14 Rules for Moments (Expectation, Variance), Estimating Probabilities (Markov, Chebyshev), Chernoff Bounds

Lecture from: 03.04.2025 | Video: Homelab

We’ve established what variance is and some of its basic properties. Now, let’s explore how variance behaves with sums and products of random variables. This will lead us to powerful tools called concentration inequalities, which allow us to bound the probability that a random variable deviates far from its expected value.

Recap: Expectation and Variance

Expected Value $E [X]$ : The average value of $X$ . We desire to quantify exactly that $P r [∣ X - E [X] ∣ is "large"] is "small"$ .
Variance $Va r [X] = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}$ : Measures the expected squared deviation from the mean. A small variance suggests $X$ is usually close to $E [X]$ .
Property of Variance: $Va r [a X + b] = a^{2} Va r [X]$ . Adding a constant $b$ shifts the distribution but doesn’t change its spread (variance). Scaling by $a$ scales the variance by $a^{2}$ .

Rules for Moments: Expectation and Variance

Let’s now summarize how expectation and variance behave when working with multiple random variables. These foundational properties are often referred to as rules for moments, where:

The first moment is the expectation: $E [X]$ .
The second central moment is the variance: $Va r [X] = E [(X - E [X])^{2}]$ .

1. Linearity of Expectation

E [X + Y] = E [X] + E [Y]

This holds for all random variables $X$ and $Y$ , regardless of whether they are dependent or independent. This property - linearity of expectation - is one of the most powerful and broadly applicable tools in probability. This is always true.

2. Product of Expectations

E [X Y] = E [X] \cdot E [Y] ?

This identity holds if $X$ and $Y$ are independent.

Proof (for independent $X, Y$ ):

E [X Y] = x \sum y \sum x y \cdot P r [X = x, Y = y]

By independence: $P r [X = x, Y = y] = P r [X = x] \cdot P r [Y = y]$ .

= (x \sum x \cdot P r [X = x]) \cdot (y \sum y \cdot P r [Y = y]) = E [X] \cdot E [Y]

However, this fails for dependent variables.

Counterexample: Let $X$ be a Bernoulli( $\frac{1}{2}$ ) variable, and let $Y = X$ . Then:

$E [X] = E [Y] = 1/2$ , so $E [X] E [Y] = 1/4$
But $X Y = X^{2} = X$ since $X \in 0, 1$ , so $E [X Y] = E [X] = 1/2$

Thus, $E [X Y] \neq = E [X] E [Y]$ when $X$ and $Y$ are dependent.

3. Variance of a Sum

Va r [X + Y] = Va r [X] + Va r [Y] ?

This identity holds if $X$ and $Y$ are independent.

Proof (for independent $X, Y$ ):

Let $μ_{X} = E [X], μ_{Y} = E [Y]$ . Then:

Va r [X + Y] = E [((X + Y) - (μ_{X} + μ_{Y}))^{2}] = E [(X - μ_{X} + Y - μ_{Y})^{2}]

= E [(X - μ_{X})^{2} + 2 (X - μ_{X}) (Y - μ_{Y}) + (Y - μ_{Y})^{2}]

By linearity of expectation:

= Va r [X] + Va r [Y] + 2 E [(X - μ_{X}) (Y - μ_{Y})]

The final term is the covariance:

C o v (X, Y) := E [(X - μ_{X}) (Y - μ_{Y})]

The final term is the covariance:

C o v (X, Y) := E [(X - μ_{X}) (Y - μ_{Y})]

If $X$ and $Y$ are independent, then so are $(X - μ_{X})$ and $(Y - μ_{Y})$ , since subtracting constants does not affect independence. Therefore:

E [(X - μ_{X}) (Y - μ_{Y})] = E [X - μ_{X}] \cdot E [Y - μ_{Y}]

But:

E [X - μ_{X}] = E [X] - μ_{X} = 0, and similarly E [Y - μ_{Y}] = 0

So:

C o v (X, Y) = 0

Thus, under independence:

Va r [X + Y] = Va r [X] + Va r [Y]

In fact, this result extends to sums of more than two variables: If $X_{1}, \dots, X_{n}$ are pairwise independent, then:

Va r [i = 1 \sum n X_{i}] = i = 1 \sum n Va r [X_{i}]

Pairwise independence is sufficient here, which is strictly weaker than full (mutual) independence.

4. Variance of a Product

Va r [X Y] = Va r [X] \cdot Va r [Y] ?

This is false in general, even when $X$ and $Y$ are independent.

Counterexample: Let $X, Y$ be independent Bernoulli( $\frac{1}{2}$ ) variables:

$E [X] = E [Y] = 1/2$
$Va r [X] = Va r [Y] = 1/4$
So $Va r [X] \cdot Va r [Y] = (1/4)^{2} = 1/16$

Now let $Z = X Y$ :

$Z = 1$ only when $X = Y = 1$ (probability $1/4$ ), and $0$ otherwise.
So $Z \sim Bernoulli (1/4)$
$E [Z] = 1/4$
$Va r [Z] = E [Z^{2}] - (E [Z])^{2} = 1/4 - (1/4)^{2} = 1/4 - 1/16 = 3/16$

Clearly, $Va r [X Y] = 3/16 \neq = 1/16 = Va r [X] Va r [Y]$

Summary of Moment Rules

Property	Always True?	Conditions Required
$E [X + Y] = E [X] + E [Y]$	✅ Always	None
$E [X Y] = E [X] \cdot E [Y]$	✅ If independent	$X, Y$ independent
$Va r [X + Y] = Va r [X] + Va r [Y]$	✅ If independent	$X, Y$ (pairwise) independent
$Va r [X Y] = Va r [X] \cdot Va r [Y]$	❌ Not in general	Fails even if independent

Estimating Probabilities: Concentration Inequalities

We know the expectation $E [X]$ gives the average value of a random variable. But how likely is it that $X$ takes a value far from its mean?

The goal is to bound:

Pr [∣ X - E [X] ∣ \geq t]

for some deviation $t > 0$ . These bounds are called concentration inequalities - they describe how tightly $X$ is concentrated around its mean.

Markov’s Inequality

This is the most basic concentration inequality. It applies to any non-negative random variable and requires only the expectation.

Intuitive explanation:

Theorem (Markov’s Inequality)

Let $X \geq 0$ be a non-negative random variable. Then for any $t > 0$ :

Pr [X \geq t] \leq \frac{E [ X ]}{t}

Proof

We split the expectation as follows:

E [X] = x < t \sum x Pr [X = x] + x \geq t \sum x Pr [X = x]

Since all terms are non-negative, the first sum is non-negative, so:

E [X] \geq x \geq t \sum x Pr [X = x]

Each $x \geq t$ , so replacing $x$ with $t$ in this sum gives:

E [X] \geq x \geq t \sum t Pr [X = x] = t \cdot Pr [X \geq t]

Rearranging:

Pr [X \geq t] \leq \frac{E [ X ]}{t}

This is a one-sided inequality: it only bounds the probability that $X$ is large. It’s general, but can be loose since it only uses the mean.

Chebyshev’s Inequality

Chebyshev’s inequality gives a two-sided bound using both the mean and the variance.

Theorem (Chebyshev’s Inequality)

Let $X$ have mean $μ = E [X]$ and finite variance $Var [X]$ . Then for any $t > 0$ :

Pr [∣ X - μ ∣ \geq t] \leq \frac{Var [ X ]}{t ^{2}}

Proof

Let $Y = (X - μ)^{2}$ . This is a non-negative random variable. The event:

∣ X - μ ∣ \geq t is equivalent to Y \geq t^{2}

So:

Pr [∣ X - μ ∣ \geq t] = Pr [Y \geq t^{2}]

Apply Markov’s Inequality to $Y$ :

Pr [Y \geq t^{2}] \leq \frac{E [ Y ]}{t ^{2}}

But $E [Y] = E [(X - μ)^{2}] = Var [X]$ , so:

Pr [∣ X - μ ∣ \geq t] \leq \frac{Var [ X ]}{t ^{2}}

Interpretation

Smaller variance means tighter concentration around the mean. The tail probability decays quadratically in $t$ .

Letting $t = C σ$ , where $σ = Var [X]$ , we get:

Pr [∣ X - μ ∣ \geq C σ] \leq \frac{1}{C ^{2}}

Examples:

Deviation of 2 standard deviations: $\leq \frac{1}{4}$
Deviation of 10 standard deviations: $\leq \frac{1}{100}$

Variance of Common Distributions

These are useful when applying Chebyshev’s inequality:

Bernoulli( $p$ ): $E [X] = p$ , $Var [X] = p (1 - p)$
Binomial( $n, p$ ): $X = X_{1} + \dots + X_{n}$ , with $X_{i} \sim Bernoulli (p)$ and independent $E [X] = n p$ , $Var [X] = n p (1 - p)$
Poisson( $λ$ ): $E [X] = λ$ , $Var [X] = λ$
Geometric( $p$ ): $E [X] = \frac{1}{p}$ , $Var [X] = \frac{1 - p}{p ^{2}}$

Example: Coupon Collector with Chebyshev

Let $X$ be the total time to collect all $n$ coupons. Then:

X = i = 1 \sum n X_{i}

where $X_{i} \sim Geo (p_{i})$ , and $p_{i} = \frac{n - i + 1}{n}$ .

We know:

E [X] = n H_{n} \approx n ln n

To apply Chebyshev, compute the variance:

Var [X] = i = 1 \sum n Var [X_{i}] = i = 1 \sum n \frac{1 - p _{i}}{p _{i}^{2}}

Substitute $p_{i} = \frac{n - i + 1}{n}$ , and after simplification:

Var [X] = n k = 1 \sum n (\frac{n}{k ^{2}} - \frac{1}{k}) \leq n^{2} k = 1 \sum \infty \frac{1}{k ^{2}} = n^{2} \cdot \frac{π ^{2}}{6}

So:

Var [X] \leq \frac{π ^{2}}{6} n^{2}, and σ_{X} \leq n \cdot \frac{π}{6}

Applying Chebyshev

Let $C > 0$ . Then:

Pr [∣ X - E [X] ∣ \geq C n \cdot \frac{π}{6}] \leq \frac{1}{C ^{2}}

This shows that $X$ is sharply concentrated around $n ln n$ .

Sharper Example

Let $t = n ln n$ . Then:

t^{2} = n^{2} ln n

Pr [X \geq E [X] + n ln n] \leq \frac{Var [ X ]}{t ^{2}} \leq \frac{π ^{2}}{6 ln n} \to 0 as n \to \infty

So even deviations of size $o (E [X])$ (like $n ln n$ ) become unlikely for large $n$ .

Chernoff Bounds: Concentration for Sums of Random Variables

Chebyshev’s inequality is useful for bounding how far a random variable might stray from its expected value, but it’s often too loose, especially for sums of independent random variables. This is where Chernoff bounds become powerful.

Chernoff bounds provide exponentially decreasing bounds on tail probabilities - i.e. the probability that a sum of independent random variables deviates significantly from its expected value.

Setting: Sums of Independent Bernoulli Variables

Suppose $X_{1}, \dots, X_{n}$ are independent Bernoulli random variables with $P [X_{i} = 1] = p_{i}$ , and let

X = i = 1 \sum n X_{i} and μ = E [X] = i = 1 \sum n p_{i} .

Then Chernoff bounds give the following tail bounds:

Chernoff Bounds (Standard Forms)

For $0 < δ \leq 1$ :

Upper tail:
$P [X \geq (1 + δ) μ] \leq exp (- \frac{δ ^{2} μ}{3})$
Lower tail:
$P [X \leq (1 - δ) μ] \leq exp (- \frac{δ ^{2} μ}{2})$

These bounds are exponentially decreasing in $δ^{2} μ$ , which gives much tighter control than Chebyshev’s $1/ δ^{2}$ behavior.

Examples

$μ = E [X] = 100$
Try different values of $δ$ : 0.1, 0.5, 1.0

1. $δ = 0.1$ (10% above the mean)

P [X \geq 110] \leq exp (- \frac{( 0.1 ) ^{2} \cdot 100}{3}) = exp (- \frac{1}{3}) \approx 0.717

🟠 Quite likely - about a 71.7% upper bound.

2. $δ = 0.5$ (50% above the mean)

P [X \geq 150] \leq exp (- \frac{( 0.5 ) ^{2} \cdot 100}{3}) = exp (- \frac{25}{3}) \approx exp (- 8.33) \approx 0.00024

🟢 Very unlikely - less than 0.03%.

3. $δ = 1.0$ (100% above the mean, i.e., double the expectation)

P [X \geq 200] \leq exp (- \frac{( 1 ) ^{2} \cdot 100}{3}) = exp (- 33.33) \approx 3.4 \times 1 0^{- 15}

🔵 Astronomically unlikely - almost zero.

Intuition:

The probability decays exponentially fast as you move further above the expected value. The larger the deviation ( $δ$ ) or the expected value ( $μ$ ), the sharper the drop.

Comparing Chebyshev and Chernoff for $X \sim Bin (n, 1/2)$

Suppose $X \sim Bin (n, 1/2)$ . Then:

$E [X] = n /2$
$Var [X] = n /4$
$σ_{X} = Var [X] = n /2$

Let’s study the probability of a deviation of size $t = δ μ = δ n /2$ . Rewriting in terms of standard deviations:

t = δ n /2 = δ \cdot \frac{n}{2} = C \cdot σ_{X} where C = δ n

So we’re asking: What is the probability that $X$ deviates from its mean by more than $C σ_{X}$ ?

Chebyshev’s Bound

P [∣ X - μ ∣ \geq t] \leq \frac{Var [ X ]}{t ^{2}} = \frac{n /4}{( δ n /2 ) ^{2}} = \frac{1}{δ ^{2} n}

Substitute $δ = C / n$ (i.e. deviation of $C \cdot σ_{X}$ ):

P [∣ X - μ ∣ \geq C \cdot σ_{X}] \leq \frac{1}{C ^{2}}

Interpretation: Chebyshev decays only polynomially in $C$ . This means even for large deviations (e.g., $C = 10$ ), the bound is still only $1/100$ .

Chernoff’s Bound (Upper Tail)

P [X \geq (1 + δ) μ] \leq exp (- \frac{δ ^{2} μ}{3})

For $X \sim Bin (n, 1/2)$ , $μ = n /2$ , so:

P [X \geq μ + t] \leq exp (- \frac{δ ^{2} n}{6})

Substitute $δ = C / n$ , then:

P [X \geq μ + C \cdot σ_{X}] \leq exp (- \frac{C ^{2}}{6})

Interpretation: Chernoff decays exponentially in $C^{2}$ , giving much smaller probabilities for large deviations.

General Chernoff Bounds (Summary)

Let $X = \sum X_{i}$ , where $X_{i} \sim Bern (p_{i})$ are independent.

Let $μ = E [X] = \sum p_{i}$ .

Then for all $0 < δ \leq 1$ :

$P [X \geq (1 + δ) μ] \leq e^{- δ^{2} μ /3}$
$P [X \leq (1 - δ) μ] \leq e^{- δ^{2} μ /2}$
For large deviations: If $t \geq 2 e μ$ , then
$P [X \geq t] \leq 2^{- t}$

Idea Behind the Proof (MGFs and Markov)

We apply Markov’s inequality, not directly to $X$ , but to an exponential transformation of $X$ :

Let $s > 0$ . Then:

P [X \geq t] = P [e^{s X} \geq e^{s t}] \leq \frac{E [ e ^{s X} ]}{e ^{s t}} (by Markov)

So we want to bound $E [e^{s X}]$ .

Because $X = \sum X_{i}$ and the $X_{i}$ are independent:

E [e^{s X}] = E [e^{s (X_{1} + \dots + X_{n})}] = i = 1 \prod n E [e^{s X_{i}}]

Each $X_{i} \sim Bern (p_{i})$ , so:

E [e^{s X_{i}}] = p_{i} e^{s} + (1 - p_{i})

Therefore:

P [X \geq t] \leq \frac{\prod ( p _{i} e ^{s} + 1 - p _{i} )}{e ^{s t}}

The optimal value of $s$ depends on $t$ , and we can choose it to minimize this upper bound. The algebra is technical, but with this approach we can derive the exponential bounds.

Example Quick Bound Using $4^{X}$

To get a quick bound on large deviations ( $t \geq 2 e μ$ ):

Let $Y = 4^{X} = e^{X l n 4}$ , and suppose we want $P [X \geq t]$ . Then:

P [X \geq t] = P [Y \geq 4^{t}] \leq \frac{E [ Y ]}{4 ^{t}} = \frac{E [ 4 ^{X} ]}{4 ^{t}}

Using the same independence and Bernoulli MGF trick:

E [4^{X}] = i = 1 \prod n E [4^{X_{i}}] = i = 1 \prod n (p_{i} \cdot 4 + (1 - p_{i})) = \prod (1 + 3 p_{i})

Now observe that for any real number $x$ , we have $1 + x \leq e^{x}$ . Applying this to each factor:

\prod (1 + 3 p_{i}) \leq \prod e^{3 p_{i}} = e^{3 \sum p_{i}} = e^{3 μ}

So:

P [X \geq t] \leq \frac{e ^{3 μ}}{4 ^{t}}

If we want to make the bound even simpler, observe that:

P [X \geq t] \leq \frac{e ^{3 μ}}{4 ^{t}} = \frac{e ^{3 μ}}{2 ^{2 t}} = 2^{- t} \cdot \frac{e ^{3 μ}}{2 ^{t}}

So to ensure $P [X \geq t] \leq 2^{- t}$ , it’s enough that:

e^{3 μ} \leq 2^{t} \Rightarrow t \geq \frac{3 μ}{ln 2}

Basically if $\frac{e ^{3 μ}}{2 ^{t}} < 1$ which happens if the numerator is smaller than then denominator, then we can upper bound by $2^{- t}$ .

A conservative sufficient condition is:

t \geq 2 e μ

Under this condition, the tail bound becomes very simple:

P [X \geq t] \leq 2^{- t}

Conclusion

Chernoff bounds are powerful tools for analyzing sums of independent random variables. Unlike Chebyshev, which offers general but loose control, Chernoff bounds exploit independence to yield strong exponential concentration.

These results are foundational in randomized algorithms, probabilistic method, machine learning, and theoretical computer science.

Continue here: 15 Randomized Algorithms, Monte Carlo vs Las Vegas, Reducing Error Probability

CS Notes

Explorer

14 Rules for Moments (Expectation, Variance), Estimating Probabilities (Markov, Chebyshev), Chernoff Bounds

Recap: Expectation and Variance

Rules for Moments: Expectation and Variance

1. Linearity of Expectation

2. Product of Expectations

3. Variance of a Sum

4. Variance of a Product

Summary of Moment Rules

Estimating Probabilities: Concentration Inequalities

Markov’s Inequality

Theorem (Markov’s Inequality)

Proof

Chebyshev’s Inequality

Theorem (Chebyshev’s Inequality)

Proof

Interpretation

Variance of Common Distributions

Example: Coupon Collector with Chebyshev

Applying Chebyshev

Sharper Example

Chernoff Bounds: Concentration for Sums of Random Variables

Setting: Sums of Independent Bernoulli Variables

Chernoff Bounds (Standard Forms)

Examples

1. $δ = 0.1$ (10% above the mean)

2. $δ = 0.5$ (50% above the mean)

3. $δ = 1.0$ (100% above the mean, i.e., double the expectation)

Intuition:

Comparing Chebyshev and Chernoff for $X \sim Bin (n, 1/2)$

Chebyshev’s Bound

Chernoff’s Bound (Upper Tail)

General Chernoff Bounds (Summary)

Idea Behind the Proof (MGFs and Markov)

Example Quick Bound Using $4^{X}$

Conclusion

Table of Contents

Graph View

CS Notes

Explorer

14 Rules for Moments (Expectation, Variance), Estimating Probabilities (Markov, Chebyshev), Chernoff Bounds

Recap: Expectation and Variance

Rules for Moments: Expectation and Variance

1. Linearity of Expectation

2. Product of Expectations

3. Variance of a Sum

4. Variance of a Product

Summary of Moment Rules

Estimating Probabilities: Concentration Inequalities

Markov’s Inequality

Theorem (Markov’s Inequality)

Proof

Chebyshev’s Inequality

Theorem (Chebyshev’s Inequality)

Proof

Interpretation

Variance of Common Distributions

Example: Coupon Collector with Chebyshev

Applying Chebyshev

Sharper Example

Chernoff Bounds: Concentration for Sums of Random Variables

Setting: Sums of Independent Bernoulli Variables

Chernoff Bounds (Standard Forms)

Examples

1. δ=0.1 (10% above the mean)

2. δ=0.5 (50% above the mean)

3. δ=1.0 (100% above the mean, i.e., double the expectation)

Intuition:

Comparing Chebyshev and Chernoff for X∼Bin(n,1/2)

Chebyshev’s Bound

Chernoff’s Bound (Upper Tail)

General Chernoff Bounds (Summary)

Idea Behind the Proof (MGFs and Markov)

Example Quick Bound Using 4X

Conclusion

Table of Contents

Graph View

1. $δ = 0.1$ (10% above the mean)

2. $δ = 0.5$ (50% above the mean)

3. $δ = 1.0$ (100% above the mean, i.e., double the expectation)

Comparing Chebyshev and Chernoff for $X \sim Bin (n, 1/2)$

Example Quick Bound Using $4^{X}$