Beyond the mean

The mean tells you where a distribution is centered. The variance tells you how spread out it is. But two distributions can share the same mean and variance yet look completely different. What else do we need?

What do you think?

Two distributions have the same mean and variance, but one is symmetric and the other has a long right tail. What summary statistic distinguishes them?

The moment hierarchy

k-th Moment

The $k$ -th moment of a random variable $X$ is: $\mu_k = E[X^k]$ The $k$ -th central moment is: $\mu_k' = E[(X - \mu)^k]$ where $\mu = E[X]$ .

Each moment tells you something different:

Moment	Formula	What it measures
1st	$E[X]$	Center (mean)
2nd central	$E[(X-\mu)^2]$	Spread (variance)
3rd standardized	$E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]$	Asymmetry (skewness)
4th standardized	$E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]$	Tail heaviness (kurtosis)

Moments Visualizer

Distribution

Mean (μ): 5.0

010

Std Dev (σ): 1.50

0.53

Mean (1st)

5.00

Variance (2nd)

2.25

Skewness (3rd)

0.00

Kurtosis (4th)

3.00

Skewness = 0 → symmetric distribution · Kurtosis = 3 (mesokurtic, Normal baseline)

Switch between distributions to see how the four moments change. The Normal has skewness 0 and kurtosis 3 (the baseline). The Exponential is right-skewed with heavier tails.

Skewness

$\text{Skew}(X) = E\left[\left(\Large\frac{X - \mu}{\sigma}\right)^3\right]$

Skew = 0: symmetric (Normal, Uniform)
Skew > 0: right tail is longer (Exponential, Poisson)
Skew < 0: left tail is longer

The Normal distribution has skewness 0. Is this because it's always centered at 0? (whole number)

The Expo(λ) distribution has skewness = 2 for all λ. True (1) or False (0)? (whole number)

Kurtosis

$\text{Kurt}(X) = E\left[\left(\Large\frac{X - \mu}{\sigma}\right)^4\right]$ The Normal distribution has kurtosis = 3. Excess kurtosis = Kurt − 3 measures deviation from normality.

Kurtosis is often misunderstood as "peakedness." It's really about tail weight. High kurtosis means more probability in the extreme tails — more outliers, not necessarily a sharper peak.

The moment generating function

Computing moments one at a time is tedious. The MGF encodes all of them in a single function.

Moment Generating Function

The MGF of $X$ is: $M(t) = E[e^{tX}]$ defined for all $t$ in a neighborhood of 0. Then: $E[X^k] = M^{(k)}(0)$ The $k$ -th moment is the $k$ -th derivative of $M$ evaluated at $t = 0$ .

Why does this work? Expand $e^{tX}$ as a Taylor series:

Why the MGF generates moments

e^{tX} = 1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \cdots

Taylor expand e^(tX) around t = 0.

Step 1 of 4

MGFs of common distributions

Distribution	MGF $M(t)$	Valid for
Bernoulli( $p$ )	$1 - p + pe^t$	all $t$
Binomial( $n,p$ )	$(1-p+pe^t)^n$	all $t$
Poisson( $\lambda$ )	$e^{\lambda(e^t - 1)}$	all $t$
Normal( $\mu, \sigma^2$ )	$e^{\mu t + \sigma^2 t^2/2}$	all $t$
Exponential( $\lambda$ )	$\frac{\lambda}{\lambda - t}$	$t < \lambda$

What do you think?

Why is the MGF of Binomial(n,p) the n-th power of the Bernoulli MGF?

The key MGF property

MGFs of Independent Sums

If $X$ and $Y$ are independent: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ MGFs convert sums into products — they're probability's version of Fourier transforms.

Not all distributions have MGFs. For example, the Cauchy distribution's $E[e^{tX}]$ is infinite for all $t \neq 0$ . When MGFs exist, they uniquely determine the distribution.

If M(t) = e^(3t + 2t²), what is E[X]? (Hint: M'(0)) (whole number)

Same MGF: M(t) = e^(3t + 2t²). What is E[X²]? (Hint: M''(0)) (whole number)

X ~ Pois(3), Y ~ Pois(5), independent. What distribution is X + Y? (whole number)

Summary

Concept	Key Idea
$k$ -th moment	$E[X^k]$ — raw information about the distribution's shape
Skewness	3rd standardized moment — measures asymmetry
Kurtosis	4th standardized moment — measures tail weight
MGF	$M(t) = E[e^{tX}]$ encodes all moments
MGF derivatives	$M^{(k)}(0) = E[X^k]$
Independent sums	$M_{X+Y} = M_X \cdot M_Y$

Moments summarize shape; the MGF packages them all. When you need to identify a sum's distribution, reaching for MGFs is often the fastest route.

What's next

We know the mean and variance. We have the MGF. But how tightly does the mean constrain where values actually fall? Enter Markov's and Chebyshev's inequalities — the first tools for bounding tail probabilities.