Beyond the mean

The mean tells you where a distribution is centered. The variance tells you how spread out it is. But two distributions can share the same mean and variance yet look completely different. What else do we need?

What do you think?
Two distributions have the same mean and variance, but one is symmetric and the other has a long right tail. What summary statistic distinguishes them?

The moment hierarchy

k-th Moment

The kk-th moment of a random variable XX is: μk=E[Xk]\mu_k = E[X^k] The kk-th central moment is: μk=E[(Xμ)k]\mu_k' = E[(X - \mu)^k] where μ=E[X]\mu = E[X].

Each moment tells you something different:

MomentFormulaWhat it measures
1stE[X]E[X]Center (mean)
2nd centralE[(Xμ)2]E[(X-\mu)^2]Spread (variance)
3rd standardizedE[(Xμσ)3]E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]Asymmetry (skewness)
4th standardizedE[(Xμσ)4]E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]Tail heaviness (kurtosis)
Moments Visualizer
Distribution
010
0.53
μ = 5.0σ = 1.50
Mean (1st)
5.00
Variance (2nd)
2.25
Skewness (3rd)
0.00
Kurtosis (4th)
3.00
Skewness = 0 → symmetric distribution · Kurtosis = 3 (mesokurtic, Normal baseline)

Switch between distributions to see how the four moments change. The Normal has skewness 0 and kurtosis 3 (the baseline). The Exponential is right-skewed with heavier tails.

Skewness

Skewness

Skew(X)=E[(Xμσ)3]\text{Skew}(X) = E\left[\left(\Large\frac{X - \mu}{\sigma}\right)^3\right]

  • Skew = 0: symmetric (Normal, Uniform)
  • Skew > 0: right tail is longer (Exponential, Poisson)
  • Skew < 0: left tail is longer
The Normal distribution has skewness 0. Is this because it's always centered at 0? (whole number)
The Expo(λ) distribution has skewness = 2 for all λ. True (1) or False (0)? (whole number)

Kurtosis

Kurtosis

Kurt(X)=E[(Xμσ)4]\text{Kurt}(X) = E\left[\left(\Large\frac{X - \mu}{\sigma}\right)^4\right] The Normal distribution has kurtosis = 3. Excess kurtosis = Kurt − 3 measures deviation from normality.

Kurtosis is often misunderstood as "peakedness." It's really about tail weight. High kurtosis means more probability in the extreme tails — more outliers, not necessarily a sharper peak.

The moment generating function

Computing moments one at a time is tedious. The MGF encodes all of them in a single function.

Moment Generating Function

The MGF of XX is: M(t)=E[etX]M(t) = E[e^{tX}] defined for all tt in a neighborhood of 0. Then: E[Xk]=M(k)(0)E[X^k] = M^{(k)}(0) The kk-th moment is the kk-th derivative of MM evaluated at t=0t = 0.

Why does this work? Expand etXe^{tX} as a Taylor series:

Why the MGF generates moments
etX=1+tX+t2X22!+t3X33!+e^{tX} = 1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \cdots
Taylor expand e^(tX) around t = 0.
Step 1 of 4

MGFs of common distributions

DistributionMGF M(t)M(t)Valid for
Bernoulli(pp)1p+pet1 - p + pe^tall tt
Binomial(n,pn,p)(1p+pet)n(1-p+pe^t)^nall tt
Poisson(λ\lambda)eλ(et1)e^{\lambda(e^t - 1)}all tt
Normal(μ,σ2\mu, \sigma^2)eμt+σ2t2/2e^{\mu t + \sigma^2 t^2/2}all tt
Exponential(λ\lambda)λλt\frac{\lambda}{\lambda - t}t<λt < \lambda
What do you think?
Why is the MGF of Binomial(n,p) the n-th power of the Bernoulli MGF?

The key MGF property

MGFs of Independent Sums

If XX and YY are independent: MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) MGFs convert sums into products — they're probability's version of Fourier transforms.

Not all distributions have MGFs. For example, the Cauchy distribution's E[etX]E[e^{tX}] is infinite for all t0t \neq 0. When MGFs exist, they uniquely determine the distribution.

If M(t) = e^(3t + 2t²), what is E[X]? (Hint: M'(0)) (whole number)
Same MGF: M(t) = e^(3t + 2t²). What is E[X²]? (Hint: M''(0)) (whole number)
X ~ Pois(3), Y ~ Pois(5), independent. What distribution is X + Y? (whole number)

Summary

ConceptKey Idea
kk-th momentE[Xk]E[X^k] — raw information about the distribution's shape
Skewness3rd standardized moment — measures asymmetry
Kurtosis4th standardized moment — measures tail weight
MGFM(t)=E[etX]M(t) = E[e^{tX}] encodes all moments
MGF derivativesM(k)(0)=E[Xk]M^{(k)}(0) = E[X^k]
Independent sumsMX+Y=MXMYM_{X+Y} = M_X \cdot M_Y

Moments summarize shape; the MGF packages them all. When you need to identify a sum's distribution, reaching for MGFs is often the fastest route.

What's next

We know the mean and variance. We have the MGF. But how tightly does the mean constrain where values actually fall? Enter Markov's and Chebyshev's inequalities — the first tools for bounding tail probabilities.