What is Exponential Family of Distribution

2021-05-13

Motivation

To understand the first 4 page of Generalized Linear Models by Nelder and Wedderburn, without a formal background in Statistics. That took me a while.

This 4 hours lecture series from MIT helped a lot. Highly recommend.

By the end of the post, I hope I have answered this question:

Sufficient Statistics

Definition A statistics is a function $T := r(X_1, X_2, \ldots, X_n)$ of the random samples $X_1, X_2, \ldots, X_n \in S$. For now, suppose the samples are sampled from a unknown distribution $\pi$.

Definition A statistical model is a pair of $(S, P)$, where $S$ is the set of all possible observation and $P$ is the set of probability distribution on $S$. Note: S for Sample space and P for Prior distributions.

Definition Given statistical model, $(S, P)$, suppose that $P$ is parameterized: $P := {P_\theta \mid \theta \in O}$. Then the model is:

That was a mouthful so that I can say the following:

Definition Let $(S, P_\theta)$ be a parametric statistical models and $X_1, X_2, \ldots, X_n \in S$ be i.i.d. samples generated from $P_\theta^$ for some unknown $\theta^ \in O$. An estimator for $\theta^*$, $T(X_1, X_2, \ldots, X_n)$, is a statistics that maps into $O$, i.e. $T: S^n \rightarrow O$.

In general, the true distribution, $\pi$, that generates the samples may not be in the prior $P_\theta$. That should be fine if the set $P_\theta$ is dense enough: for any $\epsilon > 0$, there exists $\theta$ such that $d(P_\theta, \pi) < \epsilon$.

Definition Let $(S, P_\theta)$ be a parametric statistical models and $X_1, X_2, \ldots, X_n \in S$ be i.i.d. samples generated from $P_\theta^$ for some unknown $\theta^ \in O$. A statistics, $T(x)$, is sufficient if

Sufficient statistics formalizes the idea that, two sets of data yielding the same value for statistics $T(x)$, would yield the same inference about $\theta$. The notion of sufficiency makes most sense in term of information entropy; there is no loss of information regarding $\theta$ when compressing from samples, $X_1, X_2, \ldots, X_n$, to statistics, $T(X_1, X_2, \ldots, X_n)$.

If we were to describe the distribution, a parametric model (distribution as a function of the statistics) is as good as a non-parametric model (distribution as a function of the data points), if and only if the distribution admits sufficient statistics.

In layman's term, no information is lost when compressing arbitrary n data points (samples) into a fixed number of parameters (statistics).

So what does it have to do with exponential family of distributions?

TODO: use factorization theorem instead, drop the assumption on i.i.d.

Essence of Exponential Family of Distribution

It turns out that, a sufficient condition (not a pun) for distributions admitting a sufficient statistics is that its distribution obeys a certain form:

Definition (slide 11 from [MIT course notes][glm-mit-notes]) exponential family of distribution is a parametric distribution where:

A family of distribution ${P_\theta: \theta \in \Omega }$, $\Omega \subset \mathbb{R}^k$ is said to be a $k$ -parameter exponential family on $\mathbb{R}^q$ , if there exist real valued functions:

$$ p_\theta(x) := exp \left( \sum_{i = 1}^k \mu_i(\theta) T_i(x) - B(\theta)\right) h(x) $$

For most practical purposes, $k$ is rarely more than 1. See this chart of commonly used distributions (FIXME: insert taxonomy of various distributions)

Definition of Canonical Exponential Family of Distribution

Definition A exponential family of distribution for $k = 1, x \in \mathbb{R}$ is canonical if

$$ p(x; \theta) := exp( \langle x, \theta \rangle - F(\theta) + k(x)) $$

where

Properties of Canonical Exponential Family of Distribution

The exponential form of pdf allows us to derive mean and variance analytically, which will be useful in discussing Generalized Linear Model.

Lemma Let $p(x; \theta)$ be a probability density function with parameter $\theta$ and $l(\theta) := log p(x; \theta)$ be the likelihood function. We have

$$ E[\frac{\partial l}{\partial \theta}] := \int \frac{p'(x; \theta)}{p(x; \theta)} p(x; \theta) dx = \int \frac{d}{d \theta} \left( p(x; \theta) \right) dx= \frac{d}{d \theta} \int p(x; \theta) dx = 0 $$

$$ E\left[\frac{\partial l^2}{\partial^2 \theta}\right] := E\left[\frac{d}{d \theta}\left(\frac{p'(x; \theta)}{p(x; \theta)}\right)\right] = \int \frac{p''(x; \theta)p(x; \theta) - p'(x; \theta)^2}{p^2(x; \theta)} p(x; \theta)dx = \int p''(x; \theta) dx - \int \left(\frac{p'(x; \theta)}{p(x; \theta)}\right)^2 p(x; \theta) dx = - E\left[\left(\frac{\partial l}{\partial \theta}\right)^2\right] $$

Claim Suppose a distribution $X \sim p(x ; \theta) := exp (x \theta - F(\theta) + k(x))$, then

This follows from applying the lemma above.