2021-05-13
To understand the first 4 page of Generalized Linear Models by Nelder and Wedderburn, without a formal background in Statistics. That took me a while.
This 4 hours lecture series from MIT helped a lot. Highly recommend.
By the end of the post, I hope I have answered this question:
Definition A statistics is a function $T := r(X_1, X_2, \ldots, X_n)$ of the random samples $X_1, X_2, \ldots, X_n \in S$. For now, suppose the samples are sampled from a unknown distribution $\pi$.
Definition A statistical model is a pair of $(S, P)$, where $S$ is the set of all possible observation and $P$ is the set of probability distribution on $S$. Note: S for Sample space and P for Prior distributions.
Definition Given statistical model, $(S, P)$, suppose that $P$ is parameterized: $P := {P_\theta \mid \theta \in O}$. Then the model is:
That was a mouthful so that I can say the following:
Definition Let $(S, P_\theta)$ be a parametric statistical models and $X_1, X_2, \ldots, X_n \in S$ be i.i.d. samples generated from $P_\theta^$ for some unknown $\theta^ \in O$. An estimator for $\theta^*$, $T(X_1, X_2, \ldots, X_n)$, is a statistics that maps into $O$, i.e. $T: S^n \rightarrow O$.
In general, the true distribution, $\pi$, that generates the samples may not be in the prior $P_\theta$. That should be fine if the set $P_\theta$ is dense enough: for any $\epsilon > 0$, there exists $\theta$ such that $d(P_\theta, \pi) < \epsilon$.
Definition Let $(S, P_\theta)$ be a parametric statistical models and $X_1, X_2, \ldots, X_n \in S$ be i.i.d. samples generated from $P_\theta^$ for some unknown $\theta^ \in O$. A statistics, $T(x)$, is sufficient if
Sufficient statistics formalizes the idea that, two sets of data yielding the same value for statistics $T(x)$, would yield the same inference about $\theta$. The notion of sufficiency makes most sense in term of information entropy; there is no loss of information regarding $\theta$ when compressing from samples, $X_1, X_2, \ldots, X_n$, to statistics, $T(X_1, X_2, \ldots, X_n)$.
If we were to describe the distribution, a parametric model (distribution as a function of the statistics) is as good as a non-parametric model (distribution as a function of the data points), if and only if the distribution admits sufficient statistics.
In layman's term, no information is lost when compressing arbitrary n data points (samples) into a fixed number of parameters (statistics).
So what does it have to do with exponential family of distributions?
TODO: use factorization theorem instead, drop the assumption on i.i.d.
It turns out that, a sufficient condition (not a pun) for distributions admitting a sufficient statistics is that its distribution obeys a certain form:
Definition (slide 11 from [MIT course notes][glm-mit-notes]) exponential family of distribution is a parametric distribution where:
A family of distribution ${P_\theta: \theta \in \Omega }$, $\Omega \subset \mathbb{R}^k$ is said to be a $k$ -parameter exponential family on $\mathbb{R}^q$ , if there exist real valued functions:
$$ p_\theta(x) := exp \left( \sum_{i = 1}^k \mu_i(\theta) T_i(x) - B(\theta)\right) h(x) $$
For most practical purposes, $k$ is rarely more than 1. See this chart of commonly used distributions (FIXME: insert taxonomy of various distributions)
Definition A exponential family of distribution for $k = 1, x \in \mathbb{R}$ is canonical if
$$ p(x; \theta) := exp( \langle x, \theta \rangle - F(\theta) + k(x)) $$
where
The exponential form of pdf allows us to derive mean and variance analytically, which will be useful in discussing Generalized Linear Model.
Lemma Let $p(x; \theta)$ be a probability density function with parameter $\theta$ and $l(\theta) := log p(x; \theta)$ be the likelihood function. We have
$$ E[\frac{\partial l}{\partial \theta}] := \int \frac{p'(x; \theta)}{p(x; \theta)} p(x; \theta) dx = \int \frac{d}{d \theta} \left( p(x; \theta) \right) dx= \frac{d}{d \theta} \int p(x; \theta) dx = 0 $$
$$ E\left[\frac{\partial l^2}{\partial^2 \theta}\right] := E\left[\frac{d}{d \theta}\left(\frac{p'(x; \theta)}{p(x; \theta)}\right)\right] = \int \frac{p''(x; \theta)p(x; \theta) - p'(x; \theta)^2}{p^2(x; \theta)} p(x; \theta)dx = \int p''(x; \theta) dx - \int \left(\frac{p'(x; \theta)}{p(x; \theta)}\right)^2 p(x; \theta) dx = - E\left[\left(\frac{\partial l}{\partial \theta}\right)^2\right] $$
Claim Suppose a distribution $X \sim p(x ; \theta) := exp (x \theta - F(\theta) + k(x))$, then
This follows from applying the lemma above.