Statistical models are useful for:

We assume that the data we have was generated by some underlying process (e.g. some distribution) which is parameterized in some way. Note that sometimes we may get mixtures of distributions (e.g. if you have a bimodal distribution); that is, the data is generated by multiple processes that we need to pull apart.

We don't know these parameters so we must estimate them.

In parametric inference we choose a distribution we think makes sense, then estimate parameters for that distribution such that it best fits the data. For example, if we choose a normal distribution, our parameters are the mean $\mu$ and the standard deviation $\sigma$.

Two commonly methods of estimation are:

  1. Method of moments: choose parameters such that the sample moments (typically the sample mean and variance) match the theoretical moments of our chosen distribution.
  2. Maximum likelihood: choose the parameters to maximize the likelihood, which measures how likely it is to observe our given sample.

Maximum likelihood

We assume that our data $y = y_1, \dots, y_n$ is distributed to according to some distribution $P(Y_i = y_i|\theta)$; that is, it is parameterized by $\theta$.

We say that the total probability of each datapoint, $\prod_{i=1}^n P(y_i|\theta)$ is the likelihood of our data - that is, how likely it is to have the data we have, given our estimate for $\theta$. We want to choose $\theta$ such that this probability - this likelihood - is maximized.

You often want to deal with log likelihoods instead, in which the log likelihood is instead $\sum_{i=1}^n \log P(y_i|\theta)$

So then, to find the maximum likelihood, we take the derivative of the likelihood, set it to zero, and solve (like you would any optimization problem).

The result however may not have a closed-form solution (it cannot be solved analytically) and thus we must use numerical optimization.

There are a variety of numerical optimization methods, but the general idea is that an initial guess is made for the solution, then this guess is iteratively improved upon until it approximates the solution.

One such method is the Newton-Raphson algorithm.

Example

Say we have a coin and we're not sure it's fair or not.

We flip it 8 times ($N=8$) and get TTTHHHHH. What is the probability of tails ($P(T)$)?

We set $\pi = P(T)$ ($\pi$ is typical notation here).

A Bernoulli distribution is appropriate here:

$$ P(y_i) = \pi^{y_i}(1-\pi)^{1-y_i} $$

Where $y_i \in \\{0,1\\}$ (i.e. is either H or T). So then $P(y_i=1) = \pi$ (tails) and $P(y_i=0) = 1-\pi$ (heads).

The likelihood of our data then is:

$$ P(\text{data}) = \prod_{i=1}^N P(y_i) $$

If the number of tails is $n$ (and thus the number of heads is $N-n$), we can also write this as:

$$ P(\text{data}) = \pi^n (1-\pi)^{N-n} $$

For this example, this works out to be:

$$ P(\text{data}) = \pi^3 (1-\pi)^5 $$

Then we want to find a value for $\pi$ which maximizes this equation.

As mentioned above, it's often easier to work with log likelihood instead, so we'd be working with:

$$ \log P(\text{data}) = 3\log\pi + 5\log(1-\pi) $$

So then to get the maximum you'd solve for where the derivative is equal to 0.

You should get $\pi = 3/8$.

Model Checking

How do you quantitatively evaluate how good your model is against the data?

You can compare the observed quantiles of the data to the quantiles from your estimated model.

Model Selection

You typically must choose from multiple models for a given dataset - how do you decide?

There are a few rules of thumb, such as simpler models (e.g. with less parameters) are preferred.

You can capture this rule of thumb using Akaike's Information Criterion (AIC) which balances the fit of the model (in terms of likelihood) with the number of parameters needed for that fit. It is computed:

$$ AIC = n \log (\hat{\sigma}^2) + 2p $$

Where $p$ is the number of parameters in the model and $\hat{\sigma}^2 = \frac{RSS}{n-p-1}$ (reminder: $RSS$ is the residual sum of squares).

The intuition here is that as the number of parameters increases, the residual sum of squares goes down, but the second term (which is a penalty) also increases.

AIC is a metric of information distance between a given model and some "true" model. Of course, we don't know what the true model is, so AIC values aren't interpretable in the absolute sense. But these values are useful relative to one another, because we get a relative measure of model quality.

So we want the model with the lowest AIC.

References