Statistical models are useful for:

- describing the data
- estimated parameters of the process that may have generated the data
- make predictions

We assume that the data we have was generated by some underlying process (e.g. some distribution) which is parameterized in some way. Note that sometimes we may get mixtures of distributions (e.g. if you have a bimodal distribution); that is, the data is generated by multiple processes that we need to pull apart.

We don't know these parameters so we must *estimate* them.

In *parametric* inference we choose a distribution we think makes sense, then estimate parameters for that distribution such that it best fits the data. For example, if we choose a normal distribution, our parameters are the mean

Two commonly methods of estimation are:

*Method of moments*: choose parameters such that the sample moments (typically the sample mean and variance) match the theoretical moments of our chosen distribution.*Maximum likelihood*: choose the parameters to maximize the likelihood, which measures how likely it is to observe our given sample.

We assume that our data

We say that the total probability of each datapoint, *likelihood* of our data - that is, how likely it is to have the data we have, given our estimate for

You often want to deal with log likelihoods instead, in which the *log likelihood* is instead

So then, to find the maximum likelihood, we take the derivative of the likelihood, set it to zero, and solve (like you would any optimization problem).

The result however may not have a closed-form solution (it cannot be solved analytically) and thus we must use numerical optimization.

There are a variety of numerical optimization methods, but the general idea is that an initial guess is made for the solution, then this guess is iteratively improved upon until it approximates the solution.

One such method is the Newton-Raphson algorithm.

Say we have a coin and we're not sure it's fair or not.

We flip it 8 times (`TTTHHHHH`

. What is the probability of tails (

We set

A Bernoulli distribution is appropriate here:

Where

The likelihood of our data then is:

If the number of tails is

For this example, this works out to be:

Then we want to find a value for

As mentioned above, it's often easier to work with log likelihood instead, so we'd be working with:

So then to get the maximum you'd solve for where the derivative is equal to 0.

You should get

How do you quantitatively evaluate how good your model is against the data?

You can compare the observed quantiles of the data to the quantiles from your estimated model.

You typically must choose from multiple models for a given dataset - how do you decide?

There are a few rules of thumb, such as simpler models (e.g. with less parameters) are preferred.

You can capture this rule of thumb using *Akaike's Information Criterion* (AIC) which balances the fit of the model (in terms of likelihood) with the number of parameters needed for that fit. It is computed:

Where

The intuition here is that as the number of parameters increases, the residual sum of squares goes down, but the second term (which is a penalty) also increases.

AIC is a metric of information distance between a given model and some "true" model. Of course, we don't know what the true model is, so AIC values aren't interpretable in the absolute sense. But these values are useful relative to one another, because we get a relative measure of model quality.

So we want the model with the lowest AIC.

- Computational Statistics II. Chris Fonnesbeck. SciPy 2015: https://www.youtube.com/watch?v=heFaYLKVZY4 and https://github.com/fonnesbeck/scipy2015_tutorial
- Introduction to Artificial Intelligence (Udacity CS271): https://www.udacity.com/wiki/cs271, Peter Norvig and Sebastian Thrun.