Statistical models are useful for:
We assume that the data we have was generated by some underlying process (e.g. some distribution) which is parameterized in some way. Note that sometimes we may get mixtures of distributions (e.g. if you have a bimodal distribution); that is, the data is generated by multiple processes that we need to pull apart.
We don't know these parameters so we must estimate them.
In parametric inference we choose a distribution we think makes sense, then estimate parameters for that distribution such that it best fits the data. For example, if we choose a normal distribution, our parameters are the mean
Two commonly methods of estimation are:
We assume that our data
We say that the total probability of each datapoint,
You often want to deal with log likelihoods instead, in which the log likelihood is instead
So then, to find the maximum likelihood, we take the derivative of the likelihood, set it to zero, and solve (like you would any optimization problem).
The result however may not have a closed-form solution (it cannot be solved analytically) and thus we must use numerical optimization.
There are a variety of numerical optimization methods, but the general idea is that an initial guess is made for the solution, then this guess is iteratively improved upon until it approximates the solution.
One such method is the Newton-Raphson algorithm.
Say we have a coin and we're not sure it's fair or not.
We flip it 8 times (
TTTHHHHH. What is the probability of tails (
A Bernoulli distribution is appropriate here:
The likelihood of our data then is:
If the number of tails is
For this example, this works out to be:
Then we want to find a value for
As mentioned above, it's often easier to work with log likelihood instead, so we'd be working with:
So then to get the maximum you'd solve for where the derivative is equal to 0.
You should get
How do you quantitatively evaluate how good your model is against the data?
You can compare the observed quantiles of the data to the quantiles from your estimated model.
You typically must choose from multiple models for a given dataset - how do you decide?
There are a few rules of thumb, such as simpler models (e.g. with less parameters) are preferred.
You can capture this rule of thumb using Akaike's Information Criterion (AIC) which balances the fit of the model (in terms of likelihood) with the number of parameters needed for that fit. It is computed:
The intuition here is that as the number of parameters increases, the residual sum of squares goes down, but the second term (which is a penalty) also increases.
AIC is a metric of information distance between a given model and some "true" model. Of course, we don't know what the true model is, so AIC values aren't interpretable in the absolute sense. But these values are useful relative to one another, because we get a relative measure of model quality.
So we want the model with the lowest AIC.