Bayesian statistics is an approach to statistics contrasted with frequentist approaches.
As is with frequentist statistical inference, Bayesian inference is concerned with estimating parameters from some observed data. However, whereas frequentist inference returns point estimates - that is, single values - for these parameters, Bayesian inference instead expresses these parameters themselves as probability distributions. This is intuitively appealing as we are uncertain about the parameters we've inferred; with Bayesian inference we can represent this uncertainty.
This is to say that in Bayesian inference, we don't assign an explicit value to an unknown parameter. Rather, we define it over a probability distribution as well: what values is the parameter likely to take on? That is, we treat the parameter itself as a random variable.
We may say for instance that an unknown parameter
Fundamentally, this is Bayesian inference:
Where the parameters
So we must decide (specify) probability distributions for both the data sample and for the unknown parameters. These decisions involve making a lot of assumptions. Then you must compute a posterior distribution, which often cannot be calculated analytically - so other methods are used (such as simulations, described later).
From the posterior distribution, you can calculate point estimates, credible intervals, quantiles, and make predictions.
Finally, because of the assumptions which go into specifying the initial distributions, you must test your model and see if it fits the data and seems reasonable.
Thus Bayesian inference amounts to:
For frequentists, probability is thought of in terms of frequencies, i.e. the probability of the event is the amount of times it happened over the total amount of times it could have happened.
In frequentist statistics, the observed data is considered random; if you gathered more observations they would be different according to the underlying distribution. The parameters of the model, however, are considered fixed.
For Bayesians, probability is belief or certainty about an event. Observed data is considered fixed, but the model parameters are random (uncertain) instead and considered to be drawn from some probability distribution.
Another way of phrasing this is that frequentists are concerned with uncertainty in the data, whereas Bayesians are concerned with uncertainty in the parameters.
In frequentist statistics, many different estimators may be used, but in Bayesian statistics the only estimator is Bayes' Formula (aka Bayes' Rule or Bayes' Theorem).
Bayes' Theorem, aka Bayes' Rule:
For an example of likelihood:
If I want to predict the sides of a dice I rolled, and then I rolled an 8, then
A key insight to draw from Bayes' Rule is that
Note that the normalizing constant
One workaround is to do approximate inference with non-normalized posteriors, since we know that the posterior is proportional to the numerator term:
Another workaround to approximate the posterior using simulation methods such as Monte Carlo.
Given a set of hypotheses
The distribution of the posterior probabilities is the posterior distribution, i.e.
Likelihood is not the same as probability (thus it does not have to sum to 1), but it is proportional to probability. More specifically, the likelihood of a hypothesis
With the probability
The Law of Likelihood states that the hypothesis for which the probability of the data is greater is the more likely hypothesis. For example,
We can also quantify how much better
Likelihoods are meaningless in isolation (because of the constant
A Bayes factor is an extension of likelihood ratios: it is a weighted average likelihood ratio based on the prior distribution of hypotheses. So we have some prior bias as to what hypotheses we expect, i.e. how probable we expect some hypotheses to be, and we weigh the likelihood ratios by these expected probabilities.
With Bayesian inference, we must choose a prior distribution, then apply data to get our posterior distribution. The prior is chosen based on domain knowledge or intuition or perhaps from the results of previous analysis; that is, it is chosen subjectively - there is no prescribed formula for picking a prior. If you have no idea what to pick, you can just pick a uniform distribution as your prior.
Your choice of prior will affect the posterior that you get, and the subjectivity of this choice is what makes Bayesian statistics controversial - but it's worth noting that all of statistics, whether or frequentist or Bayesian, involves many subjective decisions (e.g. frequentists must decide on an estimator to use, what data to collect and how, and so on) - what matters most is that you are explicit about your decisions and why you made them.
Say we perform an Bayesian analysis and get a posterior. Then we get some new data for the same problem. We can re-use the posterior from before as our prior, and when we run Bayesian analysis on the new data, we will get a new posterior which reflects the additional data. We don't have to re-do any analysis on the data from before, all we need is the posterior generated from it.
For any unknown quantity we want to model, we say it is drawn from some prior of our choosing. This is usually some parameter describing a probability distribution, but it could be other values as well. This is central to Bayesian statistics - all unknowns are represented as distributions of possible values. In Bayesian statistics: if there's a value and you don't know what it is, come up with a prior for it and add it to your model!
If you think of distributions as landscapes or surfaces, then the data deforms the prior surface to mold it into the posterior distribution.
The surface's "resistance" to this shaping process depends on the selected prior distribution.
When it comes to selecting Bayesian priors, there are two broad categories:
An example objective prior is a uniform (flat) prior where every value has equal weighting. Using a uniform prior is called The Principle of Indifference. Note that a uniform prior restricted within a range is not objective - it has to be over all possibilities.
Note that the more data you have (as
Conjugate priors are priors which, when combined with the likelihood, result in a posterior which is in the same family. These are very convenient because the posterior can be calculated analytically, so there is no need to use approximation such as Markov Chain Monte Carlo (see below).
For example, a binomial likelihood is a conjugate with a beta prior - their combination results in a beta-binomial posterior.
For example, the Gaussian family of distributions are conjugate to itself (self conjugate) - a Gaussian likelihood with a Gaussian prior results in a Gaussian posterior.
For example, when working with count data you will probably use the Poisson distribution for your likelihood, which is conjugate with gamma distribution priors, resulting in a gamma posterior.
Unfortunately, conjugate priors only really show up in simple one-dimensional models.
More generally, we can define a conjugate prior like so:
Say random variable
For the given distribution
The Beta-Binomial model is a useful Bayesian model because it provides values between 0 and 1, which is useful for estimating probabilities or percentages.
It involves, as you might expect, a beta and a binomial distribution.
So say we have
For frequentist inference we'd estimate
This being Bayesian inference, we first must select a prior.
We prefer a beta prior over a uniform prior because, given binomial observations, the posterior will also be a beta distribution.
It works out nicely mathematically:
So with these two distributions, we can directly compute the posterior with no need for simulation (e.g. MCMC).
How do you choose the parameters for a Beta prior? Well, it depends on the particular problem, but a conservative one, for when you don't have a whole lot of info to go on, is
We run 100 trials and observe 10 successes. What is the probability
Our knowns are
For our prior for
We can directly compute the posterior now:
Then we can draw samples from the distribution and compute its mean or other descriptive statistics such as the credible interval.
The strength of the prior affects the posterior - the stronger your prior beliefs, the more difficult it is to change those beliefs (it requires more data/evidence). You can conduct sensitivity analysis to try your approach with various different priors to get an idea of how different priors affect your resulting posterior.
Empirical Bayes is a method which combines frequentist and Bayesian approaches by using frequentist methods to select the hyperparameters.
For instance, say you want to estimate the
You could use the empirical sample mean from the observed data:
Though if working with not much data, this kind of ends like double-counting your data.
With Bayesian inference, in order to describe your posterior, you often must evaluate complex multidimensional integrals (i.e. from very complex, multidimensional probability distributions), which can be computationally intractable.
Instead you can generate sample points from the posterior distribution and use those samples to compute whatever descriptions you need. This technique is called Monte Carlo integration, and the process of drawing repeated random samples in this way is called Monte Carlo simulation. In particular, we can use a family of techniques known as Markov Chain Monte Carlo, which combine Monte Carlo integration and simulation with Markov chains, to generate samples for us.
Monte Carlo integration is a way to approximate complex integrals using random number generation.
Say we have a complex integral:
If we can decompose
That is, the result of this integral is the expected value of
We can approximate this expected value by taking the mean of many, many samples (
This process of approximating the integral is Monte Carlo integration.
For very simple cases of known distributions, we can sample directly, e.g.
import numpy as np # Say we think the distribution is a Poisson distribution # and the parameter of our distribution, lambda, # is unknown and what we want to discover. lam = 5 # Collect 100000 samples sim_vals = np.random.poisson(lam, size=100000) # Get whatever descriptions we want, e.g. mean = sim_vals.mean() # For poisson, the mean is lambda, so we expect # them to be approximately equal (given a large enough sample size) abs(lam - mean()) < 0.001
Markov chains are a stochastic process in which the next state depends only on the current state.
Consider a random variable
For a Markov chain, the state
If we consider
If we call the
MCMC is useful because often we may encounter distributions which aren't easily expressed mathematically (e.g. their functions may have very strange shapes), but we still want to compute some descriptive statistics (or make other computations) from them. MCMC allows us to work with such distributions without needing precise mathematical formulations of them.
More generally, MCMC is really useful if you don't want to (or can't) find the underlying function describing something. As long as you can simulate that process in some way, you don't need to know the exact function - you can just generate enough sample data to work with in its stead. So MCMC is a brute force but effective method.
Rather than directly compute the integral for posterior distributions in Bayesian analysis, we can instead use MCMC to draw several (thousands, millions, etc) samples from the probability distribution, then use these samples to compute whatever descriptions we'd like about the distribution (often this is some expected value of a function,
You start with some random initial sample and, based on that sample, you pick a new sample. This is the Markov Chain aspect of MCMC - the next sample you choose depends only on the current sample. This works out so that you spend most your time with high probability samples (b/c they have higher transition probabilities) but occasionally jump out to lower probability samples. Eventually the MCMC chain will converge on a random sample.
So we can take all these
Because of the random initialization, there is a "burn-in" phase in which the sampling model needs to be "warmed up" until it reaches an equilibrium sampling state, the stationary distribution. So you discard the first hundred or thousand or so samples as part of this burn-in phase. You can (eventually) arrive at this stationary distribution independent of where you started which is why the random initialization is ok - this is an important feature of Markov Chains.
MCMC is a general technique of which there are several algorithms.
Monte Carlo integration allows us to draw samples from a posterior distribution with a known parametric form. It does not, however, enable us to draw samples from a posterior distribution without a known parametric form. We may instead use rejection sampling in such cases.
We can take our function
In the case of unbounded support (i.e. infinite tails), we instead choose some majorizing or enveloping function
Then, for each
The intuition here is that the probability of a given point being accepted is proportional to the function
In multidimensional cases, you draw candidates from every dimension simultaneously.
The Metropolis-Hastings algorithm uses Markov chains with rejection sampling.
The proposal density
First select some initial
There are a few required properties of the Markov chain for this to work properly:
It can been proven that the stationary distribution of the Metropolis-Hastings algorithm is the target density (proof omitted).
Th ergodic property (whether or not the chain "mixes" well) can be validated with some convergence diagnostics A common method is to plot the chain's values as their drawn and see if the values tend to concentrate around a constant; if not, you should try a different proposal density.
Alternatively, you can look at an autocorrelation plot, which measures the internal correlation (from -1 to 1) over time, called "lag". We expect that the greater the lag, the less the points should be autocorrelated - that is, we expect autocorrelation to smoothly decrease to 0 with increasing lag. If autocorrelation remains high, then the chain is not fully exploring the space. Autocorrelation can be improved by thinning, which is a technique where only every
Finally, you also have the options of running multiple chains, each with different starting values, and combining those samples.
You should also use burn-in.
It is easy to sample from simple distributions. For example, for a binomial distribution, you can basically just flip a coin. For a multinomial distribution, you can basically just roll a dice.
If you have a multinomial, multivariate distribution, e.g.
However - what if these aren't independent, and we want to sample from the joint distribution
With Gibbs sampling we can approximate this joint distribution under the condition that we can easily sample from the conditional distribution for each variable, i.e.
We take advantage of this and iteratively sample from these conditional distributions and using the most recent value for each of the other variables (starting with random values at first). For example, sampling
If you iterate through this a large number of times you get an approximation of samples taken from the actual joint distribution.
Another way to look at Gibbs sampling:
Say you have random variables
We can first pick some starting sample, e.g.
Then we fix
Now we fix
Now we fix
Your samples will reflect the actual joint distribution of these values, since more likely samples are, well, more likely to be generated.
MCMC can take a long time to get good answers - in theory if you run it infinitely it will generate enough samples to get a perfectly accurate distribution, but that's not a fair criterion (many algorithms do well if they have infinite time).
With variational inference we don't need to take samples - instead we fit an approximate joint distribution
The mean-field form of variational inference assumes that
Bayesian inference returns a distribution (the posterior) but we often need a single value (or a vector in multivariate cases). So we choose a value from the posterior. This value is a Bayesian point estimate.
Selecting the MAP (maximum a posterior) value is insufficient because it neglects the shape of the distribution.
The expected loss of choosing estimate
You can approximate the expected loss using the Law of Large Numbers, which just states that as sample size grows, the expected value approaches the actual value. That is, as
For approximating expected loss, it looks like:
You want to select the estimate
In Bayesian statistics, The closest analog to confidence intervals in frequentist statistics is the credible interval. It is much easier to interpret than the confidence interval because it is exactly what most people confuse the confidence interval to be. For instance, the 95% credible interval is the interval in which we expect to find
Mathematically this is expressed as:
We condition on
The Bayesian methodology can be applied to regression as well. In conventional regression the parameters are treated as fixed values that we uncover. In Bayesian regression, the parameters are treated as random variables, as they are elsewhere in Bayesian statistics. We define prior distributions for each parameter - in particular, normal priors, so that for each parameter we define a prior mean as well as a covariance matrix for all the parameters.
So we specify:
So the prior for your parameters then is a normal distribution parameterized by
Then there are a few formulas:
So the resulting distribution of parameters is a multivariate
Let's say we have a coin. We are uncertain whether or not it's a fair coin. What can we learn about the coin's fairness from a Bayesian approach?
Let's restate the problem. We can represent the outcome of a coin flip with a random variable,
It's reasonable to assume that
The beta distribution is parameterized by
The posterior for a beta prior will not be derived here, but it is
Now we can flip the coin a few times to gather our evidence.
Below are some illustrations of possible evidence with the prior and the resulting posterior.
A few things to note here:
What if instead of a flat prior, we had assumed that the coin was fair to begin with? In this scenario, the
Since our prior belief is stronger than it was with a flat prior, the same amount of evidence doesn't change the prior belief as much. For instance, now if we see a streak of heads, we are less convinced it is unfair.
In either case, we could take the expected value of