**Bayesian statistics** is an approach to statistics contrasted with frequentist approaches.

As is with frequentist statistical inference, Bayesian inference is concerned with estimating parameters from some observed data. However, whereas frequentist inference returns point estimates - that is, single values - for these parameters, Bayesian inference instead expresses these parameters themselves as probability distributions. This is intuitively appealing as we are *uncertain* about the parameters we've inferred; with Bayesian inference we can represent this uncertainty.

This is to say that in Bayesian inference, we don't assign an explicit value to an unknown parameter. Rather, we define it over a probability distribution as well: what values is the parameter *likely* to take on? That is, we treat the parameter itself as a random variable.

We may say for instance that an unknown parameter

Here **hyperparameter**, that is, it is a parameter for our parameter

Fundamentally, this is Bayesian inference:

Where the parameters **posterior distribution**.

So we must decide (specify) probability distributions for both the data sample and for the unknown parameters. These decisions involve making a lot of assumptions. Then you must compute a posterior distribution, which often cannot be calculated analytically - so other methods are used (such as simulations, described later).

From the posterior distribution, you can calculate point estimates, credible intervals, quantiles, and make predictions.

Finally, because of the assumptions which go into specifying the initial distributions, you must test your model and see if it fits the data and seems reasonable.

Thus Bayesian inference amounts to:

- Specifying a sampling model for the observed data
$X$ , conditioned on the unknown parameter$\theta$ (which we treat as a random variable), such that$X \sim f(X|\theta)$ , where$f(X|\theta)$ is either the PDF or the PMF (as appropriate). - Specifying a marginal or distribution
$\pi(\theta)$ for$\theta$ , which is the prior distribution ("prior" for short):$\theta \sim \pi(\theta)$ - From this we wish to compute the posterior, that is, uncover the distribution for
$\theta$ given the observed data$X$ , like so:$\pi(\theta|X) = \frac{\pi(\theta)L(\theta|X)}{\int \pi(\theta) L(\theta|X) d\theta}$ , where$L(\theta|X) \propto f(\theta|X)$ in$\theta$ , called the likelihood of$\theta$ given$X$ . More often than not, the posterior must be approximated through Markov Chain Monte Carlo (detailed later).

For frequentists, probability is thought of in terms of frequencies, i.e. the probability of the event is the amount of times it happened over the total amount of times it could have happened.

In frequentist statistics, the observed data is considered random; if you gathered more observations they would be different according to the underlying distribution. The parameters of the model, however, are considered fixed.

For Bayesians, probability is belief or certainty about an event. Observed data is considered fixed, but the model parameters are random (uncertain) instead and considered to be drawn from some probability distribution.

Another way of phrasing this is that frequentists are concerned with uncertainty in the data, whereas Bayesians are concerned with uncertainty in the parameters.

In frequentist statistics, many different estimators may be used, but in Bayesian statistics the only estimator is Bayes' Formula (aka Bayes' Rule or Bayes' Theorem).

Bayes' Theorem, aka Bayes' Rule:

$H$ is the hypothesis (more commonly represented as the parameters$\theta$ )$D$ is the data

$P(H)$ = the probability of the hypothesis before seeing the data. The**prior**.$P(H|D)$ = probability of the hypothesis, given the data. The**posterior**.$P(D|H)$ = the probability of the data under the hypothesis. The**likelihood**.$P(D)$ = the probability of data under*any*hypothesis. The**normalizing constant**.

For an example of likelihood:

If I want to predict the sides of a dice I rolled, and then I rolled an 8, then

A key insight to draw from Bayes' Rule is that

Note that the normalizing constant

One workaround is to do approximate inference with non-normalized posteriors, since we know that the posterior is proportional to the numerator term:

Another workaround to approximate the posterior using simulation methods such as Monte Carlo.

Given a set of hypotheses *prior distribution*, i.e.

The distribution of the posterior probabilities is the *posterior distribution*, i.e.

Likelihood is not the same as probability (thus it does not have to sum to 1), but it is *proportional* to probability. More specifically, the likelihood of a hypothesis

Where

With the probability

The **Law of Likelihood** states that the hypothesis for which the probability of the data is greater is the more likely hypothesis. For example,

We can also quantify *how much* better

Likelihoods are meaningless in isolation (because of the constant

A **Bayes factor** is an extension of likelihood ratios: it is a weighted average likelihood ratio based on the prior distribution of hypotheses. So we have some prior bias as to what hypotheses we expect, i.e. how probable we expect some hypotheses to be, and we weigh the likelihood ratios by these expected probabilities.

With Bayesian inference, we must *choose* a prior distribution, then apply data to get our posterior distribution. The prior is chosen based on domain knowledge or intuition or perhaps from the results of previous analysis; that is, it is chosen subjectively - there is no prescribed formula for picking a prior. If you have no idea what to pick, you can just pick a uniform distribution as your prior.

Your choice of prior will affect the posterior that you get, and the subjectivity of this choice is what makes Bayesian statistics controversial - but it's worth noting that all of statistics, whether or frequentist or Bayesian, involves many subjective decisions (e.g. frequentists must decide on an estimator to use, what data to collect and how, and so on) - what matters most is that you are explicit about your decisions and why you made them.

Say we perform an Bayesian analysis and get a posterior. Then we get some new data for the same problem. We can re-use the posterior from before as our prior, and when we run Bayesian analysis on the new data, we will get a new posterior which reflects the additional data. We don't have to re-do any analysis on the data from before, all we need is the posterior generated from it.

For any unknown quantity we want to model, we say it is drawn from some prior of our choosing. This is usually some parameter describing a probability distribution, but it could be other values as well. This is central to Bayesian statistics - all unknowns are represented as distributions of possible values. In Bayesian statistics: if there's a value and you don't know what it is, come up with a prior for it and add it to your model!

If you think of distributions as landscapes or surfaces, then the data deforms the prior surface to mold it into the posterior distribution.

The surface's "resistance" to this shaping process depends on the selected prior distribution.

When it comes to selecting Bayesian priors, there are two broad categories:

**objective priors**- these let the data influence the posterior the most**subjective priors**- these allow the practitioner to asset their own views in to the prior. This prior can be the posterior from another problem or just come from domain knowledge.

An example objective prior is a **uniform** (flat) prior where every value has equal weighting. Using a uniform prior is called *The Principle of Indifference*. Note that a uniform prior restricted within a range is *not* objective - it has to be over *all* possibilities.

Note that the more data you have (as

**Conjugate priors** are priors which, when combined with the likelihood, result in a posterior which is in the same family. These are very convenient because the posterior can be calculated analytically, so there is no need to use approximation such as Markov Chain Monte Carlo (see below).

For example, a binomial likelihood is a conjugate with a beta prior - their combination results in a beta-binomial posterior.

For example, the Gaussian family of distributions are conjugate to itself (*self conjugate*) - a Gaussian likelihood with a Gaussian prior results in a Gaussian posterior.

For example, when working with count data you will probably use the Poisson distribution for your likelihood, which is conjugate with gamma distribution priors, resulting in a gamma posterior.

Unfortunately, conjugate priors only really show up in simple one-dimensional models.

More generally, we can define a conjugate prior like so:

Say random variable

For the given distribution

The Beta-Binomial model is a useful Bayesian model because it provides values between 0 and 1, which is useful for estimating probabilities or percentages.

It involves, as you might expect, a beta and a binomial distribution.

So say we have

For frequentist inference we'd estimate

This being Bayesian inference, we first must select a prior.

However,

We prefer a beta prior over a uniform prior because, given binomial observations, the posterior will also be a beta distribution.

It works out nicely mathematically:

So with these two distributions, we can directly compute the posterior with no need for simulation (e.g. MCMC).

How do you choose the parameters for a Beta prior? Well, it depends on the particular problem, but a conservative one, for when you don't have a whole lot of info to go on, is **Jeffrey's prior**.

We run 100 trials and observe 10 successes. What is the probability

Our knowns are

For our prior for

We can directly compute the posterior now:

Then we can draw samples from the distribution and compute its mean or other descriptive statistics such as the credible interval.

The strength of the prior affects the posterior - the stronger your prior beliefs, the more difficult it is to change those beliefs (it requires more data/evidence). You can conduct **sensitivity analysis** to try your approach with various different priors to get an idea of how different priors affect your resulting posterior.

Empirical Bayes is a method which combines frequentist and Bayesian approaches by using frequentist methods to select the hyperparameters.

For instance, say you want to estimate the

You could use the empirical sample mean from the observed data:

Where

Though if working with not much data, this kind of ends like double-counting your data.

With Bayesian inference, in order to describe your posterior, you often must evaluate complex multidimensional integrals (i.e. from very complex, multidimensional probability distributions), which can be computationally intractable.

Instead you can generate sample points from the posterior distribution and use those samples to compute whatever descriptions you need. This technique is called **Monte Carlo integration**, and the process of drawing repeated random samples in this way is called **Monte Carlo simulation**. In particular, we can use a family of techniques known as **Markov Chain Monte Carlo**, which combine Monte Carlo integration and simulation with Markov chains, to generate samples for us.

Monte Carlo integration is a way to approximate complex integrals using random number generation.

Say we have a complex integral:

If we can decompose

That is, the result of this integral is the expected value of

We can approximate this expected value by taking the mean of many, many samples (

This process of approximating the integral is *Monte Carlo integration*.

For very simple cases of known distributions, we can sample directly, e.g.

```
import numpy as np
# Say we think the distribution is a Poisson distribution
# and the parameter of our distribution, lambda,
# is unknown and what we want to discover.
lam = 5
# Collect 100000 samples
sim_vals = np.random.poisson(lam, size=100000)
# Get whatever descriptions we want, e.g.
mean = sim_vals.mean()
# For poisson, the mean is lambda, so we expect
# them to be approximately equal (given a large enough sample size)
abs(lam - mean()) < 0.001
```

Markov chains are a stochastic process in which the next state depends only on the current state.

Consider a random variable

For a Markov chain, the state

Where **transition probability** of **transition matrix** (for discrete states); more generally is is called a **transition kernel**.

If we consider **stationary distribution**, where *erdogic*, i.e. they "mix", which means that the influence of the initial state weakens with time (the rate at which it mixes is its *mixing speed*).

If we call the

MCMC is useful because often we may encounter distributions which aren't easily expressed mathematically (e.g. their functions may have very strange shapes), but we still want to compute some descriptive statistics (or make other computations) from them. MCMC allows us to work with such distributions without needing precise mathematical formulations of them.

More generally, MCMC is really useful if you don't want to (or can't) find the underlying function describing something. As long as you can simulate that process in some way, you don't need to know the exact function - you can just generate enough sample data to work with in its stead. So MCMC is a brute force but effective method.

Rather than directly compute the integral for posterior distributions in Bayesian analysis, we can instead use MCMC to draw several (thousands, millions, etc) samples from the probability distribution, then use these samples to compute whatever descriptions we'd like about the distribution (often this is some expected value of a function,

You start with some random initial sample and, based on that sample, you pick a new sample. This is the Markov Chain aspect of MCMC - the next sample you choose depends only on the current sample. This works out so that you spend most your time with high probability samples (b/c they have higher transition probabilities) but occasionally jump out to lower probability samples. Eventually the MCMC chain will converge on a random sample.

So we can take all these

Because of the random initialization, there is a "**burn-in**" phase in which the sampling model needs to be "warmed up" until it reaches an equilibrium sampling state, the *stationary distribution*. So you discard the first hundred or thousand or so samples as part of this burn-in phase. You can (eventually) arrive at this stationary distribution *independent of where you started* which is why the random initialization is ok - this is an important feature of Markov Chains.

MCMC is a general technique of which there are several algorithms.

Monte Carlo integration allows us to draw samples from a posterior distribution with a known parametric form. It does not, however, enable us to draw samples from a posterior distribution without a known parametric form. We may instead use **rejection sampling** in such cases.

We can take our function *support* ("support" is the

In the case of unbounded support (i.e. infinite tails), we instead choose some *majorizing* or *enveloping* function *proposal density*) such that

Then, for each

The intuition here is that the probability of a given point being accepted is proportional to the function

In multidimensional cases, you draw candidates from every dimension simultaneously.

The **Metropolis-Hastings** algorithm uses Markov chains with rejection sampling.

The proposal density

First select some initial

Then for

- Draw a candidate
$\theta_t^c \sim g(\theta_t|\theta_{t-1})$ - Compute the Metropolis-Hastings ratio:
$R = \frac{f(\theta_t^c)g(\theta_{t-1}|\theta_t^c)}{f(\theta_{t-1})g(\theta_t^c|\theta_{t-1})}$ - Draw
$u \sim \Uni$ - If
$u < R$ , accept$\theta_t = \theta_t^c$ , otherwise,$\theta_t = \theta_{t-1}$

There are a few required properties of the Markov chain for this to work properly:

- The stationary distribution of the chain must be the target density:
- The chain must be
*recurrent*- that is, for all$\theta \in \Theta$ in the*target density*(the density we wish to approximate), the probability of returning to any state$\theta_i \in Theta = 1$ . That is, it must be possible*eventually*for any state in the state space to be reached. - The chain must be
*non-null*for all$\theta \in \Theta$ in the target density; that is, the expected time to recurrence is finite. - The chain must have a stationary distribution equal to the target density.

- The chain must be
- The chain must be
*ergodic*, that is:- The chain must be
*irreducible*- that is, any state$\theta_i$ can be reached from any other state$\theta_j$ in a finite number of transitions (i.e. the chain should not get stuck in any infinite loops) - The chain must be
*aperiodic*- that is, there should not be a fixed number of transitions to get from any state$\theta_i$ to any state$\theta_j$ . For instance, it should not always take three steps to get from one place to another - that would be a period. Another way of putting this - there are no fixed cycles in the chain.

- The chain must be

It can been proven that the stationary distribution of the Metropolis-Hastings algorithm is the target density (proof omitted).

Th ergodic property (whether or not the chain "mixes" well) can be validated with some *convergence diagnostics* A common method is to plot the chain's values as their drawn and see if the values tend to concentrate around a constant; if not, you should try a different proposal density.

Alternatively, you can look at an autocorrelation plot, which measures the internal correlation (from -1 to 1) over time, called "lag". We expect that the greater the lag, the less the points should be autocorrelated - that is, we expect autocorrelation to smoothly decrease to 0 with increasing lag. If autocorrelation remains high, then the chain is not fully exploring the space. Autocorrelation can be improved by *thinning*, which is a technique where only every

Finally, you also have the options of running multiple chains, each with different starting values, and combining those samples.

You should also use burn-in.

It is easy to sample from simple distributions. For example, for a binomial distribution, you can basically just flip a coin. For a multinomial distribution, you can basically just roll a dice.

If you have a multinomial, multivariate distribution, e.g.

However - what if these aren't independent, and we want to sample from the *joint distribution*

With Gibbs sampling we can approximate this joint distribution under the condition that we can easily sample from the conditional distribution for each variable, i.e.

We take advantage of this and iteratively sample from these conditional distributions and using the most recent value for each of the other variables (starting with random values at first). For example, sampling

If you iterate through this a large number of times you get an approximation of samples taken from the actual joint distribution.

Another way to look at Gibbs sampling:

Say you have random variables

c | P(c) |
---|---|

0 | 0.5 |

1 | 0.5 |

c | r | P(r|c) |
---|---|---|

0 | 0 | 0.9 |

0 | 1 | 0.1 |

1 | 0 | 0.1 |

1 | 1 | 0.9 |

c | r | t | P(t|c,r) |
---|---|---|---|

0 | 0 | 0 | 0.9 |

0 | 0 | 1 | 0.1 |

0 | 1 | 0 | 0.5 |

0 | 1 | 1 | 0.5 |

1 | 0 | 0 | 0.6 |

1 | 0 | 1 | 0.4 |

1 | 1 | 0 | 0.1 |

1 | 1 | 1 | 0.9 |

We can first pick some starting sample, e.g.

Then we fix

Now we fix

Now we fix

Your samples will reflect the actual joint distribution of these values, since more likely samples are, well, more likely to be generated.

MCMC can take a long time to get good answers - in theory if you run it infinitely it will generate enough samples to get a perfectly accurate distribution, but that's not a fair criterion (many algorithms do well if they have infinite time).

With *variational inference* we don't need to take samples - instead we fit an approximate joint distribution

The *mean-field* form of variational inference assumes that

Bayesian inference returns a distribution (the posterior) but we often need a single value (or a vector in multivariate cases). So we choose a value from the posterior. This value is a **Bayesian point estimate**.

Selecting the MAP (*maximum a posterior*) value is insufficient because it neglects the shape of the distribution.

Suppose

The *expected loss* of choosing estimate *risk* of estimate

Where

You can approximate the expected loss using the Law of Large Numbers, which just states that as sample size grows, the expected value approaches the actual value. That is, as

For approximating expected loss, it looks like:

You want to select the estimate

In Bayesian statistics, The closest analog to confidence intervals in frequentist statistics is the **credible interval**. It is *much* easier to interpret than the confidence interval because it is exactly what most people confuse the confidence interval to be. For instance, the 95% credible interval is the interval in which we expect to find

Mathematically this is expressed as:

We condition on

The Bayesian methodology can be applied to regression as well. In conventional regression the parameters are treated as fixed values that we uncover. In Bayesian regression, the parameters are treated as random variables, as they are elsewhere in Bayesian statistics. We define prior distributions for each parameter - in particular, normal priors, so that for each parameter we define a prior mean as well as a covariance matrix for all the parameters.

So we specify:

$b_0$ - a vector of prior means for the parameters$B_0$ - a covariance matrix such that$\sigma^2 B_0$ is the prior covariance matrix of$\beta$ $v_0 > 0$ - the degrees of freedom for the prior$\sigma_0^2 > 0$ - the variance for the prior (which essentially functions as your strength of belief in the prior - the lower the variance, the more concentrated your prior is around the mean, thus the stronger your belief)

So the prior for your parameters then is a normal distribution parameterized by

Then

Then there are a few formulas:

So the resulting distribution of parameters is a multivariate

Let's say we have a coin. We are uncertain whether or not it's a fair coin. What can we learn about the coin's fairness from a Bayesian approach?

Let's restate the problem. We can represent the outcome of a coin flip with a random variable,

It's reasonable to assume that

Thus *prior* to any evidence (i.e. decide on a prior to use). Because *conjugate prior*, so the posterior is analytically derived and requires no simulation.

The beta distribution is parameterized by

The posterior for a beta prior will not be derived here, but it is

Now we can flip the coin a few times to gather our evidence.

Below are some illustrations of possible evidence with the prior and the resulting posterior.

A few things to note here:

- When the evidence has even amounts of tails and heads, the posterior centers around
$p=0.5$ . - When the evidence has even one tail, the possibility of
$p=1$ drops to nothing. - When the evidence has no tails, the posterior places more weight on an unfair coin, but there is still some possibility of
$p=0.5$ . As the number of evidence increases, however, and still no tails show up, the posterior will have even more weight pushed towards$p=1$ . - When the is a lot of evidence containing even amounts of tails and heads, there is greater confidence that
$p=0.5$ (that is, there's smaller variance around it).

What if instead of a flat prior, we had assumed that the coin was fair to begin with? In this scenario, the

Since our prior belief is stronger than it was with a flat prior, the same amount of evidence doesn't change the prior belief as much. For instance, now if we see a streak of heads, we are less convinced it is unfair.

In either case, we could take the expected value of

- POLS 506: Simple Bayesian Models. Justin Esarey.
- POLS 506: Basic Monte Carlo Procedures and Sampling. Justin Esarey.
- POLS 506: Metropolis-Hastings, the Gibbs Sampler, and MCMC. Justin Esarey.
- Markov chain Monte Carlo (MCMC) introduction. mathematicalmonk.
- Markov Chain Monte Carlo Without all the Bullshit. Jeremy Kun.
- http://homepages.dcc.ufmg.br/~assuncao/pgm/aulas2014/mcmc-gibbs-intro.pdf
- Markov Chain Monte Carlo and Gibbs Sampling. B. Walsh.
- Computational Methods in Bayesian Analysis. Chris Fonnesbeck.
- Think Bayes. Version 1.0.6. Allen Downey.
- Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
- Bayesian Statistical Analysis. Chris Fonnesbeck. SciPy 2014.
- Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon.
- Frequentism and Bayesianism V: Model Selection. Jake Vanderplas.
- Understanding Bayes: A Look at the Likelihood. Alex Etz.
- High-Level Explanation of Variational Inference. Jason Eisner.
- Bayesian Deep Learning. Thomas Wiecki.
- A Tutorial on Variational Bayesian Inference. Charles Fox, Stephen Roberts.
- Variational Inference. David M. Blei.
- Probabilistic Programming Data Science with PyMC3. Thomas Wiecki.