$$ \def\argmax{\operatorname*{argmax}} \def\argmin{\operatorname*{argmin}} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\CDF}{CDF} \DeclareMathOperator{\df}{df} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\SST}{SST} \DeclareMathOperator{\SSW}{SSW} \DeclareMathOperator{\SSB}{SSB} \DeclareMathOperator{\MSG}{MSG} \DeclareMathOperator{\SSG}{SSG} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\SSE}{SSE} $$


Broadly, statistics is concerned with collecting and analyzing data. It seeks to describe rigorous methods for collecting data (samples), for describing the data, and for inferring conclusions from the data. There are processes out there in the world that generate observable data, but these processes are often black boxes and we want to gain some insight into how they work.

We can crudely cleave statistics into two main practices: descriptive statistics, which provides tools for describing data, and inferential statistics, which provides tools for learning (inferring or estimating) from data.

This section is focused on frequentist (or classical) statistics, which is distinguished from Bayesian statistics (covered in another chapter).


Descriptive Statistics

Descriptive statistics involves computing values which summarize a set of data. This typically includes statistics like the mean, standard deviation, median, min, max, etc, which are called summary statistics.

Scales of Measurement

In statistics, numbers and variables are categorized in certain ways.

Variables may be categorical (also called qualitative), in which they represent discrete values (numbers here are arbitrarily assigned to represent categories of qualities), or numerical (also called quantitative) in which they represent continuous values.

These variables are further categorized into scales of measurement.


The average of a set of data can be described as its central tendency; which gives some sense of a typical or common value for a variable. There are three types:

Arithmetic mean

Often just called the "mean" and notated $\mu$ (mu).

For a dataset $\{x_1, \dots, x_n\}$, the arithmetic mean is:

$$ \frac{\sum_{i=1}^n x_i}{n} $$

The mean can be sensitive to extreme values (outliers), which is one reason the median is sometimes used instead. Which is to say, the median is a more robust statistic (meaning that it is less sensitive to outliers).

Note that there are other types of means, but the arithmetic mean is by far the most common.


The central value in the dataset, e.g.

$$ 1 \; 1 \; \mathbf 2 \; 3 \; 4 $$
$$ \text{median} = 2 $$

If there are even number of values, you just take the value between the two central values:

$$ 1 \; 1 \; 2 \; 3 \; 4 \; 4 $$
$$ \text{median} = \frac{2+3}{2} = 2.5 $$


The most frequently occurring value in the dataset, e.g.

$$ 1 \; 2 \; 3 \; 3 \; 2 \; 3\; 4 \; 3 $$
$$ \text{mode} = 3 $$

Central tendencies for a distribution
Central tendencies for a distribution

Population vs Sample

With statistics we take a sample of a broader population or already have data which is a sample from a population. We use this limited sample in order to learn things about the whole population.

The mean of the population is denoted $\mu$ and consists of $N$ items, whereas the mean of the sample (i.e. the sample mean, sometimes called the empirical mean) is notated $\bar x$ or $\hat \mu$ and consists of $n$ items.

The sample mean is:

$$ \hat \mu = \frac{1}{n}\sum_i x^{(i)} $$

The sample variance is:

$$ \hat \sigma^2 = \frac{1}{n-1} \sum_i (x^{(i)} - \hat \mu)^2 $$

The sample covariance matrix is:

$$ \hat \Sigma = \frac{1}{n-1} \sum_i (x^{(i)} - \hat \mu)(x^{(t)} - \hat \mu)^T $$

These estimators are unbiased, i.e.:

$$ \begin{aligned} E[\hat \mu] &= \mu \\ E[\hat \sigma^2] &= \sigma^2 \\ E[\hat \Sigma] &= \Sigma \end{aligned} $$

Independent and Identically Distributed

Often in statistics we assume that a sample is independent and identically distributed (iid); that is; that the data points are independent from one another (the outcome of one has no influence over the outcome of any of the others) and that they share the same distribution.

We say that $X_1, \dots, X_n$ are iid if they are independent and drawn from the same distribution, that is $P(X_1) = \dots = P(X_n)$. This can also be stated:

$$ P(X_1, \dots, X_n) = \prod_i P(X_i) $$

In this case, they all share the same mean (expected value) and variance.

This assumption makes computing statistics for the sample much easier.

For instance, if a sample was not identically distributed, each datapoint might come from a different distribution, in which case there are different means and variances for each datapoint which must be computed from each of those datapoints alone. They can't really be treated as a group since the datapoints aren't quite equivalent to each other (in a statistical sense).

Or, if the sample was not independent, then we lose all the conveniences that come with independence.

The IID assumption doesn't always hold (i.e. it may be violated), of course, so there are other ways of approaching such situations while minimizing complexity, such as Hidden Markov Models.

The Law of Large Numbers (LLN)

Let $X_1, \dots X_n$ be iid with mean $\mu$.

The law of large numbers essentially states that as a sample size approaches infinity, its mean will approach the population ("true") mean:

$$ \lim_{n \to \infty} \frac{1}{n} \sum^n_{i=1} X_i = \mu $$

Regression to the mean

$P(Y < x|X = x)$ gets bigger as $x$ approaches very large values. That is, given a very large $X$ (an extreme), the chance that $Y$'s value is as large as or larger than $X$ is unlikely.

$P(Y > x|X = x)$ gets bigger as $x$ approaches very small values. That is, given a very small $X$ (an extreme), the chance that $Y$'s value is as small as or smaller than $X$ is unlikely.

Central Limit Theorem (CLT)

Say you have a set of data. Even if the distribution of that data is not normal, you can divide the data into groups (samples) and then average the values of those groups. Those averages will approach the form of a normal curve as you increase the size of those groups (i.e. increase the sample size).

Let $X_1, \dots X_n$ be iid with mean $\mu$ and variance $\sigma^2$.

Then the central limit theorem can be formalized as:

$$ \sqrt \frac{n}{\sigma^2} ((\frac{1}{n} \sum^n_{i=1} X_i) - \mu) \xrightarrow{D} N(0,1) $$

That is, the left side converges in distribution to a normal distribution with mean 0 and variance 1 as $n$ increases.

Dispersion (Variance and Standard Deviation)

Dispersion is the "spread" of a distribution - how spread it out its values are.

The main measures of dispersion are the variance and the standard deviation.

Standard Deviation

Standard deviation is represented by $\sigma$ (sigma) and describes the variation from the mean ($\mu$, i.e. the expected value), calculated:

$$ \sigma = \sqrt{E[(X-\mu)^2]} = \sqrt{E[X^2] - (E[X])^2} $$


The square of the standard deviation, that is, $E[(X-\mu)^2]$, is the variance of $X$, usually notated $\sigma^2$. It can also be written:

$$ \Var(X) = \sigma^2 = E(X^2) - E(X)^2 = \frac{\sum_{i=1}^N(x_i - \mu^2)}{N} $$

For a population of size $N$ of datapoints $x$.

That is, variance is the difference between the square of the inputs and the square of the expected value.

Coefficient of variation (CV)

Variance depends on the units of measurement, but this can be controlled for by computing the coefficient of variation:

$$ CV = \frac{\sigma}{\bar x} \times 100% $$

This allows us to compare variability across variables measured in different units.

Variance of a linear combination of random variables

The variance of a linear combination of (independent) random variables, e.g. $aX + bY$, can be computed:

$$ \Var(\beta_1 X_1 + \dots + \beta_n X_n) = \beta_1^2 \Var(X_1) + \dots + \beta_n^2 \Var(X_n) $$


The range can also be used to get a sense of dispersion. The range is the difference between the highest and lowest values, but very sensitive to outliers.

As an alternative to the range, you can look at the interquartile range, which is the range of the middle 50% of the values (that is, the difference of the 75th and 25th percentile values). This is less sensitive to outliers.

Z score

A Z score is just the number of standard deviations a value is from the mean. It is defined:

$$ Z = \frac{x-\mu}{\sigma} $$

The Empirical Rule

The empirical rule describes that, for a normal distribution, there is:

The Empirical Rule
The Empirical Rule

Pooled standard deviation estimates

If you have reason to expect that the standard deviations of two populations are practically identical, you can use the pooled standard deviation of the two groups to obtain a more accurate estimate of the standard deviation and standard error:

$$ s^2_{\text{pooled}} = \frac{s_1^2(n_1 - 1) + s_2^2 (n_2 -1)}{n_1 + n_2 -2} $$

Where $n_1, n_2, s_1, s_2$ are the sample sizes and standard deviations of the sample groups. We must update the degrees of freedom as well, $\df = n_1 + n_2 - 2$, which we can use for a new t-distribution.


The $k$th moment, $m_k(X)$, where $k \in {1,2,3,\dots}$ (i.e. it is a positive integer), is $E[X^k]$. So the first moment is just the mean, $E[X]$.

The $k$th central moment is $E[(X-E[X])^k]$. So the second central moment is just the variance, $E[(X-E[X])^2]$.

The third moment is the skewness, and the fourth moment is the kurtosis; they all share the same form (with different normalization terms):

$$ \begin{aligned} \text{skewness} &= E[(X - \mu)^3 \frac{1}{\sigma^3}] \\ \text{kurtosis} &= E[(X - \mu)^4] \frac{1}{\sigma^4} \end{aligned} $$

Moments have different units, e.g. the first moment might be in meters ($m$), the second moment would be $m^2$, and so on, so it is typical to standardize moments by taking their $k$-th root, e.g. $\sqrt{m^2}$


The covariance describes the variance between two random variables.

For random variables $x$ and $y$, the covariance is (remember, $E(x)$ denotes the expected value of random variable $x$):

$$ \Cov(x, y) = E[(x-E(x))(y-E(y))] = \frac{1}{n} \sum (x_i - \bar x)(y_i - \bar y)$$

There must the same number of values $n$ for each.

This is simplified to:

$$ \Cov(x,y) = E[xy] - E[y]E[x] \approx \bar{xy} - \bar{y}\bar{x} $$

A positive covariance means that as $x$ goes up, $y$ goes up.
A negative covariance means that as $x$ goes up, $y$ goes down.

Note that variance is just the covariance of a random variable with itself:

$$ \Var(X) = E(XX) - E(X)E(X) = \Cov(X, X) $$


Correlation gives us a measure of relatedness between two variables. Alone it does not imply causation, but it can help guide more formal inquiries (e.g. experiments) into causal relationships.

A good way to visually intuit correlation is through scatterplots.

We can measure correlation with correlation coefficients. These measure the strength and sign of a relationship (but not the slope, linear regression, detailed later, does that).

Some of the more common correlation coefficients include:

The Pearson and Spearman coefficients are the most commonly used ones, but sometimes the later two are used in special cases (e.g. with categorical data).

Pearson product-moment correlation coefficient (Pearson's correlation)

$$ r = \frac{\sum^n_{i=1}(\frac{x_i - \bar x}{s_x})(\frac{y_i - \bar y}{s_y})}{n-1} $$

Note: this is sometimes denoted as a capital $R$

You may recognize this as:

$$ \frac{\Cov(X,Y)}{S_X S_Y} $$

Here we convert our values to standard scores, i.e. $\frac{x_i - \bar x}{s_x}$. This standardizes the values such that their mean is 0 and their variance is 1 (and so they are unitless)

For a population, $r$ is notated $\rho$ (rho).

This value can range from $[-1, 1]$, where 1 and -1 mean complete correlation and 0 means no correlation.

To test the statistical significance of the Pearson correlation coefficient, you can use the $t$ statistic.

For instance, if you believe there is a relationship between two variables, you set your null hypothesis as $\rho = 0$ and then with your estimate of $r$, calculate the $t$ statistic:

$$ t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} $$

Then look up the value in a $t$ table.

The Pearson correlation coefficient tells you the strength and direction of a relationship, but it doesn't tell you how much variance of one variable is explained by the other.

For that, you can use the coefficient of determination which is just $r^2$. So for instance, if you have $r=0.9$, then $r^2 = 0.81$ which means 81% of the variation of one variable is explained by the other.

Note that Pearson's correlation only accurately measures linear relationships; so even if you have a Pearson correlation near 0, it is still possible that there may be a strong nonlinear relationship. It's worthwhile to look at a scatter plot to verify.

It is also not robust in the presence of outliers.

Spearman rank-order correlation coefficient (Spearman's rank correlation)

Here you compute ranks (i.e. the indices in the sorted sample) rather than standard scores.

For example, for the dataset $[1,10,100]$, the rank of the value $10$ is $2$ because it is second in the sorted list.

Then you can compute the Spearman correlation:

$$ r_s = 1 - \frac{(6 \sum d^2)}{n(n^2 - 1)} $$

Where $d$ is the difference in ranks for each datapoint.

Generally, you can interpret $r_s$ in the following ways:

You can test its statistical significance using a $z$ test, where the null hypothesis is that $r_s=0$.

$$ z = r_s \sqrt{n-1} $$

Spearman's correlation is more robust to outliers and skewed distributions.

Point-Biserial correlation coefficient

This correlation coefficient is useful when comparing a categorical (binary) variable with an interval or ratio scale variable:

$$ r_{pbi} = \frac{M_p - M_q}{S_t} \sqrt{pq} $$

Where $M_p$ is the mean for the datapoints categorized as 1 and $M_q$ is the mean for the datapoints categorized as 0.
$S_t$ is the standard deviation for the interval/ratio variable, $p$ is the proportion of datapoints categorized as 1, and $q$ is the proportion of datapoints categorized as 0.

Phi correlation coefficient

This allows you to measure the correlation between two categorical (binary) variables.

It is calculated like so:

$$ \begin{aligned} A &= f(0,1) \\ B &= f(1,1) \\ C &= f(0,0) \\ D &= f(1,0) \\ r_{\phi} &= \frac{AD-BC}{\sqrt{(A+B)(C+D)(A+C)(B+D)}} \end{aligned} $$

Where $f(a,b)$ is the frequency of label $a$ and label $b$ occurring together in the data.

Degrees of Freedom

Degrees of freedom describes the number of variables that are "free" in what value they can take. Often a given variable must be a particular value because of the values the other variables take on and some constraint(s).

For example: say we have four unknown quantities $x_1, x_2, x_3, x_4$. We know that their mean is 5. In this case we have three degrees of freedom - this is because three of the variables are free to take arbitrary values, but once those three are set, the fourth value must be equal to $x_4 = 20 - x_1 - x_2 - x_3$ in order for the mean to be 5 (that is, in order to satisfy the constraint). So for instance, if $x_1=2, x_2=4, x_3=6$, then $x_4$ must equal 8. It is not "free" to take on any other value.

Time Series Analysis

Often data has a temporal component; e.g. you are looking for patterns over time.

Generally, time series data may have the following parts: a trend, which is some function reflecting persistent changes, seasonality; that is, periodic variation, and of course there is going to be some noise - random variation - as well.

Moving averages

To extract a trend from a series, you can use regression, but sometimes you will be better off with some kind of moving average. This divides the series into overlapping regions, windows, of some size, and takes the averages of each window. The rolling mean just takes the mean of each window. There is also the exponentially-weighted moving average (EWMA) which gives a weighted average, such that more recent values have the highest weight, and values before that have weights which drop off exponentially. The EWMA takes an additional span parameter which determines how fast the weights drop off.

Serial correlation (autocorrelation)

In time series data you may expect to see patterns. For example, if a value is low, it may stay low for a bit, if it's high, it may stay high for a bit. These types of patterns are serial correlations, also called autocorrelation (so-called because it is correlated a dataset with itself, in some sense), because the values correlate in their sequence.

You can compute serial correlation by shifting the time series by some interval, called a lag, and then compute the correlation of the shifted series with the original, unshifted series.

Survival Analysis

Survival analysis describes how long something lasts. It can refer to the survival of, for instance, a person - in the context of disease, a 5-year survival rate is the probability of surviving 5 years after diagnosis, for example - or a mechanical component, and so on. More broadly it can be seen as looking at how long something lasts until something happens - for instance, how long until someone gets married.

A survival curve is a function $S(t)$ which computes the probability of surviving longer than duration $t$. Such a duration is called a lifetime.

The survival curve ends up just being the complement of the CDF:

$$ S(t) = 1 - \CDF(t) $$

Looking at it this way, the CDF is the probability of a lifetime less than or equal to $t$.

Hazard function

A hazard function tells you the fraction of cases that continue until $t$ and then end at $t$. It can be computed from the survival curve:

$$ \lambda(t) = \frac{S(t) - S(t+1)}{S(t)} $$

Hazard functions are also used for estimating survival curves.

Estimating survival curves: Kaplan-Meier estimation

Often we do not have the CDF of lifetimes so we can't easily compute the survival curve. We often have non-survival cases alongside have survival cases, where we don't yet know what their final lifetime will be. Often, as is the case in the medical context, we don't want to wait to learn what these unknown lifetimes will be. So we need to estimate the survival curve with the data we do have.

The Kaplan-Meier estimation allows us to do this. We can use the data we have to estimate the hazard function, and then convert that into a survival curve.

We can convert a hazard function into an estimate of the survival curve, where each point at time $t$ is computed by taking the product of complementary hazard functions through that time $t$, like so:

$$ \prod_t (1 - \lambda(t)) $$

Inferential Statistics

Statistical inference is the practice of using statistics to infer some conclusion about a population based on only a sample of that population. This can be the population's distribution - we want to infer from the sample data what the "true" distribution (the population distribution) is and the unknown parameters that define it.

Generally, data is generated by some process; this data-generating process is also noisy; that is, there is a relatively small degree of imprecision or fluctuation in values due to randomness. In inferential statistics, we try to uncover the particular function that describes this process as closely as possible. We do so by choosing a model (e.g. if we believe it can be modeled linearly, we might choose linear regression, otherwise we might choose a different kind of model such as a probability distribution; modeling is covered in greater detail in the machine learning part). Once we have chosen the model, then we need to determine the parameters (linear coefficients, for example, or mean and variance for a probability distribution) for that model.

Broadly, the two paradigms of inference are frequentist, which relies on long-run repetitions of an event, that is, it is empirical (and could be termed the "conventional" or "traditional" framework, though there's a lot of focus on Bayesian inference now) and Bayesian, which is about generating a hypothesis distribution (the prior) and updating it as more evidence is acquired. Bayesian inference is valuable because there are many events which we cannot repeat, but we still want to learn something about.

The frequentist believes these unknown parameters have precise "true" values which can be (approximately) uncovered. In frequentist statistics, we can estimate these exact values. When we estimate a single value for an unknown, that estimation is called a point estimate. This is in contrast to describing a value estimate as a probability distribution, which is the Bayesian method. The Bayesian believes that we cannot express these parameters as single values and we should rather describe them as a distributions of possible values to be explicit about their uncertainty.

Here we focus on frequentist inference; Bayesian inference is covered in a later chapter.

In frequentist statistics, the factor of noise means that we may see relationships (and thus come up with non-zero parameters) where they don't exist, just because of the random noise. This is what p-values are meant to compensate for - if the relationship truly did not exist, what's the probability, given the data, that we'd see the non-zero parameter estimate that we computed? Generally if this probability is less than 0.05 (i.e. $p < 0.05$) then we accept the result.

Often with statistical inference you are trying to quantify some difference between groups (which can be framed as measuring an effect size) or testing if some data supports or refutes some hypothesis, and then trying to determine whether or not this difference or effect can be attributed to chance (this is covered in the section on experimental statistics).

A word of caution: many statistical tools work only under certain conditions, e.g. assumptions of independence, or for a particular distribution, or a large enough sample size, or lack of skew, and so on - so before applying statistical methods and drawing conclusions, make sure the tools are appropriate for the data. And of course you must always be cautious of potential biases involved in the data collection process.


Dealing with error is a big part of statistics and some error is unavoidable (noise is natural).

There are three kinds of error:

We never know the true value of something, only what we observe by imprecise means, so we always must grapple with error.

Estimates and estimators

We can think of the population as representing the underlying data generating process and consider these parameters as functions of the population. To estimate these parameters from the sample data, we use estimators, which are functions of the sample data that return an estimate for some unknown value. Essentially, any statistic is an estimator. For instance, we may estimate the population mean by using the sample mean as our estimator. Or we may estimate the population variance as the sample variance. And so on.


Estimators may be biased for small sample sizes; that is, it tends to have more error for small sample sizes.

Say we are estimating a parameter $\theta$. The bias of an estimator $\hat \theta_m$ is:

$$ \text{bias}(\hat \theta+m) = E[\hat \theta_m] - \theta $$

Where $m$ is the number of samples.

There are unbiased estimators as well, which have an expected mean error (against the population parameter) of 0. That is, $\text{bias}(\hat \theta) = 0$, which can also be stated as $E[\hat \theta] = \theta$.

For example, an unbiased estimator for population variance $\sigma^2$ is:

$$ \frac{1}{n-1} \sum(x_i - \bar x)^2 $$

An estimator may be asymptotically unbiased if $\lim_{m \to \infty} \text{bias}(\hat \theta_m) = $, that is, if $\lim_{m \to \infty} E[\hat \theta_m] = 0$.

Generally, unbiased estimators are preferred, but sometimes biased estimators have other properties which make them useful.

For an estimate, we can measure its standard error (SE), which describes how much we expect the estimate to be off by, on average. It can also be stated as:

$$ \SE (\hat \theta) = \sqrt{\Var(\hat \theta)} $$

"Standard error" sometimes refers to the standard error of the mean, which is the standard deviation of the mean:

$$ \SE (\hat \mu_m) = \sqrt{\Var(\frac{1}{m}\sum_{i=1}^m x^{(i)})} = \frac{\sigma}{\sqrt m} $$

Much of statistical inference is concerned with measuring the quality of these estimates.


When we used a biased estimator, we generally still want our point estimates to converge to the true value of the parameter. This property is called consistency. For some error $\epsilon > 0$, we want $\lim_{m \to \infty} P(|\theta_m - \theta| > \epsilon) = 0$. That is, as we increase our sample size, we want the probability of the absolute difference between the estimate and the true value being greater than $\epsilon$ to approach 0.

Point Estimation

Given an unknown population parameter, we may want to estimate a single value for it - this estimate is called a point estimate. Ideally, the estimate is as close to the true value as possible.

The estimation formula (the function which yields an estimate) is called an estimator and is a random variable (so there is some underlying distribution). A particular value of the estimator is the estimate.

A simple example: we have a series of trials with some number of successes. We want an estimate for the probability of success of the event we looked at. Here an obvious estimate is be the number of successes over the total number of trials, so our estimator would be $\frac{x}{N}$ and - say we had 40 successes out of 100 trials - our estimate would be $0.4$.

We consider a "good" estimator one whose distribution is concentrated as closely as possible around the parameter's true value (that is, it has a small variance). Generally this becomes the case as more data is collected.

We can take multiple samples (of a fixed size) from a population and compute a point estimate (e.g. for the mean) from each. Then we can consider the distribution of these point estimates - this distribution is called a sampling distribution. The standard deviation of the sampling distribution describes the typical error of a point estimate, so this standard deviation is known as the standard error (SE) of the estimate.

Alternatively, if you have only one sample, the standard error of the sample mean $\bar x$ can be computed (where $n$ is the size of the sample):

$$ \SE_{\bar x} = \sigma_{\bar x} = \frac{\sigma_x}{\sqrt n} $$

This however requires the population standard deviation, $\sigma_x$, which probably isn't known - but we can also use a point estimate for that as well; that is, you can just use $s$, the sample standard deviation, instead (provided that the sample size is at least 30, as a rule of thumb, and the population distribution is not strongly skewed).

Also remember that the distribution of sample means approximates a normal distribution, with better approximation as sample size increases, as described by the central limit theorem. Some other point estimates' sampling distribution also approximate a normal distribution. Such point estimates are called normal point estimates.

There are other such computations for the standard error of other estimates as well.

We say a point estimate is unbiased if the sampling distribution of the estimate is centered at the parameter it estimates.

Nuisance Parameters

Nuisance parameters are values we are not directly interested in, but still need to be dealt with in order to get at what we are interested in.

Confidence Intervals

Rather than provide a single value estimate of a population parameter, that is, a point estimate, it can be better to provide a range of values for the estimate instead. This range of values is a confidence interval. The confidence interval is the range of values where an estimate is likely to fall with some percent probability.

Confidence intervals are expressed in percentages, e.g. the "95% confidence interval", which describes the plausibility that the parameter is in that interval. It does not imply a probability (that is, it does not mean that the true parameter has a 95% chance of being in that interval), however. Rather, the 95% confidence interval is the range of values in which, over repeated experimentation, in 95% of the experiments, that confidence interval will contain the true value. To put it another way, for the 95% confidence interval, out of every 100 experiments, at least 95 of their confidence intervals will contain the true parameter value. You would say "We are 95% confident the population parameter is in this interval".

Confidence intervals are a tool for frequentist statistics, and in frequentist statistics, unknown parameters are considered fixed (we don't express them in terms of probability as we do in Bayesian statistics). So we do not associate a probability with the parameter. Rather, the confidence interval itself is the random variable, not the parameter. To put it another way, we are saying that 95% of the intervals we would generate from repeated experimentation would contain the real parameter - but we aren't saying anything about the parameter's value changing, just that the intervals will vary across experiments.

The mathematical definition of the 95% confidence interval is (where $\theta$ is the unknown parameter):

$$ P(a(Y) < \theta < b(Y)|\theta) = 0.95 $$

Where $a, b$ are the endpoints of the interval, calculated according to the sampling distribution of $Y$. We condition on $\theta$ because, as just mentioned, in frequentist statistics, the parameters are fixed and the data $Y$ is random.

We can compute the 95% confidence interval by taking the point estimate (which is the best estimate for the value) and $\pm 2\SE$, that is build an interval within two standard errors of the point estimate. The interval we add or subtract to the point estimate (here it is $2\SE$) is called the margin of error. The value we multiply the SE with is essentially a Z score, so we can more generally describe the margin of error as $z\SE$.

For the confidence interval of the mean, we can be more precise and look within $\pm 1.96\SE$ (that is, $z=1.96$) of the point estimate $\bar x$ for the 95% confidence interval (that is, our 95% confidence interval would be $\bar x \pm 1.96\SE$). This is because we know to the sampling distribution for the sample means approximates a normal distribution (for sufficiently large sample sizes, $n \geq 30$ is a rule of thumb) according to the central limit theorem.

Kernel Density Estimates

Sometimes we don't want the parameters of our data's distribution, but just a smoothed representation of it. Kernel density estimation allows us to get this representation. It is a nonparametric method because it makes no assumptions about the form of the underlying distribution (i.e. no assumptions about its parameters).

Kernel Density Estimation example
Kernel Density Estimation example

Some kernel function (which generates symmetric densities) is applied to each data point, then the density estimate is formed by summing the densities. The kernel function determines the shape of these densities and the bandwidth parameter, $h > 0$, determines their spread and the smoothing of the estimate. Typically, a Gaussian kernel function is used, so the bandwidth is equivalent to the variance.

KDE bandwidth comparisons
KDE bandwidth comparisons

In this figure, the grey curve is the true density, the red curve is the KDE with $h = 0.05$, the black curve is the KDE with $h = 0.337$, and the green curve is the KDE with $h = 2$.

Experimental Statistics

Experimental statistics is concerned with hypothesis testing, where you have a hypothesis and want to learn if your data supports it. That is, you have some sample data and an apparent effect, and you want to know if there is any reason to believe that the effect is genuine and not just by chance.

Often you are comparing two or more groups; more specifically, you are typically comparing statistics across these groups, such as their means. For example, you want to see if the difference of their means is statistically significant; which is to say, likely that it is a real effect and not just chance.

The "classical" approach to hypothesis testing, null hypothesis significance testing (NHST), follows this general structure:

  1. Quantify the size of the apparent effect by choosing some test statistic, which is just a summary statistic which is useful for hypothesis testing or identifying p-values. For example, if you have two populations you're looking at, this could be the difference in means (of whatever you are measuring) between the two groups.
  2. Define a null hypothesis, which is usually that the apparent effect is not real.
  3. Compute a p-value, which is the probability of seeing the effect if the null hypothesis is true.
  4. Determine the statistical significance of the result. The lower the p-value, the more significant the result is, since the less likely it is to have just occurred by chance.

Broadly speaking, there are two types of scientific studies: observational and experimental.

In observational studies, the research cannot interfere while recording data; as the name implies, the involvement is merely as an observer.

Experimental studies, however, are deliberately structured and executed. They must be designed to minimize error, both at a low level (e.g. imprecise instruments or measurements) and at a high-level (e.g. researcher biases).

Statistical Power

The power of a study is the likelihood that it will distinguish an effect of a certain size from pure luck. - Statistical power and underpowered statistics, Alex Reinhart

Statistical power, sometimes called sensitivity, can be defined as the probability of rejecting the null hypothesis when it is false.

If $\beta$ is the probability of a type II error (i.e. failing to reject the null hypothesis when it's false), then $\text{power} = 1 - \beta$.


Sample Selection

Bias can enter studies primarily in two ways:

To prevent selection bias (selecting samples in such a way that it encourages a particular outcome, whether done consciously or not), sample selection may be random.

In the case of medical trials and similar studies, random allocation is ideally double blind, so that neither the patient nor the researchers know which treatment a patient is receiving.

Another sample selection technique is stratified sampling, in which the population is divided into categories (e.g. male and female) and samples are selected from those subgroups. If the variable used for stratification is strongly related to the variable being studied, there may be better accuracy from the sample size.

You need large sample sizes because with small sample sizes, you're more sensitive to the effects of chance. e.g. if I flip a coin 10 times, it's feasible that I get heads 6/10 times (60% of the time). With that result I couldn't conclusively say whether or not that coin is rigged. If I flip that coin 1000 times, it's extremely unlikely that I will get heads 60% of the time (600/1000 times) if it were a fair coin.

Sometimes to increase sample size, a researcher may use a technique called "replication", which is simply repeating the measurements with new samples. but some researchers really only "pseudoreplicate". samples should be as independent from each other as possible - otherwise you have too many confounding factors. in medical research, researchers may sample a single patient multiple times, every week for instance, and treat each week's sample as a distinct sample. this is pseudoreplication - you begin to inflate other factors particular to that patient in your results. another example is - say you wanted to measure pH levels in soil samples across the US. well, you cant sample soil 15ft from each other because they are too dependent on each other:


Operationalization is the practice of coming up with some way of measuring something which cannot be directly measured, such as intelligence. This may be accomplished via proxy measurements.

The Null Hypothesis

In an experiment, the null hypothesis, notated $H_0$, is the "status quo". For example, in testing whether or not a drug has an impact on a disease, the null hypothesis would be that the drug has no effect.

When running an experiment, you do it under the assumption that the null hypothesis is true. Then you ask: what's the probability of getting the results you got, assuming the null hypothesis is true? If that probability is very small, the null hypothesis is likely false. This probability - of getting your results if the null hypothesis were true - is called the P value.

Type 1 Errors

A type 1 error is one where the null hypothesis is rejected, even though it is true.

Type 1 errors are usually presented as a probability of them occurring, e.g. a "0.5% chance of a type 1 error" or a "type 1 error with probability of 0.01".

P Values

P values are central to null hypothesis significance testing (NHST), but they are commonly misunderstood.

P values do not:

There's no mathematical tool to tell you if your hypothesis is true; you can only see whether it is consistent with the data, and if the data is sparse or unclear, your conclusions are uncertain. - Statistics Done Wrong, Alex Reinhart

So what is it then? The P value is the probability of seeing your results or data if the null hypothesis were true.

That is, given data $D$ and a hypothesis $H$, where $H_0$ is the null hypothesis, the P value is merely:

$$ P(D|H_0) $$

If instead we want to find the probability of our hypothesis given the data, that is, $P(H|D)$, we have to use Bayesian inference instead:

$$ P(H \, | \, D) = \frac{P(D \, | \, H)P(H)}{P(D \, | \, H)P(H) + P(D\, | \, \lnot H)P(\not H)} $$

Note that P values are problematic when testing multiple hypotheses (multiple testing or multiple comparisons) because any "significant" results (as determined by P value comparisons, e.g. $p < 0.5$) may be deceptively so, since that result may still have just been chance, as the following comic illustrates. That is, the more significance tests you conduct, the more likely you will make a Type 1 Error.

xkcd - "Significant"
xkcd - "Significant"

In this comic, 20 hypotheses are tested, so with a significance level at 5%, it's expected that at least one of those tests will come out significant by chance. In the real world this may be problematic in that multiple research groups may be testing the same hypothesis and chances may be such that one of them gets significant results.

The Base Rate Fallacy

A very important shortcoming to be aware of is the base rate fallacy. A P value cannot be considered in isolation. The base rate of whatever occurrence you are looking at must also be taken into account. Say you are testing 100 treatments for a disease, and it's a very difficult disease to treat, so there's a low chance (say 1%) that a treatment will actually be successful. This is your base rate. A low base rate means a higher probability of false positives - treatments which, during the course of your testing, may appear to be successful but are in reality not (i.e. their success was a fluke). A good example is the mammogram test example (see The p value and the base rate fallacy).

A p value is calculated under the assumption that the medication does not work and tells us the probability of obtaining the data we did, or data more extreme than it. It does not tell us the chance the medication is effective. (The p value and the base rate fallacy, Alex Reinhart)

False Discovery Rate

The false discovery rate is the expected proportion of false positives (Type 1 errors) amongst hypothesis tests.

For example, if we have a maximum FDR of 0.10 and we have 1000 observations which seem to indicate a significant hypothesis, then we can expect 100 of those observations to be false positives.

The q value for an individual hypothesis is the minimum FDR at which the test may be called significant.

Say you run multiple comparisons and have the following values:

We can calculate the FDR as:

$$ \text{FDR} = E[\frac{V}{V+S}] = E[\frac{V}{R}] $$

Note that $\frac{V}{R} = 0$ if $R = 0$.

Alpha Level

The value that you select to compare the p-value to, e.g. 0.5 in the comic, is the alpha level $\bar \alpha$, also called the significance level, of an experiment. Your alpha level should be selected according to the number of tests you'll be conducting in an experiment.

There are some approaches to help adjust the alpha level.

The Bonferroni Correction

The highly conservative Bonferroni Correction can be used as a safeguard.

You divide whatever your significance level $\bar \alpha$ is by the number of statistical tests $t$ you're doing:

$$ \alpha_p = \frac{\bar \alpha}{t} $$

$\alpha_p$ is the per-comparison significance level which you apply for each individual test, and $\bar \alpha$ is the maximum experiment-wide significance level, called the maximum familywise error rate (FWER).

The Sidak Correction

A more sensitive correction, the Sidak Correction, can also be used:

$$ \alpha_p = 1 - (1 - \bar \alpha)^{\frac{1}{n}} $$

For $n$ independent comparisons, $\alpha$, the experiment-wide significance level (the FWER) is:

$$ \alpha = 1 - (1 - \alpha_p)^n $$

For $n$ dependent comparisons, use:

$$ \alpha \leq n \alpha_p $$

The Benjamini-Hochberg Procedure

Approaches like the Bonferroni correction lowers the alpha level which end up decreasing your statistical power - that is, you fail to detect false effects as well as true effects.

And with such an approach, you are still susceptible to the base rate fallacy, and may still have false positives. So how can you calculate the false discovery rate? That is, what fraction of the statistically Significant results are false positives?

You can use the Benjamini-Hochberg procedure, which tells you which P values to consider statistically significant:

  1. Perform your statistical tests and get the P value for each. Make a list and sort it in ascending order.
  2. Choose a false-discovery rate $q$. The number of statistical tests is $m$.
  3. Find the largest p value such that $p \leq \frac{iq}{m}$, where $i$ is the P value's place in the sorted list.
  4. Call that P value and all smaller than it statistically significant.

The procedure guarantees that out of all statistically significant results, no more than q percent will be false positives.

The Benjamini-Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists in certain fields. It usually provides better statistical power than the Bonferroni correction and friends while giving more intuitive results. It can be applied in many different situations, and variations on the procedure provide better statistical power when testing certain kinds of data.

Of course, it's not perfect. In certain strange situations, the Benjamini-Hochberg procedure gives silly results, and it has been mathematically shown that it is always possible to beat it in controlling the false discovery rate. But it's a start, and it's much better than nothing. - (Controlling the false discovery rate, Alex Reinhart)

Sum of Squares

The sum of squares within (SSW)

$$ \SSW = \sum_{i=1}^m (\sum{j=1}^n (x_{ij} - \bar{x_i})^2) $$

The sum of squares between (SSB)

$$ \SSB = \sum_{i=1}^m [n_m [(\bar{x_m} - \bar{\bar{x}})^2]] $$

The total sum of squares (SST)

$$ \SST = \sum_{i=1}^m (\sum{j=1}^n (x_{ij} - \bar{\bar{x}})^2) $$
$$ \SST = \SSW + \SSB $$

Statistical Tests

Two-sided tests

Asks "What is the chance of seeing an effect as big as the observed effect, without regard to its sign?" That is, you are looking for any effect, increase or decrease.

One-sided tests

Asks "What is the chance of seeing an effect as big as the observed effect, with the same sign?" That is, you are looking for either only an increase or decrease.

Unpaired t-test

The most basic statistical test, used when comparing the means from two groups. Used for small sample sizes. The t-test returns a p-value.

Paired t-test

The paired t-test is a t-test used when each datapoint in one group corresponds to one datapoint in the other group.

Chi-squared test

When comparing proportions of two populations, it is common to use the chi-squared statistic:

$$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} $$

Where $O_i$ is the observed frequencies and $E_i$ is the expected frequencies.

Say for example you want to test if a coin is fair. You expect that, if it is fair, you should see about 50/50 heads and tails - this describes your expected frequencies. You flip the coin and observe the actual resulting frequencies - these are your observed frequencies.

The Chi squared test allows you to determine if these frequencies differ significantly.

ANOVA (Analysis of Variance)

ANOVA, ANCOVA, MANOVA, and MANCOVA are various ways of comparing different groups.

ANOVA is used to compare three or more groups. It uses a single test to compare the means across multiple groups simultaneously, which avoids using multiple tests to make multiple comparisons (which can lead to differences across groups resulting from chance).

There are a few requirements:

ANOVA tests the null hypothesis that the means across groups are the same (that is, that $\mu_1 = \dots = \mu_k$, if there are $k$ groups), with the alternate hypothesis being that at least one mean is different. We look at the variability in the sample means and see if it is so large that it is unlikely to have been due to chance. THe variability we use is the mean square between groups (MSG) which has degrees of freedom $\df_G = k - 1$.

The MSG is calculated:

$$ \MSG = \frac{1}{\df_G}\SSG = \frac{1}{k-1} \sum_{i=1}^k n_i(\bar x_i - \bar x)^2 $$

Where the SSG is the sum of squares between groups and $n_i$ is the sample size of group $i$ out of $k$ total groups. $\bar x$ is the mean of outcomes across all groups.

We need a value to compare the MSG to, which is the mean square error (MSE), which measures the variability within groups and has degrees of freedom $\df_E = n - k$.

The MSE is calculated:

$$ \MSE = \frac{1}{\df_E}\SSE $$

Where the SSE is the sum of squared errors and is computed as:

$$ \SSE = \SST - \SSG $$

Where the SSG is same as before and the SST is the sum of squares total:

$$ \begin{aligned} \SST &= \sum_{i=1}^n(x_i - \bar x)^2 \\ \SSG = \sum_{i=1}^k n_i (\bar x_i - \bar x)^2 \end{aligned} $$

ANOVA uses a test statistic called $F$, which is computed:

$$ F = \frac{\MSG}{\MSE} $$

When the null hypothesis is true, difference in variability across sample means should be due only to chance, so we expect MSG and MSE to be about equal (and thus $F$ to be close to 1).

We take this $F$ statistic and use it with a test called the F test, where we compute a p-value from the $F$ statistic, using the F distribution, which has the parameters $\df_1$ and $\df_2$. We expect ANOVA's $F$ statistic to follow an F distribution with parameters $\df_1 = \df_G, \df_2 = \df_E$ if the null hypothesis is true.


Similar to a t-test but used to compare three or more groups. With ANOVA, you calculate the F statistic, assuming the null hypothesis^[Remember that $\SSB$ is the "sum of squares between" and $\SSW$ is the "sum of squares within".]:

$$ F = \frac{\frac{\SSB}{m-1}}{\frac{\SSW}{m(n-1)}} $$


Allows you to compare the means of two or more groups when there are multiple variables or factors to be considered.

One-tailed & two-tailed tests

In a two-tailed test, both tails of a distribution are considered. For example, with a drug where you're looking for any effect, positive or negative.

In a one-tailed, only one tail is considered. For example, you may be looking only for a positive or only for a negative effect.

Effect Size

A big part of statistical inference is measuring effect size, which more generally is trying to quantify differences between groups, but typically just referred to as "effect size".

There are a few ways of measuring effect size:

Difference in means

The difference in means, e.g. $\mu_1 - \mu_2$

But this has a few problems:

Distribution overlap

The overlap between the two distributions:

Choose some threshold between the two means, e.g.

Count how many in the first group are below the threshold, call it $m_1$
Count how many in the second group are above the threshold, call it $m_2$.

The overlap then is:

$$ \frac{m_1}{n_1} + \frac{m_2}{n_2} $$

Where $n_1, n_2$ are the sample sizes of the first and second groups, respectively.

This overlap can also be framed as a misclassification rate, which is just $\frac{\text{overlap}}{2}$.

These measures are unitless, which makes them easy to compare across studies.

Probability of superiority

The "probability of superiority" is the probability that a randomly chosen datapoint from group 1 is greater than a randomly chosen datapoint from group 2.

This measure is also unitless.

Cohen's $d$

Cohen's $d$ is the difference in means, divided by the standard deviation, which is computed from the pooled variance, $\sigma_p^2$, of the groups:

$$ \begin{aligned} \sigma_p^2 &= \frac{n_1\sigma_1^2 + n_2\sigma_2^2}{n_1+n_2} \\ d &= \frac{\mu_1 - \mu_2}{\sqrt{\sigma_p^2}} \end{aligned} $$

This measure is also unitless.

Different fields have different intuitions about how big a $d$ value is; it's something you have to learn.


Reliability refers to how consistent or repeatable a measurement is (for continuous data).

There are three main approaches:

Multiple-occasions reliability

Aka test-retest reliability. This is how a test holds up over repeated testing, e.g. "temporal stability". This assumes the underlying metric does not change.

Multiple-forms reliability

Aka parallel-forms reliability. This asks: how consistent are different tests at measuring the same thing?

Internal consistency reliability

This asks: do the items on a test all measure the same thing?


Agreement is similar to reliability, but used more for discrete data.

Percent agreement

$$ \frac{\text{number of cases where tests agreed}}{\text{all cases}} $$

Note that a high percent agreement may be obtained by chance.

Cohen's kappa

Often just called kappa, this corrects for the possibility of chance agreement:

$$ \kappa = \frac{p_o - p_e}{1 - p_e} $$

Where $p_o$ is the observed agreement, that is, $\frac{\text{num. agreement}}{\text{total cases}}$, and $p_e$ is the expected agreement. Kappa ranges from -1 to 1, where 1 is perfect agreement.

Handling Data

Transforming data

Occasionally you may find data easier to work with if you apply a transformation to it; that is, rescale it in some way. For instance, you might take the natural log of your values, or the square root, or the inverse. This can reduce skew and the effect of outliers or make linear modeling easier.

The function which applies this transformation is called a link function.

Dealing with missing data

Data can be missing for a few reasons:

There are a few strategies for dealing with missing data.

The worst you can do is to ignore the missing data and try to run your analysis, missing data and all (it likely won't and probably shouldn't work).

Alternatively, you can delete all datapoints which have missing data, leaving only complete data points - this is called complete case analysis. Complete case analysis makes the most sense with MCAR missing data - you will have a reduction in sample size, and thus a reduction in statistical power, as a result, but your inference will not be biased. The possibly systemic nature of missing data in MAR and MNAR means that complete case analysis may overlook important details for your model.

You also have the option of filling in missing values - this is called imputation (you "impute" the missing values). You can, for instance, filling in missing values with the mean of that variable. You don't gain any of the information that was missing, and you end up ignoring the uncertainty associated with the fill-in value (and the resulting variances will be artificially reduced), but you at least get to maintain your sample size. Again, bias may be introduced in MAR and MNAR situations since the missing data may be due to some systemic cause.

One of the better approaches is multiple imputation, which produces unbiased parameter estimates and accounts for the uncertainty of imputed values. A regression model is used to generated the imputed values, and does well especially under MAR conditions - the regression model may be able to exploit info in the dataset about the missing data. If some known values correlate with the missing values, they can be of use in this way.

Then, instead of using the regression model to produce one value for each missing value, multiple values are produced, so that the end result is multiple copies of your dataset, each with different imputed values for the missing values. Your perform your analysis across all datasets and average the produced estimates.


Resampling involves repeatedly drawing subsamples from an existing sample.

Resampling is useful for assessing and selecting models and for estimating the precision of parameter estimates.

A common resampling method is bootstrapping.


Bootstrapping is a resampling method to approximate the true sampling distribution of a dataset, which can then be used to estimate the mean and the variance of the distribution. The advantage with bootstrapping is that there is no need to compute derivatives or make assumptions about the distribution's form.

You take $R$ samples $S_i^*$, with replacement, each of size $n$ (i.e. each resample is the same size as the original sample), from your dataset. These samples, $S^*=\\{S_1^*, \dots, S_R^*\\}$ are called replicate bootstrap samples. Then you can compute an estimate of the $t$ statistic for each of the bootstrap samples, $T_i^* = t(S_i^*)$.

Then you can estimate the mean and variance:

$$ \begin{aligned} \bar T^* = \hat E[T^*] &= \frac{\sum_i T_i^*}{R} \\ \hat \Var(T^*) &= \frac{\sum_i (T_i^* - \bar T^*)^2}{R-1} \end{aligned} $$

With bootstrap estimates, there are two possible sources of error. You may have the sampling error from your original sample $S$ in addition to the bootstrap error, from failing to be comprehensive in your sampling of bootstrap samples. To avoid the latter, you should try to choose a large $R$, such as $R=1000$.