Probability theory is the study of uncertainty.

We typically talk about the probability of an **event**. The **probability space** defines the possible outcomes for the event, and is defined by the triple

$\Omega$ is the space of possible outcomes, i.e. the**outcome space**(sometimes called the**sample space**).$\mathcal F \subseteq 2^{\Omega}$ , where$2^{\Omega}$ is the power set of$\Omega$ (i.e. the set of all subsets of$\Omega$ , including the empty set$\emptyset$ and$\Omega$ itself, the latter of which is called the**trivial event**), is the*space of measurable events*or the**event space**.$P$ is the*probability measure*, i.e. the**probability distribution**, that maps an event$E \in \mathcal F$ to a real value between 0 and 1 (that is,$P$ is a function that outputs a probability for the input event).

For example, we have a six-sided dice, so the space of possible outcomes

The outcome space

*non-negativity*: for all$\alpha \in \mathcal F, P(\alpha) \geq 0$ .*trivial event*:$P(\Omega) = 1$ .*additivity*: For all$\alpha, \beta \in \mathcal F$ and$\alpha \cap \beta = \emptyset$ ,$P(\alpha \cup \beta) = P(\alpha) + P(\beta)$ .

Other axioms include:

$0 \leq P(a) \leq 1$ $P(\text{True}) = 1$ and$P(\text{False}) = 0$

We refer to an event whose outcome is unknown as a *trial*, or an *experiment*, or an *observation*. An event is a trial which has resolved (we know the outcome), and we say "the event has occurred" or that the trial has "satisfied the event".

The **compliment** of an event is everything in the outcome space that is *not* the event, and may be notated in a few ways:

If two events cannot occur together, they are **mutually exclusive**.

More concisely (from Probability, Paradox, and the Reasonable Person Principle):

- Experiment: An occurrence with an uncertain outcome that we can observe.

- For example, rolling a die.
- Outcome: The result of an experiment; one particular state of the world. Synonym for "case."

- For example: 6.
- Sample Space: The set of all possible outcomes for the experiment. (For now, assume each outcome is equally likely.)

- For example, {1, 2, 3, 4, 5, 6}.
- Event: A subset of possible outcomes that together have some property we are interested in.

- For example, the event "even die roll" is the set of outcomes {2, 4, 6}.
- Probability: The number of possible outcomes in the event divided by the number in the sample space.

- For example, the probability of an even outcome from a six-sided die is |{2, 4, 6}| / |{1, 2, 3, 4, 5, 6}| = 3/6 = 1/2.

A **random variable** (sometimes called a **stochastic variable**) is a function which maps outcomes to real values (that is, they are technically not variables but rather functions), dependent on some other probabilistic factor. Random variables represent uncertain events we are interested in with a numerical value.

Random variables are typically denoted by a capital letter, e.g.

When we use **distribution** of the random variable

Contrast this to *probability* of some arbitrary single value

For example, we may be flipping a coin and have a random variable

Random variables may be:

**Discrete**: the variable can only have specific values, e.g. on a 5 star rating system, the random variable could only be one of the values$[0,1,2,3,4,5]$ . Another way of describing this is that the space of the variable's possible values (i.e. the outcome space) is*countable*and finite. For discrete random variables which are not numeric, e.g. gender (male, female, etc), we use an**indicator function**$I$ to map non-numeric values to numbers, e.g.$\text{male} = 0, \text{female} = 1, \dots$ ; we call variables from such functions**indicator variables**.**Continuous**: the variable can have arbitrarily exact values, e.g. time, speed, distance. That is, the outcome space is infinite.**Mixed**: these variables assign probabilities to both discrete and continuous random variables.

**Joint probability**:$P(a \cap b) = P(a \land b) = P(a, b) = P(a) P(b|a)$ , the probability of both$a$ and$b$ occurring.**Disjoint probability**:$P(a \cup b) = P(a \lor b) = P(a) + P(b) - P(a, b)$ , the probability of$a$ or$b$ occurring.

Probabilities can be visualized as a Venn diagram:

The overlap is where both

The previous axiom, describing

The conditional probability is the probability of

Formally, this is:

Where

For example, say you have two die. Say

Now what is

Intuitively, this can be thought of as the probability of

So we ignore the part of the world in which

This can be re-written as:

Events

That is, their outcomes are unrelated.

Another way of saying this is that:

Knowing something about

The independence of

From this we can infer that:

More generally we can say that events **mutually independent** if

Mutual independence implies pairwise independence, but note that the converse is not true (that is, pairwise independence does not imply mutual independence).

**Conditional independence** is defined as:

If

Which is to say

From this we can infer that:

Note that mutual independence does not imply conditional independence.

Similarly, we can say that events

Say we have the joint probability

We can set

And we can again apply the previous equation to

This is generalized as the **chain rule of probability**:

With permutations, *order matters*. For instance, *permutations* (though they are the same *combination*, see below).

Permutations are notated:

Where:

$x$ = total number of "items"$y$ = "spots" or "spaces" or "positions" available for the items.

A permutation can be expanded like so:

And generalized to the following formula:

For example, consider

With combinations, the order *doesn't* matter.

The notation is basically the same as permutation, except with a

or, expanded:

The **binomial coefficient**, and read as "n choose k".

Say you have a a coin. What is

$P(\frac{3}{8}H)$ ? That is, what is the probability of flippingexactly3 heads?

So there are 56 possible outcomes that result in exactly 3 heads. Because a coin has two possible outcomes, and we're flipping 8 times, we know there are

So to figure out the probability, we can just take the ratio of these outcomes.

Given outcome

$P(A) = 0.8$ and$P(B) = 0.2$ , what is$P(\frac{3}{5}A)$ ? That is, the possibility of exactly 3 out of 5 trials being A.

Basically, like before, we're looking for possible combinations of

So we know there are 10 possible outcomes resulting in

So then we just multiply the number of these combinations, 10, by this resulting probability to get the final answer.

For some random variable **probability distribution function** **probability distribution**); the particular kind depends on what kind of random variable

If the random variable

Distributions themselves are described by **parameters** - variables which determine the specifics of a distribution. Different kinds of distributions are described by different sets of parameters. For instance, the particular shape of a normal distribution is determined by **parameterized** by *parameterization* of the distribution; that is, with the normal distribution example, we don't know what

For discrete random variables, the distribution is a **probability mass function**.

It is called a "mass function" because it divides a unit mass (the total probability) across the different values the random variable can take.

In the example figure, the random variable can take on one of three discrete values,

For continuous random variables we have a **probability density function**. A probability density function

Where

The total area under the curve sums to 1 (which is to say that the aggregate probability of all possible values for

The probability of a random variable

It's worth noting that this implies that, for a continuous random variable *single* value is zero (when dealing with continuous random variables there are infinitely precise values). Rather, we compute probabilities for a range of values of

There are a few ways we can describe distributions.

**Unimodal**: The distribution has one main peak.**Bimodal**: The distribution has two (approximately) equivalent main peaks.**Multimodal**: The distribution has more than two (approximately) equivalent main peaks.**Symmetrical**: The distribution falls in equal numbers on both sides of the middle.

**Skewness** describes distributions that have greater density on one side of the distribution. The side with less is the direction of the skew.

Skewness is defined:

Where

The normal distribution has a skewness of 0.

**Kurtosis** describes how the shape differs from a normal curve (if the tails are lighter or heavier).

Kurtosis is defined:

The standard normal distribution has a kurtosis of 3, so sometimes kurtosis is standardized by subtracting 3; this standardized kurtosis measure is called the *excess kurtosis*.

A cumulative distribution function

The *complimentary distribution* (CCDF) of a distribution is

The cumulative distribution function of a discrete random variable is just the sum of the probabilities for the values up to

Say our discrete random variable

The complete discrete CDF is a step function, as you might expect because the CDF is constant between discrete values.

The cumulative distribution function of a continuous random variable is:

That is, it is the integral of the PDF up to the value in question.

Probability values for a specific range

or more simply:

Visually, there are a few tricks you can do with CDFs.

You can estimate the median by looking at where

You can estimate the probability that your

You can estimate a confidence interval as well. For example, the 90% confidence interval by looking at the

The **survival function** of a random variable

The expected value of a random variable

That is, it is the average (mean) value.

It can be thought of as a way of "summarizing" a random variable to a single value.

It can be thought of as a sample from a potentially infinite population. A sample from that population is expected to be the mean of that population. The value of that mean depends on the distribution of that population.

For a discrete random variable

The expected value exists only if this sum is well-defined, which basically means it has to aggregate in some clear way, as either a finite value or positive or negative infinity. But it can't, for instance, contain a positive infinity and a negative infinity term simultaneously, because it's undefined how those combine.

For example, consider the infinite sum

For a continuous random variable

The expected value exists only when this integral is well-defined.

A function

Using whichever is appropriate, depending on if

Jensen's Inequality states that given a convex function

For random variables

$E(a) = a$ for all$a \in \mathbb R$ . That is, the expected value of a constant is just the constant. This is called the*normalization*property.$E(aX) = aE(X)$ for all$a \in \mathbb R$ $E(X+Y) = E(X) + E(Y)$ - If
$X \geq 0$ , that is, all possible values of$X$ are greater than 0, then$E[X] \geq 0$ - If
$X \leq Y$ , that is, each possible value of$X$ is less than each possible value of$Y$ , then$E[X] \leq E[Y]$ . This is called the*order*property. - If
$X$ and$Y$ are independent, then$E[XY] = E[X]E[Y]$ . Note that the converse is not true; that is, if$E[XY] = E[X]E[Y]$ , this does not necessarily mean that$X$ and$Y$ are independent. $E[I_A(X)] = P(X \in A)$ , that is, the expected value of an indicator function:

is the probability that the random variable

Properties 2 and 3 are called *linearity*.

To put linearity another way: Let

The **variance** of a distribution is the "spread" of a distribution.

The variance of a random variable

It can be defined in a couple ways:

Variance is not a linear function of

If random variables

The **covariance** of two random variables is a measure of how "closely related" they are:

With more than two variables, a **covariance matrix** is used.

Covariance matrices show two things:

- the variance of a variable
$i$ , located at the$i,i$ element - the covariance of variables
$i,j$ , located at the$i,j$ and$j,i$ elements

If the covariance between two variables is negative, then we have a downward slope, if it is positive, then we have an upward slope.

So the covariance matrix tells us a lot about the shape of the data.

Here a few distributions you are likely to encounter are described in more detail.

A random variable distributed according to the Bernoulli distribution can take on two possible values **Bernoulli random variable**.

The distribution is described as:

And for a Bernoulli random variable

The mean of a Bernoulli distribution is

A Bernoulli distribution describes a single trial, though often you may consider multiple trials, each with its own random variable.

Say we have a set of iid Bernoulli random variables, each representing a trial. What is the probably of finding the first success at the

This can be described with a geometric distribution, which is a distribution where the probabilities decrease exponentially fast.

It is formalized as:

With the mean

Suppose you have a binomial experiment (i.e. one with two mutually exclusive outcomes, such as "success" or "failure") of

Note that *binomial* is in contrast to *multinomial* in which a random variable can take on more than just two discrete values. This shouldn't be confused with *multivariate* which refers to a situation where there are multiple variables.

The resulting distribution is a **binomial distribution**, such as:

The binomial distribution has the following properties:

The binomial distribution is expressed as:

A binomial random variable

Here

Its expected value is:

The binomial distribution has two parameters:

$n$ - a positive integer representing the number of trials$p$ - the probability of an event occurring in a single trial

The special case *Bernoulli distribution*.

If we have

Thus the expected value of a Bernoulli random variable is

Some example questions that can be answered with a binomial distribution:

- Out of ten tosses, how many times will this coin be heads?
- From the children born in a given hospital on a given day, how many of them will be girls?
- How many students in a given class room will have green eyes?
- How many mosquitoes, out of a swarm, will die when sprayed with insecticide?

(Source)

When the number of trials

The negative binomial distribution is a more general form of the geometric distribution; instead of giving the probability of the *first* success in the

This distribution is described as:

The **Poisson distribution** is useful for describing the number of rare (independent) events in a large population (of independent individuals) during some time span. It looks at how many times a discrete event occurs, over a period of continuous space or time; without a fixed number of trials.

If

For the Poisson distribution *intensity* of the distribution.

For the Poisson distribution,

A shorthand for saying that

For Poisson distributions, the expected value of our random variable is equal to the parameter

In the Poisson distribution figure, although it looks like the values fall off at some point, it actually has an infinite tail, so that *every* positive integer has some positive probability.

On average, 9 cars pass this intersection every hour. What is the probability that two cars pass the intersection this hour? Assume a Poisson distribution.

This problem can be framed as: what is

We know the expected value is 9 and that we have a Poisson distribution, so

Some example questions that can be answered with a Poisson distribution:

- How many pennies will I encounter on my walk home?
- How many children will be delivered at the hospital today?
- How many products will I sell after airing a new television commercial?
- How many mosquito bites did you get today after having sprayed with insecticide?
- How many defects will there be per 100 metres of rope sold?

(Source)

With the uniform distribution, every value is equally likely.

It may be constrained to a range of values as well.

A random variable which is continuous may have an *exponential density*, often describe as an *exponential random variable*:

Here we say *exponential*:

Like the Poisson random variable, the exponential random variable can only have positive values. But because it is continuous, it can also take on non-integral values such as 4.25.

For exponential distributions, the expected value of our random variable is equal to the inverse of the parameter

Say we have the random variable

$y$ which is the exact amount of rain we will get tomorrow, in inches. What is the probability that$y = 2 \pm 0.1$ ? Assume you have the probability density function$f$ for$y$ .

We'd notate the probability we're looking for like so:

Which is the probability that

Then we would just find the integral (area under the curve) of the PDF from 1.9 to 2.1, i.e.

This is over positive real numbers.

It is just a generalization of the exponential random variable:

The PDF is:

Where *Gamma function*.

The normal distribution is perhaps the most common probability distribution, occurring very often in nature.

For a random variable

The (univariate) Gaussian distribution is parameterized by

The peak of the distribution is where

The height and width of the distribution varies according to

The *standard* normal distribution is just

The Gaussian distribution can be used to approximate other distributions, such as the binomial distribution when the number of experiments is large, or the Poisson distribution when the average arrival rate is high.

A normal random variable

Where the parameters are:

$\mu$ = the mean$\sigma$ = the standard deviation

The expected value is:

For small sample sizes (

This distribution is the t-distribution, which, for large enough sample sizes (

The t-distribution has thicker tails than the normal distribution, so observations are more likely to be within two standard deviations of its mean. This allows for more accurate estimations of the standard error for small sample sizes.

The t-distribution is always centered around zero and is described by one parameter: the **degrees of freedom**. The higher the degrees of freedom, the closer the t-distribution is to the standard normal distribution.

The confidence interval is computed slightly differently for a t distribution. Instead of the Z score we use a cutoff,

For a single sample with

The t-distribution's corresponding test is the t-test, sometimes called the "Student t-test", which is used to compare the means of two groups.

From the t-distribution we can calculate a t value:

Then we can use this t value with the t distribution with the degrees of freedom for the sample and use that to compute a p-value.

For an event with two outcomes, the beta distribution is the probability distribution of the probability of the outcome being positive. The beta distribution's domain is

That is, in a beta distribution both the

and the

It is notated:

Where

Its PDF is:

Where *Beta function*.

The Beta distribution is a generalization of the uniform distribution:

The mean of a beta distribution is just

If you need to estimate the probability of something happening, the beta distribution can be a good prior since it is quite easy to calculate its posterior distribution:

That is, you just use some plausible prior values for

The Weibull distribution is used for modeling reliability or "survival" data, e.g. for dealing with failure-rates.

It is defined as:

The *shape parameter* and the *scale parameter* of the distribution.

If

The

The

This distribution has a mean

A Pareto distribution has a CDF with the form:

They are characterized as having a long tail (i.e. many small values, few large ones), but the large values are large enough that they still make up a disproportionate share of the total (e.g. the large values take up 80% of the distribution, the rest are 20%).

Such a distribution is described as *scale-free* since they are not centered around any particular value. Compare this to Gaussian distributions which are centered around some value.

Such a distribution is said to obey the *power law*. A distribution

Such distributions are (confusingly) sometimes called *scaling distributions* because they are invariant to changes of scale, which is to say that you can change the units the quantities are expressed in and

In the real world you often work with multiple random variables simultaneously - that is, you are working in higher dimensions. You could describe a group of random variables as a *random vector*, i.e. a random vector

A distribution over multiple random variables is called a *joint distribution*.

For a joint distribution *marginal distribution* (or just *marginal*) of the joint distribution, and is computed:

That is, fix

Generally, you can compute the marginal like so:

So you take the variable you want to remove and sum over the probabilities with it fixed for each of its possible outcomes.

The distribution over multiple random variables is called a **joint distribution**. When we have multiple random variables, the distribution of some subset of those random variables is the **marginal distribution** for that subset.

The probability density function for a joint distribution just takes more arguments, i.e.:

Conditional distributions are distributions in which the value of one or more other random variables are known.

For random variables

which is undefined if

This can be expanded to multiple given random variables:

The conditional distribution of

More generally, we can describe the conditional distribution of

For continuous random variables, the probability of the random variable being a given specific value is 0 (see the section on probability density functions), so here we have the denominator as 0, which won't do. However, it can be shown that the probability density function

And thus:

A random vector

Note that "Gaussian" often implies "multivariate Gaussian".

That is, the dot product of some vector

is Gaussian for every

We say

which means

If

Caveat: a random vector's individual components being Gaussian but *not* independent does not necessarily imply that the vector itself is Gaussian.

Intuitively this makes sense because if

A *degenerate* univariate Gaussian distribution is one where

A multivariate Gaussian can also be degenerate, which is when the determinant of its covariance matrix

These are some examples of what Gaussians can look like. Drawn over the first two are their *level sets* which demarcate where the density is constant (you can think of it like a topographical map).

The last example is a degenerate Gaussian.

A multivariate Gaussian random variable

The PDF is:

Note that

An **affine transformation** is just some function in the form

Any affine transformation of a Gaussian random variable is itself a Gaussian. If

The marginal distributions of a Gaussian are also Gaussian.

More formally, if you have a Gaussian random vector

The conditional distributions of a Gaussian are also Gaussian.

More formally, if you have a Gaussian random vector

The sum of independent Gaussians is also Gaussian.

More formally, if you have Gaussian random vectors

The probability of both *or*

This is the same as the probability of

This can be rearranged to form **Bayes' Theorem**:

Bayes' Theorem is useful for answering questions such as, "How likely is A given B?". For example, "How likely is my hypothesis true given the data I have?"

This explanation is adapted from Count Bayesie.

The accompanying figure depicts a 6x10 area (60 pegs total) of lego bricks representing a probability space with the following probabilities:

Red and blue alone describe the entire set of possible events. Yellow pegs are *conditional* upon the red and blue bricks; that is, their probabilities are conditional upon what color brick is underneath it.

So the following probability properties of yellow should be straightforward:

But say you want to figure out

This intuition is Bayes' Theorem, and can be written more formally as:

Step by step, what we did was:

If you expand out the last equation, you'll find Bayes' Theorem:

Consider the following problem:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening.

What is the probability that she actually has breast cancer?

Intuitively it's difficult to get the correct answer. Generally, only ~15% doctors can get it right (Casscells, Schoenberger, and Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many other studies.)

You can work through the problem like so:

*1% of women at age forty have breast cancer.*To simplify the problem, assume there are 1000 women total, so 10/1000 have breast cancer.*80% of women w/ breast cancer will get positive mammographies.*So of the 10 women that have breast cancer, 8/1000 of them will get positive mammographies.*9.6% of women without breast cancer will also get positive mammographies.*We have 10/1000 women with breast cancer, which means there are 990 without breast cancer. Of those 990, 9.6% will also get positive mammographies, so ~95/1000 women are false positives.

We can rephrase the problem like so: *What is the probability that a woman in this age group has breast cancer, if she gets a positive mammography?*

In total, the number of positives we have are 95 + 8 = 103. Then we can just use simple probability: there's an 8/103 chance (7.8%) that she has breast cancer, and a 95/103 chance (92.2%) that she's a false positive.

One way to interpret these results is that, in general, women of age forty have a 1% chance of having breast cancer. Getting a positive mammography does not indicate that you have breast cancer, it just "slides up" your probability of having it to 7.8%.

We could break up the group of 1000 women into:

- True positives: 8
- False positives: 95
- True negatives: 990 - 95 = 895
- False negatives: 10 - 8 = 2

Which totals to 1000, so everyone is accounted for.

The original proportion of patients w/ breast cancer is the **prior probability**.

The probability of a **true positive** and the probability of a **false positive** are the **conditional probabilities**.

Collectively, this information is known as the **priors**. The priors are required to solve a Bayesian problem.

The final answer - the estimated probability that a patient has breast cancer given a positive mammography - is the revised probability, better known as the **posterior probability**.

If the two conditional probabilities are equal, the posterior probability equals the prior probability (i.e. if there's an equal chance of getting a false and a negative positive, then the test really tells you nothing).

Your friend reads you a study which found that only 10% of happy people are rich. Your friend concludes that money can't buy happiness. How could you show them otherwise?

Rather than asking "What percent of happy people are rich?", it is probably better to ask "What percent of rich people are happy?" to determine if money buys happiness.

With the statistic from the study, statistics about the overall rate of happy people (say 40% of people are happy) and rich people (say 5% of people are rich), and Bayes' Theorem, you can calculate this value:

So it seems like a lot of rich people are happy.

Bayes' rule:

Say

We'll notate the class as

Our evidence may actually be multiple pieces of evidence:

If we can assume that each piece of evidence is independent given the class

In practice: say I have two coins. One is a fair coin (

The head and tail outcomes are our evidence. So we can take the product of the probabilities of these outcomes given a particular class.

The probability of picking either coin was uniform, i.e. there was a 50% chance of picking either. So we can ignore that probability.

For a fair coin, the probability of getting heads and then tails is

For the trick coin, the probability is

So it's more likely that I picked the fair coin.

If we flip again and get a heads, things change a bit:

For a fair coin:

For the trick coin:

So now it's slightly more likely that I picked the trick coin.

When working with many independent probabilities, which is often the case in machine learning, you have to multiply many probabilities which can result in underflow. So it's often easier to work with the logarithm of probability functions, which is fine because when optimizing, the max (or min) will be at the same location in the logarithm form (though their actual values will be different). Using logarithms will allow us to sum terms instead of multiplying them.

Information, measured in bits, answers questions - the more initial uncertainty there is about the answer, the more information the answer contains.

The amount of bits needed to encode an answer depends on the distribution over the possible answers (i.e., the uncertainty about the answer).

Examples:

- the answer to a boolean question with a prior
$(0.5, 0.5)$ requires 1 bit to encode (i.e. just 0 or 1) - the answer to a 4-way question with a prior
$(0.25, 0.25, 0.25, 0.25)$ requires 2 bits to encode - the answer to a 4-way question with a prior
$(0, 0, 0, 1)$ requires 0 bits to encode, since the answer is already known (no uncertainty) - the answer to a 3-way question with prior
$(0.5, 0.25, 0.25)$ requires, on average, 1.5 bits to encode

More formally, we can compute the average number of bits required to encode uncertain information as follows:

This quantity is called the **entropy** of the distribution (

If you do something such that the answer distribution changes (e.g. observe new evidence), the difference between the entropy of the new distribution and the entropy of the old distribution is called the **information gain**.

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying "the sun rose this morning" is so uninformative as to be unnecessary to send, but a message saying "there was a solar eclipse this morning" is very informative.

The **self-information** of an event is:

and, when using natural log, is measured in *nats* (when using *bits* or *shannons*). One nat is the information gained by observing an event of probability

Self-information only measures a single event; to measure the amount of uncertainty in a complete probability distribution we can instead use the **Shannon entropy**, which tells us the expected information of an event drawn from that distribution:

When **differential entropy**.

Broadly, entropy is the measure of disorder in a system.

In the case of probability, it is the measure of uncertainty that is associated with the distribution of a random variable.

If there are a few outcomes which are fairly certain, the system has low entropy.

A point-mass distribution has the lowest entropy. We know exactly what value we'll get from it.

If there are many outcomes which are equiprobable, the system has high entropy.

A uniform distribution has the highest entropy. We don't really have any idea of what value we'll draw from it.

To put it another way: with high entropy, it is very hard to guess the value of the random variable (because all values are equally or similarly likely); with low entropy it easy to guess its value (because there are some values which are much more likely than the others).

The entropy of a random variable

Where

This does not say anything about the value of the random variable, only the spread of its distribution.

For example: what is the entropy of a roll of a six-sided die?

The **Maxmimum Entropy Principle** says that, all else being equal, we should prefer distributions that maximize the entropy. That is, you should be conservative in your confidence about how much you know - if you don't have any good reason for something to be more likely than something else, err on the side of them being equiprobable.

The specific conditional entropy

The conditional entropy

Say you must transmit the random variable

To put it more concretely:

The bigger the difference, the more

We can measure the difference between two probability distributions

The KL divergence has the following properties:

- It is non-negative
- It is 0 if and only if:
$P$ and$Q$ are the same distribution (for discrete variables)$P$ and$Q$ are equal "almost everywhere" (for continuous variables)

- It is
*not*symmetric, i.e.$D_{\text{KL}}(P||Q) \neq D_{\text{KL}}(Q||P)$ , so it is not a true distance metric

The KL divergence is related to **cross entropy**

Given some discrete random variable

Usually

The mutual information between two discrete random variables

For continuous random variables, it is instead computed:

The variation of information between two random variables is computed:

The **Kullback-Leibler divergence** tells us the difference between two probability distributions

For discrete probability distributions, it is calculated:

For continuous probability distributions, it is computed:

- Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon.
- Parameter Estimation - The PDF, CDF and Quantile Function. Count Bayesie. Will Kurt.
- What is the intuition behind beta distribution?. David Robinson, KerrBer.
- Distributions of One Variable. An Introduction to Statistics with Python. Thomas Haslwanter.
- Probability Theory Review for Machine Learning. Samuel Ieong. November 6, 2006.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
*Principles of Statistics*. M.G. Bulmer. 1979.- OpenIntro Statistics, Second Edition. David M Diez, Christopher D Barr, Mine Ã‡etinkaya-Rundel.
- A Beginnerâ€™s Guide to Eigenvectors, PCA, Covariance and Entropy. Deeplearning4j. Skymind.
- An Intuitive Explanation of Bayes' Theorem. Eliezer S. Yudkowsky.
- Why so Square? Jensen's Inequality and Moments of a Random Variable. Count Bayesie. Will Kurt.
- What is an intuitive explanation of Bayes' Rule?. Mike Kayser.
- Bayes' Theorem with Lego. Count Bayesie. Will Kurt.
- Probability, Paradox, and the Reasonable Person Principle. Peter Norvig. October 3, 2015.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
- Think Complexity. Version 1.2.3. Allen B. Downey. 2012.
- Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman.