The key tool for probabilistic inference is the **joint probability table**. Each row in a joint probability table describes a combination of values for a set of random variables. That is, say you have

X | Y | P(X,Y) |
---|---|---|

0 | 0 | 0.25 |

1 | 0 | 0.45 |

0 | 1 | 0.15 |

1 | 1 | 0.15 |

Using a joint probability table you can learn a lot about how those events are related probabilistically.

The problem is, however, that joint probability tables can get very big, which is another way of saying that models (since joint probability tables are a representation of probabilistic models) can get complex very quickly.

Typically, we have a set of random variables

Even in the simple case where each random variable is binary, you would still have a distribution over

We can use **probabilistic graphical models** (PGMs) to reduce this space. Probabilistic graphical models allow us to represent complex networks of interrelated and independent events efficiently and with sparse parameters. All graphical models have some limitations in their ability to graphically express conditional (in)dependence statements but are nevertheless very useful.

There are two main types of graphical models:

**Bayesian models**: aka**Bayesian networks**, sometimes called**Bayes nets**or**belief networks**. These use directed graphs and are used when there are causal relationships between the random variables.**Markov models**: These use undirected graphs and are used when there are noncausal relationships between the random variables.

The concept of *factors* is important to PGMs.

A **factor** is a function

The set of random variables *scope* of the factor.

A joint distribution is a factor which returns a number which is the probability of a given combination of assignments.

An unnormalized measure is also a factor, e.g.

A conditional probability distribution (CPD) is also a factor, e.g.

A common operation on factors is a *factor product*. Say we have the factors

Another operation is *factor marginalization*. This is the same as marginalization for probability distributions but generalized for all factors. For example,

Another operation is *factor reduction* which is similarly is a generalization of probability distribution reduction.

Say we are looking at five events:

- a dog barking (
$D$ ) - a raccoon being present (
$R$ ) - a burglar being present (
$B$ ) - a trash can is heard knocked over (
$T$ ) - the police are called (
$P$ )

We can encode some assumptions about how these events are related in a *belief net* (also called a *Bayesian net*):

Every node is dependent on its parent and nothing else that is not a descendant. To put it another way: given its parent, a node is independent of all its non-descendants.

For instance, the event

We can then annotate the graph with probabilities:

The

The others depend on the outcome of their parents.

With the belief net, we only needed to specify 10 probabilities.

If we had just constructed joint probability table, we would have had to specify

If we expand out the conditional probability of this system using the chain rule, it would look like:

But we can bring in our belief net's conditional independence assumptions to simplify this:

Belief networks are *acyclical*, that is, they cannot have any loops (a node cannot have a path back to itself). In particular, they are a directed acyclic graph (DAG).

Two nodes (variables) in a Bayes net are on an **active trail** if a change in one node affects the other. This includes cases where the two nodes have a causal relationship, an evidential relationship, or have some common cause.

Formally, a belief network is a distribution of the form:

where *parental* variables of variable

When you factorize a joint probability, you have a number of options for doing so.

For instance:

where

Without any conditional independence assumptions, all factorizations produce an equivalent DAG.

However, once you begin dropping edges (i.e. making conditional independence assumptions), the graphs are not necessarily equivalent anymore.

Some of the graphs are equivalent; they can be converted amongst each other via Bayes' rule. Others cannot be bridged in this way, and thus are not equivalent.

Note that belief networks encode conditional independences but do not necessarily encode dependences.

For instance, the graph

In these cases, we call this implied dependence **graphical dependence**.

The following belief network triple represents the conditional independence of

The following belief network triple also represents the conditional independence of

The following belief network triple represents the graphical conditional dependence of

Here **collider**, since its neighbors are pointing to it.

Generally, if there is a path between **blocked**.

Similarly, if there is a non-collider along the path which is in the conditioning set, we cannot induce dependence between

If all paths between **d-separated**.

However, if there are no colliders, or the colliders that are there are in the conditioning set or their descendants, and no non-collider conditioning variables in the path, we say this path **d-connects**

Note that colliders are relative to a path.

For example, in the accompanying figure,

Consider the belief network

Note that the term "causes" is used loosely here; belief networks really only make independence statements, not necessarily causal ones.

(TODO the below is another set of notes for bayes' nets, incorporate these two)

Independence allows us to more compactly represent joint probability distributions, in that independent random variables can be represented as smaller, separate probability distributions.

For example, if we have binary random variables

Typically, independent is too strong an assumption to make for real-world applications, but we can often make the weaker, yet still useful assumption of conditional independence.

Conditional independence is when one variable makes another variable irrelevant (because the other variable adds no additional information), i.e.

For example, if

As a more concrete example, given random variables traffic

Similarly, given fire, smoke and an alarm, we could say that fire and alarm are conditionally independent given smoke.

As mentioned earlier, we can apply conditional independence to simplify joint distributions.

Take the traffic/umbrella/rain example from before. Their joint distribution is

If we make the conditional independence assumption from before (

That is, we simplified

We can describe complex joint distributions more simply with these conditional independence assumptions, and we can do so with Bayes' nets (i.e. graphical models), which provide additional insight into the structure of these distributions (in particular, how variables interact locally, and how these local interactions propagate to more distant indirect interactions).

A Bayes' net is a directed acyclic graph.

The nodes in the graph are the variables (with domains). They may be assigned (observed) or unassigned (unobserved).

The arcs in the graphs are interactions between variables (similar to constraints in CSPs). They indicate "direct influence" between variables (not that this is *not* necessarily the same as causation, it's about the information that observation of one variable gives about the other, which can mean causation, but not necessarily, e.g. it could simply be a hidden common underlying cause), which is to say that they encode conditional independences.

For each node, we have a conditional distribution over the variable that node represents, conditioned on its parents' values.

Bayes' nets implicitly encode joint distributions as a product of local conditional distributions:

This simply comes from the chain rule:

And then applying conditional independence assumptions.

The graph must be acyclic so that we can come up with a consistent ordering when we apply the chain rule (that is, decide the order for expanding the distributions). If the graph has cycles, we can't come up with a consistent ordering because we will have loops.

Note that arcs can be "reversed" (i.e. parent and children can be swapped) and encode the same joint distribution - so joint distributions can be represented by multiple Bayes' nets. But some Bayes' nets are better representations than others - some will be easier to work with; in particular, if the arcs *do* represent causality, the network will be easier to work with.

Bayes' nets are much smaller than representing such joint distributions without conditional independence assumptions.

A joint distribution over

A Bayes' net, on the other hand, where the

The Bayes' net also encodes additional conditional independence assumptions in its structure.

For example, the Bayes' net

This structure implies other conditional independence assumptions, e.g. that

More generally we might ask: given two nodes, are they independent given certain evidence and the structure of the graph (i.e. assignments of intermediary nodes)?

We can use the *d-separation* algorithm to answer this question.

First, we consider three configurations of *triples* as base cases, which we can use to deal with more complex networks. That is, any Bayes' net can be decomposed into these three triple configurations.

A simple configuration of nodes in the form of *causal chain* and encodes the joint distribution

However, is

From the definition of conditional probability, we know that:

With the Bayes' net, we can simplify this (the numerator comes from the joint distribution the graph encodes, as demonstrated previously, and the denominator comes from applying the product rule):

Then, canceling a few things out:

So yes,

Another configuration of nodes is a **common cause** configuration:

The encoded joint distribution is

Again,

Is

Again, we start with the definition of conditional probability:

Apply the product rule to the denominator and replace the numerator with the Bayes' net's joint distribution:

Yielding:

So again, yes,

Another triple configuration is the **common effect** configuration (also called **v-structures**):

*are* (unconditionally) independent here.

However, is

No - observing **causal competition**). That is, having observed

Consider the following Bayes' net:

Where our random variables are rain

The relationships assumed here are: low pressure fronts cause rain, rain or a baseball game causes traffic, and rain causes your friend's roof to drip.

Given that you observe traffic, the probability that your friend's roof is dripping goes up - since perhaps the traffic is caused by rain, which would cause the roof to drip. This relationship is encoded in the graph the path between

However - if we observe that it is raining, then observation of traffic has no more effect on

One exception here is the v-structure with *only* if we have observed

That is, we must observe **activate** the path between

Thus we make the distinction between **active** triples, in which information "flows" as it did with the path between **inactive** triples, in which this information is "blocked".

Active triples are chain and common cause configurations in which the central node is *not* observed and common effect configurations in which the central node *is* observed, *or* common effect configurations in which some child node of the central node is observed.

An example for the last case:

If

Inactive triples are chain and common cause configurations in which the central node *is* observed and common effect configurations in which the central node is *not* observed.

So now, if we want to know if two nodes **d-separated**. Otherwise, conditional independence is not guaranteed. This is the **d-separation** algorithm.

You can apply d-separation to a Bayes net and get a complete list of conditional independences that are necessarily true given certain evidence. This tells you the set of probability distributions that can be represented.

- Sally comes home and hears the alarm (
$A=1$ ) - Has she been burgled? (
$B=1$ ) - Or was the alarm triggered by an earthquake? (
$E=1$ ) - She hears on the radio that there was an earthquake (
$R=1$ )

We start with

Then we can make some conditional independence assumptions:

- The radio report has no effect on the alarm:
$P(A|B,E,R) \to P(A|B,E)$ - A burglary has no effect on the radio report:
$P(R|B,E) \to P(R|E)$ - A burglary would have no effect on the earthquake:
$P(E|B) \to P(E)$

Thus we have simplified the computation of the joint probability distribution:

We can also construct a belief network out of these conditional independence assumptions:

Say we are given the following probabilities:

First consider if Sally has not yet heard the radio; that is, she has only heard the alarm (so the only evidence she has is

Now consider that Sally has also heard the report, i.e.

So hearing the report and learning that there was an earthquake makes the burglary much less likely.

We may, however, only have **soft** or **uncertain evidence**.

For instance, say Sally is only 70% sure that she heard the alarm.

We denote our soft evidence of the alarm's ringing as

We're ignoring the case with the report (

**Unreliable evidence** is distinct from *uncertain* evidence.

Say we represent Sally's *uncertainty* of hearing the alarm, as described before, as

Now say for some reason we feel that Sally is *unreliable* for other reasons (maybe she lies a lot). We would then replace the term

This new term **virtual evidence**, also called **likelihood evidence**.

A note on the following graphics: the top part shows the belief network, where a faded node means it has been marginalized out, and a filled node means it has been observed/conditioned on. The bottom part shows the relationship between

If we marginalize over

If we instead condition on

If we introduce

In this arrangement,

Here, marginalizing over

Conditioning on

The same applies for this arrangement - here

These graphs all encode the same conditional independence assumptions.

For both directed and undirected graphs, two graphs are **Markov equivalent** if they both represent the same set of conditional independence statements.

Consider a joint distribution over the following random variables:

$G$ , grade:$g^1$ for A,$g^2$ for B,$g^3$ for C$I$ , intelligence, binary:$-i$ for low,$+i$ for high$D$ , difficulty of the course, binary:$-d$ for easy,$+d$ for hard$S$ , SAT score, binary:$-s$ for low,$+s$ for high$L$ , reference letter, binary:$-l$ for not received,$+l$ for received

We can encode some conditional independence assumptions about these random variables into a belief net:

- the grade depends on the student's intelligence and difficulty of the course
- the student's SAT score seems dependent on only their intelligence
- whether or not a student receives a recommendation letter depends on their grade

Note that we could add the assumption that intelligence students are likely to take more difficult courses, if we felt strongly about it:

To turn this graph into a probability distribution, we can represent each node as a CPD:

Then we can apply the *chain rule* of Bayesian networks which just multiplies all the CPDs:

A Bayesian network (BN) is a directed acyclic graph where its nodes represent the random variables

In whole, the BN represents a joint distribution via the chain rule for BNs:

We say a probability distribution *factorizes* over a BN graph

There are three types of reasoning that occur with a BN:

- Causal reasoning includes conditioning on an ancestor to determine a descendant's probability, e.g. $P(L=1 |
I=0)$. |
---|---|

- Intercausal reasoning - consider $P(I=1 |
G=3,D=1) |

As the simplest example of intercausal reasoning, consider an `OR`

gate:

Knowing

There are a few different structures in which a variable **influence** a variable

$X \to Y$ $X \leftarrow Y$ $X \to W \to Y$ $X \leftarrow W \leftarrow Y$ $X \leftarrow W \to Y$

Which the different reasonings described above capture.

The one structure which "blocks" influence is **v-structure**.

A **trail** is a sequence of nodes that are connected to each other by single edges in the graph. A trail **active** (if there is no evidence) if it has no v-structures

When can variable

$X \to Y$ $X \leftarrow Y$

$X \to W \to Y$ , if$W \notin Z$ $X \leftarrow W \leftarrow Y$ , if$W \notin Z$ $X \leftarrow W \to Y$ , if$W \notin in Z$ $X \to W \leftarrow Y$ , if either$W \in Z$ or one of$W$ 's descendants$\in Z$ (intercausal reasoning)

A trail *active* given evidence

For events

$P(\alpha,\beta) = P(\alpha)P(\beta)$ $P(\alpha|\beta) = P(\alpha)$ $P(\beta|\alpha) = P(\beta)$

This can be generalized to random variables:

$P(X,Y) = P(X)P(Y)$ $P(X|Y) = P(X)$ $P(Y|X) = P(Y)$

For (sets of) random variables

$P(X,Y|Z) = P(X|Z) P(Y|Z)$ $P(X|Y,Z) = P(X|Z)$ $P(Y|X,Z) = P(Y|Z)$ $P(X,Y,Z) \varpropto \phi_1(X,Z) \phi_2(Y,Z)$ ; that is, the probability of the joint distribution$P(X,Y,Z)$ is proportional to a product of the two factors$\phi_1(X,Z)$ and$\phi_2(Y,Z)$

For example:

There are two coins, one is fair and one is biased to show heads 90% of the time.

You pick a coin, toss it, and it comes up heads.

The probability of heads is *higher* in the second toss. You don't know what coin you have but heads on the first toss makes it more likely that you have the bias coin, thus a higher chance of heads on the second toss. So

But note that conditioning can also lose you independence. For example, using the previous student example, `OR`

gate example).

We say that **d-separated** in

If

Any node is d-separated from its non-descendants given its parents.

So if a distribution

We can notate the set of independencies implicit in a graph

If **independency map**) of

This does not mean *all* independencies in

SO if

Within a model you may have structures which repeat throughout or you may want to reuse common structures between/across models.

In these cases we may use **template variables**.

A template variable *arguments*.

A **template model** is a language which specifies how "ground" variables inherit dependency models from templates.

A common example of template models are **temporal models**, used for systems which evolve over time.

When representing a distribution over continuous time, you typically want to *discretize* time so that it is not continuous. To do this, you pick a *time granularity*

We also have a set of template variables.

That is,

We want to represent

To simplify this, we can use the **Markov assumption**, a type of conditional independence assumption.

Without this assumption, we have:

(this is just using the chain rule for probability)

Then the Markov assumption is

So then we can simplify our distribution:

The Markov assumption isn't always appropriate, or it may be too strong.

You can make it a better approximation by adding other variables about the state, in addition to

The second assumption we make is of **time invariance**.

We use a template probability model

That is, for all

That is, the probability distribution is not influenced by the time

Again, this is an approximation and is not always appropriate. Traffic, for example, has a different dynamic depending on what time of day it is.

Again, you can include extra variables to capture other aspects of the state of the world to improve the approximation.

$W$ = weather$V$ = velocity$L$ = location$F$ = failure$O$ = observation

The left column of the graph is at time slice

The edges connecting the nodes at **inter-time-slice**, and the edges connecting nodes at **intra-time-slices**.

We can describe a conditional probability distribution (CPD) for our prime variables as such:

We don't need a CPD for the non-prime variables because they have already "happened".

We can rewrite this distribution with the independence assumptions in the graph:

Here the observation *intra-time-slice*.

All the other variables are conditioned on the previous time slice, i.e. they are *inter-time-slice* relations.

Now we start with some initial state (time slice 0,

Then we add on the next time slice,

And we can repeatedly do this to represent all subsequent time slices

So we have a **2-time-slice Bayesian network** (2TBN). A transition model (2TBN) over **fragment** such that:

- the nodes include
$X_1', \dots, X_n'$ (next time slice$t+1$ ) and a subset of$X_1, \dots, X_n$ (time slice$t$ ). - only the nodes
$X_1', \dots, X_n'$ have parents and a CPD

The 2TBN defines a conditional distribution using the chain rule:

We can consider a Markov model as a chain-structured Bayes' Net, so our reasoning there applies here as well.

Each node is a state in the sequence and each node is identically distributed (stationary) and depends on the previous state, i.e.

The parameters of a Markov model are the **transition probabilities** (or **dynamics**) and the initial state probabilities (i.e. the initial distribution

Say we want to know **forward algorithm**, which is just an instance of variable elimination (in the order

Assuming

**stationary distribution**. The influence of the initial state fades away as

The key insight for a stationary distribution is that

Formally, the stationary distribution satisfies:

A dynamic Bayes' net (DBN) is a Bayes' net replicated through time, i.e. variables at time

A dynamic Bayesian network over

- 2TBN
$\text{BN}_{\to}$ over$X_1, \dots, X_n$ - a Bayesian network
$\text{BN}^{(0)}$ over$X_1^{(0)}, \dots, X_n^{(0)}$ (time 0, i.e. the initial state)

For a trajectory over **ground (unrolled network)** such that:

- the dependency model for
$X_1^{(0)}, \dots, X_n^{(0)}$ is copied from$\text{BN}^{(0)}$ - the dependency model for
$X_1^{(t)}, \dots, X_n^{(t)}$ for all$t > 0$ is copied from$\text{BN}_{\to}$

That is, it is just an aggregate ("unrolled") of the previously shown network up to time slice

Often we have a sequence of observations and we want to use these observations to learn something about the underlying process that generated them. As such we need to introduce time or space to our models.

An **Hidden Markov Model** (HMM) is a simple dynamic Bayes' net. In particular, it is a Markov model in which we don't directly observe the state. That is, there is a Markov chain where we don't see

The actual observations are stochastic (e.g. an underlying state may produce one of many observations with some probability). We try to infer the state based on these observations.

For example, imagine we are in a windowless room and we want to know if it's raining. We can't directly observe whether it's raining, but we can see if people have brought umbrellas with them.

It is also a 2TBN.

HMMs are used to analyze or to predict time series involving noise or uncertainty.

There is a sequence of states

Each state **emits** a measurement/observation, e.g.

Together, these define a Bayes network that is at the core of HMMs.

An HMM is defined by:

- a state variable
$S$ and an*observation*(sometimes called**emission**) variable$O$ - the initial distribution
$P(S_0)$ - the transition model
$P(S'|S)$ - the observation model
$P(O|X)$ (the probability of seeing evidence given the hidden state, also called an*emissions model*)

We introduce an additional conditional independence assumption - that the current observation is independent of everything else given the current state.

You can unroll this:

HMMs, however, may also have internal structures, more commonly in the transition model, but sometimes in the observation model as well.

TODO in the following

Say we have the following HMM:

We don't know the starting state, but we know the probabilities:

Say on the first day we see that this person is happy and we want to know whether or not it is raining. That is:

We can use Bayes' rule to compute this posterior:

We can compute these values by hand:

Then you can just run the numbers.

The first base case: consider the start of an HMM:

Inferring

That is, we applied the definition of conditional probability and then expanded the numerator with the product rule.

For an HMM,

The second base case:

Say we want to infer

That is, rather than observing evidence, time moves forward one step.

For an HMM,

So we can compute

From these two base cases we can do all that we need with HMMs.

Assume that we have the current belief

After one time step passes, we have:

Which can be written compactly as:

Intuitively, what is happening here is: we look at each place we could have been,

Assume that we have the current belief

Then:

See the above base case for observing evidence - this is just that, and remember, renormalize afterwards.

Another way of putting this:

Now we can consider the forward algorithm (the one presented previously was a simplification).

We are given evidence at each time and want to know:

We can derive the following updates:

Which we can normalize at each step (if we want

This is just variable elimination with the order

This computation is proportional to the square number of states.

With **Most Likely Explanation**, the concern is not the state at time

For MLE, we use an HMM and instead we want to know:

We can use the **Viterbi algorithm** to solve this, which is essentially just the forward algorithm where the

In contrast, the forward algorithm:

A common template model is a **plate model**.

Say we are repeatedly flipping a coin.

The surrounding box is the **plate**. The idea is that these are "stacked", one for each toss

The

Another way of visualizing this:

Where

Another example:

Plates may be *nested*:

If we were to draw this out for two courses and two students:

One oddity here is that now intelligence depends on both the student

Instead, we can use *overlapping* plates:

Plate models allow for **collective inference**, i.e. they allow us to look at the aggregate of these individual instances in order to find broader patterns.

More formally, a **plate dependency model**:

For a template variable

We get the following template CPD:

We can represent CPDs in tables, e.g.

But as we start to have more variables, this table can explode in size.

More generally, we can just represent a CPD

for all

There are many models for representing CPDs, including:

- deterministic CPDs
- tree-structured CPDs
- logistic CPDs and generalizations
- noisy OR/AND
- linear Gaussians and generalizations

**Context-specific independence** shows up in some CPD representations. It is a type of independence where we have a particular assignment

Which is to say this independence holds only for particular values of

For example, consider:

Where `OR`

of

Consider:

$X \perp Y_1 | y_2^0$ . When$Y_2$ is false,$X$ just takes on the value of$Y_1$ , so there's no context-specific independence here.$X \perp Y_1 | y_2^1$ . When$Y_2$ is true, then it doesn't matter what value$Y_1$ takes, since$X$ will be true too. Thus we have context-specific independence.$Y_1 \perp Y_2 | x^0$ . If we know$X$ is false, we already know$Y_1, Y_2$ are false, independent of each other. So we have context-specific independence here.$Y_1 \perp Y_2 | x^1$ . We don't have context-specific independence here.

Say we have the following model:

That is, whether or not a student gets a job

$A$ - if they applied ($+a, -a$ )$L$ - if they have a letter of recommendation ($+l, -l$ )$S$ - if they scored well on the SAT ($+s, -s$ )

We can represent the CPD as a tree structure.

Note that the notation at the leaf nodes is the probability of not getting the job and of getting it, i.e.

A bit more detail: we're assuming its possible that the student gets the job without applying, e.g. via a recruiter, in which case the SAT score and letter aren't important.

We also assume that if the student scored well on the SAT, the letter is unimportant.

We have three binary random variables. If we represented this CPD as a table, it have

$J \perp_c L| +a, +s$ $J \perp_c L,S| -a$ $J \perp_c L| +s, A$

This last one is just a compact representation of:

$J \perp_c L| +s, +a$ $J \perp_c L| +s, -a$

Consider another model:

Where the student chooses only one letter to submit.

The tree might look like:

Here the choice variable

This scenario has context-specific independence but also non-context-specific independence:

Because, if you break it down into its individual cases:

$L_1 \perp_c L_2 | J, c_1$ $L_1 \perp_c L_2 | J, c_2$

both are true.

This scenario relates to a class of CPDs called *multiplexer* CPDs:

Here we have some variables *one* of these variables.

**multiplexer**, i.e. the "selector variable", taking a value from

For a multiplexer CPD, we have:

That is, the value of

In a noise OR CPD we introduce intermediary variables between

That is:

Where

So if

We can write this as a probability and consider the CPD of

where

Thus:

A noisy OR CPD demonstrates **independence of causal influence**. We are assuming that we have a bunch of causes

Other CPDs for independence of causal influence include noisy AND, noisy MAX, etc.

Consider:

We have the temperature in a room and a sensor which measures the temperature.

The sensor is not perfect, so it usually around the right temperature, but not exactly.

We can represent this by saying the sensor reading

This model is a *linear Gaussian*.

We can make it more complex, assuming that the outside temperature will also affect the room temperature:

Where

The

We can take it another step. Say there is a door

Now

This is a *conditional* linear Gaussian model since its parameters are conditioned on the discrete variable

Generally, a linear Gaussian model looks like:

Where

Then, conditional linear Gaussians introduce one or more discrete parents (only one,

PGMs can be used to answer many queries, but the most common is probably **conditional probability queries**:

Given evidence

Unfortunately, the problem of inference on graphical models is NP-Hard. In particular, the following are NP-Hard:

**exact inference**- given a PGM
$P_{\Phi}$ , a variable$X$ and a value$x \in \text{Val}(X)$ , compute$P_{\Phi}(X=x)$ - even just deciding if
$P_{\Phi}(X=x) > 0$ is NP-hard **approximate inference**- let
$\epsilon < 0.5$ . Given a PGM$P_{\Phi}$ , a variable$X$ , a value$x \in \text{Val}(X)$ , and an observation$e \in \text{Val}(E)$ , find a number$p$ that has$|P_{\Phi}(X=x|E=e) - p| < \epsilon$ .

However, NP-Hard is the worst case result and their are algorithms that perform for most common cases.

Some conditional probability inference algorithms:

- variable elimination
- message passing over a graph
- belief propagation
- variational approximations
- random sampling instantiations
- Markov Chain Monte Carlo (MCMC)
- importance sampling

PGMs can also answer MAP queries:

We have a set of evidence

This is also a NP-hard problem, but there are also many algorithms to solve these efficiently for most cases.

Some MAP inference algorithms:

- variable elimination
- message passing over a graph
- max-product belief propagation
- using methods for integer programming
- for some networks, graph-cut methods
- combinatorial search

Given a query, i.e. a joint probability distribution we are interested in getting a value for, we can infer an answer for that query from a Bayes' net.

The simplest approach is **inference by enumeration** in which we extract the conditional probabilities from the Bayes' net and appropriately combine them together.

But this is very inefficient, especially because variables that aren't in the query require us to enumerate over all possible values for them. We lose most of the benefit of having this compact representation of joint distributions.

An alternative approach is **variable elimination**, which is still NP-hard, but faster than enumeration.

Variable elimination requires the notion of *factors*. Here are some factors:

- a joint distribution:
$P(X,Y)$ , which is just all entries$P(x,y)$ for all$x,y$ and sums to 1.

Example:

T | W | P |
---|---|---|

hot | sun | 0.4 |

hot | rain | 0.1 |

cold | sun | 0.2 |

cold | rain | 0.3 |

- a selected joint:
$P(x,Y)$ , i.e. we fix$X=x$ , then look at all entries$P(x,y)$ for all$y$ , and sums to$P(x)$ . This is a "slice" of the joint distribution.

Example:

T | W | P |
---|---|---|

cold | sun | 0.2 |

cold | rain | 0.3 |

- a single conditional:
$P(Y|x)$ , i.e. we fix$X=x$ , then look at all entries$P(y|x)$ for all$y$ , and sums to 1.

Example:

T | W | P |
---|---|---|

cold | sun | 0.4 |

cold | rain | 0.6 |

- a family of conditionals:
$P(X,Y)$ , i.e. we have multiple conditions, all entries$P(x|y)$ for all$x, y$ , and sums to$|Y|$ .

Example:

T | W | P |
---|---|---|

hot | sun | 0.8 |

hot | rain | 0.2 |

cold | sun | 0.4 |

cold | rain | 0.6 |

- a specified family:
$P(y|X)$ , i.e. we fix$y$ and look at all entries$P(y|x)$ for all$x$ . Can sum to anything;

Example:

T | W | P |
---|---|---|

hot | rain | 0.2 |

cold | rain | 0.6 |

In general, when we write

Any assigned/instantiated

For example, if

Consider a simple Bayes' net:

Where

We are given the following factors for this Bayes' net:

R | P |
---|---|

+r | 0.1 |

-r | 0.9 |

R | T | P |
---|---|---|

+r | +t | 0.8 |

+r | -t | 0.2 |

-r | +t | 0.1 |

-r | -t | 0.9 |

T | L | P |
---|---|---|

+t | +l | 0.3 |

+t | -l | 0.7 |

-t | +l | 0.1 |

-t | -l | 0.9 |

For example, if we observe

T | L | P |
---|---|---|

+t | +l | 0.3 |

-t | +l | 0.1 |

We can *join* factors, which gives us a new factor over the union of the variables involved.

For example, we can join on

R | T | P |
---|---|---|

+r | +t | 0.08 |

+r | -t | 0.02 |

-r | +t | 0.09 |

-r | -t | 0.81 |

After completing this join, the resulting factor

We can then join on

R | T | L | P |
---|---|---|---|

+r | +t | +l | 0.024 |

+r | +t | -l | 0.056 |

+r | -t | +l | 0.002 |

+r | -t | -l | 0.018 |

-r | +t | +l | 0.027 |

-r | +t | -l | 0.063 |

-r | -t | +l | 0.081 |

-r | -t | -l | 0.729 |

Now we have this joint distribution, and we can use the **marginalization** operation (also called **elimination**) on this factor - that is, we can sum out a variable to shrink the factor. We can only do this if the variable appears in only one factor.

For example, say we still had our factor

T | P |
---|---|

+t | 0.17 |

-t | 0.83 |

So we can take our full joint distribution

T | L | P |
---|---|---|

+t | +l | 0.051 |

+t | -l | 0.119 |

-t | +l | 0.083 |

-t | -l | 0.747 |

Then we can further sum out

L | P |
---|---|

+l | 0.134 |

-l | 0.866 |

This approach is equivalent to inference by enumeration (building up the full joint distribution, then taking it apart to get to the desired quantity).

However, we can use these operations (join and elimination) to find "shortcuts" to the desired quantity (i.e. marginalize early without needing to build the entire joint distribution first). This method is **variable elimination**.

For example, we can compute

- join on
$R$ , as before, to get$P(R,T)$ - then eliminate (sum out)
$R$ from$P(R,T)$ to get$P(T)$ - then join on
$T$ , i.e. with$P(T)$ and$P(L|T)$ , giving us$P(T,L)$ - the eliminate
$T$ , giving us$P(L)$

In contrast, the enumeration method required:

- join on
$R$ to get$P(R,T)$ - join on
$T$ to get$P(R,T,L)$ - eliminate
$R$ to get$P(T)$ - eliminate
$T$ to get$P(L)$

The advantage of variable elimination is that we never build a factor of more than two variables (i.e. the full joint distribution

In this case, we had no evidence (i.e. no fixed values) to work with. If we had evidence, we would first shrink the factors involving the observed variable, and the evidence would be retained in the final factor (since we can't sum it out once it's observed).

For example, say we observed

We would take our initial factors and shrink those involving

R | P |
---|---|

+r | 0.1 |

R | T | P |
---|---|---|

+r | +t | 0.8 |

+r | -t | 0.2 |

And we would eventually end up with:

R | L | P |
---|---|---|

+r | +l | 0.026 |

+r | -l | 0.074 |

And then we could get

L | P |
---|---|

+l | 0.26 |

-l | 0.74 |

More concretely, the general variable elimination algorithm is such:

- start with a query
$P(Q|E_1 = e_1, \dots, E_k = e_k)$ , where$Q$ are your query variables - start with initial factors (i.e. local conditional probability tables instantiated by the evidence
$E_1, \dots, E_k$ , i.e. shrink factors involving the evidence) - while there are still hidden variables (i.e. those in the net that are not
$Q$ or any of the evidence$E_1, \dots, E_k$ ) - pick a hidden variable
$H$ - join all factors mentioning
$H$ - eliminate (sum out)
$H$ - then join all remaining factors and normalize. The resulting distribution will be
$P(Q | e_1, \dots, e_k)$ .

The order in which you eliminate variables affects computational complexity in that some orderings generate larger factors than others. Again, the factor size is what influences complexity, so you want to use orderings that produce small factors.

For example, if a variable is mentioned in many factors, you generally want to avoid computing that until later on (usually last). This is because a variable mentioned in many factors means joining over many factors, which will probably produce a very large factor.

We can encode this in the algorithm by telling it to choose the next hidden variable that would produce the smallest factor (since factor sizes are relatively easy to compute without needing to actually produce the factor, just look at the number and sizes of tables that would have to be joined).

Unfortunately there isn't always an ordering with small factors, so variable elimination is great in many situations, but not all.

Another method for Bayes' net inference is **sampling**. This is an *approximate* inference method, but it can be much faster. Here, "sampling" essentially means "repeated simulation".

The basic idea:

- draw
$N$ samples from a sampling distribution$S$ - compute an approximate posterior probability
- with enough samples, this converges to the true probability
$P$

Sampling from a given distribution:

- Get sample
$u$ from a uniform distribution over$[0,1]$ - Convert this sample
$u$ into an outcome for the given distribution by having each outcome associated with a sub-interval of$[0,1)$ with sub-interval size equal to the probability of the outcome

For example, if we have the following distribution:

C | P(C) |
---|---|

red | 0.6 |

green | 0.1 |

blue | 0.3 |

Then we can map

There are many different sampling strategies for Bayes' nets:

- prior sampling
- rejection sampling
- likelihood weighting
- Gibbs sampling

In practice, you typically want to use either likelihood weighting or Gibbs sampling.

We have a Bayes' net, and we want to sample the full joint distribution it encodes, but we don't want to have to build the full joint distribution.

Imagine we have the following Bayes' net:

Where

We start from

Basically, we walk through the graph, sampling from the distribution at each node, and we choose a path through the graph such that we can condition on previously-sampled variables. This generates *one* final sample across the different variables. If we want more samples, we have to repeat this process.

Prior sampling (

That is, it generates samples from the actual joint distribution the Bayes' net encodes, which is to say that this sampling procedure is **consistent**. This is worth mentioning because this isn't always the case; some sampling strategies sample from a different distribution and compensate in other ways.

Then we can use these samples to estimate

Prior sampling can be overkill, since we typically keep samples which are irrelevant to the problem at hand. We can instead use the same approach but discard irrelevant samples.

For instance, if we want to compute

This method is called **rejection sampling** because we are rejecting samples that are irrelevant to our problem. This method is also consistent.

A problem with rejection sampling is that if the evidence is unlikely, we have to reject a lot of samples.

For example, if we wanted to estimate

We could instead fix the evidence variables, i.e. when it comes to sample

We can fix this by weighting each sample by the probability of the evidence (e.g.

With likelihood weighting, we consider the evidence only for variables sampled after we fixed the evidence (that is, that come after the evidence node in our walk through the Bayes' net). Anything we sampled before did not take the evidence into account. It's possible that what we sample before we get to our evidence is very inconsistent with the evidence, i.e. makes it very unlikely and gives us a very low weight for our sample.

With Gibbs sampling, we fix our evidence and then instantiate of all our other variables,

Then, we sample a new value for one variable at a time, conditioned on the rest, though we keep the evidence fixed. We repeat this many times.

If we repeat this infinitely many times, the resulting sample comes from the correct distribution, and it is conditioned on both the upstream (pre-evidence) and downstream (post-evidence) variables.

Gibbs sampling is essentially a Markov model (hence it is a *Markov Chain* Monte Carlo method) in which the stationary distribution is the conditional distribution we are interested in.

**Markov networks** are also called **Markov random fields**.

The simplest subclass is **pairwise Markov networks**.

Say we have the following scenario:

An idea is floating around and when, for example, Alice & Bob are hanging out, they may share the idea - they influence each other. We don't use a directed graph because the influence flows in both directions.

But how do you parametrize an undirected graph? We no longer have a notion of a conditional - that is, one variable conditioning another.

Well, we can just use factors:

30 | |

5 | |

1 | |

10 |

These factors are sometimes called **affinity functions** or **compatibility functions** or **soft constraints**.

What do these numbers mean?

They indicate the "*local happiness*" of the variables

We can define factors for the other edges as well:

100 | |

1 | |

1 | |

100 |

1 | |

100 | |

100 | |

1 |

100 | |

1 | |

1 | |

100 |

Then we have:

This isn't a probability distribution because its numbers aren't in

We can normalize it to get a probability distribution:

**partition function**.

There unfortunately is no natural mapping from the pairwise factors and the marginal probabilities from the distribution they generate.

For instance, say we are given the marginal probabilities of

0.13 | ||

0.69 | ||

0.14 | ||

0.04 |

30 | |

5 | |

1 | |

10 |

The most likely joint assignment is

This is unlike Bayesian networks where the nodes were just conditional probabilities.

Formally, a *pairwise Markov network* is an undirected graph whose nodes are *potential*)

Pairwise Markov networks cannot represent all of the probability distributions we may be interested in. A pairwise Markov network with

Thus we generalize beyond pairwise Markov networks.

A **Gibbs distribution** is parameterized by a set of general factors

We also have:

Where

Thus we have:

We can generate an **induced Markov network**

For example,

So multiple set of factors can induce the same graph. We can go from a set of factors to a graph, but we can't go the other way.

We say a probability distribution

We have **active trails** in Markov networks as well: a trail

A commonly-used variant of Markov networks is **conditional random fields** (CRFs).

This kind of model is used to deal with *task-specific prediction*, where we have a set of input/observed variables

Using the graphical models we have seen so far is not the best because we don't want to model

In this scenario, we can use a *conditional random field* representation:

This looks just like a Gibbs distribution. The difference is in the partition function:

So a CRF is parameterized the same as a Gibbs distribution, but it is normalized differently.

The end result is:

Which is a family of conditional distributions, one for each possible value of

In a Markov network, we have the concept of **separation**, which is like d-separation in Bayesian networks but we drop the "d" because they are not directed.

For example:

We can separate

$A$ and$E$ are separated given$B$ and$D$ $A$ and$E$ are separated given$D$ $A$ and$E$ are separated given$B$ and$C$

Like with Bayesian networks, we have a theorem: if

We can say the independences induced by the graph

If

We can also say that if

The converse is also true: for a positive distribution

If a graph

How well can we capture a distribution

We can denote all independences that hold in

We know that if

The converse doesn't hold;

We want graphs which encode more independences because they are sparser (less parameters) and more informative.

So for sparsity, we want a **minimal I-map**; that is, an I-map without redundant edges. But it is still not sufficient for capturing

Ideally, we want a **perfect map**, which is an I-map such that

It is possible that a perfect map for a distribution is not unique; that is, there may be other graphs which model the same set of independence assumptions and thus are also perfect maps.

When graphs model the same independence assumptions, we say they are **I-equivalent**. Most graphs have many I-equivalent variants.

Log-linear models allow us to incorporate local structure into undirected models.

In the original representation of unnormalized density, we had:

We turn this into a linear form:

Hence the name "log-linear", because the log is a linear function.

Each feature

We can further write it in the form:

which effectively turns the

For example, say we have binary variables

We must define the following features using indicator functions (1 if true, else 0):

So we have the log-linear model:

So we can represent any factor as a log-linear model by including the appropriate features.

For example, say you want to develop a language model for labeling entities in text.

You have target labels

You could, for instance, have the following features:

and so on.

*Bayesian Reasoning and Machine Learning*. David Barber.- Probabilistic Graphical Models. Daphne Koller. Stanford University/Coursera.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
- Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of Edinburgh (Coursera). 2015.
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.