- complexity of joint distributions: for a multinomial distribution with $k$ states for each of its $n$ variables, the full distribution requires $k^n - 1$ parameters
- complexity can be reduced using a CPD graph structure such as a Bayes Net (BN)
- learning parameters of a BN: straightforward, like any CPD, you can use maximum likelihood estimates (MLE) or Bayesian estimates with a Dirichlet prior
Learning the structure of a BN
Includes local and global components...
Local: independence tests
Measures of deviance-from-independence between variables
For variables $x_i, x_j$ in dataset $\mathcal D$ of $M$ samples...
- Pearson's Chi-squared ($\chi^2$) statistic:
$$
d_{\chi^2}(\mathcal D) = \sum_{x_i, x_j} \frac{(M[x_i,x_j] - M \cdot \hat P(x_i) \cdot \hat P(x_j))^2}{M \cdot \hat P(x_i) \cdot \hat P(x_j)}
$$
Independence increases as this value approaches 0
- Mutual information (KL divergence) b/w joint and product of marginals:
$$
d_I(\mathcal D) = \frac{1}{M} \sum_{x_i,x_j} M[x_i, x_j] \log \frac{M[x_i, x_j]}{M[x_i]M[x_j]}
$$
Independence increases as this value approaches 0
A decision rule for accepting/rejecting hypothesis of independence
Choose some p-value $t$, acept if $d(\mathcal D) <= t$, else reject.
Global: scoring the structure
For a graph $\mathcal G$ with $n$ variables
- Log-likelihood score:
$$
\text{score}_L (\mathcal G: \mathcal D) = \sum_{\mathcal D} \sum_{i=1}^n \log \hat P (x_i | \text{parents}(x_i))
$$
- Bayesian score:
$$
\text{score}_B (\mathcal G: \mathcal D) = \log p(\mathcal D | \mathcal G) + \log p(\mathcal G)
$$
- Bayes information criterion (with Dirichlet prior over graphs):
$$
\text{score}_{BIC} (\mathcal G: \mathcal D) = l(\hat \theta) : \mathcal D) - \frac{\log M}{2} \text{Dim}(\mathcal G)
$$
Learning algorithms
- Constraint-based
- find best structure to explain determined dependencies
- sensitive to errors in testing individual dependencies
- Score-based
- search the space of networks to find high-scoring structure
- requires heuristics (e.g. greedy or branch-and-bound)
- Bayesian model averaging
- prediction over all structures
- may not have closed form
References