learning_causal

complexity of joint distributions: for a multinomial distribution with $k$ states for each of its $n$ variables, the full distribution requires $k^n - 1$ parameters
complexity can be reduced using a CPD graph structure such as a Bayes Net (BN)
learning parameters of a BN: straightforward, like any CPD, you can use maximum likelihood estimates (MLE) or Bayesian estimates with a Dirichlet prior

Learning the structure of a BN

Includes local and global components...

For variables $x_i, x_j$ in dataset $\mathcal D$ of $M$ samples...

$$ d_{\chi^2}(\mathcal D) = \sum_{x_i, x_j} \frac{(M[x_i,x_j] - M \cdot \hat P(x_i) \cdot \hat P(x_j))^2}{M \cdot \hat P(x_i) \cdot \hat P(x_j)} $$

Independence increases as this value approaches 0

$$ d_I(\mathcal D) = \frac{1}{M} \sum_{x_i,x_j} M[x_i, x_j] \log \frac{M[x_i, x_j]}{M[x_i]M[x_j]} $$

Independence increases as this value approaches 0

Choose some p-value $t$, acept if $d(\mathcal D) <= t$, else reject.

For a graph $\mathcal G$ with $n$ variables

$$ \text{score}_L (\mathcal G: \mathcal D) = \sum_{\mathcal D} \sum_{i=1}^n \log \hat P (x_i | \text{parents}(x_i)) $$

$$ \text{score}_B (\mathcal G: \mathcal D) = \log p(\mathcal D | \mathcal G) + \log p(\mathcal G) $$

$$ \text{score}_{BIC} (\mathcal G: \mathcal D) = l(\hat \theta) : \mathcal D) - \frac{\log M}{2} \text{Dim}(\mathcal G) $$

Constraint-based
- find best structure to explain determined dependencies
- sensitive to errors in testing individual dependencies
Score-based
- search the space of networks to find high-scoring structure
- requires heuristics (e.g. greedy or branch-and-bound)
Bayesian model averaging
- prediction over all structures
- may not have closed form