Machine Learning 05: Graphical Models
- Graphical Models
- Bayesian Networks
- Bayesian Networks Definition
- Conditional Independence, Revisited
- Markov Blanket
- Probabilities in Bayes Nets
- Learning of Bayes Nets
- Learning CPTs from Fully Observed Data
- Estimate θ from partly observed data
- How can we learn Bayes Net graph structure?
- Kullback-Leibler Divergence
- Chow-Liu Algorithm
- Resources
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Graphical Models
- Key Ideas:
- Conditional Independence assumptions useful to reduce # of parameters
- but Naive Bayes is extreme!
- Graphical models express sets of conditional independence assumptions via graph structure
- Graph structure plus associated parameters define joint probability distribution over sets of parameters
- Why Grapical Models:
- xxx
Conditional Independence
P(thunder | rain, light) = P(thunder | light)
Margin Independence
(Any i, j) P(X = x_i | Y = y_j) = P(X = x_i)
(Any i, j) P(Y = y_j | X = x_i) = P(Y = y_j)
Bayesian Networks
Bayes Nets are convinient representation for encoding dependencies / conditional independence.
Bayesian Networks Definition
- 4 for
P(W | L, R) = P(W | Pa)
, as shown inW
column in the diagram - 2 for
- 2
- 2
- 1 for P(StormClouds | ?) = P(StormClouds)
Bayesian Networks Joint Distribution
The joint distribution over all variables:
P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))
in which, Pa(X) = Parents = Immediat Parents
So the joint distribution of the graph is:
xxxxxx
In this case, for each node Xi
actually describes P(Xi | Pa(Xi))
.
According to chain rule:
P(S,L,R,T,W) = P(S) P(L|S) P(R|S,L) P(T|S,L,R) P(W|S,L,R,T)
But in a Bayes Net:
P(X1 ... Xn) = ∏_i P(Xi | Pa(Xi))
P(S,L,R,T,W) = P(S) P(L|S) P(R|S) P(T|L) P(W|L,R)
P(S=1,L=0,R=1,T=0,W=1) = P(S=1) P(L=0|S=1) P(R=1|S=1) P(T=0|L=0) P(W=1|L=0,R=1)
The last item P(W=1|L=0,R=1)=0.2
according to the table above.
Learning a Bayes Net
Bayes Network for X1…4 with NO assumed conditional independencies
P() = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)
Bayes Network for Naive Bayes
P(Y|X1...Xn) = P(X1...Xn|Y) P(Y) / P(X1...Xn)
= P(X1|Y) P(X2|Y) ... P(Xn|Y) P(Y)
What if variables are mix of dicrete and real values?
Bayes Network for a Hidden Markov Model
Bayesian Networks Definition
A Bayes network represents the joint probability distribution over a collection of random variables.
A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)
- Each node denotes a random variable
- Edges denote dependencies
- For each node
Xi
its CPD definesP(Xi | Pa(Xi))
- The joint distribution over all variables is defined to be:
Conditional Independence, Revisited
- We said:
- Each node is conditionally independent of its non-descendents, given its immediate parents.
- Does this rule give us all of the conditional independence relations implied by the Bayes network?
- No!
- E.g.,
X1
andX4
are conditionally indep given{X2, X3}
- But
X1
andX4
not conditionally indep givenX3
- For this, we need to understand D-separation
Easy Network 1: Head to Tail
Easy Network 2: Tail to Tail
Easy Network 3: Head to Head
X
and Y
are conditionally independent given Z
, if and only if X
and Y
are D-seperated by Z
.
Suppose we have three sets of random variables: X, Y and Z
X
and Y
are D-separated by Z
(and therefore conditionally indep, given Z
) iff every path from every variable in X
to every variable in Y
is blocked.
A path from variable A
to variable B
is blocked by Z
if it includes a node such that either:
- arrows on the path meet either head-to-tail or tail-to-tail at the node and this node is in
Z
- the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in
Z
X1
indep ofX3
givenX2
? YX3
indep ofX1
givenX2
? YX4
indep ofX1
givenX2
? YX4
indep ofX1
givenX3
? NX4
indep ofX1
given{X3, X2}
? YX4
indep ofX1
given{}
? Y
a
indep ofb
givenc
? Na
indep ofb
givenf
? Ya
indep ofb
given{}
? Y
Markov Blanket
It is actually a "sufficient condition rule": saying that to compute xi
, we need and only need to consider the nodes in purple.
Why Markov Blanket is Useful for Inference
According to Markov Blanket, we can get:
P(h|a,c,h,g) = P(h|c,j)
Solve the equation as following:
Probabilities in Bayes Nets
All following parts in this section are based on this graph:
Prob. of joint assignment: easy
Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n>
P(f,a,s,h,n) = P(f)P(a)P(s|f,a)P(h|s)P(n|s)
Prob. of marginal probabilities P(Xi): not so easy
How to calculate P(N=n)
?
P(N=1) = ΣfΣaΣhΣs P(N=1,f,a,h,s)
In this equation, we need calculate 16 values to get the marginal probabilities.
Generating a sample from joint distribution: easy
How can we generate random samples drawn according to P(F,A,S,H,N)
?
Any 21th century programming language has a primitive function you can call to generate random variables between 0 ~ 1 following uniform distribution.
To generate a random sample for roots of network ( F or A ):
- let θ = P(F=1) # look up from CPD
- r = random number drawn uniformly between 0 and 1
- if r<θ then output 1, else 0
To generate a random sample for S, given F, A:
- let θ = P(S=1|F=f,A=a) # look up from CPD
- r = random number drawn uniformly between 0 and 1
- if r<θ then output 1, else 0
e.g., P(N=n)
by generating many samples from joint distribution, then count the fraction of samples for which N=n
Similarly, for anything else we care about: P(F=1|H=1, N=0)
Gibbs Sampling
A commonly used method in research.
Approach:
- start with arbitrary initial values for unobserved X1(0) ,…,Xn(0) (and the observed Xn+1, ..., Xm)
-
iterate for s=0 to a big number:
Eventually (after burn-in), the collection of samples will constitute a sample of the true P(X1,…,Xn | Xn+1, ..., Xm)
Learning of Bayes Nets
- Four categories of learning problems
- Graph structure may be known/unknown
- Variable values may be fully observed / partly unobserved
- Easy case: learn parameters when graph structure is known, and training data is fully observed
- Interesting case: graph known, data partly observed
- Gruesome case: graph structure unknown, data partly unobserved
Learning CPTs from Fully Observed Data
- Example: Consider learning the parameter:
MLE
- MLE (Max Likelihood Estimate) is:
Remember why?
- Maximum likelihood estimate:
- In our case:
MAP
- Maximum likelihood estimate
- MAP estimate
Estimate θ from partly observed data
- What if FAHN observed, but not S?
- Can't calculate MLE:
- Let
X
be all observed variable values (over all examples) - Let
Z
be all unobserved variable values - Can’t calculate MLE:
- EM seeks the estimate:
EM Algorithm - Informally
EM is a general procedure for learning from partly observed data.
Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S} in the example)
Begin with arbitrary choice for parameters θ:
Iterate until convergence:
• E Step: use X, θ to estimate the unobserved Z values
• M Step: use X values and estimated Z values to derive a better θ
Guaranteed to find local maximum.
Each iteration increases:
EM Algorithm - Precisely
EM is a general procedure for learning from partly observed data.
Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})
Define
Iterate until convergence:
• E Step: Use X and current θ to calculate P(Z|X,θ)
• M Step: Replace current θ by θ ← argmax Q(θ'|θ)
Example using EM
EM seeks the estimate:
here, observed X = {F,A,H,N}, unobserved Z = {S}
E Step: Use X, θ, to Calculate P(Z|X,θ)
Observed X = {F,A,H,N}, Unobserved Z = {S}
How? Bayes net inference problem.
Recall that MLE was:
More generally, Given observed set X, unobserved set Z of boolean values:
Using Unlabeled Data to Help Train Naïve Bayes Classifier
How can we learn Bayes Net graph structure?
In general case, open problem
- can require lots of data (else high risk of overfitting)
- can use Bayesian priors, or other kinds of prior assumptions about graph structure to constrain search
One key result:
- Chow-Liu algorithm: finds “best” tree-structured network
- What’s best?
- suppose P(X) is true distribution, T(X) is our tree-structured network, where X =
- Chow-Liu minimizes Kullback-Leibler divergence.
- suppose P(X) is true distribution, T(X) is our tree-structured network, where X =
Kullback-Leibler Divergence
- KL(P(X) || T(X)) is a measure of the difference between distribution P(X) and T(X)
- It is asymetric, always greater or equal to 0
- It is 0 iff P(X) = T(X)
Chow-Liu Algorithm
Key result: To minimize KL(P || T) over possible tree networks T approximating true P, it suffices to find the tree network T that maximizes the sum of mutual informations over its edges.
- for each pair of variables A,B, use data to estimate P(A,B), P(A), and P(B)
- for each pair A, B calculate mutual information
- calculate the maximum spanning tree over the set of variables, using edge weights I(A,B) — given N vars, this costs only O(N2) time
- add arrows to edges to form a directed-acyclic graph
- learn the CPD’s for this graph
E.g. Greedy Algorithm to find Max-Spanning Tree
Note that for each iteration, we need to check no loop is formed becaused we need to get a tree structure.
Also note that there should be many other edges in the graph like A—F, here is just a simple representation.
Chow-Liu for improving Naïve Bayes
Can we use structure learning to modify Naïve Bayes cond. indep. assumptions?
If X2 = living in Pittburg, X3 = attending CMU, the assumption of cond. indep. is not perfect fit. By adding dependencies between X variables, and prune using Chow-Liu Algorithm, the dependencies between X variables can be represented and learnt.
Resources
- Bishop Ch. 8