PowerPoint Presentation

Slide 1

Directed GraphicalProbabilistic Models

William W. Cohen

Machine Learning 10-601

Feb 2008

Slide 2

Some practical problems

• I have 3 standard d20 dice, 1 loaded die.

• Experiment: (1) pick a d20 uniformly at random then (2)roll it. Let A=d20 picked is fair and B=roll 19 or 20 withthat die. What is P(B)?

P(B) = P(B|A) P(A) + P(B|~A) P(~A)

= 0.1*0.75 + 0.5*0.25 = 0.2

•What is P(A=fair|B=criticalHit)?

“Loaded” means P(19 or 20)=0.5

Slide 3

Definition of Conditional Probability

P(A ^ B)

P(A|B) = -----------

P(B)

Corollaries:

P(B|A) = P(A|B) P(B) / P(A)

P(A ^ B) = P(A|B) P(B)

Chain rule

Bayes rule

Slide 4

The (highly practical) Monty Hall problem

•You’re in a game show.Behind one door is a prize.Behind the others, goats.

•You pick one of threedoors, say #1

•The host, Monty Hall,opens one door,revealing…a goat!

You now can either

• stick with your guess

• always change doors

• flip a coin and pick a new doorrandomly according to the coin

Slide 5

Some practical problems

• I have 1 standard d6 die, 2 loaded d6 die.

• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

• Experiment: pick two d6 uniformly at random (A) and roll them. Whatis more likely – rolling a seven or rolling doubles?

Three combinations: HL, HF, FL

P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)

= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

Slide 6

A brute-force solution

Roll 1

Roll 2

1/3 * 1/6 * ½

1/3 * 1/6 * 1/10

…

Comment

doubles

seven

doubles

A joint probability table shows P(X1=x1 and … andXk=xk) for every possible combination of valuesx1,x2,…., xk

With this you can compute any P(A) where A is anyboolean combination of the primitive events(Xi=Xk), e.g.

• P(doubles)

• P(seven or eleven)

• P(total is higher than 5)

• ….

Slide 7

The Joint Distribution

Recipe for making a joint distributionof M variables:

Example: Booleanvariables A, B, C

Slide 8

The Joint Distribution

Recipe for making a joint distributionof M variables:

1.Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).

Example: Booleanvariables A, B, C

Slide 9

The Joint Distribution

Recipe for making a joint distributionof M variables:

1.Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).

2.For each combination of values,say how probable it is.

Example: Booleanvariables A, B, C

Prob

0.30

0.05

0.10

0.05

0.10

0.25

0.10

Slide 10

The Joint Distribution

Recipe for making a joint distributionof M variables:

1.Make a truth table listing allcombinations of values of yourvariables (if there are M Booleanvariables then the table will have2M rows).

2.For each combination of values,say how probable it is.

3.If you subscribe to the axioms ofprobability, those numbers mustsum to 1.

Example: Booleanvariables A, B, C

Prob

0.30

0.05

0.10

0.05

0.10

0.25

0.10

0.05

0.25

0.10

0.05

0.10

0.30

Slide 11

Using theJoint

One you have the JD you canask for the probability of anylogical expression involvingyour attribute

Slide 12

Using theJoint

P(Poor Male) = 0.4654

Slide 13

Using theJoint

P(Poor) = 0.7604

Slide 14

Inferencewith theJoint

Slide 15

Inferencewith theJoint

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Slide 16

Inference is a big deal

•I’ve got this evidence. What’s the chance that thisconclusion is true?

•I’ve got a sore neck: how likely am I to have meningitis?

•I see my lights are out and it’s 9pm. What’s the chance my spouseis already asleep?

•…

•Lots and lots of important problems can be solvedalgorithmically using joint inference.

•How do you get from the statement of the problem to the jointprobability table? (What is a “statement of the problem”?)

•The joint probability table is usually too big to represent explicitly,so how do you represent it compactly?

Slide 17

Problem 1: A description of the experiment

• I have 3 standard d20 dice, 1 loaded die.

• Experiment: (1) pick a d20 uniformly at random then (2)roll it. Let A=d20 picked is fair and B=roll 19 or 20 withthat die. What is P(B)?

P(A=fair)=0.75

P(A=loaded)=0.25

P(B=critical | A=fair=0.1

P(B=noncritical | A=fair)=0.9

P(B=critical | A=loaded)=0.5

P(B=noncritical | A=loaded)=0.5

P(A)

Fair

0.75

Loaded

0.25

P(B|A)

Fair

Critical

0.1

Fair

Noncritical

0.9

Loaded

Critical

0.5

Loaded

Noncritical

0.5

Slide 18

Problem 1: A description of the experiment

P(A)

Fair

0.75

Loaded

0.25

P(B|A)

Fair

Critical

0.1

Fair

Noncritical

0.9

Loaded

Critical

0.5

Loaded

Noncritical

0.5

This is

• an “influence diagram” (informal: what influences what)

• a Bayes net / belief network / … (formal: more later!)

• a description of the experiment (i.e., how data could be generated)

• everything you need to reconstruct the joint distribution

Chain rule:

P(A,B)=P(B|A) P(A)

problem with probs:Bayes net::word problem:algebra

Slide 19

Problem 2: a description

• I have 1 standard d6 die, 2 loaded d6 die.

• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

• Experiment: pick two d6 without replacement (D1,D2) and roll them.What is more likely – rolling a seven or rolling doubles?

P(D1)

Fair

0.33

LoadedLo

0.33

LoadedHi

0.33

P(D2|D1)

Fair

LoadedLo

0.5

Fair

LoadedHi

0.5

LoadedLo

Fair

0.5

LoadedLo

LoadedHi

0.5

LoadedHi

Fair

0.5

LoadedHi

LoadedLo

0.5

P(R|D1,D2)

Fair

1,1

1/6 * 1/2

Fair

1,2

1/6 * 1/10

…

Slide 20

Problem 2: a description

P(D2)

Fair

0.33

LoadedLo

0.33

LoadedHi

0.33

P(D2|D1)

Fair

LoadedLo

0.5

Fair

LoadedHi

0.5

LoadedLo

Fair

0.5

LoadedLo

LoadedHi

0.5

LoadedHi

Fair

0.5

LoadedHi

LoadedLo

0.5

P(R|D1,D2)

Fair

1,1

1/6 * 1/2

Fair

1,2

1/6 * 1/10

…

P(A ^ B) = P(A|B) P(B)

Chain rule

P(R,D1,D2) = P(R|D1,D2) P(D1,D2)

P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1)

Slide 21

Problem 2: another description

Is7

R2=1,2,…,6

R1=1,2,…,6

Is7=yes,no

Slide 22

The (highly practical) Monty Hall problem

•You’re in a game show. Behind onedoor is a prize. Behind the others,goats.

•You pick one of three doors, say #1

•The host, Monty Hall, opens onedoor, revealing…a goat!

•You now can either stick with yourguess or change doors

First guess

The money

The goat

Stick, orswap?

Second guess

P(A)

0.33

P(B)

0.33

P(D)

Stick

0.5

Swap

0.5

P(C|A,B)

0.5

1.0

…

Slide 23

The (highly practical) Monty Hall problem

First guess

The money

The goat

Stick orswap?

Second guess

P(A)

0.33

P(B)

0.33

P(C|A,B)

0.5

1.0

…

P(E|A,C,D)

…

Slide 24

The (highly practical) Monty Hall problem

First guess

The money

The goat

Stick orswap?

Second guess

P(A)

0.33

P(B)

0.33

P(C|A,B)

0.5

1.0

…

P(E|A,C,D)

…

…again by the chain rule:

P(A,B,C,D,E) =

P(E|A,C,D) *

P(D) *

P(C | A,B ) *

P(B ) *

P(A)

We could construct thejoint and computeP(E=B|D=swap)

Slide 25

The (highly practical) Monty Hall problem

First guess

The money

The goat

Stick orswap?

Second guess

P(A)

0.33

P(B)

0.33

P(C|A,B)

0.5

1.0

…

P(E|A,C,D)

…

The joint table has…?

3*3*3*2*3 = 162 rows

The conditionalprobability tables (CPTs)shown have … ?

3 + 3 + 3*3*3 + 2*3*3= 51 rows < 162 rows

Big questions:

• why are the CPTs smaller?

•how much smaller are theCPTs than the joint?

• can we compute the answersto queries like P(E=B|d) withoutbuilding the joint probabilitytables, just using the CPTs?

Slide 26

The (highly practical) Monty Hall problem

First guess

The money

The goat

Stick orswap?

Second guess

P(A)

0.33

P(B)

0.33

P(C|A,B)

0.5

1.0

…

P(E|A,C,D)

…

Why is the CPTsrepresentation smaller?Follow the money! (B)

E is conditionallyindependent of Bgiven A,D,C

Slide 27

Conditional Independence formalized

Definition: R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x  M=y ^ L=z) = P(R=x  M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 areconditionally independent given S3 if for allassignments of values to the variables in the sets,

P(S1’s assignments  S2’s assignments & S3’s assignments)=

P(S1’s assignments  S3’s assignments)

Slide 28

The (highly practical) Monty Hall problem

First guess

The money

The goat

Stick orswap?

Second guess

What are the conditional indepencies?

• I<A, {B}, C> ?

• I<A, {C}, B> ?

• I<E, {A,C}, B> ?

• I<D, {E}, B> ?

• …

Slide 29

Bayes Nets Formalized

A Bayes net (also called a belief network) is an augmenteddirected acyclic graph, represented by the pair V , E where:

•V is a set of vertices.

•E is a set of directed edges joining vertices. No loops ofany length are allowed.

Each vertex in V contains the following information:

•The name of a random variable

•A probability distribution table indicating how theprobability of this variable’s values depends on allpossible combinations of parental values.

Slide 30

Building a Bayes Net

1.Choose a set of relevant variables.

2.Choose an ordering for them

3.Assume they’re called X1 .. Xm (where X1 is the first in theordering, X1 is the second, etc)

4.For i = 1 to m:

1.Add the Xi node to the network

2.Set Parents(Xi ) to be a minimal subset of {X1…Xi-1}such that we have conditional independence of Xi andall other members of {X1…Xi-1} given Parents(Xi )

3.Define the probability table of

P(Xi =k  Assignments of Parents(Xi ) ).

Slide 31

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =

P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =

P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *

P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =

So any entry in joint pdf table can be computed. And so anyconditional probability can be computed.

Slide 32

Building a Bayes Net: Example 1

1.Choose a set of relevant variables.

2.Choose an ordering for them

3.Assume they’re called X1 .. Xm(where X1 is the first in theordering, X1 is the second, etc)

4.For i = 1 to m:

•Add the Xi node to the network

•Set Parents(Xi ) to be a minimalsubset of {X1…Xi-1} such that wehave conditional independenceof Xi and all other members of{X1…Xi-1} given Parents(Xi )

•Define the probability table of

P(Xi =k  values of Parents(Xi ) )

D1 – first die; D2 – second; R – roll, Is7

D1, D2, R, Is7

Is7

P(D2)

Fair

0.33

P(D2|D1)

Fair

0.5

Fair

0.5

Fair

0.5

Fair

0.5

P(R|D1,D2)

Fair

1,1

1/6 * 1/2

Fair

1,2

1/6 * 1/10

…

Is7

P(D2|D1)

1,1

1.0

1,2

1.0

…

1,6

Yes

1.0

2,1

…

P(D1,D2,R,Is7)=

P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)

Slide 33

Another construction

Is7

R2=1,2,…,6

R1=1,2,…,6

Is7=yes,no

Pick the order D1, D2, R1, R2, Is7

Network will be:

What if I pick otherorders? WhataboutR2,R1,D2,D1,Is7?

What about thefirst network AB?suppose I pickedthe order B,A ?

Slide 34

Another simple example

•Bigram language model:

•Pick word w0 from a multinomial 1 P(w0)

•For t=1 to 100

•Pick word wt from multinomial 2 P(wt|wt-1)

…

w100

P(w0)

aardvark

0.0002

…

the

0.179

…

zymurgy

0.097

Wt-1

P(w0)

aardvark

0.00000001

aardvark

…

aardvark

tastes

0.1275

…

aardvark

zymurgy

0.0000076

Slide 35

A simple example

•Bigram languagemodel:

•Pick word w0 from amultinomial 1 P(w0)

•For t=1 to 100

•Pick word wt frommultinomial 2P(wt|wt-1)

…

w100

P(w0)

aardvark

0.0002

…

the

0.179

…

zymurgy

0.097

Wt-1

P(wt|wt-1)

aardvark

0.00000001

aardvark

…

aardvark

tastes

0.1275

…

aardvark

zymurgy

0.0000076

Size of CPTs: |V| + |V|2

Size of joint table: |V| ^ 101

|V|=20 for amino acids (proteins)

Slide 36

A simple example

•Trigram language model:

•Pick word w0 from a multinomial M1 P(w0)

•Pick word w1 from a multinomial M2 P(w1|w0)

•For t=2 to 100

•Pick word wt from multinomial M3 P(wt|wt-1,wt-2)

…

w99

Size of CPTs: |V| + |V|2 + |V|3

Size of joint table: |V| ^ 101

|V|=20 for amino acids (proteins)

w100

Slide 37

What Independencies does a Bayes Net Model?

•In order for a Bayesian network to model a probabilitydistribution, the following must be true by definition:

Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.

•This implies

•But what else does it imply?

Slide 38

What Independencies does a Bayes Net Model?

•Example:

Given Y, does learning the value of Z tell us

nothing new about X?

I.e., is P(X|Y, Z) equal to P(X | Y)?

Yes. Since we know the value of all of X’sparents (namely, Y), and Z is not adescendant of X, X is conditionallyindependent of Z.

Also, since independence is symmetric,

P(Z|Y, X) = P(Z|Y).

Slide 39

Quick proof that independence is symmetric

•Assume: P(X|Y, Z) = P(X|Y)

•Then:

(Bayes’s Rule)

(Chain Rule)

(By Assumption)

(Bayes’s Rule)

Slide 40

What Independencies does a Bayes Net Model?

•Let I<X,Y,Z> represent X and Z being conditionally independentgiven Y.

•I<X,Y,Z>? Yes, just as in previous example: All X’s parents given,and Z is not a descendant.

Slide 41

What Independencies does a Bayes Net Model?

•I<X,{U},Z>? No.

•I<X,{U,V},Z>? Yes.

•Maybe I<X, S, Z> iff S acts a cutsetbetween X and Z in an undirectedversion of the graph…?

Slide 42

Things get a little more confusing

•X has no parents, so we know all its parents’values trivially

•Z is not a descendant of X

•So, I<X,{},Z>, even though there’s aundirected path from X to Z through anunknown variable Y.

•What if we do know the value of Y, though?Or one of its descendants?

Slide 43

The “Burglar Alarm” example

•Your house has a twitchy burglaralarm that is also sometimestriggered by earthquakes.

•Earth arguably doesn’t care whetheryour house is currently being burgled

•While you are on vacation, one ofyour neighbors calls and tells youyour home’s burglar alarm is ringing.Uh oh!

Burglar

Earthquake

Alarm

Phone Call

Slide 44

Things get a lot more confusing

•But now suppose you learnthat there was a medium-sizedearthquake in yourneighborhood. Oh, whew!Probably not a burglar after all.

•Earthquake “explains away” thehypothetical burglar.

•But then it must not be thecase that

I<Burglar,{Phone Call},Earthquake>, even though

I<Burglar,{}, Earthquake>!

Burglar

Earthquake

Alarm

Phone Call

Slide 45

d-separation to the rescue

•Fortunately, there is a relatively simple algorithm fordetermining whether two variables in a Bayesian networkare conditionally independent: d-separation.

•Definition: X and Z are d-separated by a set of evidencevariables E iff every undirected path from X to Z is“blocked”, where a path is “blocked” iff one or more of thefollowing conditions is true: ...

ie. X and Z are dependent iff there exists an unblocked path

Slide 46

A path is “blocked” when...

•There exists a variable V on the path such that

•it is in the evidence set E

•the arcs putting Y in the path are “tail-to-tail”

•Or, there exists a variable V on the path such that

•it is in the evidence set E

•the arcs putting Y in the path are “tail-to-head”

•Or, ...

unknown“commoncauses” of Xand Z imposedependency

unknown“causalchains”connecting Xan Z imposedependency

Slide 47

A path is “blocked” when… (the funky case)

•… Or, there exists a variable Y on the path such that

•it is NOT in the evidence set E

•neither are any of its descendants

•the arcs putting Y on the path are “head-to-head”

Known “commonsymptoms” of Xand Z imposedependencies… Xmay “explainaway” Z

Slide 48

d-separation to the rescue, cont’d

•Theorem [Verma & Pearl, 1998]:

•If a set of evidence variables E d-separates X and Z in a Bayesiannetwork’s graph, then I<X, E, Z>.

•d-separation can be computed in linear time using a depth-first-search-like algorithm.

•Great! We now have a fast algorithm for automatically inferringwhether learning the value of one variable might give us any additionalhints about some other variable, given what we already know.

•“Might”: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved