Slide 1

Hidden Markov Models & POS Tagging

Corpora and Statistical Methods Lecture 9

Acknowledgement

Some of the diagrams are from slides by David Bley (availableon companion website to Manning and Schutze 1999)

Formalisation of a Hidden Markov model

Part 1

Crucial ingredients (familiar)

Underlying states: S = {s1,…,sN}

Output alphabet (observations): K = {k1,…,kM}

State transition probabilities:

A = {aij}, i,j Є S

State sequence: X = (X1,…,XT+1)

+ a function mapping each Xt to a state s

Output sequence: O = (O1,…,OT)

where each ot Є K

Crucial ingredients (additional)

Initial state probabilities:

Π = {πi}, i Є S

(tell us the initial probability of each state)

Symbol emission probabilities:

B = {bijk}, i,j Є S, k Є K

(tell us the probability b of seeing observation Ot=k at timet, given that Xt=si and Xt+1 = sj)

Trellis diagram of an HMM

a1,1

a1,2

a1,3

Trellis diagram of an HMM

a1,1

a1,2

a1,3

Obs. seq:

time:

Trellis diagram of an HMM

a1,1

a1,2

a1,3

Obs. seq:

time:

b1,1,k=O2

b1,1,k=O3

b1,2,k=O2

b1,3,k=O2

The fundamental questions for HMMs

1.Given a model μ = (A, B, Π), how do we compute thelikelihood of an observation P(O| μ)?

2.Given an observation sequence O, and model μ, which is thestate sequence (X1,…,Xt+1) that best explains the observations?

This is the decoding problem

3.Given an observation sequence O, and a space of possiblemodels μ = (A, B, Π), which model best explains the observeddata?

Application of question 1 (ASR)

Given a model μ = (A, B, Π), how do we compute thelikelihood of an observation P(O| μ)?

Input of an ASR system: a continuous stream of soundwaves, which is ambiguous

Need to decode it into a sequence of phones.

is the input the sequence [n iy d] or [n iy]?

which sequence is the most probable?

Application of question 2 (POS Tagging)

Given an observation sequence O, and model μ, which is the state sequence(X1,…,Xt+1) that best explains the observations?

this is the decoding problem

Consider a POS Tagger

Input observation sequence:

I can read

need to find the most likely sequence of underlying POS tags:

e.g. is can a modal verb, or the noun?

how likely is it that can is a noun, given that the previous word isa pronoun?

Finding the probability of an observation sequence

Example problem: ASR

Assume that the input contains the word need

input stream is ambiguous (there is noise, individual variation in speech, etc)

Possible sequences of observations:

[n iy] (knee)

[n iy dh] (need)

[n iy t] (neat)

…

States:

underlying sequences of phones giving rise to the input observations with transitionprobabilities

assume we have state sequences for need, knee, new, neat, …

Formulating the problem

Probability of an observation sequence is logically an ORproblem:

model gives us state transitions underlying several possible words(knee, need, neat…)

How likely is the word need? We have:

all possible state sequences X

each sequence can give rise to the signal received with a certainprobability (possibly zero)

the probability of the word need is the sum of probabilities with whicheach sequence can have given rise to the word.

ot-1

ot+1

Simplified trellis diagram representation

start

end

Hidden layer: transitions between soundsforming the words need, knee…

This is our model

ot-1

ot+1

Simplified trellis diagram representation

start

end

Visible layer is what ASR is given as input

ot-1

ot+1

Computing the probability of an observation

start

end

Computing the probability of an observation

ot-1

ot+1

xt+1

xt-1

Computing the probability of an observation

ot-1

ot+1

xt+1

xt-1

Computing the probability of an observation

ot-1

ot+1

xt+1

xt-1

Computing the probability of an observation

ot-1

ot+1

xt+1

xt-1

Computing the probability of an observation

ot-1

ot+1

xt+1

xt-1

A final word on observation probabilities

Since we’re computing the probability of an observationgiven a model, we can use these methods to comparedifferent models

if we take observations in our corpus as given, then the bestmodel is the one which maximises the probability of theseobservations

(useful for training/parameter setting)

The forward procedure

Forward Procedure

Given our phone input, how do we decide whether the actualword is need, knee, …?

Could compute p(O|μ) for every single word

Highly expensive in terms of computation

Forward procedure

An efficient solution to resolving the problem

based on dynamic programming (memoisation)

rather than perform separate computations for all possiblesequences X, keep in memory partial solutions

Forward procedure

Network representation of all sequences (X) of states that could generatethe observations

sum of probabilities for those sequences

E.g. O=[n iy] could be generated by

X1 = [n iy d] (need)

X2 = [n iy t] (neat)

shared histories can help us save on memory

Fundamental assumption:

Given several state sequences of length t+1 with shared history up to t

probability of first t observations is the same in all of them

Forward Procedure

ot-1

ot+1

xt+1

xt-1

•Probability of the first t observations is the samefor all possible t+1 length state sequences.

•Define a forward variable:

Probability of ending upin state si at time t afterobservations 1 to t-1

Forward Procedure: initialisation

ot-1

ot+1

xt+1

xt-1

•Probability of the first t observations is the samefor all possible t+1 length state sequences.

•Define:

Probability of being instate si first is just equalto the initialisationprobability

Forward Procedure (inductive step)

ot-1

ot+1

xt+1

xt-1

Looking backward

The forward procedure caches the probability of sequencesof states leading up to an observation (left to right).

The backward procedure works the other way:

probability of seeing the rest of the obs sequence given that wewere in some state at some time

Backward procedure: basic structure

Define:

probability of the remaining observations given that current obs isemitted by state i

Initialise:

probability at the final state

Inductive step:

Total:

Combining forward & backward variables

Our two variables can be combined:

the likelihood of being in state i at time t with our sequence ofobservations is a function of:

the probability of ending up in i at t given what came previously

the probability of being in i at t given the rest

Therefore:

Decoding: Finding the best state sequence

Best state sequence: example

Consider the ASR problem again

Input observation sequence:

[aa n iy dh ax]

(corresponds to I need the…)

Possible solutions:

I need a…

I need the…

I kneed a…

…

NB: each possible solution corresponds to a state sequence.

Problem is to find best wordsegmentation and most likelyunderlying phonetic input.

Some difficulties…

If we focus on the likelihood of each individual state, we runinto problems

context effects mean that what is individually likely maytogether yield an unlikely sequence

the ASR program needs to look at the probability of entiresequences

Viterbi algorithm

Given an observation sequence O and a model , find:

argmaxX P(X,O|)

the sequence of states X such that P(X,O|) is highest

Basic idea:

run a type of forward procedure (computes probability of all possiblepaths)

store partial solutions

at the end, look back to find the best path

Illustration: path through the trellis

At every node (state) and time, we store:

•the likelihood of reaching that state at that time by the mostprobable path leading to that state (denoted )

•the preceding state leading to the current state (denoted )

ot-1

ot+1

Viterbi Algorithm: definitions

xt-1

The probability of the most probable path fromobservation 1 to t-1, landing us in state j at t

ot-1

ot+1

Viterbi Algorithm: initialisation

xt-1

The probability of being in state j at the beginning is justthe initialisation probability of state j.

ot-1

ot+1

Viterbi Algorithm: inductive step

xt-1

xt+1

Probability of being in j at t+1 depends on

• the state i for which aij is highest

• the probability that j emits the symbol Ot+1

ot-1

ot+1

Viterbi Algorithm: inductive step

xt-1

xt+1

Backtrace store: the mostprobable state from whichstate j can be reached

Illustration

2(t=6) = probability of reaching state 2 at time t=6 by themost probable path (marked) through state 2 at t=6

2(t=6) =3 is the state preceding state 2 at t=6 on the mostprobable path through state 2 at t=6

ot-1

ot+1

Viterbi Algorithm: backtrace

xt-1

xt+1

The best state at T is that state i for which the probability i(T) ishighest

ot-1

ot+1

Viterbi Algorithm: backtrace

Work backwards to the most likely preceding state

xt-1

xt+1

ot-1

ot+1

Viterbi Algorithm: backtrace

The probability of the beststate sequence is themaximum value stored forthe final state T

xt-1

xt+1

Summary

We’ve looked at two algorithms for solving two of thefundamental problems of HMMS:

likelihood of an observation sequence given a model(Forward/Backward Procedure)

most likely underlying state, given an observation sequence(Viterbi Algorithm)

Next up:

we look at POS tagging