Slide 1

CS 541: Artificial Intelligence

Lecture X: Markov Decision Process

Slides Credit: Peter Norvig and Sebastian Thrun

Schedule

TA: Vishakha Sharma

Email Address: vsharma1@stevens.edu

Week

Date

Topic

Reading

Homework

1

08/30/2011

Introduction & Intelligent Agent

Ch 1 & 2

N/A

2

09/06/2011

Search: search strategy and heuristic search

Ch 3 & 4s

HW1 (Search)

3

09/13/2011

Search: Constraint Satisfaction & Adversarial Search

Ch 4s & 5 & 6

Teaming Due

4

09/20/2011

Logic: Logic Agent & First Order Logic

Ch 7 & 8s

HW1 due, Midterm Project (Game)

5

09/27/2011

Logic: Inference on First Order Logic

Ch 8s & 9

6

10/04/2011

Uncertainty and Bayesian Network

Ch 13 &14s

HW2 (Logic)

10/11/2011

No class (function as Monday)

7

10/18/2011

Midterm Presentation

Midterm Project Due

8

10/25/2011

Inference in Baysian Network

Ch 14s

HW2 Due, HW3 (Probabilistic Reasoning)

9

11/01/2011

Probabilistic Reasoning over Time

Ch 15

10

11/08/2011

TA Session: Review of HW1 & 2

HW3 due, HW4 (MDP)

11

11/15/2011

Supervised Machine Learning

Ch 18

12

11/22/2011

Markov Decision Process

Ch 16

HW4 due

13

11/29/2011

Reinforcement learning

Ch 21

14

12/06/2011

Final Project Presentation

Final Project Due

Re-cap of Lecture IX

Machine Learning

Classification (Naïve Bayes)

Regression (Linear, Smoothing)

Linear Separation (Perceptron, SVMs)

Non-parametric classification (KNN)

Outline

Planning under uncertainty

Planning with Markov Decision Processes

Utilities and Value Iteration

Partially Observable Markov Decision Processes

Planning under uncertainty

Rhino Museum Tourguide

6

http://www.youtube.com/watch?v=PF2qZ-O3yAM

Minerva Robot

7

http://www.cs.cmu.edu/~rll/resources/cmubg.mpg

Pearl Robot

8

http://www.cs.cmu.edu/~thrun/movies/pearl-assist.mpg

Mine Mapping Robot Groundhog

9

http://www-2.cs.cmu.edu/~thrun/3D/mines/groundhog/anim/mine_mapping.mpg

10

Planning

Uncertainty

11

Planning: Classical Situation

hell

heaven

• World deterministic

• State observable

MDP-Style Planning

hell

heaven

• World stochastic

• State observable

[Koditschek 87, Barto et al. 89]

• Policy

• Universal Plan

• Navigation function

Stochastic, Partially Observable

sign

hell?

heaven?

[Sondik 72] [Littman/Cassandra/Kaelbling 97]

Stochastic, Partially Observable

sign

hell

heaven

sign

heaven

hell

Stochastic, Partially Observable

sign

heaven

hell

sign

?

?

sign

hell

heaven

start

50%

50%

Stochastic, Partially Observable

sign

?

?

start

sign

heaven

hell

sign

hell

heaven

50%

50%

sign

?

?

start

A Quiz

-dim continuous

stochastic

1-dim

continuous

stochastic

actions

# states

size belief space?

sensors

3: s1, s2, s3

deterministic

3

perfect

3: s1, s2, s3

stochastic

3

perfect

23-1: s1, s2, s3, s12, s13, s23, s123

deterministic

3

abstract states

deterministic

3

stochastic

2-dim continuous: p(S=s1), p(S=s2)

stochastic

3

none

2-dim continuous: p(S=s1), p(S=s2)

-dim continuous

deterministic

1-dim

continuous

stochastic

aargh!

stochastic

-dim

continuous

stochastic

Planning with Markov DecisionProcesses

MPD Planning

Solution for Planning problem

Noisy controls

Perfect perception

Generates “universal plan” (=policy)

Grid World

The agent lives in a grid

Walls block the agent’s path

The agent’s actions do not always go asplanned:

80% of the time, the action Northtakes the agent North(if there is no wall there)

10% of the time, North takes the agentWest; 10% East

If there is a wall in the direction theagent would have been taken, the agentstays put

Small “living” reward each step

Big rewards come at the end

Goal: maximize sum of rewards*

Grid Futures

22

Deterministic Grid World

Stochastic Grid World

X

X

E N S W

X

E N S W

?

X

X

X

Markov Decision Processes

An MDP is defined by:

A set of states s  S

A set of actions a  A

A transition function T(s,a,s’)

Prob that a from s leads to s’

i.e., P(s’ | s,a)

Also called the model

A reward function R(s, a, s’)

Sometimes just R(s) or R(s’)

A start state (or distribution)

Maybe a terminal state

MDPs are a family of non-deterministicsearch problems

Reinforcement learning: MDPs where we don’tknow the transition or reward functions

23

Solving MDPs

In deterministic single-agent search problems, want an optimal plan, orsequence of actions, from start to a goal

In an MDP, we want an optimal policy *: S → A

A policy  gives an action for each state

An optimal policy maximizes expected utility if followed

Defines a reflex agent

Optimal policy when R(s, a,s’) = -0.03 for all non-terminals s

Example Optimal Policies

R(s) = -2.0

R(s) = -0.4

R(s) = -0.03

R(s) = -0.01

MDP Search Trees

Each MDP state gives a search tree

a

s

s’

s, a

(s,a,s’) called a transition

T(s,a,s’) = P(s’|s,a)

R(s,a,s’)

s,a,s’

s is a state

(s, a) is a q-state

26

Why Not Search Trees?

Why not solve with conventional planning?

Problems:

This tree is usually infinite (why?)

Same states appear over and over (why?)

We would search once per state (why?)

27

Utilities

Utilities

Utility = sum of future reward

Problem: infinite state sequences have infinite rewards

Solutions:

Finite horizon:

Terminate episodes after a fixed T steps (e.g. life)

Gives nonstationary policies ( depends on time left)

Absorbing state: guarantee that for every policy, a terminal state willeventually be reached (like “done” for High-Low)

Discounting: for 0 <  < 1

Smaller  means smaller “horizon” – shorter term focus

29

Discounting

Typically discountrewards by  < 1 eachtime step

Sooner rewards havehigher utility than laterrewards

Also helps the algorithmsconverge

30

Recap: Defining MDPs

Markov decision processes:

States S

Start state s0

Actions A

Transitions P(s’|s,a) (or T(s,a,s’))

Rewards R(s,a,s’) (and discount )

MDP quantities so far:

Policy = Choice of action for each state

Utility (or return) = sum of discounted rewards

a

s

s, a

s,a,s’

s’

31

Optimal Utilities

Fundamental operation: compute thevalues (optimal expect-max utilities)of states s

Why? Optimal values define optimalpolicies!

Define the value of a state s:

V*(s) = expected utility starting in s andacting optimally

Define the value of a q-state (s,a):

Q*(s,a) = expected utility starting in s,taking action a and thereafter actingoptimally

Define the optimal policy:

*(s) = optimal action from state s

a

s

s, a

s,a,s’

s’

32

The Bellman Equations

Definition of “optimal utility” leads to asimple one-step lookahead relationshipamongst optimal utility values:

Optimal rewards = maximize over first action andthen follow optimal policy

Formally:

a

s

s, a

s,a,s’

s’

33

Solving MDPs

We want to find the optimal policy *

Proposal 1: modified expect-max search, starting from eachstate s:

a

s

s, a

s,a,s’

s’

34

Value Iteration

Idea:

Start with V0*(s) = 0

Given Vi*, calculate the values for all states for depth i+1:

This is called a value update or Bellman update

Repeat until convergence

Theorem: will converge to unique optimal values

Basic idea: approximations get refined towards optimal values

Policy may converge long before values do

35

Example: Bellman Updates

36

max happens fora=right, other actionsnot shown

Example: =0.9, livingreward=0, noise=0.2

Example: Value Iteration

Information propagates outward from terminal states andeventually all states have correct value estimates

V2

V3

37

Computing Actions

Which action should we chose from state s:

Given optimal values V?

Given optimal q-values Q?

Lesson: actions are easier to select from Q’s!

Asynchronous Value Iteration*

In value iteration, we update every state in each iteration

Actually, any sequences of Bellman updates will converge if everystate is visited infinitely often

In fact, we can update the policy as seldom or often as we like,and we will still converge

Idea: Update states whose value we expect to change:

If is large then update predecessors of s

$C:\Documents and Settings\Administrator\My Documents\My Pictures\Picture1.png$

Value Iteration: Example

plan1

plan2

Another Example

Value Function and Plan

Map

Another Example

Value Function and Plan

Map

Partially Observable MarkovDecision Processes

Stochastic, Partially Observable

sign

?

?

start

sign

heaven

hell

sign

hell

heaven

50%

50%

sign

?

?

start

Value Iteration in Belief space: POMDPs

Partially Observable Markov Decision Process

Known model (learning even harder!)

Observation uncertainty

Usually also: transition uncertainty

Planning in belief space = space of all probability distributions

Value function: Piecewise linear, convex function over the beliefspace

45

Introduction to POMDPs (1 of 3)

80

100

b

a

0

b

a

40

s2

s1

action a

action b

p(s1)

[Sondik 72, Littman, Kaelbling, Cassandra ‘97]

s2

s1

100

0

100

action a

action b

Introduction to POMDPs (2 of 3)

80

100

b

a

0

b

a

40

s2

s1

80%

c

20%

p(s1)

s2

s1’

s1

s2’

p(s1’)

p(s1)

s2

s1

100

0

100

[Sondik 72, Littman, Kaelbling, Cassandra ‘97]

Introduction to POMDPs (3 of 3)

80

100

b

a

0

b

a

40

s2

s1

80%

c

20%

p(s1)

s2

s1

100

0

100

p(s1)

s2

s1

s1

s2

p(s1’|A)

B

A

50%

50%

30%

70%

B

A

p(s1’|B)

[Sondik 72, Littman, Kaelbling, Cassandra ‘97]

POMDP Algorithm

Belief space = Space of all probability distribution(continuous)

Value function: Max of set of linear functions in beliefspace

Backup: Create new linear functions

Number of linear functions can grow fast!

49

Why is This So Complex?

State Space Planning

(no state uncertainty)

Belief Space Planning

(full state uncertainties)

?

longwood

but not usually.

longwood

The controller may beglobally uncertain...

Belief Space Structure

checkerboard

longwood

Augmented MDPs:

[Roy et al, 98/99]

conventional state space

uncertainty (entropy)

Path Planning with Augmented MDPs

entropy4

entropy5

information gain

Conventional planner

Probabilistic Planner

coastal

conv

coastal3

conv3

[Roy et al, 98/99]