Sequential Decisions

States: s 2S= {s

, s

, ··· , s

}

Actions: a 2A= {a

, a

, ··· , a

}

Policy: ⇡ = {a(s

),a(s

), ··· ,a(s

)}

Reward: r(s, a, s

) 2 R

Transition

model: p(s

|s, a)

In this chapter, we explore how to make optimal decisions in a

sequential context. Sequential decisions di↵er from the decision con-

text presented in chapter 14 because here, the goal is to select the

optimal action to be performed in order to maximize the current

reward as well as future ones. In that process, we must t ake into

account that our current action will a↵ect future states as well as

future decisions. Let us consider a system that can be in a state

s 2S

{

, ··· ,

}

. The set of actions

a 2A

{

, ··· ,

}

that should be performed given each state of the syste m is cal l ed a

policy,

⇡ = {a ( s

),a(s

), ··· ,a(s

)}.

In the context of sequential decisions, the goal is to identify

the optimal policy

⇡

⇤

describing for each possible state of the

system the optimal action to take. This task can be formulated

as a reinforcement learning (RL) problem, where the optimal

Note:

Reinforcement learning di↵ers

from supervised learning because the

optimal policy is not learned using other

examples of policies. RL also di↵ers from

unsupervised learning because the goal is

not to discover a structure or patterns in

the data.

policy is learn ed by choosin g th e act i ons

⇤

(

) that maximize

the expected u ti li ty over a given planning horizon. In order to

compute the expected utility, we need a function describi ng for each

set of current state

, action

, and n ex t s t at e

the associated

reward

(

s, a, s

)

2 R

. In practical cases, it can happen that the

reward simpliﬁes to a special case such as

(

s, a, s

(

s, a

(

s, a, s

(

a, s

), or even

(

s, a, s

(

). The last part we need

to deﬁne is the Markovian transition model

(

|s = s

, a

), 8k 2{1:S}, that is,

p(s

, a

) ⌘

Pr(S

t+1

= s

, a

)

Pr(S

t+1

= s

, a

)

Pr(S

t+1

= s

, a

The t ransition model describes, given the current state of the

Note:

A transition model is analogous to

the concept of a conditional probability

table presented in chapter 11.

system s

and an action taken a

, the probability of ending in any

j.-a. goulet 242

state

at time

+ 1. He re , the M arkov property (see chapter

7) allows f ormulating a model where the prob abi l i ty of each state

at time

+ 1 depends on the current state s

at time

and

is independent of any prev iou s st at es . Alth ou gh seq u ential decision

problems could also be for mulated for continuous state variables,

s 2 R, in this chapter we restrict ourselves to the d is cr et e case.

Example: Infrastructure maintenance planning The notions associ-

ated with reinforcement lear ni n g and the sequential decis ion process

are illustrated through an application to infrastructure maintenanc e

planning. Let us consider a population of bridges (see ﬁgure 15.1)

for which at a gi ven time t, each bri d ge can be in a stat e

(a) Representation of North America’s

road network, which includes hundreds of

thousands of bridges. (Photo: Commission

for Environmental Cooperation)

(b) Example of a given structure within

the network. (Photo: Michel Goulet)

Figure 15.1: Contextual illustration of a

sequential decision process applied to the

problem of infrastructure maintenance

planning.

s 2S=



100 %

|{z}

, 80 %

|{z}

, 60 %

|{z}

, 40 %

|{z}

, 20 %

|{z}

, 0%

|{z}

where 100 percent indicat e s a perfect condition , an d 0 percent

corresponds to a stat e whe r e the damage i s so ex te ns i ve that th e

structure must be clos e d. Th e objective of sequential deci sion

making is to learn the optimal policy

⇡

⇤

(

) that identiﬁes among a

set of possible actions,

a 2A= {do nothing (N), m ai ntain (M) , r ep l ace ( R) },

what is the optimal one

⇤

(

) to take at e ach time , depending on

the current condition s.

In this example, a bridge condition

is rated every year from

the results of visual inspections. We assume we have a datab ase

cont ai n in g the con di t i ons ob tai n ed f rom ins pection result s for

structures each inspected T times,

D =

, ··· ,s

}

, ··· ,s

}

, ··· ,s

}

;

N⇥T

0 5 10 15 20

20%

40%

60%

80%

100%

Time t

State s

Figure 15.2: Example of bridge condition

time histories.

Each l ine i n t hi s dat ab ase cor r es ponds to a given structure and

each col u mn corr es ponds to a given time

t 2{

, ··· , T}

. Figure

15.2 presents an example for

= 15 bridges and time histories of

length

= 20. One of them has been highlighted with a thicker

red line to better visualize how a structure transits from state to

state through time. In this data set, th e action taking place at each

time

do nothing

. We can use this dat a set to esti mat e

the parameters of the Markovian transition model

(

t+1

do nothing

) by emp loying t h e maxi mum likelihood estimat e for

each t r ans it i on prob ab il i ty (s ee

11.4). The maximum likelihood

probabilistic machine learning for civil engineers 243

estimate (MLE) corresponds to the number of stru ct u re s that, from

a st at e

at time

, transitioned to a state

+ 1, divided by the

numbe r of structures that started from a state i at time t,

Pr(S

t+1

= s

= do nothing) =

#{s

= s

t+1

= s

}

#{s

= s

}

where

{

}2S

and the # symbol describes the number of

elements in a set. These results are stored in a transition matrix,

p(s

t+1

= do nothing

| {z }

0.95 0.03 0.02 0 0 0

0 0.90.05 0.03 0.02 0

0 0 0.80.12 0.05 0.03

0 0 0 0.70.25 0.05

0 0 0 0 0.60.4

0 0 0 0 0 1

S⇥S

where each element [ ]

corresponds to the probability of landing in

state

t+1

= s

given that the system is currently in state

= s

This matrix is analogous to the transition matrix we have seen

12.2. In the case of infrastructure degradation, the transition

matrix has only zeros below i t s main d i agonal because, without

intervention, a bridge’s condition can only decrease over time, so

a t r ans it i on from

= s

! s

t+1

= s

can only have a nonz er o

probability for j  i.

For maintenance and r ep l acem e nt action s, t h e tr an si t i on matri x

can be deﬁned deterministi cal l y as

p(s

t+1

= maintain

| {z }

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

S⇥S

where from a current state

= s

, the state becomes

t+1

i1

. For exampl e, i f at a tim e

the state is

= s

= 60%, the

maintenance action will result in the state

t+1

= s

= 80%. Note

that if the state i s

= s

= 100%, it remai ns the same at

+ 1. For

a replacement action, no matter the initial state at

, the next state

becomes s

t+1

= s

= 100% so,

p(s

t+1

= replace

| {z }

1 0 0 0 0 0

S⇥S

j.-a. goulet 244

In order to identify the optimal policy

⇡

⇤

,weneedtodeﬁne

the reward

(

s, a, s

). Here, we assume that the reward simpliﬁes

(

s, a

(

), so it is only a function of the state

and

the action

. For a given structu r e, t h e re ward

(

) is estimated as

a f un ct i on of the number of users quantiﬁed through th e annual

average daily traﬃc ﬂow (AADTF = 10

users/day), times a value

of $3/user, times the capacity of the bridge as a function of its

Note:

Here the value of $3/user does not

necessarily represent the direct cost per

user as collected by a toll booth. It instead

corresponds to the indirect value of the

infrastructure for the society.

condition. The capacity is deﬁned as

(

{

}

and the associated rewards ar e

r(S) = 10

users/day · 365 day · $3/user ·c(S)

= {109.5, 109.5, 109.5, 98.6, 82.1, 0}$M.

The rewards for actions

(

) correspond to a cost, so th ei r values

are lesser than or equal to zero,

r(A)={0, 5, 20}$M.

Agent

Environment

Figure 15.3: Schematic representation of

the interaction between an agent and its

environment.

For the context of re i nf orc em ent lear n in g, we can generali z e th e

example above by introducing the conc ep t of an agent interacting

with its environment, as depicted in ﬁgure 15.3. In the context

of the previous example, the environment is the population of

structures that are degrading over time, and the agent is the hypo-

thetical entity acting with the intent of maximizing the long-term

accumulation of rewards. The environment interacts with the agent

by de ﬁni n g, for e ach time

, the state

of the system and the re-

ward f or being in that state

, taking a given action

, and e nd i ng

in the next state

. The agent perceives the environment’ s st at e

and selects an action

, which in turn can a↵ect the environment

and thus the state at time t + 1.

This chapter explores the task of identifying the optimal policy

⇡

⇤

(

) describing the optimal actions

⇤

to be taken for each st at e

This task is ﬁrst formulated in

15.1 using the model-based method

known as th e M arkov decision process. Bui l di n g on this me th od,

15.2 then presents two model-free r ei n for ce me nt lear ni n g meth ods:

temporal di↵erence and Q-learning.

15.1 Markov Decision Process

In order to formulate a Markov decisi on process (MDP), we need to

deﬁne a planning horizon over which the utility is estimated . Pl an -

ning horizons may be either ﬁnite or inﬁnite. With a ﬁnite planning

horizon, the rewards are considered over a ﬁxed period of time. In

such a case , th e opt i mal policy

⇡

⇤

(

s, t

) is nonstationary because it

depends on the time

. For the infrastructure maintenance example,

probabilistic machine learning for civil engineers 245

a ﬁ ni t e plan ni n g hori zon c oul d cor r es pond to the case where we

want to identify the optimal actions to take over the next

= 20

ye ars , aft e r which we know a structure is going t o be demolished.

The optimal action for a state

= 40 % would then not be the

same whether we are at

= 1 or at

= 19. For an inﬁnite planning

horizon, the rewards are considered over an inﬁnite period of ti m e .

In such a case, t h e opt i mal policy

⇡

⇤

(

)isstationary as it does

not depend on the tim e

. It means that at any time

, the optimal

action to take is only a function of the current state

we are in. In

this chapter, we only study problems associated with an inﬁnite

planning horizon, as they allow identifying a stationary optimal

policy

⇡

⇤

(

) rather than one which depends on time

⇡

⇤

(

s, t

), as is

the case with a ﬁnite planning horizon.

Finite planning horizon

Nonstationary policy

Inﬁnite planning horizon

Stationary policy

15.1.1 Utility for an Inﬁnite Planning Horizon

The utility associated with an inﬁni t e plan ni n g hori zon i s de ﬁned

Note:

Some references employ the term

utility,others,thetermvalue to refer to

the sum of discounted rewards.

as t h e re ward for being in the current state pl us t he sum of t he

discounted rewards for all future states,

U(s

t=0

t=1

, ··· ,s

t=1

)=r(s

t=0

)+r(s

t=1

)+

r(s

t=2

)+···

t=0



r(s

) 

max

s2S

r(s)

1  

(15.1)

For a discou nt factor

 2

1], the discounted sum of rewards over

Note:

In this subsection, the notation is

simpliﬁed by assuming that

(

s, a, s

r(s)

.Nevertheless,thesamereasoning

holds for the general case where the reward

is r(s, a, s

an inﬁnite planning horizon is a ﬁnite quantity. This is an essential

property because without a discount factor , we could not comp are

the performance of ac t ion s that would e ach have inﬁnite uti li t i es .

In the special case whe r e ther e is a term i nal state for which the

problem stops, the discount factor can be taken as



= 1 because

it will be possible to compute a ﬁnite es t im at e for the u t il i ty. Note

that for the infrastructure maintenance planning example, there

is no terminal state and the discount factor can be interpreted as

an int e re st rat e u si ng t he t ran sf or mat i on





1. The issue with

our planning problem is that we do not know yet what will be the

future states

t=1

t=2

, ···

, so we cannot com pu t e th e ut i l ity usi ng

equation 15.1. Instead, the expected utility for being in the state

is computed as

U(s

,⇡) ⌘ E[U(s

,⇡)] = r( s

)+E

t=1



r(S

)

. (15.2)

In equation 15.2, the probability of each state

at each ti me

is estimated recursively using the transition model

(

t+1

j.-a. goulet 246

where at each step, the action taken is the one deﬁned in the pol-

icy

⇡

(

). Here, we choose to simplify t he notat i on by wri t in g the

expected ut il i ty as

[

(

s, ⇡

)]

⌘ U

(

s, ⇡

). Moreover, the notation

(

s, ⇡

)

⌘ U

(

) is employed later in this chapter in order to further

simplify the notation by making the dependency on the policy

⇡

implicit.

For a state

, the optimal policy

⇡

⇤

(

) is deﬁned as the action

that maximizes the expected utility,

⇡

⇤

(s) = arg max

a2A

p(s

|s, a) ·



r(s, a, s

)+U( s

,⇡

⇤

)



The diﬃculty is that, in this deﬁnition, the optimal policy

⇡

⇤

(

)

appears on both sides of the equality. The next two sections show

how this diﬃculty can be tackled using value and policy iteration.

15.1.2 Value Iteration

Equation 15.2 described the expected utility for being in a state

where the dependence on the actions

was imp l i ci t . The deﬁ ni t i on

of this relationship with the explicit dependence on the actions is

described by the Bellman

equation,

Bellman, R. (1957). A Markovian decision

process. Journal of Mathematics and

Mechanics 6 (5), 679–684

U(s) = max

a2A

p(s

|s, a) ·



r(s, a, s

)+U( s

)



, (15.3)

where the optimal action

⇤

(

) is selected for each state through

the max operation. Again, the diﬃculty with this equation is

that the expected utility of a state

(

) appears on both sides of

the equality. One solution to this problem is to employ the value

iteration algorithm where we go from an iteration

to an iteration

i +1 using theBellman update step,

(i+1)

(s) max

a2A

p(s

|s, a) ·

⇣

r(s, a, s

)+U

(i)

)

⌘

. (15.4)

In order to use the Bellman update, we start from the initial values

Note:

The symbol

indicates that the

quantity on the left-hand side is recursively

updated using the terms on the right-hand

side, which are themselves depending on

the updated term.

at iteration

= 0, for example,

(0)

(

) = 0, and then iterate until

we converge to a steady st ate . The value iterat i on al gor it h m is

guaranteed to converge to the exact expected utilities

(1)

(

)if

an inﬁnite number of iterat i on s is empl oyed. The optim al act i on s

⇤

(

) identiﬁed in the process for each state

deﬁne the optimal

policy

⇡

⇤

. The sequence of steps for value iteration are detailed in

algorithm 10.

Example: Infrastructure maintenance planning We now apply the

value iteration algorithm to the infrastructure maintenance example

probabilistic machine learning for civil engineers 247

Algorithm 10: Value itera t io n

1 deﬁne: s 2S (states)

2 deﬁne: a 2A (actions)

3 deﬁne: r(s, a, s

) (rewards)

4 deﬁne:  (discount factor)

5 deﬁne: p(s

|s, a) (transition model)

6 deﬁne: ⌘ (convergence criterion)

7 initialize: U

(s)(expectedutilities)

8 while |U

(s) U(s)|⌘ do

9 U(s) U

(s)

10 for s 2Sdo

11 for a 2Ado

12 U(s, a)=

p(s

|s, a) ·



r(s, a, s

)+U( s

)



13 ⇡

⇤

(s) a

⇤

= arg max

a2A

U(s, a);

14 U

(s) U(s, a

⇤

)

15 return: U(s)=U

(s),⇡

⇤

(s)

deﬁned at the beginning of this chapter. The last parameters that

need to be deﬁned are the c onvergence cri t er i on

⌘

= 10

3

and the

discount factor taken as



97, which corresponds to an interest

rate of 3 percent. Starti n g from

(0)

(

{

}

,we

perform iteratively the Bellman update step for each state s,

(i+1)

(s) max

a2A

p(s

|s, N) ·

⇣

r(s, N)+U

(i)

)

⌘

p(s

|s, M) ·

⇣

r(s, M)+U

(i)

)

⌘

p(s

|s, R) ·

⇣

r(s, R)+U

(i)

)

⌘

where we choose the optimal action

⇤

leading to the maximal ex-

pected utility. The expected utilities for two ﬁrst and last iterations

are

(1)

(S)={

s=100%

z}|{

109.5,

80%

z}|{

211 ,

60%

z}|{

309 ,

40%

z}|{

393 ,

20%

z}|{

459 ,

z}|{

440 }$M

(2)

(S)={222.5, 329, 430, 511, 572, 550}$M

(356)

(S)={3640, 3635, 3630, 3615, 3592, 3510}$M

(357)

(S)={3640, 3635, 3630, 3615, 3592, 3510}$M.

Example setup

S = {100, 80, 60, 40, 20, 0}%

A = {do nothing

| {z }

, maintain

| {z }

, replace

| {z }

}

r(S)={109.5, 109.5 , 109.5, 98.6, 82.1, 0} $M

r(A)={ 0, 5, 20}$M

r(s, a, s

)=r(s, a)=r(s)+r(a)

p(s

|s, N)=

0.95 0.03 0.02 0 0 0

0 0.90.05 0.03 0.02 0

0 0 0.80.12 0.05 0.03

0 0 0 0.70.25 0.05

0 0 0 0 0.60.4

0 0 0 0 0 1

p(s

|s, M)=

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

p(s

|s, R)=

1 0 0 0 0 0

j.-a. goulet 248

Here, we e xp l i ci t l y show the calculations mad e to go from

(1)

(100%) =

$109.5M ! U

(2)

(100%) = $222.5M,

(2)

(100%) max

a2A

0.95 · ($109.5M +0.97 · $109.5M) ···

+0.03 ·($109.5M +0.97 · $211M) ···

+0.02 ·($109.5M +0.97 · $309M)= $222.5M

1 · (109.5  $5M +0.97 ·$109.5M ) = $210.7M

1 · (109.5  $20M +0.97 ·$109.5M ) = $195.7M

= $222.5M,

where the optimal action to perform if you are in state

= 100%

⇡

⇤(2)

(100%) =

⇤

do nothing

. In order to complete a full

iteration, this operation needs to be repeated for each stat es

s 2S

Figure 15.4 presents the evolution of the expected utility of each

state as a function of the iteration number. Note how the expected

utilities converge to stationary values for all state variables. The

corresponding optimal policy is

⇡

⇤

(S)={do nothi n g

| {z }

s=100%

, maint ai n

| {z }

80%

, maint ai n

| {z }

60%

, maint ai n

| {z }

40%

, replace

| {z }

20%

, replace

| {z }

0 100 200 300 400

1000

2000

3000

4000

iterations #

U(s),[$M]

s =100%

s =80%

s =60%

s =40%

s =20%

s =0%

Figure 15.4: Value iteration algorithm

converging to stationary expected utilities.

Figure 15.5 presents the expected utility for each action and state

obtained using the value iteration algorithm. For each state, the

expected ut il i ty corr es ponding to the optimal act i on i s highl i ghted

with a red border.

In the Bellman update s t ep of t he value iter at ion al gor i th m,

the

max

operation is compu t at i onal l y expensive if the number of

possible acti ons

is large. The diﬃculty is solved by the policy

iteration algorithm.

100%

80%

60%

40%

20%

3400

3450

3500

3550

U(s), [$M ]

Do nothing Maintain Replace

Figure 15.5: Expected utility for each

action and state obtained using the value

iteration algorithm. The exp ected utilities

of the optimal action is highlighted with a

red border.

15.1.3 Policy Iteration

Remember that a policy

⇡

)

, ··· ,a

)

}

deﬁnes an

action to be taken for each state

s 2S

. For the calc ul at i on of the

probabilistic machine learning for civil engineers 249

expected ut il i ty of each state

, we can redu ce t h e comp ut at i onal

burden of the Bellman update step in equation 15.4 by replacing

the max operation with the action a deﬁned by the policy ⇡

(i)

(s),

(i+1)

(s)

p(s

|s, ⇡

(i)

(s)

| {z }

⇤

) ·



r(s, ⇡

(i)

(s)

| {z }

⇤

)+U

(i)

)



. (15.5)

If we employ equ ati on 15. 5 inst e ad of equ at i on 15.4 t o update the

expected ut il i ty, we are required to perform only one sum rather

than

sums, where

corresponds to the number of possible action s.

Once the expected utility has been cal c ul at ed f or each state, the

policy can be optimized usi ng

⇡

(i+1)

(s) a

⇤

= arg max

a2A

p(s

|s, a) ·

⇣

r(s, a, s

)+U

(i+1)

)

⌘

We then repeat the successive steps of ﬁrst calculating the expected

utility with the value iteration algorithm using a ﬁxed policy, and

then update the policy. The detail s of the policy iteration procedure

are presented in algorithm 11, where the policy evaluation step

corresponds to th e value iterat ion performed whi le e mp l oying equa-

tion 15.5 instead of equation 15.4. Note that identifying an optim al

policy using t he policy or the value iteration algor i t hm l ead s to

the same results, except that policy iteration is computationally

cheaper.

Algorithm 11: Policy iteration

1 deﬁne: s 2S (states)

2 deﬁne: a 2A (actions)

3 deﬁne: p(s

|s, a) (transition model)

4 deﬁne: r(s, a, s

) (reward)

5 deﬁne:  (discount factor)

6 initialize: U(s)(expectedutilities)

7 initialize: ⇡

(s) 2A (policy)

8 while ⇡

(s) 6= ⇡(s) do

9 ⇡(s)=⇡

(s)

10 U(s) pol ic y evaluation(U(s),⇡(s),p(s

|s, a),r(s, a, s

),)

11 for s 2Sdo

12 ⇡

(s) arg max

a2A

p(s

|s, a) ·



r(s, a, s

)+U( s

)



13 return: U(s),⇡

⇤

(s)=⇡

(s)

Example: Infrastructure maintenance planning Starting from

(0)

(

{

}

, we perform the policy iteration

j.-a. goulet 250

algorithm where the Bellman update step in the policy evaluation

is deﬁned by equation 15.5, where the optimal action

⇡

(i)

(

)is

deﬁned by the optimal policy at iteration

.Theexpectedutilities

for the two ﬁrst an d last iterations are

(1)

(S)={

s=100%

z}|{

2063,

80%

z}|{

1290,

60%

z}|{

768 ,

40%

z}|{

455 ,

20%

z}|{

197 ,

z}|{

0 }$M

(2)

(S)={3483, 3483, 3468, 3457, 3441, 3359}$M

(5)

(S)={3640, 3635, 3630, 3615, 3592, 3510}$M,

and their corresponding policies are

A = {do nothing

| {z }

, maintain

| {z }

, replace

| {z }

}

⇡

(0)

(S)={N, N, N, N, N, N}

⇡

(1)

(S)={M, M, R, R, R, R}

⇡

(2)

(S)={N, N, M, M, R, R}

⇡

⇤

(S)=⇡

(6)

= ⇡

(5)

(S)={N, M, M, M, R , R }.

Here, we e x pli c it l y s how the calc ul at i on s made to go from th e

policy i t er at ion lo op 1

(1)

(100%) =

$2063M ! U

(2)

(100%) =

$3483M, within which 351 policy evaluation calls are required,

(1)

(100%) ⌘ U

(2)(1)

(100%) =

⇡

(1)

(100%) = M, maintain

z }| {

1 · (109.5  $5M +0.97 ·$2063M) = $2106M

(2)(2)

(100%) = 1 ·(109.5  $5M +0.97 · $2106M) = $2147M

(2)

(100%) = U

(2)(351)

(100%) = 1 ·(109.5  $5M +0.97 · $3483M)=$3483M.

policy iteration loop

policy evaluation loop

State

policy iteration loop1

Once the policy evaluat i on l oop is completed, the policy must

be updated. Here, we explicitly look at the speciﬁc iteration

⇡

(1)

(100%) = M ! ⇡

(2)

(100%) = N, so that

⇡

(2)

(100%) = arg max

a2A

p(s

|s, N) ·

⇣

r(s, N)+U

(2)

)

⌘

p(s

|s, M) ·

⇣

r(s, M)+U

(2)

)

⌘

p(s

|s, R) ·

⇣

r(s, R)+U

(2)

)

⌘

= arg max

a2A

0.95 · (109.5  $0M +0.97 ·$3483M ) ···

+0.03 ·(109.5  $0M +0.97 ·$3483M ) ···

+0.02 ·(109.5  $0M +0.97 ·$3468M ) = $3488M

1 · (109.5  $5M +0.97 ·$3483M) = $3483M

1 · (109.5  $20M +0.97 ·$3483M) = $3468M

= N (do nothing).

probabilistic machine learning for civil engineers 251

The policy iteration converges in ﬁve loops and it leads to the same

optimal policy as the value iteration,

⇡

⇤

(S)={do nothi n g

| {z }

s=100%

, maint ai n

| {z }

80%

, maint ai n

| {z }

60%

, maint ai n

| {z }

40%

, replace

| {z }

20%

, replace

| {z }

Furth er d et ai l s as well as advanced concepts regar d in g Mar kov

decision processes can be found in the te xt books by Russell and

Norvig,

and by Bertsekas.

Russell, S. and P. Norvig (2009). Artiﬁcial

Intelligence: A Modern Approach (3rd ed.).

Prentice-Hall

Bertsekas, D. P. (2007). Dynamic

programming and optimal control (4th ed.),

Volume 1. Athena Scientiﬁc

15.1.4 Partially Observable Markov Decision Process

One hypothesis wi t h the Mar kov decision p r ocess presented in the

previous section is that state variables are exactly observed, so that

at the current time

, one knows i n wh at st at e

the system is in.

In such a case, t h e opt i mal ac ti on deﬁ ne d by the policy

⇡

(

) can

be directly s el ec t ed . For the case we are now interested in,

is a

hidden-state variable, and the observed state

is deﬁned through

the conditional probability

(

y|s

). This problem conﬁguration is

called a partiall y observable MDP (POMDP). The challenge is that

at any ti me

, you do not obs er ve exact ly wh at i s t he t ru e stat e

of the system; consequently, you cannot simply select the optimal

action from the policy ⇡(s).

Because the state is observed with uncertainty, at each time

we nee d to desc ri be our knowledge of the state usi ng a prob abil i ty

mass function

(

{Pr

(

= s

)

, Pr

(

= s

)

, ··· , Pr

(

= s

)

}

Given the PMF

(

) describing our current knowledge of

, an

action taken

, and an ob se rvation

y 2S

, then the probability of

ending in any state s

2Sat time t + 1 is given by

p(s

|a, y)=

p(y,s

|a)

z }| {

p(y|s

)

p(s

|a)

z }| {

s2S

p(s

,s|a)

z }| {

p(s

|s, a) · p(s)

p(y)

With the MDP, the policy

⇡

(

) was a funct i on of th e stat e we are

in; with the POMDP we need to redeﬁne the policy as being a

function of the probability of the state

⇡

(

)) we are i n. In t h e

cont e xt of an MDP, the policy can be seen as a function deﬁned

over discr et e variable s. Wi t h a POM D P, if there are

possible

states,

⇡

(

)) is a function of a continuous domain with

S 

dimensions. This domain is continuous because

(

)

1),

and there are

S 

1 d im en si on s because of the constraint req ui r i ng

that

s2S

p(s) = 1.

MDP

⇡(s),s2S

|{z}

discrete

POMDP

⇡(p(s)),p(s) 2 (0, 1)

| {z }

continuous

j.-a. goulet 252

Because the policy is now a func t ion of prob ab il i t i es de ﬁn ed i n

a continuous d omai n, the value and policy iteration algorit h ms

presented for an MDP are not d i re ct l y applicable. One particularity

of the POMDP is that the set of actions now includes additional

information gathering through observation. The reader interested

in the details of value and policy-iteration algorithms for POMDP

should refer to specialized literature such as the paper by Kael-

bling, Littman, and Cassandra.

Note that solving a POMDP is

Kaelbling, L. P., M. L. Littman, and

A. R. Cassandra (1998). Planning and

acting in partially observable stochastic

domains. Artiﬁcial Intelligenc e 101 (1),

99–134

signiﬁcantly more demanding than solving an MDP. For practical

applications, the exact solution of a POMDP is often intractable

and we have to resort to approximate methods.

Hauskrecht, M. (2000). Value-function

approximations for partially observable

Markov decision processes. Journal of

Artiﬁcial Intelligence Research 13 (1),

33–94

15.2 Model-Free Reinforcement Learning

The Markov d ec i si on pr ocess presented in the pre vi ou s sec t ion i s

categorized as a model-based method because it requires knowing

the model of the environment t h at takes th e for m of the tr ans i t ion

probabilities

(

|s, a

). In some applications, it is not possible or

practical to deﬁne such a model because the number of states

can be too large, or can be continuous so that the deﬁnition of

transition probabilities is not suited. Model-free reinforcement

learning methods learn from the interaction between the agent and

the environment, as depicted in ﬁgure 15.3. This learning process is

ty p ic al ly don e by subjecting the agent to multiple epi s odes.Inthis

section, we will present two model-free methods suited for such a

case: temporal di↵erence learning and Q-learning.

15.2.1 Temporal Di↵erence Learning

The quantity we ar e interested in estimating is

[

(

s, ⇡

)]

⌘

(

s, ⇡

)

⌘ U

(

), that is, the expected utility for being in a state

We deﬁned in

15.1.1 that the utility is the sum of the discounted

reward s,

U(s, ⇡)=

t=0



r(s

t+1

obtained while following a policy

⇡

so that

⇡

(

). Moreover, we

saw in

6.5 that we can estimate an expected value using the Monte

Carlo method. Here , i n the context where our agent can interact

with the environment over multiple episodes, we can estimate

U(s, ⇡) by taking the aver age over a set of re ali z at ion s of U(s, ⇡).

In a gen er al mann er , t he average

of a se t

, ··· ,x

}

can

be calculated incrementally by following

= x

t1

 x

t1

probabilistic machine learning for civil engineers 253

0 100 200 300 400 500

2.5

3.5

= x

t1

 x

t1

)

x =

t=1

Figure 15.6: The incremental calculation of

an average for a set

, ··· ,x

500

},x

X ⇠N(x;3, 1).

Figure 15.6 presents an example of an application of the incre-

mental average est i mat i on for a set of

= 500 realizat i ons of

X ⇠N

(

1). We can appl y th e same pr i nc i pl e for t he

incremental estimation of the expected utility so that

(i+1)

(s) U

(i)

(s)+

N(s)

⇣

(i)

(s) U

(i)

(s)

⌘

, (15.6)

where

(

) is the number of times a state has been visited. Est i -

mating

(

) through Monte Carlo sampling is limited to problems

hav i ng a termi nal state and r eq u i re s evaluating th e uti l i ti e s over

entire episodes. Tempora l di↵erence (TD) learni ng,

allows going

Sutton, R. S. (1988). Learning to predict

by the methods of temporal di↵erences.

Machine Learning 3 (1), 9–44

around these requirements by replacing the explicit utility calcula-

tion

(i)

(

)bytheTD-target deﬁned as

(

s, a, s

U

(i)

(

), so that

equation 15.6 can be r ewr i t t en as

(i+1)

(s) U

(i)

(s)+↵

⇣

r(s, a, s

)+U

(i)

)

| {z }

TD-target

U

(i)

(s)

⌘

. (15.7)

Temporal di↵erence learns by updating recursively t h e sta te- uti l it y

function

(i+1)

(

) by tak i ng t he di ↵ er en ce i n expected utili t ie s

estimated at consecutive times while the agent interacts with the

envi r onm ent.

In equation 15.7,

↵

is the learning rate deﬁning how much is

learned from the TD update step. Typically, we want to empl oy a

learning rate that is in it i al l y large so our poor initial values for

(

)

are strongly inﬂuenced by the TD updates and then decrease

↵

as a

function of the number of time s a parti cu l ar set of state and act ion

has been seen by the agent. In order to ensur e t he convergence of

the state-utility function, the learning rate should satisfy the two

following conditions,

N(s,a)=1

↵(N(s, a)) = 1,

N(s,a)=1

↵

(N(s, a)) < 1.

A common learning rate function that satisﬁes both criteria is

↵(N(s, a)) =

c + N(s, a)

where

(

s, a

) is the number of times a set of state and act i on has

been visited and c 2 R

is a positive constant.

The TD-learning update step presented in equation 15.7 deﬁnes

the behavior of a passive agent, that is, an agent, who is only

eval uati n g the s tate- u ti li ty function for a predeﬁned policy

⇡

.In

other words, in its current form, TD-learning is suited to estimate

j.-a. goulet 254

(

) given that we already know the policy

⇡

(

) we want the agent

to follow. In practice, this is seldom the case because instead, we

want an active agent, who w il l learn both the expected utilities and

the optimal policy. The ne xt se c t ion pr ese nts temporal di↵erence

Q-learning, a common active reinforcement learning method capable

of doing both tasks s i multaneously.

15.2.2 Temporal Di↵erence Q-Learning

Temporal di↵erence Q-learning

(often referred to as simply Q-

Watkins, C. J. and P. Dayan (1992).

Q-learning. Machine Learning 8 (3–4),

279–292

learning) is an active reinforcement learning method that consists

in evalu at i ng th e actio n- uti l it y function

(

s, a

), from which both

the optimal policy

⇡

⇤

(s) = arg max

a2A

Q(s, a)

and the resulting expected utilities

U(s) ⌘ U(s, ⇡

⇤

) = max

a2A

Q(s, a)

can be extracted. Analogously to equation 15.7, the TD update step

for the action-utility functio n Q(s, a) is deﬁned as

(i+1)

(s, a) Q

(i)

(s, a)+↵



r(s, a, s

)+ max

(i)

)Q

(i)

(s, a)



(15.8)

Equation 15.8 describes the behavior of an active agent,which

updates

(

s, a

) by looking one step ahead and employing the action

that maximizes

(

). Q-learning is categorized as an o↵-

policy reinforcement learning method because the optimal one-step

look-ahead action

employed t o update

(

s, a

) is not necessarily

the action that will be taken by t he age nt to transi t t o t he ne x t

state. This is because in order to learn eﬃciently, it is necessary

to ﬁnd a t r ad eo↵ between exploiting the curre nt best policy and

exploring new policies.

While learning the action-utility function using Q-learning, one

cannot only rely on the currently optimal acti on

⇤

⇡

⇤

(

) in order

to transit between success i ve stat es,

and

. This is because such

a greedy agent that always selects the optimal action is likely to

get stuck in a local maxima because of the poor initial estimates

for

(

s, a

). One simple solution is to employ the

✏

-greedy strategy,

which consists in selecting the action randomly with probability

✏



(

)



and otherwise selecting the action from the optimal policy

⇡

⇤

(

). Here,

(

) is again the number of times a state has been

probabilistic machine learning for civil engineers 255

visited. Therefore, an ✏-greedy agent selects an action following

a =

(

: p(a)=

, 8a, if u<✏



N(s)



Random action

arg max

a2A

Q(s, a), if u  ✏



N(s)



Greedy action,

where

is a sample taken from a uniform distribution

1).

✏



(

)



should be deﬁned s o that i t t en ds t o zer o when th e number

of times a state has been vi si t ed

(

)

. Algorithm 12 details

the sequence of ste ps for temporal di↵erence Q -l ear ni n g.

Algorithm 12: Temporal di↵erence Q-learning

1 deﬁne: s 2S (states)

2 deﬁne: a 2A (actions)

3 deﬁne: r(s, a, s

) (rewards)

4 deﬁne:  (discount factor)

5 deﬁne: ✏ (epsilon-greedy function)

6 deﬁne: ↵ (learning rate function)

7 initialize: Q

(0)

(s, a) (action-utility function)

8 initialize: N(s, a)=0, 8{s 2S,a 2A} (action-state counts)

9 i =0

10 for episodes e 2{1:E} do

11 initialize: s

12 for time t 2{1:T} do

13 u : U ⇠U(0, 1)

14 a =

: p(a)=

, 8a, if u<✏



N(s)



Random

arg max

a2A

(i)

(s, a), if u  ✏



N(s)



Greedy

15 observe: r(s, a, s

), s

16 Q

(i+1)

(s, a) Q

(i)

(s, a) ···

↵



(

s, a

)



(

s, a, s

 max

(i)

(

)

Q

(i)

(

s, a

)



18 N(s, a)=N(s, a)+1

19 s = s

20 i = i +1

21 return: U(s) = max

a2A

(i)

(s, a)

22 ⇡

⇤

(s) = arg max

a2A

(i)

(s, a)

Example: Infrastructure maintenance planning We revisit the

infrastructure maintenance example, this time using the temporal

di↵erence Q-learning algorithm. The problem deﬁnition is identical

to that in

15.1, except that we now initialize the action-utility

j.-a. goulet 256

functions as

(0)

(

s, a

) = $0

M, 8{s 2S,a2A}

. Also, we deﬁne the

iteration-dependent learning schedule as

↵



N(s, a)



c + N(s, a)

and the exploration schedule by

✏



N(s)



c + N(s)

where

= 70. We choose to learn over a total of 500 episodes,

each con si s t in g in 100 time st e ps . For each new episode, the state

is initialized randomly. Figure 15.7 presents the evolution of the

expected utility

(

) for each st at e as a fun ct i on of the number of

episodes. Given t hat we employed 500 episodes, each comprising

100 200 300 400 500

1000

2000

3000

4000

Episode #

U(s)

s=100% s=80% s=60% s=40% s =20% s=0%

Figure 15.7: Expected utility convergence

for the Q-learning algorithm.

100 st e ps , t he tot al number of iterations is 50

000. For the last

iteration, the expected u t il i t i es are

(50 000)

(S)={3640, 3635, 3630, 3615, 3592, 3510}$M.

Those valu es ar e id entical to t h e one s obt ai ne d usi n g eith er t h e

policy- or value-iterat i on algor i t hm i n

15.1. Analogously, the

optimal policy derived from the action-utility function is also

identical and corresponds to

⇡

⇤

(S)={do nothi n g

| {z }

s=100%

, maint ai n

| {z }

80%

, maint ai n

| {z }

60%

, maint ai n

| {z }

40%

, replace

| {z }

20%

, replace

| {z }

100 200 300 400 500

0.2

0.4

0.6

0.8

Episode #

✏

Figure 15.8: Evolution of

✏



(

)



for each

state

as a function of the number of

episodes.

Figure 15.8 present

✏



(

)



as a fu n ct i on of the number of

episodes, that is, for each state, the prob abil i ty th at t h e agent

selects a random action. For the learning rate

↵



(

s, a

)



,whichis

not displayed here, there are three times as many curves because it

also depends on the ac t i on a.

For this si mp l e in f ras t ru ct u re maintenance example, using an

active reinforcement learning method such as Q-learning is not

probabilistic machine learning for civil engineers 257

beneﬁcial in com par i son wi t h passi ve meth ods such as the policy-

or value- it e rat i on al gori t h m in

15.1. Nevertheless, active reinforce-

ment learning becomes necessary for more advanced contexts where

the number of states is large and often c ontinuous . The re ade r

interested in learning more about topics related to on-policy rein-

forcement learning methods such as sarsa, advanced explor at i on

strategies, eﬃcient action-utility function initialization, as well as

cont i nuous fun cti on approximati on for Q-le arn i ng, s houl d con su l t

dedicated textbooks such as the one by Sutton and Barto.

Sutton, R. S. and A. G. Barto (2018).

Reinforcement learnin g : An introduction

(2nd ed.). MIT Press