15
Sequential Decisions
States: s 2S= {s
1
, s
2
, ··· , s
S
}
Actions: a 2A= {a
1
, a
2
, ··· , a
A
}
Policy: = {a(s
1
),a(s
2
), ··· ,a(s
S
)}
Reward: r(s, a, s
0
) 2 R
Transition
model: p(s
0
|s, a)
In this chapter, we explore how to make optimal decisions in a
sequential context. Sequential decisions dier from the decision con-
text presented in chapter 14 because here, the goal is to select the
optimal action to be performed in order to maximize the current
reward as well as future ones. In that process, we must t ake into
account that our current action will aect future states as well as
future decisions. Let us consider a system that can be in a state
s 2S
=
{
s
1
,
s
2
, ··· ,
s
S
}
. The set of actions
a 2A
=
{
a
1
,
a
2
, ··· ,
a
A
}
that should be performed given each state of the syste m is cal l ed a
policy,
= {a ( s
1
),a(s
2
), ··· ,a(s
S
)}.
In the context of sequential decisions, the goal is to identify
the optimal policy
describing for each possible state of the
system the optimal action to take. This task can be formulated
as a reinforcement learning (RL) problem, where the optimal
Note:
Reinforcement learning diers
from supervised learning because the
optimal policy is not learned using other
examples of policies. RL also diers from
unsupervised learning because the goal is
not to discover a structure or patterns in
the data.
policy is learn ed by choosin g th e act i ons
a
(
s
) that maximize
the expected u ti li ty over a given planning horizon. In order to
compute the expected utility, we need a function describi ng for each
set of current state
s
, action
a
, and n ex t s t at e
s
0
the associated
reward
r
(
s, a, s
0
)
2 R
. In practical cases, it can happen that the
reward simplifies to a special case such as
r
(
s, a, s
0
)=
r
(
s, a
),
r
(
s, a, s
0
)=
r
(
a, s
0
), or even
r
(
s, a, s
0
)=
r
(
s
). The last part we need
to define is the Markovian transition model
p
(
s
0
|
s
j
,
a
i
)=
Pr
(
S
0
=
s
k
|s = s
j
, a
i
), 8k 2{1:S}, that is,
p(s
0
|s
j
, a
i
)
8
>
>
>
<
>
>
>
:
Pr(S
t+1
= s
1
|s
t
= s
j
, a
i
)
Pr(S
t+1
= s
2
|s
t
= s
j
, a
i
)
.
.
.
Pr(S
t+1
= s
S
|s
t
= s
j
, a
i
).
The t ransition model describes, given the current state of the
Note:
A transition model is analogous to
the concept of a conditional probability
table presented in chapter 11.
system s
j
and an action taken a
i
, the probability of ending in any
j.-a. goulet 242
state
s
0
2S
at time
t
+ 1. He re , the M arkov property (see chapter
7) allows f ormulating a model where the prob abi l i ty of each state
s
0
2S
at time
t
+ 1 depends on the current state s
j
at time
t
and
is independent of any prev iou s st at es . Alth ou gh seq u ential decision
problems could also be for mulated for continuous state variables,
s 2 R, in this chapter we restrict ourselves to the d is cr et e case.
Example: Infrastructure maintenance planning The notions associ-
ated with reinforcement lear ni n g and the sequential decis ion process
are illustrated through an application to infrastructure maintenanc e
planning. Let us consider a population of bridges (see figure 15.1)
for which at a gi ven time t, each bri d ge can be in a stat e
(a) Representation of North America’s
road network, which includes hundreds of
thousands of bridges. (Photo: Commission
for Environmental Cooperation)
(b) Example of a given structure within
the network. (Photo: Michel Goulet)
Figure 15.1: Contextual illustration of a
sequential decision process applied to the
problem of infrastructure maintenance
planning.
s 2S=
100 %
|{z}
s
1
, 80 %
|{z}
s
2
, 60 %
|{z}
s
3
, 40 %
|{z}
s
4
, 20 %
|{z}
s
5
, 0%
|{z}
s
6
,
where 100 percent indicat e s a perfect condition , an d 0 percent
corresponds to a stat e whe r e the damage i s so ex te ns i ve that th e
structure must be clos e d. Th e objective of sequential deci sion
making is to learn the optimal policy
(
s
) that identifies among a
set of possible actions,
a 2A= {do nothing (N), m ai ntain (M) , r ep l ace ( R) },
what is the optimal one
a
(
s
) to take at e ach time , depending on
the current condition s.
In this example, a bridge condition
s
is rated every year from
the results of visual inspections. We assume we have a datab ase
D
cont ai n in g the con di t i ons ob tai n ed f rom ins pection result s for
N
structures each inspected T times,
D =
8
>
>
>
<
>
>
>
:
{s
1
,s
2
, ··· ,s
T
}
1
{s
1
,s
2
, ··· ,s
T
}
2
.
.
.
{s
1
,s
2
, ··· ,s
T
}
N
9
>
>
>
=
>
>
>
;
NT
.
0 5 10 15 20
0%
20%
40%
60%
80%
100%
Time t
State s
Figure 15.2: Example of bridge condition
time histories.
Each l ine i n t hi s dat ab ase cor r es ponds to a given structure and
each col u mn corr es ponds to a given time
t 2{
1
,
2
, ··· , T}
. Figure
15.2 presents an example for
N
= 15 bridges and time histories of
length
T
= 20. One of them has been highlighted with a thicker
red line to better visualize how a structure transits from state to
state through time. In this data set, th e action taking place at each
time
t
is
a
=
do nothing
. We can use this dat a set to esti mat e
the parameters of the Markovian transition model
p
(
s
t+1
|s
t
,a
=
do nothing
) by emp loying t h e maxi mum likelihood estimat e for
each t r ans it i on prob ab il i ty (s ee
§
11.4). The maximum likelihood
probabilistic machine learning for civil engineers 243
estimate (MLE) corresponds to the number of stru ct u re s that, from
a st at e
i
at time
t
, transitioned to a state
j
at
t
+ 1, divided by the
numbe r of structures that started from a state i at time t,
ˆ
Pr(S
t+1
= s
j
|s
t
= s
i
,a
t
= do nothing) =
#{s
t
= s
i
,s
t+1
= s
j
}
#{s
t
= s
i
}
,
where
{
s
i
,
s
j
}2S
2
and the # symbol describes the number of
elements in a set. These results are stored in a transition matrix,
p(s
t+1
|s
t
,a
t
= do nothing
| {z }
N
)=
2
6
6
6
6
6
6
4
0.95 0.03 0.02 0 0 0
0 0.90.05 0.03 0.02 0
0 0 0.80.12 0.05 0.03
0 0 0 0.70.25 0.05
0 0 0 0 0.60.4
0 0 0 0 0 1
3
7
7
7
7
7
7
5
SS
,
where each element [ ]
ij
corresponds to the probability of landing in
state
s
t+1
= s
j
given that the system is currently in state
s
t
= s
i
.
This matrix is analogous to the transition matrix we have seen
in
§
12.2. In the case of infrastructure degradation, the transition
matrix has only zeros below i t s main d i agonal because, without
intervention, a bridge’s condition can only decrease over time, so
a t r ans it i on from
s
t
= s
i
! s
t+1
= s
j
can only have a nonz er o
probability for j i.
For maintenance and r ep l acem e nt action s, t h e tr an si t i on matri x
can be defined deterministi cal l y as
p(s
t+1
|s
t
,a
t
= maintain
| {z }
M
)=
2
6
6
6
6
6
6
4
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
3
7
7
7
7
7
7
5
SS
,
where from a current state
s
t
= s
i
, the state becomes
s
t+1
=
s
i1
. For exampl e, i f at a tim e
t
the state is
s
t
= s
3
= 60%, the
maintenance action will result in the state
s
t+1
= s
2
= 80%. Note
that if the state i s
s
t
= s
1
= 100%, it remai ns the same at
t
+ 1. For
a replacement action, no matter the initial state at
t
, the next state
becomes s
t+1
= s
1
= 100% so,
p(s
t+1
|s
t
,a
t
= replace
| {z }
R
)=
2
6
6
6
6
6
6
4
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
3
7
7
7
7
7
7
5
SS
.
j.-a. goulet 244
In order to identify the optimal policy
,weneedtodene
the reward
r
(
s, a, s
0
). Here, we assume that the reward simplifies
to
r
(
s, a
)=
r
(
s
)+
r
(
a
), so it is only a function of the state
s
and
the action
a
. For a given structu r e, t h e re ward
r
(
s
) is estimated as
a f un ct i on of the number of users quantified through th e annual
average daily trac flow (AADTF = 10
5
users/day), times a value
of $3/user, times the capacity of the bridge as a function of its
Note:
Here the value of $3/user does not
necessarily represent the direct cost per
user as collected by a toll booth. It instead
corresponds to the indirect value of the
infrastructure for the society.
condition. The capacity is defined as
c
(
S
)=
{
1
,
1
,
1
,
0
.
90
,
0
.
75
,
0
}
,
and the associated rewards ar e
r(S) = 10
5
users/day · 365 day · $3/user ·c(S)
= {109.5, 109.5, 109.5, 98.6, 82.1, 0}$M.
The rewards for actions
r
(
a
) correspond to a cost, so th ei r values
are lesser than or equal to zero,
r(A)={0, 5, 20}$M.
Agent
Environment
Figure 15.3: Schematic representation of
the interaction between an agent and its
environment.
For the context of re i nf orc em ent lear n in g, we can generali z e th e
example above by introducing the conc ep t of an agent interacting
with its environment, as depicted in figure 15.3. In the context
of the previous example, the environment is the population of
structures that are degrading over time, and the agent is the hypo-
thetical entity acting with the intent of maximizing the long-term
accumulation of rewards. The environment interacts with the agent
by de fini n g, for e ach time
t
, the state
s
t
of the system and the re-
ward f or being in that state
s
, taking a given action
a
, and e nd i ng
in the next state
s
0
. The agent perceives the environment’ s st at e
and selects an action
a
, which in turn can aect the environment
and thus the state at time t + 1.
This chapter explores the task of identifying the optimal policy
(
s
) describing the optimal actions
a
to be taken for each st at e
s
.
This task is first formulated in
§
15.1 using the model-based method
known as th e M arkov decision process. Bui l di n g on this me th od,
§
15.2 then presents two model-free r ei n for ce me nt lear ni n g meth ods:
temporal dierence and Q-learning.
15.1 Markov Decision Process
In order to formulate a Markov decisi on process (MDP), we need to
define a planning horizon over which the utility is estimated . Pl an -
ning horizons may be either finite or infinite. With a finite planning
horizon, the rewards are considered over a fixed period of time. In
such a case , th e opt i mal policy
(
s, t
) is nonstationary because it
depends on the time
t
. For the infrastructure maintenance example,
probabilistic machine learning for civil engineers 245
a fi ni t e plan ni n g hori zon c oul d cor r es pond to the case where we
want to identify the optimal actions to take over the next
T
= 20
ye ars , aft e r which we know a structure is going t o be demolished.
The optimal action for a state
s
t
= 40 % would then not be the
same whether we are at
t
= 1 or at
t
= 19. For an infinite planning
horizon, the rewards are considered over an infinite period of ti m e .
In such a case, t h e opt i mal policy
(
s
)isstationary as it does
not depend on the tim e
t
. It means that at any time
t
, the optimal
action to take is only a function of the current state
s
we are in. In
this chapter, we only study problems associated with an infinite
planning horizon, as they allow identifying a stationary optimal
policy
(
s
) rather than one which depends on time
(
s, t
), as is
the case with a finite planning horizon.
Finite planning horizon
Nonstationary policy
Infinite planning horizon
Stationary policy
15.1.1 Utility for an Infinite Planning Horizon
The utility associated with an infini t e plan ni n g hori zon i s de fined
Note:
Some references employ the term
utility,others,thetermvalue to refer to
the sum of discounted rewards.
as t h e re ward for being in the current state pl us t he sum of t he
discounted rewards for all future states,
U(s
t=0
,s
t=1
, ··· ,s
t=1
)=r(s
t=0
)+r(s
t=1
)+
2
r(s
t=2
)+···
=
1
X
t=0
t
r(s
t
)
max
s2S
r(s)
1
.
(15.1)
For a discou nt factor
2
(0
,
1], the discounted sum of rewards over
Note:
In this subsection, the notation is
simplified by assuming that
r
(
s, a, s
0
)=
r(s)
.Nevertheless,thesamereasoning
holds for the general case where the reward
is r(s, a, s
0
).
an infinite planning horizon is a finite quantity. This is an essential
property because without a discount factor , we could not comp are
the performance of ac t ion s that would e ach have infinite uti li t i es .
In the special case whe r e ther e is a term i nal state for which the
problem stops, the discount factor can be taken as
= 1 because
it will be possible to compute a finite es t im at e for the u t il i ty. Note
that for the infrastructure maintenance planning example, there
is no terminal state and the discount factor can be interpreted as
an int e re st rat e u si ng t he t ran sf or mat i on
1
1. The issue with
our planning problem is that we do not know yet what will be the
future states
s
t=1
,s
t=2
, ···
, so we cannot com pu t e th e ut i l ity usi ng
equation 15.1. Instead, the expected utility for being in the state
s
0
is computed as
U(s
0
,) E[U(s
0
,)] = r( s
0
)+E
"
1
X
t=1
t
r(S
t
)
#
. (15.2)
In equation 15.2, the probability of each state
s
at each ti me
t>
0
is estimated recursively using the transition model
p
(
s
t+1
|s
t
,a
),
j.-a. goulet 246
where at each step, the action taken is the one defined in the pol-
icy
(
s
). Here, we choose to simplify t he notat i on by wri t in g the
expected ut il i ty as
E
[
U
(
s,
)]
U
(
s,
). Moreover, the notation
U
(
s,
)
U
(
s
) is employed later in this chapter in order to further
simplify the notation by making the dependency on the policy
implicit.
For a state
s
, the optimal policy
(
s
) is defined as the action
a
that maximizes the expected utility,
(s) = arg max
a2A
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U( s
0
,
)
.
The diculty is that, in this definition, the optimal policy
(
s
)
appears on both sides of the equality. The next two sections show
how this diculty can be tackled using value and policy iteration.
15.1.2 Value Iteration
Equation 15.2 described the expected utility for being in a state
s
,
where the dependence on the actions
a
was imp l i ci t . The defi ni t i on
of this relationship with the explicit dependence on the actions is
described by the Bellman
1
equation,
1
Bellman, R. (1957). A Markovian decision
process. Journal of Mathematics and
Mechanics 6 (5), 679–684
U(s) = max
a2A
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U( s
0
)
, (15.3)
where the optimal action
a
(
s
) is selected for each state through
the max operation. Again, the diculty with this equation is
that the expected utility of a state
U
(
s
) appears on both sides of
the equality. One solution to this problem is to employ the value
iteration algorithm where we go from an iteration
i
to an iteration
i +1 using theBellman update step,
U
(i+1)
(s) max
a2A
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U
(i)
(s
0
)
. (15.4)
In order to use the Bellman update, we start from the initial values
Note:
The symbol
indicates that the
quantity on the left-hand side is recursively
updated using the terms on the right-hand
side, which are themselves depending on
the updated term.
at iteration
i
= 0, for example,
U
(0)
(
s
) = 0, and then iterate until
we converge to a steady st ate . The value iterat i on al gor it h m is
guaranteed to converge to the exact expected utilities
U
(1)
(
s
)if
an infinite number of iterat i on s is empl oyed. The optim al act i on s
a
(
s
) identified in the process for each state
s
define the optimal
policy
. The sequence of steps for value iteration are detailed in
algorithm 10.
Example: Infrastructure maintenance planning We now apply the
value iteration algorithm to the infrastructure maintenance example
probabilistic machine learning for civil engineers 247
Algorithm 10: Value itera t io n
1 define: s 2S (states)
2 define: a 2A (actions)
3 define: r(s, a, s
0
) (rewards)
4 define: (discount factor)
5 define: p(s
0
|s, a) (transition model)
6 define: (convergence criterion)
7 initialize: U
0
(s)(expectedutilities)
8 while |U
0
(s) U(s)| do
9 U(s) U
0
(s)
10 for s 2Sdo
11 for a 2Ado
12 U(s, a)=
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U( s
0
)
13
(s) a
= arg max
a2A
U(s, a);
14 U
0
(s) U(s, a
)
15 return: U(s)=U
0
(s),
(s)
defined at the beginning of this chapter. The last parameters that
need to be defined are the c onvergence cri t er i on
= 10
3
and the
discount factor taken as
=0
.
97, which corresponds to an interest
rate of 3 percent. Starti n g from
U
(0)
(
S
)=
{
0
,
0
,
0
,
0
,
0
,
0
}
$
M
,we
perform iteratively the Bellman update step for each state s,
U
(i+1)
(s) max
a2A
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
X
s
0
2S
p(s
0
|s, N) ·
r(s, N)+U
(i)
(s
0
)
X
s
0
2S
p(s
0
|s, M) ·
r(s, M)+U
(i)
(s
0
)
X
s
0
2S
p(s
0
|s, R) ·
r(s, R)+U
(i)
(s
0
)
,
where we choose the optimal action
a
leading to the maximal ex-
pected utility. The expected utilities for two first and last iterations
are
U
(1)
(S)={
s=100%
z}|{
109.5,
80%
z}|{
211 ,
60%
z}|{
309 ,
40%
z}|{
393 ,
20%
z}|{
459 ,
0%
z}|{
440 }$M
U
(2)
(S)={222.5, 329, 430, 511, 572, 550}$M
.
.
.
U
(356)
(S)={3640, 3635, 3630, 3615, 3592, 3510}$M
U
(357)
(S)={3640, 3635, 3630, 3615, 3592, 3510}$M.
Example setup
S = {100, 80, 60, 40, 20, 0}%
A = {do nothing
| {z }
N
, maintain
| {z }
M
, replace
| {z }
R
}
r(S)={109.5, 109.5 , 109.5, 98.6, 82.1, 0} $M
r(A)={ 0, 5, 20}$M
r(s, a, s
0
)=r(s, a)=r(s)+r(a)
p(s
0
|s, N)=
2
6
6
6
6
6
6
4
0.95 0.03 0.02 0 0 0
0 0.90.05 0.03 0.02 0
0 0 0.80.12 0.05 0.03
0 0 0 0.70.25 0.05
0 0 0 0 0.60.4
0 0 0 0 0 1
3
7
7
7
7
7
7
5
p(s
0
|s, M)=
2
6
6
6
6
6
6
4
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
3
7
7
7
7
7
7
5
p(s
0
|s, R)=
2
6
6
6
6
6
6
4
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
3
7
7
7
7
7
7
5
j.-a. goulet 248
Here, we e xp l i ci t l y show the calculations mad e to go from
U
(1)
(100%) =
$109.5M ! U
(2)
(100%) = $222.5M,
U
(2)
(100%) max
a2A
8
>
>
>
>
>
<
>
>
>
>
>
:
0.95 · ($109.5M +0.97 · $109.5M) ···
+0.03 ·($109.5M +0.97 · $211M) ···
+0.02 ·($109.5M +0.97 · $309M)= $222.5M
1 · (109.5 $5M +0.97 ·$109.5M ) = $210.7M
1 · (109.5 $20M +0.97 ·$109.5M ) = $195.7M
= $222.5M,
where the optimal action to perform if you are in state
s
= 100%
is
(2)
(100%) =
a
=
do nothing
. In order to complete a full
iteration, this operation needs to be repeated for each stat es
s 2S
.
Figure 15.4 presents the evolution of the expected utility of each
state as a function of the iteration number. Note how the expected
utilities converge to stationary values for all state variables. The
corresponding optimal policy is
(S)={do nothi n g
| {z }
s=100%
, maint ai n
| {z }
80%
, maint ai n
| {z }
60%
, maint ai n
| {z }
40%
, replace
| {z }
20%
, replace
| {z }
0%
}.
Figure 15.4: Value iteration algorithm
converging to stationary expected utilities.
Figure 15.5 presents the expected utility for each action and state
obtained using the value iteration algorithm. For each state, the
expected ut il i ty corr es ponding to the optimal act i on i s highl i ghted
with a red border.
In the Bellman update s t ep of t he value iter at ion al gor i th m,
the
max
operation is compu t at i onal l y expensive if the number of
possible acti ons
A
is large. The diculty is solved by the policy
iteration algorithm.
100%
80%
60%
40%
20%
0%
3400
3450
3500
3550
s
U(s), [$M ]
Do nothing Maintain Replace
Figure 15.5: Expected utility for each
action and state obtained using the value
iteration algorithm. The exp ected utilities
of the optimal action is highlighted with a
red border.
15.1.3 Policy Iteration
Remember that a policy
=
{a
(s
1
)
,a
(s
2
)
, ··· ,a
(s
S
)
}
defines an
action to be taken for each state
s 2S
. For the calc ul at i on of the
probabilistic machine learning for civil engineers 249
expected ut il i ty of each state
s
, we can redu ce t h e comp ut at i onal
burden of the Bellman update step in equation 15.4 by replacing
the max operation with the action a defined by the policy
(i)
(s),
U
(i+1)
(s)
X
s
0
2S
p(s
0
|s,
(i)
(s)
| {z }
a
) ·
r(s,
(i)
(s)
| {z }
a
,s
0
)+U
(i)
(s
0
)
. (15.5)
If we employ equ ati on 15. 5 inst e ad of equ at i on 15.4 t o update the
expected ut il i ty, we are required to perform only one sum rather
than
A
sums, where
A
corresponds to the number of possible action s.
Once the expected utility has been cal c ul at ed f or each state, the
policy can be optimized usi ng
(i+1)
(s) a
= arg max
a2A
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U
(i+1)
(s
0
)
.
We then repeat the successive steps of first calculating the expected
utility with the value iteration algorithm using a fixed policy, and
then update the policy. The detail s of the policy iteration procedure
are presented in algorithm 11, where the policy evaluation step
corresponds to th e value iterat ion performed whi le e mp l oying equa-
tion 15.5 instead of equation 15.4. Note that identifying an optim al
policy using t he policy or the value iteration algor i t hm l ead s to
the same results, except that policy iteration is computationally
cheaper.
Algorithm 11: Policy iteration
1 define: s 2S (states)
2 define: a 2A (actions)
3 define: p(s
0
|s, a) (transition model)
4 define: r(s, a, s
0
) (reward)
5 define: (discount factor)
6 initialize: U(s)(expectedutilities)
7 initialize:
0
(s) 2A (policy)
8 while
0
(s) 6= (s) do
9 (s)=
0
(s)
10 U(s) pol ic y evaluation(U(s),(s),p(s
0
|s, a),r(s, a, s
0
),)
11 for s 2Sdo
12
0
(s) arg max
a2A
X
s
0
2S
p(s
0
|s, a) ·
r(s, a, s
0
)+U( s
0
)
13 return: U(s),
(s)=
0
(s)
Example: Infrastructure maintenance planning Starting from
U
(0)
(
S
)=
{
0
,
0
,
0
,
0
,
0
,
0
}
$
M
, we perform the policy iteration
j.-a. goulet 250
algorithm where the Bellman update step in the policy evaluation
is defined by equation 15.5, where the optimal action
a
=
(i)
(
s
)is
defined by the optimal policy at iteration
i
.Theexpectedutilities
for the two first an d last iterations are
U
(1)
(S)={
s=100%
z}|{
2063,
80%
z}|{
1290,
60%
z}|{
768 ,
40%
z}|{
455 ,
20%
z}|{
197 ,
0%
z}|{
0 }$M
U
(2)
(S)={3483, 3483, 3468, 3457, 3441, 3359}$M
.
.
.
U
(5)
(S)={3640, 3635, 3630, 3615, 3592, 3510}$M,
and their corresponding policies are
A = {do nothing
| {z }
N
, maintain
| {z }
M
, replace
| {z }
R
}
(0)
(S)={N, N, N, N, N, N}
(1)
(S)={M, M, R, R, R, R}
(2)
(S)={N, N, M, M, R, R}
.
.
.
(S)=
(6)
=
(5)
(S)={N, M, M, M, R , R }.
Here, we e x pli c it l y s how the calc ul at i on s made to go from th e
policy i t er at ion lo op 1
!
2,
U
(1)
(100%) =
$2063M ! U
(2)
(100%) =
$3483M, within which 351 policy evaluation calls are required,
U
(1)
(100%) U
(2)(1)
(100%) =
(1)
(100%) = M, maintain
z }| {
1 · (109.5 $5M +0.97 ·$2063M) = $2106M
U
(2)(2)
(100%) = 1 ·(109.5 $5M +0.97 · $2106M) = $2147M
.
.
.
U
(2)
(100%) = U
(2)(351)
(100%) = 1 ·(109.5 $5M +0.97 · $3483M)=$3483M.
policy iteration loop
policy evaluation loop
State
policy iteration loop1
Once the policy evaluat i on l oop is completed, the policy must
be updated. Here, we explicitly look at the specific iteration
(1)
(100%) = M !
(2)
(100%) = N, so that
(2)
(100%) = arg max
a2A
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
X
s
0
2S
p(s
0
|s, N) ·
r(s, N)+U
(2)
(s
0
)
X
s
0
2S
p(s
0
|s, M) ·
r(s, M)+U
(2)
(s
0
)
X
s
0
2S
p(s
0
|s, R) ·
r(s, R)+U
(2)
(s
0
)
= arg max
a2A
8
>
>
>
>
>
<
>
>
>
>
>
:
0.95 · (109.5 $0M +0.97 ·$3483M ) ···
+0.03 ·(109.5 $0M +0.97 ·$3483M ) ···
+0.02 ·(109.5 $0M +0.97 ·$3468M ) = $3488M
1 · (109.5 $5M +0.97 ·$3483M) = $3483M
1 · (109.5 $20M +0.97 ·$3483M) = $3468M
= N (do nothing).
probabilistic machine learning for civil engineers 251
The policy iteration converges in five loops and it leads to the same
optimal policy as the value iteration,
(S)={do nothi n g
| {z }
s=100%
, maint ai n
| {z }
80%
, maint ai n
| {z }
60%
, maint ai n
| {z }
40%
, replace
| {z }
20%
, replace
| {z }
0%
}.
Furth er d et ai l s as well as advanced concepts regar d in g Mar kov
decision processes can be found in the te xt books by Russell and
Norvig,
2
and by Bertsekas.
3
2
Russell, S. and P. Norvig (2009). Artificial
Intelligence: A Modern Approach (3rd ed.).
Prentice-Hall
3
Bertsekas, D. P. (2007). Dynamic
programming and optimal control (4th ed.),
Volume 1. Athena Scientic
15.1.4 Partially Observable Markov Decision Process
One hypothesis wi t h the Mar kov decision p r ocess presented in the
previous section is that state variables are exactly observed, so that
at the current time
t
, one knows i n wh at st at e
s
the system is in.
In such a case, t h e opt i mal ac ti on defi ne d by the policy
(
s
) can
be directly s el ec t ed . For the case we are now interested in,
s
is a
hidden-state variable, and the observed state
y
is defined through
the conditional probability
p
(
y|s
). This problem configuration is
called a partiall y observable MDP (POMDP). The challenge is that
at any ti me
t
, you do not obs er ve exact ly wh at i s t he t ru e stat e
of the system; consequently, you cannot simply select the optimal
action from the policy (s).
Because the state is observed with uncertainty, at each time
t
,
we nee d to desc ri be our knowledge of the state usi ng a prob abil i ty
mass function
p
(
s
)=
{Pr
(
S
= s
1
)
, Pr
(
S
= s
2
)
, ··· , Pr
(
S
= s
S
)
}
.
Given the PMF
p
(
s
) describing our current knowledge of
s
, an
action taken
a
, and an ob se rvation
y 2S
, then the probability of
ending in any state s
0
2Sat time t + 1 is given by
p(s
0
|a, y)=
p(y,s
0
|a)
z }| {
p(y|s
0
)
p(s
0
|a)
z }| {
X
s2S
p(s
0
,s|a)
z }| {
p(s
0
|s, a) · p(s)
p(y)
.
With the MDP, the policy
(
s
) was a funct i on of th e stat e we are
in; with the POMDP we need to redefine the policy as being a
function of the probability of the state
(
p
(
s
)) we are i n. In t h e
cont e xt of an MDP, the policy can be seen as a function defined
over discr et e variable s. Wi t h a POM D P, if there are
S
possible
states,
(
p
(
s
)) is a function of a continuous domain with
S
1
dimensions. This domain is continuous because
Pr
(
S
=
s
)
2
(0
,
1),
and there are
S
1 d im en si on s because of the constraint req ui r i ng
that
P
s2S
p(s) = 1.
MDP
(s),s2S
|{z}
discrete
POMDP
(p(s)),p(s) 2 (0, 1)
S
| {z }
continuous
j.-a. goulet 252
Because the policy is now a func t ion of prob ab il i t i es de fin ed i n
a continuous d omai n, the value and policy iteration algorit h ms
presented for an MDP are not d i re ct l y applicable. One particularity
of the POMDP is that the set of actions now includes additional
information gathering through observation. The reader interested
in the details of value and policy-iteration algorithms for POMDP
should refer to specialized literature such as the paper by Kael-
bling, Littman, and Cassandra.
4
Note that solving a POMDP is
4
Kaelbling, L. P., M. L. Littman, and
A. R. Cassandra (1998). Planning and
acting in partially observable stochastic
domains. Articial Intelligenc e 101 (1),
99–134
significantly more demanding than solving an MDP. For practical
applications, the exact solution of a POMDP is often intractable
and we have to resort to approximate methods.
5
5
Hauskrecht, M. (2000). Value-function
approximations for partially observable
Markov decision processes. Journal of
Artificial Intelligence Research 13 (1),
33–94
15.2 Model-Free Reinforcement Learning
The Markov d ec i si on pr ocess presented in the pre vi ou s sec t ion i s
categorized as a model-based method because it requires knowing
the model of the environment t h at takes th e for m of the tr ans i t ion
probabilities
p
(
s
0
|s, a
). In some applications, it is not possible or
practical to define such a model because the number of states
can be too large, or can be continuous so that the definition of
transition probabilities is not suited. Model-free reinforcement
learning methods learn from the interaction between the agent and
the environment, as depicted in figure 15.3. This learning process is
ty p ic al ly don e by subjecting the agent to multiple epi s odes.Inthis
section, we will present two model-free methods suited for such a
case: temporal dierence learning and Q-learning.
15.2.1 Temporal Dierence Learning
The quantity we ar e interested in estimating is
E
[
U
(
s,
)]
U
(
s,
)
U
(
s
), that is, the expected utility for being in a state
s
.
We dened in
§
15.1.1 that the utility is the sum of the discounted
reward s,
U(s, )=
1
X
t=0
t
r(s
t
,a
t
,s
t+1
),
obtained while following a policy
so that
a
t
=
(
s
t
). Moreover, we
saw in
§
6.5 that we can estimate an expected value using the Monte
Carlo method. Here , i n the context where our agent can interact
with the environment over multiple episodes, we can estimate
U(s, ) by taking the aver age over a set of re ali z at ion s of U(s, ).
In a gen er al mann er , t he average
x
T
of a se t
{x
1
,x
2
, ··· ,x
T
}
can
be calculated incrementally by following
x
t
= x
t1
+
1
t
(x
t
x
t1
).
probabilistic machine learning for civil engineers 253
0 100 200 300 400 500
2
2.5
3
3.5
t
x
t
x
t
= x
t1
+
1
t
(x
t
x
t1
)
x =
1
T
P
T
t=1
x
t
Figure 15.6: The incremental calculation of
an average for a set
{x
1
,x
2
, ··· ,x
500
},x
t
:
X N(x;3, 1).
Figure 15.6 presents an example of an application of the incre-
mental average est i mat i on for a set of
T
= 500 realizat i ons of
x
t
:
X N
(
x
;3
,
1). We can appl y th e same pr i nc i pl e for t he
incremental estimation of the expected utility so that
U
(i+1)
(s) U
(i)
(s)+
1
N(s)
U
(i)
(s) U
(i)
(s)
, (15.6)
where
N
(
s
) is the number of times a state has been visited. Est i -
mating
U
(
s
) through Monte Carlo sampling is limited to problems
hav i ng a termi nal state and r eq u i re s evaluating th e uti l i ti e s over
entire episodes. Tempora l dierence (TD) learni ng,
6
allows going
6
Sutton, R. S. (1988). Learning to predict
by the methods of temporal dierences.
Machine Learning 3 (1), 9–44
around these requirements by replacing the explicit utility calcula-
tion
U
(i)
(
s
)bytheTD-target defined as
r
(
s, a, s
0
)+
U
(i)
(
s
0
), so that
equation 15.6 can be r ewr i t t en as
U
(i+1)
(s) U
(i)
(s)+
r(s, a, s
0
)+U
(i)
(s
0
)
| {z }
TD-target
U
(i)
(s)
. (15.7)
Temporal dierence learns by updating recursively t h e sta te- uti l it y
function
U
(i+1)
(
s
) by tak i ng t he di er en ce i n expected utili t ie s
estimated at consecutive times while the agent interacts with the
envi r onm ent.
In equation 15.7,
is the learning rate defining how much is
learned from the TD update step. Typically, we want to empl oy a
learning rate that is in it i al l y large so our poor initial values for
U
(
s
)
are strongly influenced by the TD updates and then decrease
as a
function of the number of time s a parti cu l ar set of state and act ion
has been seen by the agent. In order to ensur e t he convergence of
the state-utility function, the learning rate should satisfy the two
following conditions,
1
X
N(s,a)=1
(N(s, a)) = 1,
1
X
N(s,a)=1
2
(N(s, a)) < 1.
A common learning rate function that satisfies both criteria is
(N(s, a)) =
c
c + N(s, a)
,
where
N
(
s, a
) is the number of times a set of state and act i on has
been visited and c 2 R
+
is a positive constant.
The TD-learning update step presented in equation 15.7 defines
the behavior of a passive agent, that is, an agent, who is only
eval uati n g the s tate- u ti li ty function for a predefined policy
.In
other words, in its current form, TD-learning is suited to estimate
j.-a. goulet 254
U
(
s
) given that we already know the policy
(
s
) we want the agent
to follow. In practice, this is seldom the case because instead, we
want an active agent, who w il l learn both the expected utilities and
the optimal policy. The ne xt se c t ion pr ese nts temporal dierence
Q-learning, a common active reinforcement learning method capable
of doing both tasks s i multaneously.
15.2.2 Temporal Dierence Q-Learning
Temporal dierence Q-learning
7
(often referred to as simply Q-
7
Watkins, C. J. and P. Dayan (1992).
Q-learning. Machine Learning 8 (3–4),
279–292
learning) is an active reinforcement learning method that consists
in evalu at i ng th e actio n- uti l it y function
Q
(
s, a
), from which both
the optimal policy
(s) = arg max
a2A
Q(s, a)
and the resulting expected utilities
U(s) U(s,
) = max
a2A
Q(s, a)
can be extracted. Analogously to equation 15.7, the TD update step
for the action-utility functio n Q(s, a) is defined as
Q
(i+1)
(s, a) Q
(i)
(s, a)+
r(s, a, s
0
)+ max
a
0
2A
Q
(i)
(s
0
,a
0
)Q
(i)
(s, a)
.
(15.8)
Equation 15.8 describes the behavior of an active agent,which
updates
Q
(
s, a
) by looking one step ahead and employing the action
a
0
that maximizes
Q
(
s
0
,a
0
). Q-learning is categorized as an o-
policy reinforcement learning method because the optimal one-step
look-ahead action
a
0
employed t o update
Q
(
s, a
) is not necessarily
the action that will be taken by t he age nt to transi t t o t he ne x t
state. This is because in order to learn eciently, it is necessary
to find a t r ad eo between exploiting the curre nt best policy and
exploring new policies.
While learning the action-utility function using Q-learning, one
cannot only rely on the currently optimal acti on
a
=
(
s
) in order
to transit between success i ve stat es,
s
and
s
0
. This is because such
a greedy agent that always selects the optimal action is likely to
get stuck in a local maxima because of the poor initial estimates
for
Q
(
s, a
). One simple solution is to employ the
-greedy strategy,
which consists in selecting the action randomly with probability
N
(
s
)
and otherwise selecting the action from the optimal policy
(
s
). Here,
N
(
s
) is again the number of times a state has been
probabilistic machine learning for civil engineers 255
visited. Therefore, an -greedy agent selects an action following
a =
(
a
i
: p(a)=
1
A
, 8a, if u<
N(s)
Random action
arg max
a2A
Q(s, a), if u
N(s)
Greedy action,
where
u
is a sample taken from a uniform distribution
U
(0
,
1).
N
(
s
)
should be defined s o that i t t en ds t o zer o when th e number
of times a state has been vi si t ed
N
(
s
)
!1
. Algorithm 12 details
the sequence of ste ps for temporal dierence Q -l ear ni n g.
Algorithm 12: Temporal dierence Q-learning
1 define: s 2S (states)
2 define: a 2A (actions)
3 define: r(s, a, s
0
) (rewards)
4 define: (discount factor)
5 define: (epsilon-greedy function)
6 define: (learning rate function)
7 initialize: Q
(0)
(s, a) (action-utility function)
8 initialize: N(s, a)=0, 8{s 2S,a 2A} (action-state counts)
9 i =0
10 for episodes e 2{1:E} do
11 initialize: s
12 for time t 2{1:T} do
13 u : U U(0, 1)
14 a =
8
<
:
a
i
: p(a)=
1
A
, 8a, if u<
N(s)
Random
arg max
a2A
Q
(i)
(s, a), if u
N(s)
Greedy
15 observe: r(s, a, s
0
), s
0
16 Q
(i+1)
(s, a) Q
(i)
(s, a) ···
17
+
N
(
s, a
)

r
(
s, a, s
0
)+
max
a
0
2A
Q
(i)
(
s
0
,a
0
)
Q
(i)
(
s, a
)
18 N(s, a)=N(s, a)+1
19 s = s
0
20 i = i +1
21 return: U(s) = max
a2A
Q
(i)
(s, a)
22
(s) = arg max
a2A
Q
(i)
(s, a)
Example: Infrastructure maintenance planning We revisit the
infrastructure maintenance example, this time using the temporal
dierence Q-learning algorithm. The problem definition is identical
to that in
§
15.1, except that we now initialize the action-utility
j.-a. goulet 256
functions as
Q
(0)
(
s, a
) = $0
M, 8{s 2S,a2A}
. Also, we define the
iteration-dependent learning schedule as
N(s, a)
=
c
c + N(s, a)
and the exploration schedule by
N(s)
=
c
c + N(s)
,
where
c
= 70. We choose to learn over a total of 500 episodes,
each con si s t in g in 100 time st e ps . For each new episode, the state
is initialized randomly. Figure 15.7 presents the evolution of the
expected utility
U
(
s
) for each st at e as a fun ct i on of the number of
episodes. Given t hat we employed 500 episodes, each comprising
100 200 300 400 500
0
1000
2000
3000
4000
Episode #
U(s)
s=100% s=80% s=60% s=40% s =20% s=0%
Figure 15.7: Expected utility convergence
for the Q-learning algorithm.
100 st e ps , t he tot al number of iterations is 50
,
000. For the last
iteration, the expected u t il i t i es are
U
(50 000)
(S)={3640, 3635, 3630, 3615, 3592, 3510}$M.
Those valu es ar e id entical to t h e one s obt ai ne d usi n g eith er t h e
policy- or value-iterat i on algor i t hm i n
§
15.1. Analogously, the
optimal policy derived from the action-utility function is also
identical and corresponds to
(S)={do nothi n g
| {z }
s=100%
, maint ai n
| {z }
80%
, maint ai n
| {z }
60%
, maint ai n
| {z }
40%
, replace
| {z }
20%
, replace
| {z }
0%
}.
100 200 300 400 500
0
0.2
0.4
0.6
0.8
1
Episode #
Figure 15.8: Evolution of
N
(
s
)
for each
state
s
as a function of the number of
episodes.
Figure 15.8 present
N
(
s
)
as a fu n ct i on of the number of
episodes, that is, for each state, the prob abil i ty th at t h e agent
selects a random action. For the learning rate
N
(
s, a
)
,whichis
not displayed here, there are three times as many curves because it
also depends on the ac t i on a.
For this si mp l e in f ras t ru ct u re maintenance example, using an
active reinforcement learning method such as Q-learning is not
probabilistic machine learning for civil engineers 257
beneficial in com par i son wi t h passi ve meth ods such as the policy-
or value- it e rat i on al gori t h m in
§
15.1. Nevertheless, active reinforce-
ment learning becomes necessary for more advanced contexts where
the number of states is large and often c ontinuous . The re ade r
interested in learning more about topics related to on-policy rein-
forcement learning methods such as sarsa, advanced explor at i on
strategies, ecient action-utility function initialization, as well as
cont i nuous fun cti on approximati on for Q-le arn i ng, s houl d con su l t
dedicated textbooks such as the one by Sutton and Barto.
8
8
Sutton, R. S. and A. G. Barto (2018).
Reinforcement learnin g : An introduction
(2nd ed.). MIT Press