j.-a. goulet 244
In order to identify the optimal policy
⇡
⇤
,weneedtodefine
the reward
r
(
s, a, s
0
). Here, we assume that the reward simplifies
to
r
(
s, a
)=
r
(
s
)+
r
(
a
), so it is only a function of the state
s
and
the action
a
. For a given structu r e, t h e re ward
r
(
s
) is estimated as
a f un ct i on of the number of users quantified through th e annual
average daily traffic flow (AADTF = 10
5
users/day), times a value
of $3/user, times the capacity of the bridge as a function of its
Note:
Here the value of $3/user does not
necessarily represent the direct cost per
user as collected by a toll booth. It instead
corresponds to the indirect value of the
infrastructure for the society.
condition. The capacity is defined as
c
(
S
)=
{
1
,
1
,
1
,
0
.
90
,
0
.
75
,
0
}
,
and the associated rewards ar e
r(S) = 10
5
users/day · 365 day · $3/user ·c(S)
= {109.5, 109.5, 109.5, 98.6, 82.1, 0}$M.
The rewards for actions
r
(
a
) correspond to a cost, so th ei r values
are lesser than or equal to zero,
r(A)={0, 5, 20}$M.
Figure 15.3: Schematic representation of
the interaction between an agent and its
environment.
For the context of re i nf orc em ent lear n in g, we can generali z e th e
example above by introducing the conc ep t of an agent interacting
with its environment, as depicted in figure 15.3. In the context
of the previous example, the environment is the population of
structures that are degrading over time, and the agent is the hypo-
thetical entity acting with the intent of maximizing the long-term
accumulation of rewards. The environment interacts with the agent
by de fini n g, for e ach time
t
, the state
s
t
of the system and the re-
ward f or being in that state
s
, taking a given action
a
, and e nd i ng
in the next state
s
0
. The agent perceives the environment’ s st at e
and selects an action
a
, which in turn can a↵ect the environment
and thus the state at time t + 1.
This chapter explores the task of identifying the optimal policy
⇡
⇤
(
s
) describing the optimal actions
a
⇤
to be taken for each st at e
s
.
This task is first formulated in
§
15.1 using the model-based method
known as th e M arkov decision process. Bui l di n g on this me th od,
§
15.2 then presents two model-free r ei n for ce me nt lear ni n g meth ods:
temporal di↵erence and Q-learning.
15.1 Markov Decision Process
In order to formulate a Markov decisi on process (MDP), we need to
define a planning horizon over which the utility is estimated . Pl an -
ning horizons may be either finite or infinite. With a finite planning
horizon, the rewards are considered over a fixed period of time. In
such a case , th e opt i mal policy
⇡
⇤
(
s, t
) is nonstationary because it
depends on the time
t
. For the infrastructure maintenance example,