15
Sequential Decisions
States: s 2S= {s
1
, s
2
, ··· , s
S
}
Actions: a 2A= {a
1
, a
2
, ··· , a
A
}
Policy: ⇡ = {a(s
1
),a(s
2
), ··· ,a(s
S
)}
Reward: r(s, a, s
0
) 2 R
Transition
model: p(s
0
s, a)
In this chapter, we explore how to make optimal decisions in a
sequential context. Sequential decisions di↵er from the decision con
text presented in chapter 14 because here, the goal is to select the
optimal action to be performed in order to maximize the current
reward as well as future ones. In that process, we must t ake into
account that our current action will a↵ect future states as well as
future decisions. Let us consider a system that can be in a state
s 2S
=
{
s
1
,
s
2
, ··· ,
s
S
}
. The set of actions
a 2A
=
{
a
1
,
a
2
, ··· ,
a
A
}
that should be performed given each state of the syste m is cal l ed a
policy,
⇡ = {a ( s
1
),a(s
2
), ··· ,a(s
S
)}.
In the context of sequential decisions, the goal is to identify
the optimal policy
⇡
⇤
describing for each possible state of the
system the optimal action to take. This task can be formulated
as a reinforcement learning (RL) problem, where the optimal
Note:
Reinforcement learning di↵ers
from supervised learning because the
optimal policy is not learned using other
examples of policies. RL also di↵ers from
unsupervised learning because the goal is
not to discover a structure or patterns in
the data.
policy is learn ed by choosin g th e act i ons
a
⇤
(
s
) that maximize
the expected u ti li ty over a given planning horizon. In order to
compute the expected utility, we need a function describi ng for each
set of current state
s
, action
a
, and n ex t s t at e
s
0
the associated
reward
r
(
s, a, s
0
)
2 R
. In practical cases, it can happen that the
reward simpliﬁes to a special case such as
r
(
s, a, s
0
)=
r
(
s, a
),
r
(
s, a, s
0
)=
r
(
a, s
0
), or even
r
(
s, a, s
0
)=
r
(
s
). The last part we need
to deﬁne is the Markovian transition model
p
(
s
0

s
j
,
a
i
)=
Pr
(
S
0
=
s
k
s = s
j
, a
i
), 8k 2{1:S}, that is,
p(s
0
s
j
, a
i
) ⌘
8
>
>
>
<
>
>
>
:
Pr(S
t+1
= s
1
s
t
= s
j
, a
i
)
Pr(S
t+1
= s
2
s
t
= s
j
, a
i
)
.
.
.
Pr(S
t+1
= s
S
s
t
= s
j
, a
i
).
The t ransition model describes, given the current state of the
Note:
A transition model is analogous to
the concept of a conditional probability
table presented in chapter 11.
system s
j
and an action taken a
i
, the probability of ending in any