11
Bayesian Networks
Bayesian networks were introduced by Pearl
1
and are also known
1
Pearl, J. (1988). Probabilistic reasoning in
intelligent systems: Networks of plausible
inference.MorganKaufmann
as belief networks and directed graphical models. They are t h e result
of the combination of probabili ty theory covered in chapter 3 with
graph theory, which employs graphs defined by links and nodes.
We saw in
§
3.2 that the chain rule allows formulating the joint
probability for a set of random variables using conditional and
marginal probabilities, for example,
p
(
x
1
,x
2
,x
3
)=
p
(
x
3
|x
2
,x
1
)
p
(
x
2
|x
1
)
p
(
x
1
).
Bayesian networks (BNs) employ nodes to represent random vari-
ables and directed links to describe the dependencies between them.
Bayesian networks are probabilistic models where the goal is to
learn the joint probability defined over the e ntire network. The joint
probability for the structure encoded by these nodes and links is
formulated using the chain rule. The key with Bayesian networks is
that they allow building sparse models for which ecient variable
elimination algorithms exist in order to estimate any conditional
probabilities from the j oi nt probability.
BNs can be categorized as unsupervised l earning
2
where the goal
2
Ghahramani, Z. (2004). Unsupervised
learning. In Advanced lectures on ma-
chine learning,Volume3176,pp.72112.
Springer
is to estimate the joint probability density function (PDF) for a set
of observed variable s. In it s most gener al form we may seek to learn
the structure of the BN itself. In this chapter, we restrict ourselves
to the case where we know the graph structure, and the goal is to
learn to predict unobserved quantities given some observed ones.
T
V
F
Temperature: t 2{cold, hot}
Virus: v 2{yes, no}
Flu: f 2{sick, ¬sick}
Figure 11.1: Example of Bayesian network
for representing the relationships b etween
temperature, the presence of the flu v irus,
and being sick from the flu virus.
Flu vi ru s example Section 3.3.5 presented the distinction between
correlati on and causality using the flu virus example. A Bayesian
network can be employed to model the dependence between the
temperatu re, the presence of the flu virus, and being sick from the
flu. We model our knowledge of t he se three quantities using discrete
random variables that are represented by the nodes in figure 11.1.
The arrows represent the dependencies between variab l es : The
temperature
T
aects the virus prevalence
V
, which in turns aects
j.-a. goulet 168
the probability of catching the vir us an d being sick
F
. The absence
of a link b etween
T
and
F
indicates that the temperature and
being sick from the flu are conditionally independen t from each
other. In the context of this example, conditional independence
implies that
T
and
F
are independent when
V
is known. The joint
probability for T , V , and F ,
p(f,v,t) =
=p(f|v,t)
z }| {
p(f|v) ·p(v|t) · p(t)
| {z }
p(v,t)
,
is obtained using the chain rule where, for each arrow, the cond i-
tional probabilities are described in a conditional probability table.
Virus example: Marginal and condi-
tional probability tables
p(t) = {p(cold),p(hot)} = {0.4, 0.6}
p(v|t) =
8
<
:
t =cold t =hot
v =yes 0.80.1
v =no 0.20.9
p(f|v) =
8
<
:
v =yes v =no
f =sick 0.70
f = ¬sick 0.31
Joint probability using chain rule
p(v, t) = p(v|t) · p(t)
=
8
<
:
t =cold t =hot
v =yes 0.8 0.40.1 0.6
v =no 0.2 0.40.9 0.6
=
8
<
:
t =cold t =hot
v =yes 0.32 0.06
v =no 0.08 0.54
p(f, v,t) = p(f |v) · p(v, t)
=
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
f =sick t =cold t =hot
v =yes 0.320.70.060.7
v =no 0.08 00.54 0
f = ¬sick t =cold t =hot
v =yes 0.320.30.060.3
v =no 0.08 10.54 1
=
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
f =sick t =cold t =hot
v =yes 0.224 0.042
v =no 00
f = ¬sick t =cold t =hot
v =yes 0.096 0.018
v =no 0.08 0.54
Variable elimination: Marginalization
p(f, t) =
X
v
p(f, v,t)
=
8
<
:
t =cold t =hot
f =sick 0.224 0.042
f = ¬sick 0.176 0.558
p(f) =
X
t
p(f, t)
=
f =sick 0.266
f = ¬sick 0.734
p(t|f) = p(f, t)/p(f)
=
8
<
:
t =cold t =hot
f =sick 0.84 0.16
f = ¬sick 0.24 0.76
In the case where we observe
F
=
f
, we can employ the
marginalization operation in order to obtain a conditional prob-
ability quantifying how the observation
f 2{sick, ¬sick}
changes
the probability for the tempe rat u re T ,
p(t|f) =
p(f,t)
p(f)
=
P
v
p(f,v,t)
P
t
P
v
p(f,v,t)
.
In minimalistic problems such as this one, it is trivial to calculate
the joint probability using the chain rule and eliminating variables
using marginalization. However, in practical cases involving dozens
of variables with as many links between them, these calcul at i ons
become computationally demanding. Moreover, in practice, we
seldom know the marginal and conditional probabilities tables.
A key interest of working with directed graphs is that ecient
estimation methods are available to perform all t h ose tasks.
Bayesian networks are applicable not only for discrete ran-
dom variables but also for continuous ones, or a mix of both. In
this chapter, we restrict ourselves to the study of BN for discrete
random variables. Note that the state-space models presented in
chapter 12 can be seen as a time-dependent Bayesian network using
Gaussian random variables with linear dependence models. This
chapter presents the nomenclature employed to define graphi c al
models, the methods for per for mi n g infe re nc e, and the methods
allowing us to learn the conditional probabilities defining the de-
pendencies between random variables. In addition, we present an
introduction to time-dependent Bayesian networks that are referred
to as dynamic Bayesi an networks. For advanced topics regarding
Bayesian networks, the reader is invited to consult specialized text-
book s such as the one by Nielsen and Jensen
3
or Murphy’s PhD
3
Nielsen, T. D. and F. V. Jensen (2007).
Bayesian networks and decision graphs.
Springer
thesis.
4
4
Murphy, K. P. (2002). Dynamic Bayesian
networks: representation, inference and
learning.PhDthesis,UniversityofCalifor-
nia, Berkeley
probabilistic machine learning for civil engineers 169
11.1 Graphical Models Nomenclature
Bayesian networks employ a special type of graph: directed acyclic
graph (DAG). A DAG
G
=
{U, E}
is defined by a set of node s
U
interconnected by a set of directed links
E
. In order to be acycl i c,
the directed links between variables cannot be defined such th at
there are self-loops or cycles in the graph. For a set of random vari-
ables, there are many ways to define links between variables, each
one leading to the same joint probability. Note that dir ect e d link s
between variables are not required to describe causal relationships.
Despite causality not being a requirement, it is a key to ecien cy ;
if the directed links in a graphical model are assigned following the
causal relationships, it generally produces sparse models requiring
the definition of a smaller number of conditional probabilities than
noncausal counterparts.
X
2
X
3
X
1
X
4
(a) Directed acyclic
graph (DAG)
X
2
X
3
X
1
(b) Directed graph
containing a cycle
Figure 11.2: Example of dependence
represented by directed links between
hidden (white) and an observed (shaded)
nodes describing random variables.
Figure 11.2a presents a directed acyclic graph and ( b) p r es ents a
directed graph containing a cycle so that it cannot be modeled as a
Bayesian network. In figure 11.2a random variables are represented
by nodes, where the observed variable
X
4
depends on the hidden
variables
X
2
and
X
3
. The directions of links indicate that
X
2
and
X
3
are the parent of
X
4
, that is,
parents
(
X
4
)=
{X
2
,X
3
}
, and
consequently,
X
4
is the child of
X
2
and
X
3
. Each child is asso c iat e d
with a conditional probability table (CPT) whose size depends on
the number of parents,
p
(
x
i
|parents
(
X
i
)). Nodes without parents
are described by their marginal prior probabilities
p
(
x
i
). The joint
PDF
p
(
U
) for the entire Bayesian network is formulated using the
chain rule,
p(U)=p(x
1
,x
2
, ··· ,x
X
)=
X
Y
i=1
p(x
i
|parents(X
i
)).
This application of the chain rule requires that given its parents,
each node is independent of its other ancestors.
(a) Cyanobacteria
seen under a micro-
scope
(b) Cyanobacteria
bloom in a rural
environment
Figure 11.3: Example of cyanobacteria
bloom. (Photo: NASA and USGS)
Cyanobacteria
Fish mortality
Water color
Temperature
Fertilizer
(a) Bayesian network semantic representa-
tion
C
M W
T F
(b) Bayesian network represented with
random variables
Figure 11.4: Bayesian network for the
cyanobacteria example.
Cyanobacteria example We explore the example illustrated in
figure 11.3 of cyanobacteria blooms that can o cc ur in lakes, rivers,
and estuaries. Cyanobacteria blooms are typically caused by the
use of fertilizers that wash into a water body, combined with warm
temperatures that allow for bacteria reproduction. Cyanobacteria
blooms can cause a change in the water color and can cause fish or
marine life mortality. We employ a Bayesian network to des cr i be
the joint probability for a set of random variables consisting of the
temperature
T
, the use of fertilizer
F
, the presence of cyanobacteria
in water
C
, fish mortality
M
, and the water color
W
. In figure 11.4,
j.-a. goulet 170
the definition of the graph structure and its links follow the causal
direction where the presence of cyanobacteria in water
C
depends
on the temperature
T
and the presence of fertilizer
F
. Both the
fish mortality
M
and water color
W
depend on the presence of
cyanobacteria in water C.
11.2 Conditional Independence
As stated earlier, the absence of a link between two variables in-
dicates that the pair is conditionally independent. In the case of
a serial connection, as illustrated in figure 11.5a, the absence of
alinkbetween
A
and
C
indicates that these two are independent
if
B
is known, that is,
A ?? C|{B
=
b}
. Therefore, as long as
B
is not observed,
A
and
C
depend on each other through
B
.Itis
equivalent to say that given its parent (i.e.,
B
),
C
is indep endent
of all its other non-descendants. In the case of a diverging connec-
tion (figure 11.5b), again the absence of link between
A
and
C
indicates that these two are independent if
B
is observed. The case
of a converging connecti on represented in figure 11.5c is dierent
from the two others; the absence of link between
A
and
C
implies
that both variables are independent unless
B
, or one of its descen-
dants, is observed. When
B
, or one of its descendants is observed,
the knowledge gained for
B
also modifies the knowledge for
A
and
C
. We say that
A
and
C
are d-separated in the case of a serial or
diverging connection, where the intermediary variable B is observed,
or in the case of a converging connection, where neither the inter-
mediary variable
B
nor one of its descendants is observed. For any
d
-separated variables, a change in the knowledge for on e variable
does not aect the knowledge of the others.
BA C
(a) Serial connection
B
A C
(b) Diverging
connection
B
A C
D
(c) Converging
connection
Figure 11.5: Cases where
A
and
C
are
conditionally independent for the dierent
types of connection in Bayesian networks.
C
M W
T F
(a) Serial connection:
T ??
M|c and T ?? W |c
C
M W
T F
(b) Diverging connection:
M ?? W |c
C
M W
T F
(c) Converging connection:
T
?? F
Figure 11.6: The concept of conditional in-
dependence illustrated using the cyanobac-
teria example.
Cyanobacteria example In the exampl e presented in figure 11.4b,
the sets of variables
{T, C, M}
,
{T, C, W}
,
{F, C, M}
, and
{F, C, W}
are all examples of serial connections. If as illustrated in fig-
ure 11.6a, we observe the presence of cyanobacteria, that is,
C
=
yes
, then the temperature
T
becomes independent of fish
mortality
M
and water color
W
, that is,
T ?? M|c
,
T ?? W |c
.It
implies that gaining knowledge about water temperature would
not change our knowledge about either fish mortality or water
color because these two quantities only depend on the presence of
cyanobacteria, which is now a certainty gi ven that
C
was observed.
For the diverging connection between variables
{M,C,W}
illus-
trated in figure 11.6b, observing
C
=
c
causes the fish mortality
M
to be indep en de nt from the water color
W
,
M ?? W |c
. Gaining
probabilistic machine learning for civil engineers 171
knowledge about the occurrence of fish mortality would not change
our knowledge about water color b e cau se the se two quantities only
depend on the presence of cyanobacteria, which is a cert ai nty when
C
is observed. For the converging connection between variables
{T, C, F}
represented in figure 11.6c, despite the absence of a link
between
T
and
F
, the temperature is not independent from the
use of fertilizer
F
if we observe
C
or one of its descendants, that
is,
M
or
W
. Without observing
C
,
M
, or
W
, the knowledge gained
about the use of fertilizer has no impact on our knowledge of the
temperature. On the other hand, if we observe the presence of
cyanobacteria (
C
=
yes
), then knowing that no fertilizer is present
in the environment (
F
=
no
) would increase the probability that
the temperature is high because there is only a small probability of
having cyanobacteria with out fertilizer and with cold temperature.
11.3 Inference
The finality of Bayesian networ ks is not only to define the joint
probability for random variables; it is to perform inference, that
is, to compute the conditional probability for a set of unobserved
random variables, given a set of observed ones. Let us conside r
U
=
{X
1
,X
2
, ··· ,X
X
}
, the set of random variables corresponding
to the nodes defining a system, a subset of observed variables
DU
, and another subset of query variables
QU
, such that
Q\D
=
;
. Following the definition of conditi on al probabi l i t ie s
presented in
§
3.2, the posterior probability for the variables in
Q
given observations D is
p(Q|D)=
p(Q, D)
p(D)
=
P
U\{Q,D}
p(U)
P
U\{D}
p(U)
,
where
U \ {Q, D}
describes the set of variables belonging to
U
while excluding those in
{Q, D}
. For the cyanobacteria example
Cyanobacteria example
U =
8
>
>
>
>
<
>
>
>
>
:
T : t 2{cold, hot}
F : f 2{yes, no}
C : c 2{yes, no}
M : m 2{yes, no}
W : w 2{clear, green}
9
>
>
>
>
=
>
>
>
>
;
presented in figure 11.4b, let us consider we want the posterior
probability
p
(
m|w
=
green
), that is, the probabil i ty of fish mortality
M
given that we have observed colored water,
W
=
green
.This
joint probability is described by
p(m|w = green) =
p(m, w = green)
p(w = green)
,
j.-a. goulet 172
where both terms on the right-hand side can be obtained through
the marginalization of the joint probability mass function (PMF),
p(m, w = green) =
X
t
X
f
X
c
p(t, f, c, m, w = green)
p(w = green) =
X
m
p(m, w = green).
This approach is theoretically correct; however, in practice, calcu-
lating the joint probability
p
(
U
) quickly becomes computationally
intractable with the increase in the number of variables in
U
.The
solution is to avoid computing the full joint probability table and
instead proceed by eliminating variables sequentially.
Cyanobacteria example: CPT
p(t)={p(cold),p(hot)} = {0.4, 0.6}
p(f)={p(yes),p(no)} = {0.2, 0.8}
p(c|t, f )=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
c =yes t =cold t =hot
f =yes 0.50.95
f =no 0.05 0.8
c =no t =cold t =hot
f =yes 0.50.05
f =no 0.95 0.2
p(m|c)=
8
<
:
c =yes c =no
m =yes 0.60.1
m =no 0.40.9
p(w|c)=
8
<
:
c =yes c =no
w =clear 0.70.2
w =green 0.30.8
Variable elimination The goal of variable elimination is to avoid
computing the full joint probability table
p
(
U
) by working with its
chain-rule factorization. For the cyanobacteria example, the joint
probability is
p(U)=p(t) · p(f) · p(c|t, f) · p(w|c) · p(m|c), (11.1)
where computing p(m, w = green) cor r esponds to
p(m, w = green)
=
X
t
X
f
X
c
p(t) · p(f) · p(c|t, f) · p(w = green|c) ·p(m|c)
=
X
c
p(w = green|c)
| {z }
21
·p(m|c)
| {z }
22
·
X
t
p(t)
|{z}
22
·
X
f
p(f)
|{z}
22
·p(c|t, f).
| {z }
222
| {z }
p(c,f|t): 222
| {z }
p(c|t): 22
| {z }
p(c,t): 22
| {z }
p(c): 21
| {z }
p(m,c): 22
| {z }
p(m,w=green,c): 22
| {z }
p(m,w=green): 21
= {0.1354, 0.3876}.
Vari abl e eli mi nat ion
p(c, f |t) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
c =yes t =cold t =hot
f =yes 0.5·0.20.95·0.2
f =no 0.05·0.80.8·0.8
c =no
t =cold t =hot
f =yes 0.5·0.20.05·0.2
f =no 0.95·0.80.2·0.8
p(c|t) =
8
<
:
t =cold t =hot
c =yes 0.14 0.83
c =no 0.86 0.17
p(c, t) =
8
<
:
t =cold t =hot
c =yes 0.14·0.40.83·0.6
c =no 0.86·0.40.17·0.6
p(c) =
c =yes 0.554
c =no 0.446
p(m, c) =
8
<
:
c =yes c =no
m=yes 0.6·0.554 0.1·0.446
m=no 0.4·0.554 0.9·0.446
p(m, w =green,c)=
8
<
:
c =yes c =no
m=yes 0.3324·0.30.0446·0.8
m=no 0.2216·0.30.4014·0.8
p(m, w =green)=
8
<
:
w =green
m=yes 0.1354
m=no 0.3876
p(w =green)=0.1354 + 0.3876 =0.523
Inference
p(m|w =green)=
p(m, w =green)
p(w =green)
=
8
<
:
w =green
m=yes 0.1354/0.523
m=no 0.3876/0.523
=
8
<
:
w =green
m=yes 0.26
m=no 0.74
Note how several terms in equation 11.1 are independent from the
variables in the summation operators. It allows us to take out of
the sums the terms that do not depend on them and then per f orm
the marginalization sequentially. This proced ur e is more ecient
than working with full probability tables. In the previous equation,
braces refer to the ordering of operations and indicate the size of
each probability table. The variable eliminati on procedure allows
working with probability tables containing no more than 2
3
=8
probabilistic machine learning for civil engineers 173
entries. In comparison, the full joint probability tabl e for
p
(
U
)
contains 2
5
= 32 entries.
The eciency of the variable elimination approach depends
on the ordering of operations. The common method for ordering
operations while performing variable elimination is the junction
tree method. This method transforms the graph into a tree and
then employs clustering to group nodes. The reader interested
in the details of this inference method should refer to specialized
textbooks. Note that the methods covered ab ove are exact inference
methods. In the case where the number of variabl es is so large that
exact methods become computationally prohibitive, we can also
resort to approximate metho ds based on Monte Carlo sampling.
11.4 Conditional Probability Estimation
U = x
i
TFCMW
1cold no yesyes clear
2hot noyesyesclear
3hotyesnoyesgreen
.
.
.
D hot no no yes green
(a) Fully observed Bayesian network
U = x
i
TFCMW
1cold ? yes ? clear
2? ? yes yes clear
3? yes no yes ?
.
.
.
D hot no no ? green
(b) Partially observed Bayesian network
Figure 11.7: Example of data set for
learning conditional probability tables.
In the previous sections, it was always assumed that the probabili-
ties contained in conditional probability tables (CPTs) were known.
In practice this is seldom the case; CPTs must be learned from
data. There are two typical learning setups; in the first setup, the
Bayesian network is fully observed so each observation
D
i
=
{x
i
}
consists in a joint realization
x
i
:
X p
(
U
). Figure 11.7a presents
a table where each line is one observation for the fully observed
BN employed in the cyanobacteria example. In the second learning
setup, the BN is only partially observed so that some variables are
not observed for the realization of
x
i
:
X p
(
U
). Figure 11.7b
presents a data set for the cyanobacteria example where the BN is
partially observed.
11.4.1 Fully Observed Bayesian Network
For a fully observed Bayesian network, the maximum likelihood es-
timate (MLE) for the conditional probability
Pr
(
X
1
=
x
1
|X
2
=
x
2
)
can be estimated as the ratio between the number of realization s of
a specific joint outcome
{X
1
=
x
1
,X
2
=
x
2
}
, divided by the total
number of realizations of th e outcome {X
2
= x
2
},
ˆ
Pr(X
1
= x
1
|X
2
= x
2
)=ˆp(x
1
|x
2
)=
#{X
1
= x
1
,X
2
= x
2
}
#{X
2
= x
2
}
. (11.2)
We must be careful in the case where a specific outcome is not
Reminder: #{ · }
denotes the number of
elements in a set.
observed in the data set so #
{X
1
=
x
1
,X
2
=
x
2
}
= 0. In such
a case, the MLE will lead to a conditional probability equal to
zero. This situation is problematic because we might not have
observed this specific outcome yet, simply because the number of
observations is too limited. A solution is to employ a maximum
j.-a. goulet 174
a-posteriori (MAP) estimation by adding one observation count to
each possible outcome. Given that
x
1
2{
1
,
2
, ···n}
, equation 11.2
becomes
ˆ
Pr(X
1
= x
1
|X
2
= x
2
)=
#{X
1
= x
1
,X
2
= x
2
} +1
#{X
2
= x
2
} + n
. (11.3)
Cyanobacteria example From the data set
D
presented in fig-
ure 11.7a, we can estimate each CPT involved in the definition of
p
(
U
) in equation 11.1. For example, we might want to obt ai n the
MLE estimate for the probability of having cyanobacteria given
a hot temperature and the presence of fertilizer,
p
(
c
=
yes|t
=
hot,f
=
yes
). In such a case, we compute t h e number of reali z ati on s
of
{c
=
yes,t
=
hot,f
=
yes}
in
D
, divided by the number of
realizations of
{t
=
hot,f
=
yes}
. If we want to employ the MAP
estimate instead of the MLE, we add one count to the numerator
and two to the denominator so that
ˆp(c =yes|t = hot,f =yes)=
#{c =yes,t= hot,f =yes}+1
#{t = hot,f =yes} +2
.
(11.4)
The same method applies for any entry in a conditional probability
table. In the case of the estimation of marginal probabilities such as
p
(
t
=
hot
), the calculation simplifies to the number of events where
{t
=
hot}
divided by
D
+ 2, the number of joint observations in the
data set plus the number of possibl e outcomes for t,
ˆp(t = hot) =
#{t = hot} +1
D +2
.
Before going further, we will explore the theoretical justification
for the MLE and MAP in equations 11.2 and 11.3. Let us consider
the simplified case where a network is made from
X
random vari-
ables
U
=
{X}
=
{
[
X
1
X
2
··· X
X
]
|
}
,where
x
k
2{
0
,
1
}, 8k 2{
1:
X}
. Here, the definition of a specific structure for the ne twork is not
necessary; we simply assume that the network is structured as a
DAG, so that for each
X
k
,wehave
p
(
x
k
|parents
(
X
k
)). The quan-
tity we seek to estimate from the data is the condi ti on al probability
k
=
p
(
x
k
=1
|parents
(
X
k
)), where because
x
k
is a binary variable,
we also have 1
k
=
p
(
x
k
=0
|parents
(
X
k
)). We describe the prior
probability for
k
by a Beta PDF (see §4.3), Note:
We add
+1
to the parameters of the
Beta prior because without them,
B(
k
;
0
,
0
) /
0
1
k
(1
k
)
0
1
,
which would lead to an MAP equal to
ˆ
k
=
0
+ 1
0
+
0
+ + 2
.
f(
k
)=B(
k
;
0
+1,
0
+1)
=
0
k
(1
k
)
0
B(
0
+1,
0
+ 1)
/
0
k
(1
k
)
0
.
probabilistic machine learning for civil engineers 175
We have a data set
D = {D
1
, D
2
, ··· , D
D
}
= {U = x
1
, U = x
2
, ··· , U = x
D
}
containing
D
realizations from the fully observed Bayesian network
U
. Here, we are interested in the probability of the joint realization
of
X
k
= 1, along with a specific combination of its pare nts, that
is,
parents
(
X
k
)
x
pa(k)
. The number of such outcomes in the data
set
D
is
=#
{x
k
=1
, x
pa(k)
}2{
0
,
1
, ··· , D}
. Analogously, the
number of realizations in the data set of
X
k
= 0 along with the
same specific combination of its parents is
=#
{x
k
=0
, x
pa(k)
}2
{
0
,
1
, ··· , D}
, and note that
+
=#
{x
pa(k)
}
. The posterior PDF
f(
k
|D) is formulated fol l owing
f(
k
|D) / f (D|
k
) · f(
k
)
/
0
@
D
Y
j=1
p(x
j
|
k
)
1
A
·
0
k
(1
k
)
0
/
0
@
D
Y
j=1
X
Y
i=1
p
X
i
[x
j
]
i
|x
pa(i)
j
,
k
1
A
·
0
k
(1
k
)
0
.
We saw in
§
6.3 that it is the same value
ˆ
k
that maximizes either
f
(
k
|D
) or
ln f
(
k
|D
). Therefore, the log-posterior is formulated
following
ln f(
k
|D) / ln
0
k
(1
k
)
0
+
D
X
j=1
X
X
i=1
ln p
X
i
[x
j
]
i
|x
pa(i)
j
,
k
,
where products were replaced by sums. We saw in
§
5.1 that the
MAP estimator
ˆ
k
corresponds to the location where the derivative
of the log-posterior equals zero. The derivative of the log-posterior
is given by
=#{ x
k
=1, x
pa(k)
}2{0, 1, ··· , D}
=#{x
k
=0, x
pa(k)
}2{0, 1, ··· , D}
@ ln f(
k
|D)
@✓
k
=
@ ln(
0
k
(1
k
)
0
)
@✓
k
+
@
P
D
j=1
ln p
X
k
[x
j
]
k
|x
pa(k)
j
,
k
@✓
k
=
@ ln(
0
k
(1
k
)
0
)
@✓
k
+
@ ln(
k
(1
k
)
)
@✓
k
=
@ ln(
0
k
(1
k
)
0
)
@✓
k
+
@ ( ln
k
+ ln(1
k
))
@✓
k
=
(
0
+
0
)
k
0
k
(
k
1)
+
( + )
k
k
(
k
1)
=
(
0
+
0
+ + )
k
(
0
+ )
k
(
k
1)
.
Note that when taking the derivative of
ln f
(
k
|D
)withrespectto
k
,thesumwithrespectto
i
simplifies to a single term involving
j.-a. goulet 176
the parameter
k
we are trying to estimate, that is, the conditional
log-probability
ln p
X
k
(
x
k
|parents
(
X
k
)). By setting the derivative
@ ln f(
k
|D)
@✓
k
= 0, we obtain the MAP estimator
(
0
+
0
+ + )
k
(
0
+ )
k
(
k
1)
=0!
ˆ
k
=
0
+
0
+
0
+ +
.
When we employ the prior parameters
0
=
0
= 1, the MAP
estimator simplifies to
ˆ
k
=
+1
+ +2
,
which is analogous to the formulation given in equat i on 11.3.
If instead of having a binary state
x
k
2{
0
,
1
}
we have
x
k
2
{0, 1, ··· ,n}, then the prior PDF is a Dirichlet distribution, and the
The Dirichlet distribution is a gener-
alization of the Beta distribution for
multivariate domains.
same principles apply for the derivation of the MAP estimator.
11.4.2 Par ti ally Observed Bayesi an Network
It is common in practice not to be able to observe every variable
in a Bayesian network. In such a case, the metho d pres ented for
a fully observed BN is not directly applicable, and we can resort
to the expectation maximization (EM) method
5
(see
§
10.1.1). The
5
Note:
Here, we need to satisfy a key hy-
pothesis; the underlying process controlling
whether a data
x
i
is available or missing
must be independent of its actual value
so that we can say that data is missing at
random .Forthecyanobacteriaexample,
the hypothesis of independence for the
missing data would not be satised if, for
example, experimentalists were reluctant
to go and collect samples when the water
temperature is T =cold.
EM method consists in repeating two steps recursively until con-
vergence: in the expectation step, we employ the current values con-
tained in the conditional probability tables, along with the i nf er en ce
pro cedure presented in
§
11.3 in order to replace observation counts
by the expected number of observation counts. Take the cyanobacte-
ria example, where we want to estimate
p
(
c
=
yes|t
=
hot,f
=
yes
)
from a partially observed Bayesian network. We need to replace
the explicit number of realizations #
{c
=
yes,t
=
hot,f
=
yes}
in
equation 11.4 by the expected nu m ber of counts,
E[#{c =yes,t= hot,f =yes}]=
D
X
i=1
p(c =yes,t= hot,f =yes|D
i
).
For the first observation
D
1
=
{x
1
}
=
{cold, ? , yes, ? , clear}
contained in the data set presented in figure 11.8, the conditional
probability equals zero because the event where the temperature is
t
=
hot
is impossible because we already know for this observation
that the temperature is t = cold,
U = x
i
TFC M W
1 cold ? yes ? clear
2 ? ? yes yes clear
3? yes no yes ?
4hot yes yes yes green
.
.
.
D hot no no ? green
Figure 11.8: Example of partially observed
Bayesian network for the cyanobacteria
data set.
p(c =yes,t= hot,f =yes|D
1
)=0.
probabilistic machine learning for civil engineers 177
For the second observation
D
2
=
{x
2
}
=
{ ? , ? , yes, yes, cl ear }
,
the joint probability to be inferred using the procedure presented in
§11.3 and the current values contained in the CPTs is
p(c =yes,t= hot,f =yes|D
2
)=p(t = hot,f =yes|c =yes,m=yes,w = clear) 2 (0, 1).
In the case where no variables are missing for an observation, as in
the fourth observation D
4
,then
p(c =yes,t= hot,f =yes|D
4
)=1.
Once the expectation procedure is completed for all observations
in
D
,themaxi m i za tio n step c ons is t s in comput i ng eithe r the MLE
or MAP for up dat i n g the probab i li t i es contained in CPTs. The
maximization step employs the method presented in
§
11.4.1 for
fully observed Bayesian networks, this time using the expect e d
number of counts, for example,
ˆp(c =yes|t = hot,f =yes)=
E[#{c =yes,t= hot,f =yes}]+1
E[#{t = hot,f =yes}]+2
.
The expectation maximization method is intrinsically iterative,
whereas the expectation step employs the CP T e ntries found at the
last iteration during the maximization step. Both steps are repeated
until a steady solution is reached.
11.5 Dynamic Bayesian Network
So far we have looked at Bayesian networks for model in g static
systems that do not have a temporal comp one nt. When a BN is
defined over time steps, we call it a dynamic Bayesian network
(DBN). Just like for Bayesian networks, the dependenc e between
variables is described by directed links and conditional probabil i ty
tables; the same holds for the dependence between variables over
successive time steps. Figure 11.9a presents the expanded rep re se n-
tation of a dynamic Bayesian network, where pairs of hidden st at es
X
t1
and
X
t
are linked by an arrow that points in th e dir ec ti on of
time. Figure 11.9b represents the same network using a compact
notation where the representation of the same state at dierent
time steps is replaced by a single state
X
t
with the double arrows
indicating the links between dierent time steps . Figure 11.10 shows
how the notion of time could be introdu ce d in the cyanobacteria
example where the double self-loop arrows indicate the presence of
links between these var i abl es over subsequent time st e ps .
X
t1
X
t
X
t+1
Y
t1
Y
t
Y
t+1
(a) Expanded DBN
X
t
Y
t
(b) Compact DBN
Figure 11.9: Equivalent expanded and
compact representations of a dynamic
Bayesian network. In (b), the double-
lined arrow represents the conditional
relationship between time steps.
C
t
M
t
W
t
T
t
F
t
Figure 11.10: Dynamic Bayesian network
for the cyanobacteria example.
For DBNs, we can apply the sam e conditional probability estima-
tion methods we presented for Bayesian networks in
§
11.4. In the
j.-a. goulet 178
case of inference, specialized algorithms are available to perform
variable elimination for DBN. The details for such algorithms can
be found in dedicated literature.
6
6
Murphy, K. P. (2002). Dynamic Bayesian
networks: representation, inference and
learning.PhDthesis,UniversityofCalifor-
nia, Berkeley
Hidden Mar kov models Note that the dynamic Bayesian network
presented in figure 11.9 represents a special case called a hidden
Markov model ( HMM ) . An HMM has only one hidden-state variable
at each time
t
along with one observed variable. The model is
Markovian because the futu re (
X
t+1
) is independent of the past
(
X
t1
) given the present (
X
t
), that is,
X
t+1
?? X
t1
|x
t
.The
HMM model is thus defined by an observation model
p
(
y
t
|x
t
) and
a transition model
p
(
x
t+1
|x
t
). The fire alarm example presented
in
§
6.2.2 can be described by a hidden Markov model. The reader
interested in specialized inference algorithms for HMM should
refer to dedicated literature. In this book we will instead focus
our attention on the case of a dynamic Bayesian network using
continuous state var i abl e s, that is, state-space models.Thesemodels
are described at length in the next chapter. The reader interested
about the relationships between Bayesian networks, HMM, state-
space models, and the Markov decision process (see chapter 15)
should consult the papers by Diard, Bessi`ere and Mazer;
7
and
7
Diard, J., P. Bessiere, and E. Mazer
(2003). A survey of probabilistic models
using the Bayesian programming method-
ology as a unifying framework. In Inter-
national Conference on Computational
Intelligence, Robotics and Autonomous
Systems (IEEE-CIRAS),Singapore
Ghahramani.
8
8
Ghahramani, Z. (2001). An intro-
duction to hidden Markov models and
Bayesian networks. International Jour-
nal of Pattern Recognition and Artificial
Intelligence 15 (1), 9–42