Bayesian Networks

Bayesian networks were introduced by Pearl

and are also known

Pearl, J. (1988). Probabilistic reasoning in

intelligent systems: Networks of plausible

inference.MorganKaufmann

as belief networks and directed graphical models. They are t h e result

of the combination of probabili ty theory covered in chapter 3 with

graph theory, which employs graphs deﬁned by links and nodes.

We saw in

3.2 that the chain rule allows formulating the joint

probability for a set of random variables using conditional and

marginal probabilities, for example,

(

)

(

)

(

Bayesian networks (BNs) employ nodes to represent random vari-

ables and directed links to describe the dependencies between them.

Bayesian networks are probabilistic models where the goal is to

learn the joint probability deﬁned over the e ntire network. The joint

probability for the structure encoded by these nodes and links is

formulated using the chain rule. The key with Bayesian networks is

that they allow building sparse models for which eﬃcient variable

elimination algorithms exist in order to estimate any conditional

probabilities from the j oi nt probability.

BNs can be categorized as unsupervised l earning

where the goal

Ghahramani, Z. (2004). Unsupervised

learning. In Advanced lectures on ma-

chine learning,Volume3176,pp.72–112.

Springer

is to estimate the joint probability density function (PDF) for a set

of observed variable s. In it s most gener al form we may seek to learn

the structure of the BN itself. In this chapter, we restrict ourselves

to the case where we know the graph structure, and the goal is to

learn to predict unobserved quantities given some observed ones.

Temperature: t 2{cold, hot}

Virus: v 2{yes, no}

Flu: f 2{sick, ¬sick}

Figure 11.1: Example of Bayesian network

for representing the relationships b etween

temperature, the presence of the ﬂu v irus,

and being sick from the ﬂu virus.

Flu vi ru s example Section 3.3.5 presented the distinction between

correlati on and causality using the ﬂu virus example. A Bayesian

network can be employed to model the dependence between the

temperatu re, the presence of the ﬂu virus, and being sick from the

ﬂu. We model our knowledge of t he se three quantities using discrete

random variables that are represented by the nodes in ﬁgure 11.1.

The arrows represent the dependencies between variab l es : The

temperature

a↵ects the virus prevalence

, which in turns a↵ects

j.-a. goulet 168

the probability of catching the vir us an d being sick

. The absence

of a link b etween

and

indicates that the temperature and

being sick from the ﬂu are conditionally independen t from each

other. In the context of this example, conditional independence

implies that

and

are independent when

is known. The joint

probability for T , V , and F ,

p(f,v,t) =

=p(f|v,t)

z }| {

p(f|v) ·p(v|t) · p(t)

| {z }

p(v,t)

is obtained using the chain rule where, for each arrow, the cond i-

tional probabilities are described in a conditional probability table.

Virus example: Marginal and condi-

tional probability tables

p(t) = {p(cold),p(hot)} = {0.4, 0.6}

p(v|t) =

t =cold t =hot

v =yes 0.80.1

v =no 0.20.9

p(f|v) =

v =yes v =no

f =sick 0.70

f = ¬sick 0.31

Joint probability using chain rule

p(v, t) = p(v|t) · p(t)

t =cold t =hot

v =yes 0.8 ⇥ 0.40.1 ⇥ 0.6

v =no 0.2 ⇥ 0.40.9 ⇥ 0.6

t =cold t =hot

v =yes 0.32 0.06

v =no 0.08 0.54

p(f, v,t) = p(f |v) · p(v, t)

f =sick t =cold t =hot

v =yes 0.32⇥0.70.06⇥0.7

v =no 0.08 ⇥ 00.54 ⇥ 0

f = ¬sick t =cold t =hot

v =yes 0.32⇥0.30.06⇥0.3

v =no 0.08 ⇥ 10.54 ⇥ 1

f =sick t =cold t =hot

v =yes 0.224 0.042

v =no 00

f = ¬sick t =cold t =hot

v =yes 0.096 0.018

v =no 0.08 0.54

Variable elimination: Marginalization

p(f, t) =

p(f, v,t)

t =cold t =hot

f =sick 0.224 0.042

f = ¬sick 0.176 0.558

p(f) =

p(f, t)

⇢

f =sick 0.266

f = ¬sick 0.734

p(t|f) = p(f, t)/p(f)

t =cold t =hot

f =sick 0.84 0.16

f = ¬sick 0.24 0.76

In the case where we observe

, we can employ the

marginalization operation in order to obtain a conditional prob-

ability quantifying how the observation

f 2{sick, ¬sick}

changes

the probability for the tempe rat u re T ,

p(t|f) =

p(f,t)

p(f)

p(f,v,t)

In minimalistic problems such as this one, it is trivial to calculate

the joint probability using the chain rule and eliminating variables

using marginalization. However, in practical cases involving dozens

of variables with as many links between them, these calcul at i ons

become computationally demanding. Moreover, in practice, we

seldom know the marginal and conditional probabilities tables.

A key interest of working with directed graphs is that eﬃcient

estimation methods are available to perform all t h ose tasks.

Bayesian networks are applicable not only for discrete ran-

dom variables but also for continuous ones, or a mix of both. In

this chapter, we restrict ourselves to the study of BN for discrete

random variables. Note that the state-space models presented in

chapter 12 can be seen as a time-dependent Bayesian network using

Gaussian random variables with linear dependence models. This

chapter presents the nomenclature employed to deﬁne graphi c al

models, the methods for per for mi n g infe re nc e, and the methods

allowing us to learn the conditional probabilities deﬁning the de-

pendencies between random variables. In addition, we present an

introduction to time-dependent Bayesian networks that are referred

to as dynamic Bayesi an networks. For advanced topics regarding

Bayesian networks, the reader is invited to consult specialized text-

book s such as the one by Nielsen and Jensen

or Murphy’s PhD

Nielsen, T. D. and F. V. Jensen (2007).

Bayesian networks and decision graphs.

Springer

thesis.

Murphy, K. P. (2002). Dynamic Bayesian

networks: representation, inference and

learning.PhDthesis,UniversityofCalifor-

nia, Berkeley

probabilistic machine learning for civil engineers 169

11.1 Graphical Models Nomenclature

Bayesian networks employ a special type of graph: directed acyclic

graph (DAG). A DAG

{U, E}

is deﬁned by a set of node s

interconnected by a set of directed links

. In order to be acycl i c,

the directed links between variables cannot be deﬁned such th at

there are self-loops or cycles in the graph. For a set of random vari-

ables, there are many ways to deﬁne links between variables, each

one leading to the same joint probability. Note that dir ect e d link s

between variables are not required to describe causal relationships.

Despite causality not being a requirement, it is a key to eﬃcien cy ;

if the directed links in a graphical model are assigned following the

causal relationships, it generally produces sparse models requiring

the deﬁnition of a smaller number of conditional probabilities than

noncausal counterparts.

(a) Directed acyclic

graph (DAG)

(b) Directed graph

containing a cycle

Figure 11.2: Example of dependence

represented by directed links between

hidden (white) and an observed (shaded)

nodes describing random variables.

Figure 11.2a presents a directed acyclic graph and ( b) p r es ents a

directed graph containing a cycle so that it cannot be modeled as a

Bayesian network. In ﬁgure 11.2a random variables are represented

by nodes, where the observed variable

depends on the hidden

variables

and

. The directions of links indicate that

and

are the parent of

, that is,

parents

(

}

, and

consequently,

is the child of

and

. Each child is asso c iat e d

with a conditional probability table (CPT) whose size depends on

the number of parents,

(

|parents

(

)). Nodes without parents

are described by their marginal prior probabilities

(

). The joint

PDF

(

) for the entire Bayesian network is formulated using the

chain rule,

p(U)=p(x

, ··· ,x

i=1

p(x

|parents(X

)).

This application of the chain rule requires that given its parents,

each node is independent of its other ancestors.

(a) Cyanobacteria

seen under a micro-

scope

(b) Cyanobacteria

bloom in a rural

environment

Figure 11.3: Example of cyanobacteria

bloom. (Photo: NASA and USGS)

Cyanobacteria

Fish mortality

Water color

Temperature

Fertilizer

(a) Bayesian network semantic representa-

tion

M W

T F

(b) Bayesian network represented with

random variables

Figure 11.4: Bayesian network for the

cyanobacteria example.

Cyanobacteria example We explore the example illustrated in

ﬁgure 11.3 of cyanobacteria blooms that can o cc ur in lakes, rivers,

and estuaries. Cyanobacteria blooms are typically caused by the

use of fertilizers that wash into a water body, combined with warm

temperatures that allow for bacteria reproduction. Cyanobacteria

blooms can cause a change in the water color and can cause ﬁsh or

marine life mortality. We employ a Bayesian network to des cr i be

the joint probability for a set of random variables consisting of the

temperature

, the use of fertilizer

, the presence of cyanobacteria

in water

, ﬁsh mortality

, and the water color

. In ﬁgure 11.4,

j.-a. goulet 170

the deﬁnition of the graph structure and its links follow the causal

direction where the presence of cyanobacteria in water

depends

on the temperature

and the presence of fertilizer

. Both the

ﬁsh mortality

and water color

depend on the presence of

cyanobacteria in water C.

11.2 Conditional Independence

As stated earlier, the absence of a link between two variables in-

dicates that the pair is conditionally independent. In the case of

a serial connection, as illustrated in ﬁgure 11.5a, the absence of

alinkbetween

and

indicates that these two are independent

is known, that is,

A ?? C|{B

. Therefore, as long as

is not observed,

and

depend on each other through

.Itis

equivalent to say that given its parent (i.e.,

is indep endent

of all its other non-descendants. In the case of a diverging connec-

tion (ﬁgure 11.5b), again the absence of link between

and

indicates that these two are independent if

is observed. The case

of a converging connecti on represented in ﬁgure 11.5c is di↵erent

from the two others; the absence of link between

and

implies

that both variables are independent unless

, or one of its descen-

dants, is observed. When

, or one of its descendants is observed,

the knowledge gained for

also modiﬁes the knowledge for

and

. We say that

and

are d-separated in the case of a serial or

diverging connection, where the intermediary variable B is observed,

or in the case of a converging connection, where neither the inter-

mediary variable

nor one of its descendants is observed. For any

-separated variables, a change in the knowledge for on e variable

does not a↵ect the knowledge of the others.

BA C

(a) Serial connection

A C

(b) Diverging

connection

A C

connection

Figure 11.5: Cases where

and

are

conditionally independent for the di↵erent

types of connection in Bayesian networks.

M W

T F

(a) Serial connection:

T ??

M|c and T ?? W |c

M W

T F

(b) Diverging connection:

M ?? W |c

M W

T F

⇢

?? F

Figure 11.6: The concept of conditional in-

dependence illustrated using the cyanobac-

teria example.

Cyanobacteria example In the exampl e presented in ﬁgure 11.4b,

the sets of variables

{T, C, M}

{T, C, W}

{F, C, M}

, and

{F, C, W}

are all examples of serial connections. If as illustrated in ﬁg-

ure 11.6a, we observe the presence of cyanobacteria, that is,

yes

, then the temperature

becomes independent of ﬁsh

mortality

and water color

, that is,

T ?? M|c

T ?? W |c

.It

implies that gaining knowledge about water temperature would

not change our knowledge about either ﬁsh mortality or water

color because these two quantities only depend on the presence of

cyanobacteria, which is now a certainty gi ven that

was observed.

For the diverging connection between variables

{M,C,W}

illus-

trated in ﬁgure 11.6b, observing

causes the ﬁsh mortality

to be indep en de nt from the water color

M ?? W |c

. Gaining

probabilistic machine learning for civil engineers 171

knowledge about the occurrence of ﬁsh mortality would not change

our knowledge about water color b e cau se the se two quantities only

depend on the presence of cyanobacteria, which is a cert ai nty when

is observed. For the converging connection between variables

{T, C, F}

represented in ﬁgure 11.6c, despite the absence of a link

between

and

, the temperature is not independent from the

use of fertilizer

if we observe

or one of its descendants, that

is,

. Without observing

, or

, the knowledge gained

about the use of fertilizer has no impact on our knowledge of the

temperature. On the other hand, if we observe the presence of

cyanobacteria (

yes

), then knowing that no fertilizer is present

in the environment (

) would increase the probability that

the temperature is high because there is only a small probability of

having cyanobacteria with out fertilizer and with cold temperature.

11.3 Inference

The ﬁnality of Bayesian networ ks is not only to deﬁne the joint

probability for random variables; it is to perform inference, that

is, to compute the conditional probability for a set of unobserved

random variables, given a set of observed ones. Let us conside r

, ··· ,X

}

, the set of random variables corresponding

to the nodes deﬁning a system, a subset of observed variables

D⇢U

, and another subset of query variables

Q⇢U

, such that

Q\D

;

. Following the deﬁnition of conditi on al probabi l i t ie s

presented in

3.2, the posterior probability for the variables in

given observations D is

p(Q|D)=

p(Q, D)

p(D)

U\{Q,D}

p(U)

U\{D}

p(U)

where

U \ {Q, D}

describes the set of variables belonging to

while excluding those in

{Q, D}

. For the cyanobacteria example

Cyanobacteria example

U =

T : t 2{cold, hot}

F : f 2{yes, no}

C : c 2{yes, no}

M : m 2{yes, no}

W : w 2{clear, green}

;

presented in ﬁgure 11.4b, let us consider we want the posterior

probability

(

m|w

green

), that is, the probabil i ty of ﬁsh mortality

given that we have observed colored water,

green

.This

joint probability is described by

p(m|w = green) =

p(m, w = green)

p(w = green)

j.-a. goulet 172

where both terms on the right-hand side can be obtained through

the marginalization of the joint probability mass function (PMF),

p(m, w = green) =

p(t, f, c, m, w = green)

p(w = green) =

p(m, w = green).

This approach is theoretically correct; however, in practice, calcu-

lating the joint probability

(

) quickly becomes computationally

intractable with the increase in the number of variables in

.The

solution is to avoid computing the full joint probability table and

instead proceed by eliminating variables sequentially.

Cyanobacteria example: CPT

p(t)={p(cold),p(hot)} = {0.4, 0.6}

p(f)={p(yes),p(no)} = {0.2, 0.8}

p(c|t, f )=

c =yes t =cold t =hot

f =yes 0.50.95

f =no 0.05 0.8

c =no t =cold t =hot

f =yes 0.50.05

f =no 0.95 0.2

p(m|c)=

c =yes c =no

m =yes 0.60.1

m =no 0.40.9

p(w|c)=

c =yes c =no

w =clear 0.70.2

w =green 0.30.8

Variable elimination The goal of variable elimination is to avoid

computing the full joint probability table

(

) by working with its

chain-rule factorization. For the cyanobacteria example, the joint

probability is

p(U)=p(t) · p(f) · p(c|t, f) · p(w|c) · p(m|c), (11.1)

where computing p(m, w = green) cor r esponds to

p(m, w = green)

p(t) · p(f) · p(c|t, f) · p(w = green|c) ·p(m|c)

p(w = green|c)

| {z }

2⇥1

·p(m|c)

| {z }

2⇥2

p(t)

|{z}

2⇥2

p(f)

|{z}

2⇥2

·p(c|t, f).

| {z }

2⇥2⇥2

| {z }

p(c,f|t): 2⇥2⇥2

| {z }

p(c|t): 2⇥2

| {z }

p(c,t): 2⇥2

| {z }

p(c): 2⇥1

| {z }

p(m,c): 2⇥2

| {z }

p(m,w=green,c): 2⇥2

| {z }

p(m,w=green): 2⇥1

= {0.1354, 0.3876}.

Vari abl e eli mi nat ion

p(c, f |t) =

c =yes t =cold t =hot

f =yes 0.5·0.20.95·0.2

f =no 0.05·0.80.8·0.8

c =no

t =cold t =hot

f =yes 0.5·0.20.05·0.2

f =no 0.95·0.80.2·0.8

p(c|t) =

t =cold t =hot

c =yes 0.14 0.83

c =no 0.86 0.17

p(c, t) =

t =cold t =hot

c =yes 0.14·0.40.83·0.6

c =no 0.86·0.40.17·0.6

p(c) =

⇢

c =yes 0.554

c =no 0.446

p(m, c) =

c =yes c =no

m=yes 0.6·0.554 0.1·0.446

m=no 0.4·0.554 0.9·0.446

p(m, w =green,c)=

c =yes c =no

m=yes 0.3324·0.30.0446·0.8

m=no 0.2216·0.30.4014·0.8

p(m, w =green)=

w =green

m=yes 0.1354

m=no 0.3876

p(w =green)=0.1354 + 0.3876 =0.523

Inference

p(m|w =green)=

p(m, w =green)

p(w =green)

w =green

m=yes 0.1354/0.523

m=no 0.3876/0.523

w =green

m=yes 0.26

m=no 0.74

Note how several terms in equation 11.1 are independent from the

variables in the summation operators. It allows us to take out of

the sums the terms that do not depend on them and then per f orm

the marginalization sequentially. This proced ur e is more eﬃcient

than working with full probability tables. In the previous equation,

braces refer to the ordering of operations and indicate the size of

each probability table. The variable eliminati on procedure allows

working with probability tables containing no more than 2

probabilistic machine learning for civil engineers 173

entries. In comparison, the full joint probability tabl e for

(

)

contains 2

= 32 entries.

The eﬃciency of the variable elimination approach depends

on the ordering of operations. The common method for ordering

operations while performing variable elimination is the junction

tree method. This method transforms the graph into a tree and

then employs clustering to group nodes. The reader interested

in the details of this inference method should refer to specialized

textbooks. Note that the methods covered ab ove are exact inference

methods. In the case where the number of variabl es is so large that

exact methods become computationally prohibitive, we can also

resort to approximate metho ds based on Monte Carlo sampling.

11.4 Conditional Probability Estimation

U = x

TFCMW

1cold no yesyes clear

2hot noyesyesclear

3hotyesnoyesgreen

D hot no no yes green

(a) Fully observed Bayesian network

U = x

TFCMW

1cold ? yes ? clear

2? ? yes yes clear

3? yes no yes ?

D hot no no ? green

(b) Partially observed Bayesian network

Figure 11.7: Example of data set for

learning conditional probability tables.

In the previous sections, it was always assumed that the probabili-

ties contained in conditional probability tables (CPTs) were known.

In practice this is seldom the case; CPTs must be learned from

data. There are two typical learning setups; in the ﬁrst setup, the

Bayesian network is fully observed so each observation

}

consists in a joint realization

X ⇠ p

(

). Figure 11.7a presents

a table where each line is one observation for the fully observed

BN employed in the cyanobacteria example. In the second learning

setup, the BN is only partially observed so that some variables are

not observed for the realization of

X ⇠ p

(

). Figure 11.7b

presents a data set for the cyanobacteria example where the BN is

partially observed.

11.4.1 Fully Observed Bayesian Network

For a fully observed Bayesian network, the maximum likelihood es-

timate (MLE) for the conditional probability

(

)

can be estimated as the ratio between the number of realization s of

a speciﬁc joint outcome

}

, divided by the total

number of realizations of th e outcome {X

= x

Pr(X

= x

)=ˆp(x

#{X

= x

}

#{X

= x

}

. (11.2)

We must be careful in the case where a speciﬁc outcome is not

Reminder: #{ · }

denotes the number of

elements in a set.

observed in the data set so #

}

= 0. In such

a case, the MLE will lead to a conditional probability equal to

zero. This situation is problematic because we might not have

observed this speciﬁc outcome yet, simply because the number of

observations is too limited. A solution is to employ a maximum

j.-a. goulet 174

a-posteriori (MAP) estimation by adding one observation count to

each possible outcome. Given that

, ···n}

, equation 11.2

becomes

Pr(X

= x

#{X

= x

} +1

#{X

= x

} + n

. (11.3)

Cyanobacteria example From the data set

presented in ﬁg-

ure 11.7a, we can estimate each CPT involved in the deﬁnition of

(

) in equation 11.1. For example, we might want to obt ai n the

MLE estimate for the probability of having cyanobacteria given

a hot temperature and the presence of fertilizer,

(

yes|t

hot,f

yes

). In such a case, we compute t h e number of reali z ati on s

yes,t

hot,f

yes}

, divided by the number of

realizations of

hot,f

yes}

. If we want to employ the MAP

estimate instead of the MLE, we add one count to the numerator

and two to the denominator so that

ˆp(c =yes|t = hot,f =yes)=

#{c =yes,t= hot,f =yes}+1

#{t = hot,f =yes} +2

(11.4)

The same method applies for any entry in a conditional probability

table. In the case of the estimation of marginal probabilities such as

(

hot

), the calculation simpliﬁes to the number of events where

hot}

divided by

+ 2, the number of joint observations in the

data set plus the number of possibl e outcomes for t,

ˆp(t = hot) =

#{t = hot} +1

D +2

Before going further, we will explore the theoretical justiﬁcation

for the MLE and MAP in equations 11.2 and 11.3. Let us consider

the simpliﬁed case where a network is made from

random vari-

ables

{X}

{

[

··· X

]

}

,where

}, 8k 2{

. Here, the deﬁnition of a speciﬁc structure for the ne twork is not

necessary; we simply assume that the network is structured as a

DAG, so that for each

,wehave

(

|parents

(

)). The quan-

tity we seek to estimate from the data is the condi ti on al probability

✓

(

|parents

(

)), where because

is a binary variable,

we also have 1

 ✓

(

|parents

(

)). We describe the prior

probability for ✓

by a Beta PDF (see §4.3), Note:

We add

to the parameters of the

Beta prior because without them,

B(✓

; ↵

,

) / ✓

↵

1

(1  ✓

)



1

which would lead to an MAP equal to

✓

↵

+ ↵  1

↵

+ 

+ ↵ +   2

f(✓

)=B(✓

; ↵

+1,

+1)

✓

↵

(1 ✓

)



B(↵

+1,

+ 1)

/ ✓

↵

(1 ✓

)



probabilistic machine learning for civil engineers 175

We have a data set

D = {D

, D

, ··· , D

}

= {U = x

, U = x

, ··· , U = x

}

containing

realizations from the fully observed Bayesian network

. Here, we are interested in the probability of the joint realization

= 1, along with a speciﬁc combination of its pare nts, that

is,

parents

(

)

⌘ x

pa(k)

. The number of such outcomes in the data

set

↵

, x

pa(k)

}2{

, ··· , D}

. Analogously, the

number of realizations in the data set of

= 0 along with the

same speciﬁc combination of its parents is



, x

pa(k)

{

, ··· , D}

, and note that

↵



pa(k)

}

. The posterior PDF

f(✓

|D) is formulated fol l owing

f(✓

|D) / f (D|✓

) · f(✓

)

j=1

p(x

|✓

)

· ✓

↵

(1 ✓

)



j=1

i=1



]

pa(i)

,✓



· ✓

↵

(1 ✓

)



We saw in

6.3 that it is the same value

✓

that maximizes either

(

✓

) or

ln f

(

✓

). Therefore, the log-posterior is formulated

following

ln f(✓

|D) / ln



✓

↵

(1 ✓

)





j=1

i=1

ln p



]

pa(i)

,✓



where products were replaced by sums. We saw in

5.1 that the

MAP estimator

✓

corresponds to the location where the derivative

of the log-posterior equals zero. The derivative of the log-posterior

is given by

↵ =#{ x

=1, x

pa(k)

}2{0, 1, ··· , D}

 =#{x

=0, x

pa(k)

}2{0, 1, ··· , D}

@ ln f(✓

|D)

@✓

@ ln(✓

↵

(1 ✓

)



)

@✓

j=1

ln p



]

pa(k)

,✓



@✓

@ ln(✓

↵

(1 ✓

)



)

@✓

@ ln(✓

↵

(1 ✓

)



)

@✓

@ ln(✓

↵

(1 ✓

)



)

@✓

@ (↵ ln ✓

+  ln(1  ✓

))

@✓

(↵

+ 

)✓

 ↵

✓

(✓

 1)

(↵ + )✓

 ↵

✓

(✓

 1)

(↵

+ 

+ ↵ + )✓

 (↵

+ ↵)

✓

(✓

 1)

Note that when taking the derivative of

ln f

(

✓

)withrespectto

✓

,thesumwithrespectto

simpliﬁes to a single term involving

j.-a. goulet 176

the parameter

✓

we are trying to estimate, that is, the conditional

log-probability

ln p

(

|parents

(

)). By setting the derivative

@ ln f(✓

|D)

@✓

= 0, we obtain the MAP estimator

(↵

+ 

+ ↵ + )✓

 (↵

+ ↵)

✓

(✓

 1)

=0!

✓

↵

+ ↵

↵

+ 

+ ↵ + 

When we employ the prior parameters

↵



= 1, the MAP

estimator simpliﬁes to

✓

↵ +1

↵ +  +2

which is analogous to the formulation given in equat i on 11.3.

If instead of having a binary state

}

we have

{0, 1, ··· ,n}, then the prior PDF is a Dirichlet distribution, and the

The Dirichlet distribution is a gener-

alization of the Beta distribution for

multivariate domains.

same principles apply for the derivation of the MAP estimator.

11.4.2 Par ti ally Observed Bayesi an Network

It is common in practice not to be able to observe every variable

in a Bayesian network. In such a case, the metho d pres ented for

a fully observed BN is not directly applicable, and we can resort

to the expectation maximization (EM) method

(see

10.1.1). The

Note:

Here, we need to satisfy a key hy-

pothesis; the underlying process controlling

whether a data

is available or missing

must be independent of its actual value

so that we can say that data is missing at

random .Forthecyanobacteriaexample,

the hypothesis of independence for the

missing data would not be satisﬁed if, for

example, experimentalists were reluctant

to go and collect samples when the water

temperature is T =cold.

EM method consists in repeating two steps recursively until con-

vergence: in the expectation step, we employ the current values con-

tained in the conditional probability tables, along with the i nf er en ce

pro cedure presented in

11.3 in order to replace observation counts

by the expected number of observation counts. Take the cyanobacte-

ria example, where we want to estimate

(

yes|t

hot,f

yes

)

from a partially observed Bayesian network. We need to replace

the explicit number of realizations #

yes,t

hot,f

yes}

equation 11.4 by the expected nu m ber of counts,

E[#{c =yes,t= hot,f =yes}]=

i=1

p(c =yes,t= hot,f =yes|D

For the ﬁrst observation

}

{cold, ? , yes, ? , clear}

contained in the data set presented in ﬁgure 11.8, the conditional

probability equals zero because the event where the temperature is

hot

is impossible because we already know for this observation

that the temperature is t = cold,

U = x

TFC M W

1 cold ? yes ? clear

2 ? ? yes yes clear

3? yes no yes ?

4hot yes yes yes green

D hot no no ? green

Figure 11.8: Example of partially observed

Bayesian network for the cyanobacteria

data set.

p(c =yes,t= hot,f =yes|D

)=0.

probabilistic machine learning for civil engineers 177

For the second observation

}

{ ? , ? , yes, yes, cl ear }

the joint probability to be inferred using the procedure presented in

§11.3 and the current values contained in the CPTs is

p(c =yes,t= hot,f =yes|D

)=p(t = hot,f =yes|c =yes,m=yes,w = clear) 2 (0, 1).

In the case where no variables are missing for an observation, as in

the fourth observation D

,then

p(c =yes,t= hot,f =yes|D

)=1.

Once the expectation procedure is completed for all observations

,themaxi m i za tio n step c ons is t s in comput i ng eithe r the MLE

or MAP for up dat i n g the probab i li t i es contained in CPTs. The

maximization step employs the method presented in

11.4.1 for

fully observed Bayesian networks, this time using the expect e d

number of counts, for example,

ˆp(c =yes|t = hot,f =yes)=

E[#{c =yes,t= hot,f =yes}]+1

E[#{t = hot,f =yes}]+2

The expectation maximization method is intrinsically iterative,

whereas the expectation step employs the CP T e ntries found at the

last iteration during the maximization step. Both steps are repeated

until a steady solution is reached.

11.5 Dynamic Bayesian Network

So far we have looked at Bayesian networks for model in g static

systems that do not have a temporal comp one nt. When a BN is

deﬁned over time steps, we call it a dynamic Bayesian network

(DBN). Just like for Bayesian networks, the dependenc e between

variables is described by directed links and conditional probabil i ty

tables; the same holds for the dependence between variables over

successive time steps. Figure 11.9a presents the expanded rep re se n-

tation of a dynamic Bayesian network, where pairs of hidden st at es

t1

and

are linked by an arrow that points in th e dir ec ti on of

time. Figure 11.9b represents the same network using a compact

notation where the representation of the same state at di↵erent

time steps is replaced by a single state

with the double arrows

indicating the links between di↵erent time steps . Figure 11.10 shows

how the notion of time could be introdu ce d in the cyanobacteria

example where the double self-loop arrows indicate the presence of

links between these var i abl es over subsequent time st e ps .

t1

t+1

t1

t+1

(a) Expanded DBN

(b) Compact DBN

Figure 11.9: Equivalent expanded and

compact representations of a dynamic

Bayesian network. In (b), the double-

lined arrow represents the conditional

relationship between time steps.

Figure 11.10: Dynamic Bayesian network

for the cyanobacteria example.

For DBNs, we can apply the sam e conditional probability estima-

tion methods we presented for Bayesian networks in

11.4. In the

j.-a. goulet 178

case of inference, specialized algorithms are available to perform

variable elimination for DBN. The details for such algorithms can

be found in dedicated literature.

Murphy, K. P. (2002). Dynamic Bayesian

networks: representation, inference and

learning.PhDthesis,UniversityofCalifor-

nia, Berkeley

Hidden Mar kov models Note that the dynamic Bayesian network

presented in ﬁgure 11.9 represents a special case called a hidden

Markov model ( HMM ) . An HMM has only one hidden-state variable

at each time

along with one observed variable. The model is

Markovian because the futu re (

t+1

) is independent of the past

(

t1

) given the present (

), that is,

t+1

?? X

t1

.The

HMM model is thus deﬁned by an observation model

(

) and

a transition model

(

t+1

). The ﬁre alarm example presented

6.2.2 can be described by a hidden Markov model. The reader

interested in specialized inference algorithms for HMM should

refer to dedicated literature. In this book we will instead focus

our attention on the case of a dynamic Bayesian network using

continuous state var i abl e s, that is, state-space models.Thesemodels

are described at length in the next chapter. The reader interested

about the relationships between Bayesian networks, HMM, state-

space models, and the Markov decision process (see chapter 15)

should consult the papers by Diard, Bessi`ere and Mazer;

and

Diard, J., P. Bessiere, and E. Mazer

(2003). A survey of probabilistic models

using the Bayesian programming method-

ology as a unifying framework. In Inter-

national Conference on Computational

Intelligence, Robotics and Autonomous

Systems (IEEE-CIRAS),Singapore

Ghahramani.

Ghahramani, Z. (2001). An intro-

duction to hidden Markov models and

Bayesian networks. International Jour-

nal of Pattern Recognition and Artiﬁcial

Intelligence 15 (1), 9–42