Learning from Data

Learning context

p(unknown|known)

x:Constant

✓: X ⇠ f(x; ✓)

X:Random variable

y:Observation

When scalars

are employed to model physical phenomena, we

can use empirical observations y to learn about the probability (or

probability density) for each value that

can take. When

is a

random variable, we can us e observations to learn about the p aram -

eters

✓

describing its probability density (or mass) function

(

x; ✓

This chapter presents the general Bayesian formu lat i on , as well as

approximate methods derived from it, for l ear n in g about unknown

state variables and parameter s, given known data. The applications

presented in this chapter focus on simple models in or d er to keep

the attention on the Bayesian methods themselves. We will see in

chapters 8–13 how this theory can be generaliz ed for applicat ion s

to machine learning models having complex architectures. For an

in-depth review of Bayesian methods applied to dat a analysis, th e

reader may refer to specialized textbooks such as those by Gelman

et al.

or Box and Ti ao.

Gelman, A., J. B. Carlin, H. S. Stern, and

D. B. Rubin (2014). Bayesian data analysis

(3rd ed.). CRC Press

Box, G. E. P. and G. C. Tiao (1992).

Bayesian inference in statistical analysis.

Wiley

6.1 Bayes

Learning from data can be seen as an indirect problem; we observe

some quantities y , and we want to employ this informati on to infer

knowledge about hidden-state variab l es x and model parameters Note: The term hidden describes a state

variable that is not directly observed.

✓

,wherey depends on x and

✓

through

; ✓

). In chapter 3,

we saw that such a dependence can be tre at ed using conditional

probabilities.

Notation:

;

and

are both employed

to describe a conditional probability or

probability density. The distinction is that

;isemployedwhen

✓

are parameters of

aPDForPMF,forexample,

(

;

✓

(

;

µ, 

), and

denotes a conditional

dependence between variables, for example,

(

y|x

). Besides these semantic distinctions,

both are employed in the same way when it

comes to performing calculations.

In the ﬁrst part of this chapter, we sol el y focu s on the estimation

of hidden-state variables

. Given two random variables

X ⇠ f

(

)

and

Y ⇠ f

(

), their joint probability density funct i on (PDF) is

obtained by the product of a conditional and a marginal PDF,

f(x, y)=

(

f(x|y) ·f(y)

f(y|x) · f(x).

(6.1)

j.-a. goulet 60

In the case where

is known and

is not, we can reorganize the

terms of equation 6.1 in order to obtain Bayes rule,

f(x|y)=

f(y|x) · f(x)

f(y)

which describes the posterior PDF (i. e. , our poster ior knowled ge)

given that we have observed

. Let us consider

, a random

variable so that

X ⇠ f

(

), and given a set of observations

, ··· ,y

}

that are realizations of the random var i abl e s Y = Note: From now on, the number of ele-

ments in a set or the number of variables

in a vector is deﬁned by a

typewriter

-font

upper-case letter; for ex ample,

is the

number of observations in a set

,and

is the number of variables in the vector

x =[x

··· x

]

Asetofobservations

D = {y

, ··· ,y

}

e.g., D = {1.2, 3.8, ··· , 0.22},(y 2 R)

D = {3, 1, ··· , 6},(y 2 Z

)

D = {blue, blue, ··· , red},

(y 2{blue, red, green})

[

··· Y

]

⇠ f

(y). Our posterior knowledge for

given the

observations D is described by the conditional PDF

f(x|D)

| {z }

posterior

likelihood

z }| {

f(D|x) ·

prior

z}|{

f(x)

f(D)

|{z}

evidence

Prior

(

) describes our prior knowledge for the values

that a

random variable

can take. The prior knowledge can be expressed

in multiple ways. For instance, it can be based on heur is t i cs such

as expert opinions. In the case where d ata is obtained s eq u entially,

the posterior knowledge at time

t 

1 becomes the prior at time

In some cases, it also happens that no pr i or knowledge is available;

then, we should employ a non-informative prior, that is, a p r i or Note: A uniform prior and a non-

informative prior are not the same

thing. Some non-informative priors are

uniform, but not all uniform priors are

non-informative.

that reﬂects an absence of knowledge (see §6.4. 1 for further details

on non-informative priors ) .

Likelihood

(Y = y

)

⌘ f

(

D|x

)describesthelikelihood, or the

conditional probability density, of observing the event

{

Y =

given the values that

can take. Note that i n the special case of an

exact observation where D = y = x,then

f(D|x)=

⇢

1,y= x

0, otherwise.

In such a case, the observations are ex act so the prior does not

play a role; by observing

you have all the information about

because

. In the general case where

fct

(

), that is,

is a realization from a stochastic process, which is a functi on of

—

(

D|x

)describestheprior pr ob abi l i ty density of observing

Here, the term prior refers to our state of

knowledge that has not yet been inﬂuenced

by the observations in D.

given a speciﬁc set of values for x.

Evidence

(Y = y)

⌘ f

(

) is called the evidence and consists in a

normalization constant ensuring that the posterior P DF int egr at es

probabilistic machine learning for civil engineers 61

to 1. In §3.3 we saw that for both the dis cr et e and the continuous

cases,

p(x|D)

| {z }

discrete case

=1,

1

f(x|D)dx

| {z }

continuous case

=1.

For th at property to be true, the normalizati on constant is deﬁned

as the sum/integral over the entire dom ain of

, of the likelihood

times the prior,

p(D)=

p(D|x) · p(x)

| {z }

discrete case

,f(D)=

f(D|x) · f(x)dx

| {z }

continuous case

So far, the description of learning as an indirect pr obl em is

rather abstract. The next sections wi l l apply th is concept t o simple

problem conﬁgurations involving discret e an d conti nuous random

variables.

6.2 Discrete State Variables

Discrete state variable:Adiscrete

random variable

employed to represent

the uncertainty regarding the state

of a

system.

This section presents how the probability of occurrence for discrete

state variables are estimated from observations. In all the examples

presented in this section, the uncertainty associated with our

knowledge of a hidden random variable

is epistemic because Reminder: The term hidden refers to a

variable that is not observed.

a true value

ˇx

exists, yet it is unknown. Moreover,

is not directly

observable; we can only observe

,where

(

y|x

)describesthe

conditional probability of an observation

given the value of the

discrete hidden-state variable x.

6.2.1 Example: Disease Screening

Entire

population

Person

without disease

(

¬disease

99%

Person

with

disease

(

disease

)

=1%

(

test

|disease

)=80%

(

test

|¬disease

)=10%

1% ⇥ 80% 99% ⇥ 10%

Figure 6.1: Visual interpretation of Bayes

rule in the context of disease screening.

(Adapted from John Henderson (CC BY).)

Figure 6.1 illustrates the context of a r are disease for which t h e

probability of occurrence in the general population is 1 percent,

and a screening test such that if one has the disease, the prob-

ability of testing positive is

(

test+|disease

)=0

8; and if one

does not have the disease, the pr obab il ity of testing positive is

(

test+|¬disease

)=0

1. The sample space for the variable describ-

ing the hidden state we are interested in is

x 2{disease, ¬disease}

for which the prior probability is

(

disease

)=0

01 and

Pr(X = ¬disease) = 1  Pr(X = disease) = 0.99.

For a single observation,

{test

}

, the posterior probability

j.-a. goulet 62

of having the disease is

Pr(disease | test+) =

Pr(test+|disease)·Pr(disease)

Pr(test+|disease)·Pr(disease)+Pr(test+|¬disease)·Pr(¬disease)

1%⇥80%

(1%⇥80%)+(99%⇥10%)

=7.5%,

and the posterior probability of not havi n g the disease is

Pr(¬disease | test+) = 1  Pr(disease | test+)

Pr(test+|¬disease)·Pr(¬disease)

Pr(test+|disease)·Pr(disease)+Pr(test+|¬disease)·Pr(¬disease)

99%⇥10%

(1%⇥80%)+(99%⇥10%)

= 92.5%.

If we push this example to the extreme c ase where the disease

is so rare that only one human on Earth has it (see ﬁgu r e 6.2) and

where we have a highly accurate sc re en i ng test so that

test+ !

(

Pr(test + |disease) = 0.999

Pr(test + |¬disease) = 0.001,

the question is If you test positive, should you be worried? The

answer is no; the disease is so rare that even with the hi gh test

accuracy, if we teste d every human on Earth, we would expect

⇡

001

⇥

)=8

⇥

false positive diagnoses. This example

illustrates how in extreme cases, prior i n for mat i on may outweigh

the knowledge you obtain from observations.

Humans

Only one has the disease

Figure 6.2: The example of a disease so

rare that only one human on Earth has it.

6.2.2 Example: Fire Alarm

Note:

(

)doesnothaveto

equal one for the likelihood; however, it

does for the posterior,

p(x|y)=1.

It is common to observe that when a ﬁre alar m is triggered in

a public building, people tend to rem ain calm and ignor e the

alarm. Let us explore this situation usi n g con d it i on al probabili-

ties. The sample space for the hidden st at e we are inter es t ed in is

x 2{ﬁre, ¬ﬁre}⌘{ , }

and the sample space for the observati on

is y 2{alarm, ¬alarm}⌘{ , }. If we assume that

p( |x)=

⇢

Pr( | )=0.95

Pr( | )=0.05,

then the question is Why does no one react when a ﬁre alarm is

triggered? The is su e here is that

(

) describes the probability of

an alarm given the state; in order to answer the question, we need

to calculate

(

), which depends on the prior probability that

there was a ﬁre at time

t 

1 as well as at the transition probabilities

p(x

t1

), as illustrated in ﬁgure 6.3.

0.05

0.95

Pr = 0.99

Pr = 0.01

t  1

0.99

Pr =?

Figure 6.3: Fire alarm example.

probabilistic machine learning for civil engineers 63

We assum e that at

t 

1, before the alarm is triggered, our p ri or

knowledge is

p(x

t1

⇢

Pr(

t1

)=0.01

Pr(

t1

)=0.99.

The transition probab il i ty between

t 

1 and

, at the moment when

the alarm is triggered, is described by the conditional probabilities

p(x

t1

Pr(

t1

)=0.99

Pr(

t1

)=0.01

Pr(

t1

)=1

Pr(

t1

)=0.

By combining the prior knowledge at

t 

1 with the transition

probabilities, we obtain the joint prior pr ob abi l i ty of hidden states

at both time steps,

p(x

t1

Pr(

t1

)=Pr(

t1

) · Pr(

t1

)

=1· 0.01 = 0.01

Pr(

t1

)=Pr(

t1

) · Pr(

t1

)

=0.01 · 0.99 = 0.0099

Pr(

t1

)=Pr(

t1

) · Pr(

t1

)

=0.99 · 0.99 = 0.9801

Pr(

t1

)=Pr(

t1

) · Pr(

t1

)

=0· 0.01 = 0.

We obtain t he marginal prior probability of each state

at a time

by marginalizing this joint pr obab i li ty, th at is, su mm in g over states

at t  1,

p(x

Pr(

)=Pr(

t1

)+Pr(

t1

)

=0.01 + 0.0099

=0.0199 ⇡ 0.02

Pr(

)=Pr(

t1

)+Pr(

t1

)

=0.9801 ⇡ 0.98.

If we combine the information from the prior knowledge at

and

the observation, we can compute the posterior probability of the

state x

, given that the alarm has been triggered,

Pr(

| )=

Pr( ,

)

z }| {

Pr( |

) · Pr(

)

(Pr( |

) · Pr(

)) + (Pr( |

) · Pr(

))

| {z }

Pr( )

0.95 · 0.02

(0.95 · 0.02) + (0.05 · 0.98)

=0.28

Pr(

| )=1 Pr(

| )=0.72.

j.-a. goulet 64

Overall, despite an alarm being triggered, the probability that there

is a ﬁre remains less than 30 perce nt.

We can explore what would h appen at time

+ 1 if, after the

alarm has been triggered, a safety oﬃcial i nf orm s us that the re

is a ﬁre and that we need to evacuat e. We assume here that the

conditional probability of receiving such information given t h e state

p( |x)=

⇢

Pr( | )=0.95

Pr( | )=0.05,

as illustrated in ﬁgu re 6.4. Here we assume that given the state, the

probability of receiving the information from the safety oﬃ c i al is

conditionally independent from the alarm, that is,

(

| ,

(

). If, as we did for the estimation at time

, we combine

the information from the prior knowledge at time

, the transition

probabilities, and t he observation at time

+ 1, we can compute t he

posterior probability of the hidden stat e

t+1

, given that the alarm

has been triggered and that we received the information from a

safety oﬃcial. First, the marginal prior probabilities of each state at

time

+ 1 are estimated from the post er i or probability at

combined

with the transiti on probabilities so that

Pr(

t+1

| )=

Pr(

t+1

| )

z }| {

Pr(

t+1

) · Pr(

| )+

Pr(

t+1

| )

z }| {

Pr(

t+1

) · Pr(

| )

=0.01 · 0.72 + 1 · 0.28

=0.29

Pr(

t+1

| )=1 Pr(

t+1

| )=0.71.

0.05

0.95

Pr = 0.99

Pr = 0.01

t  1

0.99

Pr = 0.72

Pr = 0.28

t +1

0.99

0.05

0.95

Pr = ?

Figure 6.4: Example illustrating the role of

the prior in sequential data acquisition and

interpretation.

Then, the posterior probability at t + 1 is estimated following

Pr(

t+1

| , )=

Pr( ,

t+1

| )

z }| {

Pr( |

t+1

) · Pr(

t+1

| )

(Pr( |

t+1

)·Pr(

t+1

| ))+(Pr( |

t+1

)·Pr(

t+1

| ))

| {z }

Pr( | )

0.95 · 0.29

(0.95 · 0.29) + (0.05 · 0.71)

=0.89

Pr(

t+1

| , )=1 Pr(

t+1

| , )=0.11.

This example illustrates how when knowledge about a hidden

state evolves over time as new information becomes available, the

posterior at t is employed to estimate the prior at t + 1.

probabilistic machine learning for civil engineers 65

6.2.3 Example: Post-Earthquake Damage Assessment

Given an earthquake of intensity

This example is adapted from Armen Der

Kiureghian’s CE229 course at UC Berkeley.

X : x 2{light (L), moderate (M), important (I)}

and a structure that can be i n a state (e.g., see ﬁgure 6.5)

Y : y 2{damaged ( D ) , undamaged (D)}.

These two random variables are illustrated usin g a Venn diagram in

ﬁgure 6.6. The likelihood of damage given each earthquake intensity

p(D|x) =

Pr(Y = D|x = L)=0.01

Pr(Y = D|x = M)=0.10

Pr(Y = D|x = I)=0.60,

Figure 6.5: Extensive damage af ter San

Francisco’s 1906 earthquake.

Figure 6.6: Venn diagram illustrating

the sample space where the earthquake

intensities light (L), moderate (M), and

important (I)aremutuallyexclusiveand

collectively exhaustive events. The post-

earthquake structural states of damaged

(D)andundamaged(

)intersectthe

earthquake intensity events.

and the prior probability of each intensity is given by

Note: For the likelihood

p(D|x) 6

=1;

however, for the prior

p(x)

=1,andfor

the posterior

p(x|D) =1.

p(x) =

Pr(X = L)=0.90

Pr(X = M)=0.08

Pr(X = I)=0.02.

We are inter es t ed in inferrin g the intensity of an earthquake given

that we have observed damaged bu il d i ngs

= D following an event ,

so that

p(x|D) =

p(D|x) · p(x)

p(D)

The ﬁrst step to solve this problem is to compute the normalization

constant,

p(y = D)=

Pr(Y =D,X=L)

}| {

p(D|L) · p(L)+

Pr(Y =D,X=M)

z }| {

p(D|M) · p(M)+

Pr(Y =D,X=I)

z }| {

p(D|I) · p(I)

=0.01 · 0.90 + 0.10 · 0.08 + 0.6 · 0. 02

=0.029.

Then, the posterior probability of each state is calculate d following

Pr(x = I |y = D)=

p(D|I)·p(I)

p(D)

0.60·0.02

0.029

=0.41

Pr(x = M| y = D)=

p(D|M)·p(M)

p(D)

0.10·0.08

0.029

=0.28

Pr(x = L |y = D)=

p(D|L)·p(L)

p(D)

0.01·0.90

0.029

=0.31.

We see that despit e the prior pr ob abi l i ty of a high-intensity earth-

quake being small (i.e.,

(

= I)=0

02), the posterior is signiﬁ-

cantly higher because of the observat i on indicating t h e presence of

structural damage.

j.-a. goulet 66

6.3 Continuous State Variables

This section presents examples of applications of hidden-stat e

estimation for continuous cases. Compared with discrete cases,

continuous ones involve more abstract conce pt s in the deﬁn i t ion

of the likelihood

(

x) and evidence

(

). Like for the discrete

case, if our model is

and we observe

, then we know all

there is to know about

so the conditional probability of

given

is nonzero only for the true value

ˇx

. In this section, we will cover

cases involving im perfect observations rather than perfect ones.

6.3.1 Likelihood: f(D|x)

In practice, we are interested in the case where

ˇx

is a true yet un-

known hidden-state variable and where we have access to imprecise

observations y

,wheretheobservation model is either

=ˇx + v

(direct observation)

> ˇx + v

(upper-bound censored observation)

< ˇx + v

(lower-bound censored observation),

where in all three cases the observat i on error,

⇠N

(

,

?? V

, 8i 6

. A direct observation corresponds t o the standard Note:

?? V

, 8i 6

indicates that the

random variable

is independent of

for all

i 6

(i.e., that all observation errors

are independent of each other).

case where the observation is the tr ue state

ˇx

on which an observa-

tion error

is added.

is a Normal random variable employed to

describe observation err or s.

is often chosen to be Normal in order

to simplify the problems by maintaining analy t i cal tractability for

linear models (see §4.1.3). Lower- and up per-bound observations are

called censored observations where we only know that the hidden-

state variable is either respecti vely greater or smaller than the

observed value y.

Direct observations For a direct observation

,thelikelihoodis

formulated from the additive observation model so that the PD F of

is described by the Normal PDF

Y ⇠N

(

;

x, 

). Because

ˇx

Y = x + V

Observation: N(y; x, 

)

Hidden state: constant

Measurement error: N(v;0,

)

unknown, the likelihood denoted by

(

y|x

) or

(

x|y

)describesthe

prior probability den si ty of observing {Y = y} given {ˇx = x},

f(y|x)=L(x|y)=

2⇡ · 

exp



✓

y  x



◆

. (6.2)

The leftmost surface in ﬁgure 6.7 correspon ds to equation 6. 2

plotted with respect to both

and

, for



= 1. The p i nk dashed

curve represents the explicit evaluat i on of

(

y|x

= 0), where the

evaluation at

= 0 is arbitrarily chosen for illustration p u rpose,

probabilistic machine learning for civil engineers 67

and the red solid curve reverses the relat i on sh ip to represent the

likelihood of the observation



1 for any value

,where

(

1|x)=L(x|y = 1).

5

f(y|x)

y =-1.0|x =0.0

5 0 5

f(y|x =0)

5 0 5

L(x|y = 1)

Figure 6.7: Representation of the likelihood

function for a direct observation. The

leftmost surface describes

(

y|x

), the

rightmost graph describes

(



(

x|y



1), and the center graph

describes f(y|x =0).

Keep in mind that the likelihood is not a PDF with respect to

so it does not integrate to 1. Accordingly, the rightmost graph in

ﬁgure 6.7 is a function quantifying for each value

how likely it is

to observe the value y = 1.

Upper-bound observations For an upper-bound observation

y>ˇx

the likelihood is described by the cumulati ve distribution function

(CDF) of Y ⇠N(y; x, 

f(y|x)=

CDFN(y;x,

)

z }| {

1

2⇡

exp



✓

 x



◆

erf

✓

y  x

2

◆

(6.3)

The surface in ﬁgure 6.8 corresponds to eq uat i on 6.3 plotted with

respect to

and

, for



= 1. The pink dashed curve r ep re sents

the explicit evaluation of

(

y|x

= 0), where t h e evaluation at

= 0 is ar bi t r ari l y chosen for illustration pu r pose, and the red solid

curve reverses the relationship to represent the likelihood of the

observation y = 1, gi ven any value x, L(x|y = 1).

5

f(y|x)

y =-1.0|x =0.0

5 0 5

f(y|x =0)

5 0 5

L(x|y = 1)

Figure 6.8: Representation of the likelihood

function for an upper-bound censored

observation. The leftmost surface describes

(

y|x

), the rightmost graph describes

(



(

x|y



1), and the center

graph describes f(y|x =0).

j.-a. goulet 68

Lower-bound observations For a lower-bound observation

y<ˇx

the likelihood is described by the complement of t he cumul at i ve

distribution function of Y ⇠N(y; x, 

f(y|x)=1

CDFN(y;x,

)

z }| {

1

2⇡

exp



✓

 x



◆

=1

✓

erf

✓

y  x

2

◆◆

(6.4)

The surface in ﬁgure 6.9a corresponds to eq uat i on 6.4 plotted with

respect to

and

, for



= 1. The pink dashed curve r ep re sents

the explicit evaluation of

(

y|x

= 0), where the evaluation at

is arbitrarily chosen for illustration pur pose, and again the red

curve reverses the relationship to represent the likelihood of the

observation



1, given any value

(

x|y



1). Note that

in ﬁgure 6.9b, if instead of having



= 1,



tends to zero, the

smooth transition from low likelihood values to high ones would

become a sharp jump from zero to one.

5

f(y|x)

y =-1.0|x =0.0

5 0 5

f(y|x =0)

5 0 5

L(x|y = 1)

(a) 

5

f(y|x)

y =-1.0|x =0.0

5 0 5

f(y|x)

5 0 5

L(x|y)

(b) 

Figure 6.9: Representation of the likelihood

function for a lower-bound censored

observation. The leftmost surface describes

(

y|x

), the rightmost graph describes

(



(

x|y



1), and the center

graph describes f(y|x =0).

Note: The notion of conditional inde-

pendence implies that given

independent of Y

,thatis,

?? Y

|x , f(y

,x)=f(y

|x),

so that

(

)

· f

(

). This

assumption is often satisﬁed because the

observation model follows the form

y = x + v,

where

is the realization of an observation

error so that V

?? V

, 8i 6= j.

Multiple observations All cases above dealt with a single observa-

tion. When multiple observation s are available, one must deﬁne the

joint likelihood for all observations given

. With the assumption

that observations are conditi onally independent given

, the joint

probabilistic machine learning for civil engineers 69

PDF is obtained from the product of marginals,

f(D|x)=f(Y

= y

, ··· ,Y

= y

|x)

D1

z }| {

i=1

f(Y

= y

|x)

=exp

i=1



f(Y

= y

|x)



(6.5)

Note that practical diﬃculties related to numerical underﬂow

or overﬂow arise when trying to evaluate t h e product of several

marginal likelihoods

(

). This is why in equation 6.5, the pr od-

uct of the marginals is replaced with the equivalent yet numerically

more robust formulation using the exponential of the sum of the log

marginals.

6.3.2 Evidence: f(D)

5 0 5

0.1

f(y|x)f (x)

(a) S =7, x

f(D)=0.2158

5 0 5

0.1

f(y|x)f (x)

(b) S =10, x

f(D)=0.2198

5 0 5

0.1

f(y|x)f (x)

f(D)=0.2197

5 0 5

0.1

f(y|x)f (x)

(d) S =100, x

100

f(D)=0.2197

Figure 6.10: Examples of application of the

rectangle integration rule.

The evidence acts as a normalization constant obtained by inte-

grating the product of the likelihood and the prior over al l possible

values of x,

f(D)=

f(D|x) · f(x)dx.

This integral is typically diﬃcult to evaluate because it is not ana-

lytically tractable. Figure 6.10 presents a naive way of approaching

this integral using the rectangle rule, where the domain containing

a signiﬁcant probability content is su bdivided in

bins of width

x

in order to approx i mat e

f(D)=

s=1

f(D|x

) · f( x

)x

This approach is reasonable for trivial 1-D pr ob l ems ; however, it

becomes computationally intractable for proble ms in larger di men -

sions (i.e., for x

2 R

) because the number of grid combinations to

evaluate increases exponentially with the number of dimensions

to integrate on. Section 6.5 presents an intro d u ct i on to sampling

methods that allow overcoming this challenge. Chapter 7 will intro-

duce more advanced sampling methods all owing the estimation of

the normalization const ants for high-dimensional domains.

j.-a. goulet 70

6.3.3 Posterior: f(x|D)

The posterior is the product of the likel ih ood function tim es the

prior PDF divided by the normalization constant,

f(x|D)=

f(D|x) · f(x)

f(D)

Let us consider two direct observations



}

obtained from the observation model

= x + v

: V

⇠N(v;0, 1

),V

?? V

, 8i 6= j.

The likelihood is thus given by

(

D|x

(

;

)

·N

(

;

We assum e the prior PDF follows a u ni form PDF withi n the range

from -5 to 5,

(

;



5). As presented in ﬁgure 6.10, the

normalization constant is estimated using the rectangle integration

rule to

(

)=0

2197. Figure 6.11 presents the prior, the likelihood,

and the posterior for this example.

5 0 5

0.1

0.2

f(D|x)

Likelihood

5 0 5

0.1

0.2

f(x)

Prior

5 0 5

0.2

0.4

0.6

f(x|D)

Posterior

Figure 6.11: Example of prior, likelihood,

and posterior.

6.3.4 Number of Observations and Identiﬁability

In the extreme case where no observat i on is available (i.e.,

;

the posterior is equal to the prior

(x). In order to

explore what happens at the other end of the spect ru m where an

inﬁnite amount of data is available, it is required to distin guis h

between identiﬁable and non-ident iﬁ able problems. A problem is

identiﬁable when, given an inﬁnite numb er of direct observations,

we are theoretically able to retrieve the true hidden state of a

deterministic system

ˇx

. For non-identiﬁable problems, even an

inﬁnite number of direct observations does not allow retrieving the

true hidden state of the system

ˇx

, because multiple equally valid

solutions exist.

Figure 6.12: Example of two streams

merging in a single river.

(Photo: Ashwini Chaudhary)

Example: Non-identiﬁable problem Take the example illustrated

in ﬁgure 6.12, where we want to estimat e the contaminant concen-

tration in two streams

and

and where we only have access to

contaminant concentration obser vations

= x

+ x

+ v

: V

⇠N(v;0, 2

),V

?? V

, 8i 6= j,

probabilistic machine learning for civil engineers 71

which are collected at the output of a ri ver after both streams

have merged. We assume that our prior k n owledge is uniform over

positive concentration values, that is ,

(

)

1. The resulting

posteriors are presented in ﬁgure 6.13 for

= 1 and

= 100

observations, along with the true values marked by a red cross.

This problem is intrinsically non-identiﬁable because, even when

the epistemic uncertainty has been removed by a large number

of observations, an inﬁnite number of combinations of

and

remain possible.

0 10 20 30 40

(a) D =1,f(x

) / 1

0 10 20 30 40

(b) D =100,f(x

) / 1

Figure 6.13: Posterior contours for an

example of non-identiﬁable problem.

Non-identiﬁable problems can be regularized u si n g prior knowl-

edge. If, for example, we know that the pri or probability for the

contaminant concentration in the ﬁrst stream follows

(

; 10

), the posterior will now have a s i ngl e mode, as pr e-

sented in ﬁgure 6.14.

0 10 20 30 40

Figure 6.14: Posterior contours for an

example of non-identiﬁable problem for

=1andwhichisconstrainedbytheprior

knowledge f (x

) /N(x

;10, 5

Well-constrained identiﬁable problems For well-constrained prob-

lems where a true value

ˇx

exists, given an inﬁnite number of direct

observations (

D !1

) that are conditionally independent given

the posterior tends to a Dirac delt a function, wh i ch is nonzero only

for x =ˇx,

f(x|D)=

f(D|x) · f(x)

f(D)

! (x 

true value

z}|{

ˇx )

| {z }

Dirac delta function

Figure 6.15: Example of posterior PDF

tending toward the Dirac delta function.

Figure 6.15 presents an example of such a case where the posterior

has collapsed over a single value. The posterior expected value then

corresponds to the true value

E[X|D] ! ˇx

|{z}

true value

and the posterior variance tends t o z e ro because no ep i st em ic

uncertainty remains about the value of x,

var[X|D] ! 0

|{z}

ˇx:constant

Note that in most practical cases we do not have acce ss to an

inﬁnite number of observations. Moreover, keep in mind that the

less data we have access to, the more important it is to consider the

uncertainty in the posterior f(x|D).

6.4 Parameter Estimation

In this section, we will explore how to ex t en d Bayesian estimation

for the parameters

✓

deﬁning the probability density function of

j.-a. goulet 72

a hidden random variable

x : X ⇠ f(x) ⌘ f(x; ✓)

.Thissetupis

relevant when the system responses are stochastic. For ex amp le ,

even in the absence of observation er r ors , di↵erent sampl es from

a concrete batch made from the same mix of com ponents would

have di↵erent resistances due to the uncertainty associate d with the

intrinsic heterogeneity of the material. Th e posterior p rob abi l i ty

Notation

: X ⇠ f

(x; ✓)

✓:PDF’sparameters

, ··· ,x

}

:realizationsofthehidden

variable X

density function sought is now

f(✓|D)=

f(D|✓ ) · f(✓)

f(D)

where

(

✓

) is the prior PDF of unknown parameters,

f(D|✓ )

the likelihood of observations

, ··· ,y

}

, and

f(✓|D)

is the

posterior PDF for unknown parameters

✓

. After having estimated

the posterior PDF for parameters

f(✓|D)

, we can marginalize this

uncertainty in

(

;

✓

) in order to obtain the po st eri or predictive

PDF for the hidden variable X,

f(x|D)=

f(x; ✓) · f(✓|D)d✓. (6.6)

Parameters of

Random variable

Figure 6.16: Example of Bayesian parame-

ter estimation.

Figure 6.16 presents an example where we want t o e st i mat e the

joint posterior probability of the mean

and standard deviation



deﬁning the PDF of

, which describes the variability in the

resistance of a given concrete mix. Here, the concrete r esi s tan ce

is assumed to be a random variable. It means that unlike in

previous sections, no true value exi s t s for the resi st an ce because it

is now a stochastic quantity. D esp i t e t h i s, true values can still exist

for the parameters of

(

;

✓

), that is,

{ˇµ

, ˇ

}

. This example will

be further explor ed in §6.5.3.

6.4.1 Prior: f(✓)

(

✓

) describes our prior knowledge for the values that

✓

can

take. Prior knowledge can be based on heuri st i cs (expert knowl-

edge), on the posterior PDF obtained f rom previous data, or on

a non-informative prior (i.e., absence of pr ior knowl ed ge) . A non-

Non-informative prior

✓ = {µ

,

}, N(x; µ

,

)

f(µ

) / 1

f(

) /



f(µ

,

) /



, if µ

?? 

informative prior for parameters such as the mean of a Normal

PDF is described by a uniform density

(

)

1. For standard

deviations, the non-informative prior typically assumes that all

orders of magnitudes for



have equal probabilities. This hypoth-

esis implies that

(

ln 

)

1. By using the change of variable rul e

presented in §3.4, it leads to

(



)



, because the derivative of

this transformation is

d ln 

d



. Note that the two non-informative

priors described for

and



are improper because

(

✓

)

d✓

which does not respect the fundamental re q ui r eme nt that a PDF

probabilistic machine learning for civil engineers 73

integrates to 1, as pre se nted in §3.3. For this reason we cannot draw

samples from an improper prior. Nevertheless, when used as a prior,

an improper PDF can yield to a p r oper posterior (see §7.5).

6.4.2 Likelihood: f(D|✓)

f(D|✓ ) ⌘ f(Y = y|✓) ⌘ f(y | ✓) ⌘L(✓|y)

are all equivalent ways of

describing the likelihood of observing speciﬁc valu es y, given

✓

.In

the special case where

so that realizations of

X ⇠ f

(

;

✓

)

are directly observed, the likelihood is

f(y|✓)=f(y; ✓)=f(x; ✓).

In a more general case where observations

, 8i 2{

D}}

are realizations of an additive observation model

= x

+ v

,where

: V ⇠N(v;0,

) and x

: X ⇠N(x; µ

,

| {z }

✓

), the likelihood is

Y = X + V

Observation: N(y; µ

,

+ 

)

Hidden-state R.V.: N(x; µ

,

)

Measurement error: N(v;0,

)

f(y|✓)=N(y; µ

,

+ 

This choice of an additive observation model using normal random

variables allows having an analytically t rac t abl e formulat ion for

the likelihood. If we want to employ a multiplicative observation

model,

y = x · v

, then choosing

V ⇠ ln N

(

,

ln V

) and

X ⇠ ln N

(

;

ln X

,

ln X

| {z }

✓

) as lognormal random variables also

preserves the analytic al tractability for the likelihood that follows

f(y|✓)=lnN( y; µ

ln X

,

ln X

+ 

ln V

6.4.3 Posterior PDF: f(✓|D)

In §6.3.4, when the number of independent observations

D !1

, Note: The term independent observations

is employed as a shortcut for the more

rigorous term that should be in this case

conditionally independent given x.

we saw that the posterior for

tends to a Dirac delta function

)

! 



). The situation is di↵erent for the est i mat i on of

parameters describing the P D F of a hidden random variabl e (R.V.);

when

D !1

, it is now the posterior PDF for

✓

that tends to a

Dirac delta functi on as presented in ﬁgure 6.17,

Figure 6.17: The posterior PDF for pa-

rameters reduces to a Dirac delta function

when the number of independent observa-

tions tends to inﬁnity.

f(✓|D)=

f(D|✓ ) · f(✓)

f(D)

! (✓ 

true value

z}|{

✓ )

| {z }

Dirac delta function

where the posterior expected value tends toward the true parameter

value and the variance of the posterior tends to zero,

E[✓|D] !

✓

|{z}

true value

j.-a. goulet 74

var[✓|D] ! 0

|{z}

✓ :constant

In such a situation, the epistemic uncert ai nty associated with

parameter values

✓

vanishes and the uncertainty about

reaches

its irreducible level, f (x|D) ! f(x;

✓).

In practical cases where the number of observations

D < 1

uncertainty remains about the param et er s

(

✓|D

), so

(

x|D

)istyp-

ically more di↵used than

(

;

✓

). The uncertainty associated with

parameters

(

✓|D

) can be marginalized as described in equation 6.6

in order to obtain the posteri or predictive PDF for the hidden ran-

dom variable. This integral is typically not analytically tractable,

so it is common to resort to the Monte Carlo sampling methods

presented in the next section in order to compute th e posterior

predictive expected value and variance for f(x|D).

6.5 Monte Carlo

Monte Carlo sampling is a numerical integration method that

allows performing Bayesian estimation f or analytically intractable

cases. In this section, we will ﬁrst see how Monte Carlo can be

employed as an integration method. T he n, we will explore how t hi s

method ﬁts in the context of Bayesian estimation.

6.5.1 Monte Carlo Integration

We empl oy the example of a unit diameter circle wit h area

⇡r

785 to demonstrate how we can approximate this area

with Monte Carlo numerical integration. A circl e of diameter

(r =0.5) is d esc r ibed by the equation

(x  0.5)

+(y  0. 5)

= r

This equation is employed in the d eﬁ ni t i on of the indicator fun ctio n

I(x, y)=

⇢

1if(x  0.5)

+(y  0. 5)

 r

0 otherwise.

0 0.5 1

0.5

(a) S =100,

E[I(X, Y )] = 0.800

0 0.5 1

0.5

(b) S =1000,

E[I(X, Y )] = 0.759

0 0.5 1

0.5

E[I(X, Y )] = 0.785

Figure 6.18: Examples of Monte Carlo

numerical integration for which the theoret-

ical answer is a = ⇡r

=0.785.

We the n deﬁne a bivariate uniform probability density in the ran ge

{x, y}2

1), where

(

x, y

) = 1. The circle area

can be

formulated as the integral of the product of the indi c ator function

times f

(x, y),

|{z}

area

I(x, y) · f

(x, y)dxdy

= E[I(X, Y )].

(6.7)

probabilistic machine learning for civil engineers 75

We saw in §3.3.5 that the integral in equation 6.7 corresponds to

the expectation of the indicator fu nc t i on

(

X, Y

). Given

ran-

dom samples

}

s 2{

, ··· , S}

generated from a uniform

probability density

(

x, y

), the expectation operation is deﬁned

a = E[ I(X, Y )] = lim

S!1

s=1

I(x

In practice, because the number of samples

is ﬁnite, our estima-

tion will remain an approximation of the expectation,

a ⇡

E[I(X, Y )] =

s=1

I(x

Figure 6.18 presents realizations for

{

}

samples as

Note: The hat in

[

(

X, Y

)] denotes

that the quantity is an approximation of

E[I(X, Y )].

well as their appr oximation of the area

. As the number of samples

increases, the approximation error decreases and

[

(

X, Y

)] tends

to the true value of a = ⇡r

=0.785.

The interesting property of Monte Carlo integration is that

the estimation quality, measured by the varian ce

var

E[I(X, Y )]

depends on the number of samples and not on the number of

dimensions.

This property is key for the application of the Monte

MacKay, D. J. C. (1998). Introduction

to Monte Carlo methods. In Learning in

graphical models,pp.175–204.Sprigner

Carlo method in high-dimensional domains.

6.5.2 Monte Carlo Sampling: Continuous State Variables

Monte Carlo sampling

h(x): Sampling PDF

: X ⇠ h(x)

s 2{1, 2, ··· , S}

Monte Carlo methods are now applied to the Bayesian e s ti m at ion

context presented in §6.3 where we are interested in estimating

continuous state variables. The Monte Car lo integrat i on method is

ﬁrst employed for approximating the evidence

f(D)=

f(D|x) · f(x)dx

= E[f(D|X)].

A Monte Carlo approximation of the evidence can be computed

using importance sampling where

f(D)=

s=1

f(D|x

) · f( x

)

h(x

)

: X ⇠ h(x) : Sampling PDF,

where samples

X ⇠ h

(

) are drawn from any sampling

distribution such that

(

)

0 for any

for which the PDF on the

numerator

(

D|x

)

· f

(

)

0. If we choose the prior PDF as sampling

distribution, h(x)=f (x), the formulation simpliﬁes to

f(D)=

s=1

f(D|x

),x

: X ⇠ h(x)=f (x)

j.-a. goulet 76

because the prior and the sampling distribution cancel out.

The performance of the Monte Carlo integration depends on

how close our sampling distribution

(

) is to the posterior

(

x|D

Note that sampling from the prior,

(

), is often not a

viable option when the number of ob se r vations

is large. This is be-

cause in such a case, the likelihood is the product of several t er ms

so the probability content typically ends up being concentrated in a

small region that is not well described by the p r ior , as illustrated in

ﬁgure 6.19.

Figure 6.19: Example of diﬃcult situation

encountered when trying to sample from

the prior.

The posterior mean and variance can be approximated with

Monte Carlo integration using

E[X|D]=

x · f( x|D)dx

⇡

s=1

f(D|x

) · f( x

)

f(D)

h(x

)

: X ⇠ h(x)

⇡

s=1

f(D|x

)

f(D)

: X ⇠ h(x)=f (x)

and

var[X|D]=

(x  E[X|D])

· f (x|D)dx

⇡

S  1

s=1

 E[X|D])

f(D|x

) · f( x

)

f(D)

h(x

)

: X ⇠ h(x)

⇡

S  1

s=1

 E[X|D])

f(D|x

)

f(D)

: X ⇠ h(x)=f (x).

For both the posteri or mean and variance, when choosing the prior

as the sampling distribution,

(

), it cancels out from the

formulation.

Figure 6.20: Example of Bayesian estima-

tion with censored observations.

(Photo: Elevate)

Example: Concentration estimation Figure 6.20 depicts a labora-

tory test for measuring a deterministic mercury concentration

ˇx

Observations follow t h e observation model

⇢

=ˇx + v

, if ˇx + v

 10µg/L

= error, if ˇx + v

< 10µg/L,

where the measuring device has a minimum thr eshold and the

mutually independent observation errors are described by

: V

⇠N(v;0, 2

),V

?? V

, 8i 6= j.

We want to estimate the posterior PDF describin g t h e concent r a-

tion if the observation obtained is

{error}

for two hypotheses

probabilistic machine learning for civil engineers 77

regarding the prior knowledge: (1) uniform in the interval (0

50),

(

50), and (2) a normal PDF with mean 25 and stan-

dard deviation 10,

(

; 25

). Figure 6.21 presents the

results for the (a) uniform prior and ( b ) normal prior. Note that

only 100, out of the 10

Monte Carlo samples that were employed

to estimate the evidence

(

), are displayed. This example illu s-

trates how information can be extracted from censored observations.

Note also how the choice of prior inﬂuences the posterior knowledge.

The choice of prior is particularly impor t ant here because limited

information is carrie d by the censored observations.

0 25 50

0.0

0.1

0.2

f(x)

0 25 50

0.0

0.5

1.0

f(D|x)

0 25 50

0.0

0.1

f(x|D)

(a) Uniform prior, U(0, 50), f(D) ⇡ 0.20, E[x|D]=5.2, var[x|D]=10.4

0 25 50

0.0

0.1

0.2

f(x)

0 25 50

0.0

0.5

1.0

f(D|x)

0 25 50

0.0

0.2

f(x|D)

(b) Normal prior, N(25, 10

), f (D) ⇡ 0.07, E[x|D]=6.2, var[x|D]=18.0

Figure 6.21: Comparative example of

auniformandnormalpriorPDFfor

performing Bayesian estimation with

censored observations.

We consi d er a di↵erent context where we employ

= 100 direct

observations,

{

, ··· ,

}

1⇥100

, and the prior is

(

50). The prior, likelihood, and posterior are presented

in ﬁgure 6.22. Note that here, out of the 100 Monte Carlo samples

displayed, none lie in the high p rob ab il i ty region for the likelihood.

As a result, it leads to a poor approximation of the evidence where

the true value is

(

)=1

⇥

78

and where the approximation

0 25 50

0.0

0.1

0.2

f(x)

0 25 50

f(D|x)

0 25 50

f(x|D)

Figure 6.22: Example where drawing

Monte Carlo samples from the prior leads

to a poor estimate of the posterior because

no sample falls in the high probability

region.

j.-a. goulet 78

f(D)=

s=1

100

j=1

f(y

)

=1.95 ⇥ 10

83

Although it is possible to solve this problem by incr eas i ng the

number of samples, this ap p roach is computationally ineﬃcient an d

even becomes impossible for high-dimension al domains. Chapter 7

presents advanced Monte Carlo methods for solving this li mi t at i on.

6.5.3 Monte Carlo Sampling: Parameter Estimation

For

observations

, ··· ,y

}

, and samples

✓

⇠ h

(

✓

(

✓

)

,s2{

, ··· , S}

drawn from the pri or PDF of parameters, we

can estimate the evidence

f(D)=

s=1

j=1

f(y

|✓

)

(6.8)

as well as t he posterior expected values and variance for parameters

E[✓|D]=

s=1

✓

j=1

f(y

|✓

)

f(D)

ˆc o v ( ✓|D)=

S  1

s=1

(✓

 E[✓|D])(✓

 E[✓|D])

j=1

f(y

|✓

)

f(D)

Figure 6.23: Example of samples taken

from the same concrete batch.

Example: Concrete strength We consider the case where we want

to characterize the resistance

of a given concrete batch, where

the intrinsic variability is estimat e d by te st i ng the resist anc e from

several samples, as depicte d in ﬁgure 6.23. We assume that t h e true

yet unknown parameter values are

ˇµ

= 42

MPa

and

ˇ

MPa

and that the resistance

is described by a log-normal random

variable

R ⇠ ln N(r;ˇµ

ln R

, ˇ

ln R

)MPa.

The unknown parameters we want to learn about are

✓



]

Here,

is a hidden random variable because i t is not di r ect l y

observed; only imprecise observations

are avail ab l e. The set of

observations

, ··· ,y

}

is employed to estimate the

posterior PDF for th e parameters

✓

that are controlling the PDF of

the hidden random variable R,

Posterior PDF

z }| {

f(µ

,

|D)=

Likelihood

z }| {

f(D|µ

,

) ·

Prior PDF

z }| {

f(µ

,

)

f(D)

|{z}

Normalization constant

probabilistic machine learning for civil engineers 79

The joint prior knowledge for par amet e rs

(

,

(

)

· f

(



(µ

?? 

) is obtained from the product of the marginal PDFs,

f(µ

)=lnN(µ

; µ

ln µ

,

ln µ

)

| {z }

=45MPa



=7MPa

,f(

)=lnN(

; µ

ln 

,

ln 

| {z }



=5MPa



=2MPa

The prior PDF for

and



are themselves described by log-

normal PDFs with respective mean

{µ

,µ



}

and variance

{

,



f(µ

,

)

{ˇµ

, ˇ

}



f(D|µ

,

)



f(µ

,

|D)

10 30 50 70

f(r|D)

Predictive – f(r|D)

Reality – f(r;ˇµ

, ˇ

)

(a) D =1



f(µ

,

)



f(D|µ

,

)



f(µ

,

|D)

10 30 50 70

f(r|D)

(b) D =5



f(µ

,

)



f(D|µ

,

)



f(µ

,

|D)

10 30 50 70

f(r|D)

Figure 6.24: Examples of Bayesian esti-

mation as a function of the number of

observations employed, D = {1, 5, 100}.

The observation model is

ln y

ln r

, where the observation

errors

V ⇠N

(

)

MPa

. With these assumptions

regarding the observation model, the marginal likelihood for each

observation is

f(y|✓

)=f(y|µ

ln R

,

ln R

)=lnN(y; µ

ln R

, (

ln R

=0.01

z}|{



)

1/2

and the joint likelihood for t h e entire set of observations is

f(D|

✓

=fct(✓)

z }| {

ln R

,

ln R

j=1

2⇡



ln R

+0.01

exp



ln y

 µ

ln R



ln R

+0.01

| {z }

Log-Normal PDF

Note that the marginal and joint likelihood are deﬁned as a func-

tion of the log-space parameters. In order t o e valuate the likeli-

hood for the parameters

{µ

,

}

, we employ the transformation

functions given in equation 4.4. Note that the likeliho od

|✓

)de-

scribes a probabili ty density only for y and not for ✓. Therefore this

re-parametrization

✓

fct

(

✓

) of the likelihood does not require

using the change of variable rul e we saw in §3.4.

Figure 6.24 presents the prior PDF for par amet e rs , the likeli-

hood, the posterior, and the poster i or predictive for (a)

= 1,

(b)

= 5, and (c)

= 100 observations. We see that, as the

number of observations

increases, the likelihood and posterior

for

✓

become concentrated around the true val ue s , and the poste-

rior predictive for

tends toward the true PDF with p aram et er s

! ˇµ

= 42 MPa and 

! ˇ

= 2 MPa.

6.6 Conjugate Priors

The previous section went through the process of using numerical

approximation methods for estimating the posterior PDF for the

parameters of a hidden random variable

. This complex process

j.-a. goulet 80

is necessary when no analytical formulation exi s t s to describe the

posterior.

For speciﬁc combinations of prior distribution and likelihood

function, the posterior PDF fol lows the same type of distribution

as the prior PDF. These speciﬁc combinations are referred to as

conjugate priors. For conjugate priors, analytic formulations are

available to describe the posteri or PDF of paramet er s

(

✓|D

) as

well as the posterior predictive f(x|D).

Figure 6.25: Thumbtacks described by

arandomvariablewithsamplespace

S = {Head (H), Tail(T)}.

(Photo: Michel Goulet)

One classic example of conjugate prior is the Beta-Binomi al .

Take, for example, the probability that a thumbtack (see ﬁgure

6.25) lands on either Head or Tail,

{Head

(H)

, Tail

(T)

}

.We

can employ the Beta-Binomial conjugate prior to describe th e

posterior PDF for

(

Head|D

)

1), given a set of observa-

tions

. It is suited in this case to assume that the pr i or PDF for

x = Pr ( Head ) follows a Beta P D F ,

f(x)=B(x; ↵, )=

↵1

(1  x)

1

B(↵, )

where B(

↵, 

) is the normalization constant. In this c ase , observa-

tions are binary, so for a set containing

observations,

heads are assigned the outcome H:

{

, ··· ,

}

1⇥D

and

tails are assigned the outcome T:

{

, ··· ,

}

1⇥D

. In that

case, the likelihood describing the probability of obs e rv in g

heads

and D

tails follows a Binomial distribution,

Pr(Head|x)=x

Pr(Tail|x)=1 x

Pr(D

|x)=x

Pr(D

|x)=(1 x)

Pr(D|x)=x

(1  x)

p(D|x) / x

(1  x)

The multiplication of the prior and the l i kelihood remains analyti-

cally tractable so that

f(x|D) / x

(1  x)

↵1

(1  x)

1

B(↵, )

and by recalculating the normalization constant, we ﬁnd that the

posterior also follows a Beta PDF with parameters

↵

and



f(x|D)=

↵+D

1

(1  x)

+D

1

B(↵ + D

,+ D

)

= B(x; ↵ + D

,+ D

The analytic formulation for conjugate priors wi t h their param-

eters, likelihood, posterior, and posterior predictive can be found

online

or in specialized publications such as th e compendium by

Wikipedia (2017). Conjugate prior.

URL

https://en.wikipedia.org/wiki/

conjugate prior

.AccessedNovember8,

2019

Murphy.

Murphy, K. P. (2007). Conjugate Bayesian

analysis of the Gaussian distribution.

Citeseer

probabilistic machine learning for civil engineers 81

Figure 6.26: Example of application of

conjugate priors on traﬃc data.

(Photo: John Matychuk)

Example: Traﬃc safety We take as another example an intersec-

tion where the number of users is

⇥

/yr

and where 20

accidents occurred over the last 5 years. Note that for the purpose

of simplicity, the number of passages without accidents is taken

as the total number of users rather than this number minus the

number of accidents. This current information can be su mmar i ze d

old

⇢

=5· (3 ⇥ 10

) (passages without accid ent, N)

= 20 (accidents, A).

Our prior knowledge for

, the probability of accident per user,

is described by the Beta PDF

(

1), where the parame-

ters are

↵



= 1, which corresponds t o a uniform dist r i bu t ion

over the interval (0

1). The posterior PDF describing the daily

probability of accident given the observations

old

is thus also

described by the Beta PDF,

f(x|D

old

)=B(x; ↵

+ D

,

+ D

7

6

5

4

x =Pr(accident/user)

f(x|D

old

)

Figure 6.27: Posterior PDF for the proba-

bility of an accident before rehabilitation

works.

Figure 6.27 presents the posterior PDF t r ans for me d in the

log

space using the ch ange of variable rule (see §3.4),

f(log

(x)|D

old

)=f(x|D

old

) ·



d log

(x)



1

= f(x|D

old

) · x ln 10.

Now, we consider that following rehabilitation works mad e on

the intersection in orde r to decrease the probability of accidents, we

observe

= 1 accident over a period of one year, th at is, 3

⇥

users without accident. In order to estimat e the new prob abi l i ty

of accident, we assume that our previous post er i or knowle dge

(

x|D

old

) is an upper-bound for

, because the rehabilitation

works made the intersection safer. The new posterior then becomes

f(x|D) / x

· (1  x)

3⇥10

| {z }

likelihood

·(1  F (x|D

old

))

| {z }

prior

Note that because the prior is now descri bed by the compleme nt

of the CDF of

X|D

old

, which is not a conjugate prior, the pos t e-

rior does not have an analytic solut i on anymore, and the ev id en ce

(

new

) must be approximated using numerical integration. Fig-

ure 6.28 presents the prior, 1

F

(

x|D

old

) and posterior PDF,

(

x|D

for the probability of an accident after t he intervention. The poste-

rior predictive probability of an accident before the rehabil i t at ion

works was 1.3 ⇥ 10

5

compared to 5.8 ⇥ 10

6

after.

7

6

5

4

0.2

0.4

0.6

0.8

x =Pr(accident/user)

Pr(x

>x|D

old

)

(a) Prior - 1  F (x|D

old

)

7

6

5

4

0.5

1.5

x =Pr(accident/user)

f(x|D)

(b) Posterior - f(x|D)

Figure 6.28: Prior and posterior PDF for

the probability of an accident after rehabil-

itation works.

j.-a. goulet 82

6.7 Approximating the Posterior

For compl e x machin e l earning models such as those that will be

covered in the next chapters, conjugate priors are often not app li c a-

ble, and it can be computationally proh ib i t ive to rely on sampling

methods for performing Bayesian estim at ion . In those case s, we

may have to employ point estimates rathe r than a probabi l i ty den-

sity function for describing the posterior PDF of model parameters.

6.7.1 Maximum Likelihood and Posterior Estimates

Instead of estimating the posterior using a Bayesian approach , a

common approximation is to ﬁnd a single set of optimal parame t er

values

✓

⇤

that maximizes the likelihood (MLE) or the product of

the likelihood and the prior (MAP) where

✓

⇤

arg max

✓

f(D|✓ ) Maximum likelihood e s t i mat e (MLE)

arg max

✓

f(D|✓ ) · f(✓) M ax i mum a-posteriori (MAP).

Optimization methods covered in chapter 5 can be employed to

identify

✓

⇤

. Note that it is common to perform t he optimization

using a log-transformed objective function, for example,

(

D|✓

)),

for numerical stability reasons (see equation 6.5). Note th at it is

equivalent to optimize either the or i gi nal or the log-t ran sf or med

objective function,

arg max

✓

f(D|✓ ) ⌘ arg max

✓

ln f (D|✓) .

Estimating only the most likely valu e with respe ct to either

the MAP or MLE criterion can lead to a n egl i gi bl e approxim at ion

error when applied to identiﬁable prob le ms (see §6.3.4), where the

amount of independent data is large; in t hat case, the posterior

is concentrated around a single value

(

✓|D

)

! 

(

✓

⇤

). When the

set of observations is small, in th e case of a multimodal posterior,

or for non-identiﬁable problems, estimati n g only

✓

⇤

is no longer

a good approximation of the posterior. The deﬁnition of what

is a small or a large data set is problem dependent. In simple

cases, data sets containing fewer than 10 observations will typically

lead to signiﬁcant epistemic uncertainty associated with unkn own

parameter values. Cases involving complex models , such as those

treated in chapter 12, may need tens of th ous and s (

D /

) of

observations to reach the point where the epistemic uncertainty

associated with unkn own parameter values is negligible.

We saw in §5.5 t hat it is common to resort to parameter-space

transformations in order to perform optimi zat i on in an uncon-

probabilistic machine learning for civil engineers 83

strained domain. For an MLE, the quantity maximized is the

likelihood of observations y, given the parameters

✓

. Because

|✓

) describes a probability density only f or y and not for

✓

parameter-space transformation for an MLE does not require u si n g

the change of variable rule we saw in §3.4. On the oth er hand, the

MAP seeks to maximize the product of the likeliho od and the prior,

|✓

)

·f

(

✓

). Here, the prior is a probability den si ty, so th e change

of variable rule applies when we employ space transformations.

Nevertheless, as we saw in ﬁgure 3.19, when a PDF is subject to

a nonlinear transformation, the modes in t h e original and t r ans -

formed spaces do not coincide. Therefore, th e MLE is invariant

to parameter-space transformations and the MAP is not.

Note

Jermyn, I. H. (2005). Invariant Bayesian

estimation on manifolds. The Annals of

Statistics 33 (2), 583–605

that Bayesian estimation is invariant to parameter-space transfor-

mations, given that we apply the change of variable rule. We will

apply parameter-space transformations in the context of Bayesi an

estimation in chapter 7.

6.7.2 Laplace Approximation

Another method for approximating a posterior PDF for the un-

known parameters of a hidden random vari ab le is the Laplace

approximation. This method approximat es the poster i or as a mul-

tivariate Normal with a posterior me an vector equal to ei t he r the

MLE or MAP estimate ✓

⇤

, and a posterior covariance ⌃

✓

⇤

f(✓|D)

| {z }

posterior

⇡N(✓;

MLE/MAP estimate

z}|{

✓

⇤

, ⌃

✓

⇤

|{z}

Laplace approximation

When we choose the MLE approximation to describe the posteri or

mode, the posterior covariance is app r oximated by the inverse

Hessian matrix of the negative log-likelihood evaluated at ✓

⇤

⌃

✓

⇤

= H[

log-likelihood

z }| {

ln f (D|✓

⇤

)]

1





ln f (D|✓)

@✓ @✓



1

✓

⇤

Details regarding the numerical estimation of the Hessian ar e

presented in §5.4. In order to include prior knowledge, we have to

estimate the Laplace approximation using

(

D|✓

⇤

)

· f

(

✓

⇤

)) instead

of the log-likelihood ln f (D|✓

⇤

) alone.

The relation between the second der i vative of the log-likelihood

evaluated at

✓

⇤

and the covariance ⌃

✓

⇤

can be explained by look-

ing at the multivariate Normal with p r obab i li ty density function

described by

j.-a. goulet 84

f(✓)=

(2⇡)

P/2

(det ⌃

✓

)

1/2

exp

✓



(✓  µ

✓

)

⌃

1

✓

(✓  µ

✓

)

◆

where

is the number of parameters

✓

··· ✓

]

. By taking

the natural logarithm of f(✓), we obt ai n

ln f (✓)=

ln 2⇡ 

ln det⌃

✓



(✓  µ

✓

)

⌃

1

✓

(✓  µ

✓

where the derivat i ve of ln f(✓)withrespectto✓ is

@ ln f (✓)

@✓

= 



(✓  µ

✓

)

⌃

1

✓

(✓  µ

✓

)



@✓

= 



2⌃

1

✓

(✓  µ

✓

)



= ⌃

1

✓

(✓  µ

✓

If we di↵erentiate this last expression a second time with respect to

✓, we obtain

ln f (✓)

@✓ @✓

= ⌃

1

✓

This development shows how the Laplace approximation allows esti-

mating the covariance matrix using the posterior PDF’s curvatures.

In the case of a Gaussian log-posterior, th e second-order de riva-

tives are constant, so the curvatures can be evaluated anywhere

in the domain. In practice, we employ the Laplace approximation

when the posterior is not Gaussian; in order to mini mi ze the ap-

proximation error, we evaluate the secon d- ord er derivatives of the

log-posterior at the MLE or MAP

✓

⇤

, where the probability density

is the highest.

4

0.1

0.2

✓

f(✓|D)

True posterior

Laplace approximation

Figure 6.29: Example of an application of

the Laplace approximation to describe a

posterior PDF.

Figure 6.29 presents an example of an applicat i on of the Laplace

approximation, where a generic posterior PD F for two parameters

is approximated by a bivariate Normal PD F . The Hessian matr i x

evaluated at the mo d e ✓

⇤

=[1.47 0.72]

H[ln f (D|✓

⇤

)] =



0.95 0.02

0.02 0.95



so the Laplace approximation of the posterior covariance is de-

scribed by

⌃

✓

⇤

= H[ln f(D|✓

⇤

)]

1



1.05 0.02

0.02 1.05



The closer the posterior is to a multivariate Normal, the better

the Laplace approximation is. Note that for pr ob le ms involving

several parameters, computing the Hessian matr i x using numerical

derivatives becomes computationally proh i bi t i ve. One possible

probabilistic machine learning for civil engineers 85

approximation is to estimate only the main diagon al of th e Hessian.

The main diagonal of the Hessian contains informat i on about the

variances of the posterior with re spect to each variable without

providing information about the covariances (i.e., the pairwise

linear dependencies between variables).

In addition to being employed to approximate the p os t er ior PDF,

the Laplace approximation can be used to est i mat e the evidence

f(D). With the Laplace approx im at ion , the posterior is

f(✓|D) ⇡N(✓; ✓

⇤

, ⌃

✓

⇤

)

(2⇡)

P/2

(det ⌃

✓

⇤

)

1/2

| {z }

(a)

exp

✓



(✓  ✓

⇤

)

⌃

1

✓

⇤

(✓  ✓

⇤

)

◆

| {z }

(b)

(6.9)

In equation 6.9, (a) corresponds to the normalization const ant

for the exponential term indicated in (b), for whi ch the value is

exp

(0) = 1 when evaluated at th e mode

✓

⇤

(see §4.1). When

estimating

(

✓

⇤

) using either an MAP

(

✓

⇤

(

D|✓

⇤

)

· f

(

✓

⇤

) or an

MLE

(

✓

⇤

(

D|✓

⇤

), its value is not going to equal 1. Therefore,

the normalization constant in (a) must be multip l ie d by t he value

of the unnormalized p ost e ri or evaluated at the mode ✓

⇤

so that

f(D)=

f(✓

⇤

) · (2⇡)

P/2

(det ⌃

✓

⇤

)

1/2

. (6.10)

The reader interested in the formal derivation of this result from a

Taylor-series expansion should consult MacKay.

MacKay, D. J. C. (2003). Information

theory, inference, and learning algorithms.

Cambridge University Press

6.8 Model Selection

In the concrete strength estimation example presented in §6.5.3,

it is known that the data was generated from a log-normal dis t ri -

bution. In real-life cases, the PDF for m f or the hidden random

variable

may either remain unknown or not correspond to an

analytically tractable probability density function. In these cases, it

is required to compare the performance of several model classes at

explaining the obser vations D.

Given a set of

model classes

m 2M

{

, ··· , M}

,the

posterior probability of each class given

is described using Bayes

rule,

p(m|D)

| {z }

model class posterior

model likelihood

z }| {

f(D|m) ·

model class prior

z}|{

p(m)

f(D

{1:M}

)

| {z }

evidence

, (6.11)

where

(

m|D

) and

(

) are respectively the posterior and prior

Note: With Bayesian model selection, the

objective is not to identify the single best

model; it allows quantifying the posterior

probability of each model, which can

be employed for performing subsequent

predictions.

j.-a. goulet 86

probability mass functions for the model classes. The li kelihood of

a model class

is obtained by marginalizing the model parameters

from the joint distribution of ob s e rvations and parameters,

f(D|m)=

f(D|✓

{m}

) · f( ✓

{m}

)d✓, (6.12)

where

✓

{m}

is the set of paramet e rs associ at ed with the model class

. The likelihood of a model class in equation 6.12 cor re sponds to

its evidence, which also corresponds to t he normalization c ons tant

from the posterior PDF for its parameters,

f(D|m) ⌘

normalization constant for model class m

z }| {

f(D

{m}

) .

(6.13)

The normalization constant for the model compar i son problem

presented in equation 6.11 is obtained by summing the likeliho od of

each model class times its prior probability,

f(D

{1:M}

m2M

f(D|m) · p(m).

Note that using the importance-sampling procedure presented in

equation 6.8 for estimating the normalization constants in equation

6.13 is only a viable option for simple pr ob le ms involving a small

number of parameters and a small number of observations. Section

7.5 further discusses the applicability of more advanced s amp lin g

methods for estimat in g f(D).

Here, we revisit the concrete strength ex am pl e present ed in

§6.5.3. We compare the posterior probabil i ty for two model classes

m 2M

{

}

(i.e., a log-normal and a Normal PDF); note that

we know the correct hypothesis is

= L because simulated data

were generated using a log-normal PDF.

For D = 5 observations, the evidence of each model is

f(D|L)=8.79 ⇥ 10

7

f(D|N)=6.44 ⇥ 10

7

With the hypothesis that the prior pr obab i l ity of each model is

equal,

(L)=

(N)=0

5, the posterior probabili ty for model classes

m are

p(L|D)=

8.79 ⇥ 10

7

⇥ 0.5

8.79 ⇥ 10

7

⇥ 0.5+6.44 ⇥ 10

7

⇥ 0.5

=0.58

p(N|D)=1 p(L|D)

6.44 ⇥ 10

7

⇥ 0.5

8.79 ⇥ 10

7

⇥ 0.5+6.44 ⇥ 10

7

⇥ 0.5

=0.42.

probabilistic machine learning for civil engineers 87

This result indicates that both models predict wi t h almost equal

performance the observations

. Therefore, if we want to make

predictions for unobserved quantities, it i s possible t o employ both

models where the probability of each of the m is quantiﬁed by its

posterior probability p(m|D).

In practice, because estimating the evide nc e

(

D|m

) is a com-

putationally demanding task, it is common to resort to approxim a-

tions such as the Akaike information criterion (AIC) an d Bayesian

information criterion (BIC) for compari ng model classes. Note that

the BIC is a special case derived from t he Laplace approximation

in equation 6.10 for estimating

(

). Other advanced approximate

methods for comparing model classes are the Deviance information

criterion

(DIC) and the Watanabe-Akaike informatio n criterion

Spiegelhalter, D. J., N. G. Best, B. P.

Carlin, and A. van der Linde (2002).

Bayesian measures of model complexity

and ﬁt. Journal of the Royal Statistical

Society: Series B (Statistical Methodol-

ogy) 64 (4), 583–639

Watanabe, S. (2010). Asymptotic

equivalence of Bayes cross validation and

widely applicable information criterion

in singular learning theory. Journal of

Machine Learning Research 11,3571–3594

(WAIC). The reader interested in these methods shou l d refer to

specialized textbooks.

Another common model selection technique

Burnham, K. P. and D. R. A nderson

(2002). Model selection and mul tim odel

inference: A practical informa tio n -th eoretic

approach (2nd ed.). Springer; and Gelman,

A., J. B. Carlin, H. S. Stern, and D. B.

Rubin (2014). Bayesian data analysis (3rd

ed.). CRC Press

widely employed in machine learning is cross-validation; the main

idea consists in employing a ﬁrst subset of data to est im at e the

model parameters and then test the pred ic t i ve performance of the

model using a second independent data set. The details regarding

cross-validation are presented in §8.1.2, where the met h od is em-

ployed to discriminate among model structu r es in the context of

regression problems.