Markov Chain Monte Carlo

The importance sampling method was i ntroduced in chapt e r 6 for

estimating the normalization constant of a posterior probability

density function (PDF) as well as its expected values and covari-

ance. A limitation of this approach is related to the diﬃculties

associated with the choice of an eﬃcient sampling distribution

(

The sampling methods covered in this chapter address this limita-

tion by providing ways of sampling directly from the posterior using

Markov chain Monte Carlo (MCMC) methods.

Figure 7.1: Example of random walk

generated with MCMC.

Markov chain Monte Carlo takes its name from the Markov

property, which states:

Given the present, the future is independent from the past.

It means that if we know the current state of the system, we can

predict future states without having to consider past states. Start-

ing from

⇠ p

(

), we can transition to

t+1

using a transition

probability

(

t+1

) that only depends on

. It implicitly conveys

that the conditional probability depending on

is equivalent to

the one depending on x

1:t

p(x

t+1

)=p(x

t+1

, ··· ,x

A Markov chai n deﬁnes the joint dis t ri b ut i on f or rand om variabl es

by c ombining the chain rule (see

3.3.4) and the Markov property

so that

p(x

1:T

)=p(x

)p(x

) ···p(x

T1

)=p(x

)

t=2

p(x

t1

The idea behind MCMC is to construct a Markov chain for which

the stationary distribution is the posterior. Conceptually, it corre-

sponds to randomly walking through the parameter space so that

the fraction of steps spent at exploring each part of the domain

is proportional to the posterior density. Figure 7.1 presents an ex-

j.-a. goulet 90

ample of a random walk generated with MCMC that follows the

density described by the underlying contour plot.

Several MCMC methods exist: Metropolis-Hastings, Gibbs sam-

pling, slice sampling, Hamiltonian Monte Carlo, and more. This

chap t er covers only the Metropolis and Metropolis-Hastings meth-

ods b e cau se t h ey ar e t h e most acc e ss ib l e. The r e ade r i nterested in

advanced methods should refer to dedicated textbooks such as the

one by Brooks et al.

Brooks, S., A. Gelman, G. Jones, and

X.-L. Meng (2011). Handbook of Markov

Chain Monte Carlo.CRCPress

7.1 Metropolis

The Metropolis algorithm

was de veloped during the Second World

Metropol is, N ., A. W. Rosenbluth, M. N.

Rosenbluth, A. H. Teller, and E. Teller

(1953). Equation of state calculations by

fast computing machines. The J o u r na l of

Chemical Physics 21 (6), 1087–1092

War while working on the Manhattan project (i.e., the atomic

bomb) at Los Alamos, Ne w Me x ic o. M et r opolis i s not th e mos t

eﬃcient sampling algorithm, yet it is a simple one allowing for an

easy introduction to MCMC methods. Notation

Initial state: ✓

Target distribution:

f(✓)

Proposal distribution: q(✓

|✓)

The Metropolis algorithm requires deﬁning an initial state f or

a set of

parameters

✓

··· ✓

]

, a target distribution

(

✓

(

D|✓

)

· f

(

✓

) corresponding to the unnormalized posterior

we want to sample f r om, an d a proposal distribut ion,

(

✓

|✓

), which

describes where to move next given the current parameter values.

The proposal must have a nonz ero probability to transit from the

current state to any state supported by the target distribution and

must be symmetric, th at is,

q(✓

|✓)=q(✓|✓

The Normal distribution (see

4.1) is a common general-purpose

proposal distribution,

q(✓

|✓)=N(✓

; ✓, ⌃

where the mean is deﬁned by the current position and the proba-

bility to move in a region around the current position is controlled

by the covariance matrix

⌃

. Figure 7.2 presents an example of 1-D

and 2-D Normal proposal distributions.

(a) N(✓

; ✓, 

)

(b)



✓

;[✓

✓

]

, diag(

⇥



⇤

)



Figure 7.2: Examples of 1-D and 2-D

proposal distributions, q(✓

|✓).

The Metropolis algorithm is recursive so at the

step, given a

current position in the parameter space

✓

,weemploy

(

✓

|✓

)

to propose mov i ng to a new positi on

✓

. If the target distribution

evaluated at the proposed location is greater than or equal to the

current one, that is,

(

✓

)



(

✓

)—we accept the proposed location.

If the proposed location has a target value that is lower than the

current one, we accept moving to the proposed location with a

probability equal to the acceptance ratio

f(✓

)

f(✓)

. In the case where

Note:

In order to accept a proposed

location with a probability equal to

f(✓

f(✓)

,wecompareitwithasample

taken from

1). If

u  r

,weacceptthe

move; otherwise, we reject it.

probabilistic machine learning for civil engineers 91

the proposed location is rejected, we stay at the current location

and

✓

s+1

✓

. Each step from the Metropolis sampling method is

formalized in algorithm 4.

Algorithm 4: Metropolis sampling

1 deﬁne

f(✓) (target distribution)

2 deﬁne q(✓

|✓) (proposal distribution)

3 deﬁne S (nu mbe r of samples)

4 initialize S = ; (set of samples)

5 initialize ✓

(initial starting location)

6 for s 2{0, 1, 2, ··· , S 1} do

7 deﬁne ✓ = ✓

8 sample ✓

⇠ q(✓

|✓)

9 compute ↵ =

f(✓

)

f(✓)

(acceptance ratio)

10 compute r =min(1,↵)

11 sample u ⇠U(0, 1)

12 if u  r then

13 ✓

s+1

= ✓

14 else

15 ✓

s+1

= ✓

16 S {S [ {✓

s+1

}} (add to the set of samples)

If we appl y t h is r e cu rs ive algorithm ove r

iterations, the result

is that the fraction of the steps spent exploring each part of the

domain is proportional to the density of the target distribution

(

✓

). Note that this last statement is valid under some conditions

regarding the chain starting location and number of samples

that

we will further discuss in

7.3. If we select a target distribution that

is an unnormalized posterior, it implies that each sample

✓

is a

realization of that posterior, ✓

: ✓ ⇠ f (D|✓) ·f(✓).

5 0 5

0.2

0.4

s =1

✓ = ✓

f(✓)=0.14083

✓

=0.54333

f(✓

)=0.15954

↵ =

f(✓

)

f(✓)

=1.1329

↵  1 ! ✓

s+1

= ✓

✓

f(✓)

✓

5 0 5

✓

(a) s =1

5 0 5

0.2

0.4

s =2

✓ = ✓

=0.54333

f(✓)=0.15954

✓

=0.15416

f(✓

)=0.15232

↵ =

f(✓

)

f(✓)

=0.95477

↵  1 ! u =0.71914

u<↵ ! ✓

s+1

= ✓

✓

f(✓)

✓

5 0 5

✓

(b) s =2

5 0 5

0.2

0.4

s =3

✓ = ✓

=0.15416

f(✓)=0.15232

✓

=0.75426

f(✓

)=0.15452

↵ =

f(✓

)

f(✓)

=1.0144

↵  1 ! ✓

s+1

= ✓

✓

f(✓)

✓

5 0 5

✓

5 0 5

0.2

0.4

✓

f(✓)

5 0 5

1,000

2,000

✓

(d) s =2400

Figure 7.3: Step-by-step example of 1-D

Metropolis sampling.

Figure 7.3 presents a step-by-step application of algorithm 4 for

sampling a given target density

(

✓

). The proposal distribution

employed in this example has a standard deviation



= 1. At step

= 1, the target value at the proposed location

✓

is greater than

at the current location

✓

, so the move is accepted. For the second

step,

= 2, the target value at the proposed location is smaller

than the value at the current location. The move is nevertheless

accepted because the random number

drawn from

1) turned

out to be smaller than the ratio

f(✓

)

f(✓)

.Atstep

= 3, the target

value at the proposed location

✓

is greater than the value at the

current location

✓

, so the move is accepted. Figure 7.3d p r es ents the

chai n c ontaining a total of

= 2400 samples. The superposition

of the sample’s histogram and the target density conﬁrms that the

Metropolis algorithm is sampling from

f(✓).

j.-a. goulet 92

7.2 Metropolis-Hastings

The Metropolis-Hastings algorithm

is identical to the Metropo-

Hastings, W. K. (1970). Monte Carlo

sampling methods using Markov chains

and their applications. Biometrika 57 (1),

97–109

lis algorithm except that it allows for nonsymmetric transition

probabilities where

q(✓

|✓) 6= q(✓|✓

The main change with Metropolis-Hastings is that when the pro-

posed location

✓

has a target value that is lower than the current

location

✓

, we acc ep t th e pr oposed location wi t h a pr ob abil i ty

equal to the ratio

f(✓

)

f(✓)

times the ratio

q ( ✓|✓

)

q ( ✓

|✓ )

, that is, the ratio of

the probability density of going from the proposed location to the

current one, divided by the probability density of going from the

current location to the proposed one.

Applying Metropolis-Hastings only requires replacing line 9 in

algorithm 4 with t he new acceptance ratio,

↵ =

f(✓

)

f(✓)

q(✓|✓

)

q(✓

|✓)

In the particular case where transition probabilities are symmetric ,

that is,

(

✓

|✓

(

✓|✓

), then Metropolis-Hastings is equivalent

to Metropolis. Note that when we employ a Normal proposal

density for parameters deﬁned in the unbounded real domain,

✓ 2 R

, then there is no need for Metropolis-Hastings. On the other

hand, using Metropolis-Hastings for sampling bounded domains

requires modifying the proposal density at each step. Figure 7.4

illustrates why if we employ a truncated Normal PDF as a proposal,

(

✓

|✓

)

(

✓|✓

) because the normalization constant needs to be

recalculated at each iteration. Section 7.4 shows how to leverage

transformation functions

✓

(

✓

) in order to transform a bounded

domain into an unbounded one

✓

2 R

so that the Metropolis

method can be employed.

✓

q(✓|✓

)

q(✓

|✓)

q(✓

|✓)

q(✓|✓

)

Figure 7.4: Example illustrating how,

when using a truncated Normal PDF as a

proposal, q(✓

|✓) 6= q(✓|✓

7.3 Convergence Checks

So far, the presentation of the Metropolis and Metropolis-Hastings

algorithms overlooked the notions of conver gen ce. For that , we need

to further address two aspects: the burn-in phase and convergence

metrics.

7.3.1 Burn-In Phase

For samples taken with an MCMC method, in order to be con-

sidered as realizations from the stationary distribution describ in g

probabilistic machine learning for civil engineers 93

the target PDF, a chain must have forgotten where it sta rted from.

Depending on the choice of initial state

✓

and transition PDF,

the sampling procedure may initially stay trapped in a part of the

domain so that the sample’s density is not representative of the

target density. As depicted in ﬁgure 7.5, this issue requires dis-

carding samples taken before reaching the stationary distribution.

These discarded samples are called the burn-in phase. In practice,

it is common to discard the ﬁrst half of each chain as the burn-in

phase and then perform the convergence check that will be further

detailed in §7.3.2.

5 0 5

5

Starting location

✓

Figure 7.5: The impact of the initial

starting location on MCMC sampling.

Figure 7.6 presents an example adapted from Murphy,

where

Murphy, K. P. (2012). Mac h ine learning:

Aprobabilisticperspective.MITPress

given a current location

x 2{

, ··· ,

}

, the transition model

is deﬁned so there is an equal probability of moving to the nearest

neighbor either on the left or on the right. If we apply this random

transition an inﬁnite number of times, we will reach the stationary

distribution

(1)

(

)=1

, 8x

. Each graph in ﬁgure 7.6 presents

the probability

(n)

(

) of being in any state

after

transitions

from the initial state

= 17. Here, we see that even after 100

transitions, the chain has not yet forgotten where it started from

because it is still skewed toward

= 17. After 400 transitions, the

initial state has been forgotten, because in this graph we can no

longer infer the chain’s initial value.

0 5 10 15 20

(0)

(x)

0 5 10 15 20

(1)

(x)

0 5 10 15 20

(2)

(x)

0 5 10 15 20

(3)

(x)

0 5 10 15 20

(10)

(x)

0 5 10 15 20

(100)

(x)

0 5 10 15 20

(200)

(x)

0 5 10 15 20

(400)

(x)

Figure 7.6: The impact of the initial

starting location using a stochastic process.

7.3.2 Monitoring Convergence

Monitoring convergence means assessing whether or not the M CM C

samples belong to a stationary distribution. For one-dimensional

problems, it is possible to track the convergence by plotting the

sample numbers versus their values and identifying visually whether

or not they belong to a stationary distribution (e.g., see ﬁgure 7.3).

When sampling in several dimensions, this visual check is limited.

Instead, the solution is to generate samples from multiple chains

(e.g., 3–5), each having a di↵erent starting location

✓

. The station-

arity of t h ese chains can be quant iﬁ ed by comparin g th e variance

within and between chains using the estimated potential scale re-

duction (EPS R) . F i gur e 7.7 il l us t rat e s t he n ot at ion e mp l oyed to

describe samples from multiple chains, where

✓

s,c

identiﬁes the

sample out of

, from the

chain s out of

. Note that because the

ﬁnal number of samples desired is

, a quantity equal to 2

samples

mus t be gener at ed in or de r t o accou nt for those discarded during

the burn-in period.

j.-a. goulet 94

“Stationary” samples

Burn-in samples

(discarded)

Chain #1

Chain #2

Chain #

Chain #c

Initial states

Samples #1

Samples #

Figure 7.7: The notation for samples taken

from multiple chains.

7.3.3 Estimated Potential Scale Reduction

The estimated potential scale reduction

metric denoted by

Gelman, A. and D. B. Rubin (1992).

Inference from iterative simulation using

multiple sequences. Statistical Science 7 (4),

457–472

computed from two quantities: the within-chains and between-

chai n s variances. Th e wi thi n- chai ns mean

✓

·c

and variance

are

estimated using

✓

·c

s=1

✓

s,c

and

W =

c=1

S  1

s=1

(✓

s,c

 ✓

·c

)

The property of t h e wi t hi n -chains variance is that it underesti-

mates the true variance of samples. The between-chains mean

✓

··

estimated using

✓

··

c=1

✓

·c

and the variance between the means of chains is given by

B =

C  1

c=1

(✓

·c

 ✓

··

)

Cont r ar il y to t he w it h i n- chain variance

, the between- chain

estimate

V =

S  1

W +

overestimates the variance of samples. The metric

is deﬁned as

the square root of the ratio between

V and W ,

R =

Because

overestimates and

underestimates the variance,

should be greater than one. As illustrated in ﬁgure 7.8, a value

probabilistic machine learning for civil engineers 95

R ⇡

1 indicates that the upper and lower bounds have converged

to the same value. Otherwise, if

1, it is an indication that

convergence is not reached and the number of samples needs to

be increased. In practice, convergence can be deemed to be met

when

✏ ⇡

1. When generating MCMC samples for

✓

··· ✓

]

, we have to comput e the E PS R

for each

dimension i 2{1, 2, ··· , P}.

true

value

Number of samples (s)

Figure 7.8: The EPSR metric to check for

convergence.

Figures 7.9a–c compare the EPSR convergence metric

for the

Metropolis method applied to a 2-D target distribution using a

di↵erent number of samples. Figures 7.9a–c are using an isotropic

bivariate Normal proposal PDF with



= 1. From the three tests,

only the one employing

= 10

samples meets the criterion that we

have set here to

R<1.01.

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

(a) S =100,=60% ,

R =[1.1074 1.0915]

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

(b) S =1000,=60% ,

R =[1.0042 1.0107]

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

R =[1.0005 1.0011]

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

(d) S =1000,=38% ,

R =[1.0182 1.0194]

Figure 7.9: Comparison of the ESPR

convergence metric

for 100, 1

000, and

10,000 MCMC samples.

j.-a. goulet 96

7.3.4 Acceptance Rate

The acceptance rate



is the ratio between the number of steps

where the proposed move is accepted and the total number of steps.

Figures 7.9a–c have an acceptance rate of 60 percent.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

Acceptance rate

Relative speed

Figure 7.10: Relative convergence speed

of MCMC for

P 

5asafunctionofthe

acceptance rate .

(Adapted from Rosenthal (2011).)

In the case presented in ﬁgure 7.9d, the standard deviation of the

proposal was d oubl ed t o



= 2. It has the e ↵ec t of re du ci n g th e

acceptance rate to 38 percent. The convergence speed of MCMC

methods is related to the acceptance rate. For ideal cases involving

Normal random variables, the optimal acceptance rate is approxi-

mately 23 percent for parameter spaces having ﬁve dimensions or

more (i.e.,

P 

5), and approximately 44 percent for cases in one

dimension (

= 1).

Figure 7.10 presents the relative convergence

Rosenthal, J. S. (2011). Optimal proposal

distributions and adaptive MCMC. In

Handbook of Markov Chain Monte Carlo,

93–112. CRC Press

speed as a function of the acceptance rate for a case involving ﬁve

dimensions or more (

P 

5). Note that there is a wide range of

values for



with similar eﬃciency, so the optimal values should not

be sought strictly.

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

(a) S =1000,=34% ,

R =[1.0421 1.0389]

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

(b) S =1000:,=28% ,

R =[1.2031 1.1952]

5 0 5

5

✓

q(✓

|✓ )

5 0 5

5

✓

c =1

5 0 5

5

✓

c =2

5 0 5

5

✓

c =3

R =[1.0064 1.0048]

Figure 7.11: Comparison of the EPSR

convergence metric

for 1,000 MCMC

samples using di↵erent proposal distribu-

tions.

probabilistic machine learning for civil engineers 97

Figure 7.11 presents another set of examples applied to a target

distribution having a high (negative) correlation between the

dimensions

✓

and

✓

, and using a ﬁxed number of samples equal

= 1000. In comparison with the case presented in (a), the case

in (b) displays worse

values. This di↵erence is due to the poor

choi c e of corr el at ion co eﬃ c ie nt for the proposal PDF in (b); the

correlation coeﬃcient of the proposal has a sign that is the opposite

of the one for the target PDF. If, like in ﬁgure 7.11c, the proposal is

wel l s el ec te d f or the t ar get distribution, the convergence speed will

be higher.

For trivial cases, it may be possib l e to i nf er eﬃcient par ame t er s

for proposals from trials and errors or from heuristics. However,

for more complex cases involving a large number of dimensions,

manu al t uni n g fal l s shor t . T he ne xt s ec t ion present s a met h od for

automatically tuning the proposal covariance matrix.

7.3.5 Proposal Tuning

One generic way to deﬁne the proposal PDF is using a multi variate

Normal centered on the current state

✓

and with covariance matr ix



⌃

✓

⇤

q(✓

|✓)=N(✓

; ✓,

⌃

✓

⇤

),

2.4

where the scaling factor



depends on the number of parameters

. The covariance mat r i x

⌃

✓

⇤

is an approximation of the posterior

covariance obt ain ed using the Laplace approximation (see

6.7.2)

calculated for the maximum a-posteriori value (MAP),

✓

⇤

.The

MAP value can be estimated using gradient-based optimization

tech ni q u es such as those present ed in §5.2.

Algorithm 5 presents a simple implementation of the Metropolis

algorithm with convergence checks and with a tuning procedure

for the scaling factor



. Note that this algorithm is a minimal

example intended to help the reader understand the algorithm ﬂow.

Implementations with state-of-the-art eﬃciency include several

more steps.

7.4 Space Transformation

When dealing with parameters that are not deﬁned in the un-

bounded real space, one solution is to employ the Metropolis-

Hastings algorithm to account for the non-reversibility of the

transitions caused by the domain constraints, as illustrated in

ﬁgure 7.4.

j.-a. goulet 98

Algorithm 5: Convergence check and scaling parameter tuning

1 deﬁne

f(✓), S, C, ✏

2 initialize ✓

0,c

, i =0

3 compute MAP: ✓

⇤

(e.g., gradient-based optimization)

4 compute ⌃

✓

⇤

= H[ln

f(✓

⇤

)]

1

(Laplace approximation)

5 deﬁne q(✓

|✓)=N(✓

; ✓,

⌃

✓

⇤

),

2.4

6 for chains c 2{1, 2, ··· , C} do

7 S

= ;

8 for samples s 2{0, 1, 2, ··· , S  1} do

9 Metropolis algorithm: S

[{✓

s+1,c

}}

10 if i =0(Pilot run) then

11 for c 2{1, 2, ··· , C} do

12 S

= {✓

S/2,c

, ··· , ✓

S,c

} (discard burn-in samples)

13 compute  (acceptance rate)

14 if <0.15 then

15 

= 

/2, Goto 5 (decrease scaling factor)

16 else if >0.50 then

17 

=2

, Goto 5 (increase scaling factor)

18 compute

, 8p 2{1, 2, ··· , P} (EPSR)

19 if

> 1+✏ for any p then

20 ✓

0,c

= ✓

S,c

(restart using the last sample)

21 S =2

⇥ S (increase the number of samples)

22 i = i + 1, Goto 6

23 converged

Another solution is to transform each constrained parameter

in the real space in order to employ the Metropolis algorithm.

Transformation functi ons

✓

(

✓

) suited for this purpose are

described in

3.4. When working with transformed parameters, the

proposal distribution has to be deﬁned in the transformed space,

q(✓

|✓

)=N(✓

; ✓

,

⌃

✓

⇤

where

⌃

✓

⇤

is estimated using the Laplace approximation, which is

itself deﬁned in the transformed space. The target probability can

be evaluated in the transformed space using the i nverse t r an sf orm a-

tion function

1

(

✓

), its gradient

|rg

(

✓

)

1

✓

evaluated at

✓

, and

the change of variable rule, Change of variable rule

(y)=f

(x) |det J

y,x

1

(see §3.4)

f(✓

=✓

z }| {

1

(✓

)) ·

i=1

g(✓)|

✓

probabilistic machine learning for civil engineers 99

Because each transformation function

(

✓

) depends on a single pa-

rameter, the determinant of the diagonal Jacobian matrix required

Determinant of a diagonal matrix

det



diag(x)



i=1

for the transformation

(

✓

) simpliﬁes to the product of the inverse

absolute value of the gradient of each transformation function

evaluated at

✓

1

(

✓

). Algorithm 6 presents a simple imple-

mentation of the Metropolis algorithm applied to a transformed

space.

Algorithm 6: Metropolis with transformed space

1 deﬁne

f(✓) (target distribution)

2 deﬁne g(✓) ! ✓

1

(✓

) ! ✓ (transformation functions)

3 deﬁne rg(✓) (transformation gradient)

4 deﬁne S (nu mbe r of samples)

5 initialize S = ; (set of samples)

6 initialize ✓

(initial starting location)

7 compute MAP: ✓

⇤tr

(e.g., gradient-based optimization)

8 compute ⌃

✓

⇤

= H[ln

f(✓

⇤tr

)]

1

(Laplace approximation)

9 for s 2{0, 1, 2, ··· , S  1} do

10 deﬁne ✓

= g(✓

)

11 sample ✓

⇠N(✓

; ✓

,

⌃

✓

⇤

)

12 compute ↵ =

f(g

1

(✓

))

f(g

1

(✓

))

i=1

g(✓)|

✓

g(✓

✓

13 compute r =min(1,↵)

14 sample u ⇠U(0, 1)

15 if u  r then

16 ✓

s+1

= g

1

(✓

)

17 else

18 ✓

s+1

= ✓

19 S {S [ {✓

s+1

}}

7.5 Computing with MCMC Samples

When employing an MCMC method, each sample

✓

is a realization

of the target distribution

(

✓

). When we are interested in perform-

ing a Bayesian estimation for parameters using a set of observations

, ··· ,y

}

, the unnormalized target distribution is deﬁned

as the product of the likelihood and the prior,

f(✓)=f (D|✓) ·f(✓).

The posterior expected values and covariance for

(

✓|D

) are then

obtained by computing the empirical average and covariance of the

j.-a. goulet 100

samples ✓

E[✓|D]=

✓ · f(✓|D)d✓

⇡

s=1

✓

cov(✓|D)=E

⇥

(✓  E[✓|D])(✓  E[✓|D])

⇤

⇡

S  1

s=1

(✓

 E[✓|D])(✓

 E[✓|D])

Given

(

;

✓

), a probability density function deﬁned by the

parameters

✓

. The posterior predictive density of

given observa-

tions D is obtaine d by marginalizing the uncertainty of ✓,

X|D ⇠ f(x|D)=

(x; ✓) ·

✓

z }| {

f(✓|D) d✓.

Predictive samples

can be generated by using samples

✓

and

evaluating them in f

(x; ✓

), so

: X|✓

⇠ f

(x; ✓

)

| {z }

sample from X using MCMC samples ✓

The posterior predictive expected value and variance for

X|D ⇠

f(x|D) can then be estimated as

E[X|D]=

x · f(x|D)dx

⇡

s=1

var[X|D]=E[(X  E[X|D])

]

⇡

S  1

s=1

 E[X|D])

Given a fu nc t i on t hat depends on parameters

✓

such that

(

✓

), posterior predictive samples from that function can

be obtained by evaluating it for randomly s el ec ted samples ✓

: Z|✓

= g(✓

)

| {z }

sample from Z using MCMC samples ✓

The posterior predictive expected value and variance for the func-

probabilistic machine learning for civil engineers 101

tion output are t hus once again

E[Z|D]=

z · f(z|D)dz

⇡

s=1

var[Z|D]=E[(Z E[Z|D])

]

⇡

S  1

s=1

 E[Z|D])

As we saw in

6.8, performing Bayesian model selection requires

computing the evidence

f(D)=

f(D|✓) · f (✓)d✓.

It is theoretically possible to estimate this normalization constant

from Metropolis-Hastings samples using the harmonic mean of the

likelihood.

This method is not presented here because it is known

Newton, M. A. and A. E. Raftery (1994).

Approximate Bayesian inference with the

weighted likelihood bootstrap. Journal

of the Royal Statistical Society. Series B

(Methodological) 56 (1), 3–48

for its poor performance in practice.

The annealed importance

Neal, R. (2008). The harmonic

mean of the likelihood: Worst

Monte Carlo method ever. URL

https://radfordneal.wordpress.com/

2008/08/17/the-harmonic-mean-of-the

-likelihood-worst-monte-carlo-method

-ever/.AccessedNovember8,2019

sampling

(AIS) is a method combining annealing optimization

Neal, R. M. (2001). Annealed importance

sampling. Statistics and Comput ing 11 (2),

125–139

methods, importance sampling, and MCMC sampling. Despite

being one of the eﬃcient methods for estimating

(

), we have to

ke ep i n min d t hat estimating the evidence is intrinsically diﬃcult

when the number of parameters is large and when the posterior is

mul t i modal or when it displays nonlinear dependencies. Therefore,

we sh oul d always be careful whe n esti mat i n g

(

) because no

perfect black-box solution is currently available for estimating it.

Example: Concrete t es ts We are revisiting the exam pl e pr e sented

6.5.3, where Bayesian estimation is employed to characterize

the resistance

of a concrete mix. The resistance is now modeled

as a Normal random variable with unknown mean and var i anc e,

R ⇠N

(

;

,

). The observation model is

v, v

V ⇠

(

)

MPa

,where

describes the observation errors that

are independent of each other, that is,

?? V

, 8i 6

. Our goal

is to estimate the posterior PDF for the resistance’s mean

and

standard deviation 

Posterior PDF

z }| {

f(µ

,

|D)=

Likelihood

z }| {

f(D|µ

,

) ·

Prior PDF

z }| {

f(µ

,

)

f(D)

|{z}

Evidence

In order to reﬂect an absence of prior knowledge, we employ

non-informative priors (see

6.4.1) for both parameters, so that

j.-a. goulet 102

(

)

(



)

/

, and thus

(

,

)

/

.The

likelihood of an observation

, given a set of parameters

✓

,is

(

y|✓

(

;

,



). Because of the conditional

independence assumption for the observations, the joint likelihood

for D observations is obtained from the p roduc t of the marginals,

f(D|µ

,

i=1

2⇡



+ 

exp



 µ



+ 

| {z }

Normal PDF

In this example, we assume we only have three observations

{43.3, 40.4, 44.8}MPa.

The parameter



, 1

is constrained to positive

real numbers. Therefore, we perform the Bayesian estimation in the

transformed space ✓

✓

=[✓

✓

]

=[µ

ln(

)]

✓ =[µ



]

=[✓

exp(✓

)]

Note that no transformation is applied for

because it is already

deﬁned over R.

The Newton-Raphson algorithm is employed in order to identify

the set of parameters maximizing the target PDF deﬁned in the

transformed space,

f(✓

)=f(D|✓

) · f(✓

The optimal values found are

✓

⇤tr

= [43.00.6]

✓

⇤

= [43.01.8]

The Laplace approximation evaluated at

✓

⇤tr

is employed to esti-

mate the posterior covariance

⌃

✓

⇤

= H[ln

f(✓

⇤tr

)]

1



1.11 0

00.17



This estimation is employed with a scaling factor



88 in

order to initialize the proposal PDF covariance. A total of

chai n s are gene r ate d, where each c ontains

= 10

samples. The

scaling factor obtained after tuning is



44, which l ead s to

an acceptance rate of



29 and EPSR convergence metrics

R =[1.0009 1.0017].

probabilistic machine learning for civil engineers 103

The posterior mean and posterior covarian ce ar e esti m ate d f rom

the MCMC samples ✓

E[✓|D]=

s=1

✓

= [42.93.8]

(Post e ri or mean)

ˆc o v ( ✓|D)=

S  1

s=1

(✓

 E[✓|D])(✓

 E[✓|D])



8.02.5

2.5 20.6



(Post e ri or covariance).

The posterior predictive mean and variance for the concrete resis-

tance, which include the epistemic uncertainty about the parame-

ters µ

and 

, are

E[R|D]=

s=1

= 42.9

ˆvar[R|D]=

S  1

s=1

 E[R|D])

= 40.5,

where samples

R ⇠N

(

;

R,s

,

R,s

) are generated using the

MCMC samples ✓

=[µ

R,s



R,s

]



f(µ

,

)

{ˇµ

, ˇ

}



f(D|µ

,

)



f(µ

,

|D)

30 40 50

f(r|D)

Predictive PDF

True PDF

Figure 7.12: Example of application of

MCMC sampling for the estimation of the

resistance of a concrete mix. The cross

indicates the true (unknown) parameter

values.

Figure 7.12 presents the prior, likelihood, posterior, and posterior

predictive PDFs. Samples

✓

are represented by dots on the poste-

rior. Note how the posterior predictive does not exactly correspond

with the true PDF for

, which has f or t r ue values

ˇµ

= 42 MPa

and

ˇ

= 2 MPa. This discrepancy is attributed to the fact that

only 3 observations are available, so epi st em i c unce rt ai nty remai n s

in the estimation of parameters

{µ

,

}

. In ﬁgure 7.12, this uncer-

tainty translates in heavier tails for the posterior predictive than for

the true PDF.