7
Markov Chain Monte Carlo
The importance sampling method was i ntroduced in chapt e r 6 for
estimating the normalization constant of a posterior probability
density function (PDF) as well as its expected values and covari-
ance. A limitation of this approach is related to the diculties
associated with the choice of an ecient sampling distribution
h
(
·
).
The sampling methods covered in this chapter address this limita-
tion by providing ways of sampling directly from the posterior using
Markov chain Monte Carlo (MCMC) methods.
Figure 7.1: Example of random walk
generated with MCMC.
Markov chain Monte Carlo takes its name from the Markov
property, which states:
Given the present, the future is independent from the past.
It means that if we know the current state of the system, we can
predict future states without having to consider past states. Start-
ing from
X
t
p
(
x
t
), we can transition to
x
t+1
using a transition
probability
p
(
x
t+1
|x
t
) that only depends on
x
t
. It implicitly conveys
that the conditional probability depending on
x
t
is equivalent to
the one depending on x
1:t
,
p(x
t+1
|x
t
)=p(x
t+1
|x
1
,x
2
, ··· ,x
t
).
A Markov chai n defines the joint dis t ri b ut i on f or rand om variabl es
by c ombining the chain rule (see
§
3.3.4) and the Markov property
so that
p(x
1:T
)=p(x
1
)p(x
2
|x
1
)p(x
3
|x
2
) ···p(x
T
|x
T1
)=p(x
1
)
T
Y
t=2
p(x
t
|x
t1
).
The idea behind MCMC is to construct a Markov chain for which
the stationary distribution is the posterior. Conceptually, it corre-
sponds to randomly walking through the parameter space so that
the fraction of steps spent at exploring each part of the domain
is proportional to the posterior density. Figure 7.1 presents an ex-
j.-a. goulet 90
ample of a random walk generated with MCMC that follows the
density described by the underlying contour plot.
Several MCMC methods exist: Metropolis-Hastings, Gibbs sam-
pling, slice sampling, Hamiltonian Monte Carlo, and more. This
chap t er covers only the Metropolis and Metropolis-Hastings meth-
ods b e cau se t h ey ar e t h e most acc e ss ib l e. The r e ade r i nterested in
advanced methods should refer to dedicated textbooks such as the
one by Brooks et al.
1
1
Brooks, S., A. Gelman, G. Jones, and
X.-L. Meng (2011). Handbook of Markov
Chain Monte Carlo.CRCPress
7.1 Metropolis
The Metropolis algorithm
2
was de veloped during the Second World
2
Metropol is, N ., A. W. Rosenbluth, M. N.
Rosenbluth, A. H. Teller, and E. Teller
(1953). Equation of state calculations by
fast computing machines. The J o u r na l of
Chemical Physics 21 (6), 1087–1092
War while working on the Manhattan project (i.e., the atomic
bomb) at Los Alamos, Ne w Me x ic o. M et r opolis i s not th e mos t
ecient sampling algorithm, yet it is a simple one allowing for an
easy introduction to MCMC methods. Notation
Initial state:
0
Target distribution:
˜
f()
Proposal distribution: q(
0
|)
The Metropolis algorithm requires defining an initial state f or
a set of
P
parameters
0
=[
1
2
···
P
]
|
0
, a target distribution
˜
f
(
)=
f
(
D|
)
· f
(
) corresponding to the unnormalized posterior
we want to sample f r om, an d a proposal distribut ion,
q
(
0
|
), which
describes where to move next given the current parameter values.
The proposal must have a nonz ero probability to transit from the
current state to any state supported by the target distribution and
must be symmetric, th at is,
q(
0
|)=q(|
0
).
The Normal distribution (see
§
4.1) is a common general-purpose
proposal distribution,
q(
0
|)=N(
0
; ,
q
),
where the mean is defined by the current position and the proba-
bility to move in a region around the current position is controlled
by the covariance matrix
q
. Figure 7.2 presents an example of 1-D
and 2-D Normal proposal distributions.
(a) N(
0
; ,
2
q
)
(b)
N
0
;[
1
2
]
|
, diag(
2
q
1
2
q
2
)
Figure 7.2: Examples of 1-D and 2-D
proposal distributions, q(
0
|).
The Metropolis algorithm is recursive so at the
s
th
step, given a
current position in the parameter space
=
s
,weemploy
q
(
0
|
)
to propose mov i ng to a new positi on
0
. If the target distribution
evaluated at the proposed location is greater than or equal to the
current one, that is,
˜
f
(
0
)
˜
f
(
)—we accept the proposed location.
If the proposed location has a target value that is lower than the
current one, we accept moving to the proposed location with a
probability equal to the acceptance ratio
˜
f(
0
)
˜
f()
. In the case where
Note:
In order to accept a proposed
location with a probability equal to
r
=
˜
f(
0
)/
˜
f()
,wecompareitwithasample
u
taken from
U
(0
,
1). If
u r
,weacceptthe
move; otherwise, we reject it.
probabilistic machine learning for civil engineers 91
the proposed location is rejected, we stay at the current location
and
s+1
=
s
. Each step from the Metropolis sampling method is
formalized in algorithm 4.
Algorithm 4: Metropolis sampling
1 define
˜
f() (target distribution)
2 define q(
0
|) (proposal distribution)
3 define S (nu mbe r of samples)
4 initialize S = ; (set of samples)
5 initialize
0
(initial starting location)
6 for s 2{0, 1, 2, ··· , S 1} do
7 define =
s
8 sample
0
q(
0
|)
9 compute =
˜
f(
0
)
˜
f()
(acceptance ratio)
10 compute r =min(1,)
11 sample u U(0, 1)
12 if u r then
13
s+1
=
0
14 else
15
s+1
=
s
16 S {S [ {
s+1
}} (add to the set of samples)
If we appl y t h is r e cu rs ive algorithm ove r
S
iterations, the result
is that the fraction of the steps spent exploring each part of the
domain is proportional to the density of the target distribution
˜
f
(
). Note that this last statement is valid under some conditions
regarding the chain starting location and number of samples
S
that
we will further discuss in
§
7.3. If we select a target distribution that
is an unnormalized posterior, it implies that each sample
s
is a
realization of that posterior,
s
: f (D|) ·f().
5 0 5
0
0.2
0.4
s =1
=
s
=1
!
˜
f()=0.14083
0
=0.54333
!
˜
f(
0
)=0.15954
=
˜
f(
0
)
˜
f()
=1.1329
1 !
s+1
=
0
˜
f()
0
5 0 5
0
1
2
s
(a) s =1
5 0 5
0
0.2
0.4
s =2
=
s
=0.54333
!
˜
f()=0.15954
0
=0.15416
!
˜
f(
0
)=0.15232
=
˜
f(
0
)
˜
f()
=0.95477
1 ! u =0.71914
u< !
s+1
=
0
˜
f()
0
5 0 5
0
1
2
3
s
(b) s =2
5 0 5
0
0.2
0.4
s =3
=
s
=0.15416
!
˜
f()=0.15232
0
=0.75426
!
˜
f(
0
)=0.15452
=
˜
f(
0
)
˜
f()
=1.0144
1 !
s+1
=
0
˜
f()
0
5 0 5
0
1
2
3
4
s
(c) s =3
5 0 5
0
0.2
0.4
˜
f()
5 0 5
0
1,000
2,000
s
(d) s =2400
Figure 7.3: Step-by-step example of 1-D
Metropolis sampling.
Figure 7.3 presents a step-by-step application of algorithm 4 for
sampling a given target density
˜
f
(
). The proposal distribution
employed in this example has a standard deviation
q
= 1. At step
s
= 1, the target value at the proposed location
0
is greater than
at the current location
, so the move is accepted. For the second
step,
s
= 2, the target value at the proposed location is smaller
than the value at the current location. The move is nevertheless
accepted because the random number
u
drawn from
U
(0
,
1) turned
out to be smaller than the ratio
˜
f(
0
)
˜
f()
.Atstep
s
= 3, the target
value at the proposed location
0
is greater than the value at the
current location
, so the move is accepted. Figure 7.3d p r es ents the
chai n c ontaining a total of
S
= 2400 samples. The superposition
of the sample’s histogram and the target density confirms that the
Metropolis algorithm is sampling from
˜
f().
j.-a. goulet 92
7.2 Metropolis-Hastings
The Metropolis-Hastings algorithm
3
is identical to the Metropo-
3
Hastings, W. K. (1970). Monte Carlo
sampling methods using Markov chains
and their applications. Biometrika 57 (1),
97–109
lis algorithm except that it allows for nonsymmetric transition
probabilities where
q(
0
|) 6= q(|
0
).
The main change with Metropolis-Hastings is that when the pro-
posed location
0
has a target value that is lower than the current
location
, we acc ep t th e pr oposed location wi t h a pr ob abil i ty
equal to the ratio
˜
f(
0
)
˜
f()
times the ratio
q ( |
0
)
q (
0
|)
, that is, the ratio of
the probability density of going from the proposed location to the
current one, divided by the probability density of going from the
current location to the proposed one.
Applying Metropolis-Hastings only requires replacing line 9 in
algorithm 4 with t he new acceptance ratio,
=
˜
f(
0
)
˜
f()
·
q(|
0
)
q(
0
|)
.
In the particular case where transition probabilities are symmetric ,
that is,
q
(
0
|
)=
q
(
|
0
), then Metropolis-Hastings is equivalent
to Metropolis. Note that when we employ a Normal proposal
density for parameters defined in the unbounded real domain,
2 R
P
, then there is no need for Metropolis-Hastings. On the other
hand, using Metropolis-Hastings for sampling bounded domains
requires modifying the proposal density at each step. Figure 7.4
illustrates why if we employ a truncated Normal PDF as a proposal,
q
(
0
|
)
6
=
q
(
|
0
) because the normalization constant needs to be
recalculated at each iteration. Section 7.4 shows how to leverage
transformation functions
tr
=
g
(
) in order to transform a bounded
domain into an unbounded one
tr
2 R
so that the Metropolis
method can be employed.
0
0
q(|
0
)
q(
0
|)
q(
0
|)
q(|
0
)
Figure 7.4: Example illustrating how,
when using a truncated Normal PDF as a
proposal, q(
0
|) 6= q(|
0
).
7.3 Convergence Checks
So far, the presentation of the Metropolis and Metropolis-Hastings
algorithms overlooked the notions of conver gen ce. For that , we need
to further address two aspects: the burn-in phase and convergence
metrics.
7.3.1 Burn-In Phase
For samples taken with an MCMC method, in order to be con-
sidered as realizations from the stationary distribution describ in g
probabilistic machine learning for civil engineers 93
the target PDF, a chain must have forgotten where it sta rted from.
Depending on the choice of initial state
0
and transition PDF,
the sampling procedure may initially stay trapped in a part of the
domain so that the sample’s density is not representative of the
target density. As depicted in figure 7.5, this issue requires dis-
carding samples taken before reaching the stationary distribution.
These discarded samples are called the burn-in phase. In practice,
it is common to discard the first half of each chain as the burn-in
phase and then perform the convergence check that will be further
detailed in §7.3.2.
Starting location
!
Figure 7.5: The impact of the initial
starting location on MCMC sampling.
Figure 7.6 presents an example adapted from Murphy,
4
where
4
Murphy, K. P. (2012). Mac h ine learning:
Aprobabilisticperspective.MITPress
given a current location
x 2{
0
,
1
, ··· ,
20
}
, the transition model
is defined so there is an equal probability of moving to the nearest
neighbor either on the left or on the right. If we apply this random
transition an infinite number of times, we will reach the stationary
distribution
p
(1)
(
x
)=1
/
21
, 8x
. Each graph in figure 7.6 presents
the probability
p
(n)
(
x
) of being in any state
x
after
n
transitions
from the initial state
x
0
= 17. Here, we see that even after 100
transitions, the chain has not yet forgotten where it started from
because it is still skewed toward
x
= 17. After 400 transitions, the
initial state has been forgotten, because in this graph we can no
longer infer the chain’s initial value.
0 5 10 15 20
p
(0)
(x)
0 5 10 15 20
p
(1)
(x)
0 5 10 15 20
p
(2)
(x)
0 5 10 15 20
p
(3)
(x)
0 5 10 15 20
p
(10)
(x)
0 5 10 15 20
p
(100)
(x)
0 5 10 15 20
p
(200)
(x)
0 5 10 15 20
p
(400)
(x)
Figure 7.6: The impact of the initial
starting location using a stochastic process.
7.3.2 Monitoring Convergence
Monitoring convergence means assessing whether or not the M CM C
samples belong to a stationary distribution. For one-dimensional
problems, it is possible to track the convergence by plotting the
sample numbers versus their values and identifying visually whether
or not they belong to a stationary distribution (e.g., see figure 7.3).
When sampling in several dimensions, this visual check is limited.
Instead, the solution is to generate samples from multiple chains
(e.g., 3–5), each having a dierent starting location
0
. The station-
arity of t h ese chains can be quant ifi ed by comparin g th e variance
within and between chains using the estimated potential scale re-
duction (EPS R) . F i gur e 7.7 il l us t rat e s t he n ot at ion e mp l oyed to
describe samples from multiple chains, where
s,c
identifies the
s
th
sample out of
S
, from the
c
th
chain s out of
C
. Note that because the
final number of samples desired is
S
, a quantity equal to 2
S
samples
mus t be gener at ed in or de r t o accou nt for those discarded during
the burn-in period.
j.-a. goulet 94
“Stationary” samples
Burn-in samples
(discarded)
Chain #1
Chain #2
Chain #
Chain #c
Initial states
Samples #1
Samples #
Figure 7.7: The notation for samples taken
from multiple chains.
7.3.3 Estimated Potential Scale Reduction
The estimated potential scale reduction
5
metric denoted by
ˆ
R
is
5
Gelman, A. and D. B. Rubin (1992).
Inference from iterative simulation using
multiple sequences. Statistical Science 7 (4),
457–472
computed from two quantities: the within-chains and between-
chai n s variances. Th e wi thi n- chai ns mean
·c
and variance
W
are
estimated using
·c
=
1
S
S
X
s=1
s,c
and
W =
1
C
C
X
c=1
1
S 1
S
X
s=1
(
s,c
·c
)
2
!
.
The property of t h e wi t hi n -chains variance is that it underesti-
mates the true variance of samples. The between-chains mean
··
is
estimated using
··
=
1
C
C
X
c=1
·c
,
and the variance between the means of chains is given by
B =
S
C 1
C
X
c=1
(
·c
··
)
2
.
Cont r ar il y to t he w it h i n- chain variance
W
, the between- chain
estimate
ˆ
V =
S 1
S
W +
1
S
B
overestimates the variance of samples. The metric
ˆ
R
is defined as
the square root of the ratio between
ˆ
V and W ,
ˆ
R =
s
ˆ
V
W
.
Because
ˆ
V
overestimates and
W
underestimates the variance,
ˆ
R
should be greater than one. As illustrated in figure 7.8, a value
probabilistic machine learning for civil engineers 95
ˆ
R
1 indicates that the upper and lower bounds have converged
to the same value. Otherwise, if
ˆ
R>
1, it is an indication that
convergence is not reached and the number of samples needs to
be increased. In practice, convergence can be deemed to be met
when
ˆ
R<
1+
1
.
1. When generating MCMC samples for
=[
1
2
···
P
]
|
, we have to comput e the E PS R
ˆ
R
i
for each
dimension i 2{1, 2, ··· , P}.
true
value
Number of samples (s)
Figure 7.8: The EPSR metric to check for
convergence.
Figures 7.9a–c compare the EPSR convergence metric
ˆ
R
for the
Metropolis method applied to a 2-D target distribution using a
dierent number of samples. Figures 7.9a–c are using an isotropic
bivariate Normal proposal PDF with
q
= 1. From the three tests,
only the one employing
S
= 10
4
samples meets the criterion that we
have set here to
ˆ
R<1.01.
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(a) S =100,=60% ,
ˆ
R =[1.1074 1.0915]
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(b) S =1000,=60% ,
ˆ
R =[1.0042 1.0107]
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(c) S =10,000 :,=60% ,
ˆ
R =[1.0005 1.0011]
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(d) S =1000,=38% ,
ˆ
R =[1.0182 1.0194]
Figure 7.9: Comparison of the ESPR
convergence metric
ˆ
R
for 100, 1
,
000, and
10,000 MCMC samples.
j.-a. goulet 96
7.3.4 Acceptance Rate
The acceptance rate
is the ratio between the number of steps
where the proposed move is accepted and the total number of steps.
Figures 7.9a–c have an acceptance rate of 60 percent.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2
Acceptance rate
Relative speed
Figure 7.10: Relative convergence speed
of MCMC for
P
5asafunctionofthe
acceptance rate .
(Adapted from Rosenthal (2011).)
In the case presented in figure 7.9d, the standard deviation of the
proposal was d oubl ed t o
q
= 2. It has the e ec t of re du ci n g th e
acceptance rate to 38 percent. The convergence speed of MCMC
methods is related to the acceptance rate. For ideal cases involving
Normal random variables, the optimal acceptance rate is approxi-
mately 23 percent for parameter spaces having five dimensions or
more (i.e.,
P
5), and approximately 44 percent for cases in one
dimension (
P
= 1).
6
Figure 7.10 presents the relative convergence
6
Rosenthal, J. S. (2011). Optimal proposal
distributions and adaptive MCMC. In
Handbook of Markov Chain Monte Carlo,
93–112. CRC Press
speed as a function of the acceptance rate for a case involving five
dimensions or more (
P
5). Note that there is a wide range of
values for
with similar eciency, so the optimal values should not
be sought strictly.
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(a) S =1000,=34% ,
ˆ
R =[1.0421 1.0389]
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(b) S =1000:,=28% ,
ˆ
R =[1.2031 1.1952]
5 0 5
5
0
5
1
2
q(
0
|)
5 0 5
5
0
5
1
2
c =1
5 0 5
5
0
5
1
2
c =2
5 0 5
5
0
5
1
2
c =3
(c) S =1000,=37% ,
ˆ
R =[1.0064 1.0048]
Figure 7.11: Comparison of the EPSR
convergence metric
ˆ
R
for 1,000 MCMC
samples using dierent proposal distribu-
tions.
probabilistic machine learning for civil engineers 97
Figure 7.11 presents another set of examples applied to a target
distribution having a high (negative) correlation between the
dimensions
1
and
2
, and using a fixed number of samples equal
to
S
= 1000. In comparison with the case presented in (a), the case
in (b) displays worse
ˆ
R
values. This dierence is due to the poor
choi c e of corr el at ion co ec ie nt for the proposal PDF in (b); the
correlation coecient of the proposal has a sign that is the opposite
of the one for the target PDF. If, like in figure 7.11c, the proposal is
wel l s el ec te d f or the t ar get distribution, the convergence speed will
be higher.
For trivial cases, it may be possib l e to i nf er ecient par ame t er s
for proposals from trials and errors or from heuristics. However,
for more complex cases involving a large number of dimensions,
manu al t uni n g fal l s shor t . T he ne xt s ec t ion present s a met h od for
automatically tuning the proposal covariance matrix.
7.3.5 Proposal Tuning
One generic way to define the proposal PDF is using a multi variate
Normal centered on the current state
and with covariance matr ix
2
,
q(
0
|)=N(
0
; ,
2
),
2
=
2.4
2
P
,
where the scaling factor
depends on the number of parameters
P
. The covariance mat r i x
is an approximation of the posterior
covariance obt ain ed using the Laplace approximation (see
§
6.7.2)
calculated for the maximum a-posteriori value (MAP),
.The
MAP value can be estimated using gradient-based optimization
tech ni q u es such as those present ed in §5.2.
Algorithm 5 presents a simple implementation of the Metropolis
algorithm with convergence checks and with a tuning procedure
for the scaling factor
. Note that this algorithm is a minimal
example intended to help the reader understand the algorithm flow.
Implementations with state-of-the-art eciency include several
more steps.
7.4 Space Transformation
When dealing with parameters that are not defined in the un-
bounded real space, one solution is to employ the Metropolis-
Hastings algorithm to account for the non-reversibility of the
transitions caused by the domain constraints, as illustrated in
figure 7.4.
j.-a. goulet 98
Algorithm 5: Convergence check and scaling parameter tuning
1 define
˜
f(), S, C,
2 initialize
0,c
, i =0
3 compute MAP:
(e.g., gradient-based optimization)
4 compute
= H[ln
˜
f(
)]
1
(Laplace approximation)
5 define q(
0
|)=N(
0
; ,
2
),
2
=
2.4
2
P
6 for chains c 2{1, 2, ··· , C} do
7 S
c
= ;
8 for samples s 2{0, 1, 2, ··· , S 1} do
9 Metropolis algorithm: S
c
{S
c
[{
s+1,c
}}
10 if i =0(Pilot run) then
11 for c 2{1, 2, ··· , C} do
12 S
c
= {
S/2,c
, ··· ,
S,c
} (discard burn-in samples)
13 compute (acceptance rate)
14 if <0.15 then
15
2
=
2
/2, Goto 5 (decrease scaling factor)
16 else if >0.50 then
17
2
=2
2
, Goto 5 (increase scaling factor)
18 compute
ˆ
R
p
, 8p 2{1, 2, ··· , P} (EPSR)
19 if
ˆ
R
p
> 1+ for any p then
20
0,c
=
S,c
(restart using the last sample)
21 S =2
i
S (increase the number of samples)
22 i = i + 1, Goto 6
23 converged
Another solution is to transform each constrained parameter
in the real space in order to employ the Metropolis algorithm.
Transformation functi ons
tr
=
g
(
i
) suited for this purpose are
described in
§
3.4. When working with transformed parameters, the
proposal distribution has to be defined in the transformed space,
q(
tr
0
|
tr
)=N(
tr
0
;
tr
,
2
tr
),
where
tr
is estimated using the Laplace approximation, which is
itself defined in the transformed space. The target probability can
be evaluated in the transformed space using the i nverse t r an sf orm a-
tion function
g
1
(
tr
), its gradient
|rg
(
)
|
1
tr
evaluated at
tr
, and
the change of variable rule, Change of variable rule
f
Y
(y)=f
X
(x) |det J
y,x
|
1
(see §3.4)
˜
f(
tr
)=
˜
f(
=
z }| {
g
1
(
tr
)) ·
P
Y
i=1
1
|r
i
g()|
tr
.
probabilistic machine learning for civil engineers 99
Because each transformation function
g
(
i
) depends on a single pa-
rameter, the determinant of the diagonal Jacobian matrix required
Determinant of a diagonal matrix
det
diag(x)
=
Q
X
i=1
x
i
for the transformation
g
(
) simplifies to the product of the inverse
absolute value of the gradient of each transformation function
evaluated at
i
=
g
1
(
tr
i
). Algorithm 6 presents a simple imple-
mentation of the Metropolis algorithm applied to a transformed
space.
Algorithm 6: Metropolis with transformed space
1 define
˜
f() (target distribution)
2 define g() !
tr
,g
1
(
tr
) ! (transformation functions)
3 define rg() (transformation gradient)
4 define S (nu mbe r of samples)
5 initialize S = ; (set of samples)
6 initialize
0
(initial starting location)
7 compute MAP:
tr
(e.g., gradient-based optimization)
8 compute
tr
= H[ln
˜
f(
tr
)]
1
(Laplace approximation)
9 for s 2{0, 1, 2, ··· , S 1} do
10 define
tr
= g(
s
)
11 sample
0
tr
N(
0
tr
;
tr
,
2
tr
)
12 compute =
˜
f(g
1
(
0
tr
))
˜
f(g
1
(
tr
))
·
P
Y
i=1
|r
i
g()|
tr
|r
i
g(
0
)|
0
tr
13 compute r =min(1,)
14 sample u U(0, 1)
15 if u r then
16
s+1
= g
1
(
0
tr
)
17 else
18
s+1
=
s
19 S {S [ {
s+1
}}
7.5 Computing with MCMC Samples
When employing an MCMC method, each sample
s
is a realization
of the target distribution
˜
f
(
). When we are interested in perform-
ing a Bayesian estimation for parameters using a set of observations
D
=
{y
1
,y
2
, ··· ,y
D
}
, the unnormalized target distribution is defined
as the product of the likelihood and the prior,
˜
f()=f (D|) ·f().
The posterior expected values and covariance for
f
(
|D
) are then
obtained by computing the empirical average and covariance of the
j.-a. goulet 100
samples
s
,
E[|D]=
Z
· f(|D)d
1
S
S
X
s=1
s
cov(|D)=E
( E[|D])( E[|D])
|
1
S 1
S
X
s=1
(
s
E[|D])(
s
E[|D])
|
.
Given
f
X
(
x
;
), a probability density function defined by the
parameters
. The posterior predictive density of
X
given observa-
tions D is obtaine d by marginalizing the uncertainty of ,
X|D f(x|D)=
Z
f
X
(x; ) ·
s
z }| {
f(|D) d.
Predictive samples
x
s
can be generated by using samples
s
and
evaluating them in f
X
(x;
s
), so
x
s
: X|
s
f
X
(x;
s
)
| {z }
sample from X using MCMC samples
s
.
The posterior predictive expected value and variance for
X|D
f(x|D) can then be estimated as
E[X|D]=
Z
x · f(x|D)dx
1
S
S
X
s=1
x
s
var[X|D]=E[(X E[X|D])
2
]
1
S 1
S
X
s=1
(x
s
E[X|D])
2
.
Given a fu nc t i on t hat depends on parameters
such that
z
=
g
(
), posterior predictive samples from that function can
be obtained by evaluating it for randomly s el ec ted samples
s
,
z
s
: Z|
s
= g(
s
)
| {z }
sample from Z using MCMC samples
s
.
The posterior predictive expected value and variance for the func-
probabilistic machine learning for civil engineers 101
tion output are t hus once again
E[Z|D]=
Z
z · f(z|D)dz
1
S
S
X
s=1
z
s
var[Z|D]=E[(Z E[Z|D])
2
]
1
S 1
S
X
s=1
(z
s
E[Z|D])
2
.
As we saw in
§
6.8, performing Bayesian model selection requires
computing the evidence
f(D)=
Z
f(D|) · f ()d.
It is theoretically possible to estimate this normalization constant
from Metropolis-Hastings samples using the harmonic mean of the
likelihood.
7
This method is not presented here because it is known
7
Newton, M. A. and A. E. Raftery (1994).
Approximate Bayesian inference with the
weighted likelihood bootstrap. Journal
of the Royal Statistical Society. Series B
(Methodological) 56 (1), 3–48
for its poor performance in practice.
8
The annealed importance
8
Neal, R. (2008). The harmonic
mean of the likelihood: Worst
Monte Carlo method ever. URL
https://radfordneal.wordpress.com/
2008/08/17/the-harmonic-mean-of-the
-likelihood-worst-monte-carlo-method
-ever/.AccessedNovember8,2019
sampling
9
(AIS) is a method combining annealing optimization
9
Neal, R. M. (2001). Annealed importance
sampling. Statistics and Comput ing 11 (2),
125–139
methods, importance sampling, and MCMC sampling. Despite
being one of the ecient methods for estimating
f
(
D
), we have to
ke ep i n min d t hat estimating the evidence is intrinsically dicult
when the number of parameters is large and when the posterior is
mul t i modal or when it displays nonlinear dependencies. Therefore,
we sh oul d always be careful whe n esti mat i n g
f
(
D
) because no
perfect black-box solution is currently available for estimating it.
Example: Concrete t es ts We are revisiting the exam pl e pr e sented
in
§
6.5.3, where Bayesian estimation is employed to characterize
the resistance
R
of a concrete mix. The resistance is now modeled
as a Normal random variable with unknown mean and var i anc e,
R N
(
r
;
µ
R
,
2
R
). The observation model is
y
=
r
+
v, v
:
V
N
(
y
;0
,
0
.
01
2
)
MPa
,where
V
describes the observation errors that
are independent of each other, that is,
V
i
?? V
j
, 8i 6
=
j
. Our goal
is to estimate the posterior PDF for the resistance’s mean
µ
R
and
standard deviation
R
,
Posterior PDF
z }| {
f(µ
R
,
R
|D)=
Likelihood
z }| {
f(D|µ
R
,
R
) ·
Prior PDF
z }| {
f(µ
R
,
R
)
f(D)
|{z}
Evidence
.
In order to reflect an absence of prior knowledge, we employ
non-informative priors (see
§
6.4.1) for both parameters, so that
j.-a. goulet 102
f
(
µ
R
)
/
1,
f
(
R
)
/
1
/
R
, and thus
f
(
µ
R
,
R
)
/
1
/
R
.The
likelihood of an observation
y
, given a set of parameters
,is
f
(
y|
)=
N
(
y
;
µ
R
,
2
R
+
2
V
=0
.
01
2
). Because of the conditional
independence assumption for the observations, the joint likelihood
for D observations is obtained from the p roduc t of the marginals,
f(D|µ
R
,
R
)=
D
Y
i=1
1
p
2
p
2
R
+
2
V
exp
2
4
1
2
y
i
µ
R
p
2
R
+
2
V
!
2
3
5
.
| {z }
Normal PDF
In this example, we assume we only have three observations
D
=
{43.3, 40.4, 44.8}MPa.
The parameter
R
2
(0
, 1
)=
R
+
is constrained to positive
real numbers. Therefore, we perform the Bayesian estimation in the
transformed space
tr
,
tr
=[
tr
1
tr
2
]
|
=[µ
R
ln(
R
)]
|
=[µ
R
R
]
|
=[
tr
1
exp(
tr
2
)]
|
.
Note that no transformation is applied for
µ
R
because it is already
defined over R.
The Newton-Raphson algorithm is employed in order to identify
the set of parameters maximizing the target PDF defined in the
transformed space,
˜
f(
tr
)=f(D|
tr
) · f(
tr
).
The optimal values found are
tr
= [43.00.6]
|
= [43.01.8]
|
.
The Laplace approximation evaluated at
tr
is employed to esti-
mate the posterior covariance
tr
= H[ln
˜
f(
tr
)]
1
=
1.11 0
00.17
.
This estimation is employed with a scaling factor
2
=2
.
88 in
order to initialize the proposal PDF covariance. A total of
C
=4
chai n s are gene r ate d, where each c ontains
S
= 10
4
samples. The
scaling factor obtained after tuning is
2
=1
.
44, which l ead s to
an acceptance rate of
=0
.
29 and EPSR convergence metrics
ˆ
R =[1.0009 1.0017].
probabilistic machine learning for civil engineers 103
The posterior mean and posterior covarian ce ar e esti m ate d f rom
the MCMC samples
s
,
ˆ
E[|D]=
1
S
S
X
s=1
s
= [42.93.8]
|
(Post e ri or mean)
ˆc o v ( |D)=
1
S 1
S
X
s=1
(
s
E[|D])(
s
E[|D])
|
=
8.02.5
2.5 20.6
(Post e ri or covariance).
The posterior predictive mean and variance for the concrete resis-
tance, which include the epistemic uncertainty about the parame-
ters µ
R
and
R
, are
ˆ
E[R|D]=
1
S
S
X
s=1
r
s
= 42.9
ˆvar[R|D]=
1
S 1
S
X
s=1
(r
s
E[R|D])
2
= 40.5,
where samples
r
s
:
R N
(
r
;
µ
R,s
,
2
R,s
) are generated using the
MCMC samples
s
=[µ
R,s
R,s
]
|
.
25
40
55
0
5
10
µ
R
R
f(µ
R
,
R
)
{ˇµ
R
, ˇ
R
}
25
40
55
0
5
10
µ
R
R
f(D|µ
R
,
R
)
25
40
55
0
5
10
µ
R
R
f(µ
R
,
R
|D)
30 40 50
r
˜
f(r|D)
Predictive PDF
True PDF
Figure 7.12: Example of application of
MCMC sampling for the estimation of the
resistance of a concrete mix. The cross
indicates the true (unknown) parameter
values.
Figure 7.12 presents the prior, likelihood, posterior, and posterior
predictive PDFs. Samples
s
are represented by dots on the poste-
rior. Note how the posterior predictive does not exactly correspond
with the true PDF for
R
, which has f or t r ue values
ˇµ
R
= 42 MPa
and
ˇ
R
= 2 MPa. This discrepancy is attributed to the fact that
only 3 observations are available, so epi st em i c unce rt ai nty remai n s
in the estimation of parameters
{µ
R
,
R
}
. In figure 7.12, this uncer-
tainty translates in heavier tails for the posterior predictive than for
the true PDF.