6
Learning from Data
Learning context
p(unknown|known)
x:Constant
: X f(x; )
X:Random variable
y:Observation
When scalars
x
are employed to model physical phenomena, we
can use empirical observations y to learn about the probability (or
probability density) for each value that
x
can take. When
X
is a
random variable, we can us e observations to learn about the p aram -
eters
describing its probability density (or mass) function
f
(
x;
).
This chapter presents the general Bayesian formu lat i on , as well as
approximate methods derived from it, for l ear n in g about unknown
state variables and parameter s, given known data. The applications
presented in this chapter focus on simple models in or d er to keep
the attention on the Bayesian methods themselves. We will see in
chapters 8–13 how this theory can be generaliz ed for applicat ion s
to machine learning models having complex architectures. For an
in-depth review of Bayesian methods applied to dat a analysis, th e
reader may refer to specialized textbooks such as those by Gelman
et al.
1
or Box and Ti ao.
2
1
Gelman, A., J. B. Carlin, H. S. Stern, and
D. B. Rubin (2014). Bayesian data analysis
(3rd ed.). CRC Press
2
Box, G. E. P. and G. C. Tiao (1992).
Bayesian inference in statistical analysis.
Wiley
6.1 Bayes
Learning from data can be seen as an indirect problem; we observe
some quantities y , and we want to employ this informati on to infer
knowledge about hidden-state variab l es x and model parameters Note: The term hidden describes a state
variable that is not directly observed.
,wherey depends on x and
through
f
(y
|
x
;
). In chapter 3,
we saw that such a dependence can be tre at ed using conditional
probabilities.
Notation:
;
and
|
are both employed
to describe a conditional probability or
probability density. The distinction is that
;isemployedwhen
are parameters of
aPDForPMF,forexample,
f
(
x
;
)=
N
(
x
;
µ,
2
), and
|
denotes a conditional
dependence between variables, for example,
f
(
y|x
). Besides these semantic distinctions,
both are employed in the same way when it
comes to performing calculations.
In the first part of this chapter, we sol el y focu s on the estimation
of hidden-state variables
x
. Given two random variables
X f
(
x
)
and
Y f
(
y
), their joint probability density funct i on (PDF) is
obtained by the product of a conditional and a marginal PDF,
f(x, y)=
(
f(x|y) ·f(y)
f(y|x) · f(x).
(6.1)
j.-a. goulet 60
In the case where
y
is known and
X
is not, we can reorganize the
terms of equation 6.1 in order to obtain Bayes rule,
f(x|y)=
f(y|x) · f(x)
f(y)
,
which describes the posterior PDF (i. e. , our poster ior knowled ge)
of
X
given that we have observed
y
. Let us consider
X
, a random
variable so that
X f
(
x
), and given a set of observations
D
=
{y
1
,y
2
, ··· ,y
D
}
that are realizations of the random var i abl e s Y = Note: From now on, the number of ele-
ments in a set or the number of variables
in a vector is defined by a
typewriter
-font
upper-case letter; for ex ample,
D
is the
number of observations in a set
D
,and
X
is the number of variables in the vector
x =[x
1
x
2
··· x
X
]
|
.
Asetofobservations
D = {y
1
,y
2
, ··· ,y
D
}
e.g., D = {1.2, 3.8, ··· , 0.22},(y 2 R)
D = {3, 1, ··· , 6},(y 2 Z
+
)
D = {blue, blue, ··· , red},
(y 2{blue, red, green})
[
Y
1
Y
2
··· Y
D
]
|
f
(y). Our posterior knowledge for
X
given the
observations D is described by the conditional PDF
f(x|D)
| {z }
posterior
=
likelihood
z }| {
f(D|x) ·
prior
z}|{
f(x)
f(D)
|{z}
evidence
.
Prior
f
(
x
) describes our prior knowledge for the values
x
that a
random variable
X
can take. The prior knowledge can be expressed
in multiple ways. For instance, it can be based on heur is t i cs such
as expert opinions. In the case where d ata is obtained s eq u entially,
the posterior knowledge at time
t
1 becomes the prior at time
t
.
In some cases, it also happens that no pr i or knowledge is available;
then, we should employ a non-informative prior, that is, a p r i or Note: A uniform prior and a non-
informative prior are not the same
thing. Some non-informative priors are
uniform, but not all uniform priors are
non-informative.
that reflects an absence of knowledge (see §6.4. 1 for further details
on non-informative priors ) .
Likelihood
f
(Y = y
|x
)
f
(
D|x
)describesthelikelihood, or the
conditional probability density, of observing the event
{
Y =
D}
,
given the values that
x
can take. Note that i n the special case of an
exact observation where D = y = x,then
f(D|x)=
1,y= x
0, otherwise.
In such a case, the observations are ex act so the prior does not
play a role; by observing
y
you have all the information about
x
because
y
=
x
. In the general case where
y
:
Y
=
fct
(
x
), that is,
y
is a realization from a stochastic process, which is a functi on of
x
f
(
D|x
)describestheprior pr ob abi l i ty density of observing
D
Here, the term prior refers to our state of
knowledge that has not yet been influenced
by the observations in D.
given a specific set of values for x.
Evidence
f
(Y = y)
f
(
D
) is called the evidence and consists in a
normalization constant ensuring that the posterior P DF int egr at es
probabilistic machine learning for civil engineers 61
to 1. In §3.3 we saw that for both the dis cr et e and the continuous
cases,
X
x
p(x|D)
| {z }
discrete case
=1,
Z
1
1
f(x|D)dx
| {z }
continuous case
=1.
For th at property to be true, the normalizati on constant is defined
as the sum/integral over the entire dom ain of
x
, of the likelihood
times the prior,
p(D)=
X
x
p(D|x) · p(x)
| {z }
discrete case
,f(D)=
Z
f(D|x) · f(x)dx
| {z }
continuous case
.
So far, the description of learning as an indirect pr obl em is
rather abstract. The next sections wi l l apply th is concept t o simple
problem configurations involving discret e an d conti nuous random
variables.
6.2 Discrete State Variables
Discrete state variable:Adiscrete
random variable
X
employed to represent
the uncertainty regarding the state
x
of a
system.
This section presents how the probability of occurrence for discrete
state variables are estimated from observations. In all the examples
presented in this section, the uncertainty associated with our
knowledge of a hidden random variable
X
is epistemic because Reminder: The term hidden refers to a
variable that is not observed.
a true value
ˇx
exists, yet it is unknown. Moreover,
x
is not directly
observable; we can only observe
y
,where
p
(
y|x
)describesthe
conditional probability of an observation
y
given the value of the
discrete hidden-state variable x.
6.2.1 Example: Disease Screening
Entire
population
Person
without disease
Pr
(
¬disease
)=
99%
Person
with
disease
Pr
(
disease
)
=1%
Pr
(
test
+
|disease
)=80%
Pr
(
test
+
disease
)=10%
1% 80% 99% 10%
Figure 6.1: Visual interpretation of Bayes
rule in the context of disease screening.
(Adapted from John Henderson (CC BY).)
Figure 6.1 illustrates the context of a r are disease for which t h e
probability of occurrence in the general population is 1 percent,
and a screening test such that if one has the disease, the prob-
ability of testing positive is
Pr
(
test+|disease
)=0
.
8; and if one
does not have the disease, the pr obab il ity of testing positive is
Pr
(
test+disease
)=0
.
1. The sample space for the variable describ-
ing the hidden state we are interested in is
x 2{disease, ¬disease}
,
for which the prior probability is
Pr
(
X
=
disease
)=0
.
01 and
Pr(X = ¬disease) = 1 Pr(X = disease) = 0.99.
For a single observation,
D
=
{test
+
}
, the posterior probability
j.-a. goulet 62
of having the disease is
Pr(disease | test+) =
Pr(test+|disease)·Pr(disease)
Pr(test+|disease)·Pr(disease)+Pr(test+disease)·Pr(¬disease)
=
1%80%
(1%80%)+(99%10%)
=7.5%,
and the posterior probability of not havi n g the disease is
Pr(¬disease | test+) = 1 Pr(disease | test+)
=
Pr(test+disease)·Pr(¬disease)
Pr(test+|disease)·Pr(disease)+Pr(test+disease)·Pr(¬disease)
=
99%10%
(1%80%)+(99%10%)
= 92.5%.
If we push this example to the extreme c ase where the disease
is so rare that only one human on Earth has it (see figu r e 6.2) and
where we have a highly accurate sc re en i ng test so that
test+ !
(
Pr(test + |disease) = 0.999
Pr(test + disease) = 0.001,
the question is If you test positive, should you be worried? The
answer is no; the disease is so rare that even with the hi gh test
accuracy, if we teste d every human on Earth, we would expect
0
.
001
(8
10
9
)=8
10
6
false positive diagnoses. This example
illustrates how in extreme cases, prior i n for mat i on may outweigh
the knowledge you obtain from observations.
Humans
Only one has the disease
Figure 6.2: The example of a disease so
rare that only one human on Earth has it.
6.2.2 Example: Fire Alarm
Note:
P
x
p
(
y
=
|x
)doesnothaveto
equal one for the likelihood; however, it
does for the posterior,
P
x
p(x|y)=1.
It is common to observe that when a fire alar m is triggered in
a public building, people tend to rem ain calm and ignor e the
alarm. Let us explore this situation usi n g con d it i on al probabili-
ties. The sample space for the hidden st at e we are inter es t ed in is
x 2{fire, ¬fire}{ , }
and the sample space for the observati on
is y 2{alarm, ¬alarm}{ , }. If we assume that
p( |x)=
Pr( | )=0.95
Pr( | )=0.05,
then the question is Why does no one react when a fire alarm is
triggered? The is su e here is that
p
(
|x
) describes the probability of
an alarm given the state; in order to answer the question, we need
to calculate
Pr
(
t
|
), which depends on the prior probability that
there was a fire at time
t
1 as well as at the transition probabilities
p(x
t
|x
t1
), as illustrated in figure 6.3.
t
0.05
0.95
Pr = 0.99
Pr = 0.01
t 1
0.99
0
.
01
1
Pr =?
Pr =?
Figure 6.3: Fire alarm example.
probabilistic machine learning for civil engineers 63
We assum e that at
t
1, before the alarm is triggered, our p ri or
knowledge is
p(x
t1
)=
Pr(
t1
)=0.01
Pr(
t1
)=0.99.
The transition probab il i ty between
t
1 and
t
, at the moment when
the alarm is triggered, is described by the conditional probabilities
p(x
t
|x
t1
)=
8
>
>
>
<
>
>
>
:
Pr(
t
|
t1
)=0.99
Pr(
t
|
t1
)=0.01
Pr(
t
|
t1
)=1
Pr(
t
|
t1
)=0.
By combining the prior knowledge at
t
1 with the transition
probabilities, we obtain the joint prior pr ob abi l i ty of hidden states
at both time steps,
p(x
t
,x
t1
)=
8
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
:
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=1· 0.01 = 0.01
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0.01 · 0.99 = 0.0099
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0.99 · 0.99 = 0.9801
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0· 0.01 = 0.
We obtain t he marginal prior probability of each state
x
at a time
t
by marginalizing this joint pr obab i li ty, th at is, su mm in g over states
at t 1,
p(x
t
)=
8
>
>
>
>
>
<
>
>
>
>
>
:
Pr(
t
)=Pr(
t
,
t1
)+Pr(
t
,
t1
)
=0.01 + 0.0099
=0.0199 0.02
Pr(
t
)=Pr(
t
,
t1
)+Pr(
t
,
t1
)
=0.9801 0.98.
If we combine the information from the prior knowledge at
t
and
the observation, we can compute the posterior probability of the
state x
t
, given that the alarm has been triggered,
Pr(
t
| )=
Pr( ,
t
)
z }| {
Pr( |
t
) · Pr(
t
)
(Pr( |
t
) · Pr(
t
)) + (Pr( |
t
) · Pr(
t
))
| {z }
Pr( )
=
0.95 · 0.02
(0.95 · 0.02) + (0.05 · 0.98)
=0.28
Pr(
t
| )=1 Pr(
t
| )=0.72.
j.-a. goulet 64
Overall, despite an alarm being triggered, the probability that there
is a fire remains less than 30 perce nt.
We can explore what would h appen at time
t
+ 1 if, after the
alarm has been triggered, a safety ocial i nf orm s us that the re
is a fire and that we need to evacuat e. We assume here that the
conditional probability of receiving such information given t h e state
is
p( |x)=
Pr( | )=0.95
Pr( | )=0.05,
as illustrated in figu re 6.4. Here we assume that given the state, the
probability of receiving the information from the safety oc i al is
conditionally independent from the alarm, that is,
Pr
(
| ,
)=
Pr
(
|
). If, as we did for the estimation at time
t
, we combine
the information from the prior knowledge at time
t
, the transition
probabilities, and t he observation at time
t
+ 1, we can compute t he
posterior probability of the hidden stat e
x
t+1
, given that the alarm
has been triggered and that we received the information from a
safety ocial. First, the marginal prior probabilities of each state at
time
t
+ 1 are estimated from the post er i or probability at
t
combined
with the transiti on probabilities so that
Pr(
t+1
| )=
Pr(
t+1
,
t
| )
z }| {
Pr(
t+1
|
t
) · Pr(
t
| )+
Pr(
t+1
,
t
| )
z }| {
Pr(
t+1
|
t
) · Pr(
t
| )
=0.01 · 0.72 + 1 · 0.28
=0.29
Pr(
t+1
| )=1 Pr(
t+1
| )=0.71.
t
0.05
0.95
Pr = 0.99
Pr = 0.01
t 1
0.99
0
.
01
1
Pr = 0.72
Pr = 0.28
t +1
0.99
1
0
.
01
0.05
0.95
Pr = ?
Pr = ?
Figure 6.4: Example illustrating the role of
the prior in sequential data acquisition and
interpretation.
Then, the posterior probability at t + 1 is estimated following
Pr(
t+1
| , )=
Pr( ,
t+1
| )
z }| {
Pr( |
t+1
) · Pr(
t+1
| )
(Pr( |
t+1
)·Pr(
t+1
| ))+(Pr( |
t+1
)·Pr(
t+1
| ))
| {z }
Pr( | )
=
0.95 · 0.29
(0.95 · 0.29) + (0.05 · 0.71)
=0.89
Pr(
t+1
| , )=1 Pr(
t+1
| , )=0.11.
This example illustrates how when knowledge about a hidden
state evolves over time as new information becomes available, the
posterior at t is employed to estimate the prior at t + 1.
probabilistic machine learning for civil engineers 65
6.2.3 Example: Post-Earthquake Damage Assessment
Given an earthquake of intensity
3
3
This example is adapted from Armen Der
Kiureghian’s CE229 course at UC Berkeley.
X : x 2{light (L), moderate (M), important (I)}
and a structure that can be i n a state (e.g., see figure 6.5)
Y : y 2{damaged ( D ) , undamaged (D)}.
These two random variables are illustrated usin g a Venn diagram in
figure 6.6. The likelihood of damage given each earthquake intensity
is
p(D|x) =
8
<
:
Pr(Y = D|x = L)=0.01
Pr(Y = D|x = M)=0.10
Pr(Y = D|x = I)=0.60,
Figure 6.5: Extensive damage af ter San
Franciscos 1906 earthquake.
Figure 6.6: Venn diagram illustrating
the sample space where the earthquake
intensities light (L), moderate (M), and
important (I)aremutuallyexclusiveand
collectively exhaustive events. The post-
earthquake structural states of damaged
(D)andundamaged(
D
)intersectthe
earthquake intensity events.
and the prior probability of each intensity is given by
Note: For the likelihood
P
x
p(D|x) 6
=1;
however, for the prior
P
x
p(x)
=1,andfor
the posterior
P
x
p(x|D) =1.
p(x) =
8
<
:
Pr(X = L)=0.90
Pr(X = M)=0.08
Pr(X = I)=0.02.
We are inter es t ed in inferrin g the intensity of an earthquake given
that we have observed damaged bu il d i ngs
y
= D following an event ,
so that
p(x|D) =
p(D|x) · p(x)
p(D)
.
The first step to solve this problem is to compute the normalization
constant,
p(y = D)=
Pr(Y =D,X=L)
z
}| {
p(D|L) · p(L)+
Pr(Y =D,X=M)
z }| {
p(D|M) · p(M)+
Pr(Y =D,X=I)
z }| {
p(D|I) · p(I)
=0.01 · 0.90 + 0.10 · 0.08 + 0.6 · 0. 02
=0.029.
Then, the posterior probability of each state is calculate d following
Pr(x = I |y = D)=
p(D|I)·p(I)
p(D)
=
0.60·0.02
0.029
=0.41
Pr(x = M| y = D)=
p(D|M)·p(M)
p(D)
=
0.10·0.08
0.029
=0.28
Pr(x = L |y = D)=
p(D|L)·p(L)
p(D)
=
0.01·0.90
0.029
=0.31.
We see that despit e the prior pr ob abi l i ty of a high-intensity earth-
quake being small (i.e.,
Pr
(
X
= I)=0
.
02), the posterior is signifi-
cantly higher because of the observat i on indicating t h e presence of
structural damage.
j.-a. goulet 66
6.3 Continuous State Variables
This section presents examples of applications of hidden-stat e
estimation for continuous cases. Compared with discrete cases,
continuous ones involve more abstract conce pt s in the defin i t ion
of the likelihood
f
(
D|
x) and evidence
f
(
D
). Like for the discrete
case, if our model is
y
=
x
and we observe
y
, then we know all
there is to know about
x
so the conditional probability of
y
given
x
is nonzero only for the true value
ˇx
. In this section, we will cover
cases involving im perfect observations rather than perfect ones.
6.3.1 Likelihood: f(D|x)
In practice, we are interested in the case where
ˇx
is a true yet un-
known hidden-state variable and where we have access to imprecise
observations y
i
,wheretheobservation model is either
y
i
x + v
i
(direct observation)
y
i
> ˇx + v
i
(upper-bound censored observation)
y
i
< ˇx + v
i
(lower-bound censored observation),
where in all three cases the observat i on error,
v
i
:
V
i
N
(
v
;0
,
2
V
),
V
i
?? V
j
, 8i 6
=
j
. A direct observation corresponds t o the standard Note:
V
i
?? V
j
, 8i 6
=
j
indicates that the
random variable
V
i
is independent of
V
j
,
for all
i 6
=
j
(i.e., that all observation errors
are independent of each other).
case where the observation is the tr ue state
ˇx
on which an observa-
tion error
v
is added.
V
is a Normal random variable employed to
describe observation err or s.
V
is often chosen to be Normal in order
to simplify the problems by maintaining analy t i cal tractability for
linear models (see §4.1.3). Lower- and up per-bound observations are
called censored observations where we only know that the hidden-
state variable is either respecti vely greater or smaller than the
observed value y.
Direct observations For a direct observation
y
,thelikelihoodis
formulated from the additive observation model so that the PD F of
Y
is described by the Normal PDF
Y N
(
y
;
x,
2
V
). Because
ˇx
is
Y = x + V
Observation: N(y; x,
2
V
)
Hidden state: constant
Measurement error: N(v;0,
2
V
)
unknown, the likelihood denoted by
f
(
y|x
) or
L
(
x|y
)describesthe
prior probability den si ty of observing {Y = y} given {ˇx = x},
f(y|x)=L(x|y)=
1
p
2 ·
V
exp
1
2
y x
V
2
!
. (6.2)
The leftmost surface in figure 6.7 correspon ds to equation 6. 2
plotted with respect to both
x
and
y
, for
V
= 1. The p i nk dashed
curve represents the explicit evaluat i on of
f
(
y|x
= 0), where the
evaluation at
x
= 0 is arbitrarily chosen for illustration p u rpose,
probabilistic machine learning for civil engineers 67
and the red solid curve reverses the relat i on sh ip to represent the
likelihood of the observation
y
=
1 for any value
x
,where
f
(
y
=
1|x)=L(x|y = 1).
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
Figure 6.7: Representation of the likelihood
function for a direct observation. The
leftmost surface describes
f
(
y|x
), the
rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center graph
describes f(y|x =0).
Keep in mind that the likelihood is not a PDF with respect to
x
so it does not integrate to 1. Accordingly, the rightmost graph in
figure 6.7 is a function quantifying for each value
x
how likely it is
to observe the value y = 1.
Upper-bound observations For an upper-bound observation
y>ˇx
+
v
,
the likelihood is described by the cumulati ve distribution function
(CDF) of Y N(y; x,
2
V
),
f(y|x)=
CDFN(y;x,
2
V
)
z }| {
Z
y
1
1
p
2⇡
V
exp
1
2
y
0
x
V
2
!
dy
0
=
1
2
+
1
2
erf
y x
p
2
V
.
(6.3)
The surface in figure 6.8 corresponds to eq uat i on 6.3 plotted with
respect to
x
and
y
, for
V
= 1. The pink dashed curve r ep re sents
the explicit evaluation of
f
(
y|x
= 0), where t h e evaluation at
x
= 0 is ar bi t r ari l y chosen for illustration pu r pose, and the red solid
curve reverses the relationship to represent the likelihood of the
observation y = 1, gi ven any value x, L(x|y = 1).
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
Figure 6.8: Representation of the likelihood
function for an upper-bound censored
observation. The leftmost surface describes
f
(
y|x
), the rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center
graph describes f(y|x =0).
j.-a. goulet 68
Lower-bound observations For a lower-bound observation
y<ˇx
+
v
,
the likelihood is described by the complement of t he cumul at i ve
distribution function of Y N(y; x,
2
V
),
f(y|x)=1
CDFN(y;x,
2
V
)
z }| {
Z
y
1
1
p
2⇡
V
exp
1
2
y
0
x
V
2
!
dy
0
=1
1
2
+
1
2
erf
y x
p
2
V
◆◆
.
(6.4)
The surface in figure 6.9a corresponds to eq uat i on 6.4 plotted with
respect to
x
and
y
, for
V
= 1. The pink dashed curve r ep re sents
the explicit evaluation of
f
(
y|x
= 0), where the evaluation at
x
=0
is arbitrarily chosen for illustration pur pose, and again the red
curve reverses the relationship to represent the likelihood of the
observation
y
=
1, given any value
x
,
L
(
x|y
=
1). Note that
in figure 6.9b, if instead of having
V
= 1,
V
tends to zero, the
smooth transition from low likelihood values to high ones would
become a sharp jump from zero to one.
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
(a)
V
=1
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x)
5 0 5
x
L(x|y)
(b)
V
=0
Figure 6.9: Representation of the likelihood
function for a lower-bound censored
observation. The leftmost surface describes
f
(
y|x
), the rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center
graph describes f(y|x =0).
Note: The notion of conditional inde-
pendence implies that given
x
,
Y
i
is
independent of Y
j
,thatis,
Y
i
?? Y
j
|x , f(y
i
|y
j
,x)=f(y
i
|x),
so that
f
(
y
i
,y
j
|x
)=
f
(
y
i
|x
)
· f
(
y
j
|x
). This
assumption is often satisfied because the
observation model follows the form
y = x + v,
where
v
is the realization of an observation
error so that V
i
?? V
j
, 8i 6= j.
Multiple observations All cases above dealt with a single observa-
tion. When multiple observation s are available, one must define the
joint likelihood for all observations given
x
. With the assumption
that observations are conditi onally independent given
x
, the joint
probabilistic machine learning for civil engineers 69
PDF is obtained from the product of marginals,
f(D|x)=f(Y
1
= y
1
, ··· ,Y
D
= y
D
|x)
=
D1
!
z }| {
D
Y
i=1
f(Y
i
= y
i
|x)
=exp
D
X
i=1
ln
f(Y
i
= y
i
|x)
!
.
(6.5)
Note that practical diculties related to numerical underflow
or overflow arise when trying to evaluate t h e product of several
marginal likelihoods
f
(
y
i
|x
). This is why in equation 6.5, the pr od-
uct of the marginals is replaced with the equivalent yet numerically
more robust formulation using the exponential of the sum of the log
marginals.
6.3.2 Evidence: f(D)
5 0 5
0
0.1
x
f(y|x)f (x)
(a) S =7, x
s
=
10
7
,
ˆ
f(D)=0.2158
5 0 5
0
0.1
x
f(y|x)f (x)
(b) S =10, x
s
=
10
10
,
ˆ
f(D)=0.2198
5 0 5
0
0.1
x
f(y|x)f (x)
(c) S =20, x
s
=
10
20
,
ˆ
f(D)=0.2197
5 0 5
0
0.1
x
f(y|x)f (x)
(d) S =100, x
s
=
10
100
,
ˆ
f(D)=0.2197
Figure 6.10: Examples of application of the
rectangle integration rule.
The evidence acts as a normalization constant obtained by inte-
grating the product of the likelihood and the prior over al l possible
values of x,
f(D)=
Z
f(D|x) · f(x)dx.
This integral is typically dicult to evaluate because it is not ana-
lytically tractable. Figure 6.10 presents a naive way of approaching
this integral using the rectangle rule, where the domain containing
a significant probability content is su bdivided in
S
bins of width
x
s
in order to approx i mat e
ˆ
f(D)=
S
X
s=1
f(D|x
s
) · f( x
s
)x
s
.
This approach is reasonable for trivial 1-D pr ob l ems ; however, it
becomes computationally intractable for proble ms in larger di men -
sions (i.e., for x
2 R
X
) because the number of grid combinations to
evaluate increases exponentially with the number of dimensions
X
to integrate on. Section 6.5 presents an intro d u ct i on to sampling
methods that allow overcoming this challenge. Chapter 7 will intro-
duce more advanced sampling methods all owing the estimation of
the normalization const ants for high-dimensional domains.
j.-a. goulet 70
6.3.3 Posterior: f(x|D)
The posterior is the product of the likel ih ood function tim es the
prior PDF divided by the normalization constant,
f(x|D)=
f(D|x) · f(x)
f(D)
.
Let us consider two direct observations
D
=
{y
1
=
1
,y
2
=0
}
obtained from the observation model
y
i
= x + v
i
,v
i
: V
i
N(v;0, 1
2
),V
i
?? V
j
, 8i 6= j.
The likelihood is thus given by
f
(
D|x
)=
N
(
y
1
;
x,
1
2
)
·N
(
y
2
;
x,
1
2
).
We assum e the prior PDF follows a u ni form PDF withi n the range
from -5 to 5,
f
(
x
)=
U
(
x
;
5
,
5). As presented in figure 6.10, the
normalization constant is estimated using the rectangle integration
rule to
ˆ
f
(
D
)=0
.
2197. Figure 6.11 presents the prior, the likelihood,
and the posterior for this example.
5 0 5
0
0.1
0.2
x
f(D|x)
Likelihood
5 0 5
0
0.1
0.2
x
f(x)
Prior
5 0 5
0
0.2
0.4
0.6
x
f(x|D)
Posterior
Figure 6.11: Example of prior, likelihood,
and posterior.
6.3.4 Number of Observations and Identifiability
In the extreme case where no observat i on is available (i.e.,
D
=
;
),
the posterior is equal to the prior
f
(x
|;
)=
f
(x). In order to
explore what happens at the other end of the spect ru m where an
infinite amount of data is available, it is required to distin guis h
between identifiable and non-ident ifi able problems. A problem is
identifiable when, given an infinite numb er of direct observations,
we are theoretically able to retrieve the true hidden state of a
deterministic system
ˇx
. For non-identifiable problems, even an
infinite number of direct observations does not allow retrieving the
true hidden state of the system
ˇx
, because multiple equally valid
solutions exist.
Figure 6.12: Example of two streams
merging in a single river.
(Photo: Ashwini Chaudhary)
Example: Non-identifiable problem Take the example illustrated
in figure 6.12, where we want to estimat e the contaminant concen-
tration in two streams
x
1
and
x
2
and where we only have access to
contaminant concentration obser vations
y
i
= x
1
+ x
2
+ v
i
,v
i
: V
i
N(v;0, 2
2
),V
i
?? V
j
, 8i 6= j,
probabilistic machine learning for civil engineers 71
which are collected at the output of a ri ver after both streams
have merged. We assume that our prior k n owledge is uniform over
positive concentration values, that is ,
f
(
x
1
,x
2
)
/
1. The resulting
posteriors are presented in figure 6.13 for
D
= 1 and
D
= 100
observations, along with the true values marked by a red cross.
This problem is intrinsically non-identifiable because, even when
the epistemic uncertainty has been removed by a large number
of observations, an infinite number of combinations of
x
1
and
x
2
remain possible.
0 10 20 30 40
0
10
20
30
40
x
1
x
2
(a) D =1,f(x
1
,x
2
) / 1
0 10 20 30 40
0
10
20
30
40
x
1
x
2
(b) D =100,f(x
1
,x
2
) / 1
Figure 6.13: Posterior contours for an
example of non-identifiable problem.
Non-identifiable problems can be regularized u si n g prior knowl-
edge. If, for example, we know that the pri or probability for the
contaminant concentration in the first stream follows
p
(
x
1
)=
N
(
x
1
; 10
,
5
2
), the posterior will now have a s i ngl e mode, as pr e-
sented in figure 6.14.
0 10 20 30 40
0
10
20
30
40
x
1
x
2
Figure 6.14: Posterior contours for an
example of non-identifiable problem for
D
=1andwhichisconstrainedbytheprior
knowledge f (x
1
,x
2
) /N(x
1
;10, 5
2
).
Well-constrained identifiable problems For well-constrained prob-
lems where a true value
ˇx
exists, given an infinite number of direct
observations (
D !1
) that are conditionally independent given
x
,
the posterior tends to a Dirac delt a function, wh i ch is nonzero only
for x x,
f(x|D)=
f(D|x) · f(x)
f(D)
! (x
true value
z}|{
ˇx )
| {z }
Dirac delta function
.
Figure 6.15: Example of posterior PDF
tending toward the Dirac delta function.
Figure 6.15 presents an example of such a case where the posterior
has collapsed over a single value. The posterior expected value then
corresponds to the true value
E[X|D] ! ˇx
|{z}
true value
,
and the posterior variance tends t o z e ro because no ep i st em ic
uncertainty remains about the value of x,
var[X|D] ! 0
|{z}
ˇx:constant
.
Note that in most practical cases we do not have acce ss to an
infinite number of observations. Moreover, keep in mind that the
less data we have access to, the more important it is to consider the
uncertainty in the posterior f(x|D).
6.4 Parameter Estimation
In this section, we will explore how to ex t en d Bayesian estimation
for the parameters
defining the probability density function of
j.-a. goulet 72
a hidden random variable
x : X f(x) f(x; )
.Thissetupis
relevant when the system responses are stochastic. For ex amp le ,
even in the absence of observation er r ors , dierent sampl es from
a concrete batch made from the same mix of com ponents would
have dierent resistances due to the uncertainty associate d with the
intrinsic heterogeneity of the material. Th e posterior p rob abi l i ty
Notation
x
i
: X f
X
(x; )
:PDFsparameters
{x
1
,x
2
, ··· ,x
D
}
:realizationsofthehidden
variable X
density function sought is now
f(|D)=
f(D|) · f()
f(D)
,
where
f
(
) is the prior PDF of unknown parameters,
f(D|)
is
the likelihood of observations
D
=
{y
1
, ··· ,y
D
}
, and
f(|D)
is the
posterior PDF for unknown parameters
. After having estimated
the posterior PDF for parameters
f(|D)
, we can marginalize this
uncertainty in
f
(
x
;
) in order to obtain the po st eri or predictive
PDF for the hidden variable X,
f(x|D)=
Z
f(x; ) · f(|D)d. (6.6)
Parameters of
Random variable
Figure 6.16: Example of Bayesian parame-
ter estimation.
Figure 6.16 presents an example where we want t o e st i mat e the
joint posterior probability of the mean
µ
X
and standard deviation
X
defining the PDF of
X
, which describes the variability in the
resistance of a given concrete mix. Here, the concrete r esi s tan ce
X
is assumed to be a random variable. It means that unlike in
previous sections, no true value exi s t s for the resi st an ce because it
is now a stochastic quantity. D esp i t e t h i s, true values can still exist
for the parameters of
f
(
x
;
), that is,
{ˇµ
X
, ˇ
X
}
. This example will
be further explor ed in §6.5.3.
6.4.1 Prior: f()
f
(
) describes our prior knowledge for the values that
can
take. Prior knowledge can be based on heuri st i cs (expert knowl-
edge), on the posterior PDF obtained f rom previous data, or on
a non-informative prior (i.e., absence of pr ior knowl ed ge) . A non-
Non-informative prior
= {µ
X
,
X
}, N(x; µ
X
,
2
X
)
f(µ
X
) / 1
f(
X
) /
1
X
f(µ
X
,
X
) /
1
X
, if µ
X
??
X
informative prior for parameters such as the mean of a Normal
PDF is described by a uniform density
f
(
µ
)
/
1. For standard
deviations, the non-informative prior typically assumes that all
orders of magnitudes for
have equal probabilities. This hypoth-
esis implies that
f
(
ln
)
/
1. By using the change of variable rul e
presented in §3.4, it leads to
f
(
)
/
1
, because the derivative of
this transformation is
d ln
d
=
1
. Note that the two non-informative
priors described for
µ
and
are improper because
R
f
(
)
d
=
1
,
which does not respect the fundamental re q ui r eme nt that a PDF
probabilistic machine learning for civil engineers 73
integrates to 1, as pre se nted in §3.3. For this reason we cannot draw
samples from an improper prior. Nevertheless, when used as a prior,
an improper PDF can yield to a p r oper posterior (see §7.5).
6.4.2 Likelihood: f(D|)
f(D|) f(Y = y|) f(y | ) L(|y)
are all equivalent ways of
describing the likelihood of observing specific valu es y, given
.In
the special case where
y
=
x
so that realizations of
x
:
X f
(
x
;
)
are directly observed, the likelihood is
f(y|)=f(y; )=f(x; ).
In a more general case where observations
D
=
{y
i
, 8i 2{
1:
D}}
are realizations of an additive observation model
y
i
= x
i
+ v
i
,where
v
i
: V N(v;0,
2
V
) and x
i
: X N(x; µ
X
,
2
X
| {z }
), the likelihood is
Y = X + V
Observation: N(y; µ
X
,
2
X
+
2
V
)
Hidden-state R.V.: N(x; µ
X
,
2
X
)
Measurement error: N(v;0,
2
V
)
f(y|)=N(y; µ
X
,
2
X
+
2
V
).
This choice of an additive observation model using normal random
variables allows having an analytically t rac t abl e formulat ion for
the likelihood. If we want to employ a multiplicative observation
model,
y = x · v
, then choosing
v
:
V ln N
(
v
;0
,
2
ln V
) and
x
:
X ln N
(
x
;
µ
ln X
,
2
ln X
| {z }
) as lognormal random variables also
preserves the analytic al tractability for the likelihood that follows
f(y|)=lnN( y; µ
ln X
,
2
ln X
+
2
ln V
).
6.4.3 Posterior PDF: f(|D)
In §6.3.4, when the number of independent observations
D !1
, Note: The term independent observations
is employed as a shortcut for the more
rigorous term that should be in this case
conditionally independent given x.
we saw that the posterior for
X
tends to a Dirac delta function
f
(x
|D
)
!
(x
ˇ
x
). The situation is dierent for the est i mat i on of
parameters describing the P D F of a hidden random variabl e (R.V.);
when
D !1
, it is now the posterior PDF for
that tends to a
Dirac delta functi on as presented in figure 6.17,
Figure 6.17: The posterior PDF for pa-
rameters reduces to a Dirac delta function
when the number of independent observa-
tions tends to infinity.
f(|D)=
f(D|) · f()
f(D)
! (
true value
z}|{
ˇ
)
| {z }
Dirac delta function
,
where the posterior expected value tends toward the true parameter
value and the variance of the posterior tends to zero,
E[|D] !
ˇ
|{z}
true value
j.-a. goulet 74
var[|D] ! 0
|{z}
ˇ
:constant
.
In such a situation, the epistemic uncert ai nty associated with
parameter values
vanishes and the uncertainty about
X
reaches
its irreducible level, f (x|D) ! f(x;
ˇ
).
In practical cases where the number of observations
D < 1
,
uncertainty remains about the param et er s
f
(
|D
), so
f
(
x|D
)istyp-
ically more diused than
f
(
x
;
ˇ
). The uncertainty associated with
parameters
f
(
|D
) can be marginalized as described in equation 6.6
in order to obtain the posteri or predictive PDF for the hidden ran-
dom variable. This integral is typically not analytically tractable,
so it is common to resort to the Monte Carlo sampling methods
presented in the next section in order to compute th e posterior
predictive expected value and variance for f(x|D).
6.5 Monte Carlo
Monte Carlo sampling is a numerical integration method that
allows performing Bayesian estimation f or analytically intractable
cases. In this section, we will first see how Monte Carlo can be
employed as an integration method. T he n, we will explore how t hi s
method fits in the context of Bayesian estimation.
6.5.1 Monte Carlo Integration
We empl oy the example of a unit diameter circle wit h area
a
=
r
2
=0
.
785 to demonstrate how we can approximate this area
with Monte Carlo numerical integration. A circl e of diameter
d
=1
(r =0.5) is d esc r ibed by the equation
(x 0.5)
2
+(y 0. 5)
2
= r
2
.
This equation is employed in the d efi ni t i on of the indicator fun ctio n
I(x, y)=
1if(x 0.5)
2
+(y 0. 5)
2
r
2
0 otherwise.
0 0.5 1
0
0.5
1
x
y
(a) S =100,
ˆ
E[I(X, Y )] = 0.800
0 0.5 1
0
0.5
1
x
y
(b) S =1000,
ˆ
E[I(X, Y )] = 0.759
0 0.5 1
0
0.5
1
x
y
(c) S =10000,
ˆ
E[I(X, Y )] = 0.785
Figure 6.18: Examples of Monte Carlo
numerical integration for which the theoret-
ical answer is a = r
2
=0.785.
We the n define a bivariate uniform probability density in the ran ge
{x, y}2
(0
,
1), where
f
XY
(
x, y
) = 1. The circle area
a
can be
formulated as the integral of the product of the indi c ator function
times f
XY
(x, y),
a
|{z}
area
=
Z
y
Z
x
I(x, y) · f
XY
(x, y)dxdy
= E[I(X, Y )].
(6.7)
probabilistic machine learning for civil engineers 75
We saw in §3.3.5 that the integral in equation 6.7 corresponds to
the expectation of the indicator fu nc t i on
I
(
X, Y
). Given
S
ran-
dom samples
{x
s
,y
s
}
,
s 2{
1
,
2
, ··· , S}
generated from a uniform
probability density
f
XY
(
x, y
), the expectation operation is defined
as
a = E[ I(X, Y )] = lim
S!1
1
S
S
X
s=1
I(x
s
,y
s
).
In practice, because the number of samples
S
is finite, our estima-
tion will remain an approximation of the expectation,
a
ˆ
E[I(X, Y )] =
1
S
S
X
s=1
I(x
s
,y
s
).
Figure 6.18 presents realizations for
S
=
{
10
2
,
10
3
,
10
4
}
samples as
Note: The hat in
ˆ
E
[
I
(
X, Y
)] denotes
that the quantity is an approximation of
E[I(X, Y )].
well as their appr oximation of the area
a
. As the number of samples
increases, the approximation error decreases and
ˆ
E
[
I
(
X, Y
)] tends
to the true value of a = r
2
=0.785.
The interesting property of Monte Carlo integration is that
the estimation quality, measured by the varian ce
var
h
ˆ
E[I(X, Y )]
i
,
depends on the number of samples and not on the number of
dimensions.
4
This property is key for the application of the Monte
4
MacKay, D. J. C. (1998). Introduction
to Monte Carlo methods. In Learning in
graphical models,pp.175204.Sprigner
Carlo method in high-dimensional domains.
6.5.2 Monte Carlo Sampling: Continuous State Variables
Monte Carlo sampling
h(x): Sampling PDF
x
s
: X h(x)
s 2{1, 2, ··· , S}
Monte Carlo methods are now applied to the Bayesian e s ti m at ion
context presented in §6.3 where we are interested in estimating
continuous state variables. The Monte Car lo integrat i on method is
first employed for approximating the evidence
f(D)=
Z
f(D|x) · f(x)dx
= E[f(D|X)].
A Monte Carlo approximation of the evidence can be computed
using importance sampling where
ˆ
f(D)=
1
S
S
X
s=1
f(D|x
s
) · f( x
s
)
h(x
s
)
,x
s
: X h(x) : Sampling PDF,
where samples
x
s
:
X h
(
x
) are drawn from any sampling
distribution such that
h
(
x
)
>
0 for any
x
for which the PDF on the
numerator
f
(
D|x
)
· f
(
x
)
>
0. If we choose the prior PDF as sampling
distribution, h(x)=f (x), the formulation simplifies to
ˆ
f(D)=
1
S
S
X
s=1
f(D|x
s
),x
s
: X h(x)=f (x)
j.-a. goulet 76
because the prior and the sampling distribution cancel out.
The performance of the Monte Carlo integration depends on
how close our sampling distribution
h
(
x
) is to the posterior
f
(
x|D
).
Note that sampling from the prior,
h
(
x
)=
f
(
x
), is often not a
viable option when the number of ob se r vations
D
is large. This is be-
cause in such a case, the likelihood is the product of several t er ms
so the probability content typically ends up being concentrated in a
small region that is not well described by the p r ior , as illustrated in
figure 6.19.
Figure 6.19: Example of dicult situation
encountered when trying to sample from
the prior.
The posterior mean and variance can be approximated with
Monte Carlo integration using
E[X|D]=
Z
x · f( x|D)dx
1
S
S
X
s=1
x
s
·
f(D|x
s
) · f( x
s
)
f(D)
·
1
h(x
s
)
,x
s
: X h(x)
1
S
S
X
s=1
x
s
·
f(D|x
s
)
f(D)
,x
s
: X h(x)=f (x)
and
var[X|D]=
Z
(x E[X|D])
2
· f (x|D)dx
1
S 1
S
X
s=1
(x
s
E[X|D])
2
·
f(D|x
s
) · f( x
s
)
f(D)
·
1
h(x
s
)
,x
s
: X h(x)
1
S 1
S
X
s=1
(x
s
E[X|D])
2
·
f(D|x
s
)
f(D)
,x
s
: X h(x)=f (x).
For both the posteri or mean and variance, when choosing the prior
as the sampling distribution,
h
(
x
)=
f
(
x
), it cancels out from the
formulation.
Figure 6.20: Example of Bayesian estima-
tion with censored observations.
(Photo: Elevate)
Example: Concentration estimation Figure 6.20 depicts a labora-
tory test for measuring a deterministic mercury concentration
ˇx
.
Observations follow t h e observation model
y
i
x + v
i
, if ˇx + v
i
10µg/L
y
i
= error, if ˇx + v
i
< 10µg/L,
where the measuring device has a minimum thr eshold and the
mutually independent observation errors are described by
v
i
: V
i
N(v;0, 2
2
),V
i
?? V
j
, 8i 6= j.
We want to estimate the posterior PDF describin g t h e concent r a-
tion if the observation obtained is
D
=
{error}
for two hypotheses
probabilistic machine learning for civil engineers 77
regarding the prior knowledge: (1) uniform in the interval (0
,
50),
f
(
x
)=
U
(
x
;0
,
50), and (2) a normal PDF with mean 25 and stan-
dard deviation 10,
f
(
x
)=
N
(
x
; 25
,
10
2
). Figure 6.21 presents the
results for the (a) uniform prior and ( b ) normal prior. Note that
only 100, out of the 10
5
Monte Carlo samples that were employed
to estimate the evidence
f
(
D
), are displayed. This example illu s-
trates how information can be extracted from censored observations.
Note also how the choice of prior influences the posterior knowledge.
The choice of prior is particularly impor t ant here because limited
information is carrie d by the censored observations.
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
0.0
0.5
1.0
x
f(D|x)
0 25 50
0.0
0.1
x
f(x|D)
(a) Uniform prior, U(0, 50), f(D) 0.20, E[x|D]=5.2, var[x|D]=10.4
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
0.0
0.5
1.0
x
f(D|x)
0 25 50
0.0
0.2
x
f(x|D)
(b) Normal prior, N(25, 10
2
), f (D) 0.07, E[x|D]=6.2, var[x|D]=18.0
Figure 6.21: Comparative example of
auniformandnormalpriorPDFfor
performing Bayesian estimation with
censored observations.
We consi d er a dierent context where we employ
D
= 100 direct
observations,
D
=
{
25
.
4
,
23
.
6
,
24
.
1
, ··· ,
24
.
3
}
1100
, and the prior is
f
(
x
)=
U
(
x
;0
,
50). The prior, likelihood, and posterior are presented
in figure 6.22. Note that here, out of the 100 Monte Carlo samples
displayed, none lie in the high p rob ab il i ty region for the likelihood.
As a result, it leads to a poor approximation of the evidence where
the true value is
f
(
D
)=1
.
19
10
78
and where the approximation
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
x
f(D|x)
0 25 50
x
f(x|D)
Figure 6.22: Example where drawing
Monte Carlo samples from the prior leads
to a poor estimate of the posterior because
no sample falls in the high probability
region.
j.-a. goulet 78
is
ˆ
f(D)=
1
S
S
X
s=1
0
@
100
Y
j=1
f(y
j
|x
s
)
1
A
=1.95 10
83
.
Although it is possible to solve this problem by incr eas i ng the
number of samples, this ap p roach is computationally inecient an d
even becomes impossible for high-dimension al domains. Chapter 7
presents advanced Monte Carlo methods for solving this li mi t at i on.
6.5.3 Monte Carlo Sampling: Parameter Estimation
For
D
observations
D
=
{y
1
,y
2
, ··· ,y
D
}
, and samples
s
h
(
)=
f
(
)
,s2{
1
,
2
, ··· , S}
drawn from the pri or PDF of parameters, we
can estimate the evidence
ˆ
f(D)=
1
S
S
X
s=1
0
@
D
Y
j=1
f(y
j
|
s
)
1
A
(6.8)
as well as t he posterior expected values and variance for parameters
ˆ
E[|D]=
1
S
S
X
s=1
s
·
Q
D
j=1
f(y
j
|
s
)
f(D)
!
ˆc o v ( |D)=
1
S 1
S
X
s=1
(
s
E[|D])(
s
E[|D])
|
·
Q
D
j=1
f(y
j
|
s
)
f(D)
!
.
Figure 6.23: Example of samples taken
from the same concrete batch.
Example: Concrete strength We consider the case where we want
to characterize the resistance
R
of a given concrete batch, where
the intrinsic variability is estimat e d by te st i ng the resist anc e from
several samples, as depicte d in figure 6.23. We assume that t h e true
yet unknown parameter values are
ˇµ
R
= 42
MPa
and
ˇ
R
=2
MPa
and that the resistance
R
is described by a log-normal random
variable
R ln N(rµ
ln R
, ˇ
ln R
)MPa.
The unknown parameters we want to learn about are
=[
µ
R
R
]
|
.
Here,
R
is a hidden random variable because i t is not di r ect l y
observed; only imprecise observations
y
i
are avail ab l e. The set of
observations
D
=
{y
1
,y
2
, ··· ,y
D
}
is employed to estimate the
posterior PDF for th e parameters
that are controlling the PDF of
the hidden random variable R,
Posterior PDF
z }| {
f(µ
R
,
R
|D)=
Likelihood
z }| {
f(D|µ
R
,
R
) ·
Prior PDF
z }| {
f(µ
R
,
R
)
f(D)
|{z}
Normalization constant
.
probabilistic machine learning for civil engineers 79
The joint prior knowledge for par amet e rs
f
(
µ
R
,
R
)=
f
(
µ
R
)
· f
(
R
),
(µ
R
??
R
) is obtained from the product of the marginal PDFs,
f(µ
R
)=lnN(µ
R
; µ
ln µ
R
,
ln µ
R
)
| {z }
µ
µ
R
=45MPa
µ
R
=7MPa
,f(
R
)=lnN(
R
; µ
ln
R
,
ln
R
).
| {z }
µ
R
=5MPa
R
=2MPa
The prior PDF for
µ
R
and
R
are themselves described by log-
normal PDFs with respective mean
{µ
µ
R
R
}
and variance
{
2
µ
R
,
2
R
}.
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
{ˇµ
R
, ˇ
R
}
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
Predictive f(r|D)
Reality f(rµ
R
, ˇ
R
)
(a) D =1
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
(b) D =5
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
(c) D =100
Figure 6.24: Examples of Bayesian esti-
mation as a function of the number of
observations employed, D = {1, 5, 100}.
The observation model is
ln y
i
=
ln r
i
+
v
i
, where the observation
errors
v
i
:
V N
(
v
;0
,
0
.
01
2
)
MPa
. With these assumptions
regarding the observation model, the marginal likelihood for each
observation is
f(y|
ln
)=f(y|µ
ln R
,
ln R
)=lnN(y; µ
ln R
, (
2
ln R
+
=0.01
2
z}|{
2
V
)
1/2
),
and the joint likelihood for t h e entire set of observations is
f(D|
ln
=fct()
z }| {
µ
ln R
,
ln R
)=
D
Y
j=1
1
y
j
p
2
p
2
ln R
+0.01
2
exp
0
@
1
2
ln y
j
µ
ln R
p
2
ln R
+0.01
2
!
2
1
A
| {z }
Log-Normal PDF
.
Note that the marginal and joint likelihood are defined as a func-
tion of the log-space parameters. In order t o e valuate the likeli-
hood for the parameters
{µ
R
,
R
}
, we employ the transformation
functions given in equation 4.4. Note that the likeliho od
f
(y
|
)de-
scribes a probabili ty density only for y and not for . Therefore this
re-parametrization
ln
=
fct
(
) of the likelihood does not require
using the change of variable rul e we saw in §3.4.
Figure 6.24 presents the prior PDF for par amet e rs , the likeli-
hood, the posterior, and the poster i or predictive for (a)
D
= 1,
(b)
D
= 5, and (c)
D
= 100 observations. We see that, as the
number of observations
D
increases, the likelihood and posterior
for
become concentrated around the true val ue s , and the poste-
rior predictive for
R
tends toward the true PDF with p aram et er s
µ
R
! ˇµ
R
= 42 MPa and
R
! ˇ
R
= 2 MPa.
6.6 Conjugate Priors
The previous section went through the process of using numerical
approximation methods for estimating the posterior PDF for the
parameters of a hidden random variable
X
. This complex process
j.-a. goulet 80
is necessary when no analytical formulation exi s t s to describe the
posterior.
For specific combinations of prior distribution and likelihood
function, the posterior PDF fol lows the same type of distribution
as the prior PDF. These specific combinations are referred to as
conjugate priors. For conjugate priors, analytic formulations are
available to describe the posteri or PDF of paramet er s
f
(
|D
) as
well as the posterior predictive f(x|D).
Figure 6.25: Thumbtacks described by
arandomvariablewithsamplespace
S = {Head (H), Tail(T)}.
(Photo: Michel Goulet)
One classic example of conjugate prior is the Beta-Binomi al .
Take, for example, the probability that a thumbtack (see figure
6.25) lands on either Head or Tail,
S
=
{Head
(H)
, Tail
(T)
}
.We
can employ the Beta-Binomial conjugate prior to describe th e
posterior PDF for
x
=
Pr
(
Head|D
)
2
(0
,
1), given a set of observa-
tions
D
. It is suited in this case to assume that the pr i or PDF for
x = Pr ( Head ) follows a Beta P D F ,
f(x)=B(x; , )=
x
1
(1 x)
1
B(, )
,
where B(
,
) is the normalization constant. In this c ase , observa-
tions are binary, so for a set containing
D
=
D
H
+
D
T
observations,
heads are assigned the outcome H:
D
H
=
{
H
,
H
, ··· ,
H
}
1D
H
and
tails are assigned the outcome T:
D
T
=
{
T
,
T
, ··· ,
T
}
1D
T
. In that
case, the likelihood describing the probability of obs e rv in g
D
H
heads
and D
T
tails follows a Binomial distribution,
Pr(Head|x)=x
Pr(Tail|x)=1 x
Pr(D
H
|x)=x
D
H
Pr(D
T
|x)=(1 x)
D
T
Pr(D|x)=x
D
H
(1 x)
D
T
p(D|x) / x
D
H
(1 x)
D
T
.
The multiplication of the prior and the l i kelihood remains analyti-
cally tractable so that
f(x|D) / x
D
H
(1 x)
D
T
·
x
1
(1 x)
1
B(, )
,
and by recalculating the normalization constant, we find that the
posterior also follows a Beta PDF with parameters
+
D
H
and
+
D
T
,
f(x|D)=
x
+D
H
1
(1 x)
+D
T
1
B( + D
H
,+ D
T
)
= B(x; + D
H
,+ D
T
).
The analytic formulation for conjugate priors wi t h their param-
eters, likelihood, posterior, and posterior predictive can be found
online
5
or in specialized publications such as th e compendium by
5
Wikipedia (2017). Conjugate prior.
URL
https://en.wikipedia.org/wiki/
conjugate prior
.AccessedNovember8,
2019
Murphy.
6
6
Murphy, K. P. (2007). Conjugate Bayesian
analysis of the Gaussian distribution.
Citeseer
probabilistic machine learning for civil engineers 81
Figure 6.26: Example of application of
conjugate priors on trac data.
(Photo: John Matychuk)
Example: Trac safety We take as another example an intersec-
tion where the number of users is
u
=3
10
5
/yr
and where 20
accidents occurred over the last 5 years. Note that for the purpose
of simplicity, the number of passages without accidents is taken
as the total number of users rather than this number minus the
number of accidents. This current information can be su mmar i ze d
as
D
old
:
D
N
=5· (3 10
5
) (passages without accid ent, N)
D
A
= 20 (accidents, A).
Our prior knowledge for
x
, the probability of accident per user,
is described by the Beta PDF
f
(
x
)=
B
(
x
;1
,
1), where the parame-
ters are
0
=
0
= 1, which corresponds t o a uniform dist r i bu t ion
over the interval (0
,
1). The posterior PDF describing the daily
probability of accident given the observations
D
old
is thus also
described by the Beta PDF,
f(x|D
old
)=B(x;
0
+ D
A
,
0
+ D
N
).
10
7
10
6
10
5
10
4
0
2
4
6
x =Pr(accident/user)
f(x|D
old
)
Figure 6.27: Posterior PDF for the proba-
bility of an accident before rehabilitation
works.
Figure 6.27 presents the posterior PDF t r ans for me d in the
log
10
-
space using the ch ange of variable rule (see §3.4),
f(log
10
(x)|D
old
)=f(x|D
old
) ·
d log
10
(x)
dx
1
= f(x|D
old
) · x ln 10.
Now, we consider that following rehabilitation works mad e on
the intersection in orde r to decrease the probability of accidents, we
observe
a
= 1 accident over a period of one year, th at is, 3
10
5
users without accident. In order to estimat e the new prob abi l i ty
of accident, we assume that our previous post er i or knowle dge
f
(
x|D
old
) is an upper-bound for
X
, because the rehabilitation
works made the intersection safer. The new posterior then becomes
f(x|D) / x
1
· (1 x)
310
5
| {z }
likelihood
·(1 F (x|D
old
))
| {z }
prior
.
Note that because the prior is now descri bed by the compleme nt
of the CDF of
X|D
old
, which is not a conjugate prior, the pos t e-
rior does not have an analytic solut i on anymore, and the ev id en ce
f
(
D
new
) must be approximated using numerical integration. Fig-
ure 6.28 presents the prior, 1
F
(
x|D
old
) and posterior PDF,
f
(
x|D
),
for the probability of an accident after t he intervention. The poste-
rior predictive probability of an accident before the rehabil i t at ion
works was 1.3 10
5
compared to 5.8 10
6
after.
10
7
10
6
10
5
10
4
0
0.2
0.4
0.6
0.8
1
x =Pr(accident/user)
Pr(x
0
>x|D
old
)
(a) Prior - 1 F (x|D
old
)
10
7
10
6
10
5
10
4
0
0.5
1
1.5
x =Pr(accident/user)
f(x|D)
(b) Posterior - f(x|D)
Figure 6.28: Prior and posterior PDF for
the probability of an accident after rehabil-
itation works.
j.-a. goulet 82
6.7 Approximating the Posterior
For compl e x machin e l earning models such as those that will be
covered in the next chapters, conjugate priors are often not app li c a-
ble, and it can be computationally proh ib i t ive to rely on sampling
methods for performing Bayesian estim at ion . In those case s, we
may have to employ point estimates rathe r than a probabi l i ty den-
sity function for describing the posterior PDF of model parameters.
6.7.1 Maximum Likelihood and Posterior Estimates
Instead of estimating the posterior using a Bayesian approach , a
common approximation is to find a single set of optimal parame t er
values
that maximizes the likelihood (MLE) or the product of
the likelihood and the prior (MAP) where
=
8
<
:
arg max
f(D|) Maximum likelihood e s t i mat e (MLE)
arg max
f(D|) · f() M ax i mum a-posteriori (MAP).
Optimization methods covered in chapter 5 can be employed to
identify
. Note that it is common to perform t he optimization
using a log-transformed objective function, for example,
ln
(
f
(
D|
)),
for numerical stability reasons (see equation 6.5). Note th at it is
equivalent to optimize either the or i gi nal or the log-t ran sf or med
objective function,
arg max
f(D|) arg max
ln f (D|) .
Estimating only the most likely valu e with respe ct to either
the MAP or MLE criterion can lead to a n egl i gi bl e approxim at ion
error when applied to identifiable prob le ms (see §6.3.4), where the
amount of independent data is large; in t hat case, the posterior
is concentrated around a single value
f
(
|D
)
!
(
). When the
set of observations is small, in th e case of a multimodal posterior,
or for non-identifiable problems, estimati n g only
is no longer
a good approximation of the posterior. The definition of what
is a small or a large data set is problem dependent. In simple
cases, data sets containing fewer than 10 observations will typically
lead to significant epistemic uncertainty associated with unkn own
parameter values. Cases involving complex models , such as those
treated in chapter 12, may need tens of th ous and s (
D /
10
5
) of
observations to reach the point where the epistemic uncertainty
associated with unkn own parameter values is negligible.
We saw in §5.5 t hat it is common to resort to parameter-space
transformations in order to perform optimi zat i on in an uncon-
probabilistic machine learning for civil engineers 83
strained domain. For an MLE, the quantity maximized is the
likelihood of observations y, given the parameters
. Because
f
(y
|
) describes a probability density only f or y and not for
,a
parameter-space transformation for an MLE does not require u si n g
the change of variable rule we saw in §3.4. On the oth er hand, the
MAP seeks to maximize the product of the likeliho od and the prior,
f
(y
|
)
·f
(
). Here, the prior is a probability den si ty, so th e change
of variable rule applies when we employ space transformations.
Nevertheless, as we saw in figure 3.19, when a PDF is subject to
a nonlinear transformation, the modes in t h e original and t r ans -
formed spaces do not coincide. Therefore, th e MLE is invariant
to parameter-space transformations and the MAP is not.
7
Note
7
Jermyn, I. H. (2005). Invariant Bayesian
estimation on manifolds. The Annals of
Statistics 33 (2), 583–605
that Bayesian estimation is invariant to parameter-space transfor-
mations, given that we apply the change of variable rule. We will
apply parameter-space transformations in the context of Bayesi an
estimation in chapter 7.
6.7.2 Laplace Approximation
Another method for approximating a posterior PDF for the un-
known parameters of a hidden random vari ab le is the Laplace
approximation. This method approximat es the poster i or as a mul-
tivariate Normal with a posterior me an vector equal to ei t he r the
MLE or MAP estimate
, and a posterior covariance
,
f(|D)
| {z }
posterior
N(;
MLE/MAP estimate
z}|{
,
|{z}
Laplace approximation
).
When we choose the MLE approximation to describe the posteri or
mode, the posterior covariance is app r oximated by the inverse
Hessian matrix of the negative log-likelihood evaluated at
,
= H[
log-likelihood
z }| {
ln f (D|
)]
1
=
@
2
ln f (D|)
@@
|
1
.
Details regarding the numerical estimation of the Hessian ar e
presented in §5.4. In order to include prior knowledge, we have to
estimate the Laplace approximation using
ln
(
f
(
D|
)
· f
(
)) instead
of the log-likelihood ln f (D|
) alone.
The relation between the second der i vative of the log-likelihood
evaluated at
and the covariance
can be explained by look-
ing at the multivariate Normal with p r obab i li ty density function
described by
j.-a. goulet 84
f()=
1
(2)
P/2
(det
)
1/2
exp
1
2
( µ
)
|
1
( µ
)
,
where
P
is the number of parameters
=[
1
2
···
P
]
|
. By taking
the natural logarithm of f(), we obt ai n
ln f ()=
P
2
ln 2
1
2
ln det
1
2
( µ
)
|
1
( µ
),
where the derivat i ve of ln f()withrespectto is
@ ln f ()
@
=
1
2
@
( µ
)
|
1
( µ
)
@
=
1
2
2
1
( µ
)
=
1
( µ
).
If we dierentiate this last expression a second time with respect to
, we obtain
@
2
ln f ()
@@
|
=
1
.
This development shows how the Laplace approximation allows esti-
mating the covariance matrix using the posterior PDF’s curvatures.
In the case of a Gaussian log-posterior, th e second-order de riva-
tives are constant, so the curvatures can be evaluated anywhere
in the domain. In practice, we employ the Laplace approximation
when the posterior is not Gaussian; in order to mini mi ze the ap-
proximation error, we evaluate the secon d- ord er derivatives of the
log-posterior at the MLE or MAP
, where the probability density
is the highest.
4
0
4
4
0
4
0
0.1
0.2
1
2
f(|D)
True posterior
Laplace approximation
Figure 6.29: Example of an application of
the Laplace approximation to describe a
posterior PDF.
Figure 6.29 presents an example of an applicat i on of the Laplace
approximation, where a generic posterior PD F for two parameters
is approximated by a bivariate Normal PD F . The Hessian matr i x
evaluated at the mo d e
=[1.47 0.72]
|
is
H[ln f (D|
)] =
0.95 0.02
0.02 0.95
,
so the Laplace approximation of the posterior covariance is de-
scribed by
= H[ln f(D|
)]
1
=
1.05 0.02
0.02 1.05
.
The closer the posterior is to a multivariate Normal, the better
the Laplace approximation is. Note that for pr ob le ms involving
several parameters, computing the Hessian matr i x using numerical
derivatives becomes computationally proh i bi t i ve. One possible
probabilistic machine learning for civil engineers 85
approximation is to estimate only the main diagon al of th e Hessian.
The main diagonal of the Hessian contains informat i on about the
variances of the posterior with re spect to each variable without
providing information about the covariances (i.e., the pairwise
linear dependencies between variables).
In addition to being employed to approximate the p os t er ior PDF,
the Laplace approximation can be used to est i mat e the evidence
f(D). With the Laplace approx im at ion , the posterior is
f(|D) N(;
,
)
=
1
(2)
P/2
(det
)
1/2
| {z }
(a)
exp
1
2
(
)
|
1
(
)
| {z }
(b)
.
(6.9)
In equation 6.9, (a) corresponds to the normalization const ant
for the exponential term indicated in (b), for whi ch the value is
exp
(0) = 1 when evaluated at th e mode
(see §4.1). When
estimating
˜
f
(
) using either an MAP
˜
f
(
)=
f
(
D|
)
· f
(
) or an
MLE
˜
f
(
)=
f
(
D|
), its value is not going to equal 1. Therefore,
the normalization constant in (a) must be multip l ie d by t he value
of the unnormalized p ost e ri or evaluated at the mode
so that
ˆ
f(D)=
˜
f(
) · (2)
P/2
(det
)
1/2
. (6.10)
The reader interested in the formal derivation of this result from a
Taylor-series expansion should consult MacKay.
8
8
MacKay, D. J. C. (2003). Information
theory, inference, and learning algorithms.
Cambridge University Press
6.8 Model Selection
In the concrete strength estimation example presented in §6.5.3,
it is known that the data was generated from a log-normal dis t ri -
bution. In real-life cases, the PDF for m f or the hidden random
variable
X
may either remain unknown or not correspond to an
analytically tractable probability density function. In these cases, it
is required to compare the performance of several model classes at
explaining the obser vations D.
Given a set of
M
model classes
m 2M
=
{
1
,
2
, ··· , M}
,the
posterior probability of each class given
D
is described using Bayes
rule,
p(m|D)
| {z }
model class posterior
=
model likelihood
z }| {
f(D|m) ·
model class prior
z}|{
p(m)
f(D
{1:M}
)
| {z }
evidence
, (6.11)
where
p
(
m|D
) and
p
(
m
) are respectively the posterior and prior
Note: With Bayesian model selection, the
objective is not to identify the single best
model; it allows quantifying the posterior
probability of each model, which can
be employed for performing subsequent
predictions.
j.-a. goulet 86
probability mass functions for the model classes. The li kelihood of
a model class
m
is obtained by marginalizing the model parameters
from the joint distribution of ob s e rvations and parameters,
f(D|m)=
Z
f(D|
{m}
) · f(
{m}
)d, (6.12)
where
{m}
is the set of paramet e rs associ at ed with the model class
m
. The likelihood of a model class in equation 6.12 cor re sponds to
its evidence, which also corresponds to t he normalization c ons tant
from the posterior PDF for its parameters,
f(D|m)
normalization constant for model class m
z }| {
f(D
{m}
) .
(6.13)
The normalization constant for the model compar i son problem
presented in equation 6.11 is obtained by summing the likeliho od of
each model class times its prior probability,
f(D
{1:M}
)=
X
m2M
f(D|m) · p(m).
Note that using the importance-sampling procedure presented in
equation 6.8 for estimating the normalization constants in equation
6.13 is only a viable option for simple pr ob le ms involving a small
number of parameters and a small number of observations. Section
7.5 further discusses the applicability of more advanced s amp lin g
methods for estimat in g f(D).
Here, we revisit the concrete strength ex am pl e present ed in
§6.5.3. We compare the posterior probabil i ty for two model classes
m 2M
=
{
L
,
N
}
(i.e., a log-normal and a Normal PDF); note that
we know the correct hypothesis is
m
= L because simulated data
were generated using a log-normal PDF.
For D = 5 observations, the evidence of each model is
f(D|L)=8.79 10
7
f(D|N)=6.44 10
7
.
With the hypothesis that the prior pr obab i l ity of each model is
equal,
p
(L)=
p
(N)=0
.
5, the posterior probabili ty for model classes
m are
p(L|D)=
8.79 10
7
0.5
8.79 10
7
0.5+6.44 10
7
0.5
=0.58
p(N|D)=1 p(L|D)
=
6.44 10
7
0.5
8.79 10
7
0.5+6.44 10
7
0.5
=0.42.
probabilistic machine learning for civil engineers 87
This result indicates that both models predict wi t h almost equal
performance the observations
D
. Therefore, if we want to make
predictions for unobserved quantities, it i s possible t o employ both
models where the probability of each of the m is quantified by its
posterior probability p(m|D).
In practice, because estimating the evide nc e
f
(
D|m
) is a com-
putationally demanding task, it is common to resort to approxim a-
tions such as the Akaike information criterion (AIC) an d Bayesian
information criterion (BIC) for compari ng model classes. Note that
the BIC is a special case derived from t he Laplace approximation
in equation 6.10 for estimating
f
(
D
). Other advanced approximate
methods for comparing model classes are the Deviance information
criterion
9
(DIC) and the Watanabe-Akaike informatio n criterion
10
9
Spiegelhalter, D. J., N. G. Best, B. P.
Carlin, and A. van der Linde (2002).
Bayesian measures of model complexity
and fit. Journal of the Royal Statistical
Society: Series B (Statistical Methodol-
ogy) 64 (4), 583–639
10
Watanabe, S. (2010). Asymptotic
equivalence of Bayes cross validation and
widely applicable information criterion
in singular learning theory. Journal of
Machine Learning Research 11,35713594
(WAIC). The reader interested in these methods shou l d refer to
specialized textbooks.
11
Another common model selection technique
11
Burnham, K. P. and D. R. A nderson
(2002). Model selection and mul tim odel
inference: A practical informa tio n -th eoretic
approach (2nd ed.). Springer; and Gelman,
A., J. B. Carlin, H. S. Stern, and D. B.
Rubin (2014). Bayesian data analysis (3rd
ed.). CRC Press
widely employed in machine learning is cross-validation; the main
idea consists in employing a first subset of data to est im at e the
model parameters and then test the pred ic t i ve performance of the
model using a second independent data set. The details regarding
cross-validation are presented in §8.1.2, where the met h od is em-
ployed to discriminate among model structu r es in the context of
regression problems.