6
Learning from Data
Learning context
p(unknown|known)
x:Constant
: X f(x; )
X:Random variable
y:Observation
When scalars
x
are employed to model physical phenomena, we
can use empirical observations y to learn about the probability (or
probability density) for each value that
x
can take. When
X
is a
random variable, we can us e observations to learn about the p aram -
eters
describing its probability density (or mass) function
f
(
x;
).
This chapter presents the general Bayesian formu lat i on , as well as
approximate methods derived from it, for l ear n in g about unknown
state variables and parameter s, given known data. The applications
presented in this chapter focus on simple models in or d er to keep
the attention on the Bayesian methods themselves. We will see in
chapters 8–13 how this theory can be generaliz ed for applicat ion s
to machine learning models having complex architectures. For an
in-depth review of Bayesian methods applied to dat a analysis, th e
reader may refer to specialized textbooks such as those by Gelman
et al.
1
or Box and Ti ao.
2
1
Gelman, A., J. B. Carlin, H. S. Stern, and
D. B. Rubin (2014). Bayesian data analysis
(3rd ed.). CRC Press
2
Box, G. E. P. and G. C. Tiao (1992).
Bayesian inference in statistical analysis.
Wiley
6.1 Bayes
Learning from data can be seen as an indirect problem; we observe
some quantities y , and we want to employ this informati on to infer
knowledge about hidden-state variab l es x and model parameters Note: The term hidden describes a state
variable that is not directly observed.
,wherey depends on x and
through
f
(y
|
x
;
). In chapter 3,
we saw that such a dependence can be tre at ed using conditional
probabilities.
Notation:
;
and
|
are both employed
to describe a conditional probability or
probability density. The distinction is that
;isemployedwhen
are parameters of
aPDForPMF,forexample,
f
(
x
;
)=
N
(
x
;
µ,
2
), and
|
denotes a conditional
dependence between variables, for example,
f
(
y|x
). Besides these semantic distinctions,
both are employed in the same way when it
comes to performing calculations.
In the first part of this chapter, we sol el y focu s on the estimation
of hidden-state variables
x
. Given two random variables
X f
(
x
)
and
Y f
(
y
), their joint probability density funct i on (PDF) is
obtained by the product of a conditional and a marginal PDF,
f(x, y)=
(
f(x|y) ·f(y)
f(y|x) · f(x).
(6.1)
j.-a. goulet 60
In the case where
y
is known and
X
is not, we can reorganize the
terms of equation 6.1 in order to obtain Bayes rule,
f(x|y)=
f(y|x) · f(x)
f(y)
,
which describes the posterior PDF (i. e. , our poster ior knowled ge)
of
X
given that we have observed
y
. Let us consider
X
, a random
variable so that
X f
(
x
), and given a set of observations
D
=
{y
1
,y
2
, ··· ,y
D
}
that are realizations of the random var i abl e s Y = Note: From now on, the number of ele-
ments in a set or the number of variables
in a vector is defined by a
typewriter
-font
upper-case letter; for ex ample,
D
is the
number of observations in a set
D
,and
X
is the number of variables in the vector
x =[x
1
x
2
··· x
X
]
|
.
Asetofobservations
D = {y
1
,y
2
, ··· ,y
D
}
e.g., D = {1.2, 3.8, ··· , 0.22},(y 2 R)
D = {3, 1, ··· , 6},(y 2 Z
+
)
D = {blue, blue, ··· , red},
(y 2{blue, red, green})
[
Y
1
Y
2
··· Y
D
]
|
f
(y). Our posterior knowledge for
X
given the
observations D is described by the conditional PDF
f(x|D)
| {z }
posterior
=
likelihood
z }| {
f(D|x) ·
prior
z}|{
f(x)
f(D)
|{z}
evidence
.
Prior
f
(
x
) describes our prior knowledge for the values
x
that a
random variable
X
can take. The prior knowledge can be expressed
in multiple ways. For instance, it can be based on heur is t i cs such
as expert opinions. In the case where d ata is obtained s eq u entially,
the posterior knowledge at time
t
1 becomes the prior at time
t
.
In some cases, it also happens that no pr i or knowledge is available;
then, we should employ a non-informative prior, that is, a p r i or Note: A uniform prior and a non-
informative prior are not the same
thing. Some non-informative priors are
uniform, but not all uniform priors are
non-informative.
that reflects an absence of knowledge (see §6.4. 1 for further details
on non-informative priors ) .
Likelihood
f
(Y = y
|x
)
f
(
D|x
)describesthelikelihood, or the
conditional probability density, of observing the event
{
Y =
D}
,
given the values that
x
can take. Note that i n the special case of an
exact observation where D = y = x,then
f(D|x)=
1,y= x
0, otherwise.
In such a case, the observations are ex act so the prior does not
play a role; by observing
y
you have all the information about
x
because
y
=
x
. In the general case where
y
:
Y
=
fct
(
x
), that is,
y
is a realization from a stochastic process, which is a functi on of
x
f
(
D|x
)describestheprior pr ob abi l i ty density of observing
D
Here, the term prior refers to our state of
knowledge that has not yet been influenced
by the observations in D.
given a specific set of values for x.
Evidence
f
(Y = y)
f
(
D
) is called the evidence and consists in a
normalization constant ensuring that the posterior P DF int egr at es
probabilistic machine learning for civil engineers 61
to 1. In §3.3 we saw that for both the dis cr et e and the continuous
cases,
X
x
p(x|D)
| {z }
discrete case
=1,
Z
1
1
f(x|D)dx
| {z }
continuous case
=1.
For th at property to be true, the normalizati on constant is defined
as the sum/integral over the entire dom ain of
x
, of the likelihood
times the prior,
p(D)=
X
x
p(D|x) · p(x)
| {z }
discrete case
,f(D)=
Z
f(D|x) · f(x)dx
| {z }
continuous case
.
So far, the description of learning as an indirect pr obl em is
rather abstract. The next sections wi l l apply th is concept t o simple
problem configurations involving discret e an d conti nuous random
variables.
6.2 Discrete State Variables
Discrete state variable:Adiscrete
random variable
X
employed to represent
the uncertainty regarding the state
x
of a
system.
This section presents how the probability of occurrence for discrete
state variables are estimated from observations. In all the examples
presented in this section, the uncertainty associated with our
knowledge of a hidden random variable
X
is epistemic because Reminder: The term hidden refers to a
variable that is not observed.
a true value
ˇx
exists, yet it is unknown. Moreover,
x
is not directly
observable; we can only observe
y
,where
p
(
y|x
)describesthe
conditional probability of an observation
y
given the value of the
discrete hidden-state variable x.
6.2.1 Example: Disease Screening
Entire
population
Person
without disease
Pr
(
¬disease
)=
99%
Person
with
disease
Pr
(
disease
)
=1%
Pr
(
test
+
|disease
)=80%
Pr
(
test
+
disease
)=10%
1% 80% 99% 10%
Figure 6.1: Visual interpretation of Bayes
rule in the context of disease screening.
(Adapted from John Henderson (CC BY).)
Figure 6.1 illustrates the context of a r are disease for which t h e
probability of occurrence in the general population is 1 percent,
and a screening test such that if one has the disease, the prob-
ability of testing positive is
Pr
(
test+|disease
)=0
.
8; and if one
does not have the disease, the pr obab il ity of testing positive is
Pr
(
test+disease
)=0
.
1. The sample space for the variable describ-
ing the hidden state we are interested in is
x 2{disease, ¬disease}
,
for which the prior probability is
Pr
(
X
=
disease
)=0
.
01 and
Pr(X = ¬disease) = 1 Pr(X = disease) = 0.99.
For a single observation,
D
=
{test
+
}
, the posterior probability
j.-a. goulet 62
of having the disease is
Pr(disease | test+) =
Pr(test+|disease)·Pr(disease)
Pr(test+|disease)·Pr(disease)+Pr(test+disease)·Pr(¬disease)
=
1%80%
(1%80%)+(99%10%)
=7.5%,
and the posterior probability of not havi n g the disease is
Pr(¬disease | test+) = 1 Pr(disease | test+)
=
Pr(test+disease)·Pr(¬disease)
Pr(test+|disease)·Pr(disease)+Pr(test+disease)·Pr(¬disease)
=
99%10%
(1%80%)+(99%10%)
= 92.5%.
If we push this example to the extreme c ase where the disease
is so rare that only one human on Earth has it (see figu r e 6.2) and
where we have a highly accurate sc re en i ng test so that
test+ !
(
Pr(test + |disease) = 0.999
Pr(test + disease) = 0.001,
the question is If you test positive, should you be worried? The
answer is no; the disease is so rare that even with the hi gh test
accuracy, if we teste d every human on Earth, we would expect
0
.
001
(8
10
9
)=8
10
6
false positive diagnoses. This example
illustrates how in extreme cases, prior i n for mat i on may outweigh
the knowledge you obtain from observations.
Humans
Only one has the disease
Figure 6.2: The example of a disease so
rare that only one human on Earth has it.
6.2.2 Example: Fire Alarm
Note:
P
x
p
(
y
=
|x
)doesnothaveto
equal one for the likelihood; however, it
does for the posterior,
P
x
p(x|y)=1.
It is common to observe that when a fire alar m is triggered in
a public building, people tend to rem ain calm and ignor e the
alarm. Let us explore this situation usi n g con d it i on al probabili-
ties. The sample space for the hidden st at e we are inter es t ed in is
x 2{fire, ¬fire}{ , }
and the sample space for the observati on
is y 2{alarm, ¬alarm}{ , }. If we assume that
p( |x)=
Pr( | )=0.95
Pr( | )=0.05,
then the question is Why does no one react when a fire alarm is
triggered? The is su e here is that
p
(
|x
) describes the probability of
an alarm given the state; in order to answer the question, we need
to calculate
Pr
(
t
|
), which depends on the prior probability that
there was a fire at time
t
1 as well as at the transition probabilities
p(x
t
|x
t1
), as illustrated in figure 6.3.
t
0.05
0.95
Pr = 0.99
Pr = 0.01
t 1
0.99
0
.
01
1
Pr =?
Pr =?
Figure 6.3: Fire alarm example.
probabilistic machine learning for civil engineers 63
We assum e that at
t
1, before the alarm is triggered, our p ri or
knowledge is
p(x
t1
)=
Pr(
t1
)=0.01
Pr(
t1
)=0.99.
The transition probab il i ty between
t
1 and
t
, at the moment when
the alarm is triggered, is described by the conditional probabilities
p(x
t
|x
t1
)=
8
>
>
>
<
>
>
>
:
Pr(
t
|
t1
)=0.99
Pr(
t
|
t1
)=0.01
Pr(
t
|
t1
)=1
Pr(
t
|
t1
)=0.
By combining the prior knowledge at
t
1 with the transition
probabilities, we obtain the joint prior pr ob abi l i ty of hidden states
at both time steps,
p(x
t
,x
t1
)=
8
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
:
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=1· 0.01 = 0.01
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0.01 · 0.99 = 0.0099
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0.99 · 0.99 = 0.9801
Pr(
t
,
t1
)=Pr(
t
|
t1
) · Pr(
t1
)
=0· 0.01 = 0.
We obtain t he marginal prior probability of each state
x
at a time
t
by marginalizing this joint pr obab i li ty, th at is, su mm in g over states
at t 1,
p(x
t
)=
8
>
>
>
>
>
<
>
>
>
>
>
:
Pr(
t
)=Pr(
t
,
t1
)+Pr(
t
,
t1
)
=0.01 + 0.0099
=0.0199 0.02
Pr(
t
)=Pr(
t
,
t1
)+Pr(
t
,
t1
)
=0.9801 0.98.
If we combine the information from the prior knowledge at
t
and
the observation, we can compute the posterior probability of the
state x
t
, given that the alarm has been triggered,
Pr(
t
| )=
Pr( ,
t
)
z }| {
Pr( |
t
) · Pr(
t
)
(Pr( |
t
) · Pr(
t
)) + (Pr( |
t
) · Pr(
t
))
| {z }
Pr( )
=
0.95 · 0.02
(0.95 · 0.02) + (0.05 · 0.98)
=0.28
Pr(
t
| )=1 Pr(
t
| )=0.72.
j.-a. goulet 64
Overall, despite an alarm being triggered, the probability that there
is a fire remains less than 30 perce nt.
We can explore what would h appen at time
t
+ 1 if, after the
alarm has been triggered, a safety ocial i nf orm s us that the re
is a fire and that we need to evacuat e. We assume here that the
conditional probability of receiving such information given t h e state
is
p( |x)=
Pr( | )=0.95
Pr( | )=0.05,
as illustrated in figu re 6.4. Here we assume that given the state, the
probability of receiving the information from the safety oc i al is
conditionally independent from the alarm, that is,
Pr
(
| ,
)=
Pr
(
|
). If, as we did for the estimation at time
t
, we combine
the information from the prior knowledge at time
t
, the transition
probabilities, and t he observation at time
t
+ 1, we can compute t he
posterior probability of the hidden stat e
x
t+1
, given that the alarm
has been triggered and that we received the information from a
safety ocial. First, the marginal prior probabilities of each state at
time
t
+ 1 are estimated from the post er i or probability at
t
combined
with the transiti on probabilities so that
Pr(
t+1
| )=
Pr(
t+1
,
t
| )
z }| {
Pr(
t+1
|
t
) · Pr(
t
| )+
Pr(
t+1
,
t
| )
z }| {
Pr(
t+1
|
t
) · Pr(
t
| )
=0.01 · 0.72 + 1 · 0.28
=0.29
Pr(
t+1
| )=1 Pr(
t+1
| )=0.71.
t
0.05
0.95
Pr = 0.99
Pr = 0.01
t 1
0.99
0
.
01
1
Pr = 0.72
Pr = 0.28
t +1
0.99
1
0
.
01
0.05
0.95
Pr = ?
Pr = ?
Figure 6.4: Example illustrating the role of
the prior in sequential data acquisition and
interpretation.
Then, the posterior probability at t + 1 is estimated following
Pr(
t+1
| , )=
Pr( ,
t+1
| )
z }| {
Pr( |
t+1
) · Pr(
t+1
| )
(Pr( |
t+1
)·Pr(
t+1
| ))+(Pr( |
t+1
)·Pr(
t+1
| ))
| {z }
Pr( | )
=
0.95 · 0.29
(0.95 · 0.29) + (0.05 · 0.71)
=0.89
Pr(
t+1
| , )=1 Pr(
t+1
| , )=0.11.
This example illustrates how when knowledge about a hidden
state evolves over time as new information becomes available, the
posterior at t is employed to estimate the prior at t + 1.
probabilistic machine learning for civil engineers 65
6.2.3 Example: Post-Earthquake Damage Assessment
Given an earthquake of intensity
3
3
This example is adapted from Armen Der
Kiureghian’s CE229 course at UC Berkeley.
X : x 2{light (L), moderate (M), important (I)}
and a structure that can be i n a state (e.g., see figure 6.5)
Y : y 2{damaged ( D ) , undamaged (D)}.
These two random variables are illustrated usin g a Venn diagram in
figure 6.6. The likelihood of damage given each earthquake intensity
is
p(D|x) =
8
<
:
Pr(Y = D|x = L)=0.01
Pr(Y = D|x = M)=0.10
Pr(Y = D|x = I)=0.60,
Figure 6.5: Extensive damage af ter San
Franciscos 1906 earthquake.
Figure 6.6: Venn diagram illustrating
the sample space where the earthquake
intensities light (L), moderate (M), and
important (I)aremutuallyexclusiveand
collectively exhaustive events. The post-
earthquake structural states of damaged
(D)andundamaged(
D
)intersectthe
earthquake intensity events.
and the prior probability of each intensity is given by
Note: For the likelihood
P
x
p(D|x) 6
=1;
however, for the prior
P
x
p(x)
=1,andfor
the posterior
P
x
p(x|D) =1.
p(x) =
8
<
:
Pr(X = L)=0.90
Pr(X = M)=0.08
Pr(X = I)=0.02.
We are inter es t ed in inferrin g the intensity of an earthquake given
that we have observed damaged bu il d i ngs
y
= D following an event ,
so that
p(x|D) =
p(D|x) · p(x)
p(D)
.
The first step to solve this problem is to compute the normalization
constant,
p(y = D)=
Pr(Y =D,X=L)
z
}| {
p(D|L) · p(L)+
Pr(Y =D,X=M)
z }| {
p(D|M) · p(M)+
Pr(Y =D,X=I)
z }| {
p(D|I) · p(I)
=0.01 · 0.90 + 0.10 · 0.08 + 0.6 · 0. 02
=0.029.
Then, the posterior probability of each state is calculate d following
Pr(x = I |y = D)=
p(D|I)·p(I)
p(D)
=
0.60·0.02
0.029
=0.41
Pr(x = M| y = D)=
p(D|M)·p(M)
p(D)
=
0.10·0.08
0.029
=0.28
Pr(x = L |y = D)=
p(D|L)·p(L)
p(D)
=
0.01·0.90
0.029
=0.31.
We see that despit e the prior pr ob abi l i ty of a high-intensity earth-
quake being small (i.e.,
Pr
(
X
= I)=0
.
02), the posterior is signifi-
cantly higher because of the observat i on indicating t h e presence of
structural damage.
j.-a. goulet 66
6.3 Continuous State Variables
This section presents examples of applications of hidden-stat e
estimation for continuous cases. Compared with discrete cases,
continuous ones involve more abstract conce pt s in the defin i t ion
of the likelihood
f
(
D|
x) and evidence
f
(
D
). Like for the discrete
case, if our model is
y
=
x
and we observe
y
, then we know all
there is to know about
x
so the conditional probability of
y
given
x
is nonzero only for the true value
ˇx
. In this section, we will cover
cases involving im perfect observations rather than perfect ones.
6.3.1 Likelihood: f(D|x)
In practice, we are interested in the case where
ˇx
is a true yet un-
known hidden-state variable and where we have access to imprecise
observations y
i
,wheretheobservation model is either
y
i
x + v
i
(direct observation)
y
i
> ˇx + v
i
(upper-bound censored observation)
y
i
< ˇx + v
i
(lower-bound censored observation),
where in all three cases the observat i on error,
v
i
:
V
i
N
(
v
;0
,
2
V
),
V
i
?? V
j
, 8i 6
=
j
. A direct observation corresponds t o the standard Note:
V
i
?? V
j
, 8i 6
=
j
indicates that the
random variable
V
i
is independent of
V
j
,
for all
i 6
=
j
(i.e., that all observation errors
are independent of each other).
case where the observation is the tr ue state
ˇx
on which an observa-
tion error
v
is added.
V
is a Normal random variable employed to
describe observation err or s.
V
is often chosen to be Normal in order
to simplify the problems by maintaining analy t i cal tractability for
linear models (see §4.1.3). Lower- and up per-bound observations are
called censored observations where we only know that the hidden-
state variable is either respecti vely greater or smaller than the
observed value y.
Direct observations For a direct observation
y
,thelikelihoodis
formulated from the additive observation model so that the PD F of
Y
is described by the Normal PDF
Y N
(
y
;
x,
2
V
). Because
ˇx
is
Y = x + V
Observation: N(y; x,
2
V
)
Hidden state: constant
Measurement error: N(v;0,
2
V
)
unknown, the likelihood denoted by
f
(
y|x
) or
L
(
x|y
)describesthe
prior probability den si ty of observing {Y = y} given {ˇx = x},
f(y|x)=L(x|y)=
1
p
2 ·
V
exp
1
2
y x
V
2
!
. (6.2)
The leftmost surface in figure 6.7 correspon ds to equation 6. 2
plotted with respect to both
x
and
y
, for
V
= 1. The p i nk dashed
curve represents the explicit evaluat i on of
f
(
y|x
= 0), where the
evaluation at
x
= 0 is arbitrarily chosen for illustration p u rpose,
probabilistic machine learning for civil engineers 67
and the red solid curve reverses the relat i on sh ip to represent the
likelihood of the observation
y
=
1 for any value
x
,where
f
(
y
=
1|x)=L(x|y = 1).
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
Figure 6.7: Representation of the likelihood
function for a direct observation. The
leftmost surface describes
f
(
y|x
), the
rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center graph
describes f(y|x =0).
Keep in mind that the likelihood is not a PDF with respect to
x
so it does not integrate to 1. Accordingly, the rightmost graph in
figure 6.7 is a function quantifying for each value
x
how likely it is
to observe the value y = 1.
Upper-bound observations For an upper-bound observation
y>ˇx
+
v
,
the likelihood is described by the cumulati ve distribution function
(CDF) of Y N(y; x,
2
V
),
f(y|x)=
CDFN(y;x,
2
V
)
z }| {
Z
y
1
1
p
2⇡
V
exp
1
2
y
0
x
V
2
!
dy
0
=
1
2
+
1
2
erf
y x
p
2
V
.
(6.3)
The surface in figure 6.8 corresponds to eq uat i on 6.3 plotted with
respect to
x
and
y
, for
V
= 1. The pink dashed curve r ep re sents
the explicit evaluation of
f
(
y|x
= 0), where t h e evaluation at
x
= 0 is ar bi t r ari l y chosen for illustration pu r pose, and the red solid
curve reverses the relationship to represent the likelihood of the
observation y = 1, gi ven any value x, L(x|y = 1).
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
Figure 6.8: Representation of the likelihood
function for an upper-bound censored
observation. The leftmost surface describes
f
(
y|x
), the rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center
graph describes f(y|x =0).
j.-a. goulet 68
Lower-bound observations For a lower-bound observation
y<ˇx
+
v
,
the likelihood is described by the complement of t he cumul at i ve
distribution function of Y N(y; x,
2
V
),
f(y|x)=1
CDFN(y;x,
2
V
)
z }| {
Z
y
1
1
p
2⇡
V
exp
1
2
y
0
x
V
2
!
dy
0
=1
1
2
+
1
2
erf
y x
p
2
V
◆◆
.
(6.4)
The surface in figure 6.9a corresponds to eq uat i on 6.4 plotted with
respect to
x
and
y
, for
V
= 1. The pink dashed curve r ep re sents
the explicit evaluation of
f
(
y|x
= 0), where the evaluation at
x
=0
is arbitrarily chosen for illustration pur pose, and again the red
curve reverses the relationship to represent the likelihood of the
observation
y
=
1, given any value
x
,
L
(
x|y
=
1). Note that
in figure 6.9b, if instead of having
V
= 1,
V
tends to zero, the
smooth transition from low likelihood values to high ones would
become a sharp jump from zero to one.
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x =0)
5 0 5
x
L(x|y = 1)
(a)
V
=1
5
0
5
5
0
5
x
y
f(y|x)
y =-1.0|x =0.0
5 0 5
y
f(y|x)
5 0 5
x
L(x|y)
(b)
V
=0
Figure 6.9: Representation of the likelihood
function for a lower-bound censored
observation. The leftmost surface describes
f
(
y|x
), the rightmost graph describes
f
(
y
=
1
|x
)=
L
(
x|y
=
1), and the center
graph describes f(y|x =0).
Note: The notion of conditional inde-
pendence implies that given
x
,
Y
i
is
independent of Y
j
,thatis,
Y
i
?? Y
j
|x , f(y
i
|y
j
,x)=f(y
i
|x),
so that
f
(
y
i
,y
j
|x
)=
f
(
y
i
|x
)
· f
(
y
j
|x
). This
assumption is often satisfied because the
observation model follows the form
y = x + v,
where
v
is the realization of an observation
error so that V
i
?? V
j
, 8i 6= j.
Multiple observations All cases above dealt with a single observa-
tion. When multiple observation s are available, one must define the
joint likelihood for all observations given
x
. With the assumption
that observations are conditi onally independent given
x
, the joint
probabilistic machine learning for civil engineers 69
PDF is obtained from the product of marginals,
f(D|x)=f(Y
1
= y
1
, ··· ,Y
D
= y
D
|x)
=
D1
!
z }| {
D
Y
i=1
f(Y
i
= y
i
|x)
=exp
D
X
i=1
ln
f(Y
i
= y
i
|x)
!
.
(6.5)
Note that practical diculties related to numerical underflow
or overflow arise when trying to evaluate t h e product of several
marginal likelihoods
f
(
y
i
|x
). This is why in equation 6.5, the pr od-
uct of the marginals is replaced with the equivalent yet numerically
more robust formulation using the exponential of the sum of the log
marginals.
6.3.2 Evidence: f(D)
5 0 5
0
0.1
x
f(y|x)f (x)
(a) S =7, x
s
=
10
7
,
ˆ
f(D)=0.2158
5 0 5
0
0.1
x
f(y|x)f (x)
(b) S =10, x
s
=
10
10
,
ˆ
f(D)=0.2198
5 0 5
0
0.1
x
f(y|x)f (x)
(c) S =20, x
s
=
10
20
,
ˆ
f(D)=0.2197
5 0 5
0
0.1
x
f(y|x)f (x)
(d) S =100, x
s
=
10
100
,
ˆ
f(D)=0.2197
Figure 6.10: Examples of application of the
rectangle integration rule.
The evidence acts as a normalization constant obtained by inte-
grating the product of the likelihood and the prior over al l possible
values of x,
f(D)=
Z
f(D|x) · f(x)dx.
This integral is typically dicult to evaluate because it is not ana-
lytically tractable. Figure 6.10 presents a naive way of approaching
this integral using the rectangle rule, where the domain containing
a significant probability content is su bdivided in
S
bins of width
x
s
in order to approx i mat e
ˆ
f(D)=
S
X
s=1
f(D|x
s
) · f( x
s
)x
s
.
This approach is reasonable for trivial 1-D pr ob l ems ; however, it
becomes computationally intractable for proble ms in larger di men -
sions (i.e., for x
2 R
X
) because the number of grid combinations to
evaluate increases exponentially with the number of dimensions
X
to integrate on. Section 6.5 presents an intro d u ct i on to sampling
methods that allow overcoming this challenge. Chapter 7 will intro-
duce more advanced sampling methods all owing the estimation of
the normalization const ants for high-dimensional domains.
j.-a. goulet 70
6.3.3 Posterior: f(x|D)
The posterior is the product of the likel ih ood function tim es the
prior PDF divided by the normalization constant,
f(x|D)=
f(D|x) · f(x)
f(D)
.
Let us consider two direct observations
D
=
{y
1
=
1
,y
2
=0
}
obtained from the observation model
y
i
= x + v
i
,v
i
: V
i
N(v;0, 1
2
),V
i
?? V
j
, 8i 6= j.
The likelihood is thus given by
f
(
D|x
)=
N
(
y
1
;
x,
1
2
)
·N
(
y
2
;
x,
1
2
).
We assum e the prior PDF follows a u ni form PDF withi n the range
from -5 to 5,
f
(
x
)=
U
(
x
;
5
,
5). As presented in figure 6.10, the
normalization constant is estimated using the rectangle integration
rule to
ˆ
f
(
D
)=0
.
2197. Figure 6.11 presents the prior, the likelihood,
and the posterior for this example.
5 0 5
0
0.1
0.2
x
f(D|x)
Likelihood
5 0 5
0
0.1
0.2
x
f(x)
Prior
5 0 5
0
0.2
0.4
0.6
x
f(x|D)
Posterior
Figure 6.11: Example of prior, likelihood,
and posterior.
6.3.4 Number of Observations and Identifiability
In the extreme case where no observat i on is available (i.e.,
D
=
;
),
the posterior is equal to the prior
f
(x
|;
)=
f
(x). In order to
explore what happens at the other end of the spect ru m where an
infinite amount of data is available, it is required to distin guis h
between identifiable and non-ident ifi able problems. A problem is
identifiable when, given an infinite numb er of direct observations,
we are theoretically able to retrieve the true hidden state of a
deterministic system
ˇx
. For non-identifiable problems, even an
infinite number of direct observations does not allow retrieving the
true hidden state of the system
ˇx
, because multiple equally valid
solutions exist.
Figure 6.12: Example of two streams
merging in a single river.
(Photo: Ashwini Chaudhary)
Example: Non-identifiable problem Take the example illustrated
in figure 6.12, where we want to estimat e the contaminant concen-
tration in two streams
x
1
and
x
2
and where we only have access to
contaminant concentration obser vations
y
i
= x
1
+ x
2
+ v
i
,v
i
: V
i
N(v;0, 2
2
),V
i
?? V
j
, 8i 6= j,
probabilistic machine learning for civil engineers 71
which are collected at the output of a ri ver after both streams
have merged. We assume that our prior k n owledge is uniform over
positive concentration values, that is ,
f
(
x
1
,x
2
)
/
1. The resulting
posteriors are presented in figure 6.13 for
D
= 1 and
D
= 100
observations, along with the true values marked by a red cross.
This problem is intrinsically non-identifiable because, even when
the epistemic uncertainty has been removed by a large number
of observations, an infinite number of combinations of
x
1
and
x
2
remain possible.
0 10 20 30 40
0
10
20
30
40
x
1
x
2
(a) D =1,f(x
1
,x
2
) / 1
0 10 20 30 40
0
10
20
30
40
x
1
x
2
(b) D =100,f(x
1
,x
2
) / 1
Figure 6.13: Posterior contours for an
example of non-identifiable problem.
Non-identifiable problems can be regularized u si n g prior knowl-
edge. If, for example, we know that the pri or probability for the
contaminant concentration in the first stream follows
p
(
x
1
)=
N
(
x
1
; 10
,
5
2
), the posterior will now have a s i ngl e mode, as pr e-
sented in figure 6.14.
0 10 20 30 40
0
10
20
30
40
x
1
x
2
Figure 6.14: Posterior contours for an
example of non-identifiable problem for
D
=1andwhichisconstrainedbytheprior
knowledge f (x
1
,x
2
) /N(x
1
;10, 5
2
).
Well-constrained identifiable problems For well-constrained prob-
lems where a true value
ˇx
exists, given an infinite number of direct
observations (
D !1
) that are conditionally independent given
x
,
the posterior tends to a Dirac delt a function, wh i ch is nonzero only
for x x,
f(x|D)=
f(D|x) · f(x)
f(D)
! (x
true value
z}|{
ˇx )
| {z }
Dirac delta function
.
Figure 6.15: Example of posterior PDF
tending toward the Dirac delta function.
Figure 6.15 presents an example of such a case where the posterior
has collapsed over a single value. The posterior expected value then
corresponds to the true value
E[X|D] ! ˇx
|{z}
true value
,
and the posterior variance tends t o z e ro because no ep i st em ic
uncertainty remains about the value of x,
var[X|D] ! 0
|{z}
ˇx:constant
.
Note that in most practical cases we do not have acce ss to an
infinite number of observations. Moreover, keep in mind that the
less data we have access to, the more important it is to consider the
uncertainty in the posterior f(x|D).
6.4 Parameter Estimation
In this section, we will explore how to ex t en d Bayesian estimation
for the parameters
defining the probability density function of
j.-a. goulet 72
a hidden random variable
x : X f(x) f(x; )
.Thissetupis
relevant when the system responses are stochastic. For ex amp le ,
even in the absence of observation er r ors , dierent sampl es from
a concrete batch made from the same mix of com ponents would
have dierent resistances due to the uncertainty associate d with the
intrinsic heterogeneity of the material. Th e posterior p rob abi l i ty
Notation
x
i
: X f
X
(x; )
:PDFsparameters
{x
1
,x
2
, ··· ,x
D
}
:realizationsofthehidden
variable X
density function sought is now
f(|D)=
f(D|) · f()
f(D)
,
where
f
(
) is the prior PDF of unknown parameters,
f(D|)
is
the likelihood of observations
D
=
{y
1
, ··· ,y
D
}
, and
f(|D)
is the
posterior PDF for unknown parameters
. After having estimated
the posterior PDF for parameters
f(|D)
, we can marginalize this
uncertainty in
f
(
x
;
) in order to obtain the po st eri or predictive
PDF for the hidden variable X,
f(x|D)=
Z
f(x; ) · f(|D)d. (6.6)
Parameters of
Random variable
Figure 6.16: Example of Bayesian parame-
ter estimation.
Figure 6.16 presents an example where we want t o e st i mat e the
joint posterior probability of the mean
µ
X
and standard deviation
X
defining the PDF of
X
, which describes the variability in the
resistance of a given concrete mix. Here, the concrete r esi s tan ce
X
is assumed to be a random variable. It means that unlike in
previous sections, no true value exi s t s for the resi st an ce because it
is now a stochastic quantity. D esp i t e t h i s, true values can still exist
for the parameters of
f
(
x
;
), that is,
{ˇµ
X
, ˇ
X
}
. This example will
be further explor ed in §6.5.3.
6.4.1 Prior: f()
f
(
) describes our prior knowledge for the values that
can
take. Prior knowledge can be based on heuri st i cs (expert knowl-
edge), on the posterior PDF obtained f rom previous data, or on
a non-informative prior (i.e., absence of pr ior knowl ed ge) . A non-
Non-informative prior
= {µ
X
,
X
}, N(x; µ
X
,
2
X
)
f(µ
X
) / 1
f(
X
) /
1
X
f(µ
X
,
X
) /
1
X
, if µ
X
??
X
informative prior for parameters such as the mean of a Normal
PDF is described by a uniform density
f
(
µ
)
/
1. For standard
deviations, the non-informative prior typically assumes that all
orders of magnitudes for
have equal probabilities. This hypoth-
esis implies that
f
(
ln
)
/
1. By using the change of variable rul e
presented in §3.4, it leads to
f
(
)
/
1
, because the derivative of
this transformation is
d ln
d
=
1
. Note that the two non-informative
priors described for
µ
and
are improper because
R
f
(
)
d
=
1
,
which does not respect the fundamental re q ui r eme nt that a PDF
probabilistic machine learning for civil engineers 73
integrates to 1, as pre se nted in §3.3. For this reason we cannot draw
samples from an improper prior. Nevertheless, when used as a prior,
an improper PDF can yield to a p r oper posterior (see §7.5).
6.4.2 Likelihood: f(D|)
f(D|) f(Y = y|) f(y | ) L(|y)
are all equivalent ways of
describing the likelihood of observing specific valu es y, given
.In
the special case where
y
=
x
so that realizations of
x
:
X f
(
x
;
)
are directly observed, the likelihood is
f(y|)=f(y; )=f(x; ).
In a more general case where observations
D
=
{y
i
, 8i 2{
1:
D}}
are realizations of an additive observation model
y
i
= x
i
+ v
i
,where
v
i
: V N(v;0,
2
V
) and x
i
: X N(x; µ
X
,
2
X
| {z }
), the likelihood is
Y = X + V
Observation: N(y; µ
X
,
2
X
+
2
V
)
Hidden-state R.V.: N(x; µ
X
,
2
X
)
Measurement error: N(v;0,
2
V
)
f(y|)=N(y; µ
X
,
2
X
+
2
V
).
This choice of an additive observation model using normal random
variables allows having an analytically t rac t abl e formulat ion for
the likelihood. If we want to employ a multiplicative observation
model,
y = x · v
, then choosing
v
:
V ln N
(
v
;0
,
2
ln V
) and
x
:
X ln N
(
x
;
µ
ln X
,
2
ln X
| {z }
) as lognormal random variables also
preserves the analytic al tractability for the likelihood that follows
f(y|)=lnN( y; µ
ln X
,
2
ln X
+
2
ln V
).
6.4.3 Posterior PDF: f(|D)
In §6.3.4, when the number of independent observations
D !1
, Note: The term independent observations
is employed as a shortcut for the more
rigorous term that should be in this case
conditionally independent given x.
we saw that the posterior for
X
tends to a Dirac delta function
f
(x
|D
)
!
(x
ˇ
x
). The situation is dierent for the est i mat i on of
parameters describing the P D F of a hidden random variabl e (R.V.);
when
D !1
, it is now the posterior PDF for
that tends to a
Dirac delta functi on as presented in figure 6.17,
Figure 6.17: The posterior PDF for pa-
rameters reduces to a Dirac delta function
when the number of independent observa-
tions tends to infinity.
f(|D)=
f(D|) · f()
f(D)
! (
true value
z}|{
ˇ
)
| {z }
Dirac delta function
,
where the posterior expected value tends toward the true parameter
value and the variance of the posterior tends to zero,
E[|D] !
ˇ
|{z}
true value
j.-a. goulet 74
var[|D] ! 0
|{z}
ˇ
:constant
.
In such a situation, the epistemic uncert ai nty associated with
parameter values
vanishes and the uncertainty about
X
reaches
its irreducible level, f (x|D) ! f(x;
ˇ
).
In practical cases where the number of observations
D < 1
,
uncertainty remains about the param et er s
f
(
|D
), so
f
(
x|D
)istyp-
ically more diused than
f
(
x
;
ˇ
). The uncertainty associated with
parameters
f
(
|D
) can be marginalized as described in equation 6.6
in order to obtain the posteri or predictive PDF for the hidden ran-
dom variable. This integral is typically not analytically tractable,
so it is common to resort to the Monte Carlo sampling methods
presented in the next section in order to compute th e posterior
predictive expected value and variance for f(x|D).
6.5 Monte Carlo
Monte Carlo sampling is a numerical integration method that
allows performing Bayesian estimation f or analytically intractable
cases. In this section, we will first see how Monte Carlo can be
employed as an integration method. T he n, we will explore how t hi s
method fits in the context of Bayesian estimation.
6.5.1 Monte Carlo Integration
We empl oy the example of a unit diameter circle wit h area
a
=
r
2
=0
.
785 to demonstrate how we can approximate this area
with Monte Carlo numerical integration. A circl e of diameter
d
=1
(r =0.5) is d esc r ibed by the equation
(x 0.5)
2
+(y 0. 5)
2
= r
2
.
This equation is employed in the d efi ni t i on of the indicator fun ctio n
I(x, y)=
1if(x 0.5)
2
+(y 0. 5)
2
r
2
0 otherwise.
0 0.5 1
0
0.5
1
x
y
(a) S =100,
ˆ
E[I(X, Y )] = 0.800
0 0.5 1
0
0.5
1
x
y
(b) S =1000,
ˆ
E[I(X, Y )] = 0.759
0 0.5 1
0
0.5
1
x
y
(c) S =10000,
ˆ
E[I(X, Y )] = 0.785
Figure 6.18: Examples of Monte Carlo
numerical integration for which the theoret-
ical answer is a = r
2
=0.785.
We the n define a bivariate uniform probability density in the ran ge
{x, y}2
(0
,
1), where
f
XY
(
x, y
) = 1. The circle area
a
can be
formulated as the integral of the product of the indi c ator function
times f
XY
(x, y),
a
|{z}
area
=
Z
y
Z
x
I(x, y) · f
XY
(x, y)dxdy
= E[I(X, Y )].
(6.7)
probabilistic machine learning for civil engineers 75
We saw in §3.3.5 that the integral in equation 6.7 corresponds to
the expectation of the indicator fu nc t i on
I
(
X, Y
). Given
S
ran-
dom samples
{x
s
,y
s
}
,
s 2{
1
,
2
, ··· , S}
generated from a uniform
probability density
f
XY
(
x, y
), the expectation operation is defined
as
a = E[ I(X, Y )] = lim
S!1
1
S
S
X
s=1
I(x
s
,y
s
).
In practice, because the number of samples
S
is finite, our estima-
tion will remain an approximation of the expectation,
a
ˆ
E[I(X, Y )] =
1
S
S
X
s=1
I(x
s
,y
s
).
Figure 6.18 presents realizations for
S
=
{
10
2
,
10
3
,
10
4
}
samples as
Note: The hat in
ˆ
E
[
I
(
X, Y
)] denotes
that the quantity is an approximation of
E[I(X, Y )].
well as their appr oximation of the area
a
. As the number of samples
increases, the approximation error decreases and
ˆ
E
[
I
(
X, Y
)] tends
to the true value of a = r
2
=0.785.
The interesting property of Monte Carlo integration is that
the estimation quality, measured by the varian ce
var
h
ˆ
E[I(X, Y )]
i
,
depends on the number of samples and not on the number of
dimensions.
4
This property is key for the application of the Monte
4
MacKay, D. J. C. (1998). Introduction
to Monte Carlo methods. In Learning in
graphical models,pp.175204.Sprigner
Carlo method in high-dimensional domains.
6.5.2 Monte Carlo Sampling: Continuous State Variables
Monte Carlo sampling
h(x): Sampling PDF
x
s
: X h(x)
s 2{1, 2, ··· , S}
Monte Carlo methods are now applied to the Bayesian e s ti m at ion
context presented in §6.3 where we are interested in estimating
continuous state variables. The Monte Car lo integrat i on method is
first employed for approximating the evidence
f(D)=
Z
f(D|x) · f(x)dx
= E[f(D|X)].
A Monte Carlo approximation of the evidence can be computed
using importance sampling where
ˆ
f(D)=
1
S
S
X
s=1
f(D|x
s
) · f( x
s
)
h(x
s
)
,x
s
: X h(x) : Sampling PDF,
where samples
x
s
:
X h
(
x
) are drawn from any sampling
distribution such that
h
(
x
)
>
0 for any
x
for which the PDF on the
numerator
f
(
D|x
)
· f
(
x
)
>
0. If we choose the prior PDF as sampling
distribution, h(x)=f (x), the formulation simplifies to
ˆ
f(D)=
1
S
S
X
s=1
f(D|x
s
),x
s
: X h(x)=f (x)
j.-a. goulet 76
because the prior and the sampling distribution cancel out.
The performance of the Monte Carlo integration depends on
how close our sampling distribution
h
(
x
) is to the posterior
f
(
x|D
).
Note that sampling from the prior,
h
(
x
)=
f
(
x
), is often not a
viable option when the number of ob se r vations
D
is large. This is be-
cause in such a case, the likelihood is the product of several t er ms
so the probability content typically ends up being concentrated in a
small region that is not well described by the p r ior , as illustrated in
figure 6.19.
Figure 6.19: Example of dicult situation
encountered when trying to sample from
the prior.
The posterior mean and variance can be approximated with
Monte Carlo integration using
E[X|D]=
Z
x · f( x|D)dx
1
S
S
X
s=1
x
s
·
f(D|x
s
) · f( x
s
)
f(D)
·
1
h(x
s
)
,x
s
: X h(x)
1
S
S
X
s=1
x
s
·
f(D|x
s
)
f(D)
,x
s
: X h(x)=f (x)
and
var[X|D]=
Z
(x E[X|D])
2
· f (x|D)dx
1
S 1
S
X
s=1
(x
s
E[X|D])
2
·
f(D|x
s
) · f( x
s
)
f(D)
·
1
h(x
s
)
,x
s
: X h(x)
1
S 1
S
X
s=1
(x
s
E[X|D])
2
·
f(D|x
s
)
f(D)
,x
s
: X h(x)=f (x).
For both the posteri or mean and variance, when choosing the prior
as the sampling distribution,
h
(
x
)=
f
(
x
), it cancels out from the
formulation.
Figure 6.20: Example of Bayesian estima-
tion with censored observations.
(Photo: Elevate)
Example: Concentration estimation Figure 6.20 depicts a labora-
tory test for measuring a deterministic mercury concentration
ˇx
.
Observations follow t h e observation model
y
i
x + v
i
, if ˇx + v
i
10µg/L
y
i
= error, if ˇx + v
i
< 10µg/L,
where the measuring device has a minimum thr eshold and the
mutually independent observation errors are described by
v
i
: V
i
N(v;0, 2
2
),V
i
?? V
j
, 8i 6= j.
We want to estimate the posterior PDF describin g t h e concent r a-
tion if the observation obtained is
D
=
{error}
for two hypotheses
probabilistic machine learning for civil engineers 77
regarding the prior knowledge: (1) uniform in the interval (0
,
50),
f
(
x
)=
U
(
x
;0
,
50), and (2) a normal PDF with mean 25 and stan-
dard deviation 10,
f
(
x
)=
N
(
x
; 25
,
10
2
). Figure 6.21 presents the
results for the (a) uniform prior and ( b ) normal prior. Note that
only 100, out of the 10
5
Monte Carlo samples that were employed
to estimate the evidence
f
(
D
), are displayed. This example illu s-
trates how information can be extracted from censored observations.
Note also how the choice of prior influences the posterior knowledge.
The choice of prior is particularly impor t ant here because limited
information is carrie d by the censored observations.
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
0.0
0.5
1.0
x
f(D|x)
0 25 50
0.0
0.1
x
f(x|D)
(a) Uniform prior, U(0, 50), f(D) 0.20, E[x|D]=5.2, var[x|D]=10.4
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
0.0
0.5
1.0
x
f(D|x)
0 25 50
0.0
0.2
x
f(x|D)
(b) Normal prior, N(25, 10
2
), f (D) 0.07, E[x|D]=6.2, var[x|D]=18.0
Figure 6.21: Comparative example of
auniformandnormalpriorPDFfor
performing Bayesian estimation with
censored observations.
We consi d er a dierent context where we employ
D
= 100 direct
observations,
D
=
{
25
.
4
,
23
.
6
,
24
.
1
, ··· ,
24
.
3
}
1100
, and the prior is
f
(
x
)=
U
(
x
;0
,
50). The prior, likelihood, and posterior are presented
in figure 6.22. Note that here, out of the 100 Monte Carlo samples
displayed, none lie in the high p rob ab il i ty region for the likelihood.
As a result, it leads to a poor approximation of the evidence where
the true value is
f
(
D
)=1
.
19
10
78
and where the approximation
0 25 50
0.0
0.1
0.2
x
f(x)
0 25 50
x
f(D|x)
0 25 50
x
f(x|D)
Figure 6.22: Example where drawing
Monte Carlo samples from the prior leads
to a poor estimate of the posterior because
no sample falls in the high probability
region.
j.-a. goulet 78
is
ˆ
f(D)=
1
S
S
X
s=1
0
@
100
Y
j=1
f(y
j
|x
s
)
1
A
=1.95 10
83
.
Although it is possible to solve this problem by incr eas i ng the
number of samples, this ap p roach is computationally inecient an d
even becomes impossible for high-dimension al domains. Chapter 7
presents advanced Monte Carlo methods for solving this li mi t at i on.
6.5.3 Monte Carlo Sampling: Parameter Estimation
For
D
observations
D
=
{y
1
,y
2
, ··· ,y
D
}
, and samples
s
h
(
)=
f
(
)
,s2{
1
,
2
, ··· , S}
drawn from the pri or PDF of parameters, we
can estimate the evidence
ˆ
f(D)=
1
S
S
X
s=1
0
@
D
Y
j=1
f(y
j
|
s
)
1
A
(6.8)
as well as t he posterior expected values and variance for parameters
ˆ
E[|D]=
1
S
S
X
s=1
s
·
Q
D
j=1
f(y
j
|
s
)
f(D)
!
ˆc o v ( |D)=
1
S 1
S
X
s=1
(
s
E[|D])(
s
E[|D])
|
·
Q
D
j=1
f(y
j
|
s
)
f(D)
!
.
Figure 6.23: Example of samples taken
from the same concrete batch.
Example: Concrete strength We consider the case where we want
to characterize the resistance
R
of a given concrete batch, where
the intrinsic variability is estimat e d by te st i ng the resist anc e from
several samples, as depicte d in figure 6.23. We assume that t h e true
yet unknown parameter values are
ˇµ
R
= 42
MPa
and
ˇ
R
=2
MPa
and that the resistance
R
is described by a log-normal random
variable
R ln N(rµ
ln R
, ˇ
ln R
)MPa.
The unknown parameters we want to learn about are
=[
µ
R
R
]
|
.
Here,
R
is a hidden random variable because i t is not di r ect l y
observed; only imprecise observations
y
i
are avail ab l e. The set of
observations
D
=
{y
1
,y
2
, ··· ,y
D
}
is employed to estimate the
posterior PDF for th e parameters
that are controlling the PDF of
the hidden random variable R,
Posterior PDF
z }| {
f(µ
R
,
R
|D)=
Likelihood
z }| {
f(D|µ
R
,
R
) ·
Prior PDF
z }| {
f(µ
R
,
R
)
f(D)
|{z}
Normalization constant
.
probabilistic machine learning for civil engineers 79
The joint prior knowledge for par amet e rs
f
(
µ
R
,
R
)=
f
(
µ
R
)
· f
(
R
),
(µ
R
??
R
) is obtained from the product of the marginal PDFs,
f(µ
R
)=lnN(µ
R
; µ
ln µ
R
,
ln µ
R
)
| {z }
µ
µ
R
=45MPa
µ
R
=7MPa
,f(
R
)=lnN(
R
; µ
ln
R
,
ln
R
).
| {z }
µ
R
=5MPa
R
=2MPa
The prior PDF for
µ
R
and
R
are themselves described by log-
normal PDFs with respective mean
{µ
µ
R
R
}
and variance
{
2
µ
R
,
2
R
}.
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
{ˇµ
R
, ˇ
R
}
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
Predictive f(r|D)
Reality f(rµ
R
, ˇ
R
)
(a) D =1
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
(b) D =5
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(D|µ
R
,
R
)
20
40
60
0
5
10
µ
R
R
f(µ
R
,
R
|D)
10 30 50 70
r
f(r|D)
(c) D =100
Figure 6.24: Examples of Bayesian esti-
mation as a function of the number of
observations employed, D = {1, 5, 100}.
The observation model is
ln y
i
=
ln r
i
+
v
i
, where the observation
errors
v
i
:
V N
(
v
;0
,
0
.
01
2
)
MPa
. With these assumptions
regarding the observation model, the marginal likelihood for each
observation is
f(y|
ln
)=f(y|µ
ln R
,
ln R
)=lnN(y; µ
ln R
, (
2
ln R
+
=0.01
2
z}|{
2
V
)
1/2
),
and the joint likelihood for t h e entire set of observations is
f(D|
ln
=fct()
z }| {
µ
ln R
,
ln R
)=
D
Y
j=1
1
y
j
p
2
p
2
ln R
+0.01
2
exp
0
@
1
2
ln y
j
µ
ln R
p
2
ln R
+0.01
2
!
2
1
A
| {z }
Log-Normal PDF
.
Note that the marginal and joint likelihood are defined as a func-
tion of the log-space parameters. In order t o e valuate the likeli-
hood for the parameters
{µ
R
,
R
}
, we employ the transformation
functions given in equation 4.4. Note that the likeliho od
f
(y
|
)de-
scribes a probabili ty density only for y and not for . Therefore this
re-parametrization
ln
=
fct
(
) of the likelihood does not require
using the change of variable rul e we saw in §3.4.
Figure 6.24 presents the prior PDF for par amet e rs , the likeli-
hood, the posterior, and the poster i or predictive for (a)
D
= 1,
(b)
D
= 5, and (c)
D
= 100 observations. We see that, as the
number of observations
D
increases, the likelihood and posterior
for
become concentrated around the true val ue s , and the poste-
rior predictive for
R
tends toward the true PDF with p aram et er s
µ
R
! ˇµ
R
= 42 MPa and
R
! ˇ
R
= 2 MPa.
6.6 Conjugate Priors
The previous section went through the process of using numerical
approximation methods for estimating the posterior PDF for the
parameters of a hidden random variable
X
. This complex process