3
Probability Theory
“La ehorie des probabilit´es n’est, au fond,
que le bon sens r´eduit au calcul;ellefait
appr´ecier avec exactitude ce que les esprits
justes sentent par une sorte d’instinct”
Pierre-Simon, marquis de Laplace (1749 –1 8 27 )
The interpretati on of probabi l i ty theory employed in this book
follows Laplace’s view of common sense reduced to calculus.” It
means that probabilities describe our state of knowledge rather
than intrinsicall y aleatory phenomena. In practice, few phenomena
are actually intrinsically unpredictabl e. Take, for example, a coin as
displayed in figure 3.1. Whether a coin toss resul t s in eit he r head s
or t ai l s has nothi ng to do with an inher ently aleatory process. The
outcome appears unpredictable because of the lack of knowl edge
about the coin’s initial position, speed, and accel e rat i on . If we could
gather information about the coin’s initial kinematic conditions, the
outcome would become predictable. Devices that can throw coins
with repeatable initial kinematic conditions will lead to repeatable
outcomes.
Figure 3.1: A coin toss illustrates the
concept of epistemic uncertainty.
(Photo: Michel Goulet)
Figure 3.2 presents another example where we consider the elas-
tic modulus
1
E
at one s pecific location in a dam. Notwithstanding
1
Elastic modulus relates the stress
and
strains in Hooke’s law = E.
long-term eects such as creep,
2
at any given location,
E
do es
2
Creep is the long-term (i.e.,
years)
deformation that occurs under constant
stress.
not vary with time:
E
is a de t er mi ni s ti c , yet unknown constant.
Probability is employed here as a tool t o desc ri be our incomplete
knowledge of that constant.
E?
Figure 3.2: The concrete elastic modulus
E
at a given location is an example of
deterministic, yet unknown quantity. The
possible values for
E
can be described
using the probability theory.
There are two types of uncertainty: aleatory and epistemic.
aleatory uncertainty is characterized by its irreducibility;noin-
formation can either reduce or alter it. Alternately, epistemic
uncertainty refers to a lack of knowledge that can be altered by new
information. In an engineering context, aleatory uncertainties arise
when we are concerned with future real i zat i ons that have yet to
occur. Epistemic uncertainty applies to any other case dealing with
deterministic, yet unknown q uantities.
This book approaches machine learning using probability theory
because in many practical engineering problems, the number of
observations available is limited, from a few to a few thousand. In
such a context, the amount of information available is typically
j.-a. goulet 18
insucient to eliminate epistemic uncertainties. When large data
sets are available, probabilistic and deterministic met hods may lead
to indistinguishable results; the opposite occurs w he n li tt l e dat a
is available. Therefore, the less we know about it, the stronger the
argument for approaching a problem using probability theory.
In t h i s chapter, a revi ew of set theory lays the foundation for
probability theory, where the central part is the concept of random
variables. Machine learning methods are built from an ensemble
of f un ct i on s organize d in a clever way. Therefore, the last part of
this chapter looks at what happens when random variables are
introduced into deterministic functions.
For specific notions related to pr ob abi l i ty theory that are outside
the scope of this chapter, the reader should refer to dedicated
textbooks such as those by Box and Tiao;
3
Ang and Tang.
4
3
Box, G. E. P. and G. C. Tiao (1992).
Bayesian inference in statistical analysis.
Wiley
4
Ang, A. H.-S. and W. H. Tang (1975).
Probability concepts in engineering plan-
ning and decision,Volume1Basic
Principles. John Wiley
3.1 Set Theory
Set: Ensemble of events or elements.
Universe/sampling space
(
S
)
:
Ensem-
ble of all possible events.
Elementary event
(
x
)
:
Asingleevent,
x 2S.
Event
(E)
:
Ensemble of elementary events.
E S :SubsetofS
E = S :Certainevent
E = ; :Impossibleevent
E :ComplementofE
A set descri bes an ensemble of elements, also referred to as events.
An el emen tar y event
x
refers to a single event among a sampling
space (or universe) denoted by the calligraphic lett er
S
.Bydeni-
tion, a sampling space contains all the possible events,
E S
.The
special case where an event i s equ al to the sampl in g space ,
E
=
S
,
is called a certain event. Th e opposite,
E
=
;
, w he re an event is
an e mp ty set, is call ed a null event.
E
refers to the complement of
a set , t hat is, al l el eme nts belonging to
S
and not t o
E
. F i gu re 3.3
illustrates these concepts using a Venn diagram.
Figure 3.3: Venn diagram representing
the sampling space
S
,anevent
E
,its
complement E,andanelementaryeventx.
(a) Union operation
(b) Intersection operation
Figure 3.4: Venn diagrams representing the
two basic operations.
Let us consider the example,
5
of t h e stat e of a structu r e foll ow-
5
This example is adapted from Armen Der
Kiureghian’s course, CE229, at University
of California, Berkeley.
ing an earthquake, which is described by a sampling space,
S = {no damage, light d amage , important damage , collapse}
= {N, L, I, C}.
In t h at context, an event
E
1
=
{
N
,
L
}
could contain the no damage
and light damage events, and another event
E
2
=
{
C
}
could contain
only the collapsed state. The complements of these events are,
respectively, E
1
= {I, C} and E
2
= {N, L, I}.
The two main operations for events, union and intersection, are
illustrated in figure 3.4. A union is analogous to the “or” operator,
where
E
1
[ E
2
holds if the event belongs to e i the r
E
1
,
E
2
, or
both. The intersection is analogous to the “and” operator, where
E
1
\ E
2
E
1
E
2
holds if the event belongs to both
E
1
and
E
2
. As
a convention, intersection has priority over union. Moreover, both
operations are commutative, as s ociative, and distributive.
Given a set of
n
events
{E
1
,E
2
, ··· ,E
n
}2S
,
E
1
,E
2
, ··· ,E
n
,
probabilistic machine learning for civil engineers 19
the events are mutually exclusive if
E
i
E
j
=
;, 8i 6
=
j
, t h at is ,
if the intersection for any pair of events is an empty set. Events
E
1
,E
2
, ··· ,E
n
are collectively exhaustive if
[
n
i=1
E
i
=
S
, that is, the
union of all events is the sampling space. Events
E
1
,E
2
, ··· ,E
n
are
mutually exclusive and collectively exhaustive if they satisfy both
properties simultaneously. Figure 3.5 presents examples of mutu-
ally exclusive (3.5a), collectively exhaustive (3.5b), and mutually
exclusive and collectively exhaustive (3.5c–d) events. Note that
the dierence between (b) and (c) is the absence of overlap in the
latter.
Union (“or”)
E
1
[ E
2
Intersection (“and”)
E
1
\ E
2
E
1
E
2
Commutativity
E
1
[ E
2
= E
2
[ E
1
,E
1
E
2
= E
2
E
1
[
n
i=1
E
i
= E
1
[ E
2
[···[E
n
\
n
i=1
E
i
= E
1
\ E
2
\···\E
n
Associativity
(
E
1
[E
2
)
[E
3
=
E
1
[
(
E
2
[E
3
)=
E
1
[E
2
[E
3
Distributivity
E
1
(E
2
[ E
3
)=(E
1
E
2
[ E
1
E
3
)
(a) Mutually exclusive
(b) Collectively exhaustive
(c) Mutually exclusive and collec-
tively exhaustive
(d) Mutually exclusive and collec-
tively exhaustive
Figure 3.5: Venn diagrams representing
the concepts of mutual exclusiv ity and
collective exhaustivity for events.
3.2 Probability of Events
Pr
(
E
i
) denotes the probability of the event
E
i
. T he re are two main
interpretati on s for a probabil i ty: the Frequentist and the Bayesian.
Frequentists interpret a probability as the number of occurrences of
E
i
relative to the number of samples s, as s goes to 1,
Pr(E
i
)= lim
s!1
#{E
i
}
s
.
For Bayesians, a probability measures how likely is
E
i
in compar-
ison with other events in
S
. T hi s interpretation assumes that the
nature of uncertainty is episte mi c, that i s, i t des cr i bes our knowl-
edge of a phenom en on. For instance, the probability depends on
the available knowledge and can change when new information
is obtained. Throughout this book we are adopting this Bayesian
interpretation .
By definition, the probability of an event is a number between
zero and one, 0
Pr
(
E
i
)
1. At the ends of thi s spectrum, the
probability of any event in
S
is one,
Pr
(
S
) = 1, and the probabil i ty
of an emp ty set is zero,
Pr
(
;
) = 0. If two events
E
1
and
E
2
are
mutually exclusive, th en th e pr obabi li ty of the events’ union is the
sum of e ach event’s probability. Because the union of an event and
its complement are the sampling space,
E [ E
=
S
(see figure 3.5d),
and because
Pr
(
S
) = 1, then the probabili ty of the compleme nt is
Pr(E)=1 Pr(E).
When events are not mutually exclusive, the gene ral addi t i on
rule for the probabil ity of the union of two events is
Pr(E
1
[ E
2
)=Pr(E
1
)+Pr(E
2
) Pr(E
1
E
2
).
This general addition rule is illustrated in figure 3.6, where if we
simply add the probability of each event without accounting for the
subtraction of
Pr
(
E
1
E
2
), the probability of the intersection of both
events wil l be counted twice.
j.-a. goulet 20
Figure 3.6: Venn diagram representing the
addition rule for the probability of events.
Pr(E
1
|E
2
)
denotes the probability of the event
E
1
conditional
on t h e real i zat i on of the event
E
2
. T hi s cond i t ional pr obab i l ity
is defined as the joint probability for both events divided by the
probability of E
2
, Note: Pr(E
2
) 6=0
on the denominator
because a division by 0 is not finite.
Pr(E
1
|E
2
)=
Pr(E
1
E
2
)
Pr(E
2
)
, Pr(E
2
) 6=0. (3.1)
Conditional probability
Marginal probability
Joint probability
Note:
Statistical independence (
??
)be-
tween a pair of random variables implies
that learning about one random variable
does not modify our knowledge for the
other.
The probability of a single event is referred to as a marginal prob-
ability. A joint probability designates the probabi l ity of the inter-
section of events. The terms in equation 3.1 can be rearranged to
explicitly show that the joint probability of two events
{E
1
,E
2
}
is
the product of a conditional probab i li ty and its associated marginal,
Pr(E
1
E
2
)=Pr(E
1
|E
2
) ·Pr(E
2
)
=Pr(E
2
|E
1
) ·Pr(E
1
).
In c ase s wher e
E
1
and
E
2
are s t ati st ical l y independent,
E
1
?? E
2
,
conditional probabilities are equal to the marginal,
E
1
?? E
2
Pr(E
1
|E
2
)=Pr(E
1
)
Pr(E
2
|E
1
)=Pr(E
2
).
In t h e special case of statistically independent events, the joint
probability reduces to th e product of the marginals,
Pr(E
1
E
2
)=Pr(E
1
) ·Pr(E
2
).
The joint pr ob abi l i ty for
n
events can be broken down into
n
1
conditionals and one margi nal probability using the chain rule,
Pr(E
1
E
2
···E
n
)=Pr(E
1
|E
2
···E
n
)Pr(E
2
···E
n
)
=Pr(E
1
|E
2
···E
n
)Pr(E
2
|E
3
···E
n
)Pr(E
3
···E
n
)
=Pr(E
1
|E
2
···E
n
)Pr(E
2
|E
3
···E
n
) ···Pr(E
n1
|E
n
)Pr(E
n
).
Let us define
{E
1
,E
2
,E
3
, ··· ,E
n
}2S
, a set of mutually e xc l u-
sive an d colle ct i vely ex h aus ti ve events, that is,
E
i
E
j
=
;, 8i 6
=
j, [
n
i=1
E
i
=
S
and an event
A
belonging to the same sampling
probabilistic machine learning for civil engineers 21
space, that is,
A 2S
. T hi s context is il l us t rat e d usi n g a Venn dia-
gram in figure 3.7. The probability of the event
A
can be obtained
by sum mi ng the joint probabi l ity of A and each event E
i
,
Pr(A)=
n
X
i=1
Pr(A|E
i
) ·Pr(E
i
)
| {z }
Pr(AE
i
)
. (3.2)
Figure 3.7: Venn diagram representing the
conditional occurrence of events.
This operation of obtaining a marginal probability from a joint is
called marginali z ati on. The addit i on ru le for t he union of
E
1
[ E
2
conditional on A is
Pr(E
1
[ E
2
|A)=Pr(E
1
|A)+Pr(E
2
|A) Pr(E
1
E
2
|A),
and the intersection rule is
Pr(E
1
E
2
|A)=Pr(E
1
|E
2
,A) · Pr(E
2
|A).
Using the definition of a conditional probability in equation 3.1, we
can break
Pr
(
AE
i
) into two dierent products of a conditional and
its associated marginal probability,
Pr(AE
i
)=Pr(A|E
i
) ·Pr(E
i
)
=Pr(E
i
|A) ·Pr(A)
| {z }
Pr(E
i
|A) ·Pr(A)=Pr(A|E
i
) ·Pr(E
i
).
(3.3)
Reorganizing the right-hand terms of equation 3.3 leads to Bayes
rule,
Pr(E
i
|A)=
Pr(A|E
i
) · Pr(E
i
)
Pr(A)
.
Posterior probability
Evidence
Conditional probability
Prior probability
On the left-hand side is the po st eri or probability: the probability
of t h e event
E
i
given the realization of the event
A
.Onthenu-
merator of the right-hand side is the product of the conditional
probability of the event
A
given the event
E
i
,timestheprior prob-
ability of
E
i
. T he te rm on the denom in at or is ref er r ed to as the
evidence and act s as a normalizati on cons t ant, which ensur es th at
P
i
Pr
(
E
i
|A
) = 1. The normalization const ant
Pr
(
A
) is obtained
using the marginalization operation presented in equation 3.2.
In p r act i cal appl i cat i ons ,
Pr
(
A
) is typically dicult t o esti mat e .
Chapters 6 and 7 p re se nt analyt i c as well as numerical methods
for tackling this challenge. Fi gure 3.8 ill u st r at es th e condi t i on al
occurrence of events in the context of Bayes rule.
Pr(E
1
|A)=
Pr(A|E
1
)Pr(E
1
)
Pr(A)
=
Pr(AE
1
)
Pr(A)
=
Pr( )
Pr( )
Figure 3.8: Venn diagram representing the
conditional occurrence of events in the
context of Bayes rule.
j.-a. goulet 22
3.3 Random Variables
Set theory is relevant for i ntroducing the concepts related to proba-
bilities. However, on its own, it has a limited appli cab i l ity to practi-
cal problems that require defining the concept of random variables.
A ran d om variable is denoted by a capital letter
X
. C ontrarily to
what its name implies, a random variable is not intended to de-
scribe only intrinsically random events; in our case, it describes lack
of knowledge. A random variable
X
does not take any specific value.
Instead, it takes any value in its valid sampling space
x 2S
and,
as we will see short l y, the probability of occurrence of each value
is typically not equal. Values of
x
are either called realizations or
outcomes and are elementary events that are mutually exclusive and
collectively exhaustive. A sampling space
S
for a rand om variable
can either be discrete or continuous. Continuous cases are always
infinite, whereas discrete ones can either be finite or infinite. F i gur e
3.9 illustrates how the concepts of events and sampling space can be
transposed from a Venn diagram representation to the domain of a
random variabl e .
3.3.1 Discrete Random Variables
Figure 3.9: Parallel between a Venn
diagram and a continuous domain to
represent a random variable.
In t h e case where
S
is a di sc r et e domain, t he pr ob abi l i ty that
X
=
x
is described by a probability mass function (PMF). In terms
of n ot at ion
Pr
(
X
=
x
)
p
X
(
x
)
p
(
x
) are al l equ ivalent. Moreover,
we typically describe a random variable by defining its sampling
space and its probability mass function so that
x
:
X p
X
(
x
). The
symbol
reads as distributed like. Analogously to the probability of
Notation
X :Randomvariable
x :RealizationofX
p
X
(x): ProbabilitythatX = x
events, t he probability that X = x must be
0 p
X
(x) 1,
and the sum of the probability for all x 2Sfollows
X
x
p
X
(x)=1.
For the post-earthquake structural safety example introduced in
§3.1, where
S =
8
>
>
<
>
>
:
no damage (N)
light d amage (L )
important damage (I)
collapse (C)
9
>
>
=
>
>
;
,
the sampling space along with the probability of each event can
be represented by a probability mass function as depicted in fig-
ure 3.10.
Figure 3.10: Representation of a sampling
space for a discrete random variable.
probabilistic machine learning for civil engineers 23
The event corresponding to damages that are either light or impor-
tant corresponds to L
[
I
{
1
x
2
}
. B ec aus e the events
x
=1
and x = 2 are mutually exclusive, the probability
Pr(L [I)=Pr({1 X 2})
= p
X
(x = 1) + p
X
(x = 2).
The probability that
X
takes a value less than or equal to
x
is
described by a cumulative mass function (CMF),
Pr(X x)=F
X
(x)=
X
x
0
x
p
X
(x
0
).
Figure 3.11 presents on the same graph the probability mass fu nc-
tion (PMF) and the cumulative mass function. As its name indi-
cates, the CMF corresponds to the cumulative sum of the PMF.
Inversely, the PMF can be obtained from the CMF following
p
X
(x
i
)=F
X
(x
i
) F
X
(x
i1
).
Figure 3.11: Comparison of a probability
p
X
(
x
)andacumulative
F
X
(
x
)mass
function.
3.3.2 Continuous Random Variables
The concepts presented for discrete sampling spaces can be ex-
tended for cases where
S
is a continuous domain. Because continu-
ous domains are inevitably infinite, the probability that a random
variable takes a specific value X = x is zero, Note:
Here, the probability equal to
zero does not mean that a specific value
x
is impossible. Take, for example, a
random variable defined in the interval
(0
,
1), for which all the outcomes are
equally probable. The probability that
X
=0
.
23642 is only one out of an infinite
number of possibilities in (0, 1).
Pr(X = x)=0.
For continuous random variables, the probability is only d efi n ed f or
intervals x<X x +x,
Pr(x<X x +x)=f
X
(x)x,
where
f
X
(
x
)
f
(
x
) denotes a probability density function (PDF).
A PD F must always be greater than or equal to zero
f
X
(
x
)
0;
however, unlike for the discrete case where 0
p
X
(
x
)
1,
f
X
(
x
)
can take values greater than one because it describes a probability
density rather than a probability. In order to satisfy the property
that
Pr
(
S
) = 1, the integral of
f
X
(
x
) over all possible values of
x
must be one,
Z
+1
1
f
X
(x)dx =1.
The probability that
Pr
(
X x
) is given by the cumulative density
function (CDF),
Pr(X x)=F
X
(x)=
Z
x
1
f
X
(x
0
)dx
0
.
j.-a. goulet 24
For a random variable
x 2 R
:
X f
X
(
x
), the CDF evaluated
at the lower and upper bounds is, respectively,
F
X
(
1
) = 0 and
F
X
(+
1
) = 1. Notice that the CDF is obtained by integrating
the PDF, and inversely, the PDF is obtained by dierentiating the
CDF,
F
X
(x)=
Z
x
1
f
X
(x
0
)dx
0
$ f
X
(x)=
dF
X
(x)
dx
.
Moreover, because
F
X
(
x
) is the i ntegral of
f
X
(
x
) and
f
X
(
x
)
0,
F
X
(
x
)isnondecreasing. Figure 3.12 presents examples of probabil-
ity density and cumulative distribution f u nc ti on .
(a) Probability density function
nondecreasing
(b) Cumulative distribution function
Figure 3.12: Examples of PDF and CDF
for a continuous random variable.
3.3.3 Conditional Probabilities
Conditional probabilities describe the probability of a random
variable’s outcomes, given the realiz at ion of anothe r variable. The
conditional notation for discrete random variables follows
X|y p(x|y) p
X|y
(x) Pr(X = x|y )=
p
XY
(x, y)
p
Y
(y)
,
and the conditional notation for continu ou s rand om variables
follows,
X|y f (x|y) f
X|y
(x|y)=
f
XY
(x, y)
f
Y
(y)
.
Conditional probabilities are employed in Bayes rule to infer the
posterior knowledge associated with a random variable, given the
observations made for another.
Let us revisit the post-earthquake structural safety example
introduced in
§
3.1, where the damage state
x 2S
. I f we measure
the peak ground acceleration (PGA) after an earthquake
y 2 R
+
, Peak ground acceleration is a metric
quantifying the intensity of an earthquake
using the maximal acceleration recorded
during an event.
we can employ the conditional probability of having struc t ur al
damage given the PGA value to infer the structural state of a
building that itself has not been observed. Figure 3.13 illustrates
schematically how an observation of the peak ground acceleration
y
can be employed to infer th e struc t ur al st at e of a buildin g
x
,
using conditional probabilities. Because the structural state
X
is a
discrete random variable,
p
(
x|y
) describes the posterior probability
of e ach state
x 2S
, gi ven an observed value of PGA
y 2 R
+
.
f
(
y
) is a normal i zat i on const ant obt ain ed by marginal iz i ng
X
from
f(x, y) and evaluating it for the particular observed value y,
f(y)=
X
x2S
f(y|x) ·p(x).
The posterior is obtained by multiplying the likelihood of observing
the particular value of PGA
y
given each of the structural states
probabilistic machine learning for civil engineers 25
x
, t i me s th e prior pr obab i li ty of each structu ral st at e , and then
dividing by the probability of the observation
y
itself. Conditional
probabilities can be employed to put in relation any combination
of c ontinuous and discrete random variables. Chapter 6 further
explores Baye si an estimation with applied examples.
Figure 3.13: Schematic example of how ob-
servations of the peak ground acceleration
y
can be employed to infer the structural
state of a building
x
using conditional
probabilities.
3.3.4 Multivariate Random Variables
It is common to st u dy t he joi nt occurrence of multiple phenomena.
In t h e context of prob abi l i ty theory, it is done using multivariate
random variables.
x
=[
x
1
x
2
··· x
n
]
|
is a vector (col u mn ) con-
taining realizations for
n
random variables
X
=[
X
1
X
2
··· X
n
]
|
,
x
:
X p
X
(
x
)
p
(
x
), or
x
:
X f
X
(
x
)
f
(
x
). For the discret e
case, the probability of the joint realizat i on x is described by
p
X
(x)=Pr(X
1
= x
1
\ X
2
= x
2
\···\X
n
= x
n
),
where 0 p
X
(x) 1. For the continuous case, it is
f
X
(x)x =Pr(x
1
<X
1
x
1
+x
1
\···\x
n
<X
n
x
n
+x
n
),
for
x ! 0
. Not e th at
f
X
(
x
) can be
>
1 because it describes a
probability density. As mentioned earlier, two random variables
X
1
and X
2
are statistically independent (?? )if
p
X
1
|x
2
(x
1
|x
2
)=p
X
1
(x
1
).
If
X
1
?? X
2
?? · · · ?? X
n
, the joint PMF is defined by th e product of
its marginals,
p
X
1
:X
n
(x
1
, ··· ,x
n
)=p
X
1
(x
1
)p
X
2
(x
2
) ···p
X
n
(x
n
).
For the general case where
X
1
,X
2
, ··· ,X
n
are not statistically
independent, their joint PMF can be defined using the chain rule,
p
X
1
:X
n
(x
1
, ··· ,x
n
)=p
X
1
|X
2
:X
n
(x
1
|x
2
, ··· ,x
n
) ···
·p
X
n1
|X
n
(x
n1
|x
n
) ·p
X
n
(x
n
).
The same rules apply for continuous random variables except
that
p
X
(
x
) is replaced by
f
X
(
x
). Figure 3.14 presents examples of
marginals and a bivariate joint p r obab i li ty density function.
The multivariate cumulative distribution function describes the
probability that a set of
n
random variables is simultaneously lesser
or equal to x,
F
X
(x)=Pr(X
1
x
1
\···\X
n
x
n
).
j.-a. goulet 26
The joint CD F is obt ai ne d by integrating t he joint PDF over each
dimension from its lower bound up to
x
, an d inversely, the joint
PDF is obtained by dierentiating the CDF,
F
X
(x)=
Z
x
1
1
···
Z
x
n
1
f
X
(x
0
)dx
0
$ f
X
(x)dx =
@
n
F
X
(x)
@x
1
···@x
n
.
A multivariate CDF has values 0
F
X
(
x
)
1, and its value is zer o
at the lowest bound for any dimension and one at the upper bound
for all dimensions,
F
X
1
:X
n
(x
1
, ··· ,x
n1
, 1)=0
F
X
1
:X
n
(+1, ··· , +1, +1)=1.
Figure 3.15 presents an example of marginals and a bivariate
cumulative distribution function.
5
0
5
5
0
5
0
0.1
0.2
0.3
0.4
x
1
x
2
f
X
1
X
2
(x
1
,x
2
)
5 0 5
5
0
5
x
1
x
2
Figure 3.14: Examples of marginals and a
bivariate probability density function.
5
0
5
5
0
5
0
0.5
1
x
1
x
2
F
X
1
X
2
(x
1
,x
2
)
5 0 5
5
0
5
x
1
x
2
Figure 3.15: Examples of marginals and a
bivariate cumulative distribution function.
The operation consisting of removing a random variable from
a joi nt set is calle d marginalization. For a set of
n
joint ran d om
variables, we can remove the
i
th
variable by summing over the
i
th
dimension,
X
x
n
p
X
1
:X
n
(x
1
, ··· ,x
n
)=p
X
1
:X
n1
(x
1
, ··· ,x
n1
).
If we marginali z e all variables by summing over all dimensions, the
result is
X
x
1
···
X
x
n
p
X
(x)=1.
For the example prese nted in figur e 3.16,
X
1
?? X
2
so t h e joint
PMF is obtained by the product of its marginals. It is possible to
obtain the marginal PMF for
x
1
from the joint through marginaliz-
ing,
P
3
i=1
p
X
(x
1
,i):
Marginalization
R
1
1
f
X
1
X
2
(x
1
,x
2
)dx
2
= f
X
1
(x
1
)
P
x
2
p
X
1
X
2
(x
1
,x
2
)=p
X
1
(x
1
)
F
X
1
X
2
(x
1
, +1)=F
X
1
(x
1
)
p
X
(x
1
,x
2
)=
8
>
>
>
>
>
<
>
>
>
>
>
:
x
2
=1 x
2
=2 x
2
=3
3
X
i=1
p
X
(x
1
,i)
x
1
=1 0.08 0.015 0.005 0.1
x
1
=2 0.40.075 0.025 0.5
x
1
=3 0.32 0.06 0.02 0.4
.
Marginalization applies to continuous random variables using
integration,
Z
1
1
f
X
1
:X
n
(x
1
, ··· ,x
n
)dx
n
= f
X
1
:X
n1
(x
1
, ··· ,x
n1
),
where again, if we integrate over all dimensions, the result is
Z
1
1
···
Z
1
1
f
X
(x)dx =1.
probabilistic machine learning for civil engineers 27
For both continuous and discrete rand om variables, we can marginal-
ize a random variable by evaluating its CDF at its upper bound,
F
X
1
:X
n
(x
1
, ··· ,x
n1
, +1)=F
X
1
:X
n1
(x
1
, ··· ,x
n1
).
p
X
1
(x
1
)
8
<
:
p
X
1
(1) = 0.1
p
X
1
(2) = 0.5
p
X
1
(3) = 0.4
p
X
2
(x
2
)
8
<
:
p
X
2
(1) = 0.8
p
X
2
(2) = 0.15
p
X
2
(3) = 0.05
p
X
(x
1
,x
2
)=p
X
1
(x
1
) · p
X
2
(x
2
)
3
2
0
3
1
0.2
2
1
0.4
0.6
0.8
Figure 3.16: Examples of marginals and
bivariate probability mass functions.
3.3.5 Moments and Expectation
The m om ent of order
m
,
E
[
X
m
] of a random variable
X
is defined
as
E[X
m
]=
Z
x
m
· f
X
(x)dx (continuous)
=
X
i
x
m
i
· p
X
(x
i
)(discrete),
where
E
[
·
] denotes the expectation op er at i on. For
m
= 1,
E
[
X
]=
µ
X
is a meas ur e of position for the centroid of the probability density
or m ass fun ct i on . This centroid is anal ogou s to the conc ep t of
center of gravity for a solid body or cross section. An expected value
Expected value
E[X]=
Z
x · f
X
(x)dx (continuous)
=
X
i
x
i
· p
X
(x
i
)(discrete)
refers to the sum of all possible valu es weighted by t he ir probability
of occurrence. A key property of the expectation is that it is a
linear operation so that
E[X + Y ]=E[X]+E[Y ].
The notion of expectation can be extended for any function of
random variabl e s g(X),
E[g(X)] =
Z
g(x) · f
X
(x)dx.
The expectation of the function
g
(
X
)=(
X µ
X
)
m
is referred to as
centered moment of order m,
E[(X µ
X
)
m
]=
Z
(x µ
X
)
m
· f
X
(x)dx.
For the special cases where
m
= 1,
E
[(
X µ
X
)
1
] = 0, and for
m
= 2,
E[(X µ
X
)
2
]=
2
X
= var[X]
= E[X
2
] E[X]
2
,
where
X
denotes the standard deviation of
X
; an d
var
[
·
] d en ot es
the variance operator that measures the dispersion of the prob-
ability density function with respect to its mean. The notion of
variance is analogous to the concept of moment of inertia for a
cross section. Toget he r,
µ
X
and
X
are metrics describing the cen-
troid and dispersion of a random var i abl e. An oth er ad im en si onal
j.-a. goulet 28
dispersion metric for describing a random variable is the coecient
of variation,
X
=
X
µ
X
. Note that
X
only applies for µ
X
6= 0.
Given two random variables
X, Y
,theircovariance,
cov
(
X, Y
),
is defined by the expectation of the product of the mean-centered
variables,
E[(X µ
X
)(Y µ
Y
)] = cov(X, Y )
= E[XY ] E[X] · E[Y ]
=
XY
·
X
·
Y
.
The correlation coecient
XY
can take a value between -1 and 1,
which quantifies the linear dependence between X and Y ,
XY
=
cov(X, Y )
X
Y
, 1
XY
+1.
A positive (negative) correlation indicates that a large outcome for
X
is associated with a high probability for a large (small) outcome
for
Y
. F i gu re 3.17 prese nts examp l es of scatt e r plots gen er at ed for
dierent correlation coecients.
Figure 3.17: Examples of scatter plots
between the realizations of two random
variables for dierent correlation coe-
cients .
In t h e special case where
X
and
Y
are independent, the correla-
tion is zero,
X ?? Y
=
)
XY
= 0. Note that the inverse is not
true; a correlation coecient equal to zero does not guarantee the
independence,
ij
=0
6=) X
i
?? X
j
. This happens because correla-
tion only measures the linear dependence between a pair of random
variables; two random variables can be nonlinearly dependent, yet
have a correlation coecient equal to zero. Figure 3.18 presents an
example of a s c att e r plot wi t h quad r at ic dependence yet no linear
dependence, so 0.
Figure 3.18: Example of scatter plot where
there is a quadratic dependence between
the variables, yet the correlation coecient
0.
Correlation also does not imply causality . For example, the
number of flu cases is negatively correlated wit h t he te mperature;
when the seasonal temperatures drop during winter, the number
of fl u cases incr e as es . Noneth el es s , the col d it se lf i s not causin g
the flu; someone isolated in a cold climate is unlikely to contract
the flu because the virus is itself unlikely to be present in the
environment. Instead, stu di e s have shown that the flu virus has
a hi gh er tran sm i s si bi li ty in the col d and dry cond i ti on s that are
prevalent duri ng winter. See, for e xam pl e, Lowen and Steel.
6
6
Lowen, A. C. and J. Steel (2014). Roles
of humidity and temperature in shaping
influenza seasonality. Journal of Virol-
ogy 88 (14), 7692–7695
For a set of
n
random variables
X
1
,X
2
, ··· ,X
n
,thecovariance
matrix defines the dispersion of each variable through it s variance
located on the main diagonal, and the dependence between vari-
ables through the pairwise covariance located on the o-diagonal
terms,
=
2
6
4
2
X
1
···
1n
X
1
X
n
.
.
.
.
.
.
.
.
.
.
.
.
sym.
2
X
n
3
7
5
.
probabilistic machine learning for civil engineers 29
A covariance matrix is symmetric (sym.) , and each term is defined
following [
]
ij
=
cov
(
X
i
,X
j
)=
ij
X
i
X
j
. B ec aus e a variable
is linearly correlated with itself (
= 1), the main diagonal terms
reduce to [
]
ii
=
2
X
i
. A covariance matrix has to be positive semi-
definite (see
§
2.4.2) so the variances on the main diagonal must be
>
0. I n orde r to avoid singular cases, there should be no li ne arl y
dependent variabl e s, that is, 1 <
ij
< 1, 8i 6= j.
3.4 Functions of Random Variables
Let us consider a continuous random variable
X f
X
(
x
) and a
monotonic deterministic function
y
=
g
(
x
). The function’s out- A monotonic function
g
(
x
)takesone
variable as input and returns one variable
as output and is strictly either increasing
or decreasing.
put
Y
is a ran d om variable because it takes as input the random
variable
X
.ThePDF
f
Y
(
y
) is defined knowing that for each i n -
finitesimal part of the domain
dx
, t h er e is a corresponding
dy
, an d
the probability over both d omai ns must be equal,
Pr(y<Y y + dy)=Pr(x<X x + dx)
f
Y
(y)
|{z}
0
dy = f
X
(x)
|{z}
0
dx.
The change-of-variable rule for f
Y
(y)isdenedby
f
Y
(y)=f
X
(x)
dx
dy
= f
X
(x)
dy
dx
1
= f
X
(g
1
(y))
dg(g
1
(y))
dx
1
,
where multiplying by
dx
dy
accounts for t h e change in the size of the
neighborhood of
x
with respect to
y
, an d whe re the absol u t e value
ensures that
f
Y
(
y
)
0. For a function
y
=
g
(
x
) and its i nverse
x = g
1
(y), the gradient i s obtained from
dy
dx
dg(x)
dx
dg(
=x
z }| {
g
1
(y))
dx
.
0 20 40
0
20
40
y
0 20 40
x
f
X
(x)
0
20
40
f
Y
(y)
Figure 3.19: Example of 1-D nonlinear
transformation
y
=
g
(
x
). Notice how the
nonlinear transformation causes the modes
(i.e., the most likely values) to be dierent
in the x and y spaces.
Atransformationfromaspace
x
to an-
other space
y
requires taking into account
the change in the size of the neighborhood
f
Y
(y)=f
X
(x)
dx
dy
.
Figure 3.19 presents an example of nonlinear transformation
y
=
g
(
x
). Notice how, because of t h e nonl i ne ar trans for mat i on , th e
maximum for
f
X
(
x
) and the maximum for
f
Y
(
y
) do not occur for
the same locations, that is, y
6= g(x
).
Given a set of
n
random variables
x 2 R
n
:
X f
X
(
x
), we can
generalize the transformation rule for an
n
to
n
multivariate func-
tion
y
=
g
(
x
), as illustrated in figure 3.20a for a case where
n
= 2.
As wi t h th e univariate case, we need to account for the change in
j.-a. goulet 30
the neighborhood size when going from the original to the trans-
formed space, as illustrated in figure 3.20b. The transformation is
then defined by
(a) 2-D transformation
(b) Eect of a 2-D transformation on the
neighborhood size
Figure 3.20: Illustration of a 2-D transfor-
mation.
f
Y
(y)dy = f
X
(x)dx
f
Y
(y)=f
X
(x)
dx
dy
,
where |
dx
dy
| is the inverse of th e determinant of the Jacobian matrix,
dx
dy
= |det J
y,x
|
1
dy
dx
= |det J
y,x
|.
The Jacobian is an
n n
matrix containing the partial d er i vatives of
y
k
with respect to x
l
, evaluated at x so that [J
y,x
]
k,l
=
@y
k
@x
l
,
J
y,x
=
2
6
4
@y
1
@x
1
···
@y
1
@x
n
.
.
.
.
.
.
.
.
.
@y
n
@x
1
···
@y
n
@x
n
3
7
5
=
2
6
4
rg
1
(x)
.
.
.
rg
n
(x)
3
7
5
.
Note that each row of the Jacobian matrix corresponds to the
gradient vect or evaluated at x,
rg(x)=
@g(x)
@x
1
···
@g(x)
@x
n
.
The determinant (see
§
2.4.1) of the Jacobian is a scalar quantifying
the size of the neighborhood of dy with respect t o dx.
3.4.1 Linear Functions
(a) Generic linear transformation
-1
0
1
2
3
4
5
-1 0 1 2 3 4 5
(b) Linear transformation y =2x
Figure 3.21: Examples of transformations
through a linear function.
Figure 3.21b illustrates how a function
y
=2
x
transforms a random
variable
X
with mean
µ
X
= 1 and standard deviati on
X
=0
.
5
into
Y
with mean
µ
Y
= 2 and standard deviati on
y
= 1. In the
machine learning context, it is common to empl oy linear func t i ons
of random variables
y
=
g
(
x
)=
ax
+
b
, as illustrated in figure 3.21a.
Given a rand om variable
X
with mean
µ
X
and variance
2
X
,the
change i n the neighborhood size simplifies to
dy
dx
= |a|.
In s uch a case, because of the linear property of the expectation
operation (see §3.3.5),
µ
Y
= g(µ
X
)=
X
+ b,
Y
= |a|
X
.
probabilistic machine learning for civil engineers 31
Let us consider a set of
n
random variables
X
defined by its mean
vector and covariance matrix,
X =
2
6
4
X
1
.
.
.
X
n
3
7
5
, µ
X
=
2
6
4
µ
X
1
.
.
.
µ
X
n
3
7
5
,
X
=
2
6
4
2
X
1
···
1n
X
1
X
n
.
.
.
.
.
.
.
.
.
.
.
.
sym.
2
X
n
3
7
5
,
and the variables
Y
=[
Y
1
Y
2
··· Y
n
]
|
obtained from a linear
function Y = g(X)=AX + b so that
2
4
3
5
n1
| {z }
Y
=
2
4
3
5
nn
| {z }
A = J
y,x
2
4
3
5
n1
| {z }
X
+
2
4
3
5
n1
| {z }
b
.
The function outputs
Y
(i.e., the mean vector), covariance matrix,
Note:
For linear functions
Y
=
AX
+
b
,
the Jacobian J
y,x
is the matrix A itself .
and the joint covariance are then described by
µ
Y
= g(µ
X
)=Aµ
X
+ b
Y
= A
X
A
|
XY
=
X
A
|
9
>
=
>
;
X
Y
,
µ
X
µ
Y
,
X
XY
|
XY
Y
.
If i n st ead of having an
n ! n
function, we have an
n !
1
function
y
=
g
(
X
)=
a
|
X
+
b
, t h en the Jac obi an si mp li fi es to th e
gradient vector
rg
(
x
)=
h
@g(x)
@x
1
···
@g(x)
@x
n
i
, w hi ch is again equal to
the vector a
|
,
⇥⇤
11
| {z }
Y
=
⇥⇤
1n
| {z }
a
|
=rg ( x)
2
4
3
5
n1
| {z }
X
+
⇥⇤
11
| {z }
b
.
The function output Y is then described by
µ
Y
= g(µ
X
)=a
|
µ
X
+ b
2
Y
= a
|
X
a.
3.4.2 Linearization of Nonlinear Functions
Because of the analytic simplicity associated with linear functions
of r an dom variables, it is common to approximate nonlinear func-
tions by linear ones using a Taylor series so that
The Hessian
H(µ
X
)
is an
n n
matrix
containing the 2
nd
-order partial derivatives
evaluated at µ
X
.See§5.2 for details.
g(X)
2
nd
-order approximation
z }| {
g(µ
X
)+
Gradient
z }| {
rg(µ
X
)(X µ
X
)
| {z }
1
st
-order approximation
+
1
2
(X µ
X
)
|
Hessian
z }| {
H(µ
X
)(X µ
X
)+···
| {z }
m
th
-order approximation
.
j.-a. goulet 32
In p r act i ce , th e serie s are most oft en li mi t ed t o the first -or de r
approximation, so for a one-to-one function, it simplifies to
Y = g(X) aX + b.
Figure 3.22 presents an example of such a linear approximation
for a one -t o-on e trans for mat i on . Lin ear iz i ng at the expected value
µ
x
minimizes the approximation errors because the linearization
is then centered in the region associated with a high probability
content for
f
X
(
x
). In that case,
a
corresponds to the gradient of
g(x) evaluated at µ
X
,
-1
0
1
2
3
4
5
-1 0 1 2 3 4 5
Figure 3.22: Example of a linearized
nonlinear transformation.
a =
dg(x)
dx
x=µ
X
.
For the
n !
1 multivariate case , the linearized transformati on leads
to
Y = g(X) a
|
X + b
= rg(µ
X
)(X µ
X
)+g(µ
X
),
where Y has a mean and variance equal to
µ
Y
g(µ
X
)
2
Y
⇡rg(µ
X
)
X
rg(µ
X
)
|
.
For th e
n ! n
multivariate case, the linearized transformation leads
to
Y = g(X) AX + b
= J
Y,X
(µ
X
)(X µ
X
)+g(µ
X
),
where Y is described by the mean vector and covariance matrix ,
µ
Y
=
g(µ
X
)
Y
=
J
Y,X
(µ
X
)
X
J
|
Y,X
(µ
X
).
For multivariate nonlinear functions, the gradient or Jacobian is
evaluated at the expected value µ
X
.