Probability Theory

“La t´ehorie des probabilit´es n’est, au fond,

que le bon sens r´eduit au calcul;ellefait

appr´ecier avec exactitude ce que les esprits

justes sentent par une sorte d’instinct”

— Pierre-Simon, marquis de Laplace (1749 –1 8 27 )

The interpretati on of probabi l i ty theory employed in this book

follows Laplace’s view of “common sense reduced to calculus.” It

means that probabilities describe our state of knowledge rather

than intrinsicall y aleatory phenomena. In practice, few phenomena

are actually intrinsically unpredictabl e. Take, for example, a coin as

displayed in ﬁgure 3.1. Whether a coin toss resul t s in eit he r head s

or t ai l s has nothi ng to do with an inher ently aleatory process. The

outcome appears unpredictable because of the lack of knowl edge

about the coin’s initial position, speed, and accel e rat i on . If we could

gather information about the coin’s initial kinematic conditions, the

outcome would become predictable. Devices that can throw coins

with repeatable initial kinematic conditions will lead to repeatable

outcomes.

Figure 3.1: A coin toss illustrates the

concept of epistemic uncertainty.

(Photo: Michel Goulet)

Figure 3.2 presents another example where we consider the elas-

tic modulus

at one s peciﬁc location in a dam. Notwithstanding

Elastic modulus relates the stress



and

strains ✏ in Hooke’s law  = ✏E.

long-term e↵ects such as creep,

at any given location,

do es

Creep is the long-term (i.e.,



years)

deformation that occurs under constant

stress.

not vary with time:

is a de t er mi ni s ti c , yet unknown constant.

Probability is employed here as a tool t o desc ri be our incomplete

knowledge of that constant.

Figure 3.2: The concrete elastic modulus

at a given location is an example of

deterministic, yet unknown quantity. The

possible values for

can be described

using the probability theory.

There are two types of uncertainty: aleatory and epistemic.

aleatory uncertainty is characterized by its irreducibility;noin-

formation can either reduce or alter it. Alternately, epistemic

uncertainty refers to a lack of knowledge that can be altered by new

information. In an engineering context, aleatory uncertainties arise

when we are concerned with future real i zat i ons that have yet to

occur. Epistemic uncertainty applies to any other case dealing with

deterministic, yet unknown q uantities.

This book approaches machine learning using probability theory

because in many practical engineering problems, the number of

observations available is limited, from a few to a few thousand. In

such a context, the amount of information available is typically

j.-a. goulet 18

insuﬃcient to eliminate epistemic uncertainties. When large data

sets are available, probabilistic and deterministic met hods may lead

to indistinguishable results; the opposite occurs w he n li tt l e dat a

is available. Therefore, the less we know about it, the stronger the

argument for approaching a problem using probability theory.

In t h i s chapter, a revi ew of set theory lays the foundation for

probability theory, where the central part is the concept of random

variables. Machine learning methods are built from an ensemble

of f un ct i on s organize d in a clever way. Therefore, the last part of

this chapter looks at what happens when random variables are

introduced into deterministic functions.

For speciﬁc notions related to pr ob abi l i ty theory that are outside

the scope of this chapter, the reader should refer to dedicated

textbooks such as those by Box and Tiao;

Ang and Tang.

Box, G. E. P. and G. C. Tiao (1992).

Bayesian inference in statistical analysis.

Wiley

Ang, A. H.-S. and W. H. Tang (1975).

Probability concepts in engineering plan-

ning and decision,Volume1—Basic

Principles. John Wiley

3.1 Set Theory

Set: Ensemble of events or elements.

Universe/sampling space

(

)

Ensem-

ble of all possible events.

Elementary event

(

)

Asingleevent,

x 2S.

Event

(E)

Ensemble of elementary events.

E ⇢S :SubsetofS

E = S :Certainevent

E = ; :Impossibleevent

E :ComplementofE

A set descri bes an ensemble of elements, also referred to as events.

An el emen tar y event

refers to a single event among a sampling

space (or universe) denoted by the calligraphic lett er

.Bydeﬁni-

tion, a sampling space contains all the possible events,

E ✓S

.The

special case where an event i s equ al to the sampl in g space ,

is called a certain event. Th e opposite,

;

, w he re an event is

an e mp ty set, is call ed a null event.

refers to the complement of

a set , t hat is, al l el eme nts belonging to

and not t o

. F i gu re 3.3

illustrates these concepts using a Venn diagram.

Figure 3.3: Venn diagram representing

the sampling space

,anevent

,its

complement E,andanelementaryeventx.

(a) Union operation

(b) Intersection operation

Figure 3.4: Venn diagrams representing the

two basic operations.

Let us consider the example,

of t h e stat e of a structu r e foll ow-

This example is adapted from Armen Der

Kiureghian’s course, CE229, at University

of California, Berkeley.

ing an earthquake, which is described by a sampling space,

S = {no damage, light d amage , important damage , collapse}

= {N, L, I, C}.

In t h at context, an event

{

}

could contain the no damage

and light damage events, and another event

{

}

could contain

only the collapsed state. The complements of these events are,

respectively, E

= {I, C} and E

= {N, L, I}.

The two main operations for events, union and intersection, are

illustrated in ﬁgure 3.4. A union is analogous to the “or” operator,

where

[ E

holds if the event belongs to e i the r

, or

both. The intersection is analogous to the “and” operator, where

\ E

⌘ E

holds if the event belongs to both

and

. As

a convention, intersection has priority over union. Moreover, both

operations are commutative, as s ociative, and distributive.

Given a set of

events

, ··· ,E

}2S

, ··· ,E

probabilistic machine learning for civil engineers 19

the events are mutually exclusive if

;, 8i 6

, t h at is ,

if the intersection for any pair of events is an empty set. Events

, ··· ,E

are collectively exhaustive if

[

i=1

, that is, the

union of all events is the sampling space. Events

, ··· ,E

are

mutually exclusive and collectively exhaustive if they satisfy both

properties simultaneously. Figure 3.5 presents examples of mutu-

ally exclusive (3.5a), collectively exhaustive (3.5b), and mutually

exclusive and collectively exhaustive (3.5c–d) events. Note that

the di↵erence between (b) and (c) is the absence of overlap in the

latter.

Union (“or”)

[ E

Intersection (“and”)

\ E

⌘ E

Commutativity

[ E

= E

[ E

= E

[

i=1

= E

[ E

[···[E

i=1

= E

\ E

\···\E

Associativity

(

)

[

(

Distributivity

[ E

)=(E

[ E

)

(a) Mutually exclusive

(b) Collectively exhaustive

tively exhaustive

(d) Mutually exclusive and collec-

tively exhaustive

Figure 3.5: Venn diagrams representing

the concepts of mutual exclusiv ity and

collective exhaustivity for events.

3.2 Probability of Events

(

) denotes the probability of the event

. T he re are two main

interpretati on s for a probabil i ty: the Frequentist and the Bayesian.

Frequentists interpret a probability as the number of occurrences of

relative to the number of samples s, as s goes to 1,

Pr(E

)= lim

s!1

#{E

}

For Bayesians, a probability measures how likely is

in compar-

ison with other events in

. T hi s interpretation assumes that the

nature of uncertainty is episte mi c, that i s, i t des cr i bes our knowl-

edge of a phenom en on. For instance, the probability depends on

the available knowledge and can change when new information

is obtained. Throughout this book we are adopting this Bayesian

interpretation .

By deﬁnition, the probability of an event is a number between

zero and one, 0

 Pr

(

)



1. At the ends of thi s spectrum, the

probability of any event in

is one,

(

) = 1, and the probabil i ty

of an emp ty set is zero,

(

;

) = 0. If two events

and

are

mutually exclusive, th en th e pr obabi li ty of the events’ union is the

sum of e ach event’s probability. Because the union of an event and

its complement are the sampling space,

E [ E

(see ﬁgure 3.5d),

and because

(

) = 1, then the probabili ty of the compleme nt is

Pr(E)=1 Pr(E).

When events are not mutually exclusive, the gene ral addi t i on

rule for the probabil ity of the union of two events is

Pr(E

[ E

)=Pr(E

)+Pr(E

) Pr(E

This general addition rule is illustrated in ﬁgure 3.6, where if we

simply add the probability of each event without accounting for the

subtraction of

(

), the probability of the intersection of both

events wil l be counted twice.

j.-a. goulet 20

Figure 3.6: Venn diagram representing the

addition rule for the probability of events.

Pr(E

)

denotes the probability of the event

conditional

on t h e real i zat i on of the event

. T hi s cond i t ional pr obab i l ity

is deﬁned as the joint probability for both events divided by the

probability of E

, Note: Pr(E

) 6=0

on the denominator

because a division by 0 is not ﬁnite.

Pr(E

)

Pr(E

)

, Pr(E

) 6=0. (3.1)

Conditional probability

Marginal probability

Joint probability

Note:

Statistical independence (

)be-

tween a pair of random variables implies

that learning about one random variable

does not modify our knowledge for the

other.

The probability of a single event is referred to as a marginal prob-

ability. A joint probability designates the probabi l ity of the inter-

section of events. The terms in equation 3.1 can be rearranged to

explicitly show that the joint probability of two events

}

the product of a conditional probab i li ty and its associated marginal,

Pr(E

)=Pr(E

) ·Pr(E

)

=Pr(E

) ·Pr(E

In c ase s wher e

and

are s t ati st ical l y independent,

?? E

conditional probabilities are equal to the marginal,

?? E

⇢

Pr(E

)=Pr(E

)

Pr(E

)=Pr(E

In t h e special case of statistically independent events, the joint

probability reduces to th e product of the marginals,

Pr(E

)=Pr(E

) ·Pr(E

The joint pr ob abi l i ty for

events can be broken down into

n 

conditionals and one margi nal probability using the chain rule,

Pr(E

···E

)=Pr(E

···E

)Pr(E

···E

)

=Pr(E

···E

)Pr(E

···E

)Pr(E

···E

)

=Pr(E

···E

)Pr(E

···E

) ···Pr(E

n1

)Pr(E

Let us deﬁne

, ··· ,E

}2S

, a set of mutually e xc l u-

sive an d colle ct i vely ex h aus ti ve events, that is,

;, 8i 6

j, [

i=1

– and an event

belonging to the same sampling

probabilistic machine learning for civil engineers 21

space, that is,

A 2S

. T hi s context is il l us t rat e d usi n g a Venn dia-

gram in ﬁgure 3.7. The probability of the event

can be obtained

by sum mi ng the joint probabi l ity of A and each event E

Pr(A)=

i=1

Pr(A|E

) ·Pr(E

)

| {z }

Pr(AE

)

. (3.2)

Figure 3.7: Venn diagram representing the

conditional occurrence of events.

This operation of obtaining a marginal probability from a joint is

called marginali z ati on. The addit i on ru le for t he union of

[ E

conditional on A is

Pr(E

[ E

|A)=Pr(E

|A)+Pr(E

|A) Pr(E

|A),

and the intersection rule is

Pr(E

|A)=Pr(E

,A) · Pr(E

|A).

Using the deﬁnition of a conditional probability in equation 3.1, we

can break

(

) into two di↵erent products of a conditional and

its associated marginal probability,

Pr(AE

)=Pr(A|E

) ·Pr(E

)

=Pr(E

|A) ·Pr(A)

| {z }

Pr(E

|A) ·Pr(A)=Pr(A|E

) ·Pr(E

(3.3)

Reorganizing the right-hand terms of equation 3.3 leads to Bayes

rule,

Pr(E

|A)=

Pr(A|E

) · Pr(E

)

Pr(A)

Posterior probability

Evidence

Conditional probability

Prior probability

On the left-hand side is the po st eri or probability: the probability

of t h e event

given the realization of the event

.Onthenu-

merator of the right-hand side is the product of the conditional

probability of the event

given the event

,timestheprior prob-

ability of

. T he te rm on the denom in at or is ref er r ed to as the

evidence and act s as a normalizati on cons t ant, which ensur es th at

(

) = 1. The normalization const ant

(

) is obtained

using the marginalization operation presented in equation 3.2.

In p r act i cal appl i cat i ons ,

(

) is typically diﬃcult t o esti mat e .

Chapters 6 and 7 p re se nt analyt i c as well as numerical methods

for tackling this challenge. Fi gure 3.8 ill u st r at es th e condi t i on al

occurrence of events in the context of Bayes rule.

Pr(E

|A)=

Pr(A|E

)Pr(E

)

Pr(A)

Pr(AE

)

Pr(A)

Pr( )

Figure 3.8: Venn diagram representing the

conditional occurrence of events in the

context of Bayes rule.

j.-a. goulet 22

3.3 Random Variables

Set theory is relevant for i ntroducing the concepts related to proba-

bilities. However, on its own, it has a limited appli cab i l ity to practi-

cal problems that require deﬁning the concept of random variables.

A ran d om variable is denoted by a capital letter

. C ontrarily to

what its name implies, a random variable is not intended to de-

scribe only intrinsically random events; in our case, it describes lack

of knowledge. A random variable

does not take any speciﬁc value.

Instead, it takes any value in its valid sampling space

x 2S

and,

as we will see short l y, the probability of occurrence of each value

is typically not equal. Values of

are either called realizations or

outcomes and are elementary events that are mutually exclusive and

collectively exhaustive. A sampling space

for a rand om variable

can either be discrete or continuous. Continuous cases are always

inﬁnite, whereas discrete ones can either be ﬁnite or inﬁnite. F i gur e

3.9 illustrates how the concepts of events and sampling space can be

transposed from a Venn diagram representation to the domain of a

random variabl e .

3.3.1 Discrete Random Variables

Figure 3.9: Parallel between a Venn

diagram and a continuous domain to

represent a random variable.

In t h e case where

is a di sc r et e domain, t he pr ob abi l i ty that

is described by a probability mass function (PMF). In terms

of n ot at ion

(

)

⌘ p

(

)

⌘ p

(

) are al l equ ivalent. Moreover,

we typically describe a random variable by deﬁning its sampling

space and its probability mass function so that

X ⇠ p

(

). The

symbol

⇠

reads as distributed like. Analogously to the probability of

Notation

X :Randomvariable

x :RealizationofX

(x): ProbabilitythatX = x

events, t he probability that X = x must be

0  p

(x)  1,

and the sum of the probability for all x 2Sfollows

(x)=1.

For the post-earthquake structural safety example introduced in

§3.1, where

S =

no damage (N)

light d amage (L )

important damage (I)

collapse (C)

;

the sampling space along with the probability of each event can

be represented by a probability mass function as depicted in ﬁg-

ure 3.10.

Figure 3.10: Representation of a sampling

space for a discrete random variable.

probabilistic machine learning for civil engineers 23

The event corresponding to damages that are either light or impor-

tant corresponds to L

[

⌘{

 x 

}

. B ec aus e the events

and x = 2 are mutually exclusive, the probability

Pr(L [I)=Pr({1  X  2})

= p

(x = 1) + p

(x = 2).

The probability that

takes a value less than or equal to

described by a cumulative mass function (CMF),

Pr(X  x)=F

(x)=

x

Figure 3.11 presents on the same graph the probability mass fu nc-

tion (PMF) and the cumulative mass function. As its name indi-

cates, the CMF corresponds to the cumulative sum of the PMF.

Inversely, the PMF can be obtained from the CMF following

)=F

) F

i1

Figure 3.11: Comparison of a probability

(

)andacumulative

(

)mass

function.

3.3.2 Continuous Random Variables

The concepts presented for discrete sampling spaces can be ex-

tended for cases where

is a continuous domain. Because continu-

ous domains are inevitably inﬁnite, the probability that a random

variable takes a speciﬁc value X = x is zero, Note:

Here, the probability equal to

zero does not mean that a speciﬁc value

is impossible. Take, for example, a

random variable deﬁned in the interval

1), for which all the outcomes are

equally probable. The probability that

23642 is only one out of an inﬁnite

number of possibilities in (0, 1).

Pr(X = x)=0.

For continuous random variables, the probability is only d eﬁ n ed f or

intervals x<X x +x,

Pr(x<X x +x)=f

(x)x,

where

(

)

⌘ f

(

) denotes a probability density function (PDF).

A PD F must always be greater than or equal to zero

(

)



however, unlike for the discrete case where 0

 p

(

)



(

)

can take values greater than one because it describes a probability

density rather than a probability. In order to satisfy the property

that

(

) = 1, the integral of

(

) over all possible values of

must be one,

1

(x)dx =1.

The probability that

(

X  x

) is given by the cumulative density

function (CDF),

Pr(X  x)=F

(x)=

1

)dx

j.-a. goulet 24

For a random variable

x 2 R

X ⇠ f

(

), the CDF evaluated

at the lower and upper bounds is, respectively,

(

1

) = 0 and

) = 1. Notice that the CDF is obtained by integrating

the PDF, and inversely, the PDF is obtained by di↵erentiating the

CDF,

(x)=

1

)dx

$ f

(x)=

(x)

Moreover, because

(

) is the i ntegral of

(

) and

(

)



(

)isnondecreasing. Figure 3.12 presents examples of probabil-

ity density and cumulative distribution f u nc ti on .

(a) Probability density function

nondecreasing

(b) Cumulative distribution function

Figure 3.12: Examples of PDF and CDF

for a continuous random variable.

3.3.3 Conditional Probabilities

Conditional probabilities describe the probability of a random

variable’s outcomes, given the realiz at ion of anothe r variable. The

conditional notation for discrete random variables follows

X|y ⇠ p(x|y) ⌘ p

X|y

(x) ⌘ Pr(X = x|y )=

(x, y)

(y)

and the conditional notation for continu ou s rand om variables

follows,

X|y ⇠ f (x|y) ⌘ f

X|y

(x|y)=

(x, y)

(y)

Conditional probabilities are employed in Bayes rule to infer the

posterior knowledge associated with a random variable, given the

observations made for another.

Let us revisit the post-earthquake structural safety example

introduced in

3.1, where the damage state

x 2S

. I f we measure

the peak ground acceleration (PGA) after an earthquake

y 2 R

, Peak ground acceleration is a metric

quantifying the intensity of an earthquake

using the maximal acceleration recorded

during an event.

we can employ the conditional probability of having struc t ur al

damage given the PGA value to infer the structural state of a

building that itself has not been observed. Figure 3.13 illustrates

schematically how an observation of the peak ground acceleration

can be employed to infer th e struc t ur al st at e of a buildin g

using conditional probabilities. Because the structural state

is a

discrete random variable,

(

x|y

) describes the posterior probability

of e ach state

x 2S

, gi ven an observed value of PGA

y 2 R

(

) is a normal i zat i on const ant obt ain ed by marginal iz i ng

from

f(x, y) and evaluating it for the particular observed value y,

f(y)=

x2S

f(y|x) ·p(x).

The posterior is obtained by multiplying the likelihood of observing

the particular value of PGA

given each of the structural states

probabilistic machine learning for civil engineers 25

, t i me s th e prior pr obab i li ty of each structu ral st at e , and then

dividing by the probability of the observation

itself. Conditional

probabilities can be employed to put in relation any combination

of c ontinuous and discrete random variables. Chapter 6 further

explores Baye si an estimation with applied examples.

Figure 3.13: Schematic example of how ob-

servations of the peak ground acceleration

can be employed to infer the structural

state of a building

using conditional

probabilities.

3.3.4 Multivariate Random Variables

It is common to st u dy t he joi nt occurrence of multiple phenomena.

In t h e context of prob abi l i ty theory, it is done using multivariate

random variables.

··· x

]

is a vector (col u mn ) con-

taining realizations for

random variables

··· X

]

X ⇠ p

(

)

⌘ p

(

), or

X ⇠ f

(

)

⌘ f

(

). For the discret e

case, the probability of the joint realizat i on x is described by

(x)=Pr(X

= x

\ X

= x

\···\X

= x

where 0  p

(x)  1. For the continuous case, it is

(x)x =Pr(x

 x

+x

\···\x

 x

+x

for 

x ! 0

. Not e th at

(

) can be

1 because it describes a

probability density. As mentioned earlier, two random variables

and X

are statistically independent (?? )if

)=p

?? X

?? · · · ?? X

, the joint PMF is deﬁned by th e product of

its marginals,

, ··· ,x

)=p

) ···p

For the general case where

, ··· ,X

are not statistically

independent, their joint PMF can be deﬁned using the chain rule,

, ··· ,x

)=p

, ··· ,x

) ···

·p

n1

n1

) ·p

The same rules apply for continuous random variables except

that

(

) is replaced by

(

). Figure 3.14 presents examples of

marginals and a bivariate joint p r obab i li ty density function.

The multivariate cumulative distribution function describes the

probability that a set of

random variables is simultaneously lesser

or equal to x,

(x)=Pr(X

 x

\···\X

 x

j.-a. goulet 26

The joint CD F is obt ai ne d by integrating t he joint PDF over each

dimension from its lower bound up to

, an d inversely, the joint

PDF is obtained by di↵erentiating the CDF,

(x)=

1

···

1

)dx

$ f

(x)dx =

(x)

···@x

A multivariate CDF has values 0

 F

(

)



1, and its value is zer o

at the lowest bound for any dimension and one at the upper bound

for all dimensions,

, ··· ,x

n1

, 1)=0

(+1, ··· , +1, +1)=1.

Figure 3.15 presents an example of marginals and a bivariate

cumulative distribution function.

5

0.1

0.2

0.3

0.4

)

5 0 5

5

Figure 3.14: Examples of marginals and a

bivariate probability density function.

5

0.5

)

5 0 5

5

Figure 3.15: Examples of marginals and a

bivariate cumulative distribution function.

The operation consisting of removing a random variable from

a joi nt set is calle d marginalization. For a set of

joint ran d om

variables, we can remove the

variable by summing over the

dimension,

, ··· ,x

)=p

n1

, ··· ,x

n1

If we marginali z e all variables by summing over all dimensions, the

result is

···

(x)=1.

For the example prese nted in ﬁgur e 3.16,

?? X

so t h e joint

PMF is obtained by the product of its marginals. It is possible to

obtain the marginal PMF for

from the joint through marginaliz-

ing,

i=1

,i):

Marginalization

1

)dx

= f

)

)=p

)

, +1)=F

)

=1 x

=2 x

i=1

,i)

=1 0.08 0.015 0.005 0.1

=2 0.40.075 0.025 0.5

=3 0.32 0.06 0.02 0.4

Marginalization applies to continuous random variables using

integration,

1

, ··· ,x

)dx

= f

n1

, ··· ,x

n1

where again, if we integrate over all dimensions, the result is

1

···

1

(x)dx =1.

probabilistic machine learning for civil engineers 27

For both continuous and discrete rand om variables, we can marginal-

ize a random variable by evaluating its CDF at its upper bound,

, ··· ,x

n1

, +1)=F

n1

, ··· ,x

n1

)

(1) = 0.1

(2) = 0.5

(3) = 0.4

)

(1) = 0.8

(2) = 0.15

(3) = 0.05

)=p

) · p

)

0.2

0.4

0.6

0.8

Figure 3.16: Examples of marginals and

bivariate probability mass functions.

3.3.5 Moments and Expectation

The m om ent of order

[

] of a random variable

is deﬁned

E[X

· f

(x)dx (continuous)

· p

)(discrete),

where

[

] denotes the expectation op er at i on. For

= 1,

[

is a meas ur e of position for the centroid of the probability density

or m ass fun ct i on . This centroid is anal ogou s to the conc ep t of

center of gravity for a solid body or cross section. An expected value

Expected value

E[X]=

x · f

(x)dx (continuous)

· p

)(discrete)

refers to the sum of all possible valu es weighted by t he ir probability

of occurrence. A key property of the expectation is that it is a

linear operation so that

E[X + Y ]=E[X]+E[Y ].

The notion of expectation can be extended for any function of

random variabl e s g(X),

E[g(X)] =

g(x) · f

(x)dx.

The expectation of the function

(

)=(

X  µ

)

is referred to as

centered moment of order m,

E[(X  µ

)

(x  µ

)

· f

(x)dx.

For the special cases where

= 1,

[(

X  µ

)

] = 0, and for

= 2,

E[(X  µ

)

]=

= var[X]

= E[X

]  E[X]

where



denotes the standard deviation of

; an d

var

[

] d en ot es

the variance operator that measures the dispersion of the prob-

ability density function with respect to its mean. The notion of

variance is analogous to the concept of moment of inertia for a

cross section. Toget he r,

and



are metrics describing the cen-

troid and dispersion of a random var i abl e. An oth er ad im en si onal

j.-a. goulet 28

dispersion metric for describing a random variable is the coeﬃcient

of variation, 



. Note that 

only applies for µ

6= 0.

Given two random variables

X, Y

,theircovariance,

cov

(

X, Y

is deﬁned by the expectation of the product of the mean-centered

variables,

E[(X  µ

)(Y  µ

)] = cov(X, Y )

= E[XY ]  E[X] · E[Y ]

= ⇢

· 

The correlation coeﬃcient

⇢

can take a value between -1 and 1,

which quantiﬁes the linear dependence between X and Y ,

⇢

cov(X, Y )



, 1  ⇢

 +1.

A positive (negative) correlation indicates that a large outcome for

is associated with a high probability for a large (small) outcome

for

. F i gu re 3.17 prese nts examp l es of scatt e r plots gen er at ed for

di↵erent correlation coeﬃcients.

Figure 3.17: Examples of scatter plots

between the realizations of two random

variables for di↵erent correlation coeﬃ-

cients ⇢.

In t h e special case where

and

are independent, the correla-

tion is zero,

X ?? Y

) ⇢

= 0. Note that the inverse is not

true; a correlation coeﬃcient equal to zero does not guarantee the

independence,

⇢

6=) X

?? X

. This happens because correla-

tion only measures the linear dependence between a pair of random

variables; two random variables can be nonlinearly dependent, yet

have a correlation coeﬃcient equal to zero. Figure 3.18 presents an

example of a s c att e r plot wi t h quad r at ic dependence yet no linear

dependence, so ⇢ ⇡ 0.

Figure 3.18: Example of scatter plot where

there is a quadratic dependence between

the variables, yet the correlation coeﬃcient

⇢ ⇡ 0.

Correlation also does not imply causality . For example, the

number of ﬂu cases is negatively correlated wit h t he te mperature;

when the seasonal temperatures drop during winter, the number

of ﬂ u cases incr e as es . Noneth el es s , the col d it se lf i s not causin g

the ﬂu; someone isolated in a cold climate is unlikely to contract

the ﬂu because the virus is itself unlikely to be present in the

environment. Instead, stu di e s have shown that the ﬂu virus has

a hi gh er tran sm i s si bi li ty in the col d and dry cond i ti on s that are

prevalent duri ng winter. See, for e xam pl e, Lowen and Steel.

Lowen, A. C. and J. Steel (2014). Roles

of humidity and temperature in shaping

inﬂuenza seasonality. Journal of Virol-

ogy 88 (14), 7692–7695

For a set of

random variables

, ··· ,X

,thecovariance

matrix deﬁnes the dispersion of each variable through it s variance

located on the main diagonal, and the dependence between vari-

ables through the pairwise covariance located on the o↵-diagonal

terms,

⌃ =



··· ⇢



sym. 

probabilistic machine learning for civil engineers 29

A covariance matrix is symmetric (sym.) , and each term is deﬁned

following [

⌃

]

cov

(

⇢



. B ec aus e a variable

is linearly correlated with itself (

⇢

= 1), the main diagonal terms

reduce to [

⌃

]



. A covariance matrix has to be positive semi-

deﬁnite (see

2.4.2) so the variances on the main diagonal must be

0. I n orde r to avoid singular cases, there should be no li ne arl y

dependent variabl e s, that is, 1 <⇢

< 1, 8i 6= j.

3.4 Functions of Random Variables

Let us consider a continuous random variable

X ⇠ f

(

) and a

monotonic deterministic function

(

). The function’s out- A monotonic function

(

)takesone

variable as input and returns one variable

as output and is strictly either increasing

or decreasing.

put

is a ran d om variable because it takes as input the random

variable

.ThePDF

(

) is deﬁned knowing that for each i n -

ﬁnitesimal part of the domain

, t h er e is a corresponding

, an d

the probability over both d omai ns must be equal,

Pr(y<Y  y + dy)=Pr(x<X x + dx)

(y)

|{z}

0

dy = f

(x)

|{z}

0

dx.

The change-of-variable rule for f

(y)isdeﬁnedby

(y)=f

(x)



= f

(x)



1

= f

1

(y))



dg(g

1

(y))



1

where multiplying by

accounts for t h e change in the size of the

neighborhood of

with respect to

, an d whe re the absol u t e value

ensures that

(

)



0. For a function

(

) and its i nverse

x = g

1

(y), the gradient i s obtained from

⌘

dg(x)

⌘

dg(

z }| {

1

(y))

0 20 40

(x)

(y)

Figure 3.19: Example of 1-D nonlinear

transformation

(

). Notice how the

nonlinear transformation causes the modes

(i.e., the most likely values) to be di↵erent

in the x and y spaces.

Atransformationfromaspace

to an-

other space

requires taking into account

the change in the size of the neighborhood

(y)=f

(x)



Figure 3.19 presents an example of nonlinear transformation

(

). Notice how, because of t h e nonl i ne ar trans for mat i on , th e

maximum for

(

⇤

) and the maximum for

(

⇤

) do not occur for

the same locations, that is, y

⇤

6= g(x

⇤

Given a set of

random variables

x 2 R

X ⇠ f

(

), we can

generalize the transformation rule for an

multivariate func-

tion

(

), as illustrated in ﬁgure 3.20a for a case where

= 2.

As wi t h th e univariate case, we need to account for the change in

j.-a. goulet 30

the neighborhood size when going from the original to the trans-

formed space, as illustrated in ﬁgure 3.20b. The transformation is

then deﬁned by

(a) 2-D transformation

(b) E↵ect of a 2-D transformation on the

neighborhood size

Figure 3.20: Illustration of a 2-D transfor-

mation.

(y)dy = f

(x)dx

(y)=f

(x)



where |

| is the inverse of th e determinant of the Jacobian matrix,



= |det J

y,x

1



= |det J

y,x

The Jacobian is an

n ⇥ n

matrix containing the partial d er i vatives of

with respect to x

, evaluated at x so that [J

y,x

]

k,l

y,x

···

(x)

Note that each row of the Jacobian matrix corresponds to the

gradient vect or evaluated at x,

rg(x)=



@g(x)

···

@g(x)



The determinant (see

2.4.1) of the Jacobian is a scalar quantifying

the size of the neighborhood of dy with respect t o dx.

3.4.1 Linear Functions

(a) Generic linear transformation

-1

-1 0 1 2 3 4 5

(b) Linear transformation y =2x

Figure 3.21: Examples of transformations

through a linear function.

Figure 3.21b illustrates how a function

transforms a random

variable

with mean

= 1 and standard deviati on



into

with mean

= 2 and standard deviati on



= 1. In the

machine learning context, it is common to empl oy linear func t i ons

of random variables

(

, as illustrated in ﬁgure 3.21a.

Given a rand om variable

with mean

and variance



,the

change i n the neighborhood size simpliﬁes to



= |a|.

In s uch a case, because of the linear property of the expectation

operation (see §3.3.5),

= g(µ

)=aµ

+ b, 

= |a|

probabilistic machine learning for civil engineers 31

Let us consider a set of

random variables

deﬁned by its mean

vector and covariance matrix,

X =

, µ

, ⌃



··· ⇢



sym. 

and the variables

··· Y

]

obtained from a linear

function Y = g(X)=AX + b so that

n⇥1

| {z }

n⇥n

| {z }

A = J

y,x

⇥

n⇥1

| {z }

n⇥1

| {z }

The function outputs

(i.e., the mean vector), covariance matrix,

Note:

For linear functions

the Jacobian J

y,x

is the matrix A itself .

and the joint covariance are then described by

= g(µ

)=Aµ

+ b

⌃

= A⌃

⌃

= ⌃

;











⌃



If i n st ead of having an

n ! n

function, we have an

n !

function

(

, t h en the Jac obi an si mp li ﬁ es to th e

gradient vector

(

@g(x)

···

@g(x)

, w hi ch is again equal to

the vector a

⇥⇤

1⇥1

| {z }

⇥⇤

1⇥n

| {z }

=rg ( x)

⇥

n⇥1

| {z }

⇥⇤

1⇥1

| {z }

The function output Y is then described by

= g(µ

)=a

+ b



= a

⌃

3.4.2 Linearization of Nonlinear Functions

Because of the analytic simplicity associated with linear functions

of r an dom variables, it is common to approximate nonlinear func-

tions by linear ones using a Taylor series so that

The Hessian

H(µ

)

is an

n ⇥ n

matrix

containing the 2

-order partial derivatives

evaluated at µ

.See§5.2 for details.

g(X) ⇡

-order approximation

z }| {

g(µ

Gradient

z }| {

rg(µ

)(X  µ

)

| {z }

-order approximation

(X  µ

)

Hessian

z }| {

H(µ

)(X  µ

)+···

| {z }

-order approximation

j.-a. goulet 32

In p r act i ce , th e serie s are most oft en li mi t ed t o the ﬁrst -or de r

approximation, so for a one-to-one function, it simpliﬁes to

Y = g(X) ⇡ aX + b.

Figure 3.22 presents an example of such a linear approximation

for a one -t o-on e trans for mat i on . Lin ear iz i ng at the expected value

minimizes the approximation errors because the linearization

is then centered in the region associated with a high probability

content for

(

). In that case,

corresponds to the gradient of

g(x) evaluated at µ

-1

-1 0 1 2 3 4 5

Figure 3.22: Example of a linearized

nonlinear transformation.

a =



dg(x)



x=µ

For the

n !

1 multivariate case , the linearized transformati on leads

Y = g(X) ⇡ a

X + b

= rg(µ

)(X  µ

)+g(µ

where Y has a mean and variance equal to

⇡ g(µ

)



⇡rg(µ

)⌃

rg(µ

)

For th e

n ! n

multivariate case, the linearized transformation leads

Y = g(X) ⇡ AX + b

= J

Y,X

(µ

)(X  µ

)+g(µ

where Y is described by the mean vector and covariance matrix ,

⇠

g(µ

)

⌃

⇠

Y,X

(µ

)⌃

Y,X

(µ

For multivariate nonlinear functions, the gradient or Jacobian is

evaluated at the expected value µ