Classiﬁcation

[ ] cl+

Pr(pathogens)

Obs. class

-1

(a)

PGA

Pr(damage)

Obs. class

-1

(b)

Figure 9.1: Examples of application of

classiﬁcation analysis in the context of civil

engineering.

Classiﬁcation is similar to regression, where the task is t o predict a

system response

, given some covariates x describing the system

properties. With regression, the system responses were continuous

quantities; with classiﬁcation, the system responses are now discrete

classes. Figure 9.1 presents two examples of classiﬁcation applied to

the civil engineering context; in (a), the sigmoid-like curve describes

the probability of having pathogens in water, as a function of the

chlorine concentration. In this problem, obs er vations are eith er



1 or +1, respectively correspondin g to one of the two classes

{pathogens, ¬pathogens}

. T he second example, (b), presents the

probability of having structural damage in a building given the

peak ground accel er at ion (PGA) associated with an earthquake.

Observations are either



1 or +1, respectively corresponding to one

of the two classes {damage, ¬damage}.

For classiﬁcation, the data consists in

pairs of covariates

2 R

and system responses

2 {

}

,so

{

)

, 8i 2

{

D}}

. For problems involving multiple classes, the system re -

sponse is

, ··· , C}

. T he typical goal of a classiﬁcation

Data

D = {(x

), 8i 2{1:D}}

2 R :

Covariate

Attribute

Regressor

2{

}

:Observationforbinary

classiﬁcation

, ··· , C}

:Observationformulti-

classes classiﬁcation

method is to obtain a mathematical model for the probability of

a sy st e m response given a covariate, for example,

(

=+1

x).

This problem can be modeled with either a generative or a discri m -

inative approach . A generative classiﬁer models the probability

density function (PDF) of the covariates x , for each class

, an d

then uses these PDFs to compute the probability of classes given

the covariate values. A discriminative classiﬁer directly builds a

model of

(

=+1

x) without mod el i n g the probability density

function (PDF) of covariates x, for each class

. I n practice, dis-

criminative approaches are most com mon because they typically

perform better on medium and large data sets,

and are often easier

Ng, A. and M. Jordan (2002). On

discriminative vs. generative classiﬁers: A

comparison of logistic regression and naive

Bayes. In Advances in neural information

processing systems,Volume14,pp.841–

848. MIT Press

to use. This chapter presents the generic formulation for probabilis-

tic generative classiﬁers and three discriminat i ve classiﬁ er s: logistic

j.-a. goulet 140

regression, Gaussian process classiﬁcation, and neural networks.We

will see that generative classiﬁers are best suited for small data sets,

Gaussian process classiﬁcation is suited for medium-size ones, and

neural networks for large ones. In order to simplify the explanations,

this chapter presents each method while focus in g on the context of

binary classiﬁcation, that is, y 2 {1, +1}.

9.1 Generative Classiﬁers

The basic idea of a generative classiﬁer is to employ a set of obser-

vations

to model the conditional PDF of covariates x

2 R

given

each class

y 2 {

}

. T he n these PDFs are employed in combi-

nation with the prior probability of each class in ord er to obtain the

posterior probability of a class given a covariate x. Generative clas-

siﬁers are most appropriate when working with problems involving

a smal l number of covariates and observati on s because they allow

us to explicit l y include domain-speciﬁc knowledge regarding

)

or tailored features such as censored data.

9.1.1 Formulation

The poster i or probability of a class

given covariates x can be

expressed using Bayes rule,

posterior

z }| {

p(y|x)=

likelihood

z }| {

f(x|y) ·

prior

z}|{

p(y)

f(x)

|{z}

evidence

The poster i or probability of a class

conditional on a covariate

value x is deﬁned according to

p(y|x)=

⇢

Pr(Y = 1|x)ify = 1

Pr(Y =+1|x)=1Pr(Y = 1|x)ify =+1.

The prior probability of a class is described by the probabilities

Note: the prior knowledge can either be

based on expert knowledge, i.e.,

(

), or be

empirically estimated using observations,

i.e.,

p(y|D)=

p(D|y) · p(y)

p(D)

p(y)=

⇢

Pr(Y = 1) if y = 1

Pr(Y = +1) if y =+1.

The prior and poster i or are describe d by probabi l i ty mass fun ct i ons

because

y 2 {

}

is a discrete vari ab le . The likelihood of

covariates x conditional on a class

is a multivariate conditional

probability density function that foll ows

f(x|y)=

⇢

f(x|y = 1) if y = 1

f(x|y = +1) if y =+1.

probabilistic machine learning for civil engineers 141

The normalization constant (i.e., t he evidence) is obtained by

summing the product of the likelihood and prior over all classes,

f(x)=

y 2{1,1 }

f(x|y) · p(y).

A key step with a generative classiﬁer is to pick a dist r i bu t i on Note: The notation

f(x|y; ✓) ⌘ f

|y, ✓

)

indicates that the conditional probability

f(x|y)isparameterizedby✓.

type for each

f(x|y

; ✓

), 8j

and then estimat e the parameters

✓

for these PDFs using the data set

, t h at is,

(

✓

). For a class

, t h e parameters of

)

⌘ f

;

✓

) are estimated

using only observations

{

)

, 8i

, t h at is, the

covariates x

such that the associated sys te m response is

The estimation of parameters

✓

can be performed using either a

Bayesian approach or a determi ni st i c one as detailed in §6.3. Like

for the likeliho od, the prior probability of each class

(

y|D

) can be

estimated using either a Bayesi an approach or a deterministic one

using the relat i ve freque nc y of observations where

.With

a Bayesian approach, the posterior predictive probability mass

function (PMF) is obtained by mar gi nal i zi n g the uncertainty about

the posterior parameter estimate,

posterior predictive

z }| {

p(y|x, D)=

likelihood

z }| {

f(x|y; ✓) ·

prior

z }| {

p(y; ✓)

f(x; ✓)

| {z }

normalization constant

posterior ✓

z }| {

f(✓|D) d✓.

When either a maximum a-posteriori (MAP) or a maximum likeli-

hood estimate ( M LE) is employed instead of a Bayesian estimation

approach, the posterior predictive is replaced by the approximation,

ˆp(y|x, D)=

f(x|y; ✓

⇤

) · p(y; ✓

⇤

)

f(x; ✓

⇤

)

where ✓

⇤

are the MLE or MAP estimates for parameters.

MLE: ✓

⇤

=argmax

✓

f(D; ✓)

✓ = {µ

,

}

D = {x

2 R, 8i 2{1:D}}

f(x; ✓)=N(x; µ

,

)

f(D; ✓)=

i=1

N(x

; µ

,

)

ln f(D; ✓) /

i=1

✓



 µ

)



+ln



◆

@ ln f (D;✓)

@µ

i=1

 Dµ

! µ

⇤

i=1

@ ln f (D;✓)

@

i=1

 µ

)

(

)





! 

2⇤

i=1

 µ

)

✓ = p

D = {y

2{1:C}, 8i 2{1:D}}

! D

=#{i : y

= j}D

f(y; ✓)=

⇢

for y = j

1  p

for y 6= j

f(D; ✓) / p

(1  p

)

(DD

)

ln f(D; ✓) / D

ln p

+(D  D

)ln(1 p

)

@ ln f (D;✓)



(D  D

)

1  p

! p

⇤

Maximum likelihood estimate In the special case where

(

x|y

;

✓

⇤



;

⇤

x|y

,

2⇤

x|y



and the numb e r of available data

large, one can employ the MLE approximation for the parameters

x|y

and 

x|y

which is deﬁned by

⇤

x|y



2⇤

x|y

 µ

⇤

x|y

)

, 8{i : y

= j},

where

denotes the number of observations in a

data set

that bel ongs to the

class. The MLE approximation

of p(y = j; ✓

⇤

) is given by

p(y = j; ✓

⇤

) =

j.-a. goulet 142

Figure 9.2 presents an example of application of a generative

classiﬁer where the parameter s for the normal PDFs are estimated

using the MLE approximation. The bottom graph overlays the

empirical histograms for the data associated with each class along

with the PDFs

(

x|y

j, ✓

⇤

). The top graph presents the classiﬁer

(

=+1

|x, ✓

⇤

), which is the probability of the class

= +1 given

the covariate value

. I f we assume that the prior probability of

each class is equal, this conditional probabil i ty is computed as

p(y =+1|x)=

f(x|y = +1) · p(y = +1)

f(x|y = 1) · p(y = 1) + f (x|y = +1) · p(y = +1)

f(x|y=+1)

f(x|y=1)+f(x|y=+1)

, for p(y = 1) = p(y = +1).

2 0 2 4 6 8 10

0.5

p(y =+1|x)

= 1} {x

=+1}

2 0 2 4 6 8 10

0.2

0.4

f(x|y)

f(x|y = 1) f (x|y =+1)

Figure 9.2: Example of generative classiﬁer

using marginal PDFs.

Method f(x|y = j)

Naive X

?? X

|y, 8i 6= k

Bayes e.g. B(x

; ↵, ) ·N(x

,µ,

)

LDA N(x; µ

x,j

, ⌃

)

QDA N(x; µ

x,j

, ⌃

x,j

)

(a) Naive Bayes

(b) LDA, ⌃

x,1

= ⌃

x,+1

x,1

6= ⌃

x,+1

Figure 9.3: Examples of PDF

)

for naive Bayes, LDA, and QDA.

Multiple covariates When there is more t han one covariate x =

[

··· x

]

describing a sys t em, we must deﬁne the joint PDF

). A common approach to deﬁne this joint PDF is called naive

Bayes (NB). It assumes that covariates

are indep en de nt of

each other s o their joint PDF is obtained from the produc t of the

marginals,

f(x|y = j)=

k=1

f(x

|y = j).

The method employing the special case where the joint PDF

of c ovariates is described by a multivariate normal

N(x; µ, ⌃)

is called quadratic discri m in ant analysis (QDA). The

label quadratic refers to the shape of the boundary in the co-

variate domain where the probabili ty of both classes are equal,

(

=+1

x)=

(



x). In the case where the covariance

matrix ⌃ is the same for each class, the boundary becomes linear,

and this speci al case is named linear discriminant analysis (LDA).

The naive Bayes, QDA, and LDA methods typically employ an

MLE approach t o estimate the parameters of

). Figure

9.3 p r es ents exampl es of

) for the three approaches: NB,

LDA, and QDA. Figu r e 9.4 presents an example of an ap p li c ati on

of q u adr at i c discriminant analysi s where the joint probabilities

) are multivariate Normal PD F s as depicted by the contour

plots. The mean vector and covariance matrices describing these

PDFs are estimat e d using an MLE approximation. Because there

are two covariates, the classiﬁer

(

=+1

x) is now a 2-D surface

obtained following

p(y =+1|x)=

f(x|y = +1) · p(y = +1)

f(x|y = 1) · p(y = 1) + f (x|y = +1) · p(y = +1)

probabilistic machine learning for civil engineers 143

Note that for many problems, generative methods such as naive

Bayes and quadratic discriminant analysis are outperformed by

discriminative approaches such as those presented in §9.2–§9.4.

2 2 6 10

2

= 1} {x

=+1}

2

0.5

p(y =+1|x)

Figure 9.4: Example of quadratic discrimi-

nant analysis where the joint probabilities

)depictedbycontourplotsarede-

scribed by multivariate Normal P DFs. The

parameters of each PDF are estimated

using an MLE approach.

9.1.2

Example: Pos t- Ear thqu ake Structural Safety Assessment

We present here an example of a dat a- dr i ven post- ear t hq u ake

structural safety ass es s me nt. The goal is to assess right after an

earthquake whether or not buildings are safe for occupation, that

is,

y 2{safe, ¬safe}

. We want to guide this assessment based on

empirical data where t h e observed covariate

x 2

1) describes the

ratio of a building’s ﬁrst natural frequency

[Hz] after and before

the earthquake,

x =

post-earthquake

pre-earthquake

Figure 9.5 presents a set of 45 empirical observations

collected

Goulet, J.-A., C. Michel, and A. Der Ki-

ureghian (2015). Data-driven post-

earthquake rapid structural safety as-

sessment. Earthquake Engineering &

Structural Dyn am ic s 44 (4), 549–562

in di↵erent countries and d esc r ib i ng the frequency ratio

as a

function of the damage index

d 2{

}

, which ranges from

undamaged (

= 0) up to collapse (

= 5). Note that observations

marked by an arrow are censored data, where

is an upper-bound

censored observation (see §6.3) for the frequency ratio. The ﬁrst

step is to employ this data set

{

(

|{z}

2(0,1)

|{z}

2{0:5}

)

, 8i 2{

1 : 45

}}

order to learn the conditional PDF of the frequency ratio

, gi ven

a dam age index

. B ec aus e

x 2

1), we choose to describe its

conditional PDF using a Beta distribution,

Algeria

Japan

Martinique

Spain

Italy

USA

0 1 2 3 4 5

0.2

0.4

0.6

0.8

Damage Index [d ]

Upper bound

observation

Figure 9.5: Empirical observations of the

ratio between fundamental frequency of

buildings after and before an earthquake.

f(x; ✓(d)) = B(x; µ(d),( d )

| {z }

✓

Note that the Beta distribution is parameterized here by its mean

and standard deviat i on and not by

↵

and



as d esc r ibed in §4.3.

The justiﬁcation for this choice of parameterization will become

clear when we present the constraints that are linking the prior

knowledge for di↵erent damage ind ex e s

. For a given

, t h e poste-

rior PDF for the parameters

✓

(

{µ

(

)

,

(

)

}

is obtained usi ng

Bayes,

posterior

z }| {

f(✓(d)|D)=

likelihood

z }| {

f(D|✓(d)) ·

prior

z }| {

f(✓(d))

f(D)

|{z}

constant

and the pos t e r ior predic t i ve PDF of

is obtained by marginalizing

the uncertainty associated with the e st i mat i on of ✓(d),

f(x|d, D)=

f(x; ✓(d)) · f(✓(d)|D)d✓.

j.-a. goulet 144

By assuming that observations are conditionally independe nt given

x, the joint likelihood of observations is

f(D|✓(d)) =

{i:d

=d}

L(x

|µ(d),(d)),

where the marginal likelihood for each observation (see §6.3) is

L(x

|µ(d),(d)) =

(

f(x

|µ(d),(d)),x

: direct observation

F (x

|µ(d),(d)),x

: censored observation

and where

(

) denotes the PDF and

(

) denotes the cumulative

distribution function (CDF). The pri or is assumed to be uniform

in the range (0

1) for

(

) and uniform in the range (0

25) for



(

). For the mean parameters

(

), the prior is also constrained

following

(0)

>µ

(1)

> ···>µ

(5). These const rai nts reﬂec t our

domain-speciﬁc knowledge where we expec t the mean frequency

ratio to decrease with increasing values of d .

The Metrop ol i s algorithm presented in §7.1 is employed to

generate

= 35

000 samp l es (

R 

005 for

= 3 chains) from the

posterior for the parameters

✓

(

). Figure 9.6a presents the contours

of t h e posterior PDF for each pair of parameters

{µ

(

)

,

(

)

}

along with a subset of samples. The predictive PDFs

(

x|d, D

) of

a fr eq u en cy ratio

conditional on

are presented in ﬁgure 9.6b.

The predictive probability of each d amage index d given a frequenc y

ratio x is obtained following

p(d|x, D)=

f(x; ✓(d)) · p(d)

f(x; ✓(d

)) · p(d

)

· f(✓(d)|D)d✓. (9.1)

The integral in equation 9.1 can be approximated using the Mar kov

chain Monte Carlo (MCMC) sample s as described in §7.5. Figure

9.7 p r es ents the posterior predictive probabil i ty

(

d|x, D

), which

is computed whil e assuming that there is an equal p r ior probabil-

ity

(

) for each damage index. This classiﬁer can be employed

to assist inspectors during post-earthquake safety assessment of

structures.

0 0.5 1

0.1

0.2

µ(5)

(5)

0 0.5 1

0.1

0.2

µ(4)

(4)

0 0.5 1

0.1

0.2

µ(3)

(3)

0 0.5 1

0.1

0.2

µ(2)

(2)

0 0.5 1

0.1

0.2

µ(1)

(1)

0 0.5 1

0.1

0.2

µ(0)

(0)

(a) Posterior PDFs for ✓(d)

0 0.25 0.5 0.75 1

f(x|d, D)

d=0 d =1 d =2 d=3 d=4 d=5

(b) Posterior predictive PDF

Figure 9.6: Posterior PDFs of means

and standard deviations and posterior

predictive PDFs of frequency ratios

for

each damage index d.

0 0.25 0.5 0.75 1

p(d|x, D)

d=0 d =1 d=2 d =3 d =4 d=5

Figure 9.7: Posterior predictive probability

of a damage index

given a frequency

ratio x.

9.2 Logistic Regression

Logistic regression is not a state-of-the-art approach, yet its preva-

lence, historical importance, and simple formulation make it a good

choice to i ntroduce discriminative regr e ssi on methods. Logistic

regression is th e extension of linear regression (see §8.1) for c l as-

siﬁcation problems. In the context of regression, a linear function

! R

is deﬁned so that it transforms an

-dimensions covariate

probabilistic machine learning for civil engineers 145

domain into a single output

(x)=Xb

2 R

. I n the context of clas-

siﬁcation, system resp on se s are discrete, for example,

y 2 {

}

The idea with classiﬁcation is to transform the output of a linear

function

(x)intothe(0

1) interval describing the probability

(

=+1

). We can transform the output of a linear model

Xb i n the interval (0

1) by passi ng it through a logistic sigmoid

function (z),

(z)=

1+exp(z)

exp(z)

exp(z)+1

as p lot t e d in ﬁgure 9.8. The sigmoid is a transformation

R !

such that an input deﬁned in the real space is squashe d in the

interval (0

1). In short, logistic regression maps the covariates x to

the probability of a class,

x 2 R

| {z }

covariates

! g(x)=Xb 2 R

| {z }

hidden/latent variable

! (g(x)) 2 (0, 1)

| {z }

Pr(Y =y|x)

Note that

(x) is conside re d as a latent or hidden variable because

it is not directly observed; only the class

and its associated covar i -

ate x are.

6 4 2 0 2 4 6

(z)

Figure 9.8: The logistic sigmoid function

(z)=(1+exp(z))

1

Figure 9.9 presents the example of a function

(

)=0

x 

(i.e., the r ed line on the horizontal plane), which is passed through

a logi s t ic sigmoid function. The blue curve on the leftmost vertical

plane is the logistic sigmoid function, and the orange curve on the

right s houl d be interpreted as the probability the class

= +1,

given a covariate

(

=+1

). For four di↵erent covariates

simulated observations y

2 {1, +1} are depicted by crosses.

100

5

0.25

0.5

0.75

g(x)

g(x) (g(x)) (x)

Figure 9.9: Example of a function

(

x 

4passedthroughthelogistic

sigmoid function.

In p r act i ce , we have to follow the reverse path, where observa-

tions

{

(

)

, 8i 2{

D}}

are available and we n eed to

estimate the par ame t er s b and the basis functions



(

)deﬁning

the model matrix X. For a given data set and a choice of basis

functions, we separate t h e observations

in order to build two

model matrices: one for



(1)

, an d another for

=+1

(+1)

. T he likelihood of an observation

=+1is

given by

(

=+1



(Xb), and for an observation



the likelihood is

(



)=1

 

(Xb). With t he hypothesis

that observations are conditionally independent, the joint li kelihood

is obtained from the product of the marginal likelihood or, equiva-

lently, the sum of the marginal log-likeliho od of observations given

parameters b,

ln p( D|b)=

ln((X

(+1)

b)) +

ln(1  (X

(1)

b)).

Optimal parameters b

⇤

can be inferred using an MLE procedure

that maximizes

ln p

(

b). Unfortunately, contrarily to the optimiza-

j.-a. goulet 146

tion of parameter s in the case of the l in ear regression presented

in §8.1, no closed-form analytic solution exists here to identify b

⇤

With linear regr ession, the derivative of the log-likelihood (i.e.,

the loss functi on ) was linear, leading to an analytic solution for

⇤

@J(b)

= 0. With logistic regression, the derivative of the

log-likelihood is a nonlinear function, so we have to resort to an

optimization algorithm (see chapter 5) to identify b

⇤

Example: Logistic regression Figure 9.10 presents three examples

involving a single covariate

and a linear model

(

and where paramete rs b =[

]

are estimated wi th respectively

5, 10, and 100 observations. The true parameter values employed to

generate simulated observations are



08]

. The correspond-

ing functions

) and



(

)) are represented by dashed lines,

and those est i mat ed using MLE parameters

(Xb

⇤

) and



(

(Xb

⇤

))

are represented by solid li ne s. We can observe that as the number

of ob se rvations increases, the class iﬁ er converges toward the one

employed to generate the data.

100

5

0.5

g(x)

g(x) (g(x)) (x)

(a) D =5, b

⇤

=[3.10.09]

100

5

0.5

g(x)

(b) D =10, b

⇤

=[4.40.08]

100

5

0.5

g(x)

⇤

=[3.90.08]

Figure 9.10: Example of application of

logistic regression.

This case is a trivial one because, in this cl os ed -l oop simulation,

the model structure (i.e., as deﬁned by the basi s functi ons in model

matrix X) was a perfect ﬁt for the problem. In practic al cases, we

have to select an appropriate set of basis functions



(

)tosuit

the problem at hand. Like for the linear regression, the selection of

basis functions is prone to overﬁtting, so we have to employ either

the Bayesian model selection (see §6.8) or cross-validation (see

§8.1.2) for that purpose.

Civil engineering perspectives In the ﬁeld of transportation en-

gineering, logistic regression h as been extensively employed for

discrete choice modeli ng

because of the interpretability of the

Ben-Akiva, M. E. and S. R. Lerman

(1985). Discrete choice analysis: Theory

and application to travel demand.MIT

Press; and McFadden, D. (2001). Economic

choices. American Economic Review 91 (3),

351–378

model parameters b in the context of behavioral economics. How-

ever, for most benchmark probl e ms, the predictive capacity of

logistic regression is outperformed by more modern techniques such

as Ga us s i an process classiﬁcation and neural networks, which are

presented in the next two sections.

9.3 Gaussian Process Classiﬁcation

Gaussian process classiﬁcation (GPC) is th e extension of Gaussian

process regression (GPR; see §8.2) to classiﬁcation problems. In the

context of GPR, a function

! R

is deﬁned so that it transforms

-dimensions covariate domain into a s i ngl e output

(x)

2 R

.In

the context of classiﬁc at ion , the system response is

y 2 {

}

probabilistic machine learning for civil engineers 147

Again, the idea with classiﬁcation is to transform

(x)’s outputs

into the (0

1) interval describing the probability

(

=+1

For GPC, the transformation in the interval (0

1) is done using the

standard Normal CDF pr ese nted in ﬁgure 9.11, where (

) denotes

the CDF of Z ⇠N(z;0, 1).

4 3 2 1 0 1 2 3 4

0.5

(z)

Figure 9.11: The standard Normal CDF,

where (

)denotestheCDFof

Z ⇠

N(z;0, 1).

Like the l ogi st i c sigmoid function presented in §9.2, (

)is

a t ran sf or mat i on

R !

1), such t hat an input deﬁned in the

real space is mapped in the interval (0

1). Note that choosing

the transformation function (

) instead of the logistic function

is not arbitrary ; we will see later that it allows maintaining the

analytic tractability of GPC. We saw in §8.2 that the function

⇤

), describing the system responses at prediction locations x

⇤

,is

modeled by a Gaussian process,

⇤

, D) ⌘ f(g

⇤

, D)=N(g

⇤

; µ

⇤|D

, ⌃

⇤|D

100

5

0.5

g(x)

± 2

(g(x))

Pr(Y =1|x)

Figure 9.12: Example of Gaussian process

(

)evaluatedthrough(

(

)) and

whose uncertainty is marginalized in order

to describe Pr(Y =+1|x).

In or d er to compute

(

=+1

⇤

, D

), that is, the probability

that

= +1 for a covariate

⇤

and a set of observations

,we

need to transform

⇤

in the interval (0

1) and then marginalize the

uncertainty associated with f (g

⇤

, D) so that

Pr(Y =+1|x

⇤

, D)=

(g

⇤

) · f(g

⇤

, D) dg

⇤

. (9.2)

If we choose the standard normal CDF, with (

) as a transforma-

tion function, t h e integral in equation 9.2 follows the closed-form

solution

Pr(Y =+1|x

⇤

, D)=

E[G

⇤

, D]

1 + var[G

⇤

, D]

, (9.3)

where for the

prediction locations,

[

⇤

, D

]

⌘

[

⇤|D

]

and

var

[

⇤

, D

]

⌘

[⌃

⇤|D

]

. F i gu re 9.12 illustrates how the uncer-

tainty related to a Gaussian process evaluated through (

(

)) is

marginalized in order to describe Pr(Y =+1|x, D).

So f ar , we have seen how to employ the standard normal C DF

to transform a Gaussian process

⇤

2 R

,withPDF

(

⇤

, D

)

obtained from a set of observations

, D

}

{

)

, 8i 2{

D}}

, i nto a space (

⇤

)

1). The issue is that

in a classiﬁcati on setup,

(

) is not directly observable ; only

{

}

is. This requi re s inferring for each covariate

x 2D

the

mean and standard deviations for latent variables

(

). For that,

Note: The qualiﬁcation latent refers to

variables that are not directly observed and

that need to be inferred from observations

of the classes y 2{1, +1}.

we need to rewrite equation 9.3 in order to explicitly include t he

j.-a. goulet 148

conditional dependence over g so that

Pr(Y =+1|x

⇤

, D)=

(g

⇤

) · f(g

⇤

, g, D)dg

⇤

=

E[G

⇤

, g, D]

1 + var[G

⇤

, g, D]

For the

prediction locations,

[

⇤

, D

]

⌘

[

⇤|g,D

]

and

var[G

⇤

, g, D] ⌘ [⌃

⇤|g,D

]

,where

f(g

⇤

|g) ⌘ f(g

⇤

, g, D)

= N(g

⇤

; µ

⇤|g,D

, ⌃

⇤|g,D

(9.4)

Equation 9.4 presents th e posterior probability of the Gaussian

process outcomes g

⇤

at prediction location x

⇤

, gi ven the data set

, D

}

{

2 {

}

)

, 8i 2{

D}}

, an d the inferred

values for g at observed locations x. The mean and covariance

matrix for the PDF in equation 9.4 are, respect i vely,

Note: The prior mean for the Gaussian

process at both observed

⇤

and predicted

locations are assumed to be equal to

zero.

⇤|g,D

z}|{

⇤

+⌃

G⇤

⌃

1

(µ

G|D



z}|{

)

⌃

⇤|g,D

= ⌃

⇤

 ⌃

G⇤

⌃

1

G|D

⌃

G⇤

where

G|D

and ⌃

G|D

are the mean vector and covariance ma-

trix for the inferred latent observations of

(x)

⇠ f

(g;

G|D

⌃

G|D

). The posteri or PDF for inferred latent variables

G(x)is

f(g|D)=

p(D

|g) · f (g|D

)

p(D

)

/ f(D

|g) · f (g|D

(9.5)

The ﬁrst term corresponds to the joint likelihood of observations

given the associ at ed set of inferred latent variables g.With

the assumption t h at observations ar e conditionally independent

given

, t h e joint likelihood is obtained from the product of the

marginals so that

Note: Because of the symmetry in the

standard normal CDF,

p(y =+1|g)=(g)and

p(y = 1|g)=1 (g)=(g).

It explains why the marginal likelihood

simpliﬁes to p(y|g)=(y · g).

p(D

|g)=

i=1

p(y

)

i=1

(y

· g

The second term in equation 9.5,

(g;

= 0

, ⌃

)

is the prior knowledge for the inferred latent variables

(x). The

posterior mean of

), that i s,

G|D

, is obtain ed by maximizing

the logarithm of f(g|D) so that

Note: As detailed in §8.2, the prior

covariance [

⌃

]

⇢

)



depends

on the hyperparameters

✓

{

, `}

,which

also need to be inferred from data.

G|D

= g

⇤

= arg max

ln f (g|D)

= arg max

ln f (D

|g) 

⌃

1

g 

ln | ⌃

|

ln 2⇡.

probabilistic machine learning for civil engineers 149

The maximum g

⇤

corresponds to t h e location where the derivative

equals zero so that

@ ln f(g|D)

= r

ln f (g|D)=rln f(D

|g)  ⌃

1

g =0. (9.6)

By isolating g in equation 9.6, we can obtain op t i mal values g

⇤

iteratively using

g ⌃

rln f ( D

|g), (9.7)

where the init i al starting location can be taken as g = 0.The

uncertainty associated with the optimal se t of inferred latent vari-

ables

G|D

= g

⇤

is estimated us in g the Laplace approximation (see

§6.7.2). The Laplace app r oximation for th e covariance matrix corre-

sponds to the inverse of the Hessian for the negative log-likelihood

evaluated at µ

G|D

, so that

⌃

G|D

= H[ln f(g|D)]

1

⇣

⌃

1

 diag



rrln f (D

|µ

G|D

)



⌘

1

100

3

0.5

g(x)

Pr(Y =1|x)

Pr(Y =1|x, D)

(g(x))

⇤

± 2

⇤

G|D

± 2

G|D

(a) D =25

100

3

0.5

g(x)

(b) D =50

100

3

0.5

g(x)

Figure 9.13: Example of application of

Gaussian process classiﬁcation using the

package GPML with a di↵erent number of

observations D 2{25, 50, 100}.

Note that we can improve the eﬃciency for computing the MLE

in equation 9.8 by using the information contained in the Hessian

as we did with the Newton-Raphson method in §5.2. In that case,

optimal values g

⇤

are obtained iteratively using

g g  H[ln f(g|D)]

1

·r

ln f (g|D). (9.8)

Figure 9.13 presents an application of GPC using the Gaussian

process for machine learni n g (GPML) package with a di↵erent

number of ob ser vations,

D 2{

100

}

. I n this ﬁgure, the green

solid lines on the horizontal planes describe the inferred conﬁdence

intervals for the latent variables,

G|D



G|D

. We can see that

as t h e number of observations increases, the inferred function

(

=+1

⇤

, D

) (orange solid line) tends toward the true function

(orange dashed line).

Figure 9.14 presents the application of Gaussian process classiﬁ-

cation to the post-earthquake struc tu r al safety assessment example

introduced in §9.1.2. Here, the multiclass problem is transformed

into a bin ary one by modeling the probability that the damage

index is either 0 or 1. The main advantage of GPC is th at the prob-

lem setup is trivial when using existing, pre-implemented packages.

With the gener at ive approaches pre sented in §9.1, the formulation

has to be speciﬁcally tailored for the problem. Nevertheless, note

that a generati ve approach allows inclu d in g censored data, whereas

a discriminative approach such as the GPC presented here cannot.

j.-a. goulet 150

In p r act i ce , all the operations required to infer the latent vari-

ables

G(x)

as well as the parameter

✓

{

, `}

in GPC are already

implemented in the same open-source packages deali n g with GPR,

for example, GPML, GPstu↵, and pyGPs (see §8. 2) .

0 0.2 0.4 0.6 0.8 1

0.5

Pr(D =0|x, D)

y =1

y =0

Figure 9.14: Example of application of

Gaussian process classiﬁcation to the post-

earthquake structural safety evaluation

data set.

Strengths and limitations With the availability of Gau ss ian process

classiﬁcation implemented in a variety of open-source packages,

the setup of a GPC problem is trivial, even for proble ms involving

several covar i at es x. When the number of observations is small, the

task of inferrin g latent var i abl e s g becomes increasingly diﬃcult

and the perform anc e decreases. Note also that the formulation

presented assumes that we only have error-free direct observat ion s.

Moreover, because of the comput at ion al l y demanding procedure of

inferring latent variables g, the performance is limited in the case

of l ar ge data sets, for example,

D >

. G i ven the limitations for

both small and large data sets, the GPC presented here is thus best

suited for medium-size data sets.

For more detai l s about the Gaussian process classiﬁcation and its

extension to multiple classes, the reader should refer to dedicated

publications such as th e book by Rasmussen and Williams

or t h e

Rasmussen, C. E. and C. K. Williams

(2006). Gaussian processes for machine

learning.MITPress

tutorial by Ebden.

Ebden, M. (2008, August). Gaussian

processes: A q uick introduction. arXiv

preprint (1505.02965)

9.4 Neural Networks

The formulation for the feedforward neural network presented in

§8.3 can be adapted for binary classiﬁcation problems by replacing

the linear outpu t activation fu nc t ion by a sigmoid function as

illustrated in ﬁgure 9.15a. For an observati on

y 2 {

}

,the

log of the probability for the outcome

= +1 is modeled as being

proportional to the hidden variable on the output layer

(O)

, an d

the log of the probability for the outcome



1 is assumed to be

proportional to 0,

ln Pr(Y =+1|x, ✓) / z

(O)

= W

(L)

+ b

(L)

ln Pr(Y = 1|x, ✓) / 0.

These unnormalized log-probabilities are tran sf orm ed into a proba-

bility for each possible outcome by taking the exponential of each of

them and then normalizing by following

Pr(Y =+1|x, ✓)=

exp(z

(O)

)

exp(z

(O)

) + exp(0)

| {z }

= (z

(O)

)

Pr(Y = 1|x , ✓)=

exp(0)

exp(z

(O)

) + exp(0)

= (z

(O)

probabilistic machine learning for civil engineers 151

As p re se nted in §8.3.1, this procedure for normalizing the log-

probabilities of

is equivalent to evaluating the hidden variable

(O)

(or its negative) in the logistic sigmoid function. Thus, the

likelihood of an observation

y 2 {

}

, gi ven its associated

covariates x and a set of parameters ✓ = {✓

, ✓

}, is given by

···

(L1)

(L)

(O)

(a)

···

(L1)

(L)

(O)

(b)

Figure 9.15: Nomenclature for a feed-

forward neural network employed for

classiﬁcation: (a) represents the case for

binary classiﬁcation

y 2{

}

;(b)the

case for C classes y 2{1, 2, ··· , C}.

p(y|x, ✓)=( y · z

(O)

With neural networks, it is common to minimize a loss function

(

;

✓

, ✓

) that is deﬁn ed as the negative joint log-likeliho od for a

set of D observations,

J(D; ✓

, ✓

)=ln p(D

, ✓)

= 

i=1

ln (y

· z

(O)

In t h e case where the observed system responses can b el on g

to multiple classes,

y 2{

, ··· , C}

, t h e output layer needs to

be modiﬁe d to include as many hidden states as the re are classes,

(O)

··· z

(O)

]

, as presented in ﬁgure 9.15b. The log-

probability of an observation y = k is assum ed to be proportional to

the k

hidden state z

ln Pr(Y = k|x, ✓) / z

(O)

The normalization of the log-probabilities is done using the softmax

activation function, where the probability of an observation

given by

p(y = k|x, ✓) = softmax(z

(O)

,k)

exp(z

(O)

)

j=1

exp(z

(O)

)

By assuming that observations are conditionally independe nt from

each other gi ven z

(O)

, t h e loss function for the entire data set is

obtained as in §8.3.2 by summing the log-probabilities (i.e., the

marginal log-likelihoods) for each observation,

J(D; ✓

, ✓

)=

i=1

ln p( y

, ✓). (9.9)

Minimizing the l oss fun ct i on

(

;

✓

, ✓

) deﬁned in equation 9.9

corresponds to mi n i mi zi ng the cross-entropy. Cross-entropy is a

Note: Given two PMFs,

(

)and

(

), the

cross-entropy is deﬁned as

H(p, q)=

p(x)lnq(x).

concept from the ﬁeld of information theory

that measures th e

MacKay, D. J. C. (2003). Information

theory, inference, and learning algorithms.

Cambridge University Press

similarity between two p r obab i li ty distributions or mass functions.

Note that in the context of classiﬁcation, the gradient

r

(

(O)

)

in the backpropagation algorithm (see §8.3.2) is no longer equal to

j.-a. goulet 152

one because of the sigmoid function. Moreover, like for the regres-

sion setup, a neural network classiﬁer is best suited for problems

with a large number of covariates and for which large data sets are

available.

9.5 Regression versus Classiﬁcation

Note that when the data and context allow for it, it can be prefer-

able to formulate a problem using a regression approach rather

than using classi ﬁcation. Take, for example, the soil contami-

nation example p re sented in §8.2.5. The data set available de-

scribes the contaminant concentration

measured at coordi n at es

}

. Us i ng regression to model this problem results in

a fu nc t i on describing the PDF of contaminant concentration as a

function of coor di n ate s l,

, D

). A classiﬁcati on setup would

instead mo de l the probability that the soil contamination exceeds

an ad mi ss i bl e val u e

adm.

 c

adm.

, D

). Here, the drawback

of c las si ﬁ cat i on is that information is lost when transforming a

continuous system response C i nto a categorical event

{

 c

adm.

}

For the classiﬁcation setup, the information available to build the

model is whethe r the contaminant concentration is above (

= +1)

or below (



1) the admissibl e concentration

adm.

for multiple

locations l

. The issue with a c las si ﬁ cat i on approach i s that becau se

we work with categor i es , the information about how far or how

close we are from the threshold c

adm.

is lost.