9
Classification
[ ] cl+
Pr(pathogens)
0
1
Obs. class
+1
-1
(a)
PGA
Pr(damage)
0
1
Obs. class
+1
-1
(b)
Figure 9.1: Examples of application of
classification analysis in the context of civil
engineering.
Classification is similar to regression, where the task is t o predict a
system response
y
, given some covariates x describing the system
properties. With regression, the system responses were continuous
quantities; with classification, the system responses are now discrete
classes. Figure 9.1 presents two examples of classification applied to
the civil engineering context; in (a), the sigmoid-like curve describes
the probability of having pathogens in water, as a function of the
chlorine concentration. In this problem, obs er vations are eith er
1 or +1, respectively correspondin g to one of the two classes
{pathogens, ¬pathogens}
. T he second example, (b), presents the
probability of having structural damage in a building given the
peak ground accel er at ion (PGA) associated with an earthquake.
Observations are either
1 or +1, respectively corresponding to one
of the two classes {damage, ¬damage}.
For classification, the data consists in
D
pairs of covariates
x
i
2 R
X
and system responses
y
i
2 {
1
,
+1
}
,so
D
=
{
(x
i
,y
i
)
, 8i 2
{
1:
D}}
. For problems involving multiple classes, the system re -
sponse is
y
i
2{
1
,
2
, ··· , C}
. T he typical goal of a classification
Data
D = {(x
i
,y
i
), 8i 2{1:D}}
x
i
2 R :
8
<
:
Covariate
Attribute
Regressor
y
i
2{
1
,
+1
}
:Observationforbinary
classification
y
i
2{
1
,
2
, ··· , C}
:Observationformulti-
classes classification
method is to obtain a mathematical model for the probability of
a sy st e m response given a covariate, for example,
Pr
(
Y
=+1
|
x).
This problem can be modeled with either a generative or a discri m -
inative approach . A generative classifier models the probability
density function (PDF) of the covariates x , for each class
y
, an d
then uses these PDFs to compute the probability of classes given
the covariate values. A discriminative classifier directly builds a
model of
Pr
(
Y
=+1
|
x) without mod el i n g the probability density
function (PDF) of covariates x, for each class
y
. I n practice, dis-
criminative approaches are most com mon because they typically
perform better on medium and large data sets,
1
and are often easier
1
Ng, A. and M. Jordan (2002). On
discriminative vs. generative classifiers: A
comparison of logistic regression and naive
Bayes. In Advances in neural information
processing systems,Volume14,pp.841
848. MIT Press
to use. This chapter presents the generic formulation for probabilis-
tic generative classifiers and three discriminat i ve classifi er s: logistic
j.-a. goulet 140
regression, Gaussian process classification, and neural networks.We
will see that generative classifiers are best suited for small data sets,
Gaussian process classification is suited for medium-size ones, and
neural networks for large ones. In order to simplify the explanations,
this chapter presents each method while focus in g on the context of
binary classification, that is, y 2 {1, +1}.
9.1 Generative Classifiers
The basic idea of a generative classifier is to employ a set of obser-
vations
D
to model the conditional PDF of covariates x
2 R
X
given
each class
y 2 {
1
,
+1
}
. T he n these PDFs are employed in combi-
nation with the prior probability of each class in ord er to obtain the
posterior probability of a class given a covariate x. Generative clas-
sifiers are most appropriate when working with problems involving
a smal l number of covariates and observati on s because they allow
us to explicit l y include domain-specific knowledge regarding
f
(x
|y
)
or tailored features such as censored data.
9.1.1 Formulation
The poster i or probability of a class
y
given covariates x can be
expressed using Bayes rule,
posterior
z }| {
p(y|x)=
likelihood
z }| {
f(x|y) ·
prior
z}|{
p(y)
f(x)
|{z}
evidence
.
The poster i or probability of a class
y
conditional on a covariate
value x is defined according to
p(y|x)=
Pr(Y = 1|x)ify = 1
Pr(Y =+1|x)=1Pr(Y = 1|x)ify =+1.
The prior probability of a class is described by the probabilities
Note: the prior knowledge can either be
based on expert knowledge, i.e.,
p
(
y
), or be
empirically estimated using observations,
i.e.,
p(y|D)=
p(D|y) · p(y)
p(D)
p(y)=
Pr(Y = 1) if y = 1
Pr(Y = +1) if y =+1.
The prior and poster i or are describe d by probabi l i ty mass fun ct i ons
because
y 2 {
1
,
+1
}
is a discrete vari ab le . The likelihood of
covariates x conditional on a class
y
is a multivariate conditional
probability density function that foll ows
f(x|y)=
f(x|y = 1) if y = 1
f(x|y = +1) if y =+1.
probabilistic machine learning for civil engineers 141
The normalization constant (i.e., t he evidence) is obtained by
summing the product of the likelihood and prior over all classes,
f(x)=
X
y 2{1,1 }
f(x|y) · p(y).
A key step with a generative classifier is to pick a dist r i bu t i on Note: The notation
f(x|y; ) f
(x
|y,
)
indicates that the conditional probability
f(x|y)isparameterizedby.
type for each
f(x|y
j
;
j
), 8j
and then estimat e the parameters
j
for these PDFs using the data set
D
, t h at is,
f
(
j
|D
). For a class
y
=
j
, t h e parameters of
f
(x
|y
j
)
f
(x
|y
j
;
j
) are estimated
using only observations
D
j
:
{
(x
,y
)
i
, 8i
:
y
i
=
j}
, t h at is, the
covariates x
i
such that the associated sys te m response is
y
i
=
j
.
The estimation of parameters
can be performed using either a
Bayesian approach or a determi ni st i c one as detailed in §6.3. Like
for the likeliho od, the prior probability of each class
p
(
y|D
) can be
estimated using either a Bayesi an approach or a deterministic one
using the relat i ve freque nc y of observations where
y
i
=
j
.With
a Bayesian approach, the posterior predictive probability mass
function (PMF) is obtained by mar gi nal i zi n g the uncertainty about
the posterior parameter estimate,
posterior predictive
z }| {
p(y|x, D)=
Z
likelihood
z }| {
f(x|y; ) ·
prior
z }| {
p(y; )
f(x; )
| {z }
normalization constant
·
posterior
z }| {
f(|D) d.
When either a maximum a-posteriori (MAP) or a maximum likeli-
hood estimate ( M LE) is employed instead of a Bayesian estimation
approach, the posterior predictive is replaced by the approximation,
ˆp(y|x, D)=
f(x|y;
) · p(y;
)
f(x;
)
,
where
are the MLE or MAP estimates for parameters.
MLE:
=argmax
f(D; )
= {µ
x
,
x
}
D = {x
i
2 R, 8i 2{1:D}}
f(x; )=N(x; µ
x
,
2
x
)
f(D; )=
D
Y
i=1
N(x
i
; µ
x
,
2
x
)
ln f(D; ) /
D
X
i=1
1
2
(x
i
µ
x
)
2
2
x
+ln
1
x
@ ln f (D;)
@µ
x
=
D
X
i=1
x
i
Dµ
x
=0
! µ
x
=
1
D
D
X
i=1
x
i
@ ln f (D;)
@
2
x
=
D
X
i=1
(x
i
µ
x
)
2
(
2
x
)
2
D
2
x
=0
!
2
x
=
1
D
D
X
i=1
(x
i
µ
x
)
2
= p
j
D = {y
i
2{1:C}, 8i 2{1:D}}
! D
j
=#{i : y
i
= j}D
f(y; )=
p
j
for y = j
1 p
j
for y 6= j
f(D; ) / p
D
j
j
(1 p
j
)
(DD
j
)
ln f(D; ) / D
j
ln p
j
+(D D
j
)ln(1 p
j
)
@ ln f (D;)
@p
j
=
D
j
p
j
(D D
j
)
1 p
j
=0
! p
j
=
D
j
D
Maximum likelihood estimate In the special case where
f
(
x|y
=
j
;
)=
N
x
;
µ
x|y
j
,
2
x|y
j
and the numb e r of available data
D
is
large, one can employ the MLE approximation for the parameters
µ
x|y
j
and
2
x|y
j
which is defined by
µ
x|y
j
=
1
D
j
P
i
x
i
2
x|y
i
=
1
D
j
P
i
(x
i
µ
x|y
j
)
2
)
, 8{i : y
i
= j},
where
D
j
=#
{i
:
y
i
=
j}
denotes the number of observations in a
data set
D
that bel ongs to the
j
th
class. The MLE approximation
of p(y = j;
) is given by
p(y = j;
) =
D
j
D
.
j.-a. goulet 142
Figure 9.2 presents an example of application of a generative
classifier where the parameter s for the normal PDFs are estimated
using the MLE approximation. The bottom graph overlays the
empirical histograms for the data associated with each class along
with the PDFs
f
(
x|y
=
j,
). The top graph presents the classifier
p
(
y
=+1
|x,
), which is the probability of the class
y
= +1 given
the covariate value
x
. I f we assume that the prior probability of
each class is equal, this conditional probabil i ty is computed as
p(y =+1|x)=
f(x|y = +1) · p(y = +1)
f(x|y = 1) · p(y = 1) + f (x|y = +1) · p(y = +1)
=
f(x|y=+1)
f(x|y=1)+f(x|y=+1)
, for p(y = 1) = p(y = +1).
2 0 2 4 6 8 10
0
0.5
1
x
p(y =+1|x)
{x
i
,y
i
= 1} {x
i
,y
i
=+1}
2 0 2 4 6 8 10
0
0.2
0.4
x
f(x|y)
f(x|y = 1) f (x|y =+1)
Figure 9.2: Example of generative classifier
using marginal PDFs.
Method f(x|y = j)
Naive X
i
?? X
k
|y, 8i 6= k
Bayes e.g. B(x
1
; , ) ·N(x
2
,
2
)
LDA N(x; µ
x,j
,
x
)
QDA N(x; µ
x,j
,
x,j
)
(a) Naive Bayes
(b) LDA,
x,1
=
x,+1
(c) QDA,
x,1
6=
x,+1
Figure 9.3: Examples of PDF
f
(x
|y
=
j
)
for naive Bayes, LDA, and QDA.
Multiple covariates When there is more t han one covariate x =
[
x
1
x
2
··· x
X
]
|
describing a sys t em, we must define the joint PDF
f
(x
|y
). A common approach to define this joint PDF is called naive
Bayes (NB). It assumes that covariates
X
k
are indep en de nt of
each other s o their joint PDF is obtained from the produc t of the
marginals,
f(x|y = j)=
X
Y
k=1
f(x
k
|y = j).
The method employing the special case where the joint PDF
of c ovariates is described by a multivariate normal
f
(x
|y
)=
N(x; µ, )
is called quadratic discri m in ant analysis (QDA). The
label quadratic refers to the shape of the boundary in the co-
variate domain where the probabili ty of both classes are equal,
Pr
(
y
=+1
|
x)=
Pr
(
y
=
1
|
x). In the case where the covariance
matrix is the same for each class, the boundary becomes linear,
and this speci al case is named linear discriminant analysis (LDA).
The naive Bayes, QDA, and LDA methods typically employ an
MLE approach t o estimate the parameters of
f
(x
|y
=
j
). Figure
9.3 p r es ents exampl es of
f
(x
|y
=
j
) for the three approaches: NB,
LDA, and QDA. Figu r e 9.4 presents an example of an ap p li c ati on
of q u adr at i c discriminant analysi s where the joint probabilities
f
(x
|y
) are multivariate Normal PD F s as depicted by the contour
plots. The mean vector and covariance matrices describing these
PDFs are estimat e d using an MLE approximation. Because there
are two covariates, the classifier
f
(
y
=+1
|
x) is now a 2-D surface
obtained following
p(y =+1|x)=
f(x|y = +1) · p(y = +1)
f(x|y = 1) · p(y = 1) + f (x|y = +1) · p(y = +1)
.
probabilistic machine learning for civil engineers 143
Note that for many problems, generative methods such as naive
Bayes and quadratic discriminant analysis are outperformed by
discriminative approaches such as those presented in §9.2–§9.4.
2 2 6 10
2
2
6
10
x
1
x
2
{x
i
|y
i
= 1} {x
i
|y
i
=+1}
2
2
6
10
2
2
6
10
0
0.5
1
x
1
x
2
p(y =+1|x)
Figure 9.4: Example of quadratic discrimi-
nant analysis where the joint probabilities
f
(x
|y
i
)depictedbycontourplotsarede-
scribed by multivariate Normal P DFs. The
parameters of each PDF are estimated
using an MLE approach.
9.1.2
Example: Pos t- Ear thqu ake Structural Safety Assessment
We present here an example of a dat a- dr i ven post- ear t hq u ake
structural safety ass es s me nt. The goal is to assess right after an
earthquake whether or not buildings are safe for occupation, that
is,
y 2{safe, ¬safe}
. We want to guide this assessment based on
empirical data where t h e observed covariate
x 2
(0
,
1) describes the
ratio of a building’s first natural frequency
f
[Hz] after and before
the earthquake,
x =
f
post-earthquake
f
pre-earthquake
.
Figure 9.5 presents a set of 45 empirical observations
2
collected
2
Goulet, J.-A., C. Michel, and A. Der Ki-
ureghian (2015). Data-driven post-
earthquake rapid structural safety as-
sessment. Earthquake Engineering &
Structural Dyn am ic s 44 (4), 549–562
in dierent countries and d esc r ib i ng the frequency ratio
x
as a
function of the damage index
d 2{
0
,
1
,
2
,
3
,
4
,
5
}
, which ranges from
undamaged (
d
= 0) up to collapse (
d
= 5). Note that observations
marked by an arrow are censored data, where
x
is an upper-bound
censored observation (see §6.3) for the frequency ratio. The first
step is to employ this data set
D
=
{
(
x
i
|{z}
2(0,1)
,d
i
|{z}
2{0:5}
)
, 8i 2{
1 : 45
}}
in
order to learn the conditional PDF of the frequency ratio
X
, gi ven
a dam age index
d
. B ec aus e
x 2
(0
,
1), we choose to describe its
conditional PDF using a Beta distribution,
Algeria
Japan
Martinique
Spain
Italy
USA
0 1 2 3 4 5
0.2
0.4
0.6
0.8
1
Damage Index [d ]
Upper bound
observation
Figure 9.5: Empirical observations of the
ratio between fundamental frequency of
buildings after and before an earthquake.
f(x; (d)) = B(x; µ(d),( d )
| {z }
).
Note that the Beta distribution is parameterized here by its mean
and standard deviat i on and not by
and
as d esc r ibed in §4.3.
The justification for this choice of parameterization will become
clear when we present the constraints that are linking the prior
knowledge for dierent damage ind ex e s
d
. For a given
d
, t h e poste-
rior PDF for the parameters
(
d
)=
{µ
(
d
)
,
(
d
)
}
is obtained usi ng
Bayes,
posterior
z }| {
f((d)|D)=
likelihood
z }| {
f(D|(d)) ·
prior
z }| {
f((d))
f(D)
|{z}
constant
,
and the pos t e r ior predic t i ve PDF of
X
is obtained by marginalizing
the uncertainty associated with the e st i mat i on of (d),
f(x|d, D)=
Z
f(x; (d)) · f((d)|D)d.
j.-a. goulet 144
By assuming that observations are conditionally independe nt given
x, the joint likelihood of observations is
f(D|(d)) =
Y
{i:d
i
=d}
L(x
i
|µ(d),(d)),
where the marginal likelihood for each observation (see §6.3) is
L(x
i
|µ(d),(d)) =
(
f(x
i
|µ(d),(d)),x
i
: direct observation
F (x
i
|µ(d),(d)),x
i
: censored observation
and where
f
(
·
) denotes the PDF and
F
(
·
) denotes the cumulative
distribution function (CDF). The pri or is assumed to be uniform
in the range (0
,
1) for
µ
(
d
) and uniform in the range (0
,
0
.
25) for
(
d
). For the mean parameters
µ
(
d
), the prior is also constrained
following
µ
(0)
(1)
> ···
(5). These const rai nts reflec t our
domain-specific knowledge where we expec t the mean frequency
ratio to decrease with increasing values of d .
The Metrop ol i s algorithm presented in §7.1 is employed to
generate
S
= 35
,
000 samp l es (
ˆ
R
1
.
005 for
C
= 3 chains) from the
posterior for the parameters
(
d
). Figure 9.6a presents the contours
of t h e posterior PDF for each pair of parameters
{µ
(
d
)
,
(
d
)
}
along with a subset of samples. The predictive PDFs
f
(
x|d, D
) of
a fr eq u en cy ratio
x
conditional on
d
are presented in figure 9.6b.
The predictive probability of each d amage index d given a frequenc y
ratio x is obtained following
p(d|x, D)=
Z
f(x; (d)) · p(d)
P
5
d
0
=0
f(x; (d
0
)) · p(d
0
)
· f((d)|D)d. (9.1)
The integral in equation 9.1 can be approximated using the Mar kov
chain Monte Carlo (MCMC) sample s as described in §7.5. Figure
9.7 p r es ents the posterior predictive probabil i ty
p
(
d|x, D
), which
is computed whil e assuming that there is an equal p r ior probabil-
ity
p
(
d
) for each damage index. This classifier can be employed
to assist inspectors during post-earthquake safety assessment of
structures.
0 0.5 1
0
0.1
0.2
µ(5)
(5)
0 0.5 1
0
0.1
0.2
µ(4)
(4)
0 0.5 1
0
0.1
0.2
µ(3)
(3)
0 0.5 1
0
0.1
0.2
µ(2)
(2)
0 0.5 1
0
0.1
0.2
µ(1)
(1)
0 0.5 1
0
0.1
0.2
µ(0)
(0)
(a) Posterior PDFs for (d)
0 0.25 0.5 0.75 1
x
f(x|d, D)
d=0 d =1 d =2 d=3 d=4 d=5
(b) Posterior predictive PDF
Figure 9.6: Posterior PDFs of means
and standard deviations and posterior
predictive PDFs of frequency ratios
x
for
each damage index d.
0 0.25 0.5 0.75 1
0
1
x
p(d|x, D)
d=0 d =1 d=2 d =3 d =4 d=5
Figure 9.7: Posterior predictive probability
of a damage index
d
given a frequency
ratio x.
9.2 Logistic Regression
Logistic regression is not a state-of-the-art approach, yet its preva-
lence, historical importance, and simple formulation make it a good
choice to i ntroduce discriminative regr e ssi on methods. Logistic
regression is th e extension of linear regression (see §8.1) for c l as-
sification problems. In the context of regression, a linear function
R
X
! R
is defined so that it transforms an
X
-dimensions covariate
probabilistic machine learning for civil engineers 145
domain into a single output
g
(x)=Xb
2 R
. I n the context of clas-
sification, system resp on se s are discrete, for example,
y 2 {
1
,
+1
}
.
The idea with classification is to transform the output of a linear
function
g
(x)intothe(0
,
1) interval describing the probability
Pr
(
y
=+1
|x
). We can transform the output of a linear model
Xb i n the interval (0
,
1) by passi ng it through a logistic sigmoid
function (z),
(z)=
1
1+exp(z)
=
exp(z)
exp(z)+1
,
as p lot t e d in figure 9.8. The sigmoid is a transformation
R !
(0
,
1)
such that an input defined in the real space is squashe d in the
interval (0
,
1). In short, logistic regression maps the covariates x to
the probability of a class,
x 2 R
X
| {z }
covariates
! g(x)=Xb 2 R
| {z }
hidden/latent variable
! (g(x)) 2 (0, 1)
| {z }
Pr(Y =y|x)
.
Note that
g
(x) is conside re d as a latent or hidden variable because
it is not directly observed; only the class
y
and its associated covar i -
ate x are.
6 4 2 0 2 4 6
0
1
z
(z)
Figure 9.8: The logistic sigmoid function
(z)=(1+exp(z))
1
.
Figure 9.9 presents the example of a function
g
(
x
)=0
.
08
x
4
(i.e., the r ed line on the horizontal plane), which is passed through
a logi s t ic sigmoid function. The blue curve on the leftmost vertical
plane is the logistic sigmoid function, and the orange curve on the
right s houl d be interpreted as the probability the class
y
= +1,
given a covariate
x
,
Pr
(
y
=+1
|x
). For four dierent covariates
x
i
,
simulated observations y
i
2 {1, +1} are depicted by crosses.
0
25
50
75
100
5
0
5
0
0.25
0.5
0.75
1
x
g(x)
g(x) (g(x)) (x)
y
i
Figure 9.9: Example of a function
g
(
x
)=
0
.
08
x
4passedthroughthelogistic
sigmoid function.
In p r act i ce , we have to follow the reverse path, where observa-
tions
D
=
{
(
x
i
,y
i
)
, 8i 2{
1:
D}}
are available and we n eed to
estimate the par ame t er s b and the basis functions
(
x
)dening
the model matrix X. For a given data set and a choice of basis
functions, we separate t h e observations
x
i
in order to build two
model matrices: one for
x
i
:
y
i
=
1
,
X
(1)
, an d another for
x
i
:
y
i
=+1
,
X
(+1)
. T he likelihood of an observation
y
=+1is
given by
Pr
(
y
=+1
|x
)=
(Xb), and for an observation
y
=
1
the likelihood is
Pr
(
y
=
1
|x
)=1
(Xb). With t he hypothesis
that observations are conditionally independent, the joint li kelihood
is obtained from the product of the marginal likelihood or, equiva-
lently, the sum of the marginal log-likeliho od of observations given
parameters b,
ln p( D|b)=
X
ln((X
(+1)
b)) +
X
ln(1 (X
(1)
b)).
Optimal parameters b
can be inferred using an MLE procedure
that maximizes
ln p
(
D|
b). Unfortunately, contrarily to the optimiza-
j.-a. goulet 146
tion of parameter s in the case of the l in ear regression presented
in §8.1, no closed-form analytic solution exists here to identify b
.
With linear regr ession, the derivative of the log-likelihood (i.e.,
the loss functi on ) was linear, leading to an analytic solution for
b
:
@J(b)
@b
= 0. With logistic regression, the derivative of the
log-likelihood is a nonlinear function, so we have to resort to an
optimization algorithm (see chapter 5) to identify b
.
Example: Logistic regression Figure 9.10 presents three examples
involving a single covariate
x
and a linear model
g
(
x
)=
b
1
x
+
b
0
,
and where paramete rs b =[
b
0
b
1
]
|
are estimated wi th respectively
5, 10, and 100 observations. The true parameter values employed to
generate simulated observations are
ˇ
b
=[
40
.
08]
|
. The correspond-
ing functions
g
(X
ˇ
b
) and
(
g
(X
ˇ
b
)) are represented by dashed lines,
and those est i mat ed using MLE parameters
g
(Xb
) and
(
g
(Xb
))
are represented by solid li ne s. We can observe that as the number
of ob se rvations increases, the class ifi er converges toward the one
employed to generate the data.
0
25
50
75
100
5
0
5
0
0.5
1
x
g(x)
g(x) (g(x)) (x)
y
i
(a) D =5, b
=[3.10.09]
|
0
25
50
75
100
5
0
5
0
0.5
1
x
g(x)
(b) D =10, b
=[4.40.08]
|
0
25
50
75
100
5
0
5
0
0.5
1
x
g(x)
(c) D =100, b
=[3.90.08]
|
Figure 9.10: Example of application of
logistic regression.
This case is a trivial one because, in this cl os ed -l oop simulation,
the model structure (i.e., as defined by the basi s functi ons in model
matrix X) was a perfect fit for the problem. In practic al cases, we
have to select an appropriate set of basis functions
j
(
x
i
)tosuit
the problem at hand. Like for the linear regression, the selection of
basis functions is prone to overfitting, so we have to employ either
the Bayesian model selection (see §6.8) or cross-validation (see
§8.1.2) for that purpose.
Civil engineering perspectives In the field of transportation en-
gineering, logistic regression h as been extensively employed for
discrete choice modeli ng
3
because of the interpretability of the
3
Ben-Akiva, M. E. and S. R. Lerman
(1985). Discrete choice analysis: Theory
and application to travel demand.MIT
Press; and McFadden, D. (2001). Economic
choices. American Economic Review 91 (3),
351–378
model parameters b in the context of behavioral economics. How-
ever, for most benchmark probl e ms, the predictive capacity of
logistic regression is outperformed by more modern techniques such
as Ga us s i an process classification and neural networks, which are
presented in the next two sections.
9.3 Gaussian Process Classification
Gaussian process classification (GPC) is th e extension of Gaussian
process regression (GPR; see §8.2) to classification problems. In the
context of GPR, a function
R
X
! R
is defined so that it transforms
an
X
-dimensions covariate domain into a s i ngl e output
g
(x)
2 R
.In
the context of classific at ion , the system response is
y 2 {
1
,
+1
}
.
probabilistic machine learning for civil engineers 147
Again, the idea with classification is to transform
g
(x)’s outputs
into the (0
,
1) interval describing the probability
Pr
(
Y
=+1
|x
).
For GPC, the transformation in the interval (0
,
1) is done using the
standard Normal CDF pr ese nted in figure 9.11, where (
z
) denotes
the CDF of Z N(z;0, 1).
4 3 2 1 0 1 2 3 4
0
0.5
1
z
(z)
Figure 9.11: The standard Normal CDF,
where (
z
)denotestheCDFof
Z
N(z;0, 1).
Like the l ogi st i c sigmoid function presented in §9.2, (
z
)is
a t ran sf or mat i on
R !
(0
,
1), such t hat an input defined in the
real space is mapped in the interval (0
,
1). Note that choosing
the transformation function (
z
) instead of the logistic function
is not arbitrary ; we will see later that it allows maintaining the
analytic tractability of GPC. We saw in §8.2 that the function
g
(x
), describing the system responses at prediction locations x
,is
modeled by a Gaussian process,
f
G
|x
,D
(g
|x
, D) f(g
|x
, D)=N(g
; µ
|D
,
|D
).
0
25
50
75
100
5
0
5
0
0.5
1
x
g(x)
µ
G
± 2
g
µ
G
g
i
(g(x))
Pr(Y =1|x)
Figure 9.12: Example of Gaussian process
G
(
x
)evaluatedthrough(
G
(
x
)) and
whose uncertainty is marginalized in order
to describe Pr(Y =+1|x).
In or d er to compute
Pr
(
Y
=+1
|x
, D
), that is, the probability
that
Y
= +1 for a covariate
x
and a set of observations
D
,we
need to transform
g
in the interval (0
,
1) and then marginalize the
uncertainty associated with f (g
|x
, D) so that
Pr(Y =+1|x
, D)=
Z
(g
) · f(g
|x
, D) dg
. (9.2)
If we choose the standard normal CDF, with (
z
) as a transforma-
tion function, t h e integral in equation 9.2 follows the closed-form
solution
Pr(Y =+1|x
, D)=
E[G
|x
, D]
p
1 + var[G
|x
, D]
!
, (9.3)
where for the
i
th
prediction locations,
E
[
G
|x
, D
]
[
µ
|D
]
i
and
var
[
G
|x
, D
]
[
|D
]
ii
. F i gu re 9.12 illustrates how the uncer-
tainty related to a Gaussian process evaluated through (
G
(
x
)) is
marginalized in order to describe Pr(Y =+1|x, D).
So f ar , we have seen how to employ the standard normal C DF
to transform a Gaussian process
g
2 R
,withPDF
f
(
g
|x
, D
)
obtained from a set of observations
D
=
{D
x
, D
y
}
=
{
(x
i
,y
i
2
R
)
, 8i 2{
1:
D}}
, i nto a space (
g
)
2
(0
,
1). The issue is that
in a classificati on setup,
g
(
x
) is not directly observable ; only
y
i
2
{
1
,
+1
}
is. This requi re s inferring for each covariate
x 2D
x
the
mean and standard deviations for latent variables
G
(
x
). For that,
Note: The qualification latent refers to
variables that are not directly observed and
that need to be inferred from observations
of the classes y 2{1, +1}.
we need to rewrite equation 9.3 in order to explicitly include t he
j.-a. goulet 148
conditional dependence over g so that
Pr(Y =+1|x
, D)=
Z
(g
) · f(g
|x
, g, D)dg
=
E[G
|x
, g, D]
p
1 + var[G
|x
, g, D]
!
.
For the
i
th
prediction locations,
E
[
G
|x
,
g
, D
]
[
µ
|g,D
]
i
and
var[G
|x
, g, D] [
|g,D
]
ii
,where
f(g
|g) f(g
|x
, g, D)
= N(g
; µ
|g,D
,
|g,D
).
(9.4)
Equation 9.4 presents th e posterior probability of the Gaussian
process outcomes g
at prediction location x
, gi ven the data set
D
=
{D
x
, D
y
}
=
{
(x
i
,y
i
2 {
1
,
1
}
)
, 8i 2{
1:
D}}
, an d the inferred
values for g at observed locations x. The mean and covariance
matrix for the PDF in equation 9.4 are, respect i vely,
Note: The prior mean for the Gaussian
process at both observed
µ
and predicted
µ
G
locations are assumed to be equal to
zero.
µ
|g,D
=
=0
z}|{
µ
+
|
G
1
G
(µ
G|D
=0
z}|{
µ
G
)
|g,D
=
|
G
1
G|D
G
,
where
µ
G|D
and
G|D
are the mean vector and covariance ma-
trix for the inferred latent observations of
G
(x)
f
(g
|D
)=
N
(g;
µ
G|D
,
G|D
). The posteri or PDF for inferred latent variables
G(x)is
f(g|D)=
p(D
y
|g) · f (g|D
x
)
p(D
y
|D
x
)
/ f(D
y
|g) · f (g|D
x
).
(9.5)
The first term corresponds to the joint likelihood of observations
D
y
given the associ at ed set of inferred latent variables g.With
the assumption t h at observations ar e conditionally independent
given
g
i
, t h e joint likelihood is obtained from the product of the
marginals so that
Note: Because of the symmetry in the
standard normal CDF,
p(y =+1|g)=(g)and
p(y = 1|g)=1 (g)=(g).
It explains why the marginal likelihood
simplifies to p(y|g)=(y · g).
p(D
y
|g)=
Q
D
i=1
p(y
i
|g
i
)
=
Q
D
i=1
(y
i
· g
i
).
The second term in equation 9.5,
f
(g
|D
x
)=
N
(g;
µ
G
= 0
,
G
)
is the prior knowledge for the inferred latent variables
G
(x). The
posterior mean of
f
(g
|D
), that i s,
µ
G|D
, is obtain ed by maximizing
the logarithm of f(g|D) so that
Note: As detailed in §8.2, the prior
covariance [
G
]
ij
=
(x
i
,
x
j
)
2
G
depends
on the hyperparameters
=
{
G
, `}
,which
also need to be inferred from data.
µ
G|D
= g
= arg max
g
ln f (g|D)
= arg max
g
ln f (D
y
|g)
1
2
g
|
1
G
g
1
2
ln |
G
|
D
2
ln 2.
probabilistic machine learning for civil engineers 149
The maximum g
corresponds to t h e location where the derivative
equals zero so that
@ ln f(g|D)
@g
= r
g
ln f (g|D)=rln f(D
y
|g)
1
G
g =0. (9.6)
By isolating g in equation 9.6, we can obtain op t i mal values g
iteratively using
g
G
rln f ( D
y
|g), (9.7)
where the init i al starting location can be taken as g = 0.The
uncertainty associated with the optimal se t of inferred latent vari-
ables
µ
G|D
= g
is estimated us in g the Laplace approximation (see
§6.7.2). The Laplace app r oximation for th e covariance matrix corre-
sponds to the inverse of the Hessian for the negative log-likelihood
evaluated at µ
G|D
, so that
G|D
= H[ln f(g|D)]
1
=
1
G
diag
rrln f (D
y
|µ
G|D
)
1
.
0
25
50
75
100
3
0
3
0
0.5
1
x
g(x)
Pr(Y =1|x)
y
i
Pr(Y =1|x, D)
(g(x))
µ
G
|G
± 2
g
µ
G
|G
µ
G|D
± 2
G|D
(a) D =25
0
25
50
75
100
3
0
3
0
0.5
1
x
g(x)
(b) D =50
0
25
50
75
100
3
0
3
0
0.5
1
x
g(x)
(c) D =100
Figure 9.13: Example of application of
Gaussian process classification using the
package GPML with a dierent number of
observations D 2{25, 50, 100}.
Note that we can improve the eciency for computing the MLE
in equation 9.8 by using the information contained in the Hessian
as we did with the Newton-Raphson method in §5.2. In that case,
optimal values g
are obtained iteratively using
g g H[ln f(g|D)]
1
·r
g
ln f (g|D). (9.8)
Figure 9.13 presents an application of GPC using the Gaussian
process for machine learni n g (GPML) package with a dierent
number of ob ser vations,
D 2{
25
,
50
,
100
}
. I n this figure, the green
solid lines on the horizontal planes describe the inferred confidence
intervals for the latent variables,
µ
G|D
±
2
G|D
. We can see that
as t h e number of observations increases, the inferred function
Pr
(
Y
=+1
|x
, D
) (orange solid line) tends toward the true function
(orange dashed line).
Figure 9.14 presents the application of Gaussian process classifi-
cation to the post-earthquake struc tu r al safety assessment example
introduced in §9.1.2. Here, the multiclass problem is transformed
into a bin ary one by modeling the probability that the damage
index is either 0 or 1. The main advantage of GPC is th at the prob-
lem setup is trivial when using existing, pre-implemented packages.
With the gener at ive approaches pre sented in §9.1, the formulation
has to be specifically tailored for the problem. Nevertheless, note
that a generati ve approach allows inclu d in g censored data, whereas
a discriminative approach such as the GPC presented here cannot.
j.-a. goulet 150
In p r act i ce , all the operations required to infer the latent vari-
ables
G(x)
as well as the parameter
=
{
g
, `}
in GPC are already
implemented in the same open-source packages deali n g with GPR,
for example, GPML, GPstu, and pyGPs (see §8. 2) .
0 0.2 0.4 0.6 0.8 1
0
0.5
1
x
Pr(D =0|x, D)
y =1
y =0
Figure 9.14: Example of application of
Gaussian process classification to the post-
earthquake structural safety evaluation
data set.
Strengths and limitations With the availability of Gau ss ian process
classification implemented in a variety of open-source packages,
the setup of a GPC problem is trivial, even for proble ms involving
several covar i at es x. When the number of observations is small, the
task of inferrin g latent var i abl e s g becomes increasingly dicult
and the perform anc e decreases. Note also that the formulation
presented assumes that we only have error-free direct observat ion s.
Moreover, because of the comput at ion al l y demanding procedure of
inferring latent variables g, the performance is limited in the case
of l ar ge data sets, for example,
D >
10
5
. G i ven the limitations for
both small and large data sets, the GPC presented here is thus best
suited for medium-size data sets.
For more detai l s about the Gaussian process classification and its
extension to multiple classes, the reader should refer to dedicated
publications such as th e book by Rasmussen and Williams
4
or t h e
4
Rasmussen, C. E. and C. K. Williams
(2006). Gaussian processes for machine
learning.MITPress
tutorial by Ebden.
5
5
Ebden, M. (2008, August). Gaussian
processes: A q uick introduction. arXiv
preprint (1505.02965)
9.4 Neural Networks
The formulation for the feedforward neural network presented in
§8.3 can be adapted for binary classification problems by replacing
the linear outpu t activation fu nc t ion by a sigmoid function as
illustrated in figure 9.15a. For an observati on
y 2 {
1
,
+1
}
,the
log of the probability for the outcome
y
i
= +1 is modeled as being
proportional to the hidden variable on the output layer
z
(O)
, an d
the log of the probability for the outcome
y
i
=
1 is assumed to be
proportional to 0,
ln Pr(Y =+1|x, ) / z
(O)
= W
(L)
a
(L)
+ b
(L)
ln Pr(Y = 1|x, ) / 0.
These unnormalized log-probabilities are tran sf orm ed into a proba-
bility for each possible outcome by taking the exponential of each of
them and then normalizing by following
Pr(Y =+1|x, )=
exp(z
(O)
)
exp(z
(O)
) + exp(0)
| {z }
=1
= (z
(O)
)
Pr(Y = 1|x , )=
exp(0)
exp(z
(O)
) + exp(0)
= (z
(O)
).
probabilistic machine learning for civil engineers 151
As p re se nted in §8.3.1, this procedure for normalizing the log-
probabilities of
y
is equivalent to evaluating the hidden variable
z
(O)
(or its negative) in the logistic sigmoid function. Thus, the
likelihood of an observation
y 2 {
1
,
+1
}
, gi ven its associated
covariates x and a set of parameters = {
w
,
b
}, is given by
···
···
···
···
b
(L1)
.
.
.
b
(L)
z
(O)
y
(a)
···
···
···
···
b
(L1)
.
.
.
b
(L)
z
(O)
1
y
z
(O)
2
.
.
.
z
(O)
C
(b)
Figure 9.15: Nomenclature for a feed-
forward neural network employed for
classification: (a) represents the case for
binary classification
y 2{
1
,
+1
}
;(b)the
case for C classes y 2{1, 2, ··· , C}.
p(y|x, )=( y · z
(O)
).
With neural networks, it is common to minimize a loss function
J
(
D
;
w
,
b
) that is defin ed as the negative joint log-likeliho od for a
set of D observations,
J(D;
w
,
b
)=ln p(D
y
|D
x
, )
=
D
X
i=1
ln (y
i
· z
(O)
i
).
In t h e case where the observed system responses can b el on g
to multiple classes,
y 2{
1
,
2
, ··· , C}
, t h e output layer needs to
be modifie d to include as many hidden states as the re are classes,
z
(O)
=[
z
(O)
1
z
(O)
2
··· z
(O)
C
]
|
, as presented in figure 9.15b. The log-
probability of an observation y = k is assum ed to be proportional to
the k
th
hidden state z
k
,
ln Pr(Y = k|x, ) / z
(O)
k
.
The normalization of the log-probabilities is done using the softmax
activation function, where the probability of an observation
y
=
k
is
given by
p(y = k|x, ) = softmax(z
(O)
,k)
=
exp(z
(O)
k
)
P
C
j=1
exp(z
(O)
j
)
.
By assuming that observations are conditionally independe nt from
each other gi ven z
(O)
, t h e loss function for the entire data set is
obtained as in §8.3.2 by summing the log-probabilities (i.e., the
marginal log-likelihoods) for each observation,
J(D;
w
,
b
)=
D
X
i=1
ln p( y
i
|x
i
, ). (9.9)
Minimizing the l oss fun ct i on
J
(
D
;
w
,
b
) defined in equation 9.9
corresponds to mi n i mi zi ng the cross-entropy. Cross-entropy is a
Note: Given two PMFs,
p
(
x
)and
q
(
x
), the
cross-entropy is defined as
H(p, q)=
X
x
p(x)lnq(x).
concept from the field of information theory
6
that measures th e
6
MacKay, D. J. C. (2003). Information
theory, inference, and learning algorithms.
Cambridge University Press
similarity between two p r obab i li ty distributions or mass functions.
Note that in the context of classification, the gradient
r
O
(
z
(O)
)
in the backpropagation algorithm (see §8.3.2) is no longer equal to
j.-a. goulet 152
one because of the sigmoid function. Moreover, like for the regres-
sion setup, a neural network classifier is best suited for problems
with a large number of covariates and for which large data sets are
available.
9.5 Regression versus Classification
Note that when the data and context allow for it, it can be prefer-
able to formulate a problem using a regression approach rather
than using classi fication. Take, for example, the soil contami-
nation example p re sented in §8.2.5. The data set available de-
scribes the contaminant concentration
c
i
measured at coordi n at es
l
i
=
{x
i
,y
i
,z
i
}
. Us i ng regression to model this problem results in
a fu nc t i on describing the PDF of contaminant concentration as a
function of coor di n ate s l,
f
(c
|
l
, D
). A classificati on setup would
instead mo de l the probability that the soil contamination exceeds
an ad mi ss i bl e val u e
c
adm.
,
Pr
(C
c
adm.
|
l
, D
). Here, the drawback
of c las si fi cat i on is that information is lost when transforming a
continuous system response C i nto a categorical event
{
C
c
adm.
}
.
For the classification setup, the information available to build the
model is whethe r the contaminant concentration is above (
c
i
= +1)
or below (
c
i
=
1) the admissibl e concentration
c
adm.
for multiple
locations l
i
. The issue with a c las si fi cat i on approach i s that becau se
we work with categor i es , the information about how far or how
close we are from the threshold c
adm.
is lost.