8
Regression
Data
D = {(x
i
,y
i
), 8i 2{1:D}}
x
i
2 R
X
:
8
<
:
Covariate
Attribute
Regressor
y
i
2 R :Observation
Model
g(x) fct(x)
Regression consists in the task of modeling the relationships be-
tween a system’ s responses and the covariates describing the
properties of this system. Here, the generic term sys tem is em-
ployed because regression can be applied to many types of prob l ems .
Figure 8.1 presents examples of applicati on s in the field of civil engi-
neering; for instan ce , one could model the relations hi p between (a)
the number of pathogens and chlorine concentration, (b) the num-
ber of accidents at a road intersection as a function of the hour of
the day, and (c) conc re t e strength as a function of the water-cement
ratio. These examples are all functions of a single covariate; later in
this chapter we will see how regression is generalized to model the
responses of a system as a function of any number of covariates.
[ ] cl+
Pathogens [ppm]
(a)
hour of the day
# of accidents
(b)
Water/cement ratio
Concrete strength
(c)
Figure 8.1: Examples of applications of
regression analysis in the context of civ il
engineering.
In regressi on problems , the data sets are described by
D
pairs of
observations
D
=
{
(
x
i
,y
i
)
, 8i 2{
1:
D}}
where
x
i
=[
x
1
x
2
··· x
X
]
|
i
2
R
X
are covariates, also called attributes or regressors, an d
y
i
2 R
1
is
an observed system response. A regression model
g
(
x
)
fct
(
x
)
2 R
1
is thus a
X !
1 functi on, that is, a function of
X
covariates leading
to one output. This chapter covers three regression methods for
building
g
(
x
): linear regression, Gaussian process regression, and
neural networks.
8.1 Linear Regression
Linear regression is one of the easiest methods for approaching
regression problems. Contrarily to what its nam e may lead us to
think, linear re gr ession is not lim i te d to cases where the relation-
ships between the covariates and system responses are linear . The
term linear refers t o the linear dependence of the model with r e-
spect to its parameters. In th e case involving a single covariate
x
,
j.-a. goulet 108
the model employed i n linear regression takes the form
g(x)=b
0
+ b
1
1
(x)+b
2
2
(x)+... + b
B
B
(x),
where
=
b
=[
b
0
b
1
··· b
B
]
|
is a vector of model parameters and
i
(
x
)isabasis function appl i ed on the covariate
x
. Basis functions
are nonli ne ar functions employed whe n we want to apply linear
regression to descr i be nonlinear relationships. Figure 8.2 presents
two examples of linear regression applied to (a) linear and (b)
nonlinear relationships.
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
g(x)
(a) Linear: g(x)=b
0
+ b
1
x
0 0.2 0.4 0.6 0.8 1
4
2
0
2
4
6
x
y
training set
g(x)
(b) Nonlinear: g(x)=b
0
+ b
1
sin(10x)
3
Figure 8.2: Examples of linear regression
applied to linear and nonlinear relation-
ships.
8.1.1 Mathematical Formulation
For the simplest case where there is only one covariate, the data
set consi st s in
D
pairs of covariates and system responses
D
=
{
(
x
i
,y
i
)
, 8i 2{
1:
D}}
, where a covariate
x
i
2 R
, and its associated
observation y
i
2 R. The observation model is defined as
y
|{z}
Observation
=
Model
z}|{
g(x)+v, v : V N(v;0,
2
V
)
| {z }
Observation error
,
where
v
describes the observation errors that are assumed to be
independent of each other,
V
i
?? V
j
, 8i 6
=
j
. Note that observed
covariates
x
are assum ed to be exact and free from any observation
errors. Le t us consider the special case of a linear basis function
(x)=x where the model takes the form
g(x)=b
0
+ b
1
x =[b
0
b
1
]
1
x
. (8.1)
We can exp re ss the model for the entire data set using the foll owing
matrix notation, Note:
The 1 in
1
x
stands for the absence
of dependence on a covariate for the bias
parameter b
0
.
b =
b
0
b
1
, y =
2
6
6
6
4
y
1
y
2
.
.
.
y
D
3
7
7
7
5
, X =
2
6
6
6
4
1 x
1
1 x
2
.
.
.
.
.
.
1 x
D
3
7
7
7
5
, v =
2
6
6
6
4
v
1
v
2
.
.
.
v
D
3
7
7
7
5
.
With this matrix notation, we r ef ormulate the observation model as
y = Xb + v, v : V N(v; 0,
2
V
· I),
where
I
is the identity mat r i x indicat i ng that all measurement
errors are i nd ependent from each other and have the same variance.
If we separate the set of observations
D
=
{D
x
, D
y
}
by isolating
covariates
D
x
and syste m responses
D
y
, the joint likelihood of
probabilistic machine learning for civil engineers 109
observed system responses given the set of observed covariates and
model parameters = b is expressed as
f(D
y
|D
x
, ) f(D
y
= y|D
x
= x, = b)
=
D
Y
i=1
N(y
i
; g(x
i
),
2
V
)
/
D
Y
i=1
exp
1
2
y
i
g(x
i
)
2
2
V
!
=exp
1
2
D
X
i=1
y
i
g(x
i
)
2
2
V
!
=exp
1
2
2
V
(y Xb)
|
(y Xb)
.
The goal is to estimate optimal parameters
=
b
, which maxi-
mize the likelihood
f
(
D
y
|D
x
,
), or equivalently, which maximize
the log-likelihood, Note:
Because we have assumed that
V
is the same for all observations, optimal
parameters values
are independent of
the observation-error standard deviation.
ln f (D
y
|D
x
, )=
1
2
2
V
(y Xb)
|
(y Xb)
/
1
2
(y Xb)
|
(y Xb).
The set of data
D
employed to obtain the maximum likelihood
Note:
arg max
b
ln f(D
y
|D
x
, ) arg min
b
J(b)
estimate (MLE) f or parameters
=
b
is called the training set.
In the context of linear regression, the object i ve of maximizing
ln f
(
D
y
|D
x
,
) is often replaced by the equivalent objective of
minimizing a loss function
J
(
b
) corresponding to the sum of the
squares of the prediction errors,
J(b)=
1
2
D
X
i=1
y
i
g(x
i
)
2
=
1
2
(y Xb)
|
(y Xb)
= ln f(D
y
|D
x
, ).
Figure 8.3: The minimum of a continuous
function corresponds to the location where
its derivative equals zero.
Figure 8. 3 presents a loss function
J
(
b
) as a function of the model
parameters b. Optimal parameter values
= b
are defined as
b
= arg min
b
1
2
(y Xb)
|
(y Xb).
As for other continuous functions , the minimum of
J
(
b
) coinci de s
with the point where its derivative equals zero,
@J (b
)
@b
=0=X
|
Xb
X
|
y.
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=1.98
training set
g(x)
Figure 8.4: Example of linear regression
minimizing the performance metric
J
computed on the training set.
By reorgani zi n g the terms from the last equation, optimal parame-
ters are defined as
b
=(X
|
X)
1
X
|
y. (8.2)
j.-a. goulet 110
Figure 8. 4 presents an example of linear regression employing a
linear basis function
g(x)=b
0
+ b
1
x.
The optim al parameter s found using the training set are
b
=
[1.14 3.71]
|
, for which J
train
(b
)=1.98.
We can extend the formulation presented above for the general
case where the relationship between
x
and
y
is modeled by the
summation of multiple basis functions,
g(x)=b
0
+ b
1
1
(x)+b
2
2
(x)+···+ b
B
B
(x)=Xb,
where
i
(
x
) can be any linear or nonlinear function, for example:
i
(
x
)=
x
2
,
i
(
x
)=
sin
(
x
)
, ···
. In a matrix form, this general
model is expressed as
b =
2
6
6
6
4
b
0
b
1
.
.
.
b
B
3
7
7
7
5
, y =
2
6
6
6
4
y
1
y
2
.
.
.
y
D
3
7
7
7
5
, X =
2
6
6
6
4
1
1
(x
1
)
2
(x
1
) ···
B
(x
1
)
1
1
(x
2
)
2
(x
2
) ···
B
(x
2
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
(x
D
)
2
(x
D
) ···
B
(x
D
)
3
7
7
7
5
.
Remember that with linear regression, the model is necessarily
linear wi t h respect to
i
(
x
) and
b
, yet not
x
. Figure 8.5 presents
two examples of linear regression applied to nonlinear relationships
between a s ys t em response and a covariate.
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
g(x)
(a) g(x)=b
0
+ b
1
x
2
0 0.2 0.4 0.6 0.8 1
4
2
0
2
4
6
x
y
training set
g(x)
(b) g(x)=b
0
b
2
sin(10x)
3
Figure 8.5: Examples of linear regression
applied to nonlinear relationships.
8.1.2 Overfitting and Cross-Validati on
One of the main challenges with linear regression, as with many
other machine learning methods, is overfitting. O verfitting happens
when you select a model class or model structure that represents
the data in your training set well, yet it fails to provide good
predictions for oth er covariates that were not seen during training.
A highl y complex model capable of adapting to the training data is
said to have a large capacity.
(a) Low bias &
low variance
(b) High bias &
low variance
(c) Low bias &
high variance
(d) High bias &
high variance
Model complexity
Error
Optimal
complexity
(bias)
V
a
l
i
d
a
t
i
o
n
e
r
r
o
r
T
r
a
i
n
i
n
i
n
g
e
r
r
o
r
(
v
a
r
i
a
n
c
e
)
OverfittingUnderfitting
(e) Bias-variance tradeo
Figure 8.6: Illustration of the bias and
variance when talking about predictive
performance and model complexity.
Overfitting with a model that has too much capaci ty corresponds
to the case (c) in figure 8.6, where prediction er r ors are displ aying a
low bias when evaluated with the dat a employed to train the model;
however, when validated with the true objective of pred i ct i ng
unobserved data, there is a high variance. Case (b) represents the
situation where a model underfits the data because its low capacity
causes a large bias and a small vari anc e. As depicted in (e), we
want the opti mal model complexity oering a tradeo between the
prediction bias and variance by minimizing the validation error.
This is known as th e bias - var ia nce tradeo . The conc ep t il l us tr at ed
in figure 8.6 will be revisited at the end of this section.
probabilistic machine learning for civil engineers 111
In the context of the linear regression, overfitting arises when
one chooses basis functions
(
x
) that have too much capacity and
that wil l be able to fit the data in a training set but not t h e data
of an independent validation set. Figure 8.7a presents an example
where dat a is generated from a linear basis function and where the
regression model also employs a linear basis function to fit the data.
Figure 8. 7b presents the case where the same data is fitted with
a model using polynomial basis functions of order 5. Noti ce how
when more complex basis functions are introduced in the model, it
becomes more flexible, and it can better fit the data in the training
set. Here the fitting ability is quantified by the loss function
J
that
is lower for the polynomial model than for the linear one.
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=1.98
training set
g(x)
(a) g(x)=b
0
+ b
1
x
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=1.72
training set
g(x)
(b)
g
(
x
)=
b
0
+
b
1
x
+
b
2
x
2
+
b
3
x
3
+
b
4
x
4
+
b
5
x
5
Figure 8.7: Example of (a) a model with
correct complexity, and (b) a model with
too much capacity.
One solut i on to this challenge is to replace the MLE proce-
dure for optimizin g the parameters
b
(see equ at ion 8.2) with the
Bayesian estimation procedure as described in
§
6.3. The appro-
priate model complexity should then be selected by employing
Bayesian model selection as presented in
§
6.8. In practice, this is
seldom performed because calculating the evidence
f
(
D|m
i
) for a
model is computationally demanding. An alternative is to perform
ridge regression,
1
which is a special case of Bayesian linear regres-
1
Hoerl, A. E. and R. W. Kennard (1970).
Ridge regression: Biased estimation for
nonorthogonal problems. Technomet-
rics 12 (1), 55–67
sion that employs regulariz at ion to constrain the parameters (see
§8.3.3).
Another gen er al procedure to identify the optimal model com-
plexity is to employ cross-validation. The idea is to avoid employing
the same data to (1) estimate optimal parameters, (2) discriminate
between model classes, and (3) test the final model performance
using th e loss function. To start with, the data is separated into a
training set and a test set. In order to maximize the utilization of
the data available, we can rely on the
N
-fold cross-validation pro-
cedure, where th e training set is again divided into
N
subsets c al le d
folds; one at a time , each subset is employed as an independent val-
idation s et while the training is performed on the remaining folds.
The total loss is obtained by summing the loss obtained for each
validation subset. The model class selected is the one with the low-
est aggregate d loss. Once a model class is selected as the best one,
its parame t er s can be retrai ne d using the entire training set, and
its pre di ct i ve performance is e valuated on the independent test set
that was not employed to estimate parameters or to discriminate
between model classes.
fold
#
1
2
3
training subset
validation subset
entire data set
testing set
test
training set
Figure 8.8: The
N
-fold cross-validation
procedure.
Figure 8.8 p re se nts an example of a 3-fold cross-validation where
the green -sh ad ed boxes describe the part of the data set employed
to train the model, t h e blue boxes describe the part of the data set
employed to compare the predictive performance while choosing
j.-a. goulet 112
between model classes, and the red box re pr es ents the independent
test set that is only employed to quantify the model performance on
unobserved data. It is essential that for each fold, the training and
validation sets, and the test set are kept separated. Typically, we
would employ 80 percent of the data for the training and validation
sets and 20 percent for the test set.
It is common to employ 10-fold cross-validation because it of-
fers a balance between the computational eort required and th e
amount of dat a available to estimate the model parameters. For
N
folds, t he model needs t o be trained
N
times, w hi ch may involve a
substantial computational eort. On the other hand, usi n g few folds
may lead t o poor parameter estimates. For example, if we only em-
ploy 2 fold s, for each of them the model is only trained on half the
training set. Whe n
N
=
D
train
, the procedure is cal l ed leave-one-out
cross-val id ati on (LOO CV) . This approach leads to the best esti-
mation of validation er ror s because more dat a is employed during
training, yet it is th e most computationally demandin g. Algorith m
7 summarize s the
N
-fold cros s-validation procedure for choosing
between M model classes.
Algorithm 7: N -fold cross-validation
1 define {x
train
, y
train
} (training set with D
train
observations)
2 define {x
test
, y
test
} (test set)
3 define g
m
(x), 8m 2{1:M} (mo dels)
4 define N (N-folds)
5 d = randperm(D
train
) (random permutation of [1 : D
train
]
|
)
6 n = round(D
train
/N) (# observations p er fold)
7 for m 2{1, 2, ··· , M} do
8 for i 2{1, 2, ··· , N} do
9 x
t
= x
train
; y
t
= y
train
(initialize training set)
10 k = d
n(i 1) + 1
:min(i·n, D
train
)
(val. set indices)
11 x
v
= x
train
(k, :); y
v
= y
train
(k, :) (validation se t )
12 x
t
(k, :) = [ ]; y
t
(k, :) = [ ] (remove validation set)
13 X =fct(x
t
) (training model-matrix)
14 b
i
=(X
|
X)
1
X
|
y
t
(estimate parameters)
15 X =fct(x
v
) (validation model-matrix)
16 J
i
= J(b
i
, y
v
)(i
th
validation-set loss)
17 J
m
validation
=
P
N
i=1
J
i
(aggregated loss for model m)
18 m
= arg min
m
J
m
validation
(select best model)
19 X =fct(x
train
) (training model-matrix for m
)
20 b
=(X
|
X)
1
X
|
y
train
(estimate parameters)
21 X =fct(x
test
) (test-set model matrix)
22 J
m
test
= J(b
, y
test
) (test-set loss for m
)
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=0.43 | J
validation
(b)=1.56
training set
validation set
g(x)
(a) Fold 1/2
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=1.53 | J
validation
(b)=0.46
training set
validation set
g(x)
(b) Fold 2/2
Figure 8.9: Example of 2-fold cross-
validation for the model g
1
(x)=b
0
+ b
1
x.
0 0.2 0.4 0.6 0. 8 1
0
2
4
6
x
y
J
train
(b)=0.7 | J
validation
(b)=5.42
training set
validation set
g(x)
(a) Fold 1/2
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
J
train
(b)=0.11 | J
validation
(b)=15.47
training set
validation set
g(x)
(b) Fold 2/2
Figure 8.10: Example of 2-fold cross-
validation for the model
g
2
(
x
)=
b
0
+
b
1
x
+
b
2
x
2
+ b
3
x
3
+ b
4
x
4
+ b
5
x
5
.
probabilistic machine learning for civil engineers 113
Figures 8. 9 and 8.10 present a 2-fold cross-validation procedure
employed to choose between a lin ear model
g
1
(
x
) and a fifth-order
polynomial one g
2
(x),
g
1
(x)=b
0
+ b
1
x (linear)
g
2
(x)=b
0
+ b
1
x + b
2
x
2
+ b
3
x
3
+ b
4
x
4
+ b
5
x
5
(5
th
order).
For the linear model, the su m of the losses for the two training and
validation set s is
P
J
train
=1.97
P
J
validation
=2.02
)
g
1
(x) ,
and for the fifth-order polynomial model, the sum of t h e losses for
the two tr ain i ng and validation set s is
P
J
train
=0.81
P
J
validation
= 20.89
)
g
2
(x) .
These resu l t s indicate that the fifth-order polynomial function is
better at fitting the training set; however, it fails at predicting
data that was not seen in the training set. This is r efl ected in the
loss func ti on for the validation set, which is lower for the linear
model having the c orr e ct complexi ty, compared to the fifth-order
polynomial function that is overcomplex.
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
validation set
g(x)
(a) Fold 1/2
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
validation set
g(x)
(b) Fold 2/2
Figure 8.11: Example of 2-fold cross-
validation for the model
g
3
(
x
)=
b
0
+
b
1
x
,
where the data follows the real model
ˇg(x)=1+5x
4
.
We now come back to the targets presented in figure 8.6; the
model
g
1
(
x
) presented in figure 8.9 corresponds to the target in
figure 8.6a, where the complexity is optimal so there is both a low
bias and a low variance. The model
g
2
(
x
) presented in figure 8.10
corresponds to figure 8.6c, where because the capacity is t oo high,
the model closely fits the tr ai ni n g data in each fold (low bias), yet
the model is drastically dierent (i.e., high variance) for each fold.
Here, the high model capacity allowed to overfit the noise contained
in the data. Finally, the target in figure 8.6b cor re sponds to the
new case illustrat e d in figure 8.11, where the model does not have
enough capaci ty to fit the data. The result is that the model is
highly biased, yet there is almost no variability between the models
obtained for each fold.
8.1.3 Mathematical Formulation >1-D
Linear regr es si on can be employed for cases involving more than
one covariate. The data set is then
D
=
{
(
x
i
,y
i
)
, 8i 2{
1:
D}}
,
where
x
i
=[
x
1
x
2
··· x
X
]
|
i
2 R
X
is an
X
by one vector associated
with an observation
y
i
2 R
1
. The model is an
X !
1 functi on,
g(x) fct(x) 2 R
1
, which is defined using basis f un ct i on s so that
g(x)=b
0
+ b
1
1
(x
1
)+b
2
2
(x
2
)+···+ b
B
B
(x
X
)=Xb.
j.-a. goulet 114
The matri x notation for the model defined over the entire data set
is
b =
2
6
6
6
4
b
0
b
1
.
.
.
b
B
3
7
7
7
5
, y =
2
6
6
6
4
y
1
y
2
.
.
.
y
D
3
7
7
7
5
, X =
2
6
6
6
4
1
1
(x
1,1
)
2
(x
1,2
) ···
B
(x
1,X
)
1
1
(x
2,1
)
2
(x
2,2
) ···
B
(x
2,X
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
(x
D,1
)
2
(x
D,2
) ···
B
(x
D,X
)
3
7
7
7
5
,
where [
x
i
]
j
x
i,j
refers to the
j
th
component of the
x
vector for
the
i
th
observation. Figure 8.12 presents an example of multivariate
linear regression for x =[x
1
x
2
]
|
and for which the model is
g(x)=b
0
+ b
1
x
1/2
1
+ b
2
x
2
2
.
Note that even if the model now involves multiple covariates, the
optimal model parameters b
are still estimated using equati on 8.2.
0
0.5
1
0
0.5
1
0
5
10
15
x
1
x
2
y
training set
g(x)
Figure 8.12: Example of application of
linear regression to a 2-D data set.
8.1.4 Limitations
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
g(x)
(a) Example of heteroscedasticity where
the variance of errors is a function of
the covariate x
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
g(x)
(b) Example of a bias introduced by an
outlier on the bottom right corner
0 0.2 0.4 0.6 0.8 1
0
2
4
6
x
y
training set
test set
g(x)
(c) Example of biased extrapolation for
unobserved covariates
Figure 8.13: Illustrations of some of the
limitations of the linear regression method
presented in §8.1.1.
Some of the limitations of the linear regression method presented
in
§
8.1.1 are illustrat e d in figure 8.13: (a) it assumes homoscedastic
errors and is thus not suited for the cases where the variance of
errors de pends on the covariate
x
, that is, heteroscedastic cases,
(b) it is sensitive to outl ier s, so a single biased observation may
severely aect the model, and (c) it does not dierentiate between
interpolation and extrapolation. Interpolation refers to predictions
made between covariates for which observati on s are available to
train t he model, and ex t rapolation refers to predictions made
beyond the domain the model has been trained for. In (c), the
predictions performed with
g
(
x
) are good at interpolating between
data points in t he traini ng set, yet they oer a poor performance
for extr apolating beyond them on the test data. A main limitation
with t h e linear regressi on formulation presented here is that it does
not provide uncertainty e s t im at es indicati ng that its predictive
performance degrades for covariate values that are far from those
available in t h e training set.
Another key limitati on not illustrated in figure 8.13 is that the
performance of linear regression depends on our ability to hand-
pick the correct basis fu nc ti on s. For cases invol v i ng one or two
covariates , this task is feasible from a simple visual representation
of the data set. This task becomes dicult when the number of
covariates
X >
2. In short, line ar regres s ion is simple and parameter
estimation is qu ick, yet it is outperformed by modern techniques
such as Gau s s ia n process regression and neural networks,which
will be covered in the next sections. Note that the cross-validation
probabilistic machine learning for civil engineers 115
technique covered here is not limited to linear regression and can be
employed with a wide range of machine learning methods.
8.2 Gaussian Process Regression
Gaussian p r ocess regression (GPR) works by describing the prior
knowledge of system responses over its covariate domain using a
joint multivariate Normal probability density function (PDF). Then,
it empl oys the properties of the multivariate Normal (see
§
4.1.3) in
order to update the joint prior PDF using empirical observations.
Figure 8.14: Example of a 100 km long
pipeline.
Pipeline example Take for example the 100 km long pipeline
illustrated in fi gur e 8.14, for which we are interested in quantifying
the temperature at any coordinate
x 2
(0
,
100) km. Given that
we know the pipeline temperature to be
y
=8
C
at
x
= 30 km,
what is the temperature at
x
= 30
.
1 km ? Because of the proximity
between the two locations, it is intuitive that the temperature has
to be close t o 8
C
.Then,what is the temperature at
x
= 70 km?
We cannot say for sure that it is going to be ar oun d 8
C
because
over a 40 km distance, the change in temperature can reach a few
degrees. T hi s example illustrat es how we can take advantage of the
underlying dependencies between system responses for dierent
observed covariates
x
, for predicting the responses for unob se rved
covariates. Note:
As for linear regression (see
§
8.1),
covariates
x
are assumed to be determinis-
tic constants that are free from observation
errors.
Let us consider system responses defined by a single covariate
x
i
so that
g
i
|{z}
observation
=
model
z}|{
g(x
i
) .
The data set is defined as
D
=
{
(
x
i
,g
i
)
, 8i 2{
1:
D}}
=
{D
x
, D
g
}
,
where for D observations,
D
g
= {g} = {[g
1
g
2
··· g
D
]
|
}, and D
x
= {x} = {[x
1
x
2
··· x
D
]
|
},
so for each observed
g
i
, ther e is an associated covar i at e
x
i
.AGaus-
sian process is a multivariate Norm al random variable defined over
a domain described by covariates. In the case of Gaussian process
regression, a set of system responses is assumed to be a realization
from a Gaussian process
g
(
x
):
G
(
x
)
N
(
g
(
x
);
µ
G
,
G
), where
µ
G
is the mean vector [µ
G
]
i
= µ
G
(x
i
) and the covariance matrix
G
is
[
G
]
ij
= (x
i
,x
j
) ·
G
(x
i
) ·
G
(x
j
),(x
i
,x
j
)=fct(x
i
x
j
).
G
defines t he covariance between a discr et e set of Gaussian ran-
dom variables f or which the pairwise correlation
(
x
i
,x
j
)between
j.-a. goulet 116
G
(
x
i
) and
G
(
x
j
) is a function of the distance between the covari-
ates
x
i
and
x
j
. Note that
µ
G
and
G
depend implicitly on the
covariates x.
Figure 8.15: Representation of the prior
knowledge for the temperature along
apipeline.Thedashedhorizontalred
line represents the prior mean, and the
pink-shaded area is the
±
2
G
confidence
region. Overlaid colored curves represent
five realizations of the Gaussian process,
g
i
: G N(g(x); µ
G
,
G
).
Pipeline example (continued) Figure 8.15 illustrates our prior
knowledge of the temperature for coordinates x =[01 ··· 100]
|
km,
G N(g(x); µ
G
,
G
)
| {z }
Prior knowledge
, [
G
]
ij
= (x
i
,x
j
)
2
G
,
where each colored curve represents a dierent realization of the
Gaussian p r ocess
G
. The prior means and standard deviations
are assumed to be equal for all locations
x
,so
µ
G
(
x
)=
µ
G
=
0
C
and
G
(
x
i
)=
G
=2
.
5
C
. The correlation between the
temperature at two locations
G
(
x
i
) and
G
(
x
j
)isdenedherebya
square exponential correlation function
(x
i
,x
j
)=exp
1
2
(x
i
x
j
)
2
`
2
,
where
`
is the length-scale parameter. F i gur e 8.16 presents the
square-exponential correl at i on function plotted for parameter values
` = {1, 2, 5}.
Figure 8.16: Square-exponential correlation
function.
In figure 8.15, for a length-scale of
`
= 25 km, the cor re l at ion for t h e
three pairs of coordinates x = {30, 31, 70} is
(x
1
= 30 km,x
2
= 31 km) = 0.999
(x
1
= 30 km,x
3
= 70 km) = 0.278
(x
2
= 31 km,x
3
= 70 km) = 0.296
9
=
;
for ` = 25 km.
Figure 8.17 presents examples of realizations for the same Gaussian
process using dierent lengt h -sc al e parameters.
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(a) ` =1km
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(b) ` =10km
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(c) ` =100km
Figure 8.17: Examples of Gaussian process
realizations for dierent length-scale
parameter values. The dashed horizontal
red line represents the prior mean, and the
pink-shaded area is the
±
2
G
confidence
region.
8.2.1 Updating a GP Using Exact Observations
Let us consider
D
=
{
(
x
i
,g
i
)
,i2{
1:
D}}
, a set of
D
exact obse rva-
tions (i . e. , without observation errors), an d
x
=[
x
1
x
2
··· x
P
]
|
,
a vector of
P
covariates corresponding to either observed or unob-
served lo cat i on s where we want to obtain predictions. The goal
probabilistic machine learning for civil engineers 117
is to obtain the posterior PD F
f(g
|x
, D)
. Before obtaining this
posterior PDF, we have to defin e the joint prior knowledge for
the syst e m responses at both the observed
x
and predi ct i on
x
locations,
G
G
, µ =
µ
G
µ
, =
G
G
|
G
| {z }
Prior knowledge
,
where
µ
G
and
µ
are, respectively, the prior m e ans at observed
and at prediction locations; [
G
]
ij
=
(
x
i
,x
j
)
2
G
and [
]
ij
=
(
x
i
,x
j
)
2
G
are thei r respective covariances. The covariance be-
tween observed and predict i on locations is described by [
G
]
ij
=
(
x
i
,x
j
)
2
G
. The posterior probabil i ty of
G
at prediction locations
x
, conditioned on the observations
D
, is given by
f
G
|x
,D
(
g
|x
, D
)=
N
(
g
;
µ
|D
,
|D
)
,
where th e posterior mean vector and covari anc e
matrix are
µ
|D
= µ
G
+
|
G
1
G
(g µ
G
)
|D
=
|
G
1
G
G
.
|
{z }
Posterior knowledge
These two equat ion s are analogous to those presented
§
4.1.3 for
the condi t i onal distr i bu t ion of a multivariate Normal (
Gaussian).
Nomenclature
x
=[
x
1
x
2
··· x
D
]
|
:Covariatesfor
D
observed locations
x
=[
x
1
x
2
··· x
P
]
|
:Covariatesfor
P
prediction locations
g
=[
g
1
g
2
··· g
D
]
|
:Systemresponsesfor
D
observed locations
g
=[
g
1
g
2
··· g
P
]
|
:Modelpredictions
for P locations
G
:Priorcovarianceof
G
for observed
locations x
:Priorcovarianceof
G
for prediction
locations x
G
:Priorcovarianceof
G
between
observed and prediction locations,
x
and
x
,respectively
Strength and limitations One of the main dierences between the
Gaussian p r ocess regression and other r e gre ssi on methods that
encode the relationships contained in the data set through a para-
metric m odel is that GPR is a nonparametric approach, for which
predictions depend explicitly on the observations available i n the
data set. The upside, in comparison with other methods such as
linear r egr es si on, is that there is no requirement to define basis
functions that are compatible with the system responses. GPR thus
allows modeling highly nonlinear responses using few parameters.
Another key feature of GPR is that it is an intrinsically proba-
bilistic method that allows dierentiating between the prediction
quality in interpol at i on and extrapolation situations. For example,
figures 8.19c and 8.19d show how interpolating between observation
leads to a narrower confidence re gi on than when extrapolating.
One of the main downsides of Gaussian process regression is th at
when the data set is large (
D >
1000), building the matrices describ-
ing the joint prior knowledge becomes computationall y demandi ng
and it is then necessary to employ approximate methods.
2
2
Qui˜nonero-Candela, J. and C. E. Ras-
mussen (2005). A unifying view of sparse
approximate Gaussian process regression.
Journal of Machine Learning Research 6,
1939–1959
j.-a. goulet 118
Pipeline example (continued) We consider the case where the
pipeline temperature
g
2
=
2
C
is observed for the covariate
x
2
and we want to use this information to obtain the posterior PDF
describing the t em perature
G
1
at a location
x
1
.Given
µ
G
=0
C
,
G
=2.5
C
, and
(x
1
,x
2
)=0.8
, the joint PDF describing the prior
probability of the temperature at locations x
1
and x
2
is defined by
f
G
1
G
2
(g
1
,g
2
)=N(g; µ
G
,
G
)
8
>
<
>
:
µ
G
= [0 0]
|
G
=
2.5
2
0.8 · 2.5
2
0.8 · 2.5
2
2.5
2
.
The conditional probability of G
1
given the observation g
2
is
f
G
1
|g
2
(g
1
|g
2
)=
f
G
1
G
2
(g
1
,g
2
)
f
G
2
(g
2
)
= N(g
1
; µ
1|2
,
2
1|2
),
where the posterior me an and standard deviation are
µ
1|2
= µ
1
+ ⇢
1
g
2
µ
2
2
=0+0.8 2.5
20
2.5
= 1.6
C
1|2
=
1
p
1
2
=2.5
p
1 0.8
2
=1.5
C.
5
0
5
2
0
g
1
g
2
f
G
1
,G
2
(g
1
,g
2
)
5
0
5
2
0
g
1
g
2
f
G
1
|g
2
(g
1
|g
2
)
Figure 8.18: Joint Gaussian prior knowl-
edge of the temperature at two locations
and conditional PDF for the temperature
g
1
,giventheobservationg
2
= 2.
Again, the se last two equations are analogous to those presented
in
§
4.1.3. The joint prior and posterior PDFs are presented in
figure 8.18. Figure 8.19 presents the generalization of this update
process for
P
= 101 predict e d locations
x
=[01
···
100]
|
km
and using, respectively,
D
=
{
0
,
1
,
2
,
3
}
observations. Note how the
uncertainty reduces to zero at locations where observations are
available . Realizati ons of the updated Gaussi an process are then
required to pass by all observed values. The correlation embedded
in the joint prior knowledge allows sharing the information carried
by observations for predicting the system responses at adjacent
unobserved covariate values.
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
µ
G
± 2
µ
G
g
i
(a) D = {([]
|{z}
x
, []
|{z}
g
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(b) D = {([31]
|{z}
x
, [0.4]
| {z }
g
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(c) D = {([31 70]
| {z }
x
, [0.43.2]
| {z }
g
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(d) D = {([31 70 30]
| {z }
x
, [0.43.2 0.6]
| {z }
g
)}
Figure 8.19: Examples of how the Gaussian
process describing the pipeline temperature
evolves as it is updated with die rent data
sets D.
8.2.2 Updating a GP Using Imperfect Observations
We now explore how to extend the formulation presented in the
previous section to the case where observations are contaminated by
noise. The observation model is now defined as
y
|{z}
observation
=
model
z}|{
g(x)+ v
|{z}
measurement error
,
where
v
:
V N
(
v
;0
,
2
V
) descri bes zero-mean independent
observation errors so that
V
i
?? V
j
8i 6
=
j
. Because
g
(
x
):
G
(
x
)is
now indirectly observed through
y
, we have to redefine our prior
probabilistic machine learning for civil engineers 119
knowledge between the observat i ons
Y
and the model responses
G
at prediction locations. The joint prior knowledge is described by
Y
G
, m =
µ
Y
µ
G
, =
Y
Y
|
Y
| {z }
prior knowledge
.
In the case where the observation noise has a mean equal to zero,
the mean vector for observations equals the prior mean for the
system r esponses,
µ
Y
=
µ
G
. The global covariance matrix is then
defined using
[
Y
]
ij
= (x
i
,x
j
)
2
G
+
2
V
ij
, where
ij
=1ifi = j
[
]
ij
= (x
i
,x
j
)
2
G
ij
=0ifi 6= j
[
Y
]
ij
= (x
i
,x
j
)
2
G
.
Given the prediction locations
x
and the observations
D
,the
posterior knowledge for the system responses G
is
f
G
|x
,D
(g
|x
, D)=N(g
; µ
|D
,
|D
),
where th e posterior mean vector
µ
|D
and covariance matrix
|D
are
µ
|D
= µ
G
+
|
Y
1
Y
(y µ
Y
)
|D
=
|
Y
1
Y
Y
.
|
{z }
posterior knowledge
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
µ
G
± 2
µ
G
y
i
(a) D = {([]
|{z}
x
, []
|{z}
y
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(b) D = {([31]
|{z}
x
, [0.4]
| {z }
y
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(c) D = {([31 70]
| {z }
x
, [0.43.2]
| {z }
y
)}
0 20 40 60 80 100
10
0
10
x [km]
g[
C]
(d) D = {([31 70 30]
| {z }
x
, [0.43.2 0.6]
| {z }
y
)}
Figure 8.20: Examples of how the Gaussian
process describing the pipeline temperature
evolves as it is updated with die rent
data sets
D
,withanobservationstandard
deviation
V
=1
C.
Pipeline example (continued) We continue our study of the
pipeline example by considering noisy observations characterized
by a measur eme nt error standard deviation
V
=1
C. Figur e 8.20
presents the updated Gaussian process for
P
= 101 prediction loca-
tions
x
=[01
···
100]
|
km, and using, respectively,
D
=
{
0
,
1
,
2
,
3
}
noisy obse rvations. Notice how, contrarily to r es ul t s presented in
figure 8.19, uncertainty now remains in th e posterior at observed
locations. The larger
V
is, the less information observations c arry
and the less they modify the prior knowledge.
8.2.3 Multiple Covariates
y
i
|{z}
observation
=
model
z}|{
g(x
i
)+ v
|{z}
measurement error
x
i
=[x
1
x
2
··· x
X
]
|
i
|
{z }
covariates
D = {(x
i
,y
i
), 8i 2{1:D}} = {D
x
, D
y
}
In the case where the regression is performed for multiple covariates
x
=[
x
1
x
2
··· x
X
]
|
, it is necessary to define a joint correlation fu n c-
tion defi ni n g the correlation coecient associated with a pair of co-
variate vectors. Typically, the joint correlation function is obtained
from the product of a marginal correlation function
([
x
i
]
k
,
[
x
j
]
k
)
j.-a. goulet 120
for each covariate,
(x
i
, x
j
)=
X
Y
k=1
[x
i
]
k
, [x
j
]
k
.
The multivariate formulation for the square-exponential correlation
function is
(x
i
, x
j
)=
X
Y
k=1
exp
1
2
([x
i
]
k
[x
j
]
k
)
2
`
2
k
exp
1
2
(x
i
x
j
)
|
diag(`)
2
(x
i
x
j
)
,
5
0
5
5
0
5
0
0.5
1
[x
i
]
1
[x
j
]
1
[x
i
]
2
[x
j
]
2
(x
i
, x
j
)
Figure 8.21: Bivariate square-exponential
correlation function with parameters
`
1
=1
and `
2
=1.5.
where th er e is one parameter
`
k
for each covariate. Figure 8.21
presents an example of a bivariate square-exponential correlation
function. One rel evant feature wit h this correlation function is that
the lengt h -s cal es
`
k
are direc t ly measuri ng the importance of each
covariate for predicting the system responses. As a simple heuris-
tic, we can say that if
`
k
is greater than two times the empirical
standard deviations of the observed covariates
X
k
,thenthe
k
th
covariate has a limited in fl ue nc e on t he system res ponse predictions.
The reason is because t he correlati on coecient remains close t o
one for any pairs of this
k
th
covariate. In the opposite case , where
`
k
is smalle r than the empirical standard deviation
X
k
, we can
conclude that the
k
th
covariate has a str ong influence on the system
response predictions.
8.2.4 Parameter Estimation
Up to now, we have assumed that we knew the parameter s such as
the obser vation standard deviation
V
, the Gaussian process prior
standard deviation
G
, and the length-scales
`
=[
`
1
`
2
··· `
X
]
|
.In
practice, we need to learn these parameter s
=[
V
G
`
|
]
|
from
=[
V
G
`
|
]
|
can be referred to
as hyperparameters because they are
parameters of the prior.
the trai n in g data
D
=
{D
x
, D
y
}
=
{
(
x
i
,y
i
)
,i2{
1:
D}}
. Following
Bayes rule, the posterior PDF for parameters given dat a is
f(|D)
| {z }
posterior
=
likelihood
z }| {
f(D
y
|D
x
, ) ·
prior
z}|{
f()
f(D
y
)
| {z }
constant
.
Often, evaluating
f
(
|D
x
, D
y
) is computational ly expensive, so a
maximum likelihood estimation approach (see
§
6.7) is typically
employed. The optimal parameter values
are then
= arg max
likelihood
z }| {
f(D
y
|D
x
, ) ar g max
log-likelihood
z }| {
lnf(D
y
|D
x
, ) .
probabilistic machine learning for civil engineers 121
The likelihood function
f
(
D
y
|D
x
,
) descri bes the joint prior
Theoretically, here, we should refer to
f
(
D
y
|D
x
,
)asthemarginal likelihood be-
cause the hidden variables
g
are marginal-
ized.
probability density of observations D
y
, given a set of parameters ,
f(D
y
|D
x
, )=
Z
f(D
y
|g) ·f(g|D
x
, )dg = N(D
y
; µ
Y
,
Y
).
The l og-l i kelihood for a vector of observations
y
=[
y
1
y
2
··· y
D
]
|
is
given by
ln f (D
y
|D
x
, )=lnN(y; µ
Y
,
Y
)
=
1
2
(yµ
Y
)
|
1
Y
(yµ
Y
)
1
2
ln |
Y
|
D
2
ln 2.
Figure 8.22: Log-likelihoo d maximization.
The optim al parameter s
correspond to the parameter values for
which the derivative of the likelihood equals z er o, as illustrated in
figure 8.22. For µ
Y
= 0, the derivative of the l og-l i kelihood is
@
@✓
j
ln f (D
y
|D
x
, )=
1
2
y
|
1
Y
@
Y
@✓
j
1
Y
y
1
2
tr
1
Y
@
Y
@✓
j
.
Ecient gradient-based optimization algorithms for learning pa-
rameters by maximizing the log-likelihood are already implemented
open-source packages such as the GPML
3
,GPStu
4
, and pyGPs
5
3
Rasmussen, C. E. and H. Nickisch (2010).
Gaussian processes for machine learning
(GPML) toolbox. The Journal of Machine
Learning Research 11,30113015
4
Vanhatalo, J., J. Riihim¨aki, J. Har-
tikainen, P. Jyl¨anki, V. Tolvanen, and
A. Vehtari (2013). GPstu: Bayesian mod-
eling with Gaussian processes. Journal of
Machine Learning Research 14,11751179
5
Neumann, M., S. Huang, D. E. Marthaler,
and K. Kersting (2015). pyGPs: A python
library for Gaussian process regression and
classification. The Journal of Machine
Learning Research 16 (1), 2611–2616
toolboxes.
8.2.5 Example: Soil Contamination Characterization
This sect i on presents an applied example of GPR for the character-
ization of contaminant c onc entration in soils. Take, for example, a
set of observations y
i
of soil contaminant concentrations,
D = {D
x
, D
y
} = {(l
i
,y
i
),i2{1 : 116}},
where
l
i
is a covariate vector associated with each concentration
observation. These covariates denote the geographic coordinates
(longitude, latitude, and depth) of e ach observation,
l
i
=[xyz]
|
i
| {z }
geodesic coordinates
.
Each observation is represented in figure 8.23 by a circle where its
size is proportional to the contaminant concentration. The goal
is to employ Gaussian process regressi on in order to model the
contaminant concentration across the w hole soil volume, given the
available observations for 116 discrete locations.
-8
-4
z
[
m
]
0
100
200
x [m]
300
100
50
400
0
-50
500
y [m]
Figure 8.23: Example of application of
GPR for soil contamination characteri-
zation. The iso-contours correspond to
Pr
([ ]
>
55
mg/kg
)=0
.
5, and the size of
each circle is proportional to the observed
contaminant concentration.
(Adapted from Quach et al. (2017).)
Because a concentration must be a positive number, and be-
cause several of the observations available are close to zero, t he
observation model is defined in t he log-space,
ln y
|{z}
observation
=
true contaminant [ ] in log space
z}|{
g(l)+ v
|{z}
meas. error in log space
,
j.-a. goulet 122
where
g
(
l
) models the natural logarithm of the true contaminant
concentration. When tr an sf orm ed back to the original space using
the exponential function, the model predictions are restricted to
positive values. The observation errors are assumed to be desc ri bed
by the ze r o-me an independent Normal random variable
v
:
V
N
(
v
;0
,
2
V
)
,V
i
?? V
j
, 8i 6
=
j
. The correlation between covariates is
modeled using a square-exponential correlation function, and the
prior mean at any location is assumed to be equal to one in the
original s pac e, which corresponds to zero in the log-transformed
space.
The paramet er s to be esti mat ed from the data are the length-
scale along e ach direct i on, th e Gaussi an process standard deviation,
and the observation noise standar d deviation,
= {`
x
,`
y
,`
z
,
G
,
V
}.
The maximum likelihood esti mat e (MLE) parameter values are
= arg max
log-likelihood
z }| {
ln f (D
y
|D
x
, )=
8
>
>
>
>
<
>
>
>
>
:
`
x
= 56.4m
`
y
= 53.7m
`
z
=0.64 m
G
=2.86
V
=0.18
9
>
>
>
>
=
>
>
>
>
;
.
The optim al values for the length-scales
{`
x
,`
y
,`
z
}
are all smaller
Covariate importance metric
If `
k
X
k
:Imp(X
k
! y) 0
If `
k
<
X
k
:Imp(X
k
! y) 0
than the range of each covariate, as depi ct ed in figure 8.23. This
confirms that all three coordinates have an influence on the contam-
inant concentration. Figure 8.23 pr ese nts the iso-contours describing
Pr
([ ]
>
55
mg/kg
)=0
.
5, where
l
are the covariates describing the
x, y, z
geographic c oordinate of a 3-D grid covering the entire soil
volume for the sit e studied . Further details about this example are
presented in the paper from Quach et al.
6
6
Quach, A. N. O., D. Tabor, L. Dumont,
B. Courcelles, and J.-A. Goulet (2017). A
machine learning approach for character-
izing soil contamination in the presence
of physical site discontinuities and aggre-
gated samples. Advanced Engineering
Informatics 33,6067
8.2.6 Example: Metamodel
In civil engineer i ng, it is common to employ computationally de-
manding models such as the finite element models pre se nted in
figure 8.24. When performing probabilistic studies related to struc-
tural or system reliabili ty, it is often required to have millions of
model realizations for dierent parameter values. In such a case,
for the purpose of compu t ati on al eciency, it may be interesting
to replac e the finite element model with a regression model built
from a set of simulated responses. Su ch a model of a model is called
a metamodel or a surrogate model. Because observations are the
output of a simulation, the observation model does not include any
observation err or and follows the formulation presented in §8.2.1.
probabilistic machine learning for civil engineers 123
| {z }
g
i
|{z}
simulation
=
model
z}|{
g(x
i
), x
i
=[x
1
x
2
··· x
X
]
|
i
| {z }
covariates
Figure 8.24: Examples of finite element
structural models.
In this example, we built a metamodel for the Tamar Br i d ge
finite element model.
7
Figure 8. 25a presents an overview of the
7
Westgate, R. J. and J. M. W. Brownjohn
(2011). Development of a Tamar Bridge
finite element model. In Conference
Proceedings of the Society for Experimental
Mechanics,Volume5,pp.1320.Springer
bridge fin i t e element model, and figure 8.25b presents the modal
displacement associated with its first natural frequency, which is
the response modeled in this example. A set of
D
= 1000 model
(a) Model representation (b) 1
st
vibration mode
Figure 8.25: Tamar Bridge finite element
model.
simulations is obtained by evaluating the finite element model for
several sets of param et er s that are generated from the prior
f
(
).
There are five model parameter s that are treated as covariates of
the GP regression model,
x
i
=[
x
1
x
2
··· x
5
]
|
i
:thestiness of
the thr ee expansion joints located on the bridge deck, along with
the initial strain in the cables on the main and side spans. The
locations of elements aected by these parameters are illustrated in
figure 8.26. This problem employs the square-exponential correla-
Saltash side Plymouth side
Saltash side deck
expansion joint stiness
Plymouth side deck
expansion joint stiness
Saltash tower deck
expansion joint stiness
Main cable
initial strains
Sidespan cables
initial strains
Figure 8.26: Tamar Bridge model parame-
ter description.
j.-a. goulet 124
tion function, so there is a total of s ix parameters to be learned,
= {`
1
,`
2
,`
3
,`
4
,`
5
,
G
}.
Figure 8. 27 compares the GPR model predicti ons with the true
finite el e ment model outputs
g
i
. Note that in order to obtain a
meaningful comparison between predicted and measured values,
it is essential to test the model iteratively using a cross-validation
procedure, where at each step, the observation c orr es ponding to
the pred i ct i on location in t h e validation se t is removed from the
training set.
0.34 0.36 0.38 0.4
0.34
0.36
0.38
0.4
g
g
g = g
(g
i
|D,i
)
µ
|D,i
±
|D,i
Figure 8.27: Comparison of the GPR
model predictions with the finite-element
model predictions using the leave-one-out
cross-validation pro c edure.
8.2.7 Advanced Considerations
There are several aspects that should be considered in order to take
advantage of the full potential of Gaussian process regre ssi on . Key
aspects are related to (1) the definition of the prior mean functions,
(2) the choice of covariance and correlation functions, and (3) th e
formulation required for modeling cases involving heteroscedastic
errors.
Prior mean f u ncti ons In the examples presented in
§
8.2, the
prior mean was always assumed to be a known constant. If needed,
this assu mp t ion can be relax ed by treating the prior mean as a
parameterized function to be inferred from data. In its simplest
case, the mean can be assumed to be a single unknown constant.
In more advanced cases we could describe our prior knowledge of
the mean as a parameterized function that depends on covariates
x
. For example, if there is a sin gl e covariate, the mean could be
modeled by the an e function
m
(
x
)=
ax
+
b
,where
{a, b}
are
parameters that ne ed to be estimat e d from data. When using
complex f un ct i on s to describe t h e prior mean, it is essential to
keep in m in d that this procedure is sensi t i ve to overfitting in a
similar way as linear regression (see
§
8.1.2). E mp loying a highly
flexible parameterization may lead to an increased log-likelihood
in the training set, yet it may not materialize in the prediction
performance evaluated on the test set. It i s therefor e essential
to choose the best pri or parameter iz at ion using methods such as
cross-valid ati on (see §8.1.2) or Bayesian model selection (see §6.8).
Covariance functions In this section, only the square-exponential
correlation function was described; nonetheless, several other
choices exist. In his the si s, Duvenaud
8
presents several covariance
8
Duvenaud, D. (2014). Automatic model
construction with Gaussian processes.PhD
thesis, University of Cambridge
models. A common one is the Matern-class correlation function, for
probabilistic machine learning for civil engineers 125
which a special case is defined as
(x
i
,x
j
)=exp
|x
i
x
j
|
`
.
Figure 8. 28 represents this correlation function for several length-
scale values.
Figure 8.28: An example of Matern-class
correlation function.
When usi n g the square-exponential correlation function with
covariate values that are close to each other, it is c ommon to obtain
correlation coecients
(
x
i
, x
j
) close to one. This has the adverse
eect of leading to singular or near-singular covariance matrices
(see
§
2.4.2). M at er n- cl ass correlat i on functions can mitigate this
problem because of its rapid reduction of the correlation coecient,
with an increase in the absolu t e distance between covariates.
Another cor r el at i on function particul ar ly useful for engineering
applications is the periodic one defined as
(x
i
,x
j
)=exp
2
`
2
sin
x
i
x
j
p
2
!
.
10 5 0 5 10
0
1
x
i
x
j
(x
i
,x
j
)
(a) p =5,` =1
10 5 0 5 10
0
1
x
i
x
j
(x
i
,x
j
)
(b) p =5,` =0.5
10 5 0 5 10
0
1
x
i
x
j
(x
i
,x
j
)
(c) p =10,` =1
Figure 8.29: Examples of periodic correla-
tion functions.
Figure 8. 29 presents examples of the period i c correlati on function
for periods
p
=
{
5
,
10
}
and lengt h- scal e parameter s
`
=
{
0
.
5
,
1
}
.
As its name indicates, this function is suited for modeling periodic
phenomena, where events spaced by a distance close to the period
p
will have a high cor re l ati on . The l en gt h- scal e
`
controls how fast the
correlation decreases as the distance between covariates gets away
from an integer that is a multiple of the period p.
Figure 8. 30 presents an example of the periodic correlation
function employed to m odel the historical daily average temper-
atures re cor d ed at Montreal airport from 1953 to 1960. The blue
circles represent the traini n g data that were employed to learn the
parameters: the p ri or mean value
µ
, the Gaussian process prior
standard deviation
G
, the observation stan dar d deviation
V
, and
1953
1954
1955
1956
1957
1958
1959
1960
40
20
0
20
40
Year
Temperature [
C]
µ
|D
± 2
q
|D
+
2
V
Training data Test data
µ
|D
Figure 8.30: Example of application of
aperiodiccovariancefunctiontomodel
temperature data.
j.-a. goulet 126
the period
p
. The pink crosses represent the test data that was
not employed t o perform predictions or to learn paramete rs . Most
observations from the test set lie within the
±
2
posterior interval.
It shows how Gaussian process regression can, even with such a
simple model with only four par ame t er s, capture the temperature
pattern as well as its uncertainty.
For more complex problems, highly complex covariance struc-
tures can be built by adding together simpler ones, such as those
described in this section (see Rasmussen and Wi l l iam s
9
for detai l s) .
9
Rasmussen, C. E. and C. K. Williams
(2006). Gaussian processes for machine
learning.MITPress
The choice regarding which correlation or covariance structure is
most approp r iat e should, as for the prior mean functions, be based
on methods such as cross-validation (see
§
8.1.2) or Bayesian model
selection (see §6.8).
Homo- and heteroscedasticity So far, this section has only treated
cases associated with homoscedastic errors such as the generic case
presented in figure 8.31a. In engineering applications, it is common
to be confronted with cases where the variances
2
V
or
2
G
depend
on the covariate values. A generic example of heteroscedastic cases
is p re sented in figu r e 8.31b, which has been modeled by considering
that th e observation-error standard deviation
V
is itself modeled
as a Gaussian process th at depends on covariate values.
10
This
10
Tolvanen, V., P. Jyl¨anki, and A. Vehtari
(2014). Expectation propagation for
nonstationary heteroscedastic Gaussian
process regression. In IEEE International
Workshop on Machine Learning for Signal
Processing (MLSP),16.IEEE
method is implemented in the GPstu
11
toolbox.
11
Vanhatalo, J., J. Riihim¨aki, J. Har-
tikainen, P. Jyl¨anki, V. Tolvanen, and
A. Vehtari (2013). GPstu: Bayesian mod-
eling with Gaussian processes. Journal of
Machine Learning Research 14,11751179
x
y
(a) Homoscedastic
x
y
(b) Heteroscedastic
Figure 8.31: Examples of homo- and
heteroscedastic cases where the shaded
regions represent µ
Y
± 2
Y
.
8.3 Neural Networks
The idea behind neural networks originates from the perceptron
model
12
developed in the 1950s. Neural networks became extremely
12
Rosenblatt, F. (1958). The perceptron: A
probabilistic model for information storage
and organization in the brain. Psychological
Review 65 (6), 386
popular in the period from 1980 until the mi d- 1990s when they
became overs had owed by other emerging methods. It was not until
the earl y 2010s, with the appellation of deep learning, that they
again took the leading r ole in the development of machine learning
methods. In deep learnin g, the label deep refers to the great number
probabilistic machine learning for civil engineers 127
of layers in the model. The rise of deep learning as the state-of-the-
art meth od was caused by three main factors:
13
(1) the increasi ng
13
Goodfellow, I., Y. Bengio, and
A. Courville (2016). Deep learning.MIT
Press
size of the data sets to train on, (2) the increase in c omp ut at i onal
capacity brought by graphical processing units (GPUs), and (3)
the impr ovements in the methods allowing to train deeper models.
One limi t at i on is that in order to achieve the performance it is
renowned for, deep learning requires large data sets containing
from thous and s to millions of labeled example s for covariates
x
i
and syste m responses
y
i
. In the field of computer vision, speech
recognition, and genomi cs, it is common to deal with data sets of
such a large size; in the civil engineering context, it is not yet the
case.
This sect i on presents the basic formulation for the simplest
neural ne twork architecture: the feedforward neural network.The
reader interested in a detailed review of neural networks and their
numerous architectures—such as convolutional neural networks
(CNN), generative adversarial networks (GAN), and recurrent neu-
ral networks (RNN, LSTM)—should refer to specialized text books
such as the one by Goodfellow, Bengi o, and Courville (2016).
8.3.1 Feedforward Neural Networks
Feedforward is the simplest architecture for neural networks, where
information is tran sf er re d from the input layer to the output layer
by propagating it i nto layers of hidden variables. The idea is to ap-
proximate complex functions by a succession of simple combinations
of hidden variables organized in layers.
x
1
x
2
x
3
b
z
y
w
1
w
2
w
3
Figure 8.32: Representation of linear regres-
sion as a feedforward neural network. The
leftmost nodes represent covariates, the
white node is a hidden variable, and the
rightmost node represents the observation.
Linear regression The linear regression method presented in
§
8.1 is
employed to introduce the concepts surroun di n g feedforward neural
networks. Figure 8.32 p re sents a regression problem where three
covariates
x
=[
x
1
x
2
x
3
]
|
are associated with a s in gl e observed
system response
y
=
z
+
v, v
:
V N
(
v
;0
,
2
V
), where
v
is a realiza-
tion of a zero-mean normal observation error.
z
represents a hidden
variable, which is defined by the linear combination of covariate
values using three weight parameters and one bias parameter,
z = w
1
x
1
+ w
2
x
2
+ w
3
x
3
+ b.
In figure 8.32, the leftmost nodes represent covariates, the white
node is a hi dde n variable, and the rightmost node is the observa-
tions
y
i
:
Y N
(
y
;
z,
2
V
). Arrows represent the dependencies
defined by the weight and bias parameters
{w
1
,w
2
,w
3
,b}
. Weight
parameter
w
i
describes the slope of a plane with respect to each
j.-a. goulet 128
variable, and
b
the plan e’ s intercept (e.g., see figure 2.2). With the
current configuration, t h i s model is li mi t ed because it i s only a
linear f un ct i on , that is, the hidden variable
z
is a linear function
with respect to covariates x.
Linear regression with multi ple layers Now we explore the eect of
modifying the model as presented in figure 8.33 by adding multiple
hidden variables , where e ach one is modeled as a linear combination
of either the covariates or the hidden variables on the preceding
layer. First, let us define the notation employed to describe this
network:
z
(l)
i
represents the
i
th
hidden variable on th e
l
th
layer. In
the first layer, the two hidden variab l es are defined by
z
(1)
1
= w
(0)
1,1
x
1
+ w
(0)
1,2
x
2
+ w
(0)
1,3
x
3
+ b
(0)
1
z
(1)
2
= w
(0)
2,1
x
1
+ w
(0)
2,2
x
2
+ w
(0)
2,3
x
3
+ b
(0)
2
,
x
1
x
2
x
3
b
(0)
1
b
(0)
2
z
(1)
1
z
(1)
2
b
(1)
z
(2)
1
y
Figure 8.33: Representation of linear
regression using multiple layers. The
leftmost nodes represent covariates, the
white nodes the hidden variables, and the
rightmost node the observation.
Nomenclature
z
(l)
i
:the
i
th
hidden variable on the
l
th
layer
w
(l)
i,j
:theweightdeningthedependence
between the
i
th
variable of the (
l
+1)
th
layer and the
j
th
hidden variable of the
(l)
th
layer
b
(l)
i
:theplanesinterceptforthe
i
th
hidden
variable of the (l +1)
th
layer
x
j
: j
th
covariate
where
w
(0)
i,j
is the weight definin g the dependence between the
j
th
variable of the 0
th
layer (i.e., the fi rs t covariate itself) and the
i
th
hidden variable of th e 1
st
layer. Foll owing the same notation,
b
(0)
i
describes the plane’s intercept for the
i
th
hidden variable of th e 1
st
layer. The onl y hidden variable on the second layer is defined as a
linear combination of the two hidd en variables on layer 1, so
z
(2)
1
= w
(1)
1,1
z
(1)
1
+ w
(1)
1,2
z
(1)
2
+ b
(1)
.
The observation model is now defined by
y
=
z
(2)
1
+
v, v
:
V
N
(
v
;0
,
2
V
), so each observation of the system response is defined
as
y
i
:
Y N
(
y
;
z
(2)
1
,
2
V
). Note that
z
(2)
1
implicitly depends on
the covariates
x
, so that there are covariates implicitly associated
with each observation
y
i
. The complete data set is
D
=
{D
x
, D
y
}
=
{(x
i
,y
i
), 8i 2{1:D}}.
In this second case, adding hidden variables did not help us
to general iz e our model to nonlinear system responses because
linear functions of linear functions are themselves still li n ear . The r e-
fore, no matter how many layers of hidden var i abl e s we add, the
final model remains a hyp er p lan e. In order to describe nonline ar
functions, we need to intro d u ce nonlineari t i es using activation
functions.
Activation functions An activation function describes a nonlinear
transformation of a hidden variable
z
. Common activation functions
are the logistic,thehyperbolic tangent (tanh), and the rectified
probabilistic machine learning for civil engineers 129
linear unit (ReLU),
s
(z)=
1
1+exp(z)
=
exp(z)
exp(z)+1
(logistic)
t
(z)=
e
z
e
z
e
z
+ e
z
(tanh)
r
(z) = max(0,z) (ReLU).
These thr ee activation f un ct i on s are presented in figure 8.34. The
ReLU activation function is curr ently widely employed because
it facil i t at es the training of models by mitigating the problems
associated with vanishing gradients (see §8.3.2).
3 2 1 0 1 2 3
1
0
1
z
(z)
Logistic tanh ReLU
Figure 8.34: Comparison of the logistic,
tanh,andReLU activation functions.
Feedforward neural networks: A simple case In a feedforward
network, a hi dd en variable
z
(j)
i
is passed through an activation
function,
a
(j)
i
=
z
(j)
i
,
before getting fed to t h e next level. The output of the activation
function is calle d the activation unit. Figure 8.35 presents the
expanded and compact forms for a simple feedforward network with
three covariates in its input layer, on e hidden layer containing two
activation units (green nodes), and one output layer containing
a single activation uni t (blue node). Note t h at in the compact
forms (i. e ., (b) and (c)), hidden variables
z
(j)
i
are not explicit l y
represented anymore. The activation units on the first layer are
a
(1)
1
=
z
(1)
1
a
(1)
2
=
z
(1)
2
.
The outpu t unit
a
(O)
is equal to the hidden variable
z
(O)
,whichis
itself defined as a linear combination of the activation units on the
hidden layer so t hat
a
(O)
= z
(O)
= w
(1)
1
a
(1)
1
+ w
(1)
2
a
(1)
2
+ b
(1)
.
The observation model is now defined by
y
=
a
(O)
+
v, v
:
V
N
(
v
;0
,
2
V
), so each observat ion of t he system response is defined as
y
i
: Y N(y; a
(O)
,
2
V
).
x
1
x
2
x
3
b
(0)
z
(1)
1
z
(1)
2
a
(1)
1
a
(1)
2
b
(1)
z
O
a
O
y
(a) Expanded-form variable nomenclature
x
1
x
2
x
3
b
(0)
b
(1)
y
(b) Compact-form representation of
activation functions
x
1
x
2
x
3
b
(0)
a
(1)
1
a
(1)
2
b
(1)
a
(O)
y
(c) Compact-form variable nomenclature
Figure 8.35: A simple feedforward neural
network.
Fully connected feedforward neural network We now describe
formulation for the fully connected feedforward neural net wor k
(FCNN) with
L
hidden l ayers, each having
A
activation units. The
example presented includes a single output observation. In practice ,
FCNN can be employed for any number of outpu t variables.
j.-a. goulet 130
Figure 8. 36 presents a graphical representation for an FCNN
where (a) presents the connectivity with the activation functions
and (b) the nomenclature employed for weights
w
(l)
i,j
and hidd en
units
a
(l)
i
. The activation unit
i
in the first layer is described by a
linear combination of covariates,
a
(1)
i
=
z
(1)
i
=
w
(0)
i,1
x
1
+ w
(0)
i,2
x
2
+ ···+ w
(0)
i,X
x
X
+ b
(0)
i
.
In subseq ue nt layers, the activation uni t
i
in layer
l
+ 1 is a linear
combination of all the activation units in layer l as described by
x
1
x
2
.
.
.
x
X
b
(0)
.
.
.
b
(1)
.
.
.
b
(2)
···
···
.
.
.
···
b
(L1)
.
.
.
b
(L)
y
(a) Representation of the network structure with its activation functions
x
1
x
2
.
.
.
x
X
b
(0)
a
(1)
1
a
(1)
2
.
.
.
a
(1)
A
b
(1)
a
(2)
1
a
(2)
2
.
.
.
a
(2)
A
b
(2)
···
···
.
.
.
···
b
(L1)
a
(L)
1
a
(L)
2
.
.
.
a
(L)
A
b
(L)
a
(O)
y
w
(1)
1,1
w
(1)
1,1
w
(1)
1,2
w
(1)
2,1
w
(1)
1,i
w
(1)
i,1
w
(1)
1,A
w
(1)
A,1
b
(1)
1
(b) Variable nomenclature: The activation unit 1inlayer2is
a
(2)
1
=
z
(2)
1
=
w
(1)
1,1
a
(1)
1
+ w
(1)
1,2
a
(1)
2
+ ···+ w
(1)
1,A
a
(1)
A
+ b
(1)
1
Figure 8.36: A fully connected neural
network having
L
hidden layers, each
having
A
activation units. The leftmost
nodes describe the input layer containing
X
covariates. The green inner nodes describe
the unobserved hidden layers where an
activation unit
i
in layer
l
is
a
(l)
i
=
(
z
(l)
i
).
The blue node describes the output layer
where
a
(O)
=
O
(
z
(O)
)=
z
(O)
for regression
problems. The rightmost node describes
the observed system responses,whichare
realizations y
i
: Y N(y; a
(O)
,
2
V
).
probabilistic machine learning for civil engineers 131
a
(l+1)
i
=
z
(l+1)
i
=
w
(l)
i,1
a
(l)
1
+ w
(l)
i,2
a
(l)
2
+ ···+ w
(l)
i,A
a
(l)
A
+ b
(l)
i
.
Activation unit in layer l +1
Hidden variable in layer l +1
Activation function
Weights
Bias
Activation units from layer l
The feedf orward propagation for all the units contained in a layer
can be expressed with a matrix notation as
a
(l+1)
=
(
z
(l+1)
), that
is,
2
6
6
6
6
4
a
(l+1)
1
a
(l+1)
2
.
.
.
a
(l+1)
A
3
7
7
7
7
5
A1
| {z }
a
(l+1)
=
0
B
B
B
B
B
B
B
B
B
@
2
6
6
6
6
4
z
(l+1)
1
z
(l+1)
2
.
.
.
z
(l+1)
A
3
7
7
7
7
5
A1
| {z }
z
(l+1)
1
C
C
C
C
C
C
C
C
C
A
,
where the hidden variables are given by
z
(l+1)
=
W
(l)
a
(l)
+
b
(l)
, that
is,
2
6
6
6
6
4
z
(l+1)
1
z
(l+1)
2
.
.
.
z
(l+1)
A
3
7
7
7
7
5
A1
| {z }
z
(l+1)
=
2
6
6
6
6
4
w
(l)
1,1
w
(l)
1,2
··· w
(l)
1,A
w
(l)
2,1
w
(l)
2,2
··· w
(l)
2,A
.
.
.
.
.
.
.
.
.
.
.
.
w
(l)
A,1
w
(l)
A,2
··· w
(l)
A,A
3
7
7
7
7
5
AA
| {z }
W
(l)
2
6
6
6
6
4
a
(l)
1
a
(l)
2
.
.
.
a
(l)
A
3
7
7
7
7
5
A1
| {z }
a
(l)
+
2
6
6
6
6
4
b
(l)
1
b
(l)
2
.
.
.
b
(l)
A
3
7
7
7
7
5
A1
| {z }
b
(l)
.
The variables
z
(O)
and
a
(O)
on the output layer are obtained from a
linear combination of the activation units on the last hidden layer
L,
a
(O)
=
O
(z
(O)
)=z
(O)
= w
(L)
1
a
(L)
1
+ w
(L)
2
a
(L)
2
+ ···+ w
(L)
A
a
(L)
A
+ b
(L)
.
Because we are dealing with a regression problem, we can consider
that th e output activation func ti on
O
(
·
) is linear. The observation
model is defined by
y
=
a
(O)
+
v a
(O)
(
x
)+
v, v
:
V N
(
v
;0
,
2
V
),
so each observation of the system response is defined as
y
i
:
Y
N(y; a
(O)
(x),
2
V
).
Neural networks are characterized by their large number of pa-
rameters to be estimated. For example , a fully connected network
with
L
layers, each having
A
activation units, has
A A
(
L
1)
weight parameters and
A L
bias parame te r s for its hidden layers.
For L = A = 10, it leads to more than 1000 parameters to be learned
from the data set
D
. In comparison, the GPR model presented in
§
8.2 had fewer than 10 parameters. Moreover, it is not uncommon
for modern deep neural networks to be parameterized by millions
of paramete rs . As a consequence, learning these parameters (i.e.,
j.-a. goulet 132
training the neur al network) req u ir e s large data sets and is a com-
putationally demanding task.
8.3.2 Parameter Estimation and Backpropagation
The task of learning model paramet er s is referred to as training
a neural network. In general, there are so many parameters to be
learned t h at the utilization of second-order optimization methods
such as Newton-Raphson (see
§
5.2) are computationally prohibitive.
Instead, neural ne tworks rely on gradient descent algorithms (see
§
5.1), wher e the gradient of the loss function with respect to each
parameter is obtained using the backpropagation metho d.
Log-likelihood As for other types of regression models, the per-
formance of a neural network is measured using quantities derived
from the likelihood of observations
f
(
D
y
|D
x
,
w
,
b
), where
{
w
,
b
}
are respectively the weight and bias parameters. By assuming that
observations are conditionally independent given
a
(O)
, the joint
likelihood for a data se t
D
=
{D
x
, D
y
}
=
{
(
x
i
,y
i
)
, 8i 2{
1:
D}}
is
defined as
f(D
y
|D
x
,
w
,
b
)=
D
Y
i=1
N(y
i
; a
(O)
(x
i
;
w
,
b
),
2
V
),
where
a
(O)
(
x
i
;
w
,
b
)
a
(O)
explicitly defines the dependence of the
output l ayer on the bias and weight parameters from all layers. In
practice, neural networks are t r ai ne d by minimizing a loss function
J(D;
w
,
b
), which is derived from the negati ve log-likelihood, Note:
The derivation of the loss function
for a neural network is analogous to the
derivation of the loss function presented in
§8.1.1 for linear regression.
ln f (D
y
|D
x
,
w
,
b
)=
D
X
i=1
ln
N(y
i
; a
(O)
(x
i
;
w
,
b
),
2
V
)
/
1
D
D
X
i=1
1
2
y
i
a
(O)
(x
i
;
w
,
b
)
2
=
1
D
D
X
i=1
J(y
i
; x
i
,
w
,
b
)
= J(D;
w
,
b
).
Note that the loss function
J
(
D
;
w
,
b
) corresponds to the mean-
square error (MSE). The optimal set of parameters
=
{
w
,
b
}
are thus those which minimize the MS E ,
= arg min
J(D;
w
,
b
).
probabilistic machine learning for civil engineers 133
Gradient descent The optimal values
are sought iteratively
using a gradient-descent algorithm, where t h e updated parameter
values are
r
J(D;
w
,
b
)
and where
is the learning rate (see
§
5.1). The gradients are esti-
mated usi ng the backpropagation algorithm that will be presented
next. Th e parameter s
0
=
{
w
,
b
}
are typically initi al i zed ran-
domly fol lowing
i
N
(
;0
,
2
), where
10
2
. The random
initialization is intended to avoid having an init i al symmetr y in the
model. The objective of having initial values close to zero is that
multiplying covariates and hidden variables by weights
w
(l)
i,j
=0will
initially limit the model capacity. Not e that a common practice is
to normalize covariates and observations in the data set.
Backpropagation The backpropagation
14
method employs the
14
Rumelhart, D. E., G. E. Hinton, and
R. J. Williams (1986). Learning repre-
sentations by back-propagating errors.
Nature 323,533536
Feedfo rwar d neur al network
J =
1
2
(y
i
a
(O)
)
2
a
(O)
=
W
(L)
a
(L)
+ b
(L)
| {z }
z
(O)
a
(l+1)
=
W
(l)
a
(l)
+ b
(l)
| {z }
z
(l+1)
Chain rule of derivation
@J
@W
(L)
=
@J
@a
(O)
@a
(O)
@z
(O)
| {z }
(O)
i
@z
(O)
@W
(L)
| {z }
a
(L)
i
@J
@b
(L)
=
z }| {
@J
@a
(O)
@a
(O)
@z
(O)
@z
(O)
@b
(L)
| {z }
1
@J
@W
(l)
=
@J
@z
(l+2)
@z
(l+2)
@a
(l+1)
@a
(l+1)
@z
(l+1)
| {z }
(l+1)
i
@z
(l+1)
@W
(l)
| {z }
a
(l)
i
@J
@b
(l)
=
z }| {
@J
@z
(l+2)
| {z }
(l+2)
i
@z
(l+2)
@a
(l+1)
| {z }
W
(l+1)
@a
(l+1)
@z
(l+1)
| {z }
r(z
(l+1)
)
@z
(l+1)
@b
(l)
| {z }
1
chain rule of derivation in order to backpropagate the model p r e-
diction errors through the network. For a single observation (
x
i
,y
i
),
the pred i ct i on error
(O)
i
is define d as the discrepancy between the
predicted value
a
(O)
(
x
i
;
w
,
b
) and the observation
y
i
,timesthe
gradient of the output’s activation function,
(O)
i
=
@
@z
(O)
J(y
i
; x
i
,
w
,
b
)
=
@
@a
(O)
J(y
i
; x
i
,
w
,
b
) ·
@a
(O)
@z
(O)
=
y
i
a
(O)
(x
i
;
w
,
b
)
·
@
@z
(O)
O
(z
(O)
) (general case)
= a
(O)
(x
i
;
w
,
b
) y
i
(regression).
In the context of regression, the term
@
@z
(O)
O
(
z
(O)
) is equal to one
because
a
(O)
=
O
(
z
(O)
)=
z
(O)
. Note that in order to evaluate the
prediction error f or an observation
y
i
, we first need to evaluate the
model prediction for the current parameter values
{
w
,
b
}
and a
vector of covariat es
x
i
. For the
i
th
observation in the training set,
the gradient vector of the loss function wit h resp e ct to ei t h er weight
or bias parameters on the l
th
layer is
r
W
(l)
i
J(y
i
; x
i
,
w
,
b
)=
@J
@W
(l)
=
(l+1)
i
(a
(l)
i
)
|
r
b
(l)
i
J(y
i
; x
i
,
w
,
b
)=
@J
@b
(l)
=
(l+1)
i
.
For hidden layers, the responsibility of each activation uni t for the
prediction error is estimated usi n g
(l+1)
i
=
(W
(l+1)
)
|
(l+2)
i
r(z
(l+1)
),
where
is the Hadamar (element-wise) product, and
r
denotes the
gradient vector.
j.-a. goulet 134
The common procedure is to perform parameter updates not
using each observation indi vi du al ly, but by using mini-batches.A
mini-batch
D
mb
D
is a subset of observations of size
n
that is
a power of 2 ranging from 32 to 256. The power of 2 constraint
is ther e to maximize the computational eciency when working
with grap hi cal processing units (GPUs). The gradient of the loss
function is then the average of the gradients obtained for each
observation in a mi ni -b at ch,
Note:
Several of the advances made
in neural networks can be attributed
to advancements in GPU calculations.
Nowadays most nontrivial models are
trained using GPUs.
r
W
(l)
J(D
mb
;
w
,
b
)=
1
n
X
i:y
i
2D
mb
r
W
(l)
i
J(y
i
; x
i
,
w
,
b
)
r
b
(l)
J(D
mb
;
w
,
b
)=
1
n
X
i:y
i
2D
mb
r
b
(l)
i
J(y
i
; x
i
,
w
,
b
).
During t r ai ni n g, an epoch corresponds to performing one pass of
parameter update for all mini-batches contained in the training
set. Neur al networks are typically trained using stochastic gradient
descent, which computes the gradient on the subset of data con-
tained i n mini-batches rather than on the full data set. The reader
should re fe r to specialized literature
15
for descr ip t i ons of stochastic
15
Goodfellow, I., Y. Bengio, and
A. Courville (2016). Deep learning.MIT
Press
gradient descent optimization methods such as Adagrad, RMSprop,
and Adam.
8.3.3 Regularization
Because of the large number of parameters involved in the def-
inition of a neural network, it is prone to overfitting, where the
performance obtained during training will not generalize for the
test set . Here, we present two regularization procedures to prevent
overfitting: weight decay and early stopping .
Weight decay The first regularization strategy is to employ maxi-
mum a-posteriori (MAP) estimation rather than a maximum likeli-
hood estimate (MLE). The prior knowledge for model parameters
in an MAP allows us to prevent overfitting by limiting the model
complexity. For instance, we can define a prior for the weights in
the vector
w
so that
f
(
w
(l)
i,j
)=
N
(
w
(l)
i,j
,
0
,
1
/
). A large
value
forces weight p ar amet e rs to be close to zero, which in turn reduces
the model capacity. If
is small, litt le penalization is applied on
weights and th e model is th en free to be highly nonlinear and can
more easily fit the data in the t r ai ni n g set.
With MAP, the posterior is now approximated by the product of
probabilistic machine learning for civil engineers 135
the likelihood and t h e prior, so
f(
w
,
b
|D) / f(D
y
|D
x
,
w
,
b
) · f (
w
,
b
)
=
D
Y
k=1
N(y
k
; a
(O)
(x
k
;
w
,
b
),
2
V
) ·
Y
i,j,l
N(w
(l)
i,j
;0, 1/)
ln f (
w
,
b
|D) /
1
D
D
X
k=1
J(y
k
; x
k
,
w
,
b
)
2
X
i,j,l
(w
(l)
i,j
)
2
/
J(D;
w
,
b
)+
2
|
w
w
.
We can then redefine the loss function as
˜
J(D;
w
,
b
)=J(D;
w
,
b
)+(
w
),
where (), the regularization or weight decay term, is
(
w
)=
1
2
||
w
||
2
=
1
2
|
w
w
.
Defining a Gaussian prior for weights corresponds to
L
2
-norm
regularization. This choice for regularization is not th e only one
available , yet it is a common one. The gradient of the loss function
with respect to weight parameters has t o be modified to include th e
regularization term, so
r
W
(l)
˜
J(D
mb
;
w
,
b
)=J(D
mb
;
w
,
b
)+W
(l)
.
In short, weight decay allows controlling the model capacity with
the hyp er par ame t er
. The hyperparamet er
then needs to be
learned using a method such as cross-validation (see §8.1.2).
Early stopping Given that we initialize each parameter randomly,
following
i
N
(
;0
,
2
), with
10
2
, the model will ini t i al l y
have a poor performance at fitting the observations in the training
set. As the number of optimization epochs increases, the fitting
ability of the model wil l improve and, as a result, the loss for the
training set
J
(
D
train
;
w
,
b
) will decrease. If we keep optimizing
parameters, there is a point where the model w il l gain too much
capacity, and it will start overfitting the data in the training set.
Thus, a second option to pr event overfitting is to moni t or the loss
function evaluated at each epoch for an independent validation
set
J
(
D
validation
;
w
,
b
). Note that the data in the validation set
must not be employed to train the model parameters . The optimal
parameter values
=[
w
b
]
|
are taken as those corresponding
to the epoch for which the loss evaluated for the validation set is
minimal. This procedure is equivalent to treating the number of
j.-a. goulet 136
training epochs as a hyp er p aram et er to be opti mi z ed using the
validation set .
Figure 8. 37 presents an example of early stopping where we
see that the loss for the training set keeps decreasing with the
number of epochs. For the loss evaluated on the validation set, the
minimum is reached at seven epochs. Beyond seven epochs, the
fitting ability we gain on the train i ng set does not generalize on the
validation set .
0 2 4 6 8 10 12
10
6
10
5
10
4
10
3
10
2
Epochs
J(D;
w
,
b
)
Train
Validation
Early stop
Figure 8.37: Example of early stopping
when the loss function for the validation
set stops decreasing.
8.3.4 Example: Metamodel
Here we revisit the example presented in
§
8.2.6, where we construct
a metamodel for the res ponses of a stru ct ur e using five covariates
and a set of
D
= 1000 simulations. For this ex amp l e, we employ
a tanh activation func ti on with
A
= 10 activation units on each
of the
L
= 2 hidden layers. Figure 8.38 compares the system
responses with the model predictions obtained in a leave-one- out
cross-validation setup where for each fold, 90 percent of the data
is employed f or trainin g the model and the remaining 10 percent
to validate wh en the training should be stopped. The loss function
J
(
D
;
w
,
b
), evaluated using t he leave-one-out c ros s- validation
procedure, is equal to 1
.
32
10
5
. In comparison, the equivalent
mean-square error ob tai n ed for the GPR model presented in
§
8.2.6
is 8
.
8
10
6
. It shows that neural networks are not necessarily
outperforming other methods, because they are typically best suited
for probl ems involving a large numb er of covariates and for which
large data sets are available.
0.34 0.36 0.38 0.4
0.34
0.36
0.38
0.4
y
i
a
(O)
i
Figure 8.38: Comparison of the system
responses with the model predictions
obtained in a leave-one-out cross-validation
setup.