Regression

Data

D = {(x

), 8i 2{1:D}}

2 R

Covariate

Attribute

Regressor

2 R :Observation

Model

g(x) ⌘ fct(x)

Regression consists in the task of modeling the relationships be-

tween a system’ s responses and the covariates describing the

properties of this system. Here, the generic term sys tem is em-

ployed because regression can be applied to many types of prob l ems .

Figure 8.1 presents examples of applicati on s in the ﬁeld of civil engi-

neering; for instan ce , one could model the relations hi p between (a)

the number of pathogens and chlorine concentration, (b) the num-

ber of accidents at a road intersection as a function of the hour of

the day, and (c) conc re t e strength as a function of the water-cement

ratio. These examples are all functions of a single covariate; later in

this chapter we will see how regression is generalized to model the

responses of a system as a function of any number of covariates.

[ ] cl+

Pathogens [ppm]

(a)

hour of the day

# of accidents

(b)

Water/cement ratio

Concrete strength

(c)

Figure 8.1: Examples of applications of

regression analysis in the context of civ il

engineering.

In regressi on problems , the data sets are described by

pairs of

observations

{

(

)

, 8i 2{

D}}

where

··· x

]

are covariates, also called attributes or regressors, an d

2 R

an observed system response. A regression model

(

)

⌘ fct

(

)

2 R

is thus a

X !

1 functi on, that is, a function of

covariates leading

to one output. This chapter covers three regression methods for

building

(

): linear regression, Gaussian process regression, and

neural networks.

8.1 Linear Regression

Linear regression is one of the easiest methods for approaching

regression problems. Contrarily to what its nam e may lead us to

think, linear re gr ession is not lim i te d to cases where the relation-

ships between the covariates and system responses are linear . The

term linear refers t o the linear dependence of the model with r e-

spect to its parameters. In th e case involving a single covariate

j.-a. goulet 108

the model employed i n linear regression takes the form

g(x)=b

+ b



(x)+b



(x)+... + b



(x),

where

✓

··· b

]

is a vector of model parameters and



(

)isabasis function appl i ed on the covariate

. Basis functions

are nonli ne ar functions employed whe n we want to apply linear

regression to descr i be nonlinear relationships. Figure 8.2 presents

two examples of linear regression applied to (a) linear and (b)

nonlinear relationships.

0 0.2 0.4 0.6 0.8 1

training set

g(x)

(a) Linear: g(x)=b

+ b

0 0.2 0.4 0.6 0.8 1

4

2

training set

g(x)

(b) Nonlinear: g(x)=b

+ b

sin(10x)

Figure 8.2: Examples of linear regression

applied to linear and nonlinear relation-

ships.

8.1.1 Mathematical Formulation

For the simplest case where there is only one covariate, the data

set consi st s in

pairs of covariates and system responses

{

(

)

, 8i 2{

D}}

, where a covariate

2 R

, and its associated

observation y

2 R. The observation model is deﬁned as

|{z}

Observation

Model

z}|{

g(x)+v, v : V ⇠N(v;0,

)

| {z }

Observation error

where

describes the observation errors that are assumed to be

independent of each other,

?? V

, 8i 6

. Note that observed

covariates

are assum ed to be exact and free from any observation

errors. Le t us consider the special case of a linear basis function

(x)=x where the model takes the form

g(x)=b

+ b

x =[b

]





. (8.1)

We can exp re ss the model for the entire data set using the foll owing

matrix notation, Note:

The 1 in

⇥

⇤

stands for the absence

of dependence on a covariate for the bias

parameter b

b =





, y =

, X =

1 x

, v =

With this matrix notation, we r ef ormulate the observation model as

y = Xb + v, v : V ⇠N(v; 0,

· I),

where

is the identity mat r i x indicat i ng that all measurement

errors are i nd ependent from each other and have the same variance.

If we separate the set of observations

, D

}

by isolating

covariates

and syste m responses

, the joint likelihood of

probabilistic machine learning for civil engineers 109

observed system responses given the set of observed covariates and

model parameters ✓ = b is expressed as

f(D

, ✓) ⌘ f(D

= y|D

= x, ✓ = b)

i=1

N(y

; g(x

),

)

i=1

exp





 g(x

)





=exp



i=1



 g(x

)





=exp

✓



2

(y Xb)

◆

The goal is to estimate optimal parameters

✓

⇤

, which maxi-

mize the likelihood

(

, ✓

), or equivalently, which maximize

the log-likelihood, Note:

Because we have assumed that



is the same for all observations, optimal

parameters values

✓

⇤

are independent of

the observation-error standard deviation.

ln f (D

, ✓)=

2

(y Xb)

/

(y Xb)

(y Xb).

The set of data

employed to obtain the maximum likelihood

Note:

arg max

ln f(D

, ✓) ⌘ arg min

J(b)

estimate (MLE) f or parameters

✓

is called the training set.

In the context of linear regression, the object i ve of maximizing

ln f

(

, ✓

) is often replaced by the equivalent objective of

minimizing a loss function

(

) corresponding to the sum of the

squares of the prediction errors,

J(b)=

i=1



 g(x

)



(y Xb)

= ln f(D

, ✓).

Figure 8.3: The minimum of a continuous

function corresponds to the location where

its derivative equals zero.

Figure 8. 3 presents a loss function

(

) as a function of the model

parameters b. Optimal parameter values ✓

⇤

= b

⇤

are deﬁned as

⇤

= arg min

(y Xb)

(y Xb).

As for other continuous functions , the minimum of

(

⇤

) coinci de s

with the point where its derivative equals zero,

@J (b

⇤

)

=0=X

⇤

 X

0 0.2 0.4 0.6 0.8 1

train

(b)=1.98

training set

g(x)

Figure 8.4: Example of linear regression

minimizing the performance metric

computed on the training set.

By reorgani zi n g the terms from the last equation, optimal parame-

ters are deﬁned as

⇤

=(X

1

y. (8.2)

j.-a. goulet 110

Figure 8. 4 presents an example of linear regression employing a

linear basis function

g(x)=b

+ b

The optim al parameter s found using the training set are

⇤

[1.14 3.71]

, for which J

train

⇤

)=1.98.

We can extend the formulation presented above for the general

case where the relationship between

and

is modeled by the

summation of multiple basis functions,

g(x)=b

+ b



(x)+b



(x)+···+ b



(x)=Xb,

where



(

) can be any linear or nonlinear function, for example:



(

,

(

sin

(

)

, ···

. In a matrix form, this general

model is expressed as

b =

, y =

, X =

1 

) 

) ··· 

)

1 

) 

) ··· 

)

1 

) 

) ··· 

)

Remember that with linear regression, the model is necessarily

linear wi t h respect to



(

) and

, yet not

. Figure 8.5 presents

two examples of linear regression applied to nonlinear relationships

between a s ys t em response and a covariate.

0 0.2 0.4 0.6 0.8 1

training set

g(x)

(a) g(x)=b

+ b

0 0.2 0.4 0.6 0.8 1

4

2

training set

g(x)

(b) g(x)=b

 b

sin(10x)

Figure 8.5: Examples of linear regression

applied to nonlinear relationships.

8.1.2 Overﬁtting and Cross-Validati on

One of the main challenges with linear regression, as with many

other machine learning methods, is overﬁtting. O verﬁtting happens

when you select a model class or model structure that represents

the data in your training set well, yet it fails to provide good

predictions for oth er covariates that were not seen during training.

A highl y complex model capable of adapting to the training data is

said to have a large capacity.

(a) Low bias &

low variance

(b) High bias &

low variance

high variance

(d) High bias &

high variance

Model complexity

Error

Optimal

complexity

(bias)

(

)

OverﬁttingUnderﬁtting

(e) Bias-variance tradeo↵

Figure 8.6: Illustration of the bias and

variance when talking about predictive

performance and model complexity.

Overﬁtting with a model that has too much capaci ty corresponds

to the case (c) in ﬁgure 8.6, where prediction er r ors are displ aying a

low bias when evaluated with the dat a employed to train the model;

however, when validated with the true objective of pred i ct i ng

unobserved data, there is a high variance. Case (b) represents the

situation where a model underﬁts the data because its low capacity

causes a large bias and a small vari anc e. As depicted in (e), we

want the opti mal model complexity o↵ering a tradeo↵ between the

prediction bias and variance by minimizing the validation error.

This is known as th e bias - var ia nce tradeo↵ . The conc ep t il l us tr at ed

in ﬁgure 8.6 will be revisited at the end of this section.

probabilistic machine learning for civil engineers 111

In the context of the linear regression, overﬁtting arises when

one chooses basis functions



(

) that have too much capacity and

that wil l be able to ﬁt the data in a training set but not t h e data

of an independent validation set. Figure 8.7a presents an example

where dat a is generated from a linear basis function and where the

regression model also employs a linear basis function to ﬁt the data.

Figure 8. 7b presents the case where the same data is ﬁtted with

a model using polynomial basis functions of order 5. Noti ce how

when more complex basis functions are introduced in the model, it

becomes more ﬂexible, and it can better ﬁt the data in the training

set. Here the ﬁtting ability is quantiﬁed by the loss function

that

is lower for the polynomial model than for the linear one.

0 0.2 0.4 0.6 0.8 1

train

(b)=1.98

training set

g(x)

(a) g(x)=b

+ b

0 0.2 0.4 0.6 0.8 1

train

(b)=1.72

training set

g(x)

(b)

(

Figure 8.7: Example of (a) a model with

correct complexity, and (b) a model with

too much capacity.

One solut i on to this challenge is to replace the MLE proce-

dure for optimizin g the parameters

(see equ at ion 8.2) with the

Bayesian estimation procedure as described in

6.3. The appro-

priate model complexity should then be selected by employing

Bayesian model selection as presented in

6.8. In practice, this is

seldom performed because calculating the evidence

(

D|m

) for a

model is computationally demanding. An alternative is to perform

ridge regression,

which is a special case of Bayesian linear regres-

Hoerl, A. E. and R. W. Kennard (1970).

Ridge regression: Biased estimation for

nonorthogonal problems. Technomet-

rics 12 (1), 55–67

sion that employs regulariz at ion to constrain the parameters (see

§8.3.3).

Another gen er al procedure to identify the optimal model com-

plexity is to employ cross-validation. The idea is to avoid employing

the same data to (1) estimate optimal parameters, (2) discriminate

between model classes, and (3) test the ﬁnal model performance

using th e loss function. To start with, the data is separated into a

training set and a test set. In order to maximize the utilization of

the data available, we can rely on the

-fold cross-validation pro-

cedure, where th e training set is again divided into

subsets c al le d

folds; one at a time , each subset is employed as an independent val-

idation s et while the training is performed on the remaining folds.

The total loss is obtained by summing the loss obtained for each

validation subset. The model class selected is the one with the low-

est aggregate d loss. Once a model class is selected as the best one,

its parame t er s can be retrai ne d using the entire training set, and

its pre di ct i ve performance is e valuated on the independent test set

that was not employed to estimate parameters or to discriminate

between model classes.

fold

training subset

validation subset

entire data set

testing set

test

training set

Figure 8.8: The

-fold cross-validation

procedure.

Figure 8.8 p re se nts an example of a 3-fold cross-validation where

the green -sh ad ed boxes describe the part of the data set employed

to train the model, t h e blue boxes describe the part of the data set

employed to compare the predictive performance while choosing

j.-a. goulet 112

between model classes, and the red box re pr es ents the independent

test set that is only employed to quantify the model performance on

unobserved data. It is essential that for each fold, the training and

validation sets, and the test set are kept separated. Typically, we

would employ 80 percent of the data for the training and validation

sets and 20 percent for the test set.

It is common to employ 10-fold cross-validation because it of-

fers a balance between the computational e↵ort required and th e

amount of dat a available to estimate the model parameters. For

folds, t he model needs t o be trained

times, w hi ch may involve a

substantial computational e↵ort. On the other hand, usi n g few folds

may lead t o poor parameter estimates. For example, if we only em-

ploy 2 fold s, for each of them the model is only trained on half the

training set. Whe n

train

, the procedure is cal l ed leave-one-out

cross-val id ati on (LOO CV) . This approach leads to the best esti-

mation of validation er ror s because more dat a is employed during

training, yet it is th e most computationally demandin g. Algorith m

7 summarize s the

-fold cros s-validation procedure for choosing

between M model classes.

Algorithm 7: N -fold cross-validation

1 deﬁne {x

train

, y

train

} (training set with D

train

observations)

2 deﬁne {x

test

, y

test

} (test set)

3 deﬁne g

(x), 8m 2{1:M} (mo dels)

4 deﬁne N (N-folds)

5 d = randperm(D

train

) (random permutation of [1 : D

train

]

)

6 n = round(D

train

/N) (# observations p er fold)

7 for m 2{1, 2, ··· , M} do

8 for i 2{1, 2, ··· , N} do

9 x

= x

train

; y

= y

train

(initialize training set)

10 k = d

⇣



n(i  1) + 1



:min(i·n, D

train

)

⌘

(val. set indices)

11 x

= x

train

(k, :); y

= y

train

(k, :) (validation se t )

12 x

(k, :) = [ ]; y

(k, :) = [ ] (remove validation set)

13 X =fct(x

) (training model-matrix)

14 b

⇤

=(X

1

(estimate parameters)

15 X =fct(x

) (validation model-matrix)

16 J

= J(b

⇤

, y

)(i

validation-set loss)

17 J

validation

i=1

(aggregated loss for model m)

18 m

⇤

= arg min

validation

(select best model)

19 X =fct(x

train

) (training model-matrix for m

⇤

)

20 b

⇤

=(X

1

train

(estimate parameters)

21 X =fct(x

test

) (test-set model matrix)

22 J

⇤

test

= J(b

⇤

, y

test

) (test-set loss for m

⇤

)

0 0.2 0.4 0.6 0.8 1

train

(b)=0.43 | J

validation

(b)=1.56

training set

validation set

g(x)

(a) Fold 1/2

0 0.2 0.4 0.6 0.8 1

train

(b)=1.53 | J

validation

(b)=0.46

training set

validation set

g(x)

(b) Fold 2/2

Figure 8.9: Example of 2-fold cross-

validation for the model g

(x)=b

+ b

0 0.2 0.4 0.6 0. 8 1

train

(b)=0.7 | J

validation

(b)=5.42

training set

validation set

g(x)

(a) Fold 1/2

0 0.2 0.4 0.6 0.8 1

train

(b)=0.11 | J

validation

(b)=15.47

training set

validation set

g(x)

(b) Fold 2/2

Figure 8.10: Example of 2-fold cross-

validation for the model

(

+ b

probabilistic machine learning for civil engineers 113

Figures 8. 9 and 8.10 present a 2-fold cross-validation procedure

employed to choose between a lin ear model

(

) and a ﬁfth-order

polynomial one g

(x),

(x)=b

+ b

x (linear)

(x)=b

+ b

x + b

+ b

order).

For the linear model, the su m of the losses for the two training and

validation set s is

train

=1.97

validation

=2.02

)

(x) ,

and for the ﬁfth-order polynomial model, the sum of t h e losses for

the two tr ain i ng and validation set s is

train

=0.81

validation

= 20.89

)

(x) .

These resu l t s indicate that the ﬁfth-order polynomial function is

better at ﬁtting the training set; however, it fails at predicting

data that was not seen in the training set. This is r eﬂ ected in the

loss func ti on for the validation set, which is lower for the linear

model having the c orr e ct complexi ty, compared to the ﬁfth-order

polynomial function that is overcomplex.

0 0.2 0.4 0.6 0.8 1

training set

validation set

g(x)

(a) Fold 1/2

0 0.2 0.4 0.6 0.8 1

training set

validation set

g(x)

(b) Fold 2/2

Figure 8.11: Example of 2-fold cross-

validation for the model

(

where the data follows the real model

ˇg(x)=1+5x

We now come back to the targets presented in ﬁgure 8.6; the

model

(

) presented in ﬁgure 8.9 corresponds to the target in

ﬁgure 8.6a, where the complexity is optimal so there is both a low

bias and a low variance. The model

(

) presented in ﬁgure 8.10

corresponds to ﬁgure 8.6c, where because the capacity is t oo high,

the model closely ﬁts the tr ai ni n g data in each fold (low bias), yet

the model is drastically di↵erent (i.e., high variance) for each fold.

Here, the high model capacity allowed to overﬁt the noise contained

in the data. Finally, the target in ﬁgure 8.6b cor re sponds to the

new case illustrat e d in ﬁgure 8.11, where the model does not have

enough capaci ty to ﬁt the data. The result is that the model is

highly biased, yet there is almost no variability between the models

obtained for each fold.

8.1.3 Mathematical Formulation >1-D

Linear regr es si on can be employed for cases involving more than

one covariate. The data set is then

{

(

)

, 8i 2{

D}}

where

··· x

]

2 R

is an

by one vector associated

with an observation

2 R

. The model is an

X !

1 functi on,

g(x) ⌘ fct(x) 2 R

, which is deﬁned using basis f un ct i on s so that

g(x)=b

+ b



)+b



)+···+ b



)=Xb.

j.-a. goulet 114

The matri x notation for the model deﬁned over the entire data set

b =

, y =

, X =

1 

1,1

) 

1,2

) ··· 

1,X

)

1 

2,1

) 

2,2

) ··· 

2,X

)

1 

D,1

) 

D,2

) ··· 

D,X

)

where [

]

⌘ x

i,j

refers to the

component of the

vector for

the

observation. Figure 8.12 presents an example of multivariate

linear regression for x =[x

]

and for which the model is

g(x)=b

+ b

1/2

+ b

Note that even if the model now involves multiple covariates, the

optimal model parameters b

⇤

are still estimated using equati on 8.2.

0.5

training set

g(x)

Figure 8.12: Example of application of

linear regression to a 2-D data set.

8.1.4 Limitations

0 0.2 0.4 0.6 0.8 1

training set

g(x)

(a) Example of heteroscedasticity where

the variance of errors is a function of

the covariate x

0 0.2 0.4 0.6 0.8 1

training set

g(x)

(b) Example of a bias introduced by an

outlier on the bottom right corner

0 0.2 0.4 0.6 0.8 1

training set

test set

g(x)

unobserved covariates

Figure 8.13: Illustrations of some of the

limitations of the linear regression method

presented in §8.1.1.

Some of the limitations of the linear regression method presented

8.1.1 are illustrat e d in ﬁgure 8.13: (a) it assumes homoscedastic

errors and is thus not suited for the cases where the variance of

errors de pends on the covariate

, that is, heteroscedastic cases,

(b) it is sensitive to outl ier s, so a single biased observation may

severely a↵ect the model, and (c) it does not di↵erentiate between

interpolation and extrapolation. Interpolation refers to predictions

made between covariates for which observati on s are available to

train t he model, and ex t rapolation refers to predictions made

beyond the domain the model has been trained for. In (c), the

predictions performed with

(

) are good at interpolating between

data points in t he traini ng set, yet they o↵er a poor performance

for extr apolating beyond them on the test data. A main limitation

with t h e linear regressi on formulation presented here is that it does

not provide uncertainty e s t im at es indicati ng that its predictive

performance degrades for covariate values that are far from those

available in t h e training set.

Another key limitati on not illustrated in ﬁgure 8.13 is that the

performance of linear regression depends on our ability to hand-

pick the correct basis fu nc ti on s. For cases invol v i ng one or two

covariates , this task is feasible from a simple visual representation

of the data set. This task becomes diﬃcult when the number of

covariates

X >

2. In short, line ar regres s ion is simple and parameter

estimation is qu ick, yet it is outperformed by modern techniques

such as Gau s s ia n process regression and neural networks,which

will be covered in the next sections. Note that the cross-validation

probabilistic machine learning for civil engineers 115

technique covered here is not limited to linear regression and can be

employed with a wide range of machine learning methods.

8.2 Gaussian Process Regression

Gaussian p r ocess regression (GPR) works by describing the prior

knowledge of system responses over its covariate domain using a

joint multivariate Normal probability density function (PDF). Then,

it empl oys the properties of the multivariate Normal (see

4.1.3) in

order to update the joint prior PDF using empirical observations.

Figure 8.14: Example of a 100 km long

pipeline.

Pipeline example Take for example the 100 km long pipeline

illustrated in ﬁ gur e 8.14, for which we are interested in quantifying

the temperature at any coordinate

x 2

100) km. Given that

we know the pipeline temperature to be



= 30 km,

what is the temperature at

= 30

1 km ? Because of the proximity

between the two locations, it is intuitive that the temperature has

to be close t o 8



.Then,what is the temperature at

= 70 km?

We cannot say for sure that it is going to be ar oun d 8



because

over a 40 km distance, the change in temperature can reach a few

degrees. T hi s example illustrat es how we can take advantage of the

underlying dependencies between system responses for di↵erent

observed covariates

, for predicting the responses for unob se rved

covariates. Note:

As for linear regression (see

8.1),

covariates

are assumed to be determinis-

tic constants that are free from observation

errors.

Let us consider system responses deﬁned by a single covariate

so that

|{z}

observation

model

z}|{

g(x

) .

The data set is deﬁned as

{

(

)

, 8i 2{

D}}

, D

}

where for D observations,

= {g} = {[g

··· g

]

}, and D

= {x} = {[x

··· x

]

so for each observed

, ther e is an associated covar i at e

.AGaus-

sian process is a multivariate Norm al random variable deﬁned over

a domain described by covariates. In the case of Gaussian process

regression, a set of system responses is assumed to be a realization

from a Gaussian process

(

)

⇠N

(

);

, ⌃

), where

is the mean vector [µ

]

= µ

) and the covariance matrix ⌃

[⌃

]

= ⇢(x

) · 

),⇢(x

)=fct(x

 x

⌃

deﬁnes t he covariance between a discr et e set of Gaussian ran-

dom variables f or which the pairwise correlation

⇢

(

)between

j.-a. goulet 116

(

) and

(

) is a function of the distance between the covari-

ates

and

. Note that

and

⌃

depend implicitly on the

covariates x.

Figure 8.15: Representation of the prior

knowledge for the temperature along

apipeline.Thedashedhorizontalred

line represents the prior mean, and the

pink-shaded area is the



conﬁdence

region. Overlaid colored curves represent

ﬁve realizations of the Gaussian process,

: G ⇠N(g(x); µ

, ⌃

Pipeline example (continued) Figure 8.15 illustrates our prior

knowledge of the temperature for coordinates x =[01 ··· 100]

km,

G ⇠N(g(x); µ

, ⌃

)

| {z }

Prior knowledge

, [⌃

]

= ⇢(x

)

where each colored curve represents a di↵erent realization of the

Gaussian p r ocess

. The prior means and standard deviations

are assumed to be equal for all locations

,so

(



and



(





. The correlation between the

temperature at two locations

(

) and

(

)isdeﬁnedherebya

square exponential correlation function

⇢(x

)=exp

✓



 x

)

◆

where

is the length-scale parameter. F i gur e 8.16 presents the

square-exponential correl at i on function plotted for parameter values

` = {1, 2, 5}.

Figure 8.16: Square-exponential correlation

function.

In ﬁgure 8.15, for a length-scale of

= 25 km, the cor re l at ion for t h e

three pairs of coordinates x = {30, 31, 70} is

⇢(x

= 30 km,x

= 31 km) = 0.999

⇢(x

= 30 km,x

= 70 km) = 0.278

⇢(x

= 31 km,x

= 70 km) = 0.296

;

for ` = 25 km.

Figure 8.17 presents examples of realizations for the same Gaussian

process using di↵erent lengt h -sc al e parameters.

0 20 40 60 80 100

10

x [km]



(a) ` =1km

0 20 40 60 80 100

10

x [km]



(b) ` =10km

0 20 40 60 80 100

10

x [km]



Figure 8.17: Examples of Gaussian process

realizations for di↵erent length-scale

parameter values. The dashed horizontal

red line represents the prior mean, and the

pink-shaded area is the



conﬁdence

region.

8.2.1 Updating a GP Using Exact Observations

Let us consider

{

(

)

,i2{

D}}

, a set of

exact obse rva-

tions (i . e. , without observation errors), an d

⇤

⇤1

⇤2

··· x

⇤P

]

a vector of

covariates corresponding to either observed or unob-

served lo cat i on s where we want to obtain predictions. The goal

probabilistic machine learning for civil engineers 117

is to obtain the posterior PD F

f(g

⇤

, D)

. Before obtaining this

posterior PDF, we have to deﬁn e the joint prior knowledge for

the syst e m responses at both the observed

and predi ct i on

⇤

locations,

⇢

⇤



, µ =

⇢

⇤



, ⌃ =



⌃

G⇤

⌃

G⇤

⌃

⇤



| {z }

Prior knowledge

where

and

⇤

are, respectively, the prior m e ans at observed

and at prediction locations; [

⌃

]

⇢

(

)



and [

⌃

⇤

]

⇢

(

⇤i

⇤j

)



are thei r respective covariances. The covariance be-

tween observed and predict i on locations is described by [

⌃

G⇤

]

⇢

(

⇤j

)



. The posterior probabil i ty of

⇤

at prediction locations

⇤

, conditioned on the observations

, is given by

⇤

(

⇤

, D

(

⇤

;

⇤|D

, ⌃

⇤|D

)

where th e posterior mean vector and covari anc e

matrix are

⇤|D

= µ

⇤

+ ⌃

G⇤

⌃

1

(g µ

)

⌃

⇤|D

= ⌃

⇤

 ⌃

G⇤

⌃

1

⌃

G⇤

{z }

Posterior knowledge

These two equat ion s are analogous to those presented

4.1.3 for

the condi t i onal distr i bu t ion of a multivariate Normal (

⌘

Gaussian).

Nomenclature

··· x

]

:Covariatesfor

observed locations

⇤

⇤1

⇤2

··· x

⇤P

]

:Covariatesfor

prediction locations

··· g

]

:Systemresponsesfor

observed locations

⇤

⇤1

⇤2

··· g

⇤P

]

:Modelpredictions

for P locations

⌃

:Priorcovarianceof

for observed

locations x

⌃

⇤

:Priorcovarianceof

⇤

for prediction

locations x

⇤

⌃

G⇤

:Priorcovarianceof

between

observed and prediction locations,

and

⇤

,respectively

Strength and limitations One of the main di↵erences between the

Gaussian p r ocess regression and other r e gre ssi on methods that

encode the relationships contained in the data set through a para-

metric m odel is that GPR is a nonparametric approach, for which

predictions depend explicitly on the observations available i n the

data set. The upside, in comparison with other methods such as

linear r egr es si on, is that there is no requirement to deﬁne basis

functions that are compatible with the system responses. GPR thus

allows modeling highly nonlinear responses using few parameters.

Another key feature of GPR is that it is an intrinsically proba-

bilistic method that allows di↵erentiating between the prediction

quality in interpol at i on and extrapolation situations. For example,

ﬁgures 8.19c and 8.19d show how interpolating between observation

leads to a narrower conﬁdence re gi on than when extrapolating.

One of the main downsides of Gaussian process regression is th at

when the data set is large (

D >

1000), building the matrices describ-

ing the joint prior knowledge becomes computationall y demandi ng

and it is then necessary to employ approximate methods.

Qui˜nonero-Candela, J. and C. E. Ras-

mussen (2005). A unifying view of sparse

approximate Gaussian process regression.

Journal of Machine Learning Research 6,

1939–1959

j.-a. goulet 118

Pipeline example (continued) We consider the case where the

pipeline temperature





is observed for the covariate

and we want to use this information to obtain the posterior PDF

describing the t em perature

at a location

.Given





=2.5



, and

⇢(x

)=0.8

, the joint PDF describing the prior

probability of the temperature at locations x

and x

is deﬁned by

)=N(g; µ

, ⌃

)

= [0 0]

⌃



2.5

0.8 · 2.5

2.5



The conditional probability of G

given the observation g

)

= N(g

; µ

1|2

,

1|2

where the posterior me an and standard deviation are

1|2

= µ

+ ⇢

µ



=0+0.8 ⇥2.5

20

2.5

= 1.6





1|2

= 

1  ⇢

=2.5

1  0.8

=1.5



5

2

)

5

2

)

Figure 8.18: Joint Gaussian prior knowl-

edge of the temperature at two locations

and conditional PDF for the temperature

,giventheobservationg

= 2.

Again, the se last two equations are analogous to those presented

4.1.3. The joint prior and posterior PDFs are presented in

ﬁgure 8.18. Figure 8.19 presents the generalization of this update

process for

= 101 predict e d locations

⇤

=[01

···

100]

and using, respectively,

{

}

observations. Note how the

uncertainty reduces to zero at locations where observations are

available . Realizati ons of the updated Gaussi an process are then

required to pass by all observed values. The correlation embedded

in the joint prior knowledge allows sharing the information carried

by observations for predicting the system responses at adjacent

unobserved covariate values.

0 20 40 60 80 100

10

x [km]



± 2

(a) D = {([]

|{z}

, []

|{z}

)}

0 20 40 60 80 100

10

x [km]



(b) D = {([31]

|{z}

, [0.4]

| {z }

)}

0 20 40 60 80 100

10

x [km]



| {z }

, [0.43.2]

| {z }

)}

0 20 40 60 80 100

10

x [km]



(d) D = {([31 70 30]

| {z }

, [0.43.2 0.6]

| {z }

)}

Figure 8.19: Examples of how the Gaussian

process describing the pipeline temperature

evolves as it is updated with di↵ e rent data

sets D.

8.2.2 Updating a GP Using Imperfect Observations

We now explore how to extend the formulation presented in the

previous section to the case where observations are contaminated by

noise. The observation model is now deﬁned as

|{z}

observation

model

z}|{

g(x)+ v

|{z}

measurement error

where

V ⇠N

(

,

) descri bes zero-mean independent

observation errors so that

?? V

8i 6

. Because

(

)is

now indirectly observed through

, we have to redeﬁne our prior

probabilistic machine learning for civil engineers 119

knowledge between the observat i ons

and the model responses

⇤

at prediction locations. The joint prior knowledge is described by

⇢

⇤



, m =

⇢

⇤



, ⌃ =



⌃

Y⇤

⌃

Y⇤

⌃

⇤



| {z }

prior knowledge

In the case where the observation noise has a mean equal to zero,

the mean vector for observations equals the prior mean for the

system r esponses,

. The global covariance matrix is then

deﬁned using

[⌃

]

= ⇢(x

)

+ 



, where 

=1ifi = j

[⌃

⇤

]

= ⇢(x

⇤i

⇤j

)



=0ifi 6= j

[⌃

Y⇤

]

= ⇢(x

⇤j

)

Given the prediction locations

⇤

and the observations

,the

posterior knowledge for the system responses G

⇤

, D)=N(g

⇤

; µ

⇤|D

, ⌃

⇤|D

where th e posterior mean vector

⇤|D

and covariance matrix

⌃

⇤|D

are

⇤|D

= µ

⇤

+ ⌃

Y⇤

⌃

1

(y µ

)

⌃

⇤|D

= ⌃

⇤

 ⌃

Y⇤

⌃

1

⌃

Y⇤

{z }

posterior knowledge

0 20 40 60 80 100

10

x [km]



± 2

(a) D = {([]

|{z}

, []

|{z}

)}

0 20 40 60 80 100

10

x [km]



(b) D = {([31]

|{z}

, [0.4]

| {z }

)}

0 20 40 60 80 100

10

x [km]



| {z }

, [0.43.2]

| {z }

)}

0 20 40 60 80 100

10

x [km]



(d) D = {([31 70 30]

| {z }

, [0.43.2 0.6]

| {z }

)}

Figure 8.20: Examples of how the Gaussian

process describing the pipeline temperature

evolves as it is updated with di↵ e rent

data sets

,withanobservationstandard

deviation 



Pipeline example (continued) We continue our study of the

pipeline example by considering noisy observations characterized

by a measur eme nt error standard deviation





C. Figur e 8.20

presents the updated Gaussian process for

= 101 prediction loca-

tions

⇤

=[01

···

100]

km, and using, respectively,

{

}

noisy obse rvations. Notice how, contrarily to r es ul t s presented in

ﬁgure 8.19, uncertainty now remains in th e posterior at observed

locations. The larger



is, the less information observations c arry

and the less they modify the prior knowledge.

8.2.3 Multiple Covariates

|{z}

observation

model

z}|{

g(x

)+ v

|{z}

measurement error

=[x

··· x

]

{z }

covariates

D = {(x

), 8i 2{1:D}} = {D

, D

}

In the case where the regression is performed for multiple covariates

··· x

]

, it is necessary to deﬁne a joint correlation fu n c-

tion deﬁ ni n g the correlation coeﬃcient associated with a pair of co-

variate vectors. Typically, the joint correlation function is obtained

from the product of a marginal correlation function

⇢

([

]

[

]

)

j.-a. goulet 120

for each covariate,

⇢(x

, x

k=1

⇢



]

, [x

]



The multivariate formulation for the square-exponential correlation

function is

⇢(x

, x

k=1

exp

✓



([x

]

 [x

]

)

◆

⌘ exp

✓



 x

)

diag(`)

2

 x

)

◆

5

0.5

]

 [x

]

 [x

]

⇢(x

, x

)

Figure 8.21: Bivariate square-exponential

correlation function with parameters

and `

=1.5.

where th er e is one parameter

for each covariate. Figure 8.21

presents an example of a bivariate square-exponential correlation

function. One rel evant feature wit h this correlation function is that

the lengt h -s cal es

are direc t ly measuri ng the importance of each

covariate for predicting the system responses. As a simple heuris-

tic, we can say that if

is greater than two times the empirical

standard deviations of the observed covariates



,thenthe

covariate has a limited in ﬂ ue nc e on t he system res ponse predictions.

The reason is because t he correlati on coeﬃcient remains close t o

one for any pairs of this

covariate. In the opposite case , where

is smalle r than the empirical standard deviation



, we can

conclude that the

covariate has a str ong inﬂuence on the system

response predictions.

8.2.4 Parameter Estimation

Up to now, we have assumed that we knew the parameter s such as

the obser vation standard deviation



, the Gaussian process prior

standard deviation



, and the length-scales

··· `

]

.In

practice, we need to learn these parameter s

✓



]

from

✓



]

can be referred to

as hyperparameters because they are

parameters of the prior.

the trai n in g data

, D

}

{

(

)

,i2{

D}}

. Following

Bayes rule, the posterior PDF for parameters given dat a is

f(✓|D)

| {z }

posterior

likelihood

z }| {

f(D

, ✓) ·

prior

z}|{

f(✓)

f(D

)

| {z }

constant

Often, evaluating

(

✓|D

, D

) is computational ly expensive, so a

maximum likelihood estimation approach (see

6.7) is typically

employed. The optimal parameter values ✓

⇤

are then

✓

⇤

= arg max

✓

likelihood

z }| {

f(D

, ✓) ⌘ ar g max

✓

log-likelihood

z }| {

lnf(D

, ✓) .

probabilistic machine learning for civil engineers 121

The likelihood function

(

, ✓

) descri bes the joint prior

Theoretically, here, we should refer to

(

, ✓

)asthemarginal likelihood be-

cause the hidden variables

are marginal-

ized.

probability density of observations D

, given a set of parameters ✓,

f(D

, ✓)=

f(D

|g) ·f(g|D

, ✓)dg = N(D

; µ

, ⌃

The l og-l i kelihood for a vector of observations

··· y

]

given by

ln f (D

, ✓)=lnN(y; µ

, ⌃

)

= 

(yµ

)

⌃

1

(yµ

)

ln | ⌃

|

ln 2⇡.

Figure 8.22: Log-likelihoo d maximization.

The optim al parameter s

✓

⇤

correspond to the parameter values for

which the derivative of the likelihood equals z er o, as illustrated in

ﬁgure 8.22. For µ

= 0, the derivative of the l og-l i kelihood is

@✓

ln f (D

, ✓)=

⌃

1

@⌃

@✓

⌃

1

y 

✓

⌃

1

@⌃

@✓

◆

Eﬃcient gradient-based optimization algorithms for learning pa-

rameters by maximizing the log-likelihood are already implemented

open-source packages such as the GPML

,GPStu↵

, and pyGPs

Rasmussen, C. E. and H. Nickisch (2010).

Gaussian processes for machine learning

(GPML) toolbox. The Journal of Machine

Learning Research 11,3011–3015

Vanhatalo, J., J. Riihim¨aki, J. Har-

tikainen, P. Jyl¨anki, V. Tolvanen, and

A. Vehtari (2013). GPstu↵: Bayesian mod-

eling with Gaussian processes. Journal of

Machine Learning Research 14,1175–1179

Neumann, M., S. Huang, D. E. Marthaler,

and K. Kersting (2015). pyGPs: A python

library for Gaussian process regression and

classiﬁcation. The Journal of Machine

Learning Research 16 (1), 2611–2616

toolboxes.

8.2.5 Example: Soil Contamination Characterization

This sect i on presents an applied example of GPR for the character-

ization of contaminant c onc entration in soils. Take, for example, a

set of observations y

of soil contaminant concentrations,

D = {D

, D

} = {(l

),i2{1 : 116}},

where

is a covariate vector associated with each concentration

observation. These covariates denote the geographic coordinates

(longitude, latitude, and depth) of e ach observation,

=[xyz]

| {z }

geodesic coordinates

Each observation is represented in ﬁgure 8.23 by a circle where its

size is proportional to the contaminant concentration. The goal

is to employ Gaussian process regressi on in order to model the

contaminant concentration across the w hole soil volume, given the

available observations for 116 discrete locations.

-8

-4

[

]

100

200

x [m]

300

100

400

-50

500

y [m]

Figure 8.23: Example of application of

GPR for soil contamination characteri-

zation. The iso-contours correspond to

([ ]

mg/kg

)=0

5, and the size of

each circle is proportional to the observed

contaminant concentration.

(Adapted from Quach et al. (2017).)

Because a concentration must be a positive number, and be-

cause several of the observations available are close to zero, t he

observation model is deﬁned in t he log-space,

ln y

|{z}

observation

true contaminant [ ] in log space

z}|{

g(l)+ v

|{z}

meas. error in log space

j.-a. goulet 122

where

(

) models the natural logarithm of the true contaminant

concentration. When tr an sf orm ed back to the original space using

the exponential function, the model predictions are restricted to

positive values. The observation errors are assumed to be desc ri bed

by the ze r o-me an independent Normal random variable

V ⇠

(

,

)

?? V

, 8i 6

. The correlation between covariates is

modeled using a square-exponential correlation function, and the

prior mean at any location is assumed to be equal to one in the

original s pac e, which corresponds to zero in the log-transformed

space.

The paramet er s to be esti mat ed from the data are the length-

scale along e ach direct i on, th e Gaussi an process standard deviation,

and the observation noise standar d deviation,

✓ = {`

,

The maximum likelihood esti mat e (MLE) parameter values are

✓

⇤

= arg max

✓

log-likelihood

z }| {

ln f (D

, ✓)=

= 56.4m

= 53.7m

=0.64 m



=2.86



=0.18

;

The optim al values for the length-scales

}

are all smaller

Covariate importance metric

If `

 

:Imp(X

! y) ⇡ 0

If `

<

:Imp(X

! y)  0

than the range of each covariate, as depi ct ed in ﬁgure 8.23. This

conﬁrms that all three coordinates have an inﬂuence on the contam-

inant concentration. Figure 8.23 pr ese nts the iso-contours describing

([ ]

mg/kg

)=0

5, where

⇤

are the covariates describing the

x, y, z

geographic c oordinate of a 3-D grid covering the entire soil

volume for the sit e studied . Further details about this example are

presented in the paper from Quach et al.

Quach, A. N. O., D. Tabor, L. Dumont,

B. Courcelles, and J.-A. Goulet (2017). A

machine learning approach for character-

izing soil contamination in the presence

of physical site discontinuities and aggre-

gated samples. Advanced Engineering

Informatics 33,60–67

8.2.6 Example: Metamodel

In civil engineer i ng, it is common to employ computationally de-

manding models such as the ﬁnite element models pre se nted in

ﬁgure 8.24. When performing probabilistic studies related to struc-

tural or system reliabili ty, it is often required to have millions of

model realizations for di↵erent parameter values. In such a case,

for the purpose of compu t ati on al eﬃciency, it may be interesting

to replac e the ﬁnite element model with a regression model built

from a set of simulated responses. Su ch a model of a model is called

a metamodel or a surrogate model. Because observations are the

output of a simulation, the observation model does not include any

observation err or and follows the formulation presented in §8.2.1.

probabilistic machine learning for civil engineers 123

| {z }

|{z}

simulation

model

z}|{

g(x

), x

=[x

··· x

]

| {z }

covariates

Figure 8.24: Examples of ﬁnite element

structural models.

In this example, we built a metamodel for the Tamar Br i d ge

ﬁnite element model.

Figure 8. 25a presents an overview of the

Westgate, R. J. and J. M. W. Brownjohn

(2011). Development of a Tamar Bridge

ﬁnite element model. In Conference

Proceedings of the Society for Experimental

Mechanics,Volume5,pp.13–20.Springer

bridge ﬁn i t e element model, and ﬁgure 8.25b presents the modal

displacement associated with its ﬁrst natural frequency, which is

the response modeled in this example. A set of

= 1000 model

(a) Model representation (b) 1

vibration mode

Figure 8.25: Tamar Bridge ﬁnite element

model.

simulations is obtained by evaluating the ﬁnite element model for

several sets of param et er s that are generated from the prior

(

✓

There are ﬁve model parameter s that are treated as covariates of

the GP regression model,

··· x

]

:thesti↵ness of

the thr ee expansion joints located on the bridge deck, along with

the initial strain in the cables on the main and side spans. The

locations of elements a↵ected by these parameters are illustrated in

ﬁgure 8.26. This problem employs the square-exponential correla-

Saltash side Plymouth side

Saltash side deck

expansion joint stiﬀness

Plymouth side deck

expansion joint stiﬀness

Saltash tower deck

expansion joint stiﬀness

Main cable

initial strains

Sidespan cables

initial strains

Figure 8.26: Tamar Bridge model parame-

ter description.

j.-a. goulet 124

tion function, so there is a total of s ix parameters to be learned,

✓ = {`

,

Figure 8. 27 compares the GPR model predicti ons with the true

ﬁnite el e ment model outputs

. Note that in order to obtain a

meaningful comparison between predicted and measured values,

it is essential to test the model iteratively using a cross-validation

procedure, where at each step, the observation c orr es ponding to

the pred i ct i on location in t h e validation se t is removed from the

training set.

0.34 0.36 0.38 0.4

0.34

0.36

0.38

0.4

⇤

g = g

⇤

,µ

⇤|D,i

)

⇤|D,i

± 

⇤|D,i

Figure 8.27: Comparison of the GPR

model predictions with the ﬁnite-element

model predictions using the leave-one-out

cross-validation pro c edure.

8.2.7 Advanced Considerations

There are several aspects that should be considered in order to take

advantage of the full potential of Gaussian process regre ssi on . Key

aspects are related to (1) the deﬁnition of the prior mean functions,

(2) the choice of covariance and correlation functions, and (3) th e

formulation required for modeling cases involving heteroscedastic

errors.

Prior mean f u ncti ons In the examples presented in

8.2, the

prior mean was always assumed to be a known constant. If needed,

this assu mp t ion can be relax ed by treating the prior mean as a

parameterized function to be inferred from data. In its simplest

case, the mean can be assumed to be a single unknown constant.

In more advanced cases we could describe our prior knowledge of

the mean as a parameterized function that depends on covariates

. For example, if there is a sin gl e covariate, the mean could be

modeled by the aﬃn e function

(

,where

{a, b}

are

parameters that ne ed to be estimat e d from data. When using

complex f un ct i on s to describe t h e prior mean, it is essential to

keep in m in d that this procedure is sensi t i ve to overﬁtting in a

similar way as linear regression (see

8.1.2). E mp loying a highly

ﬂexible parameterization may lead to an increased log-likelihood

in the training set, yet it may not materialize in the prediction

performance evaluated on the test set. It i s therefor e essential

to choose the best pri or parameter iz at ion using methods such as

cross-valid ati on (see §8.1.2) or Bayesian model selection (see §6.8).

Covariance functions In this section, only the square-exponential

correlation function was described; nonetheless, several other

choices exist. In his the si s, Duvenaud

presents several covariance

Duvenaud, D. (2014). Automatic model

construction with Gaussian processes.PhD

thesis, University of Cambridge

models. A common one is the Matern-class correlation function, for

probabilistic machine learning for civil engineers 125

which a special case is deﬁned as

⇢(x

)=exp

✓



 x

◆

Figure 8. 28 represents this correlation function for several length-

scale values.

Figure 8.28: An example of Matern-class

correlation function.

When usi n g the square-exponential correlation function with

covariate values that are close to each other, it is c ommon to obtain

correlation coeﬃcients

⇢

(

, x

) close to one. This has the adverse

e↵ect of leading to singular or near-singular covariance matrices

(see

2.4.2). M at er n- cl ass correlat i on functions can mitigate this

problem because of its rapid reduction of the correlation coeﬃcient,

with an increase in the absolu t e distance between covariates.

Another cor r el at i on function particul ar ly useful for engineering

applications is the periodic one deﬁned as

⇢(x

)=exp



sin

✓

⇡

 x

◆

10 5 0 5 10

 x

⇢(x

)

(a) p =5,` =1

10 5 0 5 10

 x

⇢(x

)

(b) p =5,` =0.5

10 5 0 5 10

 x

⇢(x

)

Figure 8.29: Examples of periodic correla-

tion functions.

Figure 8. 29 presents examples of the period i c correlati on function

for periods

{

}

and lengt h- scal e parameter s

{

}

As its name indicates, this function is suited for modeling periodic

phenomena, where events spaced by a distance close to the period

will have a high cor re l ati on . The l en gt h- scal e

controls how fast the

correlation decreases as the distance between covariates gets away

from an integer that is a multiple of the period p.

Figure 8. 30 presents an example of the periodic correlation

function employed to m odel the historical daily average temper-

atures re cor d ed at Montreal airport from 1953 to 1960. The blue

circles represent the traini n g data that were employed to learn the

parameters: the p ri or mean value

, the Gaussian process prior

standard deviation



, the observation stan dar d deviation



, and

1953

1954

1955

1956

1957

1958

1959

1960

40

20

Year

Temperature [



⇤|D

± 2



⇤|D

+ 

Training data Test data

⇤|D

Figure 8.30: Example of application of

aperiodiccovariancefunctiontomodel

temperature data.

j.-a. goulet 126

the period

. The pink crosses represent the test data that was

not employed t o perform predictions or to learn paramete rs . Most

observations from the test set lie within the



posterior interval.

It shows how Gaussian process regression can, even with such a

simple model with only four par ame t er s, capture the temperature

pattern as well as its uncertainty.

For more complex problems, highly complex covariance struc-

tures can be built by adding together simpler ones, such as those

described in this section (see Rasmussen and Wi l l iam s

for detai l s) .

Rasmussen, C. E. and C. K. Williams

(2006). Gaussian processes for machine

learning.MITPress

The choice regarding which correlation or covariance structure is

most approp r iat e should, as for the prior mean functions, be based

on methods such as cross-validation (see

8.1.2) or Bayesian model

selection (see §6.8).

Homo- and heteroscedasticity So far, this section has only treated

cases associated with homoscedastic errors such as the generic case

presented in ﬁgure 8.31a. In engineering applications, it is common

to be confronted with cases where the variances



depend

on the covariate values. A generic example of heteroscedastic cases

is p re sented in ﬁgu r e 8.31b, which has been modeled by considering

that th e observation-error standard deviation



is itself modeled

as a Gaussian process th at depends on covariate values.

This

Tolvanen, V., P. Jyl¨anki, and A. Vehtari

(2014). Expectation propagation for

nonstationary heteroscedastic Gaussian

process regression. In IEEE International

Workshop on Machine Learning for Signal

Processing (MLSP),1–6.IEEE

method is implemented in the GPstu↵

toolbox.

Vanhatalo, J., J. Riihim¨aki, J. Har-

tikainen, P. Jyl¨anki, V. Tolvanen, and

A. Vehtari (2013). GPstu↵: Bayesian mod-

eling with Gaussian processes. Journal of

Machine Learning Research 14,1175–1179

(a) Homoscedastic

(b) Heteroscedastic

Figure 8.31: Examples of homo- and

heteroscedastic cases where the shaded

regions represent µ

± 2

8.3 Neural Networks

The idea behind neural networks originates from the perceptron

model

developed in the 1950s. Neural networks became extremely

Rosenblatt, F. (1958). The perceptron: A

probabilistic model for information storage

and organization in the brain. Psychological

Review 65 (6), 386

popular in the period from 1980 until the mi d- 1990s when they

became overs had owed by other emerging methods. It was not until

the earl y 2010s, with the appellation of deep learning, that they

again took the leading r ole in the development of machine learning

methods. In deep learnin g, the label deep refers to the great number

probabilistic machine learning for civil engineers 127

of layers in the model. The rise of deep learning as the state-of-the-

art meth od was caused by three main factors:

(1) the increasi ng

Goodfellow, I., Y. Bengio, and

A. Courville (2016). Deep learning.MIT

Press

size of the data sets to train on, (2) the increase in c omp ut at i onal

capacity brought by graphical processing units (GPUs), and (3)

the impr ovements in the methods allowing to train deeper models.

One limi t at i on is that in order to achieve the performance it is

renowned for, deep learning requires large data sets containing

from thous and s to millions of labeled example s for covariates

and syste m responses

. In the ﬁeld of computer vision, speech

recognition, and genomi cs, it is common to deal with data sets of

such a large size; in the civil engineering context, it is not yet the

case.

This sect i on presents the basic formulation for the simplest

neural ne twork architecture: the feedforward neural network.The

reader interested in a detailed review of neural networks and their

numerous architectures—such as convolutional neural networks

(CNN), generative adversarial networks (GAN), and recurrent neu-

ral networks (RNN, LSTM)—should refer to specialized text books

such as the one by Goodfellow, Bengi o, and Courville (2016).

8.3.1 Feedforward Neural Networks

Feedforward is the simplest architecture for neural networks, where

information is tran sf er re d from the input layer to the output layer

by propagating it i nto layers of hidden variables. The idea is to ap-

proximate complex functions by a succession of simple combinations

of hidden variables organized in layers.

Figure 8.32: Representation of linear regres-

sion as a feedforward neural network. The

leftmost nodes represent covariates, the

white node is a hidden variable, and the

rightmost node represents the observation.

Linear regression The linear regression method presented in

8.1 is

employed to introduce the concepts surroun di n g feedforward neural

networks. Figure 8.32 p re sents a regression problem where three

covariates

]

are associated with a s in gl e observed

system response

v, v

V ⇠N

(

,

), where

is a realiza-

tion of a zero-mean normal observation error.

represents a hidden

variable, which is deﬁned by the linear combination of covariate

values using three weight parameters and one bias parameter,

z = w

+ w

+ b.

In ﬁgure 8.32, the leftmost nodes represent covariates, the white

node is a hi dde n variable, and the rightmost node is the observa-

tions

Y ⇠N

(

;

z,

). Arrows represent the dependencies

deﬁned by the weight and bias parameters

,b}

. Weight

parameter

describes the slope of a plane with respect to each

j.-a. goulet 128

variable, and

the plan e’ s intercept (e.g., see ﬁgure 2.2). With the

current conﬁguration, t h i s model is li mi t ed because it i s only a

linear f un ct i on , that is, the hidden variable

is a linear function

with respect to covariates x.

Linear regression with multi ple layers Now we explore the e↵ect of

modifying the model as presented in ﬁgure 8.33 by adding multiple

hidden variables , where e ach one is modeled as a linear combination

of either the covariates or the hidden variables on the preceding

layer. First, let us deﬁne the notation employed to describe this

network:

(l)

represents the

hidden variable on th e

layer. In

the ﬁrst layer, the two hidden variab l es are deﬁned by

(1)

= w

(0)

1,1

+ w

(0)

1,2

+ w

(0)

1,3

+ b

(0)

(1)

= w

(0)

2,1

+ w

(0)

2,2

+ w

(0)

2,3

+ b

(0)

(1)

(2)

Figure 8.33: Representation of linear

regression using multiple layers. The

leftmost nodes represent covariates, the

white nodes the hidden variables, and the

rightmost node the observation.

Nomenclature

(l)

:the

hidden variable on the

layer

(l)

i,j

:theweightdeﬁningthedependence

between the

variable of the (

+1)

layer and the

hidden variable of the

(l)

layer

(l)

:theplane’sinterceptforthe

hidden

variable of the (l +1)

layer

: j

covariate

where

(0)

i,j

is the weight deﬁnin g the dependence between the

variable of the 0

layer (i.e., the ﬁ rs t covariate itself) and the

hidden variable of th e 1

layer. Foll owing the same notation,

(0)

describes the plane’s intercept for the

hidden variable of th e 1

layer. The onl y hidden variable on the second layer is deﬁned as a

linear combination of the two hidd en variables on layer 1, so

(2)

= w

(1)

1,1

(1)

+ w

(1)

1,2

(1)

+ b

(1)

The observation model is now deﬁned by

(2)

v, v

V ⇠

(

,

), so each observation of the system response is deﬁned

Y ⇠N

(

;

(2)

,

). Note that

(2)

implicitly depends on

the covariates

, so that there are covariates implicitly associated

with each observation

. The complete data set is

, D

}

{(x

), 8i 2{1:D}}.

In this second case, adding hidden variables did not help us

to general iz e our model to nonlinear system responses because

linear functions of linear functions are themselves still li n ear . The r e-

fore, no matter how many layers of hidden var i abl e s we add, the

ﬁnal model remains a hyp er p lan e. In order to describe nonline ar

functions, we need to intro d u ce nonlineari t i es using activation

functions.

Activation functions An activation function describes a nonlinear

transformation of a hidden variable

. Common activation functions

are the logistic,thehyperbolic tangent (tanh), and the rectiﬁed

probabilistic machine learning for civil engineers 129

linear unit (ReLU),



(z)=

1+exp(z)

exp(z)

exp(z)+1

(logistic)



(z)=

 e

z

+ e

z

(tanh)



(z) = max(0,z) (ReLU).

These thr ee activation f un ct i on s are presented in ﬁgure 8.34. The

ReLU activation function is curr ently widely employed because

it facil i t at es the training of models by mitigating the problems

associated with vanishing gradients (see §8.3.2).

3 2 1 0 1 2 3

1

(z)

Logistic tanh ReLU

Figure 8.34: Comparison of the logistic,

tanh,andReLU activation functions.

Feedforward neural networks: A simple case In a feedforward

network, a hi dd en variable

(j)

is passed through an activation

function,

(j)

= 

⇣

(j)

⌘

before getting fed to t h e next level. The output of the activation

function is calle d the activation unit. Figure 8.35 presents the

expanded and compact forms for a simple feedforward network with

three covariates in its input layer, on e hidden layer containing two

activation units (green nodes), and one output layer containing

a single activation uni t (blue node). Note t h at in the compact

forms (i. e ., (b) and (c)), hidden variables

(j)

are not explicit l y

represented anymore. The activation units on the ﬁrst layer are

(1)

= 

⇣

(1)

⌘

(1)

= 

⇣

(1)

⌘

The outpu t unit

(O)

is equal to the hidden variable

(O)

,whichis

itself deﬁned as a linear combination of the activation units on the

hidden layer so t hat

(O)

= z

(O)

= w

(1)

+ w

(1)

+ b

(1)

The observation model is now deﬁned by

(O)

v, v

V ⇠

(

,

), so each observat ion of t he system response is deﬁned as

: Y ⇠N(y; a

(O)

,

(0)

(1)

(a) Expanded-form variable nomenclature

(0)

(1)

(b) Compact-form representation of

activation functions

(0)

(1)

(O)

Figure 8.35: A simple feedforward neural

network.

Fully connected feedforward neural network We now describe

formulation for the fully connected feedforward neural net wor k

(FCNN) with

hidden l ayers, each having

activation units. The

example presented includes a single output observation. In practice ,

FCNN can be employed for any number of outpu t variables.

j.-a. goulet 130

Figure 8. 36 presents a graphical representation for an FCNN

where (a) presents the connectivity with the activation functions

and (b) the nomenclature employed for weights

(l)

i,j

and hidd en

units

(l)

. The activation unit

in the ﬁrst layer is described by a

linear combination of covariates,

(1)

= 

⇣

(1)

⌘

= 

⇣

(0)

i,1

+ w

(0)

i,2

+ ···+ w

(0)

i,X

+ b

(0)

⌘

In subseq ue nt layers, the activation uni t

in layer

+ 1 is a linear

combination of all the activation units in layer l as described by

(0)

(1)

(2)

···

(L1)

(L)

(a) Representation of the network structure with its activation functions

(0)

(1)

(2)

···

(L1)

(L)

(O)

(1)

1,1

(1)

1,1

(1)

1,2

(1)

2,1

(1)

1,i

(1)

i,1

(1)

1,A

(1)

A,1

(1)

(b) Variable nomenclature: The activation unit 1inlayer2is

(2)

= 

⇣

(2)

⌘

= 

⇣

(1)

1,1

(1)

+ w

(1)

1,2

(1)

+ ···+ w

(1)

1,A

(1)

+ b

(1)

⌘

Figure 8.36: A fully connected neural

network having

hidden layers, each

having

activation units. The leftmost

nodes describe the input layer containing

covariates. The green inner nodes describe

the unobserved hidden layers where an

activation unit

in layer

(l)



(

(l)

The blue node describes the output layer

where

(O)



(

(O)

for regression

problems. The rightmost node describes

the observed system responses,whichare

realizations y

: Y ⇠N(y; a

(O)

,

probabilistic machine learning for civil engineers 131

(l+1)

= 

⇣

(l+1)

⌘

= 

⇣

(l)

i,1

(l)

+ w

(l)

i,2

(l)

+ ···+ w

(l)

i,A

(l)

+ b

(l)

⌘

Activation unit in layer l +1

Hidden variable in layer l +1

Activation function

Weights

Bias

Activation units from layer l

The feedf orward propagation for all the units contained in a layer

can be expressed with a matrix notation as

(l+1)



(

(l+1)

), that

is,

(l+1)

A⇥1

| {z }

(l+1)

= 

(l+1)

A⇥1

| {z }

(l+1)

where the hidden variables are given by

(l+1)

(l)

, that

is,

(l+1)

A⇥1

| {z }

(l+1)

(l)

1,1

(l)

1,2

··· w

(l)

1,A

(l)

2,1

(l)

2,2

··· w

(l)

2,A

(l)

A,1

(l)

A,2

··· w

(l)

A,A

A⇥A

| {z }

(l)

⇥

(l)

A⇥1

| {z }

(l)

A⇥1

| {z }

(l)

The variables

(O)

and

(O)

on the output layer are obtained from a

linear combination of the activation units on the last hidden layer

(O)

= 

(O)

)=z

(O)

= w

(L)

+ w

(L)

+ ···+ w

(L)

+ b

(L)

Because we are dealing with a regression problem, we can consider

that th e output activation func ti on



(

) is linear. The observation

model is deﬁned by

(O)

v ⌘ a

(O)

(

v, v

V ⇠N

(

,

so each observation of the system response is deﬁned as

Y ⇠

N(y; a

(O)

(x),

Neural networks are characterized by their large number of pa-

rameters to be estimated. For example , a fully connected network

with

layers, each having

activation units, has

A ⇥ A ⇥

(

L 

weight parameters and

A ⇥ L

bias parame te r s for its hidden layers.

For L = A = 10, it leads to more than 1000 parameters to be learned

from the data set

. In comparison, the GPR model presented in

8.2 had fewer than 10 parameters. Moreover, it is not uncommon

for modern deep neural networks to be parameterized by millions

of paramete rs . As a consequence, learning these parameters (i.e.,

j.-a. goulet 132

training the neur al network) req u ir e s large data sets and is a com-

putationally demanding task.

8.3.2 Parameter Estimation and Backpropagation

The task of learning model paramet er s is referred to as training

a neural network. In general, there are so many parameters to be

learned t h at the utilization of second-order optimization methods

such as Newton-Raphson (see

5.2) are computationally prohibitive.

Instead, neural ne tworks rely on gradient descent algorithms (see

5.1), wher e the gradient of the loss function with respect to each

parameter is obtained using the backpropagation metho d.

Log-likelihood As for other types of regression models, the per-

formance of a neural network is measured using quantities derived

from the likelihood of observations

(

, ✓

), where

{✓

, ✓

}

are respectively the weight and bias parameters. By assuming that

observations are conditionally independent given

(O)

, the joint

likelihood for a data se t

, D

}

{

(

)

, 8i 2{

D}}

deﬁned as

f(D

, ✓

i=1

N(y

; a

(O)

; ✓

, ✓

),

where

(O)

(

;

✓

, ✓

)

⌘ a

(O)

explicitly deﬁnes the dependence of the

output l ayer on the bias and weight parameters from all layers. In

practice, neural networks are t r ai ne d by minimizing a loss function

J(D; ✓

, ✓

), which is derived from the negati ve log-likelihood, Note:

The derivation of the loss function

for a neural network is analogous to the

derivation of the loss function presented in

§8.1.1 for linear regression.

ln f (D

, ✓

i=1



N(y

; a

(O)

; ✓

, ✓

),

)



/

i=1



 a

(O)

; ✓

, ✓

)



= 

i=1

J(y

; x

, ✓

)

= J(D; ✓

, ✓

Note that the loss function

(

;

✓

, ✓

) corresponds to the mean-

square error (MSE). The optimal set of parameters

✓

⇤

{✓

⇤

, ✓

⇤

}

are thus those which minimize the MS E ,

✓

⇤

= arg min

✓

J(D; ✓

, ✓

probabilistic machine learning for civil engineers 133

Gradient descent The optimal values

✓

⇤

are sought iteratively

using a gradient-descent algorithm, where t h e updated parameter

values are

✓ ✓  r

✓

J(D; ✓

, ✓

)

and where



is the learning rate (see

5.1). The gradients are esti-

mated usi ng the backpropagation algorithm that will be presented

next. Th e parameter s

✓

{✓

, ✓

}

are typically initi al i zed ran-

domly fol lowing

✓

⇠N

(

✓

,✏

), where

✏ ⇡

2

. The random

initialization is intended to avoid having an init i al symmetr y in the

model. The objective of having initial values close to zero is that

multiplying covariates and hidden variables by weights

(l)

i,j

=0will

initially limit the model capacity. Not e that a common practice is

to normalize covariates and observations in the data set.

Backpropagation The backpropagation

method employs the

Rumelhart, D. E., G. E. Hinton, and

R. J. Williams (1986). Learning repre-

sentations by back-propagating errors.

Nature 323,533–536

Feedfo rwar d neur al network

J =

 a

(O)

)

(O)

= 



(L)

+ b

(L)

| {z }

(O)



(l+1)

= 



(l)

+ b

(l)

| {z }

(l+1)



Chain rule of derivation

(L)

(O)

| {z }



(O)

(L)

| {z }

(L)

z }| {

(O)

(L)

| {z }

(l)

(l+2)

(l+1)

| {z }



(l+1)

(l)

| {z }

(l)

z }| {

(l+2)

| {z }



(l+2)

(l+1)

| {z }

(l+1)

| {z }

r(z

(l+1)

)

(l+1)

(l)

| {z }

chain rule of derivation in order to backpropagate the model p r e-

diction errors through the network. For a single observation (

the pred i ct i on error



(O)

is deﬁne d as the discrepancy between the

predicted value

(O)

(

;

✓

, ✓

) and the observation

,timesthe

gradient of the output’s activation function,



(O)

J(y

; x

, ✓

)

(O)

J(y

; x

, ✓

) ·

(O)

= 



 a

(O)

; ✓

, ✓

)



(O)



(O)

) (general case)

= a

(O)

; ✓

, ✓

)  y

(regression).

In the context of regression, the term

(O)



(

(O)

) is equal to one

because

(O)



(

(O)

. Note that in order to evaluate the

prediction error f or an observation

, we ﬁrst need to evaluate the

model prediction for the current parameter values

{✓

, ✓

}

and a

vector of covariat es

. For the

observation in the training set,

the gradient vector of the loss function wit h resp e ct to ei t h er weight

or bias parameters on the l

layer is

(l)

J(y

; x

, ✓

(l)

= 

(l+1)

(l)

)

(l)

J(y

; x

, ✓

(l)

= 

(l+1)

For hidden layers, the responsibility of each activation uni t for the

prediction error is estimated usi n g



(l+1)

⇣

(l+1)

)



(l+2)

⌘

 r(z

(l+1)

where



is the Hadamar (element-wise) product, and

denotes the

gradient vector.

j.-a. goulet 134

The common procedure is to perform parameter updates not

using each observation indi vi du al ly, but by using mini-batches.A

mini-batch

⇢D

is a subset of observations of size

that is

a power of 2 ranging from 32 to 256. The power of 2 constraint

is ther e to maximize the computational eﬃciency when working

with grap hi cal processing units (GPUs). The gradient of the loss

function is then the average of the gradients obtained for each

observation in a mi ni -b at ch,

Note:

Several of the advances made

in neural networks can be attributed

to advancements in GPU calculations.

Nowadays most nontrivial models are

trained using GPUs.

(l)

J(D

; ✓

, ✓

i:y

(l)

J(y

; x

, ✓

)

(l)

J(D

; ✓

, ✓

i:y

(l)

J(y

; x

, ✓

During t r ai ni n g, an epoch corresponds to performing one pass of

parameter update for all mini-batches contained in the training

set. Neur al networks are typically trained using stochastic gradient

descent, which computes the gradient on the subset of data con-

tained i n mini-batches rather than on the full data set. The reader

should re fe r to specialized literature

for descr ip t i ons of stochastic

Goodfellow, I., Y. Bengio, and

A. Courville (2016). Deep learning.MIT

Press

gradient descent optimization methods such as Adagrad, RMSprop,

and Adam.

8.3.3 Regularization

Because of the large number of parameters involved in the def-

inition of a neural network, it is prone to overﬁtting, where the

performance obtained during training will not generalize for the

test set . Here, we present two regularization procedures to prevent

overﬁtting: weight decay and early stopping .

Weight decay The ﬁrst regularization strategy is to employ maxi-

mum a-posteriori (MAP) estimation rather than a maximum likeli-

hood estimate (MLE). The prior knowledge for model parameters

in an MAP allows us to prevent overﬁtting by limiting the model

complexity. For instance, we can deﬁne a prior for the weights in

the vector

✓

so that

(

(l)

i,j

(

(l)

i,j

/↵

). A large

↵

value

forces weight p ar amet e rs to be close to zero, which in turn reduces

the model capacity. If

↵

is small, litt le penalization is applied on

weights and th e model is th en free to be highly nonlinear and can

more easily ﬁt the data in the t r ai ni n g set.

With MAP, the posterior is now approximated by the product of

probabilistic machine learning for civil engineers 135

the likelihood and t h e prior, so

f(✓

, ✓

|D) / f(D

, ✓

) · f (✓

, ✓

)

k=1

N(y

; a

(O)

; ✓

, ✓

),

) ·

i,j,l

N(w

(l)

i,j

;0, 1/↵)

ln f (✓

, ✓

|D) /

k=1

J(y

; x

, ✓

) 

↵

i,j,l

(l)

i,j

)

/

⇣

J(D; ✓

, ✓

↵

✓

⌘

We can then redeﬁne the loss function as

J(D; ✓

, ✓

)=J(D; ✓

, ✓

)+↵⌦(✓

where ⌦(✓), the regularization or weight decay term, is

⌦(✓

||✓

✓

Deﬁning a Gaussian prior for weights corresponds to

-norm

regularization. This choice for regularization is not th e only one

available , yet it is a common one. The gradient of the loss function

with respect to weight parameters has t o be modiﬁed to include th e

regularization term, so

(l)

J(D

; ✓

, ✓

)=J(D

; ✓

, ✓

)+↵W

(l)

In short, weight decay allows controlling the model capacity with

the hyp er par ame t er

↵

. The hyperparamet er

↵

then needs to be

learned using a method such as cross-validation (see §8.1.2).

Early stopping Given that we initialize each parameter randomly,

following

✓

⇠N

(

✓

,✏

), with

✏ ⇡

2

, the model will ini t i al l y

have a poor performance at ﬁtting the observations in the training

set. As the number of optimization epochs increases, the ﬁtting

ability of the model wil l improve and, as a result, the loss for the

training set

(

train

;

✓

, ✓

) will decrease. If we keep optimizing

parameters, there is a point where the model w il l gain too much

capacity, and it will start overﬁtting the data in the training set.

Thus, a second option to pr event overﬁtting is to moni t or the loss

function evaluated at each epoch for an independent validation

set

(

validation

;

✓

, ✓

). Note that the data in the validation set

must not be employed to train the model parameters . The optimal

parameter values

✓

⇤

✓

⇤

✓

⇤

]

are taken as those corresponding

to the epoch for which the loss evaluated for the validation set is

minimal. This procedure is equivalent to treating the number of

j.-a. goulet 136

training epochs as a hyp er p aram et er to be opti mi z ed using the

validation set .

Figure 8. 37 presents an example of early stopping where we

see that the loss for the training set keeps decreasing with the

number of epochs. For the loss evaluated on the validation set, the

minimum is reached at seven epochs. Beyond seven epochs, the

ﬁtting ability we gain on the train i ng set does not generalize on the

validation set .

0 2 4 6 8 10 12

6

5

4

3

2

Epochs

J(D; ✓

, ✓

)

Train

Validation

Early stop

Figure 8.37: Example of early stopping

when the loss function for the validation

set stops decreasing.

8.3.4 Example: Metamodel

Here we revisit the example presented in

8.2.6, where we construct

a metamodel for the res ponses of a stru ct ur e using ﬁve covariates

and a set of

= 1000 simulations. For this ex amp l e, we employ

a tanh activation func ti on with

= 10 activation units on each

of the

= 2 hidden layers. Figure 8.38 compares the system

responses with the model predictions obtained in a leave-one- out

cross-validation setup where for each fold, 90 percent of the data

is employed f or trainin g the model and the remaining 10 percent

to validate wh en the training should be stopped. The loss function

(

;

✓

, ✓

), evaluated using t he leave-one-out c ros s- validation

procedure, is equal to 1

⇥

5

. In comparison, the equivalent

mean-square error ob tai n ed for the GPR model presented in

8.2.6

is 8

⇥

6

. It shows that neural networks are not necessarily

outperforming other methods, because they are typically best suited

for probl ems involving a large numb er of covariates and for which

large data sets are available.

0.34 0.36 0.38 0.4

0.34

0.36

0.38

0.4

(O)

Figure 8.38: Comparison of the system

responses with the model predictions

obtained in a leave-one-out cross-validation

setup.