State-Space Models

State-space models are suited to analyze time-series data. In chap-

ter 8, we saw that it is possible to model time series using regres-

sion methods (e . g., see ﬁgure 8.30). Regression is suited for model-

ing time series displaying a stationary or trend-stationary baseline

Note: Atimeseriesorprocessissaidto

be stationary when its statistical proper-

ties, such as its mean and variance, are

constant through time, as illustrated in

ﬁgure 12.1d. A time series is said to be

trend-stationary if the trend (linear or not)

can be removed from it.

on which is added recurrent patterns, as displayed in ﬁgures 12.1a–

b. In the case of a time s er i es consisting in nonstationary processes

such as the one in ﬁgure 12.1c , regression methods typically have

a limite d predictive capacity when extrapolating beyond the cases

covered in t h e training set.

(a) Stationary base-

line

(b) Trend-stationary

baseline and periodic

pattern

process

(d) Stationary pro-

cess

Figure 12.1: The concept of time-series

stationarity.

Figure 12.2: A cannon-man trajectory

can be described using the kinematics

equations

t+1

= x

+˙x

t +

¨x

t

+ w

˙x

t+1

=˙x

+¨x

t +˙w

¨x

t+1

=¨x

+¨w,

where w =[

w ˙w ¨w

]

is a perturbation in

the trajectory.

State-space models (SS M ) analyze time series in a di↵erent way

than regre ss ion models do. Regression is used in a supervised setup

in order to describe the dependence between system responses and

covariates, whereas in the case of time series, the covariates are

related t o time. On the other hand, state-space models bears that

name because it uses a dynamic model to describe the dependence

between the hidden states of a system at subse q ue nt time steps.

Take, for example, the task illustrated in ﬁgure 12.2 of predictin g

the landi n g location of a projectile; using kinematic equations, we

can derive a linear model of the projectile position at time

+

from the initial position

, the initial velocity

˙x

, and the exit an-

gle

✓

. This task is not well suited for regression met h od s, yet it c an

easily be formulated as an SSM because there is a kinemat ic model

available for describing the dependence between the hidden states

at a time

and

+

. SSM is a probabilistic method where, given

a model’s structure, the task is to learn how to predict the hidden-

state variables x

t+1

given the current knowledge for x

. St at e-s pac e

models can be classiﬁed as unsupervised learning because it builds

a joint probability density for a set of hi d de n- st at e variables while

most of them are typically never observed.

In the context of civil engineering we can, for exam pl e, use

SSMs for modeling the degradation of infrastructures over time.

j.-a. goulet 182

As illust r at ed in ﬁgure 12.3, we employ the st ate variable

describe the condit i on of a structure. Starting from a known initial

condition

, we may develop a dynamic model

(

), sim i lar to

the one in the projectile example, in orde r to describe the kinematic

process behi n d degradation. Because our model will inevitably

be an approximation of the real degradation kinematics, we will

recursively update our predicti ons using observations

as they

become available. The theory presented in this chapter allows

combining probabilisti cal l y the information contained in the prior

knowledge

(

t1

), the mo del

(

t1

), and the observation

,in

order to ob tai n a posteri or esti mat e

(

, ···,y

). The procedure

is employed recursively to go from a time step t  1tot.

Observation:

Predicted state:

Updated state:

Initial state:

Figure 12.3: Example of state-space model

where the state

at a time

is estimated

from the combination of an observation

adynamicmodeldescribingtheconditional

probability

(

t1

), and the prior

knowledge for x

t1

SSMs allow performing several tasks such as removing (i.e.,

ﬁltering) the obse rvation errors, forecasting future states, and

estimating hidden states that ar e not directly observable. For

example, in the infrastructure degradation example , even if onl y the

condition

is observed, an SSM allows estimating two addit i onal

hidden states: the degradation speed

˙x

and its acceleration

¨x

State-space models take their ori gi n from the ﬁeld of control theory,

where the breakthr ough application of the method was the guiding

system for the Apollo mission in the 1960s (see ﬁgure 12.4).

Figure 12.4: Apollo mission service module.

(Photo: NASA)

12.1 Linear Gaussian State-Space Models

We can describe the mathematical formulation for SSMs using

the same theory we covered in §3. 4. 1, for the linear functions of

random variables, and in §4.1.3, where we explored the analytic

formulation of the multivariate Normal conditional probability

density function (PDF). Let us consider a linear model Y = AX + b,

where X

⇠N

(x;

, ⌃

), and a vector of observations y.Wesaw

in §3.4.1 that the output for linear functions of Normal random

variables is also Normal,

= Aµ

+ b

⌃

= A⌃

⌃

= ⌃

;











⌃



In §4.1.3, we saw that the conditional PDF of X given the observa-

tions y is obtained by dividing t h e joint PDF of X and Y by the

PDF of Y evaluated for observations y ,

posterior knowledge

z }| {

X|y

(x|Y = y

| {z }

observations

prior knowledge

z }| {

(x, y)

(y)

= N(x; µ

X|y

, ⌃

X|y

probabilistic machine learning for civil engineers 183

where the conditional mean vector and covariance matrix are

X|y

= µ

+ ⌃

⌃

1

(y  µ

), ⌃

X|y

= ⌃

 ⌃

⌃

1

⌃

12.1.1 Basic Problem Setup

We employ a basic setup involving a single hidden-state variable

in order to illustrate how the notions of linear functions of random

variables and the multivariate Normal conditional PDF are behind

the stat e-s pac e model theory. We use

t 2{

, ··· , T}

to describe

discrete time stamps,

to describe a hidden state of a system at a

A hidden state

is a deterministic, yet

unknown, quantity.

time t, and y

to describe an imperfect observation of x

Observation and transition models The relation between an obser -

vation and a hidden state is described by the observation model,

= x

+ v

| {z }

observation model

observation error

z }| {

: V ⇠N(v;0,

) .

We call the observation

imperfect because what we observe is

the tru e state contaminated by zero-mean Normal-distributed

observation errors.

is refer re d to as a hidden state because it is

not d ir e ct l y obser ved; only

is. The dy n ami c model descri bi n g the

possible transitions between successive time steps from a state

t1

to x

is described by the transi ti on m odel,

= x

t1

+ w

| {z }

transition model

model error

z }| {

: W ⇠N(w;0,

), (12.1)

where

is a zero-mean Normal-distributed model error describing

the disc re pan cy between the reality and our i de ali z ed transiti on

model. The transiti on model in equation 12.1 is Markovian (see

chapter 7) because a state

only depen d s on the state

t1

and

is thus independent of all previous states. This aspect is key in the

computational eﬃciency of the method.

Notation

1:t

⌘{y

, ··· ,y

}

t|t

⌘ E[X

1:t

]

t|t1

⌘ E[X

1:t1

]



t|t

⌘ var[X

1:t

]

Notation In order to keep the notation simple, we opt for compact

abbreviations; a set of observations

, ··· ,y

}

is described by

1:t

and the posterior conditional expected value for

given

1:t

[

1:t

], is summarized by

t|t

. For the subscript,

t|t

,theﬁrst

index refers to

and the second to

1:t

. Following the same nota-

tion,

[

1:t1

] is equivalent to

t|t1

. The posterior conditional

variance for X

given y

1:t

, var[X

1:t

], is summarized by 

t|t

j.-a. goulet 184

Prediction step At time

t =0

, only our prior knowledge

[

and

var

[



are available. From our prior knowledge at Note: An absence of prior knowledge for

can be described by var[X

] !1.

t =0

, the transition model

t1

, and the observation

model

, we can compute our joint prior knowledge at

t =1

for

and

(

([

]

;

µ,

⌃), wher e the mean

vector and covariance matrix are de s cr i bed by

µ =







E[X

]

E[Y

]



(12.2)

⌃ =



⌃





var[X

]cov(X

)

cov(Y

) var[Y

]



. (12.3)

The expect ed value

[

] and variance

var

[

] are obtained by

propagating the uncertainty associated with the prior knowledge for

through the transition model (x

= x

t1

+ w

Note: We use the notation

and



rather than

1|0

and



1|0

because there is

no observation at t =0.

E[X

] ⌘ µ

= µ

var[X

] ⌘ 

= 

+ 

The expect ed value

[

] and variance

var

[

] are obtained by

propagating the uncertainty associated with the prior knowledge for

through the observation model (y

= x

+ v

E[Y

]=µ

var[Y

]=

+ 

| {z }



+

The covari an ce cov(X

)=

= 

+ 

is equal to var[X

Update step We can obtain the posterior knowledge at time

using

f(x

)

f(y

)

= N(x

; µ

1|1

,

1|1

where the posterior mean and variance are

1|1

= µ

+⌃

⌃

1

 µ

)



1|1

=⌃

 ⌃

⌃

1

⌃

(12.4)

Note that the terms in equation 12.4 correspond to those deﬁn ed

in equat ion s 12.2 and 12.3. The posterior

(

) contains the

information coming from the prior knowledge at

= 0, the system

dynamics described by the transition model, and the observation at

time t = 1.

The predi ct i on and update steps are now generalized to tr ans i-

tion from

t 

! t

. From our posterior knowledge at

t  1

, the tran-

sition model

t1

, and the observation model

probabilistic machine learning for civil engineers 185

we compute our joint prior knowledge at

(

1:t1

N([x

]

; µ , ⌃), where the mean and covariance are

µ =







E[X

1:t1

]

E[Y

1:t1

]



⌃ =



⌃





var[X

1:t1

]cov(X

1:t1

)

cov(Y

1:t1

) var[Y

1:t1

]



Each term of the mean vector and covariance matrix is obtained

following

E[X

1:t1

] ⌘ µ

t|t1

= µ

t1|t1

var[X

1:t1

] ⌘ 

t|t1

= 

t1|t1

+ 

E[Y

1:t1

]= µ

t|t1

var[Y

1:t1

]=

t1|t1

+ 

| {z }



t|t1

+

cov(X

1:t1

) =

z }| {



t1|t1

+ 

The posterior knowledge at t is given by

Note: cov(X

1:t1

)

Following the covariance prop erties,

cov(X

)=cov(X

)

···+cov(X

)

···+cov(X

)

···+cov(X

)

therefore,

cov(X

)= cov(X

1|1

+W,X

1|1

+W +V )

=cov(X

1|1

)

···+cov

⇠

⇠:

0,X

1|1

?? W

1|1

,W )

···+cov

⇠

⇠:

0,X

1|1

?? V

1|1

,V )

···+cov

⇠

⇠:

0,W?? X

1|1

(W,X

1|1

)

···+cov(W,W )

···+cov

⇠

⇠:

0,W?? V

(W,V )

=cov(X

1|1

)+cov(W,W )

= 

1|1

+

f(x

1:t

f(x

1:t1

)

f(y

1:t1

)

= N(x

; µ

t|t

,

t|t

where posterior mean and variance are

t|t

= µ

+⌃

⌃

1

 µ

)



t|t

=⌃

 ⌃

⌃

1

⌃

The marginal likelihood

f(y

1:t1

)

describing the prior probabili ty Note:

f(y

1:t1

) ⌘ f

(

1:t1

;

✓

), that

is, we can represent either implicitly or

explicitly the dependency on the model

parameters

✓

that enter in the deﬁnition of

the transition and observation equations.

See §12.1.4 for further details.

We refer to

(

1:t1

)asthemarginal

likelihood because the hidden state

was marginalized. In the case of the

multivariate Normal, the solution to

(

1:t1

)

is analytically

tractable; see §4.1.3.

density of observing

given all observations up to time

t 

1 is given

f(y

1:t1

)=N(y

; µ

t|t1

|{z}

E[Y

1:t1

]

,

t1|t1

+ 

| {z }

var[Y

1:t1

]

Example: Temperature time series We now look at an illustrative

example where we apply the prediction and update steps recursively

to estimat e the temperature

over a period of three hours u si n g

imprecise observations made every hour for

t 2{1, 2, 3}

. Our prior

knowledge is described by

E[X

] = 10



and

[X

]=7



.The

transition and observation models are

= x

+ v

| {z }

observation model

observation error

z }| {

: V ⇠N(0, 3

|{z}



= x

t1

+ w

| {z }

transition model

model error

z }| {

: W ⇠N(0, 0.5

|{z}



) .

j.-a. goulet 186

Performing t h e prediction step from t =0! t = 1 leads to

E[X

]=E[X

] = 10

E[Y

]=E[X

] = 10

var[X

] = var[X

]+

= 49.25

var[Y

] = var[X

]+

= 58.25

cov(X

) = var[X

] = 49.25

⇢

cov(X

)

[ X

] [Y

]

=0.92.

At the time

= 1, the observed temperature is

8, leadin g to

alikelihoodf (y

1:t1

)=0.04. Performin g the update step leads to

E[X

1|1

]=E[X

cov(X

)(y

E[Y

])

var[Y

]

=5.6

var[X

1|1

] = var[X

] 

cov(X

)

var[Y

]

=7.6.

Following the Gaussian conditional example presented in §4.1.4,

ﬁgure 12.5a presents graphically each of the quantities associated

with th e prior knowledge at

= 1 obtained by propagating the

knowledge of

in the transition and observation models, the

likelihood of

, and the posterior knowledge. By repeating the

same predi ct i on and update steps for

= 2, where

= 12

1, we

obtain

Prediction step

E[X

2|1

]=E[X

1|1

]=5.6

E[Y

2|1

]=E[X

2|1

]=5.6

var[X

2|1

] = var[X

1|1

]+

=7.9

var[Y

2|1

] = var[X

2|1

]+

= 16.9

cov(X

2|1

) = var[X

2|1

]=7.9

⇢

cov(X

2|1

)

 [X

2|1

] [Y

2|1

]

=0.68

f(y

)=0.03

Update step

E[X

2|2

]=E[X

2|1

cov(X

2|1

)(y

E[Y

2|1

])

var[Y

2|1

]

=8.6

var[X

2|2

] = var[X

2|1

] 

cov(X

2|1

)

var[Y

2|1

]

=4.2.

These resu lt s are illustrated in ﬁgure 12.5b. The details of s ub se -

quent steps are not presented because they consist in repeating

recursively the same prediction and update steps. Figure 12.5c

shows how, in only three time steps, we went from a di↵use k nowl-

edge describing X

to a more precise estimate for X

0 1 2 3

Time - t

State - x

t|t

⌘ E[X

1:t

]

t|t

± 

t|t

⌘ E[X

1:t

] ±[X

1:t

]

ˇx

(true values)

t|t1

⌘ E[X

1:t1

]=E[Y

1:t1

]

t|t1

± 

t|t1

⌘ E[X

1:t1

] ±[X

1:t1

]

E[Y

1:t1

] ±[Y

1:t1

]

4.8

f(x

1:t1

)

0 4.8 20

f(y

1:t1

)

4.8

f(x

1:t

)

(a) t =1

0 1 2 3

Time - t

State - x

12.1

f(x

1:t1

)

0 12.1 20

f(y

1:t1

)

12.1

f(x

1:t

)

(b) t =2

0 1 2 3

Time - t

State - x

7.4

f(x

1:t1

)

0 7.4 20

f(y

1:t1

)

7.4

f(x

1:t

)

Figure 12.5: Step-by-step probabilistic

hidden-state estimation. Note that the pink

curve overlaid on the surfaces describes

either the joint or the conditional PDF

evaluated for the observed value y

12.1.2 General Formulation

We now extend the prediction and update steps with a formulation

compatible with multiple hidden states. For example, consider

probabilistic machine learning for civil engineers 187

the case where want to use a SSM to estimate the hidden states

=[x

˙x

]

, that is, the temperature

and its rate of change

˙x

The observation model then becomes

= x

+ v

⌘

C=[1 0]

z}|{

+ v

where the observation matrix C = [1 0] indicates that an observa-

tion

is a function of

but not of the rate of change

˙x

is a

realization of an observation error

V ⇠N

(

,

). In the general

case, obser vation errors are described by a multivariate Normal

random variable V

⇠N

(v; 0

R), where R is the covariance matrix.

The transition model for the two hidden states is

= x

t1

+t · ˙x

t1

+ w

˙x

=˙x

t1

+˙w



⌘ x



1t



z}|{

t1



˙w



z}|{

where the transiti on matrix A describes the model kinematics and

are realiz at ion s of the model errors W

⇠N

(w; 0

Q) that a↵ect

each of t he hidden states. Observation model

= Cx

+ v

, v : V ⇠N(v; 0, R)

| {z }

Observation errors

C:Observationmatrix

R:Observationerrorscovariance

Transition model

=Ax

t1

, w : W ⇠N(w; 0, Q)

| {z }

Transit i on err o rs

A:Transitionmatrix

Q:Transitionerrorscovariance

In t he prediction step, we compute our prior knowledge at

from

our posterior knowledge at t  1, us i ng the transition model and the

observation mod el ,

t|t1

= Aµ

t1|t1

⌃

t|t1

= A⌃

t1|t1

+ Q

E[Y

1:t1

]=Cµ

t|t1

cov(Y

1:t1

)=C⌃

t|t1

+ R

cov(X

, Y

1:t1

)=⌃

t|t1

(12.5)

These ter ms are collectively describing our joint prior knowledge

1:t1

([x

]

;

µ,

⌃) with the mean vector and

covariance matrix deﬁned as

µ =







t|t1

E[Y

1:t1

]



⌃ =



⌃





⌃

t|t1

cov(X

, Y

1:t1

)

cov(X

, Y

1:t1

)

cov(Y

1:t1

)



In the update step, the posterior knowledge is obtained using the

conditional multivariate Normal PDF,

f(x

1:t

f(x

, y

1:t1

)

f(y

1:t1

)

= N(x

; µ

t|t

, ⌃

t|t

which is deﬁned by its post e ri or me an vector and covariance matrix,

t|t

= µ

+ ⌃

⌃

1

 µ

)

⌃

t|t

= ⌃

 ⌃

⌃

1

⌃

(12.6)

j.-a. goulet 188

The posteri or knowledge

1:t

) combines the information com-

ing from the prior knowledge

{µ

⌃

}

, the sequence of observation

1:t

, and the system dynamics described by the transition model.

Equations 12.5 and 12.6, obtained using the theory presented in

§3.4.1 and §4.1.3 are referred to as the K al m an ﬁlter

where they

Kalman, R. E. (1960). A new approach

to linear ﬁltering and prediction problems.

Transactions of the ASME–Journal of

Basic Engineering 82 (Series D), 35–45

are reorganized as

Prediction step

t|t1

= Aµ

t1|t1

Prior mean

⌃

t|t1

= A⌃

t1|t1

+ Q Prior covariance

Update step

f(x

1:t

)=N(x

; µ

t|t

, ⌃

t|t

)

t|t

= µ

t|t1

+ K

Posterior expected value

⌃

t|t

=(I  K

C)⌃

t|t1

Posterior covariance

= y



Innovation vector

= Cµ

t|t1

Prediction observations vector

= ⌃

t|t1

1

Kalman gain matrix

= C⌃

t|t1

+ R Innovation covariance matrix

In the name Kalman ﬁlter,thetermﬁlter refers to the action

of removing the observation errors from an obs er ved time series.

Filtering is an intrinsically recursive operation that is suited for

applications where a ﬂow of data can be continuously processed

as it gets collected. Note that in the classic Kalman formulation,

Simon, D. (2006). Optimal state estima-

tion: Kalman, H inﬁnity, and nonlinear

approaches. Wiley

there are additional control terms for including the e↵ect of exter-

nal action s on the system. These terms are omitted here becau se ,

contrarily to the ﬁeld of aerospace, where ﬂaps, r u dd er s, or trust

vectors are employed to continuously modi f y the state (i.e., position,

speed, and accelerati on) of a vehicle traveling through space, civ il

systems s el dom have the possibility of such direct control actions

being applied to them.

Example: Temperature time series We want to estimate

=[x

˙x

]

the tempe rat u r e and its rate of change over a period of 5 h our s us-

ing impre ci se observations made every hour for

t 2{

, ··· ,

}

The observation and transition models are respectively

[1 0]

z}|{

V ⇠N(0,

)

z}|{



˙x



z}|{



1t



z}|{

t1

W⇠N(0,Q)

z}|{



˙w



z}|{

Our pri or knowledge for the temperature is

[

] = 10



[

probabilistic machine learning for civil engineers 189



, and for the rate of change,

[





[

]=1



We assume that our knowledge for

and

is independent

(

! ⇢

= 0), so the prior mean vector and covariance

matrix are



1



, ⌃



01.5



The observation-errors standard deviation is



.Model

errors are deﬁned by the covariance matrix Q,whichdescribes

the error s on the temperature estimate and its rate of change. We

assume t hat these errors are caused by a constant acceleration that

takes place du ri n g t he interval 

, which is not accounted for in the

model, th at builds on the hypothesis of a constant speed and no

acceleration. In order to deﬁne the covariance matrix Q ,weneed

to deﬁne a new vector z

˙z

¨z

]

that desc ri bes a realization

of Z

⇠N

; 0

⌃

), that is, the e↵ect of a constant acceleration

having a random magnitude such that Z

t+i

, 8i 6

= 0. As

illustrated in ﬁgur e 12.6, having a constant acceleration

¨z

over a

time ste p of length 

causes a change of

¨z

t

in the trend (i.e.,

speed) and a change of

¨z

t

in the level. We can express t he er r ors

caused by the constant acceleration in z

in terms of our transition

errors w

(acceleration)

(trend/speed)

(level)

Figure 12.6: The error caused on the level

and the trend by a constant acceleration

¨z

over a time step of length t.

= A

⇠N(w

; 0, A

⌃

{z }

where



t

00 t



, ⌃

00 0

00

Therefore, the deﬁn it ion of the matrix Q for the transition errors

Q = A

⌃

= 

t

t

For further details regarding the derivation of the proce ss noise

covariance matrices, the reader should consult the textbook by Bar-

Shalom, Li, and Kirubarajan.

In this example, the noise parameter

Bar-Shalom, Y., X. Li, and T. Kirubara-

jan (2001). Estimation with applications to

tracking and navigation: Theory algorithms

and software. John Wiley & Sons

is equal to



1 and 

= 1. Figure 12.7 presents the state

estimates for th e temperatu r e and its rate of change along with the

true values that were employe d to generate simulated observations.

0 1 2 3 4 5

10

5

Time - t

State - x

t|t

⌘ E[X

1:t

]

t|t

± 

t|t

⌘ E[X

1:t

] ± [X

1:t

]

(true values)

t+1|t

⌘ E[X

t+1

1:t

]=E[Y

t+1

1:t

]

t+1|t

± 

t+1|t

⌘ E[X

t+1

1:t

] ± [X

t+1

1:t

]

E[Y

t+1

1:t

] ± [Y

t+1

1:t

]

0 1 2 3 4 5

4

2

Time - t

Rate of change -

Figure 12.7: Example of hidden-state

estimation for a constant-speed model.

j.-a. goulet 190

12.1.3 Forecasting and Smoothing

The ﬁltering method presented in the p re vi ou s section estimates

1:t

), that is, the posterior probability density of hidden states

, given all the observations up to time

.Withthisprocedure,

acquiring new data at a time

+ 1 does not modify our knowledge

about x

. This second task of estimating the posterior probability

of hidden states x

given all the observations up to time

T >t

1:T

), is referred to as smoothing. A third task consi st s in

forecasting the hidden states x

t+n

time steps in the future,

given all the observations up to time

, that is,

t+n

1:t

). The

three tas k s of ﬁltering, smoothing, and forecasting are illustrate d

schematically in ﬁgure 12.8.

Filtering

0 t

Smoothing

0 t

Forecasting

0 t

Figure 12.8: Comparison of ﬁltering,

smoothing,andforecasting,wheretheblue

region corresponds to the data employed to

estimate x

or forecast x

t+n

Forecasting:

t+n

1:t

) In the context of SSMs, forecasting is

trivial because it only involves recursively propagating the current

knowledge

t|t

, ⌃

t|t

at time

through t he predict i on step. In short,

forecasting from a time t to t + 1 follows

f(x

t+1

1:t

)=N(x

t+1

; µ

t+1|t

, ⌃

t+1|t

)

t+1|t

= Aµ

t|t

⌃

t+1|t

= A⌃

t|t

+ Q.

Note that whenever data is missing at a given time step, only the

prediction step i s recursively applied until new data is obtained.

Therefore, missing dat a is treated in the same way as forecasting

using the transition model. This is an interesting feature for SSMs,

which do not require any special treatment for cases wh er e data is

missing.

0 1 2 3 4 5 6 7 8 9 10

10

5

Time - t

State - x

t|t

⌘ E[X

1:t

]

t|t

± 

t|t

⌘ E[X

1:t

] ± [X

1:t

]

(true values)

t+1|t

⌘ E[X

t+1

1:t

]=E[Y

t+1

1:t

]

t+1|t

± 

t+1|t

⌘ E[X

t+1

1:t

] ± [X

+1|y

1:t

]

0 1 2 3 4 5 6 7 8 9 10

4

2

Time - t

Rate of change - ˙x

Figure 12.9: Example of hidden-state

forecast for a constant-speed model.

Example: Temperature time series (continued) We build on the

temperature estimation example presented in ﬁgure 12.7 to illus-

trate th e concept of forecasting. The goal here is t o forecast the

temperature for ﬁve hours after the last time step for which data is

available. The result is presented in ﬁgu r e 12.9. Note how, because

no observation is available to update the predictions, the standard

deviation of the estimated hidden states keeps increasing because

at each time step, the model error covariance matrix Q is added

recursively to the hidden-state covariance matrix.

Smoothing:

1:T

) With smoothing, we perform the reverse

process of updati n g the knowledge at a time

, usin g the knowledge

+ 1. Smoothing starts from the last time step

estimated using

the ﬁltering procedure described in §12.1.2 and moves re cu rs i vely

probabilistic machine learning for civil engineers 191

backward until the initial time step is reached. The information ﬂow

during ﬁl t e ri n g and smoothing is illustrated in ﬁgure 12.10. The

Filter:

Smooth:

⌃

1|1

⌃

1|1

...

t|t

⌃

t|t

...

T1|T1

⌃

T1|T1

T  1

T|T

⌃

T|T

T1|T

⌃

T1|T

...

t|T

⌃

t|T

...

1|T

⌃

1|T

0|T

⌃

0|T

Figure 12.10: Information ﬂow for the

ﬁltering and smoothing processes.

posterior knowledge

1:T

) combines the information coming

from the prior knowledge

{µ

⌃

}

, the entire sequence of observa-

tion y

1:T

, and the transition model. The smoothing procedure can

be employed to learn initial hidden states

{µ

⌃

}!{µ

0|T

⌃

0|T

}

Updating i ni t i al values using the smoother typically improves the

model quality. The most common algorithm for performing smooth-

ing is the RTS Kalman smoother,

which is deﬁned following

Rauch, H. E., C. T. Strieb el, and F. Tung

(1965). Maximum likelihood estimates of lin-

ear dynamic systems. AIAA Journal 3 (8),

1445–1450

f(x

1:T

)=N(x

; µ

t|T

, ⌃

t|T

)

t|T

= µ

t|t

+ J



t+1|T

 µ

t+1|t



⌃

t|T

= ⌃

t|t

+ J



⌃

t+1|T

 ⌃

t+1|t



= ⌃

t|t

⌃

1

t+1|t

As illust r at ed in ﬁgure 12.10, the RTS Kalman smoother is per-

formed it er at i vely starting from the Kalman ﬁlter step

T 

where the mean vector

t+1|T

T|T

and covariance matrix

⌃

t+1|T

= ⌃

T|T

are already available from the last Kalman ﬁl t er

step.

0 1 2 3

Time - t

State - x

t|t

± 

t|t

t|T

± 

t|T

ˇx

Figure 12.11: Comparison of ﬁltering,

smoothing, and the true states.

Example: Smoothing temperature estimates Figure 12.11 presents

the RTS smoothing procedure applied to the te mpe rat u r e esti-

mation exam pl e initially presented in ﬁgure 12.5. These results

show how, between the ﬁltering and smoothing procedures, the

mean at

= 2 went from 8

6to8

4, and the variance from 4

to 1

8. The uncertainty is reduced with smoothing because the

estimate

var

[

2|3

] is based on one more data point than

var

[

2|2

Figure 12. 11 presents the entire smoothed estimate, which almost

coincides with the true values and has a narrower uncertainty

conﬁdence region than the ﬁltering esti mat es .

j.-a. goulet 192

12.1.4 Parameter Estimation

The deﬁni t ion of matrices A, C, Q, and R involved in the observa-

tion and t r ans i ti on models requ i re s estimat i ng several paramet er s

✓

using obser vations

1:T

}

. As for other methods presented in

previous chapters, the posterior for paramet e rs is

f(✓|D)=

f(D|✓)f(✓)

f(D)

In this Bayesian formulation for estimating the posterior PDF

for paramet er s, the key component is the marginal likelihood,

(

D|✓

)

⌘ f

(

1:T

|✓

). Wit h the hypothesis that observations are

conditionally independent given the states x

, the joint marginal

likelihood (i.e., the joint prior probability density of observations We refer to

1:T

|✓

)asthejointmarginal

likelihood because the hidden variables

1:T

are marginalized. In the remainder

of this chapter, for the sake of brevity, we

use the term likelihood instead of marginal

likelihood.

given a s et of parameters) is

f(y

1:T

|✓)=

t=1

f(y

1:t1

, ✓),

where the marginal prior probability of each observation is obtained

using the ﬁltering procedure,

f(y

1:t1

, ✓)=N(y

; Cµ

t|t1

, C⌃

t|t1

+ R).

The posteri or

(

✓|D

) can be estimated using the sampling tech-

niques

presented in chapter 7. However, because evaluating

Nguyen, L. H., I. Gaudot, and J.-A.

Goulet (2019). Uncertainty quantiﬁcation

for model parameters and hidden state

variables in Bayesian dynamic linear

models. Structural Control and Health

Monitoring 26 (3), e2309

1:T

|✓

) can be computationally demanding, most practical appli-

cations found in the literature only employ the maximum likelihood

estimate (MLE) rath er than estimating the full posterior with a

Bayesian approach. For an MLE, the optimal set of parameters is

then deﬁned as

✓

⇤

= arg max

✓

log-likelihood

z }| {

ln f (y

1:T

|✓),

where becau se often

1:t1

, ✓

)

⌧

1, the product of the

marginals

t=1

1:t1

, ✓

)

⇡

0issensitivetozerounderﬂow

issues. W h en the contrary happens, so that

1:t1

, ✓

)



the pr obl e m of numerical overﬂow arises. In order to mitigat e these

issues, the log-likelihood is employed instead of t he likelihoo d ,

ln f (y

1:T

|✓)=

t=1

ln f (y

1:t1

, ✓).

Figure 12.12: The MLE optimal parame-

ters correspond to the location where the

derivative of the log-likelihood equals zero.

As illust r at ed in ﬁgure 12.12, the optimal set of parame te r s

✓

⇤

corresponds to the location where the derivative of t he log-

likelihood equals ze ro. The values

✓

⇤

can be found using a gradient

probabilistic machine learning for civil engineers 193

ascent method such as the Newton-Raphson algorithm presented in

§5.2. Note that because Newton-Raphson is a conve x optimization

method, and because SSMs typically involve a non-convex (see

chapter 5) and non-identiﬁable (see §6.3.4) likelihood function, the

optimization is parti c ul arl y sensitive to initial values

✓

.Itisthus

essential to initiali ze parameters using engineering knowledge and

to try several initial starting locations .

Note that a closed-form expectation maximization method exists

for estimating the parameters of SSMs.

However, even if it is mat h -

Ghahramani, Z. and G. E. Hinton (1996).

Parameter estimation for linear dynamical

systems. Technical Report CRG-TR-96-2,

University of Toronto, Dept. of Computer

Science

ematically elegant, in the context of civil engineering applications

such as those presented in the following sections, i t is still sensitive

to the selection of the initial values and is thus easily trapped in

local maxima.

12.1.5 Limitations and Practical Considerations

There are some limitations to the Kalman ﬁlter formulation. First,

it is sensitive to numerical errors when the covariance is rapidly

reduced i n the measurement step, for example, when accurate

measurements are used or after a period with missing d at a, or

when ther e are orders of magnitude separating the eigenvalues

of the covariance matrix. Two alternative formulations that are

numericall y more robust are the UD ﬁlter and squar e -r oot ﬁlter.

The reader interested in these methods should consult specialized

textbooks such as the one by Simon

or Gibbs.

Simon, D. (2006). Optimal state estima-

tion: Kalman, H inﬁnity, and nonlinear

approaches. Wiley

Gibbs, B. P. (2011). Advanced Kalman

ﬁltering, least-squares and modeling: A

practical handbook. Wiley

A second key limiting aspect is that the formulation of the

Kalman ﬁlt e r and smoother are only applicable for linear observa-

tion and transition models. When problems at hand require nonlin-

ear models, there are two main types of alternatives: First, there

are the unscented Kalman ﬁlter

and extended Kalman ﬁlter,

Wan, E. A. and R. van der Merwe (2000).

The unscented Kalman ﬁlter for nonlinear

estimation. In Adaptive Systems for Signal

Processing, Communications, and Control

Symposium,153–158.IEEE

Ljung, L. (1979). Asymptotic behavior of

the extended Kalman ﬁlter as a parameter

estimator for linear systems. IEEE

Transactions on Automatic Control 24 (1),

36–50

which allow for an approximation of the Kalman algorithm while

still describing the hidden states using multivariate Normal random

variables. The second option is to employ sampling method s such

as the particle ﬁlter,

where the hidden-st at e variables are de-

Del Moral, P. (1997). Non-linear ﬁltering:

interacting particle resolution. Comptes

Rendus de l’Acad´emie des Sciences—Series

I—Mathematics 325 (6), 653–658

scribed by particles (i.e., samples) that are propagated through the

transition model. Th e weight of each particle is updated using the

likelihood computed from the observation model. Particle ﬁltering is

suited to esti mat e PDFs that are not well represented by the multi-

variate Normal. The reader interested in these nonlinear-compatible

formulations can consult Murphy’s

textbook for further details.

Murphy, K. P. (2012). Machine learning:

Aprobabilisticperspective.MITPress

j.-a. goulet 194

12.2 State-Space Models with Regime Switching

The state- sp ace model theory we have covered so far is applicable

for a set of system responses following a s ame regime. Figures

12.13a–b are both examples of behaviors where SSMs are applic able.

Figure 12. 13c presents an example where a standard SSM i s not

applicable because th er e is a switch from a z e ro- spe ed regime to a

constant-spe ed regime.

(a) Local level (b) Local trend

Figure 12.13: Comparison between single-

and multiple-regime models.

Estimating the probabi l i ty of hidden states along with the

probability of each regime involves computational challenges. For

example, if we consider a case with

= 2 possible regimes, we

then have 2 possible transitions along with 2 models to describe

each regime. We start f r om a time step

, where the system can be

in regime 1 with probability

⇡

and in regime 2 with probability

⇡

 ⇡

, as illustrated in ﬁgure 12.14a. There are 2 regime s

⇥

2 possible transition paths, leading to 4 possible combinations to

describe the joint probability at times

and

+ 1. Goi ng from

+ 2, each of t h e 4 paths at

+ 1 branches into 2 new ones at

+ 2, leading t o a total of now 8 paths. We see through this example

that the number of paths to c ons i de r increases exponentially (

)

with th e number of time steps

. Even for short time series, this

approach bec ome s computational l y prohibitive. In this section, we

present the generalized pseudo Bayesian algorithm of order two,

Blom, H. A. P. and Y. Bar-Shalom

(1988). The interacting multiple model

algorithm for systems with Markovian

switching coeﬃcients. IEEE Transactions

on Automatic Control 33(8), 780–783

which is behind the switching Kalman ﬁlter .

The key aspect of

Murphy, K. P. (1998). Switching Kalman

ﬁlters. Citeseer; and Kim, C.-J. and C. R.

Nelson (1999). State-space models with

regime switching: Classical and Gibbs-

sampling approaches with applications.MIT

Press

this method is to collapse the possible transition paths at each time

step, as illustrated in ﬁgure 12.14b.

12.2.1 Switching Kalman Filter

We ﬁrst explore the switching Kalman ﬁlter (SKF) while consid-

ering only

= 2 possible regimes. The goal i s to estimate at each

time step

the pr obab i l ity of each regime, along with the estimates

for the hidden states corresponding to each of them. The notati on

that will be employed in the SKF is

model i

where the superscri pt on the mean vector and covariance matrix

refers t o the regime number, which is descr i bed by the variable

s 2{

, ··· , S}

⇡

t|t

is the shorthand notation for the probability

of the regime

at time

, given all the available observations up to

time t,Pr(S

= i|y

1:t

(a) Regime switching

without collapse

Merge

(b) Regime switch-

ing with merge

(SKF)

Figure 12.14: Comparison of regime

switching with and without merging the

state at each time step. (Adapted from

Nguyen and Goulet (2018).)

probabilistic machine learning for civil engineers 195

Filter with

model

Filter with

model

t-1

model i

Figure 12.15: Switching Kalman ﬁlter:

How initial states transition through two

distinct models.

Filtering st ep As presented in ﬁgure 12.15, starting from regime 1,

t 

1, there are two possible paths, each of them corresponding

to one of the regime switches. The two possible paths

j 2{

}

are

explored by estimating the posterior state estimates

{µ

1,j

t|t

⌃

1,j

t|t

}

along with their likelihood

1,j

t|t

,usingtheﬁlteringprocedurepre-

sented in §12.1. 2,

(µ

i,j

t|t

, ⌃

i,j

t|t

, L

i,j

t|t

)=Filter(µ

t1|t1

, ⌃

t1|t1

, y

; A

, C

, Q

, R

(12.7)

In equati on 12.7, the initial states are obtained from the r egi me

and the matrices {A

, C

, Q

, R

} are those from the model j.

Merge

model 1

model 2

Figure 12.16: The switching Kalman ﬁlter

merge step.

Once the procedure presented in ﬁgure 12.15 is repeated for

each model, the next step consists in merging t he state estimates

that land ed in a same regime. Figure 12.16 presents an example

where the state estimates from the model starting from regime

1 at

t 

1 and transiting through regime 1 are merged with t h e

state est i mat es from the model starting from regime 2 at

t 

and transi t i ng through regime 1. This merge step is per for me d by

considering that t h e ﬁnal state estimates

{µ

t|t

⌃

t|t

}

, along with its

probability

⇡

t|t

, descri be a multivariate Normal, itself obtained by

mixing t he state estimates originating from all paths that transited

from t  1 ! t through model j.

3 0.25 2.5 5.25 8

⇡

=0.4 ⇡

=0.6

PDF : f()

N(x

, 1, 1

) N(x

, 4, 1

)

f(x

)

Figure 12.17: Example of a Gaussian

mixture PDF.

Gaussian mixture We now take a quick detour from the switching

Kalman ﬁlt e r in order to review the mathematical details related

to Gaussian mixtures (see §10.1.1) that are behind the merge step.

For example, we have two states

and

described by Gaussian

random variables so that

⇠N(x

; µ

,

)

⇠N(x

; µ

,

The outcome of the random process, where

⇡

and

⇡



⇡

correspond to the probability of each state, is described by a

Gaussian mixture,

⇠ f(x

)=⇡

N(x

; µ

,

)+⇡

N(x

; µ

,

j.-a. goulet 196

Figure 12.17 pres ents an example of two Gaussian PDFs combining

in a Gaussian mixture. We can approximate a Gaussian mixture by

a single Gaussian

(

)

⇡N

(

;

,

) having a mean and

Runnalls, A. R. (2007). Kullback-Leibler

approach to Gaussian mixture reduction.

IEEE Transactions on Aerospace and

Electronic Systems 43 (3), 989–999

variance given by

= ⇡

+ ⇡



= ⇡



+ ⇡



+ ⇡

⇡

(µ

 µ

)

Figure 12. 18 presents four examples of Gaussian mixtures ap-

proximated by a single Gaussian PDF. Notice how in (b), when the

PDFs of

and

have the same parameters, the result coincides

with th e two distributions. Also, by comparing ﬁgures ( c ) and (d),

we see th at the greater the distance |µ

 µ

| is, the greater 

is.

2 0 2 4 6

⇡

=0.9 ⇡

=0.1

PDF : f ()

(1, 1

) N

(3, 1

) N

(1.2, 1.17

)

(a)

2 0 2 4 6

⇡

=0.3 ⇡

=0.7

PDF : f()

(2, 1

) N

(2, 1

) N

(2, 1

)

(b)

2 1.5 5 8.5 12

⇡

=0.3 ⇡

=0.7

PDF : f()

(2, 1

) N

(5, 1

) N

(4.1, 1.7

)

(c)

2 1.5 5 8.5 12

⇡

=0.9 ⇡

=0.1

PDF : f ()

(2, 0.5

) N

(8, 0.5

) N

(2.6, 1.87

)

(d)

Figure 12.18: Examples of Gaussian

mixtures.

Merge step Goi n g back to the switching Kalman ﬁlter, the m er ge

step is based on the approximation of a G aus si an mixture by a

single Gaussian PDF. The joint posterior probability of the regimes

t1

and

is propor t i onal to the product of the likelihood

i,j

t|t

, the prior probability

⇡

t1|t1

, and the probability

i,j

transiting from s

t1

= i to s

= j,

i,j

t1,t|t

=Pr(S

t1

= i, S

= j|y

1:t

)

i,j

t|t

· z

i,j

· ⇡

t1|t1

i,j

t|t

· z

i,j

· ⇡

t1|t1

(12.8)

The posteri or probabili ty of the regime

at time

is obtain ed by

marginalizing the joint posterior in equation 12.8. For a regime

this oper at ion corresponds to summing the probability of all paths

transiting through model j, irrespective of t h ei r origin,

⇡

t|t

i=1

i,j

t1,t|t

The posteri or states for each regime

t|t

⌃

t|t

, are estimated

using

i,j

t1|t

=Pr(S

t1

= i|s

= j, y

1:t

i,j

t1,t|t

⇡

t|t

i=1

i,j

t|t

i,j

t1|t

µ = µ

i,j

t|t

 µ

t|t

⌃

t|t

i=1

i,j

t1|t

(⌃

i,j

t|t

+ µµ

)

The general overview of the switching Kalman ﬁlter is presented

in ﬁgure 12.19. Note that in addition to the parame t er s deﬁning

probabilistic machine learning for civil engineers 197

each ﬁltering model , the SKF r eq ui r es deﬁn i ng the pr i or probab i l ity

of each regime

⇡

along with the

S ⇥ S

transition matrix Z deﬁning

the probability of transiting from a state at time

t 

1 to another at

time t.

Filter with

model 1

Filter with

model 2

t-1

Merge

Filter with

model 1

Filter with

model 2

Merge

model 1

model 2

model 1

model 2

Figure 12.19: General overview of the

switching Kalman ﬁlter procedure.

12.2.2 Example: Temperature Data with Regime Switch

17-01 17-05 17-08 17-12

Time [YY-MM]

Temperature

± 

Figure 12.20: Example of data set where

there is a switch between a zero-speed

regime and a constant-speed regime.

This examp le explores an application of the switching Kalman

ﬁlter t o temperature data subject to a regime switch. Figure 12. 20

presents the data set where there is a switch be tween a zero-speed

regime and a constant-speed regime. Note that with the SKF, the

ﬁltering model associat ed with each regime must have the same

hidden s tat e s. In this example, the two stat e s x

˙x

]

are

the tempe rat u r e and its rate of change, with i ni t i al states for both

models initialized as





, ⌃



0.01

00.001



, ⇡



0.95

0.05



Note that the initial states

{µ

⌃

, ⇡

}

can be estimated using the

smoothing al gori t h m. The ﬁrst regime is described by a local level

where the rate of change is forced to be equal to 0 in the transition

matrix [A]

:,2

= 0,



˙x



z}|{



1 0

0 0



z}|{

t1

W⇠N(0,Q

11/12

)

z}|{



˙w



z}|{

(Regime 1).

j.-a. goulet 198

The second regime is described by a local trend model where



˙x



z}|{



1t



z}|{

t1

W⇠N(0,Q

22/21

)

z}|{



˙w



z}|{

(Regime 2).

Describing the four possible paths, that is,

{

}

, requires the deﬁnition of four covariance matrices for the

transition-model errors,

= Q

= 0, Q



0 



In [Q

]

2,2

, the MLE optimal parameter estimated using empirical

data is



⇥

4

. There is a single observation model for both

regimes deﬁned following

[1, 0]

z}|{

V ⇠N(0,0.5

)

z}|{

The MLE for terms on the main diagonal are

997 so that

the complete transition matrix is deﬁ ne d as

Z =



0.997 0.003

0.003 0.997



Figure 12.21 presents the outcome from the switching Kalman ﬁlter:

(a) prese nts the probability of the local trend regime, which marks

the tran si t i on between the zero-speed and constant-speed regimes;

(b) presents the level describing the ﬁltered system responses where

the obser vation errors were removed; and (c) presents the rate of

change of t h e system responses, that is, the trend .

17-01 17-05 17-08 17-12

0.5

Time [YY-MM]

Pr(S = M

)

(a) Probability of the local trend regime

17-01 17-05 17-08 17-12

Time [YY-MM]

[Temperature]

t|t

± 

t|t

(b) Level

17-01 17-05 17-08 17-12

0.00

0.01

0.02

0.03

Time [YY-MM]

[Temperature]

Figure 12.21: Example of hidden-states

estimation using the switching Kalman

ﬁlter.

12.3 Linear Model Structures

This sect ion presents how to build linear models from ge ne r ic

components. This approach is referred to as Bayesian dynamic

linear models by Goulet and Nguyen,

and more commonly as

Nguyen, L. H. and J.-A. Goulet (2018).

Anomaly detection with the switching

Kalman ﬁlter for structural health mon-

itoring. Structural Control and Health

Monitoring 25 (4), e2136; and Goulet, J.-A.

(2017). Bayesian dynamic linear models for

structural health monitoring. Structural

Control and Health Monitoring 24 (12),

1545–2263

Bayesian forecasting or dynam ic linear models in the literature

Harrison, P. J. and C. F. Stevens (1976).

Bayesian forecasting. J our n al of the Royal

Statistical Society: Series B (Method-

ological) 38 (3), 205–228; West, M. and

J. Harrison (1999). Bayesian forecasting

and dynamic models.Springer;andWest,

M. (2013). Bayesian dynamic modelling. In

Bayesian theory and applications,145–166.

Oxford University Press

Note that the term linear does not mean that t h ese models can

only be employed to describe linear relationships with respect to

time; i t refers to the linearity of the observati on and transition

equations.

Five main components are reviewed here: local level, local trend,

local accele rat i on , periodic, and autoregressive. As illustrated

in ﬁgure 12.22, these components can be seen as the building

probabilistic machine learning for civil engineers 199

Figure 12.22: Generic components that

can be seen as the building blocks, which,

once assembled, allow modeling complex

empirical responses. From left to right:

local level, local trend, local acceleration,

periodic, and autoregressive.

blocks that, once assembled, allow modeling complex time series.

Each component has its own speciﬁc mathematical formulation

and is intended to model a speciﬁc behavior. We will ﬁrst see the

mathematical formulation associated wit h each component; then

we present how to assemble components to form

{

}

matrices.

12.3.1 Generic Components

Local level (x

)

Local level A local level describes a quantity that is locally con-

stant over a time st ep . It is described by a single hidden variable

. Because of its locally constant nature, its transition model

predicts that the hidden state at

is equal to the one at

t 

1, so the

transition coeﬃcient is 1. The standard de vi at i on describing the pre-

diction error for the locally constant model is



. Local levels are

part of the hidden variables that contribute to obse rvations, so the

coeﬃcient in the observation matrix equals 1. The hidden-state vari-

ables as well at the formulation for the t r an si t ion and observation

model matrices are

= x

, A

=1, C

=1, Q

=(

)

Figure 12. 23 presents realizations of a stochastic process deﬁned by

the tran si t i on model of a local level. We can see that the process

variability inc r ease s as the standard deviation 

increases.

0 5 10 15 20 25

15

Time - t

(a) 

=0.01

0 5 10 15 20 25

15

Time - t

(b) 

=0.1

0 5 10 15 20 25

15

Time - t

Figure 12.23: Examples of local level

components. The initial state

=0isthe

same for all realizations.

Level (x

)

Local trend (x

)

Local trend A local trend describes a quantity that has a locally

constant rate of change over a time step. It acts jointly with a

hidden variable describing the level. It is thus deﬁned by two hid-

den variables [

]

. Because of its locally constant nature, its

transition model pre di c t s that the hidden state at

is equal to

the curr ent one,

t1

, so the transition coeﬃcient is 1. The

j.-a. goulet 200

hidden s tat e

t1



is described by the current level

value plus the change brought by the local trend, multiplied by the

time ste p length 

. As we have seen in §12.1.2, the deﬁnition of

the proces s-n oi se covariance matrix Q

requires considering the

uncertainty t hat is caused by a constant acceleration with a random

magnitude that can occur over the span of a time step. Analytically,

it corresponds to

=(

)

t

t

where



is the process noise parameter. In most cases involv-

ing a local trend, only the hidden vari abl e representing the level

contribut es to observations, so the coeﬃcient in the observation

matrix f or

equals 0. The hidden-state estimation as well as the

formulation for the transition and observation model matrices are





, A



1t



, C





Figure 12. 24 presents realizations of a stochastic process deﬁned by

the tran si t i on model of a local trend. We can see that the hidden-

state variable

is locally constant and that

has a locally con-

stant speed. A local trend is employed to model a system response

that changes over time with a locally constant speed.

0 5 10 15 20 25

30

Time - t

0 5 10 15 20 25

3

1

Time - t

(a) 

=0.001

0 5 10 15 20 25

30

Time - t

0 5 10 15 20 25

3

1

Time - t

(b) 

=0.01

0 5 10 15 20 25

30

Time - t

0 5 10 15 20 25

3

1

Time - t

=0.1

Figure 12.24: Examples of local trend

components. The initial state [

[0  1]

is the same for all realizations.

Level (x

)

Trend (x

)

Local acceleration (x

)

Local acceleration A local acceleration describes a quantity that

has a locally constant acceleration over a time step. It acts jointly

with a level hidden variable and a trend hidden variable. It is thus

deﬁned by three hidden-state variables [

]

. Because of

its locall y constant nature, its transition model predicts that the

hidden s tat e at

is equal to the current one,

t1

,sothe

transition coeﬃcient is 1. The hidden state

t1



probabilistic machine learning for civil engineers 201

described by the current value plus the change brought by the local

acceleration multipl i ed by the time step length 

.Thehidden

state

t1



t1

t

. The hidden-state variables

as well as the formulation for the transition and obser vation model

matrices are

, A

1t

t

01 t

00 1

, C

=(

)

t

t

t

t 1

Figure 12. 25 presents realizations of a stochastic process deﬁned by

the transition model of a local acceleration. We can see that the

hidden variable is locally constant,

has a locally constant speed,

and

has a locally constant acceleration. A local acceleration is

employed to mode l a system response that changes over time with a

locally constant acceleration.

0 5 10 15 20 25

30

20

10

Time - t

0 5 10 15 20 25

3

1

Time - t

0 5 10 15 20 25

0.2

0.1

0.1

Time - t

(a) 

=0.0001

0 5 10 15 20 25

30

20

10

Time - t

0 5 10 15 20 25

3

1

Time - t

0 5 10 15 20 25

0.2

0.1

0.1

Time - t

(b) 

=0.001

0 5 10 15 20 25

30

20

10

Time - t

0 5 10 15 20 25

3

1

Time - t

0 5 10 15 20 25

0.2

0.1

0.1

Time - t

=0.01

Figure 12.25: Examples of local accel-

eration components. The initial state

[

]=[0



05]

is the same

for all realizations.

Periodic (x

)

Periodi c Periodic components are expr ess ed in their Four-

rier form wh er e they can be described by two hidden variables

]

. Here, only the ﬁrst hidden variable contribute s

to observations, so the coeﬃcients in the ob se r vation matrix are

1 for

and 0 for

. The hidden-state variables as well as th e

formulation for the transition and observation model matrices are





, A



cos ! sin !

sin ! cos !



, C





j.-a. goulet 202

=(

)





where the frequency

2⇡ · t

is deﬁned as a function of the period

and the t im e step lengt h . Figur e 12.26 presents realiz at ion s of the

hidden variables describing the periodic process. A periodic compo-

nent is e mp l oyed to model s i ne -l i ke periodic phenomena with known

period

. For example, in the context of civil engineering, it can

be employed to model the e↵ect of daily and seasonal temperature

changes.

0 5 10 15 20 25

3

Time - t

0 5 10 15 20 25

3

Time - t

(a) p =10,

=0.01

0 5 10 15 20 25

3

Time - t

0 5 10 15 20 25

3

Time - t

(b) p =10,

=0.001

0 5 10 15 20 25

3

Time - t

0 5 10 15 20 25

3

Time - t

=0.0001

Figure 12.26: Examples of periodic compo-

nents. The initial state [

]=[01]

and the period

=10arethesameforall

realizations.

Autoregressive (x

)

Autoregressive The aut or egr es si ve (AR) component is described

by a single hidden variable

that pre sents a dependence on itself

over consecutive time steps. Here, we only present the autoregres-

sive process of ord er 1, where the current state only depen ds on the

previous one,

= 

t1

+ w

: W

⇠N(w

;0, (

)



is the autoregression coeﬃcient deﬁning the transition matrix.

The autoregr es si ve hidden-state variable contributes to the obser-

vation, so the coeﬃcient in the observation matrix equals 1. The

state variable as well as the for mulation for the transition and

observation mod el matrices are

= x

, A

= 

, C

=1, Q

=(

)

Figure 12. 27 presents examples of realizations of an autoregressive

process for di↵erent parameters. Like for other components, the

variability increases with the process-error standard deviation



A key property of the autoregressive process is that for 0

<

it leads to a stationary process for wh i ch the standard deviation is

(

AR,0

)

(

)

1  (

)

probabilistic machine learning for civil engineers 203

0 5 10 15 20 25

5

Time - t

(a) 

=0.9,

=0.25

0 5 10 15 20 25

5

Time - t

(b) 

=0.99,

=0.25

0 5 10 15 20 25

5

Time - t

=1.1,

=0.25

0 5 10 15 20 25

5

Time - t

(d) 

=0.9,

=0.5

0 5 10 15 20 25

5

Time - t

(e) 

=0.99,

=0.5

0 5 10 15 20 25

5

Time - t

(f) 

=1.1,

=0.5

Figure 12.27: Examples of autoregressive

components. The shaded region corre-

sponds to the

±

AR,0

conﬁdence region for

the process’s stationary parameters. Note

that cases (c) and (f) are nonstationary

because



1. The initial state

is the same for all realizations.

The conﬁdence region corresponding to

±

AR,0

is represented by the

shaded area. Notice that for





1 the process is nonstationary,

which explains why examples (c) and (f) do not disp l ay the

±

AR,0

region. Th e special case where



= 1 corresponds to a random

walk, which is equivalent to the l ocal level component because

= x

t1

+ w

The autoregr es si ve (AR) component plays a key role in the

model. It is there to describe the model errors that have inherent

dependencies over time steps; the deﬁnition of a transition model

involves some approximation errors, and because the tran si t i on

model is employed to make predictions over successive time steps,

the pred i ct i on errors are correlated across time. Moreover, note

that thi s correlation tends to one as the time-step length tends to

zero. If we do not employ an AR process to describe these time-

dependent model errors, our ability to correctly estimate other

hidden-state variables can be diminished.

12.3.2 Component Assembly

A time series of observations

1:T

is modeled by selecting the rel-

evant components and then assembling them to form the global

matrices

{

}

. The global hidden-state variable vector x

is obtain ed by concatenating in a column vector the st at e variables

deﬁning e ach selected component. The observation matrix C is

obtained by concatenatin g in a row vector the s pe ci ﬁc C vectors

deﬁning e ach selected component. The transition matrix A and

the proces s noise covariance Q are ob t ain ed by forming block di-

agonal matrices (

blkdiag

(

·, ·

)). A and Q each employs the speci ﬁc

j.-a. goulet 204

terms deﬁning the selected components. For a single time series, the

observation err or covariance matrix R contains only one term, 

Example: Cli m ate change This example s t ud i es daily average tem-

perature data recorded at Montreal’s YUL airport for the period

1953–2013, as presented in ﬁgure 12.28. The goal is to quantify the

current yearly rate of change in the average temperature. Because

53-01 68-04 83-07 98-10 13-12

40

20

Time [YY-MM]

[



Figure 12.28: Montreal YUL airport daily

average temperature observations for the

period 1953–2013.

data is recorded daily, the time-step length is 

= 1 day. Figure

12.28 shows a yearly seasonal pattern with frequency

2⇡

365.2422

. Note: The length of a solar year is

365.2422 days.

In additi on to the periodic component, this model requires a local

trend t o quantify the rate of change and an autoregressive one

to captur e the time-correlated model prediction errors. For these

components, the hidden-state variables are

=[x

|{z}

| {z }

|{z}

]

The speciﬁ c formulation for each component is described below:

Local trend component - LT





=(

)



1/41/2

1/21



⇥

⇤

Periodic component - S



cos ! sin !

sin ! cos !



,!=

2⇡

365.2422

= 0 (2 ⇥ 2 matrix)

⇥

⇤

Autoregressive component - AR

=[

]

=[(

)

]

= [1]

probabilistic machine learning for civil engineers 205

The global matrices describing the model are obtained by assem-

bling the components so that

A = blkdiag(A

, A

)

Q = blkdiag(Q

, Q

)

C =[C

]

R = 

The prior knowledge deﬁning the initial st at e s is deﬁned as

E[X

] = [5 0 0 0 0]

cov[X

] = diag([100 10

6

100 100 100]).

The unknown parameters and their MLE optimal values lear ne d

from data using the Newton-Raphson algorithm ar e

✓

⇤

=[

,

]

=[0.69 2.2 ⇥ 10

6

3.42.1 ⇥ 10

4

]

53-01 68-04 83-07 98-10 13-12

Time [YY-MM]

[



t|T

± 

t|T

(a) Level component

53-01 68-04 83-07 98-10 13-12

2

·10

4

Time [YY-MM]

[



C/day]

(b) Local trend component

53-01 68-04 83-07 98-10 13-12

10

Time [YY-MM]

[



53-01 68-04 83-07 98-10 13-12

10

5

Time [YY-MM]

[



(d) Autoregressive component

Figure 12.29: Hidden-state estimates for

the temperature time series.

Figure 12. 29 presents the smoothed estimates for hidden-state

variables x

. Notice in ﬁgure 12.29a how the long-term average

temperature drops fr om the 1950s to the 1970s and then increases

steadily. Figure 12.29b p r es ents t he rate of change in



C/day. When

reported on an annual basis, the annual rate of change in 2013 is

estimated to be

T|T



C/year with a standard deviation



T|T

=0.05



C/year.

12.3.3 Modeling Dependencies Between Observations

State-space models can be employed to model the depend en ci es

between several observations. Take, for example, the seasonal

variations in the air temperature a↵ecting the displacement of a

structure. If we observe both quantities, that is, the temperature

and the displacement, we can perform their j oi nt estimation.

In the general case where there are

observed time series, we

have to select relevant components for each time series and then

assemble them together. The global state vector x

for all time

series and all compone nts is obtained by concatenating in a column

vector the state vectors x

, 8j 2{

, ··· , D}

from each time series.

The global observation (C), transition (A), process noise covariance

(Q), an d observation covariance (R) matrices are all built in the

same way by concatenating block diagonal matrices from each time

series,

C = blkdiag(C

, C

, ··· , C

)

A = blkdiag(A

, A

, ··· , A

)

Q = blkdiag(Q

, Q

, ··· , Q

)

R = blkdiag(R

, R

, ··· , R

j.-a. goulet 206

The global matrix C requires special attention because it contains

the infor mat i on regarding the time-series dependencies. In the C

matrix, each row corresponds to a time series. The com pon ent-wise

representation of the observation matrix is

C =

1,2

··· C

1,j

··· C

1,D

2,1

··· C

2,j

··· C

2,D

i,1

i,2

··· C

i,j

··· C

i,D

D,1

D,2

··· C

D,j

··· C

where the o↵-diagonal terms C

i,j

are row vectors with the same

number of eleme nts as th e m at ri x C

.Whenthereisnodependence

between observations, the o↵-diagonal matrices C

i,j

contain only

zeros. Wh en dependence exists between observations, it can be

encoded in a dependence matrix such as

D =

1 d

1,2

··· d

1,j

··· d

1,D

2,1

1 ··· d

2,j

··· d

2,D

i,1

i,2

··· 1 ··· d

i,D

D,1

D,2

··· d

D,j

··· 1

where

i,j

}

is employed to indicate whether (1) dependence

does exist or (0) does not. When

i,j

= 0, C

i,j

=[0]; when

dependence exists an d

i,j

= 1, the o↵-diagonal matrix C

i,j



i|j



i|j

··· 

i|j

2 R

contains

regression coeﬃcients



i|j

describing the line ar dependence between the

component from

the

time series and t h e

time series. All regression coeﬃcients



i|j

are treated as unknown parameters to be lear ne d from data.

13-08 14-04 15-01 15-10

25.5

26.0

26.5

Time [YY-MM]

Displacement y

[mm]

(a) Displacement [mm]

13-08 14-04 15-01 15-10

30

15

Time [YY-MM]

Temperature y

[



(b) Temperature [



Figure 12.30: Observed time series for the

displacement and temperature of a bridge.

Example: Bridge dis pl acement versus temperature We consider the

joint estimation of the hidden-state variables for two time series:

the disp l acem ent and temperature measured on a bridge from 2013

until 2015. Both time series are presented in ﬁgur e 12.30, where we

can see a negative correlation between the obs er ved displacement

and temperature seasonal cycles.

The temperature is modeled using a local l e vel, two periodic

components with respect ive periods of 365.2422 and 1 day, and an

autoregressive component,

=[x

T,LL

|{z}

T1,S1

T1,S2

| {z }

p=365.24

T2,S1

T2,S2

| {z }

p=1

T,AR

|{z}

]

probabilistic machine learning for civil engineers 207

All matri ce s deﬁning the transition and the observation models

are obtain ed by assembling the speciﬁc mathematical formulation

associated with each component,

= block diag

⇣

cos !

sin !

sin !

cos !

sin !

sin !

cos !

,

T,AR

⌘

= block diag



0, 0, 0, 0, 0, (

T,AR

)



=[110101]

=(

)

where !

2⇡

365.2422

and !

2⇡

The displacement is modeled using a local level and an autore-

gressive component,

=[x

D,LL

|{z}

D,AR

|{z}

]

Matrices deﬁning the transiti on and the observation models are

again obtained by assembling the speciﬁc mathematical formulation

associated with each component,

= block diag (1,

D,AR

)

= block diag



0, ( 

D,AR

)



=[11]

=(

)

The global model matrices for the joint data set y

1:T



1:T



are assembled following

x =





A = block diag ( A

, A

)

C = block diag ( C

, C

)

R = block diag ( R

, R

)

Q = block diag ( Q

, Q

) .

The depend en ce matrix is deﬁned as D =





, which indicates

that the displaceme nt observations depend on the hidden states

deﬁned for the temperature. The assembled global observation

matrix is thus

C =



D|T





1 1 0 

D|T

0 

D|T

0 

D|T

001 1 0 1 0 1



where t he ﬁrst row corresponds to the displacement and the second

to the temperature. Note that there are two regressi on coeﬃcients

describing the depe nd en ce between the displacement observations

j.-a. goulet 208

and the temperature hidden-stat e variables:



D|T

, the long-term

seasonal e↵ects described by the ﬁrst periodic component, and



D|T

the short -t e rm ones described by the second periodic component

and the autoregressive term. MLE optimal model parameter values

estimated from data are

✓ =[ 

D|T



D|T



D,AR



T,AR



D,AR



T,AR



]

=[0.017 0.01 0.982 0.997 0.024 0.28 10

5

4

]

The resul t in g smoothed hidden-state estimations are presented in

ﬁgure 12.31. The periodic behavior displayed in ﬁgure 12.30a is

not prese nt in any of the displacement components presented in

ﬁgures 12.31a–b . This is because the periodic behavior displayed

in the displacement is modeled through the dependence over the

temperature’s hidden states.

13-08 14-04 15-01 15-10

Time [YY-MM]

L,D

[mm]

t|T

± 

t|T

(a) Level: Displacement

13-08 14-04 15-01 15-10

0.3

0.0

0.3

Time [YY-MM]

AR,D

[mm]

(b) Autoregressive: Displacement

13-08 14-04 15-01 15- 10

Time [YY-MM]

L,T

[



13-08 14-04 15-01 15-10

10

Time [YY-MM]

S1,T

[



(d) Per. (p =365.24 days): Temp.

13-08 14-04 15-01 15-10

1

Time [YY-MM]

S1,T

[



(e) Per. (p =1day):Temp.

13-08 14-04 15-01 15-10

5

Time [YY-MM]

AR,T

[



(f) Autoregressive: Temperature

Figure 12.31: Hidden-state estimates for

the displacements and temperatures.

12.4 Anomaly Detection

It is possible to detect anomalies in time series using th e regime-

switching method presented in §12.2. Th e idea is to create two

models representing, respectively, a normal and an abnormal regime.

The key strength of this approach is to c ons id er not only the like-

lihood of each model at each time step but also the information

about the prior probability of each regime as well as the probability

of switching from one regime to another. In the context of anomaly

detection, it all ows reducing the number of false alarm s while in-

creasing the de t ec ti on capacity. In a general context, the number of

regimes is not limited, so several models can be tested against each

other.

Example: Anomaly detection Thi s example illustrates the concept

of anomal y detecti on for observations made on a dam displacement.

Figure 12. 32 presents observations showing a downward trend as

well as a yearly periodic pattern. Because no temper at u re observa-

tions are available for a joint estimation, this case is thus treated as

a single time series. The components employed are a local accelera-

tion, two periodic components with respective periods

= 365

2422

and

= 168

6200 days, and one autoregressive component. In

this case, we want to employ two di↵erent models, where one is

constrained to have a constant speed and the other a constant ac-

celeration. The vector of hidden-state variables for both regimes

=[x

|{z}

T1,S1

T1,S2

| {z }

p=365.24

T2,S1

T2,S2

| {z }

p=168.62

|{z}

]

probabilistic machine learning for civil engineers 209

02-12 06-03 09-07 12-10 16-02

−25.28

−13.25

−2.78

Time [YY-MM]

Displ, y

[mm]

Figure 12.32: Example of a dam displace-

ment time series that contains an anomaly

that requires using the switching Kalman

ﬁlter.

Global model matrices A

(j)

for regimes

j 2{

}

are assembled following the procedure described in §12.3.2. The

only thi n gs di↵erentiating the two models are the transi t i on ma-

trices an d the process noise covariance for the local acceleration

component,

(1)

1t 0

010

000

, A

(2)

1t

t

01 t

00 1

The model associated with the ﬁrst regime has the particularity

that the terms in the transition matrix that corresponds to the

acceleration are set equal to zero. The covariance matrice s for the

transition errors from regimes 1 ! 2 and 2 ! 1 are respectively

02-12 06-03 09-07 12-10 16-02

Time [YY-MM]

P r(S)

Normal

Abnormal

Figure 12.33: Probability of regimes 1 and

2obtainedusingtheSKFalgorithm.

(

)

t

0(

LTT

)

t

00(

)

· t

(

)

t

0(

LTT

)

· t 0

000

The transition matrix is parameterized by z

and z

as deﬁned in

Z =



1  z



Figure 12.33 presents the probability of each regime as a function

of time; and ﬁgure 12.34 presents the hidden-state estimates for

the level, trend, and acceleration. This example displays a switch

between the normal and the abnormal regimes during a short time

span. Thes e results illustrate how anomalies in time series can be

detected using regi me switching. They also illustrate how the base-

line res pon se of a system can be isolated from t h e raw observations

contaminated by observation errors and external e↵ects caused by

environmental factors. The model allows estimating that before the

regime swi tch, the rate of change was -2 mm/year, and it increased

in magnitude to -2.9mm/year after the switch.

j.-a. goulet 210

02-12 06-03 09-07 12-10

16-02

−26.33

−13.29

−0.74

Time [YY-MM]

[mm]

(a) Level, x

02-12 06-03 09-07 12-10 16-02

−120

−3.71

100

·10

−3

Time [YY-MM]

[mm/12-hr]

-2.0 mm/year

-2.9 mm/year

(b) Trend, x

02-12 06-03 09-07 12-10 16-02

−480

200

·10

−5

Time [YY-MM]

[mm/144-hr

]

-0.0054 mm/year

-0.012 mm/year

Figure 12.34: Expected values

t|t

and

uncertainty bound

t|t

± 

t|t

for hidden

states resulting from a combination of

regimes 1 and 2 using the SKF algorithm.

All the details related to this example can be found in the study

performed by Nguyen and Goulet.

Nguyen, L. H. and J.-A. Goulet (2018).

Anomaly detection with the switching

Kalman ﬁlter for structural health mon-

itoring. Structural Control and Health

Monitoring 25 (4), e2136

The package OpenBDLM

allows using the state-space model’s

Gaudot, I., L. H. Nguyen, S. Khazaeli,

and J.-A. Goulet (2019, May). OpenBDLM,

an open-source software for structural

health monitoring using bayesian dynamic

linear models. In 13th Proceedings from

the 13th International Conference on

Applications of Statistics and Probability

in Civil Engineering (ICASP)

theory an d includes the formulation for the various components

deﬁned for Bayesian dynamic linear models.