Linear Algebra

Linear algebra is employed in a majority of machine learning meth-

ods and algorithms. Before going further, it is essential to under-

stand the mathematical notation and basic operations.

2.1 Notation

We employ lowercase letters

x, s, v, ···

in order to describe variables

x :scalarvariable

x :columnvector

X :matrix

⌘ [x]

: i

element of a vector

⌘ [X]

: {i, j}

element of a

matrix

that can lie in speciﬁc domains such as real numbers

, real posi-

tive

, integers

, closed intervals [

·, ·

], open intervals (

·, ·

), and so

on. Often, the problems studied involve multiple variables that can Examples of variables

belonging to

di↵erent domains

x 2 R ⌘ (1, 1)

2 R

⌘ (0, 1)

2 Z ⌘{···, 1, 0, 1, 2, ···}

be regrouped in arrays. A 1-D array or vector contai n in g sc alar s i s

represented as

x =

By convention, a vector

implicitly refers to a

n ⇥

1 column vector.

For exampl e, if each element

⌘

[

]

is a real number [

]

2 R

for

all

from 1 to

, then the vector belongs to the

-dimensional real

domain

. This last state me nt can be expressed mathematically as

[

]

2 R, 8i 2{

n}!x 2 R

. In mach i ne learning, it is common

to have 2-D arrays or matrices,

X =

··· x

where, for example, if each

⌘

[

]

2 R, 8i 2{

m},j 2{

n}!X 2 R

m⇥n

. Arrays beyond two dimensi ons are referred to as

tensors. Although tensors are widely employed in the ﬁeld of neural

networks, they will not be treated in this book.

j.-a. goulet 10

There are several matrices with speci ﬁ c properties: A diagonal

matrix is square and has only terms on its mai n diagonal,

Y = diag(x)=

0 ··· 0

0 x

··· 0

00··· x

n⇥n

An identity matrix

is similar to a diagonal matrix except that

element s on the main diagonal are 1, and 0 everywhere else,

I =

10··· 0

01··· 0

00··· 1

n⇥n

A block diagonal matrix concatenates several matrices on t h e m ai n

A =



1 2

3 4



, B =

4 5 6

7 8 9

10 11 12

blkdiag(A, B)=

1 2 000

3 4 000

00 4 5 6

00 7 8 9

0010 11 12

diagonal of a sin gle matrix,

blkdiag(A, B)=





We can mani pu l at e t h e di me ns ion s of matrices using the transpo-

sition oper ati on so that indices are permuted [

]

. For

example,

X =





! X

The trace of a square matrix

corresponds to the sum of the

element s on its main diagonal,

tr(X)=

i=1

2.2 Operations

2 1 0 1 2

5

y =3x +1

= ax + b

a =3,b=1

Figure 2.1: 1-D plot representing a linear

system, y = ax + b.

2

y = x

+2x

= a

x + b

a =





, x =





,b=3

Figure 2.2: 2-D plot representing a linear

system, y = a

x + b.

In the context of machine learning, linear algebra is employed

because of its capacity to model linear sys tem s of equations in a

format that is compact and well suite d for computer calculations.

In a 1-D case, such as the one represented in ﬁgure 2.1, the

space

is mapped into the

space,

R ! R

, through a linear (i.e., aﬃne)

function. Figure 2.2 presents an ex amp l e of a 2-D linear function

where the

space is mapped into the

space,

! R

. This can

be generalized to linear systems

, deﬁning a mapping

so that

! R

,where

and

are respectively

n ⇥

1 and

m ⇥

probabilistic machine learning for civil engineers 11

vect or s. The product of the matrix

with the vector

is deﬁned

as [Ax]

[A]

· [x]

In more general cases, linear algebra is employed to multiply a

matrix

of size

n ⇥ k

with another matrix

of size

k ⇥ m

,sothe

result is a n ⇥ m matr ix ,

C = AB

= A ⇥ B.

The matrix multiplication operation f ol l ows

[C]

[

]

[

]

as illustrated in ﬁgure 2.3. Following the requirement on the size of

the matrices multiplied, this operation is not generally commutative

so that

AB 6

. Matrix multiplication follows several pr operties

such as the following:

Distributivity A(B + C)=AB + AC

Associativity A(BC)=(AB)C

Conjugate transposability (AB)

= B

Figure 2.3: Example of matrix multiplica-

tion operation C = AB.

When the matrix multiplication operator is applied to

n ⇥

1 vectors,

it reduces to the inner product,

y ⌘ x · y

=[x

··· x

] ⇥

i=1

Another common operation is the Hadamar product or element-

wise product, which is represented by the symbol



. It consists in

mult i pl y i ng each term from matrices

m⇥n

and

m⇥n

in order to

obtain C

m⇥n

C = A  B

[C]

=[A]

· [B]

The element-wise product is seldom em pl oyed to deﬁn e m ath em at i-

cal equations; however, it is extensively employed when implement-

ing these equations in a computer language. Matrix additi on i s by

deﬁnition an element-wise operation that applies only to matrices of

same dimensions,

C = A + B

[C]

=[A]

+[B]

One last key operation is the matri x invers i on

1

. In order to

be invertible, a matrix must be square and must n ot have linearly

dependent rows or columns. The product of a matrix with its in-

Linearly dep endent vectors

Vectors

2 R

and

2 R

are linearly

dependent if a nonzero vector

y 2 R

exists, such that y

+ y

= 0.

j.-a. goulet 12

vers e i s e qu al to the identity matrix

1

. Matrix inversion is

particularly useful for solving linear s ys t ems of equations,

Ax = b,

1

Ax = A

1

Ix = A

1

x = A

1

||x||

+ x

||x||

= |x

| + |x

||x||

=max|x

| = |x

Figure 2.4: Examples of applications of

di↵erent norms for computing the length of

avectorx.

2.3 Norms

Norms measure how large a vector is. In a generic way, the

-norm

is deﬁned as

||x||

|[x]

1/p

Special cases of inte re st are

||x||

[x]

⌘

x (Euclidian norm)

||x||

|[x]

| (Manhattan norm)

||x||

= max

|[x]

|. (Max norm)

These cases are illustrated in ﬁgure 2.4. Among all cases, the

norm (Euclidian distance) is the most common. For example,

8.1.1 presents for the context of linear regression how choosing a

Euclidian norm to measure the distance between observations and

model predictions allows solving the parameter estimation problem

analytically.

2.4 Transformations

Machi ne learning involves transformations from one space to an-

other. In the context of linear algebr a, we are i nterested in the

special case of linear transformations.

2.4.1 Linear Transformations

3 2 1 0 1 2 3

3

2

1

(a) A =

⇥

⇤

, det(A)=1

3 2 1 0 1 2 3

3

2

1

(b) A =

⇥

⇤

, det(A)=2

3 2 1 0 1 2 3

3

2

1

⇥

1.50

⇤

, det(A)=1.5

Figure 2.5: Examples of linear transforma-

tions x

= Ax.

Figure 2.1 presented an example for a

R ! R

linear transforma-

tion. More generally, a

n ⇥ n

square matrix can be employed to

perform a

! R

linear transformation through multiplication.

Figures 2.5a–c illustrate how a matrix

transforms a space

into

another

using the matrix product operation

.Thede-

formation of the circle and the underlying grid (see (a) ) show the

e↵ect of various transformations. Note that the terms on the main

probabilistic machine learning for civil engineers 13

diagonal of

contr ol the transformations along the

and

axes,

and the nondiagonal terms control the t ran sf or mat i on dependency

between both axes, (s ee, for example, ﬁ gu re 2.6).

The determinant of a square mat r ix

measures how much the

transformation contracts or expands the space:

• det(A) = 1: preserves the space/volume

• det

(

) = 0: collapses the spac e/volume along a subset of dimen-

sions, for example, 2-D space ! 1-D space (see ﬁgu re 2.7)

In the examples presented in ﬁgure 2.5a–c, the determinant quan-

tiﬁes how much the area/volume is changed in the transformed

space; for the circle, it corresponds to the change of area caused by

the transformation. As shown in ﬁgure 2. 5a, if

, the transfor-

mation has no e↵ect so

det

(

) = 1. For a square matrix [

]

n⇥n

det(A):R

n⇥n

! R.

2.4.2 Eigen Decomposition

Linear transformations operate on several dimensions, such as in

the case presented in ﬁgure 2.6 where the transformation introduces

dependency between variables. Eigen decomposition enables ﬁnding

a linear transformation that removes the dependency while preserv-

ing the area/volume. A square matrix [

]

n⇥n

can be decomposed

in eigenvectors

{⌫

, ··· , ⌫

}

and eigenvalues

{

, ··· ,

}

.Inits

matrix form,

A = Vdiag()V

1

where

V =[⌫

··· ⌫

]

 =[

··· 

]

Figure 2.6 presents the eigen decompos it i on of the transformation

. Eigenvectors

⌫

and

⌫

describe the new referential into

which the transformation is indepe nd ently applied to each axis.

Eigenvalues



and



describe the transformation magnitude along

each eigenvector.

3 2 1 0 1 2 3

3

2

1

⌫



⌫



⌫

= Ax

A =



10.5

0.51



V =[⌫

⌫



0.71 0.71

0.71 0.71



 =[0.51.5]

Figure 2.6: Example of eigen decomposi-

tion, A = Vdiag()V

1

3 2 1 0 1 2 3

3

2

1

A =

⇥

10.99

0.99 1

⇤

, det(A)=0.02

Figure 2.7: Example of a nearly singular

transformation.

A matrix is positive deﬁnite if all eigenvalues

0, and a matrix

is positive semideﬁnite (PSD) if all eigenvalues



0. The deter-

minant of a matrix corresponds to the pr oduct of its eigenvalues.

Therefore, in the case where one eigenvalue equals zero, it indicates

that two or more dimensions are linearly dependent and have col-

lapsed into a single one. The transformat i on matrix is then said

to b e si ngu lar . Figure 2.7 presents an examp l e of a nearly singular

transformation. For a positive semideﬁnite mat ri x

and for any

j.-a. goulet 14

vector x, the foll owing relation holds:

Ax  0.

This property is employed in

3.3.5 to deﬁne the requirements for

an admissible covariance matri x .

A more exhaustive review of linear al geb ra can be found in

dedicated textbooks su ch as the one by Kreyszig.

Kreyszig, E. (2011). Advanced engineering

mathematics (10th ed.). Wiley