Clustering and Dimension Reduction

This chapter covers two key problems associated with unsupervised

learning: clustering and dimension reduction. Clustering consists

in discovering patterns or subgroups among a set of covariates.

Clustering is considered unsuper vi s ed learning because the cluster

indexes are not available to train a model; cluster indexes have to

be learned from the observed covariates themselves. With dimension

reduction, the goal is to identify a lower-dimensional subspace for

a s et of covariates, in which the data can be represented with a

negligible loss of information. In this chapter, we will introduce

two cl us t er i ng methods, Gaussian mixture models and K-means,

and a dimension reduction technique, that is, principal component

analysis.

(a) Old Faithful geyser

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

Cluster #1

Cluster #2

(b) Eruption data

Figure 10.1: Eruption durations and time

between eruptions can be clustered into

two groups.

10.1 Clustering

Figure 10.1 presents the Old Fai t hf ul geys er data set

where t he

Adelchi, A. and A. W. Bowman (1990).

AlookatsomedataontheOldFaithful

geyser. Journal of the Royal Statistical So-

ciety. Series C (Applied Statistics) 39 (3),

357–365

covariates

]

describe the eruption time and the time to

the next eruption. Figure 10.1 di sp l ays two distinct behaviors: short

eruptions with short times to th e next ones and long eruptions with

a long time to the next ones. The goal with clus t er in g is to discover

the existence of such an underlying structure in data.

Note that when working with data sets

, 8i 2{

D}}, x

2 R

, with covariates deﬁned in three or fewer dimensions

(

X 

3), it is then trivial to v i su al ly identify clusters. The interest of

having clustering methods is in identifying patterns in cases where

X >

3, such that the dimensionality of the data does not allow for a

visual separation of cluster s.

Data

D = D

= {x

, 8i 2{1:D}}

2 R

⇢

Covariate

Attribute

10.1.1 Gaussian Mixture Models

Clustering can be formulated as the task of ﬁtting a Gaussian

j.-a. goulet 158

mixture model (GMM), on a set of obs er ved covariates. A Gaus-

sian mixture distribution is deﬁned as the sum of

Normal (i.e.,

Gaussian) probability density functions (PDFs), each weighted by

a p r obab i li ty

(

)

, 8k 2{

. The j oi nt PDF for a Gaussian

mixture is deﬁned by

3 0.25 2.5 5.25 8

p(1) = 0.4

p(2) = 0.6

PDF : f()

(1, 1

) N

(4, 1.5

)

(x)

(a) Univariate

f(x)

(x) N

(µ

, ⌃

) N

(µ

, ⌃

)

(b) Bivariate

Figure 10.2: Examples of Gaussian mixture

PDFs.

f(x)=

k=1

f(x,k)

k=1

f(x|k) · p(k)

k=1

N(x; µ

, ⌃

) ·p(k),

(10.1)

where

✓

{

(

, ⌃

)

, 8k 2{

K}}

describes the PDF parameters.

Figure 10.2 presents examples of 1-D and 2-D Gaussian mixture

PDFs. The issue is that in practice, we do n ot know

✓

(

), so

we have to learn

{✓,p

(

)

}

from

. Given the hypothesis that

contains observations that are conditionally indepen de nt given

P, the joint likelihood deﬁned from equ at ion 10.1 is

f(D

|✓,p(k)) =

i=1

k=1

N(x

; µ

, ⌃

) ·p(k),

and equivalently, the log-likelihood is

ln f (D

|✓,p(k)) =

i=1

k=1

N(x

; µ

, ⌃

) ·p(k)

. (10.2)

If we choose a maximum likelihood estimate (MLE) approach to

Gaussian m ixture parameters

P = {(µ

, ⌃

)

| {z }

✓

,p(k), 8k 2{1:K}}

estimate the parameters, t he ir optimal valu es are deﬁned as

⇤

= arg max

ln f (D

|✓,p(k)). (10.3)

When the optimization in equation 10.3 employs methods such

as those presented in chapter 5, it is r ef er re d to as hard cluster-

ing. Hard clustering is a diﬃcult task because of several issues

such as th e const rai nts on

⌃

that require it to be positive semi-

deﬁnite (see

3.3.5), and because of non-identiﬁability (see 6.3.4).

Soft clus ter in g solves these issue s by employing the expectation

maximization method to estimate P

⇤

Expectation maximization

(EM) is an i t er at i ve meth od th at

Dempster, A. P., N. M. Laird, and D. B.

Rubin (1977). Maximum likelihood from

incomplete data via the EM algorithm.

Journal of the Royal Sta t is ti cal Society.

Series B (methodological) 39 (1), 1–38

allows identifying optimal parameters

⇤

for problems involving

latent (i.e., unobserved) var i abl es . In the context of clusteri ng,

the latent variable is the unobserved cluster index

. If we could

probabilistic machine learning for civil engineers 159

observe

, D

}

{

(

)

, 8i 2{

D}}

,thecomplete data

log-likelihood for one obser vation (x

) would be

ln f (x

|P)=lnf(x

|P)

=ln



N(x

; µ

, ⌃

) ·p(k

)



Because complete data is not available, EM identiﬁes optimal

parameters by maximizing the expected complete data log-likelihood,

(

(t)

, P

(t1)

), which depends on the sets of parameters deﬁned f or

two successive iterations, t  1 and t. For a single observation x

Q(P

(t)

, P

(t1)

k=1



N(x

; µ

(t)

, ⌃

(t)

)

| {z }

fct.(P

(t)

)

·p

(t)

(k)



· p(k|x

, P

(t1)

)

| {z }

fct.(P

(t1)

)

where

(

k|x

, P

(t1)

) is t h e post er i or probabi li ty mass fun ct i on

(PMF) for the latent cluster labels

obtained at the previous

iteration

t 

1. This last q u antity is ref er r ed to as the responsibility

of a clu st er

. For an entire data set

}

, 8i 2{

D}}

the expected complete d at a log-likelihood is

Q(P

(t)

, P

(t1)

i=1

k=1



N(x

; µ

(t)

, ⌃

(t)

) ·p

(t)

(k)



· p(k|x

, P

(t1)

)

The EM method identiﬁes the optimal parameter values

⇤

from

(

(t)

, P

(t1)

) by it er at i vely performing the expectation and maxi-

mization steps detailed below.

Expectation step For an it e rat i on

, the expectation step computes

the probability

(t1)

that an observed covariate

is associated

with the cluster k, given the parameters P

(t1)

so that

(t1)

= p(k|x

, P

(t1)

)

N(x

; µ

(t1)

, ⌃

(t1)

) ·p

(t1)

(k)

k=1

N(x

; µ

(t1)

, ⌃

(t1)

) ·p

(t1)

(k)

(10.4)

The particularity of the expectation step is that the estimation

of the responsibility

(t1)

in equation 10.4 employs the PDF

parameters

✓

(t1)

{

(

(t1)

, ⌃

(t1)

)

, 8k 2{

K}}

and the

marginal PMF p

(t1)

(k) estimated at the previous iteration t  1.

Maximization step In the maximization step, the goal is to iden-

tify

(t)⇤

= {✓

(t)⇤

(k)} = arg max

(t)

Q(P

(t)

, P

(t1)

j.-a. goulet 160

where the optimization is b roken into two part s. The PDF’s new set

of optimal parameters

✓

(t)

is obtained through the maximization

problem

✓

(t)⇤

= arg max

✓

(t)

Q(P

(t)

, P

(t1)

where the maximum corresponds to the value

✓

such that the

derivative of the function is zero,

@ ln Q(P

(t)

, P

(t1)

)

@✓

i=1

k=1



N(x

; µ

(t)

, ⌃

(t)

) ·p

(t)

(k)



· p

(t1)

@✓

i=1

k=1

(t1)

· ln N(x

; µ

(t)

, ⌃

(t)

)

@✓

z }| {

i=1

k=1

(t1)

· ln p

(t)

(k)

@✓

i=1

(t1)

ln det⌃

(t)

+(x

 µ

(t)

)

(⌃

(t)

)

1

 µ

(t)

)

@✓

By isolating the mean vector

(t)

and the covariance matrix

⌃

(t)

the new estimates are

(t)

i=1

· p

(t1)

⌃

(t)

i=1

 µ

(t)

)(x

 µ

(t)

)

· p

(t1)

;

(t1)

i=1

(t1)

By taking the derivative of

ln Q

(

(t)

, P

(t1)

)withrespectto

(

)

and repeating the same procedure, the new marginal PMF for each

cluster k is given by

(t)

(k)=

(t1)

In practice, the values

(0)

are initialized randomly,

⌃

(0)



taken as a diagonal matr i x, and

(0)

(

)=1

is initially assumed

to be equal for each clu st er . Algori t h m 8 summarizes the steps

of the expectation maximization algorithm for estimating the

parameters of a Gaussian mixture model.

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

(a) Expectation step, t =1

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

(b) Maximization step, t =1

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

(d) Maximization step, t =2

Figure 10.3: The ﬁrst two iterations of

the EM algorithm for a Gaussian mixture

model applied on the Old Faithful data.

Figure 10.3 presents the ﬁrst two steps of the EM algor it h m

for a Gau ss i an mixtu re model appl i ed to the Old Faithful exam-

ple introduced in ﬁgure 10.1. At the ﬁrst iteration

= 1, the two

cluster centers

(0)

as well as thei r covariances

⌃

(0)

are initialized

randomly. In the expectation step, the cluster responsibi l i ty is es-

timated for each data point

. As de pi ct e d in ﬁgure 10.3a, a data

probabilistic machine learning for civil engineers 161

Algorithm 8: EM for Gaussian mixture models

1 deﬁne D

, ✏,lnf(D

(0)

)=1,lnf(D

(1)

) = 0, t = 1;

2 initialize ✓

(0)

= {µ

(0)

, ⌃

(0)

}, p

(0)

(k)=1/K;

3 while |ln f (D

(t)

) ln f(D

(t1)

)| >✏do

4 for i 2{1, 2, ··· , D} (Expectation step) do

5 p

(t1)

k=1

N(x

; µ

(t1)

, ⌃

(t1)

) ·p

(t1)

(k)

6 for k 2{1, 2, ··· , K} do

7 p

(t1)

N(x

; µ

(t1)

, ⌃

(t1)

) ·p

(t1)

(k)

(t1)

8 for k 2{1, 2, ··· , K} (Maximization step) do

9 p

(t1)

i=1

(t1)

10 µ

(t)

i=1

· p

(t1)

11 ⌃

(t)

i=1

 µ

(t)

)(x

 µ

(t)

)

· p

(t1)

12 p

(t)

(k)=

(t1)

13 ln f (D

(t)

i=1

k=1

N(x

; µ

(t)

, ⌃

(t)

) ·p

(t)

(k)

14 t = t +1

point belongs to the cluster associated with the highest responsi-

bility. At thi s point, d ue to the crude initi al i zat i on of

{µ

(0)

, ⌃

(0)

}

the quality of the cluster membership assignment is poor. The

maximization step for

= 1 updates the initial cluster centers and

covariances. The updated contours for the clusters using

(1)

and

⌃

(1)

are depicted in ﬁgure 10.3b. Note that the responsibility for

each data point

has not changed, so th e clust e r member sh i p

assignment remains the same. Figure 10.3c presents the expectation

step for the second iteration, where we can notice the change in

cluster membership assignment in comparison with the previous

steps in (a) and ( b ) . In (d), the contours of each cluster are pre-

sented for parameters

{µ

(2)

, ⌃

(2)

}

, which have been updated for the

second time. Figure 10.4 presents the ﬁnal clusters along with their

contours obtained after t = 12 iterations.

0 2 4 6

100

Eruption time [min]

Time to next eruption [min]

Figure 10.4: Example of an application

of a Gaussian mixture model to the

Old Faithful geyser data set. Contours

represent the clusters identiﬁed using the

EM algorithm after t =12iterations.

0 0.25 0.5 0.75 1

0.25

0.5

0.75

Figure 10.5: Example of an application of

aGaussianmixturemodeltooverlapping

clusters. Contours represent the clusters

identiﬁed using the EM algorithm.

Figure 10.5 presents a second example of a Gaus si an mixt u r e

model for

= 3 generic clusters. Note that desp i t e the overlap

between clusters, the method performed well because, li ke in the

j.-a. goulet 162

previous case, each cluster is well approximated by a multivariate

Normal. Gaussian mixture models may display a poor performance

when the clusters do not have el l i ps oid al shapes. In such a case,

we should resort to more ad vanced meth od s such as spectral clus-

tering.

Furt he r detai l s regard in g the derivation and theor et i cal

Ng, A., M. Jordan, and Y. Weiss (2002).

On spectral clustering: Analysis and

an algorithm. In Advances i n neural

information processing systems,Volume14,

pp. 849–856. MIT Press

justiﬁcation of the EM methods are presented by Murphy

and

Murphy, K. P. (2012). Machine learning:

Aprobabilisticperspective.MITPress

Bishop.

Bishop, C. M. (2006). Pattern recognition

and machine learning.Springer

Estimating the number of clusters In the previous examples, we

have as su med th at we knew the number of clusters

. In mos t

practical applications, this is not the case and

must be estimated

using methods such as cross-validation (see

8.1.2) or Bayesian

model selection (see §6.8) .

10.1.2 K-Means

K-means

is one of the most widesp re ad clus te r in g methods due

Lloyd, S. (1982). Least squares quanti-

zation in PCM. IEEE Transactions on

Information Theory 28 (2), 129–137

to its simplicity. It can be seen as a special case of the Gaussian

mixture model presented i n

10.1.1, where the probability of each

cluster is

(

)=1

, the covariance of each cluster is diagonal

⌃



, and we seek to infer the cluster means

2 R

.With

Gaussian mixture models, we were computing the probability that

the covariates

belong to the cluster

; with K-means, the cluster

membership assignment

⇤

for a covariate

is deterministic so

that it corresponds to

p(k|x

, ✓)=

⇢

1, if k = k

⇤

0, otherwise,

where a covariate

is assigned to the cluster associated with the

mean it is the close s t to,

⇤

= arg min

||x

 µ

||.

The mean vector for t he

cluster is estimated as the average

location of the covariates x

associated with a cluster k so that

#{i : k

⇤

= k}

i:k

⇤

where #

⇤

is the number of obs er ved covariates assigned

to the cluster k.

0 0.25 0.5 0.75 1

0.25

0.5

0.75

(a) Real clusters

0 0.25 0.5 0.75 1

0.25

0.5

0.75

(b) Clusters approximated with K-

means, where crosses describ e each

cluster center.

Figure 10.6: Example of application of

K-means. Notice how some samples were

misclassiﬁed due to the hypothesis related

to the circular shape of each cluster, that

is, ⌃

= 

The general procedure consists in initializing

(0)

randomly,

iteratively assigning each observation to the cluster it is the closest

to, and r e c omp ut in g the mean of each cluster. Figure 10.6 presents

in (a) the t r u e clust e r and in (b) the cluster member sh i p assign -

ment computed using K-means, where the mean of each cluster is

probabilistic machine learning for civil engineers 163

presented by a cross. Note that because clusters are assumed to

be spherical (i.e.,

⌃



), the cluster membership assignment

is only based on th e Eucl id i an dist anc e between a covariate and

the center of each cl u st er . This explai n s why in ﬁgure 10.6b several

observed covariates are assigned to the wrong cluster. Algorithm 9

summarizes the steps for K-means clustering.

Algorithm 9: K-means

1 deﬁne x, ✏, t =0

2 initialize µ

(0)

, and µ

(1)

= µ

(0)

+ ✏

3 while ||µ

(t)

 µ

(t1)

||  ✏ do

4 for i 2{1, 2, ··· , D} do

5 k

⇤

= arg min

||x

 µ

(t1)

6 for k 2{1, 2, ··· , K} do

7 µ

(t)

#{i:k

⇤

=k}

i:k

⇤

8 t = t +1

10.2 Principal Component Analysis

Principal component analysis (PCA)

is a di me ns ion reduc t ion

Hotelling, H. (1933). Analysis of a com-

plex of statistical variables into principal

components. Journal of Ed ucational

Psychology 24 (6), 417

technique often employed in conjunction with clusteri ng. The

goal of PC A is to transform the covariate space in a new set of

orthogonal axes for which the variance is maximized for the ﬁrst

dimension. Note that in order to perform PCA, we must work with

normalized data with mean zero and with variance equal to one for

each covariate.

Let us assume, our observations consist of a set of raw unnormal-

ized covariates

, 8i 2{

D}}

,where

··· x

]

We normalize each value

]

in this data set by su bt r act i n g

from it the empirical average

ˆµ

i=1

and dividing by the

empirical standard deviation ˆ



D1

i=1

 ˆµ

)



1/2

so that

˜x

 ˆµ

ˆ

From our set of normalized covariates

˜x

··· ˜x

]

,weseek

orthogonal vectors

⌫

2 R

, 8i 2{

so that our tr an sf orm at ion

matrix,

V =[⌫

⌫

··· ⌫

] , (10.5)

is orthonormal, that is,

. We can transform the covariates

into the new s pac e usin g

. The t r an sf orm ed data is

then referred to as the score. Here, we want to ﬁnd the orthogonal

j.-a. goulet 164

transformation matrix V that minimizes the reconstruction error

J(V)=

i=1

 Vz

. (10.6)

The optimal solution minimizing

(

) is ob t ai ne d by taking

as the matrix containing the eigenvectors (see

2.4.2) from the

empirical covariance matrix estimated from the set of normalized

covariates,

⌃ =

D 1

i=1

With PCA, the eigenvectors

⌫

are sorted according to the

decreasing order of its associated eigenvalues. PCA ﬁnds a new

set of orthogonal reference axes so that the variance for the ﬁrst

principal component is maximized in the transformed space. This

explains why we need to work wit h normal i zed dat a so that the

variance is similar for each dimension in the original space. If it is

not the case, PCA wi ll id entify a tran sf orm at ion th at is biased by

the di↵erence in the scale or the units of covariates. Note that an

alternative to performing principal component analysis using the

eigen decomposition is to employ the singular value decompositi on

Van Loan, C. F. and G. H. Golub (1983).

Matrix computations.JohnsHopkins

University Press

procedure.

Figure 10.7 presents a PCA transformation applied to the Old

Faith fu l geyse r exampl e . In (a), the data is normalized by subtract -

ing the empirical average and dividing by the empirical standard

deviation for each covariate. We then obtain the PCA transforma-

tion matrix by performing the eigen decomposition of the empirical

covariance matrix estimated from normalized covariates,

V =[⌫

⌫



0.71 0.71

0.71 0.71



We can see in ﬁgure 10.7b that when transformed in the PCA

space, the variance is maximized along the ﬁrst principal compo-

nent.

2 1 0 1 2

2

1

Eruption time:

ˆµ

ˆ

Time to next eruption:

ˆµ

ˆ

(a) Normalized and centered space

2 1 0 1 2

2

1

Principal component #1: z

Principal component #2: z

(b) Principal component space

Figure 10.7: Example of application

of principal component analysis to the

normalized and centered Old Faithful data

set.

Fraction of the variance explained In the PCA space, the eigenval-

ues describe the variance associ at ed wi t h e ach pr i nc i pal com ponent.

The fraction of the variance explained by each pri n ci pal c omponent

is quantiﬁed by the ratio of the eigenvalue associated with the

component divided by the sum of eigenvalues,

I =



i=1



In the example presented in ﬁgure 10.7, the fraction of the variance

explained by each principal component is I = {0.9504, 0.0496}.

probabilistic machine learning for civil engineers 165

Dimension reduction PCA can be employed to redu ce th e numb e r

of dimensions describing the covariate space. This is achieved by

removing the eigenvectors, that is, columns from the matri x

, that

are associated with a negligible fraction of the variance explained.

Figure 10.8a presents the example of a data se t wit h

=3clusters

and where covariates

2 R

. In ﬁgur e 10.8b, we see clusters

identiﬁed using the K-means algorithm. By perf orm in g PCA on the

normalized and centered data, we ﬁnd the transformati on matrix

V =

0.7064 0.3103 0.6362

0.1438 0.9430 0.3002

0.6931 0.1206 0.7107

where the fraction of the variance explained by each principal

component is

{

}

. We can see here that the third

principal component explains a negligible fraction of the variance.

This indicates that the original data set deﬁned in a 3-D space can

be represented in a 2-D space with a negli gi b le loss of informat i on.

Such a 2-D space is obtained by removing the last column from the

original PCA matrix,

0.7064 0.3103

0.1438 0.9430

0.6931 0.1206

and then performing the transformation

. Figure 10.8c

presents K-means clusters for the covariates transformed in the 2-D

PCA space, z

2 R

0.25

0.5

0.75

0.25

0.5

0.75

0.25

0.5

0.75

(a) Real clusters in a 3-D space

0.25

0.5

0.75

0.25

0.5

0.75

0.25

0.5

0.75

(b) K-means clusters in a 3-D space

1 0.5 0 0.5 1

0.5

0.25

0.25

0.5

Principal component #1: z

Principal component #2: z

principal component space

Figure 10.8: Example of application of

the principal component analysis for

dimensionality reduction.

Note that principal component analysis is limited to linear di-

mensional reduction. For nonlinear cases, we can opt for more

advanced methods such as t-distributed stochastic neighbor em-

bedding (t-SNE).

In addition to being employed in unsupervised

Van der Maaten, L. and G. Hinton (2008).

Visualizing data using t-SNE. Journal of

Machine Learning Research 9,2579–2605

learning, PCA and other dimensionality reduction techniques can

be employed in the context of r egr e ssi on and classi ﬁc at ion (se e

chapters 8 and 9) . In such a case, instead of modeling the relation-

ship between the observed system responses and the covariates,

the relationship is modeled between the observed system respon se s

and the transformed covariates represented in a lower-dimensional

space where inter-covariate dependence is removed using PCA or

any other dimension-reduction technique.