10
Clustering and Dimension Reduction
This chapter covers two key problems associated with unsupervised
learning: clustering and dimension reduction. Clustering consists
in discovering patterns or subgroups among a set of covariates.
Clustering is considered unsuper vi s ed learning because the cluster
indexes are not available to train a model; cluster indexes have to
be learned from the observed covariates themselves. With dimension
reduction, the goal is to identify a lower-dimensional subspace for
a s et of covariates, in which the data can be represented with a
negligible loss of information. In this chapter, we will introduce
two cl us t er i ng methods, Gaussian mixture models and K-means,
and a dimension reduction technique, that is, principal component
analysis.
(a) Old Faithful geyser
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
Cluster #1
Cluster #2
(b) Eruption data
Figure 10.1: Eruption durations and time
between eruptions can be clustered into
two groups.
10.1 Clustering
Figure 10.1 presents the Old Fai t hf ul geys er data set
1
where t he
1
Adelchi, A. and A. W. Bowman (1990).
AlookatsomedataontheOldFaithful
geyser. Journal of the Royal Statistical So-
ciety. Series C (Applied Statistics) 39 (3),
357–365
covariates
x
i
=[
x
1
x
2
]
|
i
describe the eruption time and the time to
the next eruption. Figure 10.1 di sp l ays two distinct behaviors: short
eruptions with short times to th e next ones and long eruptions with
a long time to the next ones. The goal with clus t er in g is to discover
the existence of such an underlying structure in data.
Note that when working with data sets
D
=
D
x
=
{x
i
, 8i 2{
1:
D}}, x
i
2 R
X
, with covariates defined in three or fewer dimensions
(
X
3), it is then trivial to v i su al ly identify clusters. The interest of
having clustering methods is in identifying patterns in cases where
X >
3, such that the dimensionality of the data does not allow for a
visual separation of cluster s.
Data
D = D
x
= {x
i
, 8i 2{1:D}}
x
i
2 R
X
:
Covariate
Attribute
10.1.1 Gaussian Mixture Models
Clustering can be formulated as the task of fitting a Gaussian
j.-a. goulet 158
mixture model (GMM), on a set of obs er ved covariates. A Gaus-
sian mixture distribution is defined as the sum of
K
Normal (i.e.,
Gaussian) probability density functions (PDFs), each weighted by
a p r obab i li ty
p
(
k
)
, 8k 2{
1:
K}
. The j oi nt PDF for a Gaussian
mixture is defined by
3 0.25 2.5 5.25 8
p(1) = 0.4
p(2) = 0.6
x
PDF : f()
N
1
(1, 1
2
) N
2
(4, 1.5
2
)
f
12
(x)
(a) Univariate
x
1
x
2
f(x)
f
12
(x) N
1
(µ
1
,
1
) N
2
(µ
2
,
2
)
(b) Bivariate
Figure 10.2: Examples of Gaussian mixture
PDFs.
f(x)=
K
X
k=1
f(x,k)
=
K
X
k=1
f(x|k) · p(k)
=
K
X
k=1
N(x; µ
k
,
k
) ·p(k),
(10.1)
where
=
{
(
µ
k
,
k
)
, 8k 2{
1:
K}}
describes the PDF parameters.
Figure 10.2 presents examples of 1-D and 2-D Gaussian mixture
PDFs. The issue is that in practice, we do n ot know
or
p
(
k
), so
we have to learn
P
=
{,p
(
k
)
}
from
D
x
. Given the hypothesis that
D
x
contains observations that are conditionally indepen de nt given
P, the joint likelihood defined from equ at ion 10.1 is
f(D
x
|,p(k)) =
D
Y
i=1
K
X
k=1
N(x
i
; µ
k
,
k
) ·p(k),
and equivalently, the log-likelihood is
ln f (D
x
|,p(k)) =
D
X
i=1
ln
K
X
k=1
N(x
i
; µ
k
,
k
) ·p(k)
!
. (10.2)
If we choose a maximum likelihood estimate (MLE) approach to
Gaussian m ixture parameters
P = {(µ
k
,
k
)
| {z }
,p(k), 8k 2{1:K}}
estimate the parameters, t he ir optimal valu es are defined as
P
= arg max
P
ln f (D
x
|,p(k)). (10.3)
When the optimization in equation 10.3 employs methods such
as those presented in chapter 5, it is r ef er re d to as hard cluster-
ing. Hard clustering is a dicult task because of several issues
such as th e const rai nts on
k
that require it to be positive semi-
definite (see
§
3.3.5), and because of non-identifiability (see 6.3.4).
Soft clus ter in g solves these issue s by employing the expectation
maximization method to estimate P
.
Expectation maximization
2
(EM) is an i t er at i ve meth od th at
2
Dempster, A. P., N. M. Laird, and D. B.
Rubin (1977). Maximum likelihood from
incomplete data via the EM algorithm.
Journal of the Royal Sta t is ti cal Society.
Series B (methodological) 39 (1), 1–38
allows identifying optimal parameters
P
for problems involving
latent (i.e., unobserved) var i abl es . In the context of clusteri ng,
the latent variable is the unobserved cluster index
k
. If we could
probabilistic machine learning for civil engineers 159
observe
D
=
{D
x
, D
k
}
=
{
(
x
i
,k
i
)
, 8i 2{
1:
D}}
,thecomplete data
log-likelihood for one obser vation (x
i
,k
i
) would be
ln f (x
i
,k
i
|P)=lnf(x
i
,k
i
|P)
=ln
N(x
i
; µ
k
i
,
k
i
) ·p(k
i
)
.
Because complete data is not available, EM identifies optimal
parameters by maximizing the expected complete data log-likelihood,
Q
(
P
(t)
, P
(t1)
), which depends on the sets of parameters defined f or
two successive iterations, t 1 and t. For a single observation x
i
,
Q(P
(t)
, P
(t1)
)=
K
X
k=1
ln
N(x
i
; µ
(t)
k
,
(t)
k
)
| {z }
fct.(P
(t)
)
·p
(t)
(k)
· p(k|x
i
, P
(t1)
)
| {z }
fct.(P
(t1)
)
,
where
p
(
k|x
i
, P
(t1)
) is t h e post er i or probabi li ty mass fun ct i on
(PMF) for the latent cluster labels
k
obtained at the previous
iteration
t
1. This last q u antity is ref er r ed to as the responsibility
of a clu st er
k
. For an entire data set
D
=
{D
x
}
=
{x
i
, 8i 2{
1:
D}}
,
the expected complete d at a log-likelihood is
Q(P
(t)
, P
(t1)
)=
D
X
i=1
K
X
k=1
ln
N(x
i
; µ
(t)
k
,
(t)
k
) ·p
(t)
(k)
· p(k|x
i
, P
(t1)
)
!
.
The EM method identifies the optimal parameter values
P
from
Q
(
P
(t)
, P
(t1)
) by it er at i vely performing the expectation and maxi-
mization steps detailed below.
Expectation step For an it e rat i on
t
, the expectation step computes
the probability
p
(t1)
ik
that an observed covariate
x
i
is associated
with the cluster k, given the parameters P
(t1)
so that
p
(t1)
ik
= p(k|x
i
, P
(t1)
)
=
N(x
i
; µ
(t1)
k
,
(t1)
k
) ·p
(t1)
(k)
P
K
k=1
N(x
i
; µ
(t1)
k
,
(t1)
k
) ·p
(t1)
(k)
.
(10.4)
The particularity of the expectation step is that the estimation
of the responsibility
p
(t1)
ik
in equation 10.4 employs the PDF
parameters
(t1)
=
{
(
µ
(t1)
k
,
(t1)
k
)
, 8k 2{
1:
K}}
and the
marginal PMF p
(t1)
(k) estimated at the previous iteration t 1.
Maximization step In the maximization step, the goal is to iden-
tify
P
(t)
= {
(t)
,p
(t)
(k)} = arg max
P
(t)
Q(P
(t)
, P
(t1)
),
j.-a. goulet 160
where the optimization is b roken into two part s. The PDF’s new set
of optimal parameters
(t)
is obtained through the maximization
problem
(t)
= arg max
(t)
Q(P
(t)
, P
(t1)
),
where the maximum corresponds to the value
such that the
derivative of the function is zero,
0=
@ ln Q(P
(t)
, P
(t1)
)
@
=
@
D
X
i=1
K
X
k=1
ln
N(x
i
; µ
(t)
k
,
(t)
k
) ·p
(t)
(k)
· p
(t1)
ik
!
@
=
@
D
X
i=1
K
X
k=1
p
(t1)
ik
· ln N(x
i
; µ
(t)
k
,
(t)
k
)
@
+
=0
z }| {
D
X
i=1
K
X
k=1
p
(t1)
ik
· ln p
(t)
(k)
@
=
@
D
X
i=1
p
(t1)
ik
h
ln det
(t)
k
+(x
i
µ
(t)
k
)
|
(
(t)
k
)
1
(x
i
µ
(t)
k
)
i
@
.
By isolating the mean vector
µ
(t)
k
and the covariance matrix
(t)
k
,
the new estimates are
µ
(t)
k
=
P
D
i=1
x
i
· p
(t1)
ik
p
(t1)
k
(t)
k
=
P
D
i=1
(x
i
µ
(t)
k
)(x
i
µ
(t)
k
)
|
· p
(t1)
ik
p
(t1)
k
9
>
>
>
>
=
>
>
>
>
;
p
(t1)
k
=
D
X
i=1
p
(t1)
ik
.
By taking the derivative of
ln Q
(
P
(t)
, P
(t1)
)withrespectto
p
(
k
)
and repeating the same procedure, the new marginal PMF for each
cluster k is given by
p
(t)
(k)=
p
(t1)
k
D
.
In practice, the values
µ
(0)
k
are initialized randomly,
(0)
k
=
2
I
is
taken as a diagonal matr i x, and
p
(0)
(
k
)=1
/K
is initially assumed
to be equal for each clu st er . Algori t h m 8 summarizes the steps
of the expectation maximization algorithm for estimating the
parameters of a Gaussian mixture model.
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
(a) Expectation step, t =1
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
(b) Maximization step, t =1
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
(c) Expectation step, t =2
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
(d) Maximization step, t =2
Figure 10.3: The first two iterations of
the EM algorithm for a Gaussian mixture
model applied on the Old Faithful data.
Figure 10.3 presents the first two steps of the EM algor it h m
for a Gau ss i an mixtu re model appl i ed to the Old Faithful exam-
ple introduced in figure 10.1. At the first iteration
t
= 1, the two
cluster centers
µ
(0)
k
as well as thei r covariances
(0)
k
are initialized
randomly. In the expectation step, the cluster responsibi l i ty is es-
timated for each data point
x
i
. As de pi ct e d in figure 10.3a, a data
probabilistic machine learning for civil engineers 161
Algorithm 8: EM for Gaussian mixture models
1 define D
x
, ,lnf(D
x
|P
(0)
)=1,lnf(D
x
|P
(1)
) = 0, t = 1;
2 initialize
(0)
= {µ
(0)
k
,
(0)
k
}, p
(0)
(k)=1/K;
3 while |ln f (D
x
|P
(t)
) ln f(D
x
|P
(t1)
)| >do
4 for i 2{1, 2, ··· , D} (Expectation step) do
5 p
(t1)
i
=
P
K
k=1
N(x
i
; µ
(t1)
k
,
(t1)
k
) ·p
(t1)
(k)
6 for k 2{1, 2, ··· , K} do
7 p
(t1)
ik
=
N(x
i
; µ
(t1)
k
,
(t1)
k
) ·p
(t1)
(k)
p
(t1)
i
8 for k 2{1, 2, ··· , K} (Maximization step) do
9 p
(t1)
k
=
D
X
i=1
p
(t1)
ik
10 µ
(t)
k
=
P
D
i=1
x
i
· p
(t1)
ik
p
(t1)
k
11
(t)
k
=
P
D
i=1
(x
i
µ
(t)
k
)(x
i
µ
(t)
k
)
|
· p
(t1)
ik
p
(t1)
k
12 p
(t)
(k)=
p
(t1)
k
D
13 ln f (D
x
|P
(t)
)=
D
X
i=1
ln
"
K
X
k=1
N(x
i
; µ
(t)
k
,
(t)
k
) ·p
(t)
(k)
#
14 t = t +1
point belongs to the cluster associated with the highest responsi-
bility. At thi s point, d ue to the crude initi al i zat i on of
{µ
(0)
k
,
(0)
k
}
,
the quality of the cluster membership assignment is poor. The
maximization step for
t
= 1 updates the initial cluster centers and
covariances. The updated contours for the clusters using
µ
(1)
k
and
(1)
k
are depicted in figure 10.3b. Note that the responsibility for
each data point
x
i
has not changed, so th e clust e r member sh i p
assignment remains the same. Figure 10.3c presents the expectation
step for the second iteration, where we can notice the change in
cluster membership assignment in comparison with the previous
steps in (a) and ( b ) . In (d), the contours of each cluster are pre-
sented for parameters
{µ
(2)
k
,
(2)
k
}
, which have been updated for the
second time. Figure 10.4 presents the final clusters along with their
contours obtained after t = 12 iterations.
0 2 4 6
40
60
80
100
Eruption time [min]
Time to next eruption [min]
Figure 10.4: Example of an application
of a Gaussian mixture model to the
Old Faithful geyser data set. Contours
represent the clusters identified using the
EM algorithm after t =12iterations.
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
x
1
x
2
Figure 10.5: Example of an application of
aGaussianmixturemodeltooverlapping
clusters. Contours represent the clusters
identified using the EM algorithm.
Figure 10.5 presents a second example of a Gaus si an mixt u r e
model for
K
= 3 generic clusters. Note that desp i t e the overlap
between clusters, the method performed well because, li ke in the
j.-a. goulet 162
previous case, each cluster is well approximated by a multivariate
Normal. Gaussian mixture models may display a poor performance
when the clusters do not have el l i ps oid al shapes. In such a case,
we should resort to more ad vanced meth od s such as spectral clus-
tering.
3
Furt he r detai l s regard in g the derivation and theor et i cal
3
Ng, A., M. Jordan, and Y. Weiss (2002).
On spectral clustering: Analysis and
an algorithm. In Advances i n neural
information processing systems,Volume14,
pp. 849–856. MIT Press
justification of the EM methods are presented by Murphy
4
and
4
Murphy, K. P. (2012). Machine learning:
Aprobabilisticperspective.MITPress
Bishop.
5
5
Bishop, C. M. (2006). Pattern recognition
and machine learning.Springer
Estimating the number of clusters In the previous examples, we
have as su med th at we knew the number of clusters
K
. In mos t
practical applications, this is not the case and
K
must be estimated
using methods such as cross-validation (see
§
8.1.2) or Bayesian
model selection (see §6.8) .
10.1.2 K-Means
K-means
6
is one of the most widesp re ad clus te r in g methods due
6
Lloyd, S. (1982). Least squares quanti-
zation in PCM. IEEE Transactions on
Information Theory 28 (2), 129–137
to its simplicity. It can be seen as a special case of the Gaussian
mixture model presented i n
§
10.1.1, where the probability of each
cluster is
p
(
k
)=1
/K
, the covariance of each cluster is diagonal
k
=
2
I
, and we seek to infer the cluster means
µ
k
2 R
X
.With
Gaussian mixture models, we were computing the probability that
the covariates
x
i
belong to the cluster
k
; with K-means, the cluster
membership assignment
k
i
for a covariate
x
i
is deterministic so
that it corresponds to
p(k|x
i
, )=
1, if k = k
i
0, otherwise,
where a covariate
x
i
is assigned to the cluster associated with the
mean it is the close s t to,
k
i
= arg min
k
||x
i
µ
k
||.
The mean vector for t he
k
th
cluster is estimated as the average
location of the covariates x
i
associated with a cluster k so that
µ
k
=
1
#{i : k
i
= k}
X
i:k
i
=k
x
i
,
where #
{i
:
k
i
=
k}
is the number of obs er ved covariates assigned
to the cluster k.
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
x
1
x
2
(a) Real clusters
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
x
1
x
2
(b) Clusters approximated with K-
means, where crosses describ e each
cluster center.
Figure 10.6: Example of application of
K-means. Notice how some samples were
misclassified due to the hypothesis related
to the circular shape of each cluster, that
is,
k
=
2
I.
The general procedure consists in initializing
µ
(0)
k
randomly,
iteratively assigning each observation to the cluster it is the closest
to, and r e c omp ut in g the mean of each cluster. Figure 10.6 presents
in (a) the t r u e clust e r and in (b) the cluster member sh i p assign -
ment computed using K-means, where the mean of each cluster is
probabilistic machine learning for civil engineers 163
presented by a cross. Note that because clusters are assumed to
be spherical (i.e.,
k
=
2
I
), the cluster membership assignment
is only based on th e Eucl id i an dist anc e between a covariate and
the center of each cl u st er . This explai n s why in figure 10.6b several
observed covariates are assigned to the wrong cluster. Algorithm 9
summarizes the steps for K-means clustering.
Algorithm 9: K-means
1 define x, , t =0
2 initialize µ
(0)
k
, and µ
(1)
k
= µ
(0)
k
+
3 while ||µ
(t)
k
µ
(t1)
k
|| do
4 for i 2{1, 2, ··· , D} do
5 k
i
= arg min
k
||x
i
µ
(t1)
k
||
2
6 for k 2{1, 2, ··· , K} do
7 µ
(t)
k
=
1
#{i:k
i
=k}
P
i:k
i
=k
x
i
8 t = t +1
10.2 Principal Component Analysis
Principal component analysis (PCA)
7
is a di me ns ion reduc t ion
7
Hotelling, H. (1933). Analysis of a com-
plex of statistical variables into principal
components. Journal of Ed ucational
Psychology 24 (6), 417
technique often employed in conjunction with clusteri ng. The
goal of PC A is to transform the covariate space in a new set of
orthogonal axes for which the variance is maximized for the first
dimension. Note that in order to perform PCA, we must work with
normalized data with mean zero and with variance equal to one for
each covariate.
Let us assume, our observations consist of a set of raw unnormal-
ized covariates
D
=
{x
i
, 8i 2{
1:
D}}
,where
x
i
=[
x
1
x
2
··· x
X
]
|
i
.
We normalize each value
x
ij
=[
x
i
]
j
in this data set by su bt r act i n g
from it the empirical average
ˆµ
j
=
1
D
P
D
i=1
x
ij
and dividing by the
empirical standard deviation ˆ
j
=
1
D1
P
D
i=1
(x
ij
ˆµ
j
)
2
1/2
so that
˜x
ij
=
x
ij
ˆµ
j
ˆ
j
.
From our set of normalized covariates
˜
x
i
=[
˜x
1
˜x
2
··· ˜x
X
]
|
i
,weseek
orthogonal vectors
i
2 R
X
, 8i 2{
1:
X}
so that our tr an sf orm at ion
matrix,
V =[
1
2
···
X
] , (10.5)
is orthonormal, that is,
V
|
V
=
I
. We can transform the covariates
˜
x
i
into the new s pac e usin g
z
i
=
V
|
˜
x
i
. The t r an sf orm ed data is
then referred to as the score. Here, we want to find the orthogonal
j.-a. goulet 164
transformation matrix V that minimizes the reconstruction error
J(V)=
1
D
D
X
i=1
||
˜
x
i
Vz
i
||
2
. (10.6)
The optimal solution minimizing
J
(
V
) is ob t ai ne d by taking
V
as the matrix containing the eigenvectors (see
§
2.4.2) from the
empirical covariance matrix estimated from the set of normalized
covariates,
ˆ
=
1
D 1
D
X
i=1
˜
x
i
˜
x
|
i
.
With PCA, the eigenvectors
i
in
V
are sorted according to the
decreasing order of its associated eigenvalues. PCA finds a new
set of orthogonal reference axes so that the variance for the first
principal component is maximized in the transformed space. This
explains why we need to work wit h normal i zed dat a so that the
variance is similar for each dimension in the original space. If it is
not the case, PCA wi ll id entify a tran sf orm at ion th at is biased by
the dierence in the scale or the units of covariates. Note that an
alternative to performing principal component analysis using the
eigen decomposition is to employ the singular value decompositi on
8
8
Van Loan, C. F. and G. H. Golub (1983).
Matrix computations.JohnsHopkins
University Press
procedure.
Figure 10.7 presents a PCA transformation applied to the Old
Faith fu l geyse r exampl e . In (a), the data is normalized by subtract -
ing the empirical average and dividing by the empirical standard
deviation for each covariate. We then obtain the PCA transforma-
tion matrix by performing the eigen decomposition of the empirical
covariance matrix estimated from normalized covariates,
V =[
1
2
]=
0.71 0.71
0.71 0.71
.
We can see in figure 10.7b that when transformed in the PCA
space, the variance is maximized along the first principal compo-
nent.
2 1 0 1 2
2
1
0
1
2
Eruption time:
˜
x
1
=
x
1
ˆµ
1
ˆ
1
Time to next eruption:
˜
x
2
=
x
2
ˆµ
2
ˆ
2
(a) Normalized and centered space
2 1 0 1 2
2
1
0
1
2
Principal component #1: z
1
Principal component #2: z
2
(b) Principal component space
Figure 10.7: Example of application
of principal component analysis to the
normalized and centered Old Faithful data
set.
Fraction of the variance explained In the PCA space, the eigenval-
ues describe the variance associ at ed wi t h e ach pr i nc i pal com ponent.
The fraction of the variance explained by each pri n ci pal c omponent
I
is quantified by the ratio of the eigenvalue associated with the
i
th
component divided by the sum of eigenvalues,
I =
i
P
X
i=1
i
.
In the example presented in figure 10.7, the fraction of the variance
explained by each principal component is I = {0.9504, 0.0496}.
probabilistic machine learning for civil engineers 165
Dimension reduction PCA can be employed to redu ce th e numb e r
of dimensions describing the covariate space. This is achieved by
removing the eigenvectors, that is, columns from the matri x
V
, that
are associated with a negligible fraction of the variance explained.
Figure 10.8a presents the example of a data se t wit h
K
=3clusters
and where covariates
x
i
2 R
3
. In figur e 10.8b, we see clusters
identified using the K-means algorithm. By perf orm in g PCA on the
normalized and centered data, we find the transformati on matrix
V =
2
4
0.7064 0.3103 0.6362
0.1438 0.9430 0.3002
0.6931 0.1206 0.7107
3
5
,
where the fraction of the variance explained by each principal
component is
I
=
{
85
.
5
,
14
.
2
,
0
.
3
}
. We can see here that the third
principal component explains a negligible fraction of the variance.
This indicates that the original data set defined in a 3-D space can
be represented in a 2-D space with a negli gi b le loss of informat i on.
Such a 2-D space is obtained by removing the last column from the
original PCA matrix,
V
2
=
2
4
0.7064 0.3103
0.1438 0.9430
0.6931 0.1206
3
5
,
and then performing the transformation
z
i
=
V
|
2
˜
x
i
. Figure 10.8c
presents K-means clusters for the covariates transformed in the 2-D
PCA space, z
i
2 R
2
.
0
0.25
0.5
0.75
1
0
0.25
0.5
0.75
1
0
0.25
0.5
0.75
1
x
1
x
2
x
3
(a) Real clusters in a 3-D space
0
0.25
0.5
0.75
1
0
0.25
0.5
0.75
1
0
0.25
0.5
0.75
1
x
1
x
2
x
3
(b) K-means clusters in a 3-D space
1 0.5 0 0.5 1
0.5
0.25
0
0.25
0.5
Principal component #1: z
1
Principal component #2: z
2
(c) K-means clusters in the 2-D
principal component space
Figure 10.8: Example of application of
the principal component analysis for
dimensionality reduction.
Note that principal component analysis is limited to linear di-
mensional reduction. For nonlinear cases, we can opt for more
advanced methods such as t-distributed stochastic neighbor em-
bedding (t-SNE).
9
In addition to being employed in unsupervised
9
Van der Maaten, L. and G. Hinton (2008).
Visualizing data using t-SNE. Journal of
Machine Learning Research 9,25792605
learning, PCA and other dimensionality reduction techniques can
be employed in the context of r egr e ssi on and classi fic at ion (se e
chapters 8 and 9) . In such a case, instead of modeling the relation-
ship between the observed system responses and the covariates,
the relationship is modeled between the observed system respon se s
and the transformed covariates represented in a lower-dimensional
space where inter-covariate dependence is removed using PCA or
any other dimension-reduction technique.