5
Convex Optimization
Note:
In the context of this book, we
are only concerned by target functions
˜
f() 2 R
that are continuous and twice
dierentiable.
When building a model in the context of machine learning, we
often seek optimal model parameters
,inthesensewherethey
maximize the prior probability (or probability density) of predicting
observed d at a. Here, we denote by
˜
f()
the target function we want
to maximize. Optimal parameter values
are those that maximize
the function
˜
f(),
Note:
In order to be mathematically
rigorous, equation 5.1 should employ the
operator
2
rather than = to recognize that
belongs to a set of possible solutions
maximizing
˜
f().
= arg max
˜
f(). (5.1)
(a) Convex set (b) Non-convex set
Figure 5.1: Examples of a convex and a
non-convex set.
1
2
˜
f()
1
2
(a)
˜
f(): Concave,
˜
f(): Convex
1
2
˜
f()
1
2
(b)
˜
f
(
): Non-concave,
˜
f
(
):
Non-convex
Figure 5.2: Representations of convex/
concave and non-convex/non-concave
functions.
With a small caveat that will be covered below, convex opti-
mization methods can be employed for the maximization task in
equation 5.1. The key aspect of convex optimization methods is
that, under certain condition s, they are guaranteed to reach optimal
values for convex functions. Figure 5.1 presents examples of convex
and non-convex sets. For a set to be convex , you must be able to
link any two points belonging to it without being outside of this
set. Figure 5.1b presents a case where this proper ty is not satisfied.
For a convex function, the segment linking any pair of its points lies
above or is equal to the function. Conversely, for a concave function,
the opposite holds: the segment linkin g any pair of points lies below
or i s equal to the function. A concave function can be transformed
into a convex one by taking the negative of it. Therefore, a max-
imization problem formulated as a concave optimization can be
formulated in terms of a convex optimization following
= arg max
˜
f()
| {z }
Concave optimization
arg min
˜
f()
| {z }
Convex optimization
.
In this chapter, we refer to convex optimization even if we are inter-
ested in maximizing a con cave function, rather than minimizing a
convex one. Th is choice is just ifi ed by the prevalence of convex opti-
mization in the literature. Moreover, note that for several machine
learning methods, we seek
based on a minimizat i on proble m
j.-a. goulet 48
where
˜
f
(
) i s a function of the dierence between observed val-
ues an d those predic t ed by a model. Figure 5.2 presents examples
of convex/concave and non-convex/non-concave f u nc t ion s. Non-
convex/non-concave functions such as the one in figure 5.2b may
have several local optima. Many functions of practical interest are
non-convex/non-concave. As we will see in this chapter, convex
optimization methods can also be employed for non-convex/non-
concave functions given that we choose a proper starting location.
This chapter presents the gradient ascent and Newton-Raphson
methods, as well as practical tools to be employed with them.
For full-depth details regarding optimization methods, the reader
should refer to dedicated textbooks.
1
1
Bertsekas, D. P., A. Nedi, and A. E.
Ozdaglar (2003). Convex a na lys is and opti-
mization.AthenaScientic;Chong,E.K.P.
and S. H. Zak (2013). An introduction to
optimization (4th ed.). Wiley; and Nocedal,
J. and S. Wright (2006). Numerical op-
timization.SpringerScience&Business
Media
Derivative
˜
f
0
()
d
˜
f()
d
Gradient
r
˜
f() ⌘r
˜
f()
=
h
@
˜
f()
@✓
1
@
˜
f()
@✓
2
···
@
˜
f()
@✓
n
i
|
Maximum of a concave function
=argmax
˜
f():
d
˜
f(
)
d
=0
5.1 Gradient Ascent
A gradient is a vector containing the partial derivatives of a func-
tion with r e s pect to its variables. For a continuous function, the
maximum is located at the point where its gradient equals zero.
Gradient ascent is based on the principle that as long as we move
in t h e direc t i on of the gradient, we are moving toward a maximum.
For the unidimensional case, we choose to move to a new position
new
defined as t h e old value
old
plus a sear ch directi on
d
defined
by a scaling factor times th e derivative estimat ed at
old
,
is also known as the learning rate or step
length.
new
=
old
+ ·
˜
f
0
(
old
)
| {z }
d
.
A common pract i ce for setting
is to employ backtracking line
search where a new position is accepted if the Armijo rule
2
is
2
Armijo, L. (1966). Minimization of
functions having Lipschitz continuous first
partial derivatives. Pacific Journal of
Mathematics 1 6 (1), 1–3
satisfied so that
˜
f(
new
)
˜
f(
old
)+c · d
˜
f
0
(
old
), with c 2 (0, 1). (5.2)
0 2 4 6 8 10 12
0
0.5
1
Armijo’s
inadmissible
new
c =0
c =1
˜
f()
old
new
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
Figure 5.3: Example of application of
the Armijo rule to test if
new
has su-
ciently increased the objective function in
comparison with
old
.
Figure 5.3 presents a comparison of the application of equation
5.2 wi t h the two extreme cases,
c
= 0 and
c
= 1. For
c
= 1,
new
is
only accepted if
˜
f
(
new
) lies above the plane defin ed by the tan gent
at
old
. For
c
= 0,
new
is on l y accept ed if
˜
f
(
new
)
>
˜
f
(
old
). The
larger
c
is, the stricter is the Armijo rule for ensuring that sucient
progress is made by the current step. W i t h backtracking li ne search,
we start from an initial value of
0
and r ed u ce it until equation 5.2
is s at is fie d. Algori t h m 1 presents a minimal version of the gradient
ascent with backtracking line search.
probabilistic machine learning for civil engineers 49
Algorithm 1: Gradient ascent with backtracking line search
1 initialize =
0
,
old
=
0
,dene, c,
˜
f()
2 while |
˜
f
0
(
old
)| >do
3
compute
˜
f(
old
) (Function value)
˜
f
0
(
old
)(1
st
derivative)
4 compute
new
=
old
+
˜
f
0
(
old
)
| {z }
d
5 if
˜
f(
new
) <
˜
f(
old
)+c · d
˜
f
0
(
old
) then
6 assign = /2 (Backtracking)
7 Goto 4
8 assign =
0
,
old
=
new
9
=
old
0 2 4 6 8 10 12
0
0.5
1
loop #1
old
=3.50,
0
=3
˜
f(
old
)=0.21
˜
f
0
(
old
)=0.32
new
=
old
+
0
˜
f
0
(
old
)
=4.47
˜
f(
new
)=0.82
0
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
0 2 4 6 8 10 12
0
0.5
1
loop #2
old
=4.47,
0
=3
˜
f(
old
)=0.82
˜
f
0
(
old
)=0.72
new
=
old
+
0
˜
f
0
(
old
)
=6.64
˜
f(
new
)=-0.03
new
=
old
+
0
2
˜
f
0
(
old
)
=5.55
˜
f(
new
)=0.70
new
=
old
+
0
4
˜
f
0
(
old
)
=5.01
˜
f(
new
)=1.01
0
0
/2
0
/4
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
Figure 5.4: Example of application of
gradient ascent with backtracking for
finding the maximum of a function.
Figure 5.4 presents the first two steps of the application of
algorithm 1 to a non-convex/non-concave function with an initial
value
0
=3
.
5 and a scaling factor
0
= 3. For the second step,
the scaling factor
has t o be reduced twice in order to satisfy the
Armijo rule. One of the dic ul t i es with gradient ascent is that the
convergence speed depends on the choice of
0
.If
0
is t oo small,
several steps will be wasted and convergence will be slow. If
0
is
too large, the algorithm may not converge.
0 2 4 6 8 10 12
0
0.5
1
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
Figure 5.5: Example of application of
gradient ascent converging to a local
maximum for a function.
Figure 5.5 presents a limitation common to all convex optimiza-
tion methods when applied to functions involving local maxima; if
the starting location
0
is n ot locat ed on the slope segment leading
to t he global maximum, the algorith m will most likely miss it and
converge to a local maximum. The task of selecting a proper value
0
is n ontrivial because in most cases, it is n ot possi b le to visual i ze
˜
f
(
). This i s su e can be tackled by attempting multiple startin g
locations
0
and by using domain knowledge to identify proper
starting locations.
Figure 5.6: Comparison of gradient ascent
with and without momentum.
Gradient ascent can be appli ed to search for the maximum of a
multivariate function by replacing the univariate derivative by the
gradient so that
new
=
old
+ ·r
˜
f(
old
).
As il l us t rat e d in figure 5.6, because gradient ascent follows the
direction where the gradient is maximal, it often displays an os-
cillatory pattern. This issue can be mitigated by introducing a
momentum term in the calculation of
new
,
3
3
Rumelhart, D. E., G. E. Hinton, and R. J.
Williams (1986). Learning representations
by back-propagating errors. Nature 323 ,
533–536
v
new
= ·v
old
+ ·r
˜
f(
old
),
new
=
old
+ v
new
where
v
can be interpreted as a velo c i ty that carries the momentum
from the previous iterations .
j.-a. goulet 50
5.2 Newton-Raphson
The Newton -Raphson method allows us to adaptively scale the
search direction vector using the second-order derivative
˜
f
00
(
).
Second-order derivatives
˜
f
00
()
d
2
˜
f()
d
2
˜
f
00
i
()
@
2
˜
f()
@✓
2
i
Hessian
h
H[
˜
f()]
i
ij
=
@
2
˜
f()
@✓
i
@✓
j
Knowing that the maximum of a function corresponds to the point
where the gradient is zero,
˜
f
0
(
) = 0, we can find this maximum by
formulating a linearized gradient equati on using the second-or de r
derivative of
˜
f
(
) and then set it equal to zero. The analytic formu-
lation for the linearized gradient function (see
§
3.4.2) approximated
at the current location
old
is
˜
f
0
()
˜
f
00
(
old
) · (
old
)+
˜
f
0
(
old
). (5.3)
We can estimate
new
by setting equation 5.3 equal to zero, and
then by solving for , we obt ai n
new
=
old
˜
f
0
(
old
)
˜
f
00
(
old
)
. (5.4)
0 2 4 6 8 10 12
40
20
0
˜
f()
0 2 4 6 8 10 12
10
0
10
˜
f
0
()
0 2 4 6 8 10 12
4
3
2
1
0
˜
f
00
()
Figure 5.7: Example of application of
Newton-Raphson to a quadratic function.
Let u s consid er the case where we want to find the maximum of
a quadratic fun cti on (i.e.,
/ x
2
), as ill u st r at ed in figure 5.7. In the
case of a quadratic funct ion , the algorit h m converges to the exact
solution in one i te r ati on , no matter the starting point, becau se the
gradient of a quadrati c funct i on is exactly descri bed by the linear
function in equation 5.3.
Algorithm 2 presents a minimal version of the Newton-Raphson
method with backtracking line search. Note that at line 6, there
is agai n a scaling factor
, wh ich is employed because the Newton-
Raphson method is exact on l y for quadrati c funct ion s . For more
general non-convex/non-concave functions, the linearized gradient is
an approximation such that a value of
= 1 wil l not always lead to
a
new
satisfying the Armijo rule in equation 5.2.
Figure 5.8 presents the application of algorithm 2 to a non-
convex/non-concave function with an initial value
0
=3
.
5 and a
scaling factor
0
= 1. For each loop, the pink solid line represents
the linearized gradient function formulated in equation 5.3. Notice
how, for the first two iterations, the second derivative
˜
f
00
(
)
>
0.
Having a positive second derivative indicates that the linearization
of
˜
f
0
(
) eq u al s zero for a minimum rather than for a maximum.
One simple option in t h i s situat i on is to define
=
in or d er
to en su r e that the next step moves in the same direction as the
gradient. The convergence with Newton-Raphson is typically faster
than with gradient ascent.
probabilistic machine learning for civil engineers 51
Algorithm 2: Newton-Raphson with backtracking line search
1 initialize =
0
= 1,
old
=
0
,dene, c,
˜
f()
2 while |
˜
f
0
(
old
)| >do
3 compute:
old
!
8
<
:
˜
f(
old
) (Function evaluation)
˜
f
0
(
old
) (First derivative)
˜
f
00
(
old
) (Second derivative)
4 if
˜
f
00
(
old
) > 0 then
5 =
6 compute
new
=
old
˜
f
0
(
old
)
˜
f
00
(
old
)
| {z }
d
7 if
˜
f(
new
) <
˜
f(
old
)+c · d
˜
f
0
(
old
) then
8 assign = /2 (Backtracking)
9 Goto 6
10 assign =
0
,
old
=
new
11
=
old
0 2 4 6 8 10 12
0
0.5
1
loop #1
old
=3.5
˜
f(
old
)=0.21
=1
˜
f
0
(
old
)=0.32
˜
f
00
(
old
)=0.64
> 0 ! =
new
=
old
˜
f
0
(
old
)
˜
f
00
(
old
)
=4.01
˜
f(
new
)=0.46
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
0 2 4 6 8 10 12
2
1
0
1
˜
f
00
()
0 2 4 6 8 10 12
0
0.5
1
loop #2
old
=4.01
˜
f(
old
)=0.46
=1
˜
f
0
(
old
)=0.70
˜
f
00
(
old
)=0.64
> 0 ! =
new
=
old
˜
f
0
(
old
)
˜
f
00
(
old
)
=5.11
˜
f(
new
)=0.99
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
0 2 4 6 8 10 12
2
1
0
1
˜
f
00
()
0 2 4 6 8 10 12
0
0.5
1
loop #3
old
=5.11
˜
f(
old
)=0.99
=1
˜
f
0
(
old
)=0.31
˜
f
00
(
old
)=1.94
< 0 ! OK
new
=
old
˜
f
0
(
old
)
˜
f
00
(
old
)
=4.95
˜
f(
new
)=1.02
˜
f()
0 2 4 6 8 10 12
0.5
0
0.5
˜
f
0
()
0 2 4 6 8 10 12
2
1
0
1
˜
f
00
()
Figure 5.8: Example of application of
the Newton-Raphson algorithm with
backtracking line search for finding the
maximum of a function.
The Ne wt on- Raph son algori t hm can be employed for identifying
the optimal values
=[
1
2
···
n
]
|
in d omai n s having multiple
dimensions. The equation 5.4 developed for univariate cases can be
extended for n-dimensional domain s by following
new
=
old
H[
˜
f(
old
)]
1
·r
˜
f(
old
). (5.5)
H
[
˜
f
(
)] denotes the
n n
Hessian matrix containing the second-
order partial derivatives for the function
˜
f
(
) evaluated at
.The
Hessian is a symmetric matrix where each term is defined by
h
H[
˜
f()]
i
ij
=
@
2
˜
f()
@✓
i
@✓
j
=
@
@✓
i
@
˜
f()
@✓
j
!
=
@
@✓
j
@
˜
f()
@✓
i
!
.
1
2
˜
f()
Figure 5.9: Example of saddle point
where
@
2
˜
f()
@✓
2
1
< 0and
@
2
˜
f()
@✓
2
2
> 0.
When the terms on the main diagonal of the Hessian matrix are
positive, it is an in di c ati on that our lineariz ed gradi ent points to-
ward a minimum r at he r than a maximum. As we did in algorithm 2,
it is th en poss ib l e to move toward the maximum by reversing the
search direction. One issue with gradient-based multi-dimensional
optimization is saddle points. Figure 5. 9 presents an example of a
saddle point in a function for which one second-order p art i al deriva-
tive is negat ive
@
2
˜
f()
@✓
2
1
<
0
and t h e other is positive
@
2
˜
f()
@✓
2
2
>
0
.
j.-a. goulet 52
In su ch a case, one option is to regularize the Hessian matrix by
substracting a constant
on it s main diagonal in order to ensure
that all terms are negat ive:
˜
H[
˜
f()] = H[
˜
f()] I.
Avoiding being trapped in saddle points is a key challenge in opti-
mization. The reader interested in more advanced strategies for that
purpose should consult spe ci al i ze d literature.
4
4
Nocedal, J. and S. Wright (2006). Nu-
merical optimization.SpringerScience&
Business Media; Dauphin, Y. N., R. Pas-
canu, C. Gulcehre, K. Cho, S. Ganguli,
and Y. Bengio (2014). Identifying and
attacking the saddle point problem in high-
dimensional non-convex optimization. In
Advances in Neural Inform at io n Processing
Systems,27,29332941;andGoodfellow,I.,
Y. Bengio, and A. Courville (2016). Deep
learning.MITPress
In pr act i c e, second -or de r methods such as Newton-Raphson can
be employed when we have access to an analytical formulation
for e valuating the Hessian,
H
[
˜
f
(
)]. Otherwise, if we have to em-
ploy numerical derivatives (see
§
5.4) to est i mat e the Hessian, the
computational demand quickly becomes prohibitive as the num-
ber of variables increases. For example, if the number of variables
is
n
= 100, by considering the symmetry in the Hessian matrix,
there are 100 +
1
2
(100
100
100) = 5050 second-order partial
derivatives to estimate for each Newton-Raphson iteration. Even
for c omp ut at i onal l y ecie nt functi ons
˜
f
(
), the ne c ess ar y number
of evaluations becomes a challenge. In addition, for a large
n
,there
is a subst antial eort requ i re d to invert the Hessian in equation
5.5. Therefore, even if Newton-Raphson is more ecient in terms
of the number of iterati on s req ui r ed t o reach convergence, this does
not take into account the time necessary to obtain the second-order
derivatives. When dealing with a large number of variables or for
functions
˜
f
(
) t hat are computati on al ly expensive, we may revert
to using momentum-based gradient-ascent method s.
5.3 Coordinate Ascent
One alternative to the methods we have seen for multivariate
optimization is to perform the search for e ach variable separately,
which c or re sponds to solving a succession of 1D optimization
problems. This approach is known as coordinate optimization .
5
This
5
Wright, S. J. (2015). Coordinate descent
algorithms. Mat h e m a t ical Program-
ming 151 (1), 3–34; and Friedman, J.,
T. Hastie, H. ofling, and R. Tibshirani
(2007). Pathwise coordinate optimization.
The Annals of Applied Statistics 1 (2),
302–332
approach is known to perform well when there is no strong coupling
between parameters with respe ct to the values of the objective
function. Algorithm 3 presents a minimal version of the coordinate
ascent Newton-Raphson usin g backtracking line search. Figure 5.10
presents two intermediate steps illustrat i n g the applicat i on of
algorithm 3 to a bidimensi on al non-convex/non-concave functi on .
probabilistic machine learning for civil engineers 53
Algorithm 3: Coordinate ascent using Newton-Raphson
1 initialize =
0
= 1,
old
=
0
=[
1
2
···
n
]
|
2 define , c ,
˜
f()
3 while ||r
˜
f(
old
)|| >do
4 for i 2{1:n} do
5 compute
8
<
:
˜
f(
old
) (Function evaluation)
˜
f
0
i
(
old
)(i
th
1
st
partial derivative)
˜
f
00
i
(
old
)(i
th
2
nd
partial derivative)
6 if
˜
f
00
i
(
old
) > 0 then
7 =
8 compute [
new
]
i
=[
old
]
i
˜
f
0
i
(
old
)
˜
f
00
i
(
old
)
| {z }
d
9 if
˜
f(
new
) <
˜
f(
old
)+c · d
˜
f
0
i
(
old
) then
10 assign = /2 (Backtracking)
11 Goto 8
12 assign =
0
,
old
=
new
13
=
new
5
0
5
5
0
5
loo p #5 ! []
1
old
=[0.41 0.31]
|
=1
˜
f(
old
)=3.19
˜
f
0
1
(
old
)=0.72
˜
f
00
1
(
old
)=0.41
< 0 ! OK
[
new
]
1
=[
old
]
1
˜
f
0
1
()
˜
f
00
1
()
= 2.13
new
=[2.13 0.31]
|
˜
f(
new
)=3.01
1
2
˜
f()
5
0 5
5
0
5
1
2
5
0
5
5
0
5
loo p #6 ! []
2
old
=[2.13 0.31]
|
=1
˜
f(
old
)=3.01
˜
f
0
2
(
old
)=0.93
˜
f
00
2
(
old
)=0.77
< 0 ! OK
[
new
]
2
=[
old
]
2
˜
f
0
2
()
˜
f
00
2
()
=0.90
old
=[2.13 0.90]
|
˜
f(
new
)=2.52
1
2
˜
f()
5
0 5
5
0
5
1
2
Figure 5.10: Example of application of
the coordinate ascent Newton-Raphson
algorithm with backtracking for finding the
maximum of a 2-D function.
5.4 Numerical Derivatives
One aspect that was omitted in the previous sections is How do
we obtain the first- and second-order derivatives?Whenatwice
dierentiable formulation for
˜
f
(
)exists,
@
˜
f()
@✓
i
and
@
2
˜
f()
@✓
2
i
can
be expressed analytically. When analytic formulations are not
available, derivatives can be estimated numerically using either
a forward, backward, or central dierentiation scheme.
6
Here, we
6
Nocedal, J. and S. Wright (2006). Nu-
merical optimization.SpringerScience&
Business Media; and Abramowitz, M. and
I. A. Stegun (1972). Handbook of mathe-
matical functions with formulas, graphs,
and mathematical table.NationalBureauof
Standards, Applied Mathematics
only focus on the central dierentiation method. Note that forward
and b ackward dierentiations are not as accurate as central, yet
they are computationally cheaper. As il lu st r at ed in figure 5.11, first-
and s econ d -or de r partial deri vatives of
˜
f
(
)withrespecttothe
i
th
element of a vector =[
1
2
···
n
]
|
are given by
Figure 5.11: Illustration of 1-D numerical
derivatives.
˜
f
0
i
()=
@
˜
f()
@✓
i
˜
f( + I(i))
˜
f( I(i))
2
˜
f
00
i
()=
@
˜
f()
@✓
2
i
˜
f( + I(i))
˜
f()
˜
f()
˜
f( I(i))
=
˜
f( + I(i)) 2
˜
f()+
˜
f( I(i))
()
2
,
where
is a small pert ur b at ion to the value
i
and
I
(
i
)isa
Examples
I(3) = [0 0 1 0 ··· 0]
|
I(1) = [1 0 0 0 ··· 0]
|
j.-a. goulet 54
n
1 indicator vector, for which all values are equal to 0, except t he
i
th
, which is equal to one.
Figure 5.12: Illustration of 2-D partial
numerical derivatives.
As il l us t rat e d in figure 5.12, numerical derivatives can also be
employed t o estimate each term of the Hessian matrix,
h
H[
˜
f()]
i
ij
@
˜
f(+)
@✓
j
@
˜
f()
@✓
j
2
i
,
where terms on the numerator are defined as
@
˜
f(+)
@✓
j
=
˜
f(+I(i )
i
+I(j)
j
)
˜
f(+I(i )
i
I(j)
j
)
2
j
@
˜
f()
@✓
j
=
˜
f(I(i )
i
+I(j)
j
)
˜
f(I(i )
i
I(j)
j
)
2
j
.
In pr act i c e, the full Hessian matrix can only be estimated numer-
ically when the number of variables
n
is s mal l or when evaluating
˜
f() is computationally cheap.
5.5 Parameter-Space Transformation
When optimizing using either the gradient ascent or the Ne wt on -
Raphson method, we are likely to run into diculties for parame-
ters
that are n ot defined in an unbounded space. In such a case,
the eciency is hindered because the algorithms may propose new
positions
new
that lie outside the valid domain. When trying to
identify optimal parameters for probability distributions such as
those described in chapter 4, common domains for parameters are
as follows:
Mean parameters: µ 2 R
Standard deviations: 2 R
+
Correlation coecients: 2 (1, 1)
Probability: Pr(X = x) 2 (0, 1)
One solution to t h is probl em is to perform the optimizati on in a
transformed space
tr
such t h at
tr
= g() 2 R.
For each
, t he choice of transformati on funct i on
g
(
) d epends on
its domain. Figure 5.13 p re se nts exampl es of transformat i ons for
2 R
,
2 R
+
, and
2
(
a, b
). Note t hat in the simplest case, where
2 R, no transformation is required, so
tr
= .
(a)
tr
=
(b)
tr
=ln()
(c)
tr
= ln
ba
a
1
Figure 5.13: Examples of transformation
functions.
probabilistic machine learning for civil engineers 55
For
2 R
+
, a common transformat i on is to take the logarithm
tr
=
ln
(
), and its inverse transformation is
=
e
tr
. Th e analyt i -
cal derivatives f or the transformation and its inverse are
d
tr
d
=
1
,
d
d
tr
= e
tr
.
For parameters bounded in an interval
2
(
a, b
), a possi b le
transformation is the scaled logistic sigmoid function
tr
= ln
b a
a
1
,
and its inverse is given by
=
b a
1+e
tr
+ a.
The derivative of the transformation and its inverse are
d
d
tr
=
a b
( a)( b)
d
tr
d
=
b a
(1 + e
tr
)
2
e
tr
.
Note t h at the derivative of these transformations will be employed
in chapter 7 when performi n g paramete r- sp ace trans for mat i on s
using the change-of-variable rule we have seen in
§
3.4. The transfor-
mations presented here are not unique, as many other functions can
be employed. For further details about parameter space transforma-
tions, the reader is i nvited to refer to Gelman et al.
7
7
Gelman, A., J. B. Carlin, H. S. Stern, and
D. B. Rubin (2014). Bayesian data analysis
(3rd ed.). CRC Press