Convex Optimization

Note:

In the context of this book, we

are only concerned by target functions

f(✓) 2 R

that are continuous and twice

di↵erentiable.

When building a model in the context of machine learning, we

often seek optimal model parameters

✓

,inthesensewherethey

maximize the prior probability (or probability density) of predicting

observed d at a. Here, we denote by

f(✓)

the target function we want

to maximize. Optimal parameter values

✓

⇤

are those that maximize

the function

f(✓),

Note:

In order to be mathematically

rigorous, equation 5.1 should employ the

operator

rather than = to recognize that

✓

⇤

belongs to a set of possible solutions

maximizing

f(✓).

✓

⇤

= arg max

✓

f(✓). (5.1)

(a) Convex set (b) Non-convex set

Figure 5.1: Examples of a convex and a

non-convex set.

✓

f(✓)

✓

(a)

f(✓): Concave, 

f(✓): Convex

✓

f(✓)

✓

(b)

(

✓

): Non-concave,



(

✓

Non-convex

Figure 5.2: Representations of convex/

concave and non-convex/non-concave

functions.

With a small caveat that will be covered below, convex opti-

mization methods can be employed for the maximization task in

equation 5.1. The key aspect of convex optimization methods is

that, under certain condition s, they are guaranteed to reach optimal

values for convex functions. Figure 5.1 presents examples of convex

and non-convex sets. For a set to be convex , you must be able to

link any two points belonging to it without being outside of this

set. Figure 5.1b presents a case where this proper ty is not satisﬁed.

For a convex function, the segment linking any pair of its points lies

above or is equal to the function. Conversely, for a concave function,

the opposite holds: the segment linkin g any pair of points lies below

or i s equal to the function. A concave function can be transformed

into a convex one by taking the negative of it. Therefore, a max-

imization problem formulated as a concave optimization can be

formulated in terms of a convex optimization following

✓

⇤

= arg max

✓

f(✓)

| {z }

Concave optimization

⌘ arg min

✓



f(✓)

| {z }

Convex optimization

In this chapter, we refer to convex optimization even if we are inter-

ested in maximizing a con cave function, rather than minimizing a

convex one. Th is choice is just iﬁ ed by the prevalence of convex opti-

mization in the literature. Moreover, note that for several machine

learning methods, we seek

✓

⇤

based on a minimizat i on proble m

j.-a. goulet 48

where



(

✓

) i s a function of the di↵erence between observed val-

ues an d those predic t ed by a model. Figure 5.2 presents examples

of convex/concave and non-convex/non-concave f u nc t ion s. Non-

convex/non-concave functions such as the one in ﬁgure 5.2b may

have several local optima. Many functions of practical interest are

non-convex/non-concave. As we will see in this chapter, convex

optimization methods can also be employed for non-convex/non-

concave functions given that we choose a proper starting location.

This chapter presents the gradient ascent and Newton-Raphson

methods, as well as practical tools to be employed with them.

For full-depth details regarding optimization methods, the reader

should refer to dedicated textbooks.

Bertsekas, D. P., A. Nedi, and A. E.

Ozdaglar (2003). Convex a na lys is and opti-

mization.AthenaScientiﬁc;Chong,E.K.P.

and S. H. Zak (2013). An introduction to

optimization (4th ed.). Wiley; and Nocedal,

J. and S. Wright (2006). Numerical op-

timization.SpringerScience&Business

Media

Derivative

(✓) ⌘

f(✓)

d✓

Gradient

f(✓) ⌘r

✓

f(✓)

@✓

f(✓)

@✓

···

f(✓)

@✓

Maximum of a concave function

✓

⇤

=argmax

✓

f(✓):

f(✓

⇤

)

d✓

5.1 Gradient Ascent

A gradient is a vector containing the partial derivatives of a func-

tion with r e s pect to its variables. For a continuous function, the

maximum is located at the point where its gradient equals zero.

Gradient ascent is based on the principle that as long as we move

in t h e direc t i on of the gradient, we are moving toward a maximum.

For the unidimensional case, we choose to move to a new position

✓

new

deﬁned as t h e old value

✓

old

plus a sear ch directi on

deﬁned

by a scaling factor  times th e derivative estimat ed at ✓

old



is also known as the learning rate or step

length.

✓

new

= ✓

old

+  ·

(✓

old

)

| {z }

A common pract i ce for setting



is to employ backtracking line

search where a new position is accepted if the Armijo rule

Armijo, L. (1966). Minimization of

functions having Lipschitz continuous ﬁrst

partial derivatives. Paciﬁc Journal of

Mathematics 1 6 (1), 1–3

satisﬁed so that

f(✓

new

) 

f(✓

old

)+c · d

(✓

old

), with c 2 (0, 1). (5.2)

0 2 4 6 8 10 12

0.5

Armijo’s

inadmissible

✓

new

c =0

c =1

✓

f(✓)

✓

old

✓

new

0 2 4 6 8 10 12

0.5

0.5

✓

(✓)

Figure 5.3: Example of application of

the Armijo rule to test if

✓

new

has suﬃ-

ciently increased the objective function in

comparison with ✓

old

Figure 5.3 presents a comparison of the application of equation

5.2 wi t h the two extreme cases,

= 0 and

= 1. For

= 1,

✓

new

only accepted if

(

✓

new

) lies above the plane deﬁn ed by the tan gent

✓

old

. For

= 0,

✓

new

is on l y accept ed if

(

✓

new

)

(

✓

old

). The

larger

is, the stricter is the Armijo rule for ensuring that suﬃcient

progress is made by the current step. W i t h backtracking li ne search,

we start from an initial value of



and r ed u ce it until equation 5.2

is s at is ﬁe d. Algori t h m 1 presents a minimal version of the gradient

ascent with backtracking line search.

probabilistic machine learning for civil engineers 49

Algorithm 1: Gradient ascent with backtracking line search

1 initialize  = 

, ✓

old

= ✓

,deﬁne✏, c,

f(✓)

2 while |

(✓

old

)| >✏do

compute

⇢

f(✓

old

) (Function value)

(✓

old

)(1

derivative)

4 compute ✓

new

= ✓

old

+ 

(✓

old

)

| {z }

5 if

f(✓

new

) <

f(✓

old

)+c · d

(✓

old

) then

6 assign  = /2 (Backtracking)

7 Goto 4

8 assign  = 

,✓

old

= ✓

new

9 ✓

⇤

= ✓

old

0 2 4 6 8 10 12

0.5

loop #1

✓

old

=3.50, 

f(✓

old

)=0.21

(✓

old

)=0.32

✓

new

= ✓

old

+

(✓

old

)

=4.47

f(✓

new

)=0.82



✓

f(✓)

0 2 4 6 8 10 12

0.5

0.5

✓

(✓)

0 2 4 6 8 10 12

0.5

loop #2

✓

old

=4.47, 

f(✓

old

)=0.82

(✓

old

)=0.72

✓

new

= ✓

old

+

(✓

old

)

=6.64

f(✓

new

)=-0.03

✓

new

= ✓

old



(✓

old

)

=5.55

f(✓

new

)=0.70

✓

new

= ✓

old



(✓

old

)

=5.01

f(✓

new

)=1.01



✓

f(✓)

0 2 4 6 8 10 12

0.5

0.5

✓

(✓)

Figure 5.4: Example of application of

gradient ascent with backtracking for

ﬁnding the maximum of a function.

Figure 5.4 presents the ﬁrst two steps of the application of

algorithm 1 to a non-convex/non-concave function with an initial

value

✓

5 and a scaling factor



= 3. For the second step,

the scaling factor



has t o be reduced twice in order to satisfy the

Armijo rule. One of the diﬃc ul t i es with gradient ascent is that the

convergence speed depends on the choice of



.If



is t oo small,

several steps will be wasted and convergence will be slow. If



too large, the algorithm may not converge.

0 2 4 6 8 10 12

0.5

✓

f(✓ )

0 2 4 6 8 10 12

0.5

0.5

✓

(✓)

Figure 5.5: Example of application of

gradient ascent converging to a local

maximum for a function.

Figure 5.5 presents a limitation common to all convex optimiza-

tion methods when applied to functions involving local maxima; if

the starting location

✓

is n ot locat ed on the slope segment leading

to t he global maximum, the algorith m will most likely miss it and

converge to a local maximum. The task of selecting a proper value

✓

is n ontrivial because in most cases, it is n ot possi b le to visual i ze

(

✓

). This i s su e can be tackled by attempting multiple startin g

locations

✓

and by using domain knowledge to identify proper

starting locations.

✓

Gradient Ascent (GA)

GA with momentum

Figure 5.6: Comparison of gradient ascent

with and without momentum.

Gradient ascent can be appli ed to search for the maximum of a

multivariate function by replacing the univariate derivative by the

gradient so that

✓

new

= ✓

old

+  ·r

✓

f(✓

old

As il l us t rat e d in ﬁgure 5.6, because gradient ascent follows the

direction where the gradient is maximal, it often displays an os-

cillatory pattern. This issue can be mitigated by introducing a

momentum term in the calculation of ✓

new

Rumelhart, D. E., G. E. Hinton, and R. J.

Williams (1986). Learning representations

by back-propagating errors. Nature 323 ,

533–536

new

=  ·v

old

+  ·r

✓

f(✓

old

✓

new

= ✓

old

+ v

new

where

can be interpreted as a velo c i ty that carries the momentum

from the previous iterations .

j.-a. goulet 50

5.2 Newton-Raphson

The Newton -Raphson method allows us to adaptively scale the

search direction vector using the second-order derivative

(

✓

Second-order derivatives

(✓) ⌘

f(✓)

d✓

(✓) ⌘

f(✓)

@✓

Hessian

f(✓)]

f(✓)

@✓

Knowing that the maximum of a function corresponds to the point

where the gradient is zero,

(

✓

) = 0, we can ﬁnd this maximum by

formulating a linearized gradient equati on using the second-or de r

derivative of

(

✓

) and then set it equal to zero. The analytic formu-

lation for the linearized gradient function (see

3.4.2) approximated

at the current location ✓

old

(✓) ⇡

(✓

old

) · (✓  ✓

old

(✓

old

). (5.3)

We can estimate

✓

new

by setting equation 5.3 equal to zero, and

then by solving for ✓, we obt ai n

✓

new

= ✓

old



(✓

old

)

(✓

old

)

. (5.4)

0 2 4 6 8 10 12

40

20

✓

f(✓)

0 2 4 6 8 10 12

10

✓

(✓)

0 2 4 6 8 10 12

4

3

2

1

✓

(✓)

Figure 5.7: Example of application of

Newton-Raphson to a quadratic function.

Let u s consid er the case where we want to ﬁnd the maximum of

a quadratic fun cti on (i.e.,

/ x

), as ill u st r at ed in ﬁgure 5.7. In the

case of a quadratic funct ion , the algorit h m converges to the exact

solution in one i te r ati on , no matter the starting point, becau se the

gradient of a quadrati c funct i on is exactly descri bed by the linear

function in equation 5.3.

Algorithm 2 presents a minimal version of the Newton-Raphson

method with backtracking line search. Note that at line 6, there

is agai n a scaling factor



, wh ich is employed because the Newton-

Raphson method is exact on l y for quadrati c funct ion s . For more

general non-convex/non-concave functions, the linearized gradient is

an approximation such that a value of



= 1 wil l not always lead to

a ✓

new

satisfying the Armijo rule in equation 5.2.

Figure 5.8 presents the application of algorithm 2 to a non-

convex/non-concave function with an initial value

✓

5 and a

scaling factor



= 1. For each loop, the pink solid line represents

the linearized gradient function formulated in equation 5.3. Notice

how, for the ﬁrst two iterations, the second derivative

(

✓

)

Having a positive second derivative indicates that the linearization

(

✓

) eq u al s zero for a minimum rather than for a maximum.

One simple option in t h i s situat i on is to deﬁne





in or d er

to en su r e that the next step moves in the same direction as the

gradient. The convergence with Newton-Raphson is typically faster

than with gradient ascent.

probabilistic machine learning for civil engineers 51

Algorithm 2: Newton-Raphson with backtracking line search

1 initialize  = 

= 1, ✓

old

= ✓

,deﬁne✏, c,

f(✓)

2 while |

(✓

old

)| >✏do

3 compute: ✓

old

f(✓

old

) (Function evaluation)

(✓

old

) (First derivative)

(✓

old

) (Second derivative)

4 if

(✓

old

) > 0 then

5  = 

6 compute ✓

new

= ✓

old



(✓

old

)

(✓

old

)

| {z }

7 if

f(✓

new

) <

f(✓

old

)+c · d

(✓

old

) then

8 assign  = /2 (Backtracking)

9 Goto 6

10 assign  = 

,✓

old

= ✓

new

11 ✓

⇤

= ✓

old

0 2 4 6 8 10 12

0.5

loop #1

✓

old

=3.5

f(✓

old

)=0.21

 =1

(✓

old

)=0.32

(✓

old

)=0.64

> 0 !  = 

✓

new

= ✓

old

 

(✓

old

)

(✓

old

)

=4.01

f(✓

new

)=0.46

f(✓)

0 2 4 6 8 10 12

0.5

0.5

(✓)

0 2 4 6 8 10 12

2

1

✓

(✓)

0 2 4 6 8 10 12

0.5

loop #2

✓

old

=4.01

f(✓

old

)=0.46

 =1

(✓

old

)=0.70

(✓

old

)=0.64

> 0 !  = 

✓

new

= ✓

old

 

(✓

old

)

(✓

old

)

=5.11

f(✓

new

)=0.99

f(✓)

0 2 4 6 8 10 12

0.5

0.5

(✓)

0 2 4 6 8 10 12

2

1

✓

(✓)

0 2 4 6 8 10 12

0.5

loop #3

✓

old

=5.11

f(✓

old

)=0.99

 =1

(✓

old

)=0.31

(✓

old

)=1.94

< 0 ! OK

✓

new

= ✓

old

 

(✓

old

)

(✓

old

)

=4.95

f(✓

new

)=1.02

f(✓)

0 2 4 6 8 10 12

0.5

0.5

(✓)

0 2 4 6 8 10 12

2

1

✓

(✓)

Figure 5.8: Example of application of

the Newton-Raphson algorithm with

backtracking line search for ﬁnding the

maximum of a function.

The Ne wt on- Raph son algori t hm can be employed for identifying

the optimal values

✓

⇤

✓

⇤

✓

⇤

··· ✓

⇤

]

in d omai n s having multiple

dimensions. The equation 5.4 developed for univariate cases can be

extended for n-dimensional domain s by following

✓

new

= ✓

old

 H[

f(✓

old

)]

1

·r

f(✓

old

). (5.5)

[

(

✓

)] denotes the

n ⇥ n

Hessian matrix containing the second-

order partial derivatives for the function

(

✓

) evaluated at

✓

.The

Hessian is a symmetric matrix where each term is deﬁned by

f(✓)]

f(✓)

@✓

f(✓)

@✓

f(✓)

@✓

✓

f(✓)

Figure 5.9: Example of saddle point

✓

where

f(✓)

@✓

< 0and

f(✓)

@✓

> 0.

When the terms on the main diagonal of the Hessian matrix are

positive, it is an in di c ati on that our lineariz ed gradi ent points to-

ward a minimum r at he r than a maximum. As we did in algorithm 2,

it is th en poss ib l e to move toward the maximum by reversing the

search direction. One issue with gradient-based multi-dimensional

optimization is saddle points. Figure 5. 9 presents an example of a

saddle point in a function for which one second-order p art i al deriva-

tive is negat ive



f(✓)

@✓



and t h e other is positive



f(✓)

@✓



j.-a. goulet 52

In su ch a case, one option is to regularize the Hessian matrix by

substracting a constant

↵

on it s main diagonal in order to ensure

that all terms are negat ive:

f(✓)] = H[

f(✓)]  ↵I.

Avoiding being trapped in saddle points is a key challenge in opti-

mization. The reader interested in more advanced strategies for that

purpose should consult spe ci al i ze d literature.

Nocedal, J. and S. Wright (2006). Nu-

merical optimization.SpringerScience&

Business Media; Dauphin, Y. N., R. Pas-

canu, C. Gulcehre, K. Cho, S. Ganguli,

and Y. Bengio (2014). Identifying and

attacking the saddle point problem in high-

dimensional non-convex optimization. In

Advances in Neural Inform at io n Processing

Systems,27,2933–2941;andGoodfellow,I.,

Y. Bengio, and A. Courville (2016). Deep

learning.MITPress

In pr act i c e, second -or de r methods such as Newton-Raphson can

be employed when we have access to an analytical formulation

for e valuating the Hessian,

[

(

✓

)]. Otherwise, if we have to em-

ploy numerical derivatives (see

5.4) to est i mat e the Hessian, the

computational demand quickly becomes prohibitive as the num-

ber of variables increases. For example, if the number of variables

= 100, by considering the symmetry in the Hessian matrix,

there are 100 +

(100

⇥

100



100) = 5050 second-order partial

derivatives to estimate for each Newton-Raphson iteration. Even

for c omp ut at i onal l y eﬃcie nt functi ons

(

✓

), the ne c ess ar y number

of evaluations becomes a challenge. In addition, for a large

,there

is a subst antial e↵ort requ i re d to invert the Hessian in equation

5.5. Therefore, even if Newton-Raphson is more eﬃcient in terms

of the number of iterati on s req ui r ed t o reach convergence, this does

not take into account the time necessary to obtain the second-order

derivatives. When dealing with a large number of variables or for

functions

(

✓

) t hat are computati on al ly expensive, we may revert

to using momentum-based gradient-ascent method s.

5.3 Coordinate Ascent

One alternative to the methods we have seen for multivariate

optimization is to perform the search for e ach variable separately,

which c or re sponds to solving a succession of 1D optimization

problems. This approach is known as coordinate optimization .

This

Wright, S. J. (2015). Coordinate descent

algorithms. Mat h e m a t ical Program-

ming 151 (1), 3–34; and Friedman, J.,

T. Hastie, H. H¨oﬂing, and R. Tibshirani

(2007). Pathwise coordinate optimization.

The Annals of Applied Statistics 1 (2),

302–332

approach is known to perform well when there is no strong coupling

between parameters with respe ct to the values of the objective

function. Algorithm 3 presents a minimal version of the coordinate

ascent Newton-Raphson usin g backtracking line search. Figure 5.10

presents two intermediate steps illustrat i n g the applicat i on of

algorithm 3 to a bidimensi on al non-convex/non-concave functi on .

probabilistic machine learning for civil engineers 53

Algorithm 3: Coordinate ascent using Newton-Raphson

1 initialize  = 

= 1, ✓

old

= ✓

=[✓

✓

··· ✓

]

2 deﬁne ✏, c ,

f(✓)

3 while ||r

✓

f(✓

old

)|| >✏do

4 for i 2{1:n} do

5 compute

f(✓

old

) (Function evaluation)

(✓

old

)(i

partial derivative)

(✓

old

)(i

partial derivative)

6 if

(✓

old

) > 0 then

7  = 

8 compute [✓

new

]

=[✓

old

]



(✓

old

)

(✓

old

)

| {z }

9 if

f(✓

new

) <

f(✓

old

)+c · d

(✓

old

) then

10 assign  = /2 (Backtracking)

11 Goto 8

12 assign  = 

, ✓

old

= ✓

new

13 ✓

⇤

= ✓

new

5

loo p #5 ! [✓]

✓

old

=[0.41  0.31]

 =1

f(✓

old

)=3.19

(✓

old

)=0.72

(✓

old

)=0.41

< 0 ! OK

[✓

new

]

=[✓

old

]



(✓)

= 2.13

✓

new

=[2.13  0.31]

f(✓

new

)=3.01

✓

f(✓)

5

0 5

5

✓

5

loo p #6 ! [✓]

✓

old

=[2.13 0.31]

 =1

f(✓

old

)=3.01

(✓

old

)=0.93

(✓

old

)=0.77

< 0 ! OK

[✓

new

]

=[✓

old

]



(✓)

=0.90

✓

old

=[2.13 0.90]

f(✓

new

)=2.52

✓

f(✓)

5

0 5

5

✓

Figure 5.10: Example of application of

the coordinate ascent Newton-Raphson

algorithm with backtracking for ﬁnding the

maximum of a 2-D function.

5.4 Numerical Derivatives

One aspect that was omitted in the previous sections is How do

we obtain the ﬁrst- and second-order derivatives?Whenatwice

di↵erentiable formulation for

(

✓

)exists,

f(✓)

@✓

and

f(✓)

@✓

can

be expressed analytically. When analytic formulations are not

available, derivatives can be estimated numerically using either

a forward, backward, or central di↵erentiation scheme.

Here, we

Nocedal, J. and S. Wright (2006). Nu-

merical optimization.SpringerScience&

Business Media; and Abramowitz, M. and

I. A. Stegun (1972). Handbook of mathe-

matical functions with formulas, graphs,

and mathematical table.NationalBureauof

Standards, Applied Mathematics

only focus on the central di↵erentiation method. Note that forward

and b ackward di↵erentiations are not as accurate as central, yet

they are computationally cheaper. As il lu st r at ed in ﬁgure 5.11, ﬁrst-

and s econ d -or de r partial deri vatives of

(

✓

)withrespecttothe

element of a vector ✓ =[✓

✓

··· ✓

]

are given by

Figure 5.11: Illustration of 1-D numerical

derivatives.

(✓)=

f(✓)

@✓

⇡

f(✓ + I(i)✓) 

f(✓  I(i)✓)

2✓

(✓)=

f(✓)

@✓

⇡

f(✓ + I(i)✓) 

f(✓)

✓



f(✓) 

f(✓  I(i)✓)

✓

f(✓ + I(i)✓)  2

f(✓)+

f(✓  I(i)✓)

(✓)

where 

✓ ⌧ ✓

is a small pert ur b at ion to the value

✓

and

(

)isa

Examples

I(3) = [0 0 1 0 ··· 0]

I(1) = [1 0 0 0 ··· 0]

j.-a. goulet 54

n ⇥

1 indicator vector, for which all values are equal to 0, except t he

, which is equal to one.

Figure 5.12: Illustration of 2-D partial

numerical derivatives.

As il l us t rat e d in ﬁgure 5.12, numerical derivatives can also be

employed t o estimate each term of the Hessian matrix,

f(✓)]

⇡

f(✓+✓)

@✓



f(✓✓)

@✓

2✓

where terms on the numerator are deﬁned as

f(✓+✓)

@✓

f(✓+I(i )✓

+I(j)✓

) 

f(✓+I(i )✓

I(j)✓

)

2✓

f(✓✓)

@✓

f(✓I(i )✓

+I(j)✓

) 

f(✓I(i )✓

I(j)✓

)

2✓

In pr act i c e, the full Hessian matrix can only be estimated numer-

ically when the number of variables

is s mal l or when evaluating

f(✓) is computationally cheap.

5.5 Parameter-Space Transformation

When optimizing using either the gradient ascent or the Ne wt on -

Raphson method, we are likely to run into diﬃculties for parame-

ters

✓

that are n ot deﬁned in an unbounded space. In such a case,

the eﬃciency is hindered because the algorithms may propose new

positions

✓

new

that lie outside the valid domain. When trying to

identify optimal parameters for probability distributions such as

those described in chapter 4, common domains for parameters are

as follows:

• Mean parameters: µ 2 R

• Standard deviations:  2 R

• Correlation coeﬃcients: ⇢ 2 ( 1, 1)

• Probability: Pr(X = x) 2 (0, 1)

One solution to t h is probl em is to perform the optimizati on in a

transformed space ✓

such t h at

✓

= g(✓) 2 R.

For each

✓

, t he choice of transformati on funct i on

(

✓

) d epends on

its domain. Figure 5.13 p re se nts exampl es of transformat i ons for

✓ 2 R

, and

✓ 2

(

a, b

). Note t hat in the simplest case, where

✓ 2 R, no transformation is required, so ✓

= ✓.

(a) ✓

= ✓

(b) ✓

=ln(✓)

= ln

⇣

ba

✓a

 1

⌘

Figure 5.13: Examples of transformation

functions.

probabilistic machine learning for civil engineers 55

For

✓ 2 R

, a common transformat i on is to take the logarithm

✓

(

✓

), and its inverse transformation is

✓

. Th e analyt i -

cal derivatives f or the transformation and its inverse are

d✓

✓

d✓

= e

✓

For parameters bounded in an interval

✓ 2

(

a, b

), a possi b le

transformation is the scaled logistic sigmoid function

✓

= ln

✓

b  a

✓  a

 1

◆

and its inverse is given by

✓ =

b  a

1+e

✓

+ a.

The derivative of the transformation and its inverse are

d✓

a  b

(✓  a)(✓  b)

d✓

b  a

(1 + e

✓

)

✓

Note t h at the derivative of these transformations will be employed

in chapter 7 when performi n g paramete r- sp ace trans for mat i on s

using the change-of-variable rule we have seen in

3.4. The transfor-

mations presented here are not unique, as many other functions can

be employed. For further details about parameter space transforma-

tions, the reader is i nvited to refer to Gelman et al.

Gelman, A., J. B. Carlin, H. S. Stern, and

D. B. Rubin (2014). Bayesian data analysis

(3rd ed.). CRC Press