j.-a. goulet 52
In su ch a case, one option is to regularize the Hessian matrix by
substracting a constant
↵
on it s main diagonal in order to ensure
that all terms are negat ive:
˜
H[
˜
f(✓)] = H[
˜
f(✓)] ↵I.
Avoiding being trapped in saddle points is a key challenge in opti-
mization. The reader interested in more advanced strategies for that
purpose should consult spe ci al i ze d literature.
4
4
Nocedal, J. and S. Wright (2006). Nu-
merical optimization.SpringerScience&
Business Media; Dauphin, Y. N., R. Pas-
canu, C. Gulcehre, K. Cho, S. Ganguli,
and Y. Bengio (2014). Identifying and
attacking the saddle point problem in high-
dimensional non-convex optimization. In
Advances in Neural Inform at io n Processing
Systems,27,2933–2941;andGoodfellow,I.,
Y. Bengio, and A. Courville (2016). Deep
learning.MITPress
In pr act i c e, second -or de r methods such as Newton-Raphson can
be employed when we have access to an analytical formulation
for e valuating the Hessian,
H
[
˜
f
(
✓
)]. Otherwise, if we have to em-
ploy numerical derivatives (see
§
5.4) to est i mat e the Hessian, the
computational demand quickly becomes prohibitive as the num-
ber of variables increases. For example, if the number of variables
is
n
= 100, by considering the symmetry in the Hessian matrix,
there are 100 +
1
2
(100
⇥
100
100) = 5050 second-order partial
derivatives to estimate for each Newton-Raphson iteration. Even
for c omp ut at i onal l y efficie nt functi ons
˜
f
(
✓
), the ne c ess ar y number
of evaluations becomes a challenge. In addition, for a large
n
,there
is a subst antial e↵ort requ i re d to invert the Hessian in equation
5.5. Therefore, even if Newton-Raphson is more efficient in terms
of the number of iterati on s req ui r ed t o reach convergence, this does
not take into account the time necessary to obtain the second-order
derivatives. When dealing with a large number of variables or for
functions
˜
f
(
✓
) t hat are computati on al ly expensive, we may revert
to using momentum-based gradient-ascent method s.
5.3 Coordinate Ascent
One alternative to the methods we have seen for multivariate
optimization is to perform the search for e ach variable separately,
which c or re sponds to solving a succession of 1D optimization
problems. This approach is known as coordinate optimization .
5
This
5
Wright, S. J. (2015). Coordinate descent
algorithms. Mat h e m a t ical Program-
ming 151 (1), 3–34; and Friedman, J.,
T. Hastie, H. H¨ofling, and R. Tibshirani
(2007). Pathwise coordinate optimization.
The Annals of Applied Statistics 1 (2),
302–332
approach is known to perform well when there is no strong coupling
between parameters with respe ct to the values of the objective
function. Algorithm 3 presents a minimal version of the coordinate
ascent Newton-Raphson usin g backtracking line search. Figure 5.10
presents two intermediate steps illustrat i n g the applicat i on of
algorithm 3 to a bidimensi on al non-convex/non-concave functi on .