Computing descend direction
x 0 x_0 x 0  ε > 0 \varepsilon > 0 ε > 0 
for k = 0 , 1 , 2 , … k = 0,1,2,… k = 0 , 1 , 2 , … 
choose d k d_k d k  f ( x k + α d k ) < f ( x k ) f(x_k+\alpha d_k)< f(x_k) f ( x k  + α d k  ) < f ( x k  ) α > 0 \alpha > 0 α > 0 
line search: choose α k \alpha_k α k  α > 0 \alpha > 0 α > 0 
x k + 1 = x k + α k d k x_{k+1}=x_k+\alpha_k d_k x k + 1  = x k  + α k  d k  
check convergence: Exit if ∥ ∇ f ( x k + 1 ) ∥ < ε ⋅ ∥ ∇ f ( x k ) ∥ \lVert \nabla f(x_{k+1})\rVert < \varepsilon \cdot \lVert \nabla f(x_k)\rVert ∥ ∇ f ( x k + 1  )∥ < ε ⋅ ∥ ∇ f ( x k  )∥ ε = 10 − 6 \varepsilon=10^{-6} ε = 1 0 − 6 
Scaled Descent
( P ) (P) ( P ) min  x ∈ R n f ( x ) \min_{x \in \mathbb{R}^n} f(x) min x ∈ R n  f ( x ) f : R n → R f:\mathbb{R}^n \to \mathbb{R} f : R n → R 
Make a change of variables:
x : = S y x:=Sy x := S y y = S − 1 x y=S^{-1}x y = S − 1 x 
where S S S non-singular squares 
Ex: 
[ s 1 s 2 s 3 ⋱ s n ] \begin{bmatrix} s_1 & & & &\\ &s_2&&&\\ & &s_3&& \\ & & & \ddots & \\ &&&&s_n \end{bmatrix}  s 1   s 2   s 3   ⋱  s n    
S w = 0 Sw=0 Sw = 0 w = 0 w=0 w = 0 
S T v = 0 S^Tv = 0 S T v = 0 v = 0 v=0 v = 0 
x = [ s 1 ⋱ s n ] [ y 1 ⋮ y n ] x=\begin{bmatrix}s_1 & & \\ & \ddots & \\&&s_n\end{bmatrix}\begin{bmatrix}y_1 \\ \vdots \\ y_n\end{bmatrix} x =  s 1   ⋱  s n     y 1  ⋮ y n    x i = s i y i x_i=s_iy_i x i  = s i  y i  
( P s c a l e d ) (P_{scaled}) ( P sc a l e d  ) min  y ∈ R n g ( y ) : = f ( S y ) \min_{y\in \mathbb{R}^n} g(y):=f(Sy) min y ∈ R n  g ( y ) := f ( S y ) 
Apply gradient descent to ( P s c a l e d ) (P_{scaled}) ( P sc a l e d  ) 
y k + 1 = y k − α k ∇ g ( y k ) = y k − α k S T ∇ f ( S y k ) y_{k+1}=y_k-\alpha_k \nabla g(y_k)=y_k-\alpha_kS^T\nabla f(Sy_k) y k + 1  = y k  − α k  ∇ g ( y k  ) = y k  − α k  S T ∇ f ( S y k  ) 
∇ g ( y ) = ∇ y f ( S y ) = S T ∇ f ( S y ) \nabla g(y) = \nabla_y f(Sy) = S^T\nabla f(Sy) ∇ g ( y ) = ∇ y  f ( S y ) = S T ∇ f ( S y ) 
application of chain rule and ∇ x ( a T x ) = a \nabla_x (a^Tx)=a ∇ x  ( a T x ) = a 
“Move” to the “x-space” by pre-multiple by S S S 
S y k + 1 = S y k − α k S S T ∇ f ( S y k ) Sy_{k+1}=Sy_k-\alpha_kSS^T\nabla f(Sy_k) S y k + 1  = S y k  − α k  S S T ∇ f ( S y k  ) 
x k + 1 = x k − α k S S T ∇ f ( x k ) x_{k+1}=x_k-\alpha_kSS^T\nabla f(x_k) x k + 1  = x k  − α k  S S T ∇ f ( x k  ) 
let D : = S S T D:=SS^T D := S S T 
D ∇ f ( x ) D\nabla f(x) D ∇ f ( x ) x x x 
Is d = − D ∇ f ( x ) d = -D\nabla f(x) d = − D ∇ f ( x ) 
0 > ∇ f ( x ) T d = ∇ f ( x ) T ( − D ∇ f ( x ) ) = − ∇ f ( x ) T D ∇ f ( x ) = − ∇ f ( x ) S S T ∇ f ( x ) = − ( S T ∇ f ( x ) ) T ( S T ∇ f ( x ) ) = − ∥ S T ∇ f ( x ) ∥ 2 \begin{aligned}0> \nabla f(x)^Td & =\nabla f(x)^T(-D\nabla f(x)) \\ &= -\nabla f(x)^TD\nabla f(x) \\ &=-\nabla f(x)SS^T\nabla f(x) \\ &= -(S^T\nabla f(x))^T(S^T \nabla f(x))\\ &=-\lVert S^T\nabla f(x)\rVert^2\end{aligned} 0 > ∇ f ( x ) T d  = ∇ f ( x ) T ( − D ∇ f ( x )) = − ∇ f ( x ) T D ∇ f ( x ) = − ∇ f ( x ) S S T ∇ f ( x ) = − ( S T ∇ f ( x ) ) T ( S T ∇ f ( x )) = − ∥ S T ∇ f ( x ) ∥ 2  
If S S S S S T = I SS^T=I S S T = I 
Scaled descent ↔ unscaled descent
− D ∇ f ( x ) -D\nabla f(x) − D ∇ f ( x ) ∇ f ( x ) ≠ 0 \nabla f(x)\ne 0 ∇ f ( x )  = 0 
“Perfect” quadratic function:
f ( x ) = 1 2 x T x = 1 2 ∥ x ∥ 2 f(x)=\frac{1}{2}x^Tx=\frac{1}{2}\lVert x\rVert^2 f ( x ) = 2 1  x T x = 2 1  ∥ x ∥ 2 
∇ f ( x ) = x \nabla f(x)=x ∇ f ( x ) = x 
∇ 2 f ( x ) = I \nabla_2 f(x)=I ∇ 2  f ( x ) = I 
λ i ( I ) = 1 \lambda_i(I)=1 λ i  ( I ) = 1 
c o n d ( A ) = λ m a x ( A ) λ m i n ( A ) ≥ 1 cond(A)=\frac{\lambda_{max}(A)}{\lambda_{min}(A)}\ge 1 co n d ( A ) = λ min  ( A ) λ ma x  ( A )  ≥ 1 
g ( y ) = f ( S y ) g(y)=f(Sy) g ( y ) = f ( S y ) 
∇ g ( y ) = S T ∇ f ( S y ) \nabla g(y)=S^T\nabla f(Sy) ∇ g ( y ) = S T ∇ f ( S y ) 
∇ 2 g ( y ) = S T ∇ 2 f ( S y ) ⋅ S \nabla^2g(y)=S^T\nabla^2f(Sy)\cdot S ∇ 2 g ( y ) = S T ∇ 2 f ( S y ) ⋅ S 
where ∇ 2 g ( y ) \nabla^2g(y) ∇ 2 g ( y ) n × n n\times n n × n 
⇒ choose S S S ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S \nabla^2 g(y)=S^T\nabla^2f(x)S ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S 
Newton’s Method
Choose S S S S = ( ∇ 2 f ( x ) ) − 1 2 S=(\nabla^2f(x))^{-\frac{1}{2}} S = ( ∇ 2 f ( x ) ) − 2 1  
⇒
S S T = ∇ 2 f ( x ) − 1 2 ( ∇ 2 f ( x ) − 1 2 ) T = ∇ 2 f ( x ) − 1 2 ⋅ ∇ 2 f ( x ) − 1 2 = ∇ 2 f ( x ) − 1 \begin{aligned}SS^T &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{-\frac{1}{2}})^T \\ &=\nabla^2 f(x)^{-\frac{1}{2}} \cdot \nabla^2 f(x)^{-\frac{1}{2}} \\ &= \nabla^2 f(x)^{-1}\end{aligned} S S T  = ∇ 2 f ( x ) − 2 1  ( ∇ 2 f ( x ) − 2 1  ) T = ∇ 2 f ( x ) − 2 1  ⋅ ∇ 2 f ( x ) − 2 1  = ∇ 2 f ( x ) − 1  
If ∇ 2 f ( x ) \nabla^2 f(x) ∇ 2 f ( x ) ∇ 2 f ( x ) = D − 1 2 ⋅ D − 1 2 \nabla^2 f(x)=D^{-\frac{1}{2}}\cdot D^{-\frac{1}{2}} ∇ 2 f ( x ) = D − 2 1  ⋅ D − 2 1  D D D 
∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ∇ 2 f ( x ) − 1 2 ∇ 2 f ( x ) ∇ 2 f ( x ) − 1 2 = ∇ 2 f ( x ) − 1 2 ( ∇ 2 f ( x ) 1 2 ⋅ ∇ 2 f ( x ) 1 2 ) ∇ 2 f ( x ) − 1 2 = I \begin{aligned}\nabla^2 g(y)=S^T\nabla^2f(x)S &= \nabla^2 f(x)^{-\frac{1}{2}}\nabla^2 f(x)\nabla^2 f(x)^{-\frac{1}{2}} \\ &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{\frac{1}{2}} \cdot \nabla^2 f(x)^{\frac{1}{2}})\nabla^2 f(x)^{-\frac{1}{2}}\\ &=I\end{aligned} ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S  = ∇ 2 f ( x ) − 2 1  ∇ 2 f ( x ) ∇ 2 f ( x ) − 2 1  = ∇ 2 f ( x ) − 2 1  ( ∇ 2 f ( x ) 2 1  ⋅ ∇ 2 f ( x ) 2 1  ) ∇ 2 f ( x ) − 2 1  = I  
Newton’s direction at 
x x x d N = − S S T ∇ f ( x ) = − ∇ 2 f ( x ) − 1 ⋅ ∇ f ( x ) \begin{aligned}d_N &= -SS^T\nabla f(x)\\ &=-\nabla^2f(x)^{-1}\cdot \nabla f(x)\end{aligned} d N   = − S S T ∇ f ( x ) = − ∇ 2 f ( x ) − 1 ⋅ ∇ f ( x )  
i.e., Newton direction d N d_N d N  
Newton Equation: ∇ 2 f ( x ) ⋅ d = − ∇ f ( x ) \nabla^2 f(x)\cdot d=-\nabla f(x) ∇ 2 f ( x ) ⋅ d = − ∇ f ( x ) 
Newton’s Method Steps
x 0 x_0 x 0  ε > 0 \varepsilon > 0 ε > 0 
for k = 0 , 1 , 2 , … k = 0,1,2,… k = 0 , 1 , 2 , … 
d N d_N d N  ∇ 2 f ( x k ) d = − ∇ f ( x k ) \nabla^2f(x_k)d=-\nabla f(x_k) ∇ 2 f ( x k  ) d = − ∇ f ( x k  ) 
x k + 1 = x k + α k d k x_{k+1}=x_k+\alpha_k d_k x k + 1  = x k  + α k  d k  
check convergence, ∥ ∇ f ( x k ) ∥ \lVert \nabla f(x_k)\rVert ∥ ∇ f ( x k  )∥ 
work O ( n 3 ) O(n^3) O ( n 3 ) 
∇ 2 f ( x ) = U Λ U T  eigen decompose = ( U Λ 1 2 ) ( U Λ 1 2 ) T \begin{aligned}\nabla^2 f(x) & = U \Lambda U^T \text{ eigen decompose} \\ &=(U\Lambda^{\frac{1}{2}})(U\Lambda^{\frac{1}{2}})^T\end{aligned} ∇ 2 f ( x )  = U Λ U T  eigen decompose = ( U Λ 2 1  ) ( U Λ 2 1  ) T  
Set S = ( U Λ 1 2 ) − T = ( Λ − 1 2 U − 1 ) T = U − T Λ − 1 2 = U Λ − 1 2 S=(U\Lambda^{\frac{1}{2}})^{-T}=(\Lambda^{-\frac{1}{2}}U^{-1})^T=U^{-T}\Lambda^{-\frac{1}{2}}=U\Lambda^{-\frac{1}{2}} S = ( U Λ 2 1  ) − T = ( Λ − 2 1  U − 1 ) T = U − T Λ − 2 1  = U Λ − 2 1  
∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ( Λ − 1 2 U T ) ( U Λ U T ) ( U Λ − 1 2 ) = Λ − 1 2 ⋅ Λ ⋅ Λ − 1 2 = I \begin{aligned} \nabla^2 g(y) &= S^T\nabla^2 f(x)S\\ &= (\Lambda^{-\frac{1}{2}}U^T)(U\Lambda U^T)(U\Lambda^{-\frac{1}{2}})\\ &= \Lambda^{-\frac{1}{2}}\cdot \Lambda \cdot \Lambda^{-\frac{1}{2}} \\ &=I\end{aligned} ∇ 2 g ( y )  = S T ∇ 2 f ( x ) S = ( Λ − 2 1  U T ) ( U Λ U T ) ( U Λ − 2 1  ) = Λ − 2 1  ⋅ Λ ⋅ Λ − 2 1  = I  
Gauss-Newton
min  x ∈ R n f ( x ) : = 1 2 ∥ r ( x ) ∥ 2 \min_{x\in \mathbb{R}^n} f(x):=\frac{1}{2}\lVert r(x)\rVert^2 min x ∈ R n  f ( x ) := 2 1  ∥ r ( x ) ∥ 2 
r ( x ) = [ r 1 ( x ) ⋮ r m ( x ) ] r(x)=\begin{bmatrix}r_1(x)\\ \vdots \\ r_m(x)\end{bmatrix} r ( x ) =  r 1  ( x ) ⋮ r m  ( x )   r i : R n → R r_i: \mathbb{R}^n \to \mathbb{R} r i  : R n → R 
∇ f ( x ) = J ( x ) r ( x ) \nabla f(x) = J(x)r(x) ∇ f ( x ) = J ( x ) r ( x ) 
G-N iteration: 
x k + 1 = arg min  x 1 2 ∥ r ( x k ) + J ( x k ) T ( x − x k ) ⏟ linearization of r at  x k ∥ 2 = arg min  x 1 2 ∥ J k T x − ( J k T x k − r k ) ∥ 2 = ( J k J k T ) − 1 J k ( J k T x k − r k ) = x k − ( J k J k T ) − 1 J k r k = x k − ( J k J k T ) − 1 ∇ f ( x k ) \begin{aligned}x_{k+1} &=\argmin_x \frac{1}{2}\lVert \underbrace{r(x_k)+J(x_k)^T(x-x_k)}_{\text{linearization of r at }x_k}\rVert^2\\ &= \argmin_x \frac{1}{2} \lVert J_k^Tx-(J_k^Tx_k-r_k)\rVert^2\\ &=(J_kJ_k^T)^{-1}J_k(J_k^Tx_k-r_k)\\ &=x_k-(J_kJ_k^T)^{-1}J_kr_k\\ &=x_k-(J_kJ_k^T)^{-1}\nabla f(x_k)\end{aligned} x k + 1   = x arg min  2 1  ∥ linearization of r at  x k  r ( x k  ) + J ( x k  ) T ( x − x k  )   ∥ 2 = x arg min  2 1  ∥ J k T  x − ( J k T  x k  − r k  ) ∥ 2 = ( J k  J k T  ) − 1 J k  ( J k T  x k  − r k  ) = x k  − ( J k  J k T  ) − 1 J k  r k  = x k  − ( J k  J k T  ) − 1 ∇ f ( x k  )  
(normal equation: 1 2 ∥ A x − b ∥ 2 \frac{1}{2}\lVert Ax-b\rVert^2 2 1  ∥ A x − b ∥ 2 x = ( A T A ) − 1 A T b x=(A^TA)^{-1}A^Tb x = ( A T A ) − 1 A T b