Computing descend direction
x 0 x_0 x 0 given; ε > 0 \varepsilon > 0 ε > 0 convergence tolerance
for k = 0 , 1 , 2 , … k = 0,1,2,… k = 0 , 1 , 2 , …
choose d k d_k d k such that f ( x k + α d k ) < f ( x k ) f(x_k+\alpha d_k)< f(x_k) f ( x k + α d k ) < f ( x k ) for some small α > 0 \alpha > 0 α > 0 (descent direction)
line search: choose α k \alpha_k α k for some small α > 0 \alpha > 0 α > 0 to enforce (descent)
x k + 1 = x k + α k d k x_{k+1}=x_k+\alpha_k d_k x k + 1 = x k + α k d k
check convergence: Exit if ∥ ∇ f ( x k + 1 ) ∥ < ε ⋅ ∥ ∇ f ( x k ) ∥ \lVert \nabla f(x_{k+1})\rVert < \varepsilon \cdot \lVert \nabla f(x_k)\rVert ∥ ∇ f ( x k + 1 )∥ < ε ⋅ ∥ ∇ f ( x k )∥ , e.g., ε = 1 0 − 6 \varepsilon=10^{-6} ε = 1 0 − 6
Scaled Descent
( P ) (P) ( P ) min x ∈ R n f ( x ) \min_{x \in \mathbb{R}^n} f(x) min x ∈ R n f ( x ) f : R n → R f:\mathbb{R}^n \to \mathbb{R} f : R n → R continuous differentiable
Make a change of variables:
x : = S y x:=Sy x := S y or y = S − 1 x y=S^{-1}x y = S − 1 x
where S S S is non-singular and squares
Ex:
Scaled descent ↔ unscaled descent
“Perfect” quadratic function:
Newton’s Method
⇒
Newton’s Method Steps
Gauss-Newton
G-N iteration:
[ s 1 s 2 s 3 ⋱ s n ] \begin{bmatrix} s_1 & & & &\\ &s_2&&&\\ & &s_3&& \\ & & & \ddots & \\ &&&&s_n \end{bmatrix} s 1 s 2 s 3 ⋱ s n
S w = 0 Sw=0 Sw = 0 iff w = 0 w=0 w = 0
S T v = 0 S^Tv = 0 S T v = 0 iff v = 0 v=0 v = 0
x = [ s 1 ⋱ s n ] [ y 1 ⋮ y n ] x=\begin{bmatrix}s_1 & & \\ & \ddots & \\&&s_n\end{bmatrix}\begin{bmatrix}y_1 \\ \vdots \\ y_n\end{bmatrix} x = s 1 ⋱ s n y 1 ⋮ y n ⇒ x i = s i y i x_i=s_iy_i x i = s i y i
( P s c a l e d ) (P_{scaled}) ( P sc a l e d ) min y ∈ R n g ( y ) : = f ( S y ) \min_{y\in \mathbb{R}^n} g(y):=f(Sy) min y ∈ R n g ( y ) := f ( S y )
Apply gradient descent to ( P s c a l e d ) (P_{scaled}) ( P sc a l e d ) :
y k + 1 = y k − α k ∇ g ( y k ) = y k − α k S T ∇ f ( S y k ) y_{k+1}=y_k-\alpha_k \nabla g(y_k)=y_k-\alpha_kS^T\nabla f(Sy_k) y k + 1 = y k − α k ∇ g ( y k ) = y k − α k S T ∇ f ( S y k )
∇ g ( y ) = ∇ y f ( S y ) = S T ∇ f ( S y ) \nabla g(y) = \nabla_y f(Sy) = S^T\nabla f(Sy) ∇ g ( y ) = ∇ y f ( S y ) = S T ∇ f ( S y )
application of chain rule and ∇ x ( a T x ) = a \nabla_x (a^Tx)=a ∇ x ( a T x ) = a
“Move” to the “x-space” by pre-multiple by S S S :
S y k + 1 = S y k − α k S S T ∇ f ( S y k ) Sy_{k+1}=Sy_k-\alpha_kSS^T\nabla f(Sy_k) S y k + 1 = S y k − α k S S T ∇ f ( S y k )
x k + 1 = x k − α k S S T ∇ f ( x k ) x_{k+1}=x_k-\alpha_kSS^T\nabla f(x_k) x k + 1 = x k − α k S S T ∇ f ( x k )
D ∇ f ( x ) D\nabla f(x) D ∇ f ( x ) is the “scaled” gradient of x x x
Is d = − D ∇ f ( x ) d = -D\nabla f(x) d = − D ∇ f ( x ) a descent direction?
0 > ∇ f ( x ) T d = ∇ f ( x ) T ( − D ∇ f ( x ) ) = − ∇ f ( x ) T D ∇ f ( x ) = − ∇ f ( x ) S S T ∇ f ( x ) = − ( S T ∇ f ( x ) ) T ( S T ∇ f ( x ) ) = − ∥ S T ∇ f ( x ) ∥ 2 \begin{aligned}0> \nabla f(x)^Td & =\nabla f(x)^T(-D\nabla f(x)) \\ &= -\nabla f(x)^TD\nabla f(x) \\ &=-\nabla f(x)SS^T\nabla f(x) \\ &= -(S^T\nabla f(x))^T(S^T \nabla f(x))\\ &=-\lVert S^T\nabla f(x)\rVert^2\end{aligned} 0 > ∇ f ( x ) T d = ∇ f ( x ) T ( − D ∇ f ( x )) = − ∇ f ( x ) T D ∇ f ( x ) = − ∇ f ( x ) S S T ∇ f ( x ) = − ( S T ∇ f ( x ) ) T ( S T ∇ f ( x )) = − ∥ S T ∇ f ( x ) ∥ 2
If S S S orthogonal ⇒ S S T = I SS^T=I S S T = I
− D ∇ f ( x ) -D\nabla f(x) − D ∇ f ( x ) descent direction if ∇ f ( x ) ≠ 0 \nabla f(x)\ne 0 ∇ f ( x ) = 0 (i.e., not stationary)
f ( x ) = 1 2 x T x = 1 2 ∥ x ∥ 2 f(x)=\frac{1}{2}x^Tx=\frac{1}{2}\lVert x\rVert^2 f ( x ) = 2 1 x T x = 2 1 ∥ x ∥ 2
∇ f ( x ) = x \nabla f(x)=x ∇ f ( x ) = x
∇ 2 f ( x ) = I \nabla_2 f(x)=I ∇ 2 f ( x ) = I
λ i ( I ) = 1 \lambda_i(I)=1 λ i ( I ) = 1
c o n d ( A ) = λ m a x ( A ) λ m i n ( A ) ≥ 1 cond(A)=\frac{\lambda_{max}(A)}{\lambda_{min}(A)}\ge 1 co n d ( A ) = λ min ( A ) λ ma x ( A ) ≥ 1
g ( y ) = f ( S y ) g(y)=f(Sy) g ( y ) = f ( S y )
∇ g ( y ) = S T ∇ f ( S y ) \nabla g(y)=S^T\nabla f(Sy) ∇ g ( y ) = S T ∇ f ( S y )
∇ 2 g ( y ) = S T ∇ 2 f ( S y ) ⋅ S \nabla^2g(y)=S^T\nabla^2f(Sy)\cdot S ∇ 2 g ( y ) = S T ∇ 2 f ( S y ) ⋅ S
where ∇ 2 g ( y ) \nabla^2g(y) ∇ 2 g ( y ) is Hessian Symmetric ⇒ n × n n\times n n × n
⇒ choose S S S to make ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S \nabla^2 g(y)=S^T\nabla^2f(x)S ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S be “well-condition”, i.e., all eigenvalues are approximately the same
Choose S S S as S = ( ∇ 2 f ( x ) ) − 1 2 S=(\nabla^2f(x))^{-\frac{1}{2}} S = ( ∇ 2 f ( x ) ) − 2 1
S S T = ∇ 2 f ( x ) − 1 2 ( ∇ 2 f ( x ) − 1 2 ) T = ∇ 2 f ( x ) − 1 2 ⋅ ∇ 2 f ( x ) − 1 2 = ∇ 2 f ( x ) − 1 \begin{aligned}SS^T &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{-\frac{1}{2}})^T \\ &=\nabla^2 f(x)^{-\frac{1}{2}} \cdot \nabla^2 f(x)^{-\frac{1}{2}} \\ &= \nabla^2 f(x)^{-1}\end{aligned} S S T = ∇ 2 f ( x ) − 2 1 ( ∇ 2 f ( x ) − 2 1 ) T = ∇ 2 f ( x ) − 2 1 ⋅ ∇ 2 f ( x ) − 2 1 = ∇ 2 f ( x ) − 1
If ∇ 2 f ( x ) \nabla^2 f(x) ∇ 2 f ( x ) is pos-def, then ∇ 2 f ( x ) = D − 1 2 ⋅ D − 1 2 \nabla^2 f(x)=D^{-\frac{1}{2}}\cdot D^{-\frac{1}{2}} ∇ 2 f ( x ) = D − 2 1 ⋅ D − 2 1 for some D D D non-singular
∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ∇ 2 f ( x ) − 1 2 ∇ 2 f ( x ) ∇ 2 f ( x ) − 1 2 = ∇ 2 f ( x ) − 1 2 ( ∇ 2 f ( x ) 1 2 ⋅ ∇ 2 f ( x ) 1 2 ) ∇ 2 f ( x ) − 1 2 = I \begin{aligned}\nabla^2 g(y)=S^T\nabla^2f(x)S &= \nabla^2 f(x)^{-\frac{1}{2}}\nabla^2 f(x)\nabla^2 f(x)^{-\frac{1}{2}} \\ &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{\frac{1}{2}} \cdot \nabla^2 f(x)^{\frac{1}{2}})\nabla^2 f(x)^{-\frac{1}{2}}\\ &=I\end{aligned} ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ∇ 2 f ( x ) − 2 1 ∇ 2 f ( x ) ∇ 2 f ( x ) − 2 1 = ∇ 2 f ( x ) − 2 1 ( ∇ 2 f ( x ) 2 1 ⋅ ∇ 2 f ( x ) 2 1 ) ∇ 2 f ( x ) − 2 1 = I
Newton’s direction at
x x x d N = − S S T ∇ f ( x ) = − ∇ 2 f ( x ) − 1 ⋅ ∇ f ( x ) \begin{aligned}d_N &= -SS^T\nabla f(x)\\ &=-\nabla^2f(x)^{-1}\cdot \nabla f(x)\end{aligned} d N = − S S T ∇ f ( x ) = − ∇ 2 f ( x ) − 1 ⋅ ∇ f ( x )
i.e., Newton direction d N d_N d N is the solution of the pos-def system (linear)
Newton Equation: ∇ 2 f ( x ) ⋅ d = − ∇ f ( x ) \nabla^2 f(x)\cdot d=-\nabla f(x) ∇ 2 f ( x ) ⋅ d = − ∇ f ( x )
x 0 x_0 x 0 given; ε > 0 \varepsilon > 0 ε > 0
for k = 0 , 1 , 2 , … k = 0,1,2,… k = 0 , 1 , 2 , …
d N d_N d N solves ∇ 2 f ( x k ) d = − ∇ f ( x k ) \nabla^2f(x_k)d=-\nabla f(x_k) ∇ 2 f ( x k ) d = − ∇ f ( x k )
x k + 1 = x k + α k d k x_{k+1}=x_k+\alpha_k d_k x k + 1 = x k + α k d k
check convergence, ∥ ∇ f ( x k ) ∥ \lVert \nabla f(x_k)\rVert ∥ ∇ f ( x k )∥ small
∇ 2 f ( x ) = U Λ U T eigen decompose = ( U Λ 1 2 ) ( U Λ 1 2 ) T \begin{aligned}\nabla^2 f(x) & = U \Lambda U^T \text{ eigen decompose} \\ &=(U\Lambda^{\frac{1}{2}})(U\Lambda^{\frac{1}{2}})^T\end{aligned} ∇ 2 f ( x ) = U Λ U T eigen decompose = ( U Λ 2 1 ) ( U Λ 2 1 ) T
Set S = ( U Λ 1 2 ) − T = ( Λ − 1 2 U − 1 ) T = U − T Λ − 1 2 = U Λ − 1 2 S=(U\Lambda^{\frac{1}{2}})^{-T}=(\Lambda^{-\frac{1}{2}}U^{-1})^T=U^{-T}\Lambda^{-\frac{1}{2}}=U\Lambda^{-\frac{1}{2}} S = ( U Λ 2 1 ) − T = ( Λ − 2 1 U − 1 ) T = U − T Λ − 2 1 = U Λ − 2 1
∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ( Λ − 1 2 U T ) ( U Λ U T ) ( U Λ − 1 2 ) = Λ − 1 2 ⋅ Λ ⋅ Λ − 1 2 = I \begin{aligned} \nabla^2 g(y) &= S^T\nabla^2 f(x)S\\ &= (\Lambda^{-\frac{1}{2}}U^T)(U\Lambda U^T)(U\Lambda^{-\frac{1}{2}})\\ &= \Lambda^{-\frac{1}{2}}\cdot \Lambda \cdot \Lambda^{-\frac{1}{2}} \\ &=I\end{aligned} ∇ 2 g ( y ) = S T ∇ 2 f ( x ) S = ( Λ − 2 1 U T ) ( U Λ U T ) ( U Λ − 2 1 ) = Λ − 2 1 ⋅ Λ ⋅ Λ − 2 1 = I
min x ∈ R n f ( x ) : = 1 2 ∥ r ( x ) ∥ 2 \min_{x\in \mathbb{R}^n} f(x):=\frac{1}{2}\lVert r(x)\rVert^2 min x ∈ R n f ( x ) := 2 1 ∥ r ( x ) ∥ 2
r ( x ) = [ r 1 ( x ) ⋮ r m ( x ) ] r(x)=\begin{bmatrix}r_1(x)\\ \vdots \\ r_m(x)\end{bmatrix} r ( x ) = r 1 ( x ) ⋮ r m ( x ) , where r i : R n → R r_i: \mathbb{R}^n \to \mathbb{R} r i : R n → R
∇ f ( x ) = J ( x ) r ( x ) \nabla f(x) = J(x)r(x) ∇ f ( x ) = J ( x ) r ( x )
x k + 1 = arg min x 1 2 ∥ r ( x k ) + J ( x k ) T ( x − x k ) ⏟ linearization of r at x k ∥ 2 = arg min x 1 2 ∥ J k T x − ( J k T x k − r k ) ∥ 2 = ( J k J k T ) − 1 J k ( J k T x k − r k ) = x k − ( J k J k T ) − 1 J k r k = x k − ( J k J k T ) − 1 ∇ f ( x k ) \begin{aligned}x_{k+1} &=\argmin_x \frac{1}{2}\lVert \underbrace{r(x_k)+J(x_k)^T(x-x_k)}_{\text{linearization of r at }x_k}\rVert^2\\ &= \argmin_x \frac{1}{2} \lVert J_k^Tx-(J_k^Tx_k-r_k)\rVert^2\\ &=(J_kJ_k^T)^{-1}J_k(J_k^Tx_k-r_k)\\ &=x_k-(J_kJ_k^T)^{-1}J_kr_k\\ &=x_k-(J_kJ_k^T)^{-1}\nabla f(x_k)\end{aligned} x k + 1 = x arg min 2 1 ∥ linearization of r at x k r ( x k ) + J ( x k ) T ( x − x k ) ∥ 2 = x arg min 2 1 ∥ J k T x − ( J k T x k − r k ) ∥ 2 = ( J k J k T ) − 1 J k ( J k T x k − r k ) = x k − ( J k J k T ) − 1 J k r k = x k − ( J k J k T ) − 1 ∇ f ( x k )
(normal equation: 1 2 ∥ A x − b ∥ 2 \frac{1}{2}\lVert Ax-b\rVert^2 2 1 ∥ A x − b ∥ 2 ⇒x = ( A T A ) − 1 A T b x=(A^TA)^{-1}A^Tb x = ( A T A ) − 1 A T b )