Scaled Descent

  • Scaled Descent

  • Gauss-Newton Method

  • Newton’s Method

  • Diagonal Scaling

Computing descend direction

x0x_0 given; ε>0\varepsilon > 0 convergence tolerance

for k=0,1,2,k = 0,1,2,…

  • choose dkd_k such that f(xk+αdk)<f(xk)f(x_k+\alpha d_k)< f(x_k) for some small α>0\alpha > 0 (descent direction)

  • line search: choose αk\alpha_k for some small α>0\alpha > 0 to enforce (descent)

  • xk+1=xk+αkdkx_{k+1}=x_k+\alpha_k d_k

  • check convergence: Exit if f(xk+1)<εf(xk)\lVert \nabla f(x_{k+1})\rVert < \varepsilon \cdot \lVert \nabla f(x_k)\rVert, e.g., ε=106\varepsilon=10^{-6}

Scaled Descent

(P)(P)minxRnf(x)\min_{x \in \mathbb{R}^n} f(x) f:RnRf:\mathbb{R}^n \to \mathbb{R} continuous differentiable

Make a change of variables:

x:=Syx:=Sy or y=S1xy=S^{-1}x

where SS is non-singular and squares

Ex:

[s1s2s3sn]\begin{bmatrix} s_1 & & & &\\ &s_2&&&\\ & &s_3&& \\ & & & \ddots & \\ &&&&s_n \end{bmatrix}

Sw=0Sw=0 iff w=0w=0

STv=0S^Tv = 0 iff v=0v=0

x=[s1sn][y1yn]x=\begin{bmatrix}s_1 & & \\ & \ddots & \\&&s_n\end{bmatrix}\begin{bmatrix}y_1 \\ \vdots \\ y_n\end{bmatrix}xi=siyix_i=s_iy_i

(Pscaled)(P_{scaled}) minyRng(y):=f(Sy)\min_{y\in \mathbb{R}^n} g(y):=f(Sy)

Apply gradient descent to (Pscaled)(P_{scaled}):

yk+1=ykαkg(yk)=ykαkSTf(Syk)y_{k+1}=y_k-\alpha_k \nabla g(y_k)=y_k-\alpha_kS^T\nabla f(Sy_k)

g(y)=yf(Sy)=STf(Sy)\nabla g(y) = \nabla_y f(Sy) = S^T\nabla f(Sy)

application of chain rule and x(aTx)=a\nabla_x (a^Tx)=a

“Move” to the “x-space” by pre-multiple by SS:

Syk+1=SykαkSSTf(Syk)Sy_{k+1}=Sy_k-\alpha_kSS^T\nabla f(Sy_k)

xk+1=xkαkSSTf(xk)x_{k+1}=x_k-\alpha_kSS^T\nabla f(x_k)

let D:=SSTD:=SS^T

Df(x)D\nabla f(x) is the “scaled” gradient of xx

Is d=Df(x)d = -D\nabla f(x) a descent direction?

0>f(x)Td=f(x)T(Df(x))=f(x)TDf(x)=f(x)SSTf(x)=(STf(x))T(STf(x))=STf(x)2\begin{aligned}0> \nabla f(x)^Td & =\nabla f(x)^T(-D\nabla f(x)) \\ &= -\nabla f(x)^TD\nabla f(x) \\ &=-\nabla f(x)SS^T\nabla f(x) \\ &= -(S^T\nabla f(x))^T(S^T \nabla f(x))\\ &=-\lVert S^T\nabla f(x)\rVert^2\end{aligned}

If SS orthogonal ⇒ SST=ISS^T=I

Scaled descent ↔ unscaled descent

Df(x)-D\nabla f(x) descent direction if f(x)0\nabla f(x)\ne 0 (i.e., not stationary)

“Perfect” quadratic function:

f(x)=12xTx=12x2f(x)=\frac{1}{2}x^Tx=\frac{1}{2}\lVert x\rVert^2

f(x)=x\nabla f(x)=x

2f(x)=I\nabla_2 f(x)=I

λi(I)=1\lambda_i(I)=1

cond(A)=λmax(A)λmin(A)1cond(A)=\frac{\lambda_{max}(A)}{\lambda_{min}(A)}\ge 1

g(y)=f(Sy)g(y)=f(Sy)

g(y)=STf(Sy)\nabla g(y)=S^T\nabla f(Sy)

2g(y)=ST2f(Sy)S\nabla^2g(y)=S^T\nabla^2f(Sy)\cdot S

where 2g(y)\nabla^2g(y) is Hessian Symmetric ⇒ n×nn\times n

⇒ choose SS to make 2g(y)=ST2f(x)S\nabla^2 g(y)=S^T\nabla^2f(x)S be “well-condition”, i.e., all eigenvalues are approximately the same

Newton’s Method

Choose SS as S=(2f(x))12S=(\nabla^2f(x))^{-\frac{1}{2}}

SST=2f(x)12(2f(x)12)T=2f(x)122f(x)12=2f(x)1\begin{aligned}SS^T &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{-\frac{1}{2}})^T \\ &=\nabla^2 f(x)^{-\frac{1}{2}} \cdot \nabla^2 f(x)^{-\frac{1}{2}} \\ &= \nabla^2 f(x)^{-1}\end{aligned}

If 2f(x)\nabla^2 f(x) is pos-def, then 2f(x)=D12D12\nabla^2 f(x)=D^{-\frac{1}{2}}\cdot D^{-\frac{1}{2}} for some DD non-singular

2g(y)=ST2f(x)S=2f(x)122f(x)2f(x)12=2f(x)12(2f(x)122f(x)12)2f(x)12=I\begin{aligned}\nabla^2 g(y)=S^T\nabla^2f(x)S &= \nabla^2 f(x)^{-\frac{1}{2}}\nabla^2 f(x)\nabla^2 f(x)^{-\frac{1}{2}} \\ &=\nabla^2 f(x)^{-\frac{1}{2}}(\nabla^2 f(x)^{\frac{1}{2}} \cdot \nabla^2 f(x)^{\frac{1}{2}})\nabla^2 f(x)^{-\frac{1}{2}}\\ &=I\end{aligned}

Newton’s direction at xx

dN=SSTf(x)=2f(x)1f(x)\begin{aligned}d_N &= -SS^T\nabla f(x)\\ &=-\nabla^2f(x)^{-1}\cdot \nabla f(x)\end{aligned}

i.e., Newton direction dNd_N is the solution of the pos-def system (linear)

Newton Equation: 2f(x)d=f(x)\nabla^2 f(x)\cdot d=-\nabla f(x)

Newton’s Method Steps

x0x_0 given; ε>0\varepsilon > 0

for k=0,1,2,k = 0,1,2,…

  • dNd_N solves 2f(xk)d=f(xk)\nabla^2f(x_k)d=-\nabla f(x_k)

  • line search on α\alpha

  • xk+1=xk+αkdkx_{k+1}=x_k+\alpha_k d_k

  • check convergence, f(xk)\lVert \nabla f(x_k)\rVert small

work O(n3)O(n^3)

2f(x)=UΛUT eigen decompose=(UΛ12)(UΛ12)T\begin{aligned}\nabla^2 f(x) & = U \Lambda U^T \text{ eigen decompose} \\ &=(U\Lambda^{\frac{1}{2}})(U\Lambda^{\frac{1}{2}})^T\end{aligned}

Set S=(UΛ12)T=(Λ12U1)T=UTΛ12=UΛ12S=(U\Lambda^{\frac{1}{2}})^{-T}=(\Lambda^{-\frac{1}{2}}U^{-1})^T=U^{-T}\Lambda^{-\frac{1}{2}}=U\Lambda^{-\frac{1}{2}}

2g(y)=ST2f(x)S=(Λ12UT)(UΛUT)(UΛ12)=Λ12ΛΛ12=I\begin{aligned} \nabla^2 g(y) &= S^T\nabla^2 f(x)S\\ &= (\Lambda^{-\frac{1}{2}}U^T)(U\Lambda U^T)(U\Lambda^{-\frac{1}{2}})\\ &= \Lambda^{-\frac{1}{2}}\cdot \Lambda \cdot \Lambda^{-\frac{1}{2}} \\ &=I\end{aligned}

Gauss-Newton

minxRnf(x):=12r(x)2\min_{x\in \mathbb{R}^n} f(x):=\frac{1}{2}\lVert r(x)\rVert^2

r(x)=[r1(x)rm(x)]r(x)=\begin{bmatrix}r_1(x)\\ \vdots \\ r_m(x)\end{bmatrix}, where ri:RnRr_i: \mathbb{R}^n \to \mathbb{R}

f(x)=J(x)r(x)\nabla f(x) = J(x)r(x)

G-N iteration:

xk+1=arg minx12r(xk)+J(xk)T(xxk)linearization of r at xk2=arg minx12JkTx(JkTxkrk)2=(JkJkT)1Jk(JkTxkrk)=xk(JkJkT)1Jkrk=xk(JkJkT)1f(xk)\begin{aligned}x_{k+1} &=\argmin_x \frac{1}{2}\lVert \underbrace{r(x_k)+J(x_k)^T(x-x_k)}_{\text{linearization of r at }x_k}\rVert^2\\ &= \argmin_x \frac{1}{2} \lVert J_k^Tx-(J_k^Tx_k-r_k)\rVert^2\\ &=(J_kJ_k^T)^{-1}J_k(J_k^Tx_k-r_k)\\ &=x_k-(J_kJ_k^T)^{-1}J_kr_k\\ &=x_k-(J_kJ_k^T)^{-1}\nabla f(x_k)\end{aligned}

(normal equation: 12Axb2\frac{1}{2}\lVert Ax-b\rVert^2x=(ATA)1ATbx=(A^TA)^{-1}A^Tb)

Last updated