Gradient Descent

solves min12Axb2\min \frac{1}{2} \lVert Ax-b\rVert^2

check if xx^* solves as ATAx=ATbA^TAx^*=A^Tb

AT(Axb)=0A^T(Ax^*-b)=0

ATrε\lVert A^Tr^*\rVert\le \varepsilon

minxRnf(x)\underset{x\in \mathbb{R}^n}{\min}f(x), f:RnRf:\mathbb{R}^n \to \mathbb{R} smooth

(later: xCRnx\in C \le \mathbb{R}^n)

Optimality

xˉ\bar{x} is a …

  1. local min of ff if f(xˉ)f(x),xBε(xˉ)f(\bar{x}) \le f(x), \forall x \in B_\varepsilon(\bar{x}), for some ε>0\varepsilon>0 small,

    where Bε(xˉ)=xxxˉ2εB_\varepsilon(\bar{x})={x \mid \lVert x-\bar{x}\rVert_2 \le \varepsilon}

  2. global min of ff if f(xˉ)f(x),xRnf(\bar{x})≤f(x), \forall x \in \mathbb{R}^n

xˉ\bar{x} is a maximizer of ffxˉ\bar{x} is a minimizer of f-f

Ex: lack of attainment

limxex=0\underset{x \to \infty}{\lim} e^{-x}=0

infxex=0\underset{x}{\inf}e^{-x}=0

exe^{-x} is not coercive

Definition: A function f is coercive if limxf(x)=\underset{\lVert x\rVert \to \infty}{\lim} f(x) = \infty

Equivalently The level sets {xf(x)τ}\{x \mid f(x) \le \tau\} for any τR\tau \in \mathbb{R} are all compact (closer & bond)

First-order necessary condition

xˉ\bar{x} is a local min of f:RnRf:\mathbb{R}^n \to \mathbb{R} only if f(xˉ)=0\nabla f(\bar{x})=0

[local minxˉf(xˉ)=0]\left [ \overset{\bar{x}}{\text{local min}} \to \nabla f(\bar{x})=0 \right ]

[local maxxˉf(xˉ)=0] \left [ \overset{\bar{x}}{\text{local max}} \to \nabla f(\bar{x})=0 \right ]

Proof sketch: Up to first-order

f(xˉ+αd)f(xˉ)=f(xˉ)T(αd)+o(αd)f(\bar{x}+\alpha d)-f(\bar{x})=\nabla f(\bar{x})^T(\alpha d)+o(\alpha \lVert d\rVert)

Because xˉ\bar{x} is a local min,

olimα0f(xˉ+αd)f(xˉ)α=limα0f(xˉ)Td+o(αd)α=f(xˉ)Tdo\le \underset{\alpha \to 0}{\lim}\frac{f(\bar{x}+\alpha d)-f(\bar{x})}{\alpha}=\underset{\alpha \to 0}{\lim}\nabla f(\bar{x})^Td+\frac{o(\alpha \lVert d\rVert)}{\alpha}=\nabla f(\bar{x})^Td

f(xˉ)=0\nabla f(\bar{x})=0

2nd-order sufficient condition

xˉ\bar{x} is a local min of f:RnRf:\mathbb{R}^n \to \mathbb{R} if

  • f(xˉ)=0\nabla f(\bar{x})=0 (stationary)

  • 2f(xˉ)>0\nabla^2 f(\bar{x})\gt 0 (positive definite of Hessian)

[ f(xˉ)=0[\ \nabla f(\bar{x})=0 2f(xˉ)>0\nabla^2 f(\bar{x})\gt 0xˉ\bar{x} local min

f’’(xglobal)>0f’’(x_{global})\gt 0

f’’(xlocal)>0f’’(x_{local})\gt 0

f’’(xlocal)=0f’’(x’_{local})=0

f’’(xmax)<0f’’(x_{max})\lt 0

Ex: f(x)=x4f(x)=x^4

f(0)=0f’(0)=0

f’’(0)=0f’’(0)=0

Definition: A matrix (square/symmetric) A is positive semi-definite (i.e., A0A \ge 0) if the following equivalent conditions hold:

  1. xTAx0,xRnx^TAx \ge 0, \forall x\in \mathbb{R}^n

  2. λi(A)0,i=1,,n\lambda_i(A) \ge 0, \forall i=1,…,n

  3. A=RTR,RA=R^TR, \exists R (Cholesky factory exists)

Positive Definite

  1. xTAx>0,x0x^TAx \gt 0, \forall x \ne 0

  2. λi(A)>0,i\lambda_i(A) \gt 0, \forall i

  3. A=RTRA=R^TR, RR non-singular

Proof Sketch: Up to second-order:

f(xˉ+αd)f(xˉ)=αf(xˉ)T0d+12α2dT2f(xˉ)d>0+o(α2d2)f(\bar{x}+\alpha d)-f(\bar{x})=\alpha\underbrace{\nabla f(\bar{x})^T}_0d+\underbrace{\frac{1}{2}\alpha^2d^T\nabla^2 f(\bar{x})d} _{\gt 0}+o(\alpha^2\lVert d\rVert^2)

take α\alpha “small”

xˉ\bar{x} is a local min

2nd-order necessary condition

xˉ\bar{x} is a local min of f:RnRf:\mathbb{R}^n \to \mathbb{R} only if

  • f(xˉ)=0\nabla f(\bar{x})=0 (stationary)

  • 2f(xˉ)0\nabla^2 f(\bar{x})\ge 0 (positive semi-definite)

Eigen pair (x,λ)(x,\lambda) of AA sq. symm:

Ax=λxAx=\lambda x

xTAx=λxTx1=λx^TAx=\lambda \underbrace{x^Tx}_1=\lambda

(λ>0\lambda \gt 0 means the direction does not change)

Ex: min12r(x)2=12i=1mri(x)2\min \frac{1}{2}\lVert r(x)\rVert^2=\frac{1}{2}\sum^m_{i=1}r_i(x)^2, ri:RnRr_i:\mathbb{R}^n \to \mathbb{R}

f(x)=12r(x)2=12r1(x)2+12r2(x)2+f(x)=\frac{1}{2}\lVert r(x)\rVert^2=\frac{1}{2}r_1(x)^2+\frac{1}{2}r_2(x)^2+…

f(x)=r1(x)r1(x)+r2(x)r2(x)+=[r1(x)r2(x)rm(x)][r1(x) r2(x)rm(x)]=J(x)Tr(x)\begin{aligned}\nabla f(x) & =\nabla r_1(x)\cdot r_1(x)+\nabla r_2(x)\cdot r_2(x)+…\\ &= \begin{bmatrix}\mid & \mid & & \mid\\ \nabla r_1(x) & \nabla r_2(x) & \ldots & \nabla r_m(x)\\ \mid & \mid & & \mid\\ \end{bmatrix}\begin{bmatrix}r_1(x)\ r_2(x)\\ \vdots\\ r_m(x)\end{bmatrix}\\ &= J(x)^Tr(x)\end{aligned}

stop when J(x)Tr(x)=f(x)\lVert J(x)^T r(x)\rVert=\lVert \nabla f(x)\rVert small

Least-Square (Linear)

f(x)=12Axb2=12r(x)2f(x)=\frac{1}{2}\lVert Ax-b\rVert^2=\frac{1}{2}\lVert r(x)\rVert^2

where ri(x)=aiTxbi,ri(x)=air_i(x)=a_i^Tx-b_i, \nabla r_i(x)=a_i

f(x)=[a1a2am][a1Txb1a2Txb2amTxbm]=AT(Axb)\begin{aligned}\nabla f(x) & =\begin{bmatrix}\mid & \mid & & \mid\\ a_1 & a_2 & \ldots & a_m\\ \mid & \mid & & \mid \end{bmatrix}\begin{bmatrix}a_1^Tx-b_1\\ a_2^Tx-b_2\\ \vdots\\ a_m^Tx-b_m\end{bmatrix}\\ &= A^T(Ax-b)\end{aligned},

where A=[a1Ta2TamT]A=\begin{bmatrix}-a^T_1-\\-a^T_2-\\ \vdots \\ -a^T_m-\end{bmatrix} and aia_i is n×1n \times 1 vector.

f(x)=0\nabla f(x)=0ATAx=ATbA^TAx=A^Tb

Gradient Descent

  • step size selection (linearized)

  • search direction selection

minxRnf(x)\underset{x \in \mathbb{R}^n}{\min} f(x) f:RnRf:\mathbb{R}^n \to \mathbb{R} continuously differentiable

Descent Direction

x0x^0 given

for k=0,1,2,k = 0,1,2,…

xk+1=xk+αkdkx^{k+1}=x^k+\alpha^kd^k

  • dkRnd^{k \in \mathbb{R}^n} search direction

  • αkR+\alpha^k \in \mathbb{R}_+ step length

“Linesearch Methods”

A search direction dRnd \in \mathbb{R}^n is a descent direction at x if the directional derivative is negative: f(x)Td=f(x;d)<0\nabla f(x)^Td=f’(x;d) \lt 0

if f:RnRf: \mathbb{R}^n \to \mathbb{R} is continuous differentiable and dRnd \in \mathbb{R}^n is a descent direction, then \exist some ε>0\varepsilon \gt 0 such that

Proof Sketch

  • because f(x;d)<0f’(x;d) \lt 0, limα0f(x+αd)f(x)αf(x;d)<0\underset{\alpha \to 0}{\lim}\frac{f(x+\alpha d)-f(x)}{\alpha} \equiv f’(x;d) \lt 0

  • because ff is continuous, \exist small α\alpha such that f(x+αd)<f(x)f(x+\alpha d) \lt f(x)

Conceptual Algorithm

x0x^0 given for k=0,1,2,k=0,1,2,…

  • compute search direction dkd^k (given xkx^k)

  • compute step size αk\alpha_k such that f(xk+αkdk)<f(xk)f(x^k+\alpha^kd^k)\lt f(x^k)

  • update xk+1=xk+αkdkx^{k+1}=x^k+\alpha^kd^k

  • check stop criteria

Step Size Selection (αk\alpha^k)

  1. constant step αkα\alpha^k \equiv \alpha (α>0,k\alpha \gt 0, \forall k)

  2. exact linesearch (not always possible)

    choose αk\alpha^k such that

    minα0ϕ(α):=f(xk+αdk)\underset{\alpha \ge 0}{\min} \phi (\alpha) := f(x^k+\alpha d^k)

  3. backtracking linesearch (”Aruijo”)

    reduce α\alpha in steps (eg αα/2\alpha ← \alpha/2 until f(xk+αdk)<f(xk)f(x^k+\alpha d^k) \lt f(x^k)

Exact Linearization

  • not always possible

  • always possible for quadratic functions

    f(x)=12xTAx+bTx+constf(x)=\frac{1}{2}x^TAx+b^Tx+const

    ϕ(x)=f(x+αd)=12(x+αd)TA(x+αd)+bT(x+αd)+const=12xTAx+bTx+12α2dTAd+αbTd+αdTAx+const=f(x)+α212dTAd+α(Ax+b)Tf(x)Td+const=f(x)+α212dTAd+αf(x)Td+const\begin{aligned} \phi(x) & = f(x+\alpha d) \\ & = \frac{1}{2}(x+\alpha d)^TA(x+\alpha d)+b^T(x+\alpha d)+const \\ & = \frac{1}{2}x^TAx+b^Tx+\frac{1}{2}\alpha^2d^TAd+\alpha b^Td+\alpha d^T Ax+ const\\ & = f(x)+\alpha^2 \frac{1}{2}d^TAd+\alpha\underbrace{(Ax+b)^T}_{\equiv \nabla f(x)^T}d+const \\ & = f(x)+\alpha^2\frac{1}{2}d^TAd+\alpha \nabla f(x)^Td+const \end{aligned}

At (unconstrained) minimizer of ϕ\phi, 0=ϕ(α)=αdTAd+f(x)Tdα=f(x)TddTAd0=\phi’(\alpha)=\alpha d^T Ad+\nabla f(x)^T d ↔ \alpha =\frac{-\nabla f(x)^Td}{d^TAd}

  1. d is descent direction ⇒ f(x)Td<0\nabla f(x)^Td \lt 0

  2. if dTAd>0d^TAd \gt 0

if a & b ⇒ α>0\alpha \gt 0

If A>0A \gt 0 and d is a descent direction (i.e., f(x)Td<0\nabla f(x)^Td \lt 0)

α=f(x)TddTAd>0\alpha = \frac{-\nabla f(x)^Td}{d^TAd} \gt 0

suppose f(x)Td=0\nabla f(x)^Td=0

Choosing Search Direction d

steepest descent dk=f(xk)fkd^k=-\nabla f(x^k)\equiv \nabla f_k

Always provides descent direction:

f(xk;dk)=f(xk)Tdk=f(xk)T(f(xk))=f(xk)Tf(xk)=f(xk)2\begin{aligned}f’(x^k;d^k) &= \nabla f(x^k)^Td^k\\ & = \nabla f(x^k)^T(-\nabla f(x^k)) \\ & = -\nabla f(x^k)^T \nabla f(x^k) \\ & = -\lVert \nabla f(x^k) \rVert^2 \end{aligned}

If f(xk)0\nabla f(x^k) \ne 0 (i.e., xkx^k not stationary)

dk=fkd^k = -\nabla f_k descent direction

“Steepest” Descent is “Greedy”

First-order Taylor approximation of f at xkx^k,

f(x+d)f(x)+f(x)Tdf(x+d)f(d)f(x)Tdf(x+d) \approx f(x)+\nabla f(x)^Td ⇒ f(x+d)-f(d) \approx \nabla f(x)^Td

choose d to minimize f(x+d)f(x)f(x+d)-f(x),

i.e., mind21f(x)Td\underset{\lVert d \rVert_2 \le 1}{\min} \nabla f(x)^Td ⇒ choose d=f(x)f(x)d = \frac{-\nabla f(x)}{\lVert \nabla f(x) \rVert} (i.e., d2=1\lVert d \rVert_2 =1)

f(x)Td=f(x)Tf(x)f(x)2=f(x)22f(x)2=f(x)2\begin{aligned} \nabla f(x)^Td &= -\nabla f(x)^T\frac{\nabla f(x)}{\lVert \nabla f(x) \rVert_2} \\ & = - \frac{\lVert \nabla f(x) \rVert_2^2}{\lVert \nabla f(x) \rVert_2} \\ & = - \lVert \nabla f(x) \rVert_2\end{aligned}

By Cauchy-Schwarz in e.g., a2b2aTba2b2,a,bRn-\lVert a \rVert_2 \cdot \lVert b \rVert_2 \le a^Tb \le \lVert a \rVert_2 \cdot -\lVert b \rVert_2, \forall a,b \in \mathbb{R}^n

Last updated