solves min21∥Ax−b∥2
check if x∗ solves as ATAx∗=ATb
AT(Ax∗−b)=0
∥ATr∗∥≤ε
x∈Rnminf(x), f:Rn→R smooth
(later: x∈C≤Rn)
Optimality
xˉ is a …
local min of f if f(xˉ)≤f(x),∀x∈Bε(xˉ), for some ε>0 small,
where Bε(xˉ)=x∣∥x−xˉ∥2≤ε
global min of f if f(xˉ)≤f(x),∀x∈Rn
xˉ is a maximizer of f ↔ xˉ is a minimizer of −f
Ex: lack of attainment
x→∞lime−x=0
xinfe−x=0
e−x is not coercive
Definition: A function f is coercive if ∥x∥→∞limf(x)=∞
Equivalently The level sets {x∣f(x)≤τ} for any τ∈R are all compact (closer & bond)
First-order necessary condition
xˉ is a local min of f:Rn→R only if ∇f(xˉ)=0
[local minxˉ→∇f(xˉ)=0]
[local maxxˉ→∇f(xˉ)=0]
Proof sketch: Up to first-order
f(xˉ+αd)−f(xˉ)=∇f(xˉ)T(αd)+o(α∥d∥)
Because xˉ is a local min,
o≤α→0limαf(xˉ+αd)−f(xˉ)=α→0lim∇f(xˉ)Td+αo(α∥d∥)=∇f(xˉ)Td
→ ∇f(xˉ)=0
2nd-order sufficient condition
xˉ is a local min of f:Rn→R if
∇f(xˉ)=0 (stationary)
∇2f(xˉ)>0 (positive definite of Hessian)
[ ∇f(xˉ)=0 ∇2f(xˉ)>0⇒ xˉ local min
f’’(xglobal)>0
f’’(xlocal)>0
f’’(x’local)=0
f’’(xmax)<0
Ex: f(x)=x4
f’(0)=0
f’’(0)=0
Definition: A matrix (square/symmetric) A is positive semi-definite (i.e., A≥0) if the following equivalent conditions hold:
xTAx≥0,∀x∈Rn
λi(A)≥0,∀i=1,…,n
A=RTR,∃R (Cholesky factory exists)
Positive Definite
xTAx>0,∀x=0
λi(A)>0,∀i
A=RTR, R non-singular
Proof Sketch: Up to second-order:
f(xˉ+αd)−f(xˉ)=α0∇f(xˉ)Td+>021α2dT∇2f(xˉ)d+o(α2∥d∥2)
take α “small”
→ xˉ is a local min
2nd-order necessary condition
xˉ is a local min of f:Rn→R only if
∇f(xˉ)=0 (stationary)
∇2f(xˉ)≥0 (positive semi-definite)
Eigen pair (x,λ) of A sq. symm:
Ax=λx
xTAx=λ1xTx=λ
(λ>0 means the direction does not change)
Ex: min21∥r(x)∥2=21∑i=1mri(x)2, ri:Rn→R
f(x)=21∥r(x)∥2=21r1(x)2+21r2(x)2+…
∇f(x)=∇r1(x)⋅r1(x)+∇r2(x)⋅r2(x)+…=∣∇r1(x)∣∣∇r2(x)∣…∣∇rm(x)∣r1(x) r2(x)⋮rm(x)=J(x)Tr(x)
stop when ∥J(x)Tr(x)∥=∥∇f(x)∥ small
Least-Square (Linear)
f(x)=21∥Ax−b∥2=21∥r(x)∥2
where ri(x)=aiTx−bi,∇ri(x)=ai
∇f(x)=∣a1∣∣a2∣…∣am∣a1Tx−b1a2Tx−b2⋮amTx−bm=AT(Ax−b),
where A=−a1T−−a2T−⋮−amT− and ai is n×1 vector.
∇f(x)=0 ↔ ATAx=ATb
Gradient Descent
step size selection (linearized)
search direction selection
x∈Rnminf(x) f:Rn→R continuously differentiable
Descent Direction
x0 given
for k=0,1,2,…
xk+1=xk+αkdk
dk∈Rn search direction
αk∈R+ step length
“Linesearch Methods”
if f:Rn→R is continuous differentiable and d∈Rn is a descent direction, then ∃ some ε>0 such that
Proof Sketch
because f’(x;d)<0, α→0limαf(x+αd)−f(x)≡f’(x;d)<0
because f is continuous, ∃ small α such that f(x+αd)<f(x)
Conceptual Algorithm
x0 given for k=0,1,2,…
compute search direction dk (given xk)
compute step size αk such that f(xk+αkdk)<f(xk)
update xk+1=xk+αkdk
Step Size Selection (
αk)
constant step αk≡α (α>0,∀k)
exact linesearch (not always possible)
choose αk such that
α≥0minϕ(α):=f(xk+αdk)
backtracking linesearch (”Aruijo”)
reduce α in steps (eg α←α/2 until f(xk+αdk)<f(xk)
Exact Linearization
always possible for quadratic functions
f(x)=21xTAx+bTx+const
ϕ(x)=f(x+αd)=21(x+αd)TA(x+αd)+bT(x+αd)+const=21xTAx+bTx+21α2dTAd+αbTd+αdTAx+const=f(x)+α221dTAd+α≡∇f(x)T(Ax+b)Td+const=f(x)+α221dTAd+α∇f(x)Td+const
At (unconstrained) minimizer of ϕ, 0=ϕ’(α)=αdTAd+∇f(x)Td↔α=dTAd−∇f(x)Td
d is descent direction ⇒ ∇f(x)Td<0
if dTAd>0
if a & b ⇒ α>0
If A>0 and d is a descent direction (i.e., ∇f(x)Td<0)
⇒ α=dTAd−∇f(x)Td>0
suppose ∇f(x)Td=0
Choosing Search Direction d
steepest descent dk=−∇f(xk)≡∇fk
Always provides descent direction:
f’(xk;dk)=∇f(xk)Tdk=∇f(xk)T(−∇f(xk))=−∇f(xk)T∇f(xk)=−∥∇f(xk)∥2
If ∇f(xk)=0 (i.e., xk not stationary)
⇒
dk=−∇fk descent direction
“Steepest” Descent is “Greedy”
First-order Taylor approximation of f at xk,
f(x+d)≈f(x)+∇f(x)Td⇒f(x+d)−f(d)≈∇f(x)Td
choose d to minimize f(x+d)−f(x),
i.e., ∥d∥2≤1min∇f(x)Td ⇒ choose d=∥∇f(x)∥−∇f(x) (i.e., ∥d∥2=1)
∇f(x)Td=−∇f(x)T∥∇f(x)∥2∇f(x)=−∥∇f(x)∥2∥∇f(x)∥22=−∥∇f(x)∥2
By Cauchy-Schwarz in e.g., −∥a∥2⋅∥b∥2≤aTb≤∥a∥2⋅−∥b∥2,∀a,b∈Rn