Gradient Descent
minx∈Cf(x)
f: convex differentiable C
objective function is an “expectation”
f(x)=m1[f1(x)+f2(x)+⋯+fm(x)]=Eifi(x),
where i is random variable uniform distributed over {1,⋯,m}
Example: least-squares
f(x)=m1[21∥Ax−b∥22]=m121∥a1T⋮amTx−b1⋮bm∥2=m1i=1∑m21(aiTx−bi)2=m1i=1∑mfi(x)
Gradient Descent
xk+1=xk−αk∇f(xk)
Main cost at each iteration k is ∇f(xk)=m1∑i=1m∇fi(x)
Ex: for LS,
∇f(x)=∇[21∥Ax−b∥2]=AT(Ax−b)=ATr=[a1⋯am]r1⋮rm=i=1∑mairi
Approximate the Gradient
Stochastic Gradient Method
Take expectations of both sides:
Assumption: need to assume mean-squared error in stochastic approx is bounded:
randomly sample i∈{1,⋯,m}
Stochastic gradient approx to ∇f(x) is g=∇fi(x)
Eg=i=1∑mm1∇fi(x)=m1i=1∑m∇fi(x)≡∇f(x)
xk+1=xk−αkgk
cost at iteration k: choose ik∈{1,…,m}, compute gk:=∇fik(x)
Variant: instead of sampling single element {1,⋯,m}, sample a “batch” Bk⊆{1,…,m}
kth Stochastic gradient estimation: gk=∣Bk∣1∑i∈Bk∇fi(x)
Convergence of function values
f(xk) in expectation:
fk:=f(xk)
∇fk:=∇f(xk)
By the descent lemma, if f has an L-Lip continuous gradient
∥∇f(x)−∇f(y)∥2≤L⋅∥x−y∥2 for some L>0,∀x,y
f(z)≤f(x)+∇f(x)T(z−x)+2L∥z−x∥2,∀x,z
so take z=xk+1, x=xk:
fk+1≤fk+∇fkT(xk+1−xk)+2L∥xk+1−xk∥2
SGD: xk+1=xk−αgk ⇒ xk+1−xk=−αgk
⇒ fk+1≤fk−α∇fkTgk+2Lα2∥gk∥2
Efk+1≤Efk−αE∥∇fk∥2+2Lα2E∥gk∥2≤Efk−αE∥∇fk∥2+2Lα2(σ2+E∥∇fk∥2)≤Efk−α(1−2Lα)E∥∇fk∥2+2Lα2σ2≤Efk−2αE∥∇fk∥2+2Lα2σ2 if α≤L1...
E[∥gk−∇fk∥2]=E[∥gk∥2]−∥∇fk∥2≤σ2,∀k
for some σ>0 fixed ⇒ E[∥gk∥2]≤σ2+∥∇fk∥2
Sum and reverse for all k=0,1,2,⋯,T
Efk≤f(x0)−2α∑k=1T−1E[∥∇fk∥2]+2α2σLT
Divide both sides by α2T,
T1∑k=0T−1E∥∇fk∥2≤αT2(f(x0)−f∗)+error termασL.