Stochastic Gradient Descent (SGD)

Gradient Descent

minxCf(x)\min_{x\in C}f(x)

ff: convex differentiable CC

objective function is an “expectation”

f(x)=1m[f1(x)+f2(x)++fm(x)]=Eifi(x)f(x)=\frac{1}{m}\left[f_1(x)+f_2(x)+\cdots+f_m(x)\right]=\mathbb{E}_i f_i(x),

where ii is random variable uniform distributed over {1,,m}\left \{1,\cdots,m\right \}

Example: least-squares

f(x)=1m[12Axb22]=1m[12[a1TamT]x[b1bm]2]=1mi=1m12(aiTxbi)2=1mi=1mfi(x)\begin{aligned}f(x)&=\frac{1}{m}\left[\frac{1}{2}\lVert Ax-b\rVert_2^2\right] \\&=\frac{1}{m}\left[\frac{1}{2}\lVert \begin{bmatrix}a_1^T \\ \vdots \\ a_m^T\end{bmatrix}x-\begin{bmatrix}b_1 \\ \vdots \\ b_m\end{bmatrix}\rVert^2\right] \\ &=\frac{1}{m}\sum_{i=1}^m \frac{1}{2}(a_i^Tx-b_i)^2 \\ &=\frac{1}{m}\sum_{i=1}^mf_i(x)\end{aligned}

Gradient Descent

xk+1=xkαkf(xk)x_{k+1}=x_k-\alpha_k\nabla f(x_k)

Main cost at each iteration kk is f(xk)=1mi=1mfi(x)\nabla f(x_k)=\frac{1}{m}\sum_{i=1}^m\nabla f_i(x)

Ex: for LS,

f(x)=[12Axb2]=AT(Axb)=ATr=[a1am][r1rm]=i=1mairi\begin{aligned}\nabla f(x)&=\nabla \left[ \frac{1}{2}\lVert Ax-b\rVert^2\right] \\ &=A^T(Ax-b) \\ &=A^Tr \\ &=\begin{bmatrix}a_1 \cdots a_m\end{bmatrix}\begin{bmatrix}r_1 \\ \vdots \\ r_m\end{bmatrix}=\sum_{i=1}^m a_ir_i\end{aligned}

Approximate the Gradient

randomly sample i{1,,m}i\in\left \{1,\cdots,m\right \}

Stochastic gradient approx to f(x)\nabla f(x) is g=fi(x)g=\nabla f_i(x)

Eg=i=1m1mfi(x)=1mi=1mfi(x)f(x)\begin{aligned}\mathbb{E}g & =\sum_{i=1}^m \frac{1}{m}\nabla f_i(x) \\ &=\frac{1}{m}\sum_{i=1}^m\nabla f_i(x) \\ &\equiv \nabla f(x)\end{aligned}

Stochastic Gradient Method

xk+1=xkαkgkx_{k+1}=x_k-\alpha_kg_k

cost at iteration kk: choose ik{1,,m}i_k \in \left \{ 1,\ldots, m\right\}, compute gk:=fik(x)g_k:=\nabla f_{i_k}(x)

Variant: instead of sampling single element {1,,m}\left \{1,\cdots,m\right \}, sample a “batch” Bk{1,,m}B_k \subseteq \left \{ 1,\ldots, m\right \}

kkth Stochastic gradient estimation: gk=1BkiBkfi(x)g_k=\frac{1}{\lvert B_k\rvert}\sum_{i\in B_k}\nabla f_i(x)

Convergence of function values f(xk)f(x_k) in expectation:

fk:=f(xk)f_k:=f(x_k)

fk:=f(xk)\nabla f_k := \nabla f(x_k)

By the descent lemma, if ff has an L-Lip continuous gradient

f(x)f(y)2Lxy2\lVert \nabla f(x)-\nabla f(y)\rVert_2 \le L\cdot \lVert x-y\rVert_2 for some L>0,x,yL \gt 0, \forall x,y

f(z)f(x)+f(x)T(zx)+L2zx2,x,zf(z)\le f(x)+\nabla f(x)^T(z-x)+\frac{L}{2}\lVert z-x\rVert^2, \forall x,z

so take z=xk+1z=x_{k+1}, x=xkx=x_k:

fk+1fk+fkT(xk+1xk)+L2xk+1xk2f_{k+1}\le f_k+\nabla f_k^T(x_{k+1}-x_k)+\frac{L}{2}\lVert x_{k+1}-x_k\rVert^2

SGD: xk+1=xkαgkx_{k+1}=x_k-\alpha g_kxk+1xk=αgkx_{k+1}-x_k=-\alpha g_k

fk+1fkαfkTgk+Lα22gk2f_{k+1}\le f_k-\alpha \nabla f_k^Tg_k+\frac{L\alpha^2}{2}\lVert g_k\rVert^2

Take expectations of both sides:

Efk+1EfkαEfk2+Lα22Egk2EfkαEfk2+Lα22(σ2+Efk2)Efkα(1Lα2)Efk2+Lα2σ22Efkα2Efk2+Lα2σ22 if α1L...\begin{aligned}\mathbb{E}f_{k+1} &\le \mathbb{E}f_k-\alpha\mathbb{E}\lVert \nabla f_k\rVert^2+\frac{L\alpha^2}{2}\mathbb{E}\lVert g_k\rVert^2 \\ &\le \mathbb{E}f_k-\alpha\mathbb{E}\lVert \nabla f_k\rVert^2+\frac{L\alpha^2}{2}(\sigma^2+\mathbb{E}\lVert \nabla f_k\rVert^2) \\ &\le \mathbb{E}f_k-\alpha(1-\frac{L\alpha}{2})\mathbb{E}\lVert \nabla f_k\rVert^2+\frac{L\alpha^2\sigma^2}{2} \\ &\le \mathbb{E}f_k-\frac{\alpha}{2}\mathbb{E}\lVert \nabla f_k\rVert^2+\frac{L\alpha^2\sigma^2}{2} \text{ if } \alpha \le \frac{1}{L}...\end{aligned}

Assumption: need to assume mean-squared error in stochastic approx is bounded:

E[gkfk2]=E[gk2]fk2σ2,k\mathbb{E}\left[\lVert g_k-\nabla f_k \rVert^2\right]=\mathbb{E}\left[\lVert g_k\rVert^2\right]-\lVert \nabla f_k\rVert^2 \le \sigma^2 ,\forall k

for some σ>0\sigma \gt 0 fixed ⇒ E[gk2]σ2+fk2\mathbb{E}\left[\lVert g_k\rVert^2\right] \le \sigma^2+\lVert \nabla f_k\rVert^2

Sum and reverse for all k=0,1,2,,Tk=0,1,2,\cdots,T

Efkf(x0)α2k=1T1E[fk2]+α2σLT2\mathbb{E}f_k \le f(x_0)-\frac{\alpha}{2}\sum_{k=1}^{T-1}\mathbb{E} \left[\lVert \nabla f_k \rVert^2\right]+\frac{\alpha^2\sigma L T}{2}

Divide both sides by αT2\alpha \frac{T}{2},

1Tk=0T1Efk22(f(x0)f)αT+ασLerror term\frac{1}{T}\sum_{k=0}^{T-1}\mathbb{E} \lVert \nabla f_k\rVert^2 \le\frac{2(f(x_0)-f^*)}{\alpha T}+\underbrace{\alpha \sigma L}_{\text{error term}}.

Last updated