f:Rn→R
Assume ith-partial derivative exists:
δxiδf(x)=ε→0limεf(x+εei)−f(x)
ei=0⋮10⋮0ith position
The gradient of f is the collection of partials:
∇f(x)=δf(x)/δx1δf(x)/δx2⋮δf(x)/δxn∈Rn Directional derivative of at
x∈Rn
in the direction d∈Rn
f’(x,d)=ε→0limεf(x+εd)−f(x)
The function f is differentiable of x∈Rn if f’(x,d) exists for all of Rn, and f’(x,d)=∇f(x)Td
Fact: the direction derivative is positively homogeneous of degree 1, i.e., f’(x,αd)=αf’(x,d),∀α≥0
Special Directions
d=d1d2⋮dn→ d=ei
f’(x,ei)=∇f(x)Tei=[∇f(x)]i=δxiδf(x)
Calculus Rules for
Rn
f:Rn→R
Linear function: f(x)=aTx=∑i=1naixi
∇f(x)=a1a2⋮an=a
Quadratic functions:
f(x)=xTQx+cTx+α
Q is n×n matrix, c∈Rn and α∈R
∇f(x)=Qx+QTx+c=(Q+QT)x+c
If Q is symmetric, i.e., Q=QT, ∇f(x)=2Qx+c
WLOG, assume Q symmetric.
If not, observe:
f(x)=xTQx+cTx+α=21xTQx+21xTQx+cTx+α=21xTQTx+21xTQx+cTx+α=xT(21QT+21Q)x+cTx+α=xTQx+cTx+α
where Q=21QT+21Q, called “symmetric part of Q”
Directional Derivative:
f’(x,d)=∇f(x)Td
Computing Gradients
Numerical → ”finite differencing”
Automatic Differentiation
Numerical:
Taylor’s Theorem:
f(x+εd)=f(x)+ε∇f(x)Td+O(ε), g∈O(ε) if ε→0limεg(ε)=0
f’(x;d)=∇f(x)Td≅εf(x+εd)−f(x)
Choose d=ei
δxiδf(x)≅εf(x+εei)−f(x) “forward finite differencing”
Central Differencing:
δxiδf(x)=2εf(x+εei)−f(x−εei)+O(ε2)
Symbolic Differentiation
δxδ(f1(x)+f2(x))=δxδf1(x)+δxδf2(x)
δxδ(f1(x)⋅f2(x))=(δxδf1(x))f2(x)+(δxδf2(x))f1(x)
δxδf1(f2(x))=δf2(x)δf1(f2(x))⋅δxδf2(x)
f:Rn→R
∇f(x)=δx1δf(x)⋮δxnδf(x)
Example f(x1,x2)=ln(x1)+x1x2−sin(x2)
δx1δf(x1,x2)=x11+x2
δx2δf(x1,x2)=x1−cos(x2)
Computational Graph
input (x1,x2)=(2,5)
v1=x1=2
v2=x2=5
v3=ln(v1)=ln(2)
v4=v1⋅v2=2⋅5=10
v5=sin(v2)=sin(5)=−0.96
v6=v3+v4=10.69
v7=v6−v5=11.65
y=v7
Forward Mode Auto Diff
Let v˙i=δx1δvi (for gradient with respect to x1, repeat for x2)
v˙1=δx1δv1=1
v˙2=δx1δv2=0
v˙3=δx1δv3=δv1δv3⋅δx1δv1=δv1δv3v˙1=v11⋅v˙1=21
v˙4=δx1δv4=δv1δv4⋅δx1δv1+δv2δv4⋅δx1δv2=δv1δv4⋅v˙1+δx2δv4v˙2=5
v˙5=cos(v2)⋅v˙2=0
v˙6=v˙3+v˙4=21+5
v˙7=v˙6−v˙5=5.5−0=5.5
Reverse Mode Auto Diff
Let vˉi=δviδy ("adjoint")
vˉ7=δv7δy=1
vˉ6=δv6δy=δv7δy⋅δv6δv7=vˉ7⋅δv6δv7=1⋅1=1
vˉ5=δv5δy=δv7δy⋅δv5δv7=1⋅(−1)=−1
vˉ4=δv4δy=vˉ6⋅δv4δv6=1⋅1=1
vˉ3=δv3δy=vˉ6⋅δv3δv6=1⋅1=1
vˉ2=δv2δy=vˉ4⋅δv2δv4+vˉ5δv2δv5=2+(−1)⋅cos(5)=2−0.28=1.72
δx1δf(x1,x2)=δv1δy=vˉ1=vˉ3⋅δv1δv3+vˉ4⋅δv1δv4=1⋅21+1⋅5=5.5