Gradient descent

Recently, I was studying neural networks and deep learning on Coursera. I realised it is crucial to understand gradient descent and backward propagation...

Gradient descent

Recently, I was studying neural networks and deep learning on Coursera. I realised it is crucial to understand gradient descent and backward propagation when exploring neural network training algorithms. So I decided to write an article to show the details in the basic linear, sigmoid neural net.

Gradient descent is a method to find the local minimum numerical value, widely used in supervised learning for finding the optimal value of the training parameters.

First, start with a quadratic function $y = (x-1)^2$. It is trivil to see that the minimum value is at $(1, 0)$. Assuming the start point is $(1, 0)$, then we use gradient descent to find the target optimal value.

First, let $(x, y) = (0, 1)$, the derivative at $x=1$ is $(0, -2)$. Then update $x$ with

$$x := x - \alpha\frac{dy}{dx}$$

The $\alpha$ here called the learning rate in deep learning, which is chosen as a hyper parameter by the user. We let $\alpha=0.1$ in this case, and then $x$ will be updated to $x := 0 - 0.1*(-2) = 0.2$. Substitute $x$ with 0.2 and the derivative at $x=0.2$ with $-1.6$, then $x$ will be updated to 0.36. By repeating with this process, after enough iterations, which also is a hyper parameter chosen by the user, the final value will be close enough to $x= 0$.

Let's apply gradient descent to the neural net.

One hidden layer neural net

Single Training Example

First, define the input layer $x$ as $a^{[0]}$ for single training example, and $X$ as $A^{[0]}$ for all training examples. And the output value $\hat{y} = a^{[L]}$, and $\hat{Y} = A^{[L]}$ for all training examples. Then for each layer, we have

$$z^{[i]} = w^{[i]}a^{[i-1]} + b^{[i]}$$

$$a^{[i]} = g^{[i]}(z^{[i]})$$

And for the cost function

$$L(a^{[L]}, y) = -(yloga^{[L]} + (1-y)log(1-a^{[L]}))$$

while $y$ is the labeled output in the training set.

As we need to search the optimal value for $w^{[l]}$ and $b^{[l]}$, to apply gradient descent, we need to find the derivatives of each layer. We use $dx=\frac{\partial L}{\partial x}$ to simplify the notation.

First

$$da^{[L]}=\frac{\partial L}{\partial da^{[L]}}=\frac{1-y}{1-a^{[L]}} - \frac{y}{a^{[L]}}$$

If the active function $g^{[L]}$ is a sigmoid function, then $g{'}=g(1-g)$. We can calculate $dz^{[L]}$ with the chain rule

$$dz^{[L]}=da^{[L]}\frac{\partial da^{[L]}}{\partial dz^{[L]}}=da^{[L]} \cdot a^{[L]}(1-a^{[L]})=a^{[L]}-y$$

It is trivil to see

$$dw^{[L]}=a^{[L-1]}dz^{[L]}$$

$$db^{[L]}=dz^{[L]}$$

$$da^{[L-1]}=w^{[L]}dz^{[L]}$$

Recursively apply this form backward, then we can get derivatives for all variables.

$$dz^{[l]}=da^{[l]}g{'}(z^{[l]})$$

$$dw^{[l]}=a^{[l-1]}dz^{[l]}$$

$$db^{[l]}=dz^{[l]}$$

$$da^{[l-1]}=w^{[l]}dz^{[l]}$$

Vectorisation with m Training Examples

The above equations show a recursive way to calculate derivatives for a single training example. For m training examples, we need to replace the vector with a matrix with m columns, while each column is one training example.

First

$$Z^{[i]}=W^{[i]}A^{[i-1]} + b^{[i]}$$

$$A^{[i]}=g^{[i]}(Z^{[i]})$$

and the cost function

$$J = 1/m\sum_{i=1}^{m}L(A^{[L](i)}, Y)$$

Since each column is one training example and they are independent to each other, we have $dZ^{[L]}=A^{[L]}-Y$. And similarly we can get the recursive form of the vectorised equations.

$$dZ^{[l]}=dA^{[l]}g{'}(Z^{[l]})$$

$$dW^{[l]}=\frac{1}{m}dZ^{[l]}{A^{[l-1]}}^{T}$$

$$db^{[l]}=\frac{1}{m}np.sum(dZ^{[l]}, axis=1, keepdims=True)$$

$$dA^{[l-1]}={W^{[l]}}^{T}dZ^{[l]}$$

With this recursive formula, we can calculate derivatives for each layer and update $W$ and $b$ accordingly.