反向传播(Backpropagation) 是一种用于训练人工神经网络的算法。它通过计算损失函数(Loss Function)相对于网络中每个参数(如权重和偏置)的梯度,并利用这些梯度调整网络参数,从而最小化损失函数,进而优化网络的表现。
反向传播算法基于链式法则(Chain Rule),可以高效地计算神经网络中每层的梯度。同时它是大多数深度学习框架(如 TensorFlow、PyTorch、Keras 等)中的核心算法。
本文将以 均方误差(MSE) 作为损失函数。
简单网络
首先从一个最简单的神经网络开始,每一层都只有一个神经元。
这里介绍一下各参数的意义:
- l:层级,例如
L
代表最后一层,(L - 1)
为倒数第二层 - n:该层中的神经元数量,例如第
L
层中有一个神经元,那么 n(L) = 1 - a:神经元的值,例如 a0(L-1) 代表倒数第二层第一个神经元的值,也就是 +0.15
- j/k:其中 j 代表当前层中第 j 个神经元,k 代表前一层中的第 k 个神经元
- w:权重值,例如 wjk 代表当前层中第 j 个神经元对前一层中的第 k 个神经元的权重
基本公式
于是可以得到下面的公式:
$$a_0^{(L)} = \sigma(w_{00}^{(L)} * a_0^{(L-1)} + b_0^{(L)})$$
(这里的激活函数使用 sigmoid: $\sigma(x) = {1 \over 1 + e^{-x}}$,也可以换成其他的。)
现在令 $$z_0^{(L)} = w_{00}^{(L)} * a_0^{(L-1)} + b_0^{(L)}$$
那么现在 $a_0^{(L)}$ 为:
$$a_0^{(L)} = \sigma(z_0^{(L)})$$
损失值
接着得到损失值 $L_0$ (其中 L 的下标 0 代表第 0 个样本):
$$L_0 = \frac {(a_{00}^{(L)} – y_{00})^2}{2}$$
(这里乘 $1 \over 2$ 是为了下面消去 $L_0$ 的导数的常数)
于是得到对第 i 个样本的损失值 $L_i$:
$$L_i = {1 \over n} \sum_{i = 1}^{n} \frac {(a_{i0}^{(L)} – y_{i0})^2}{2}$$
既然反向传播是求出对每个 权重参数 (w)
或者 偏置 (b)
的梯度,那么接下来就分别求损失值 $L_0$ 对 $w_{00}^{(L)}$ 和 $b_0^{(L)}$ 的偏导就行了,于是得到下面的两个公式:
$$\frac {\partial L_0}{\partial w_{00}^{(L)}} = \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial w_{00}^{(L)}} $$
$$\frac {\partial L_0}{\partial b_0^{(L)}} = \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial b_0^{(L)}} $$
看起来有些复杂,下面逐个计算偏导数:
$$\begin{aligned} \frac {\partial L_0}{\partial a_{00}^{(L)}} &= (\frac {(a_{00}^{(L)} – y_{00})^2}{2})\prime \\ &= {1 \over 2}[(a_{00}^{(L)})^2 – 2{y_{00}}{a_{00}^{(L)}} + {y_{00}}^2]\prime \\ &= {1 \over 2}[2{a_{00}^{(L)}} – 2{y_{00}}] \\ &= {a_{00}^{(L)}} – {y_{00}} \end{aligned}$$
$$\frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} = \sigma\prime(z_{00}^{(L)}), \frac {\partial z_{00}^{(L)}}{\partial w_{00}^{(L)}} = a_{00}^{(L-1)}, \frac {\partial z_{00}^{(L)}}{\partial b_0^{(L)}} = 1$$
得到最后的结果:
$$\begin{aligned} \frac {\partial L_0}{\partial b_0^{(L)}} &= \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial b_0^{(L)}} \\ &= (a_{00}^{(L)} – y_{00}) \cdot \sigma\prime(z_{00}^{(L)}) \cdot 1 \end{aligned}$$
以及:
$$\begin{aligned} \frac {\partial L_0}{\partial w_{00}^{(L)}} &= \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial w_{00}^{(L)}} \\ &= (a_{00}^{(L)} – y_{00}) \cdot \sigma\prime(z_{00}^{(L)}) \cdot a_{00}^{(L-1)} \\ &= a_{00}^{(L-1)} \cdot \frac {\partial L_0}{\partial b_0^{(L)}} \end{aligned}$$
到此为止,我们完成了对单个样本的损失值的推导,现在假设有 n 个样本,那么得到最终的损失值 L:
$$\frac {\partial L}{\partial w_{00}^{(L)}} = {1 \over n} \cdot \sum_{i = 0}^{n – 1} \frac {\partial L_i}{\partial w_{00}^{(L)}}$$
$$\frac {\partial L}{\partial b_0^{(L)}} = {1 \over n} \cdot \sum_{i = 0}^{n – 1} \frac {\partial L_i}{\partial b_0^{(L)}}$$
梯度
现在我们可以得到对应的梯度了:
$$\nabla W^{(L)} = \frac {\partial L}{\partial w_{00}^{(L)}} = {1 \over n} \cdot \sum_{i = 0}^{n – 1} \frac {\partial L_i}{\partial w_{00}^{(L)}}$$
$$\nabla B^{(L)} = \frac {\partial L}{\partial b_0^{(L)}} = {1 \over n} \cdot \sum_{i = 0}^{n – 1} \frac {\partial L_i}{\partial b_0^{(L)}}$$
更新第 L 层的权重与偏置值(其中 $\eta \in [0, 1]$ 为学习率):
$$w_0^{(L)} = w_0^{(L)} – \eta \cdot \nabla W^{(L)}$$
$$b_0^{(L)} = b_0^{(L)} – \eta \cdot \nabla B^{(L)}$$
传播
问题还没有结束,现在我们只是更新了输出层的权重,反向传播不能只传播一层,下面继续讨论如何更新 L – 1 层的权重。
先求出损失值 $L_i$ 对 $a_{i0}^{(L-1)}$ 的偏导数:
$$\frac {\partial L_i}{\partial a_{i0}^{(L-1)}} = \frac {\partial L_i}{\partial a_{i0}^{(L)}} \cdot \frac {\partial a_{i0}^{(L)}}{\partial z_{i0}^{(L)}} \cdot \frac {\partial z_{i0}^{(L)}}{\partial a_{i0}^{(L-1)}}$$
再分别求出 $L_i$ 对 $w_{00}^{(L-1)}$ 和 $L_i$ 对 $b_0^{(L-1)}$ 的偏导数:
$$\frac {\partial L_i}{\partial w_{00}^{(L-1)}} = \frac {\partial L_i}{\partial a_{i0}^{(L-1)}} \cdot \frac {\partial a_{i0}^{(L-1)}}{\partial z_{i0}^{(L-1)}} \cdot \frac {\partial z_{i0}^{(L-1)}}{\partial w_{00}^{(L-1)}}$$
$$\frac {\partial L_i}{\partial b_0^{(L-1)}} = \frac {\partial L_i}{\partial a_{i0}^{(L-1)}} \cdot \frac {\partial a_{i0}^{(L-1)}}{\partial z_{i0}^{(L-1)}} \cdot \frac {\partial z_{i0}^{(L-1)}}{\partial b_0^{(L-1)}}$$
再推广到 n 个样本,就得到了 L – 1 层的梯度:
$$\nabla W^{(L-1)} = \frac {\partial L_i}{\partial w_{00}^{(L-1)}} = {1 \over n} \sum_{i = 1}^{n} \frac {\partial L_i}{\partial w_{00}^{(L-1)}}$$
$$\nabla B^{(L-1)} = \frac {\partial L_i}{\partial b_0^{(L-1)}} = {1 \over n} \sum_{i = 1}^{n} \frac {\partial L_i}{\partial b_0^{(L-1)}}$$
如果还要继续反向传播到 L – 2 层呢?
首先求出 $L_i$ 对 $a_{i0}^{(L-2)}$ 的偏导数:
$$\frac {\partial L_i}{\partial a_{i0}^{(L-2)}} = \frac {\partial L_i}{\partial a_{i0}^{(L-1)}} \cdot \frac {\partial a_{i0}^{(L-1)}}{\partial z_{i0}^{(L-1)}} \cdot \frac {\partial z_{i0}^{(L-1)}}{\partial a_{i0}^{(L-2)}}$$
这里的 $\frac {\partial L_i}{\partial a_{i0}^{(L-1)}}$ 也来自上一层得到的结果。
那么带着这个计算结果,就可以得到:
$$\nabla W^{(L-2)} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{i0}^{(L-2)}} \cdot \frac {\partial a_{i0}^{(L-2)}}{\partial z_{i0}^{(L-2)}} \cdot \frac {\partial z_{i0}^{(L-2)}}{\partial w_{00}^{(L-2)}})$$
$$\nabla B^{(L-2)} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{i0}^{(L-2)}} \cdot \frac {\partial a_{i0}^{(L-2)}}{\partial z_{i0}^{(L-2)}} \cdot \frac {\partial z_{i0}^{(L-2)}}{\partial b_0^{(L-2)}})$$
复杂网络
看完了最简单的网络,现在我们尝试拓展一下每一层的神经元吧:
现在输出层有两个神经元,最后一层隐藏层有四个神经元。
所以 $j \in [0, 1] \cap j \in N, k \in [0, 3] \cap k \in N$
基本公式
和上面一样,先得到 $a_j^{(L)}$,然后令 $a_j^{(L)} = \sigma(z_j^{(L)})$
首先看看单个神经元的激活值是怎么计算出来的:
$$a_0^{(L)} = \sigma(z_0^{(L)})$$
$$z_0^{(L)} = w_{00}^{(L)} a_0^{(L-1)} + w_{01}^{(L)} a_1^{(L-1)} + w_{02}^{(L)} a_2^{(L-1)} + w_{03}^{(L)} a_3^{(L-1)} + b_0^{(L)}$$
因此得到 $z_j^{(L)}$:
$$\begin{aligned} z_j^{(L)} &= w_{j0}^{(L)} a_0^{(L-1)} + w_{j1}^{(L)} a_1^{(L-1)} + \cdots + w_{j(k-1)}^{(L)} a_k^{(L-1)} + w_{j(k-1)}^{(L)} a_k^{(L-1)} + b_j^{(L)} \\ &= \sum_{k = 0}^{n_{(L-1)}-1} w_{jk}^{(L)} a_k^{(L-1)} + b_j^{(L)} \end{aligned}$$
损失值
现在可以计算出对单个样本的损失值 $L_{0j}$(其中第一个下标代表第 0 个样本,j 代表第 j 个输出层神经元):
$$L_{0j} = \frac {(a_j^{(L)} – y_j)^2}{2}$$
得到整体的损失值 $L_0$:
$$L_0 = \sum_{j = 0}^{n_{(L)}-1} \frac {(a_{0j}^{(L)} – y_{0j})^2}{2}$$
于是 $L_i$:
$$L_i = \sum_{j = 0}^{n_{(L)}-1} \frac {(a_{ij}^{(L)} – y_{ij})^2}{2}$$
推广到 n 个样本:
$$L = {1 \over n} \sum_{i = 0}^{n-1} L_{i} = {1 \over n} \sum_{i = 0}^{n-1} \sum_{j = 0}^{n_{(L)}-1} \frac {(a_{ij}^{(L)} – y_{ij})^2}{2}$$
现在的情况有些不同,刚才每一层都只有一个神经元,现在变为了输出层有 j 个且最后一层隐藏层有 k 个神经元。
但基本思路还是一样的,用整体的损失值 $L$ 对 $w_{jk}^{(L)}$ 求偏导即可,下面计算一下输出层的权重矩阵 $W^{(L)}$ 的第一行的权重值梯度:
$$\frac {\partial L_0}{\partial w_{00}^{(L)}} = \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial w_{00}^{(L)}}$$
$$\frac {\partial L_0}{\partial w_{01}^{(L)}} = \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial w_{01}^{(L)}}$$
$$\cdots$$
$$\frac {\partial L_0}{\partial w_{0k}^{(L)}} = \frac {\partial L_0}{\partial a_{00}^{(L)}} \cdot \frac {\partial a_{00}^{(L)}}{\partial z_{00}^{(L)}} \cdot \frac {\partial z_{00}^{(L)}}{\partial w_{0k}^{(L)}}$$
推广到 $w_{jk}^{(L)}$ 的梯度并且推广到 n 个样本:
$$\frac {\partial L}{\partial w_{jk}^{(L)}} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{ij}^{(L)}} \cdot \frac {\partial a_{ij}^{(L)}}{\partial z_{ij}^{(L)}} \cdot \frac {\partial z_{ij}^{(L)}}{\partial w_{jk}^{(L)}})$$
于是也可以得到最终 $b_j^{(L)}$ 的梯度:
$$\frac {\partial L}{\partial b_j^{(L)}} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{ij}^{(L)}} \cdot \frac {\partial a_{ij}^{(L)}}{\partial z_{ij}^{(L)}} \cdot \frac {\partial z_{ij}^{(L)}}{\partial b_j^{(L)}})$$
这两个式子很复杂,先算一下 $\frac {\partial L}{\partial w_{jk}^{(L)}}$ 的三个中间式:
$$\begin{aligned} \frac {\partial L_i}{\partial a_{ij}^{(L)}} &= \partial (\sum_{j = 0}^{n_{(L)}-1} \frac {(a_{ij}^{(L)} – y_{ij})^2}{2}) / \partial a_{ij}^{(L)} \\ &= \partial (\frac {(a_{i0}^{(L)} – y_{i0})^2}{2} + \cdots + \frac {(a_{ij}^{(L)} – y_{ij})^2}{2}) / \partial a_{ij}^{(L)} \\ &= \partial (\frac {(a_{i0}^{(L)} – y_{i0})^2}{2} + \cdots + \frac {(a_{i(j-1)}^{(L)} – y_{i(j-1)})^2}{2}) / \partial a_{ij}^{(L)} + \partial (\frac {(a_{ij}^{(L)} – y_{ij})^2}{2}) / \partial a_{ij}^{(L)} \\ &= 0 + (a_{ij}^{(L)} – y_{ij}) \\ &= a_{ij}^{(L)} – y_{ij} \end{aligned}$$
$$\frac {\partial a_{ij}^{(L)}}{\partial z_{ij}^{(L)}} = \sigma\prime(z_{ij}^{(L)})$$
$$\frac {\partial z_{ij}^{(L)}}{\partial w_{jk}^{(L)}} = a_{ik}^{(L-1)}, \frac {\partial z_{ij}^{(L)}}{\partial b_j^{(L)}} = 1$$
于是计算出具体的结果:
$$\begin{aligned} \frac {\partial L}{\partial w_{jk}^{(L)}} &= {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{ij}^{(L)}} \cdot \frac {\partial a_{ij}^{(L)}}{\partial z_{ij}^{(L)}} \cdot \frac {\partial z_{ij}^{(L)}}{\partial w_{jk}^{(L)}}) \\ &= {1 \over n} \sum_{i = 1}^{n} [(a_{ij}^{(L)} – y_{ij}) \cdot \sigma\prime(z_{ij}^{(L)}) \cdot a_{ik}^{(L-1)}] \end{aligned}$$
$$\begin{aligned} \frac {\partial L}{\partial b_j^{(L)}} &= {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{ij}^{(L)}} \cdot \frac {\partial a_{ij}^{(L)}}{\partial z_{ij}^{(L)}} \cdot \frac {\partial z_{ij}^{(L)}}{\partial b_j^{(L)}}) \\ &= {1 \over n} \sum_{i = 1}^{n} [(a_{ij}^{(L)} – y_{ij}) \cdot \sigma\prime(z_{ij}^{(L)}) \cdot 1] \end{aligned}$$
梯度
权重
观察一下输出层的权重矩阵 $W_{jk}$:
$$W^{(L)} = \begin{pmatrix} w_{00}^{(L)} & w_{01}^{(L)} & \cdots & w_{0k}^{(L)} \\ w_{10}^{(L)} & \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\ w_{j0}^{(L)} & \cdots & \cdots & w_{jk}^{(L)} \end{pmatrix}$$
再结合上面我们得到的单个输出层的神经元的损失值对 k 个权重参数的偏导数,得到梯度矩阵:
$$\nabla W^{(L)} = \begin{pmatrix} \frac {\partial L}{\partial w_{00}^{(L)}} & \frac {\partial L}{\partial w_{01}^{(L)}} & \cdots & \frac {\partial L}{\partial w_{0k}^{(L)}} \\ \frac {\partial L}{\partial w_{10}^{(L)}} & \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\ \frac {\partial L}{\partial w_{j0}^{(L)}} & \cdots & \cdots & \frac {\partial L}{\partial w_{jk}^{(L)}} \end{pmatrix}$$
因此更新输出层的权重值:
$$W^{(L)} = W^{(L)} – \eta \cdot \nabla W^{(L)}$$
偏置
$$\nabla B^{(L)} = \begin{pmatrix} \frac{\partial L}{\partial b_0^{(L)}} \\ \vdots \\ \frac {\partial L}{\partial b_j^{(L)}} \end{pmatrix}$$
因此更新输出层的偏置值:
$$B^{(L)} = B^{(L)} – \eta \cdot \nabla B^{(L)}$$
传播
继续更新 L – 1 层的权重值,现在假设 L – 2 层有 m 个神经元。
还记得在简单网络中的方法吗?现在也要先计算 $\frac {\partial L_i}{\partial a_{ik}^{(L-1)}}$ 的值:
$$\frac {\partial L_i}{\partial a_{ik}^{(L-1)}} = \frac {\partial L_i}{\partial a_{ij}^{(L)}} \cdot \frac {\partial a_{ij}^{(L)}}{z_{ij}^{(L)}} \cdot \frac {\partial z_{ij}^{(L)}}{\partial a_{ik}^{(L-1)}}$$
其实和前面求 $\frac {\partial L}{\partial w_{jk}^{(L)}}$ 是很像的,唯一不同的是,最后一项是对 $a_{ik}^{(L-1)}$ 求偏导数。
先来回忆一下 $z_{ij}^{(L)}$:
$$z_{ij}^{(L)} = \sum_{k = 0}^{n_{(L-1)}-1} w_{jk}^{(L)}a_{ik}^{(L-1)} + b_j^{(L)}$$
于是得到 $\frac {\partial z_{ij}^{(L)}}{\partial a_{ik}^{(L-1)}}$:
$$\frac {\partial z_{ij}^{(L)}}{\partial a_{ik}^{(L-1)}} = w_{jk}^{(L)}$$
现在可以计算出:
$$\frac {\partial L_i}{\partial a_{ik}^{(L-1)}} = (a_{ij}^{(L)} – y_{ij}) \cdot \sigma\prime(z_{ij}^{(L)}) \cdot w_{jk}^{(L)}$$
那么很容易得到对 $w_{km}^{(L-1)}$ 的梯度值:
$$\frac {\partial L}{\partial w_{km}^{(L-1)}} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{ik}^{(L-1)}} \cdot \frac {\partial a_{ik}^{(L-1)}}{\partial z_{ik}^{(L-1)}} \cdot \frac {\partial z_{ik}^{(L-1)}}{\partial w_{km}^{(L-1)}})$$
后面两项很容易计算出:
$$\frac {\partial a_{ik}^{(L-1)}}{\partial z_{ik}^{(L-1)}} = \sigma\prime(z_{ik}^{(L-1)})$$
$$\frac {\partial z_{ik}^{(L-1)}}{\partial w_{km}^{(L-1)}} = a_{im}^{(L-2)}$$
最终得到 $w_{km}^{(L-1)}$ 以及 $b_{k}^{(L-1)}$ 的梯度:
$$\frac {\partial L}{\partial w_{km}^{(L-1)}} = {1 \over n} \sum_{i = 1}^{n} [\frac {\partial L_i}{\partial a_{ik}^{(L-1)}} \cdot \sigma\prime(z_{ik}^{(L-1)}) \cdot a_{im}^{(L-2)}]$$
$$\frac {\partial L}{\partial b_k^{(L-1)}} = {1 \over n} \sum_{i = 1}^{n} [\frac {\partial L_i}{\partial a_{ik}^{(L-1)}} \cdot \sigma\prime(z_{ik}^{(L-1)}) \cdot 1]$$
同理还可以得到 L – 2 层的权重梯度(具体过程已省略):
同样的道理,继续求出 $\frac {\partial L_i}{\partial a_{im}^{(L-2)}}$:
$$\begin{aligned} \frac {\partial L_i}{\partial a_{im}^{(L-2)}} &= \frac {\partial L_i}{\partial a_{ik}^{(L-1)}} \cdot \frac {\partial a_{ik}^{(L-1)}}{\partial z_{ik}^{(L-1)}} \cdot \frac {\partial z_{ik}^{(L-1)}}{\partial a_{im}^{(L-2)}} \\ &= \frac {\partial L_i}{\partial a_{ik}^{(L-1)}} \cdot \sigma\prime(z_{ik}^{(L-1)}) \cdot w_{km}^{(L-1)} \end{aligned}$$
得到两个梯度:
$$\begin{aligned} \frac {\partial L}{\partial w_{mt}^{(L-2)}} &= {1 \over n} \sum_{i = 1}^{n} \frac {\partial L_i}{\partial w_{mt}^{(L-2)}} \\ &= {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{im}^{(L-2)}} \cdot \frac {\partial a_{im}^{(L-2)}}{\partial z_{im}^{(L-2)}} \cdot \frac {\partial z_{im}^{(L-2)}}{\partial w_{mt}^{(L-2)}}) \\ &= {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{im}^{(L-2)}} \cdot \sigma\prime(z_{im}^{(L-2)}) \cdot a_{it}^{(L-3)}) \end{aligned}$$
$$\frac {\partial L}{\partial b_m^{(L-2)}} = {1 \over n} \sum_{i = 1}^{n} (\frac {\partial L_i}{\partial a_{im}^{(L-2)}} \cdot \sigma\prime(z_{im}^{(L-2)}) \cdot 1)$$
结束语
本文受到由 3Blue1Brown 制作的视频启发而写:Backpropagation, step-by-step | DL3
第一次用 MathJax,所以在写公式上花了很多时间,如果错误,欢迎指出。