[CS231n]5·Recurrent Neural Networks

CS231n Convolutional Neural Networks for Visual Recognition

https://www.bilibili.com/video/BV1nJ411z7fe

RNN: Process Sequences

Overview

所谓循环神经网络，可以看作是有时序性的神经网络，有时序电路的那种感觉。

上图中，横向从左到右可以看作为若干时刻。普通的神经网络是 one to one 的结构——一个输入层，经过一系列隐藏层，到达一个输出层，这些步骤都是在一个时刻完成的；而 RNN 可以处理序列型的数据，其可以是 one to many, many to one, many to many 等结构。

one to many: 某一时刻给一个输入，在之后的若干时刻都有输出。典型例子是 Image Captioning，即输入一个图像、生成描述该图像的文字。

many to one: 在连续的几个时刻给输入，直到最后一个时刻给出输出。典型例子是 Audio Prediction。

many to many: 在连续的几个时刻给输入，输入完成后在之后的若干时刻都有输出。典型例子是 Video Captioning，即生成描述视频的文字。

many to many: 在连续的几个时刻给输入，同时不断地输出。典型例子是 Video classification on frame level。

Forward

RNN 向前传播的 key idea 是：每一个神经元有一个“隐藏”的和时序相关的向量，它根据某不随时序变化的参数和当前的输入更新，即：随后可以根据情况（one to many / many to one / many to many, etc.）决定如何用去更新输出 .

例如，一个简单的情形可以是：

Computational Graph

为了方便 Backpropagation 的推导，computational graph 是非常重要的技巧。显然，对于不同结构（one to many / many to one / many to many, etc.），它们的 computational graph 会不同，但大同小异：

注意，在 one to many 结构中，我们可以用前一时刻的输出作为下一时刻的输入。

另外我们还可以把 many to one 和 one to many 连起来，形成 sequence to sequence 的效果。

Backpropagation

向前传播是按照时序计算的，于是反向传播就逆着时序传播。但是这里有一个问题，如果时序序列很长，这个过程会占用很大的内存。解决方案是 Truncated Backpropagation:

把整个时序分段，每次向前传播一段后就对这段反向传播。

RNN Tradeoffs

RNN Advantages:

Can process any length input
Computation for step t can (in theory) use information from many steps back
Model size doesn’t increase for longer input
Same weights applied on every timestep, so there is symmetry in how inputs are processed.

RNN Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back

LSTM (Long Short Term Memory)

RNN Gradient Flow

RNN 的在一个时钟中的更新为：

在这一个时钟中，我们有：考虑整个时序：

我们 Backpropagation 的目的是找到：若仅考虑：由于（当且仅当时取等），所以上式中括号内的乘积将非常小，这导致 Vanishing gradients 梯度消失。

即便不考虑括号那一项，注意这一项，如果的最大奇异值，该项将很大，导致 Exploding gradients 梯度爆炸；而如果最大奇异值，该项将很小，导致 Vanishing gradients.

总而言之，梯度在 RNN 中的传播是困难的，于是我们思考改进 RNN 的结构来解决这个问题。

LSTM

LSTM 在普通 RNN 的基础上多加了四个中间变量，将一个时钟中的更新定义为：其中：

: Input gate, whether to write to cell
: Forget gate, whether to erase cell
: Output gate, how much to reveal cell
: how much to write to cell

注意，计算上述四个 gate 各自的是不同的，而上式中的表示把它们写在一起的矩阵。

在一个时钟中，LSTM 的从到的梯度更新为：

整个时序上，gradient flow 显得很顺畅：

虽然 LSTM 不能保证不会发生 exploding gradients 或 vanishing gradients，但是它的 gradient flow 机制确实使得神经网络更容易训练。梯度在 LSTM 中的反向传播好似走了一条 high way，这一点上 LSTM 与 ResNet 有异曲同工之妙。

课程书籍笔记 > Stanford CS231n

#computer vision #deep learning

[CS231n]5·Recurrent Neural Networks

https://xyfjason.github.io/blog-main/2021/03/06/CS231n-5·Recurrent-Neural-Networks/

作者

xyfJASON

发布于

2021年3月6日

许可协议

[CS231n]6·Detection and Segmentation 上一篇

[CS231n]4·Convoluntional Neural Networks 下一篇