# 【CS224N笔记】一文详解神经网络来龙去脉

## The structure of the neural network

##### A neuron can be a binary logistic regression unit

• b: We can have an “always on” feature, which gives a class prior, or separate it out, as a bias term------b我们常常认为是偏置
• 单层神经网络

• 多层神经网络

• for example：

f的运算:

##### 非线性的f的必须
• 重要性：
1. 没有非线性的激活函数，深度神经网络无法做比线性变换更复杂的运算
2. 多个线性的神经网络层相当于一个简单的线性变换:W_1W_2x=Wx
3. 如果采用更多的非线性激活函数，它们可以拟合更复杂的函数

### 命名主体识别Named Entity Recognition (NER)

• The task: findand classifynames in text, for example:
• Possible purposes:
1. Tracking mentions of particular entities in documents---跟踪文档中特殊的实体
3. A lot of wanted information is really associations between named entities---抽取命名主体之间关系
4. The same techniques can be extended to other slot-filling classifications----可以扩展到分类任务

#### Why might NER be hard?

• 实体的边界很难计算
• 很难指导某个物体是否是一个实体
• 很难知道未知/新奇实体的类别
• 很难识别实体---当实体是模糊的，并且依赖于上下文

#### Binary word window classification ---小窗口上下文文本分类器

• 问题:
1. in general，很少对单个单词进行分类
2. ambiguity arise in context，一词多义的问题
• example1：auto-antonyms
1. "To sanction" can mean "to permit" or "to punish”
2. "To seed" can mean "to place seeds" or "to remove seeds"
• example2：resolving linking of ambiguous named entities
1. Paris ->Paris, France vs. Paris Hilton vs. Paris, Texas
2. Hathaway ->Berkshire Hathaway vs. Anne Hathaway

#### Window classification: Softmax

• Idea: classify a word in its context window of neighboring words---在相邻词的上下文窗口对一个词进行分类
• A simple way to classify a word in context might be to average the word vectors in a window and to classify the average vector ---一个简单方法是对上下文的所有词向量去平均，但是这个方法会丢失位置信息
• 另一种方法: Train softmaxclassifier to classify a center word by taking concatenation of word vectors surrounding it in a window---将上下文所有单词的词向量串联起来
• for example: Classify “Paris” in the context of this sentence with window length 2Resulting vector x_{window}=x\varepsilon R^{5d}一个列向量然后通过softmax分类器

#### Binary classification with unnormalizedscores ---给分类的结果一个非标准化分数

• 之前的例子中:
• 假设我们需要确认一个中心词是否是一个地点
• 和word2vec类似，我们遍历语料库的所有位置，但是这一次，它只能在一些地方得到高分
• the positions that have an actual NER Location in their center are “true” positions and get a high score ---它们中符合标准的会获得最高分

#### Neural Network Feed-forward Computation

• 采用神经网络激活函数a简单的给出一个非标准化分数

• s = score("museums in Paris are amazing”)
##### Main intuition for extra layer
• 中间层的作用学习的是输入词向量的非线性交互----Example: only if “museums”is first vector should it matter that “in”is in the second position
##### The max-margin loss
• Idea for training objective: Make true window’s score larger and corrupt window’s score lower (until they’re good enough)---训练思路，让真实窗口的准确率提高，让干扰窗口的得分降低
• s = score(museums in Paris are amazing)
• 最小化
• 这是不可微分，但是是连续的，所以我们可以采用sgd算法进行优化
• Each window with an NER location at its center should have a score +1 higher than any window without a location at its center -----每个中心有ner位置的窗口得分应该比中心没有位置的窗口高1分
• For full objective function: Sample several corrupt windows per true one. Sum over all training windows---使用类似于负采样的方法，为真实窗口采样几个错误窗口

### 矩阵求导---不详细推导

#### Example Jacobian: Elementwise activation Function

• 多元求导的例子
• 针对我们的公式

• 在上面式子中，我们通过链式法则可以得出

## Neural Networks: Foundations

### A neuron

• A neuron is a generic computational unit that takes n inputs and produces a single output. What differentiates the outputs of different neurons is their parameters (also referred to as their weights). -----神经元作用

### A single layer of neurons

• 我们将单个神经元的思想扩展到多个神经元，考虑输入x作为多个这样的神经元输入
• If we refer to the different neurons’ weights as

and the biases as

, we can say the respective activations are

• 公式简化

### feed-forward computation

• 首先我们考虑一个nlp中命名实体识别问题作为例子: "Museums in Paris are amazing" 我们想判别中心词Paris是不是命名主体。在这种情况下，我们很可能不仅想要捕捉窗口中单词向量，还想要捕捉单词间的一些其他交互，方便我们分类。For instance, maybe it should matter that "Museums" is the ﬁrst word only if "in" is the second word. --上面可能存在顺序的约束的问题。所以这样的非线性决策通常不能被直接输入softmax，而是需要一个中间层进行score。因此我们使用另一个矩阵

• Analysis of Dimensions: If we represent each word using a 4 dimensional word vector and we use a 5-word window as input (as in the above example), then the input

. If we use 8 sigmoid units in the hidden layer and generate 1 score output from the activations, then

### Maximum Margin Objective Function

• 我们采用:Maximum Margin Objective Function(最大间隔目标函数)，使得保证对‘真’的数据的可能性比‘假’数据的可能性要高
• 定义符号：Using the previous example, if we call the score computed for the "true" labeled window "Museums in Paris are amazing" as S and the score computed for the "false" labeled window "Not all museums in Paris" as Sc (subscripted as c to signify that the window is "corrupt"). ---正确窗口S，错误窗口Sc
• 随后，我们目标函数最大化S-Sc或者最小化Sc-S.然而，我们修改目标函数来保证误差在Sc>S的时候在进行计算。这么做的目的是我们只在关心‘真’数据的得分高于‘假’数据的得分，其他不是很重要。因此，误差Sc>S时候存在为Sc-S，反之不存在，即为0。所以我们的优化目标是:
• However, the above optimization objective is risky in the sense that it does not attempt to create a margin of safety. We would want the "true" labeled data point to score higher than the "false" labeled data point by some positive margin ∆. In other words, we would want error to be calculated if (s−sc < ∆) and not just when (s−sc < 0). Thus, we modify the optimization objective: ----上面的优化目标函数是存在风险的，它不能创造一个比较安全的间隔，所以我们希望存在一个这样的间隔，并且这个间隔需要大于0
• We can scale this margin such that it is ∆ = 1 and let the other parameters in the optimization problem adapt to this without any change in performance.-----有希望了解的可以看svm的推导，这里的意思是说，我们把间隔设置为1，这样我们可以让其他参数在优化过程中自动进行调整，并不会影响模型的表现

### 梯度更新

• 方向传播是一种利用链式法则来计算模型上任意参数的损失梯度的方法
• 定义:

1.

is an input to the neural network. ---输入

2. s is the output of the neural network.---输出

3. •Each layer (including the input and output layers) has neurons which receive an input and produce an output. The j-th neuron of layer k receives the scalar input

and produces the scalar activation output

---每层中的神经元，上角标是层数，下角标的神经元位置

4. We will call the backpropagated error calculated at

as

. ---定义z的反向传播误差

5. Layer 1 refers to the input layer and not the ﬁrst hidden layer. ----输入层不是隐藏层

6.

is the transfer matrix that maps the output from the k-th layer to the input to the (k+1)-th -----转移矩阵

#### 开始反向传播

• 现在开始反向传播：假设损失函数

, 我们看到

• 我们只分析

, 其中

• 对更定的参数

, 我们知道它的误差梯度是

，其中

## Neural Networks: Tips and Tricks

• Given a model with parameter vector θ and loss function J, the numerical gradient around θi is simply given by centered difference formula:
• 其实就是一个梯度的估计
• Now, a natural question you might ask is, if this method is so precise, why do we not use it to compute all of our network gradients instead of applying back-propagation? The simple answer, as hinted earlier, is inefﬁciency – recall that every time we want to compute the gradient with respect to an element, we need to make two forward passes through the network, which will be computationally expensive. Furthermore, many large-scale neural networks can contain millions of parameters, and computing two passes per parameter is clearly not optimal. -----虽然上面的梯度估计公式很有效，但是这仅仅是随机检测我们梯度是否正确的方法。我们最有效的并且最实用(运算效率最高的)算法就是反向传播算法

### Regularization ---正则化

• As with many machine learning models, neural networks are highly prone to overﬁtting, where a model is able to obtain near perfect performance on the training dataset, but loses the ability to generalize to unseen data. ----和很多机器学习模型一样，神经网络也会陷入过拟合，这回让它无法泛化到测试集上。一个常见的解决过拟合的问题就是采用L2正则化(只需要给损失函数J添加一个正则项)，改进的损失函数
• 对上面公式的参数解释,λ是一个超参数,控制正则项的权值大小,

• 正则项的作用: what regularization is essentially doing is penalizing weights for being too large while optimizing over the original cost function---在优化损失函数的时候，惩罚数值太大的权值,让权值分配更均匀，防止出现权值过大的情况
• Due to the quadratic nature of the Frobenius norm (which computes the sum of the squared elements of a matrix), L2 regularization effectively reduces the ﬂexibility of the model and thereby reduces the overﬁtting phenomenon. Imposing such a constraint can also be interpreted as the prior Bayesian belief that the optimal weights are close to zero – how close depends on the value of λ.-----因为正则化有一个二次项的存在，这有利有弊，它会降低模型的灵活性但是也会降低过拟合的可能性。在贝叶斯学说的认为下，正则项可以优化权值并且使得其接近0，但是这个取决于你的一个λ值的大小
• Too high a value of λ causes most of the weights to be set too close to 0, and the model does not learn anything meaningful from the training data, often obtaining poor accuracy on training, validation, and testing sets. ---λ取值要合适
##### 为什么偏置没有正则项
• 正则化的目的是为了防止过拟合，但是过拟合的表现形式是模型对于输入的微小变化产生了巨大差异，这主要是因为W的原因，有些w的参数过大。但是b是不背锅的，偏置b对于输入的改变是不敏感的，不管输入改变大还是小。

### Dropout---部分参数抛弃运算

• the idea is simple yet effective – during training, we will randomly “drop” with some probability (1−p) a subset of neurons during each forward/backward pass (or equivalently, we will keep alive each neuron with a probability p). Then, during testing, we will use the full network to compute our predictions. The result is that the network typically learns more meaningful information from the data, is less likely to overﬁt, and usually obtains higher performance overall on the task at hand. One intuitive reason why this technique should be so effective is that what dropout is doing is essentially doing is training exponentially many smaller networks at once and averaging over their predictions.---------dropout的思想就是在每次前向传播或者反向传播的时候我们按照一定的概率(1-P)冻结神经元，但是剩下概率为p的神经元是激活的，然后在测试阶段，我们使用所有的神经元。使用dropout的网络可以从数据中学到更多的知识
• However, a key subtlety is that in order for dropout to work effectively, the expected output of a neuron during testing should be approximately the same as it was during training – else the magnitude of the outputs could be radically different, and the behavior of the network is no longer well-deﬁned. Thus, we must typically divide the outputs of each neuron during testing by a certain value --------为了使得的dropout能够有效，测试阶段的神经元的预期输出应该和训练阶段大致相同---否则输出的大小存在很大差异，所以我们通常需要在测试阶段将每个神经元的输出除以P(P是存活神经元的概率)

### Parameter Initialization--参数初始化

• A key step towards achieving superlative performance with a neural network is initializing the parameters in a reasonable way. A good starting strategy is to initialize the weights to small random numbers normally distributed around 0 ---通常我们的权值随机初始化在0附近

## 加餐：优化算法----此内容来源于datawhale'颜值担当'

### NAG

下载一：中文版！学习TensorFlow、PyTorch、机器学习、深度学习和数据结构五件套！后台回复【五件套】

0 条评论

• ### 简单实例讲解为何深度学习有效

在之前的一些年里，深度学习已经占领了模式识别领域，之后又横扫了计算机数视觉，之后自然语言处理也慢慢的朝着这个方向开始了它的发展。

• ### 这么好的视频不看吗？深度学习和线代，微积分

大家盼望的中秋节和十一已经基本都要过去了，大家是不是都玩的挺开心呀？（哎，我可没0.0，基本没离开过实验室，别认为我在学习

• ### Sentiment Analysis情感分析——珍藏版

本文为Stanford Dan Jurafsky & Chris Manning: Natural Language Processing 课程笔记。

• ### Quant求职系列：Jane Street烧脑Puzzle（2019-2020）

全球顶尖的自营交易公司Jane Street创立于1999年，其对自己的描述是：“a quantitative trading firm and liquidi...

• ### Computational Geometry in Python

This post is a simplified version of the accompanying notebook to chapter 6 of m...

• ### Palabos Tutorial 1/3: First steps with Palabos

The following tutorial provides an overview of the C++ programmer interface of t...

• ### Doctor Si, Xiao：Thoughts on Chinese Safe Harbor Rules

Thoughts on Chinese Safe Harbor Rules ——Shared by Doctor Si, Xiao at Stanford...

• ### RL09 Bayesian Network

• Bayesian Networks are directedacyclic graphs (DAGs) withan associated set of ...

• ### 如何快速定位SAP CRM订单应用(Order Application)错误消息抛出的准确位置

In my blog Six kinds of debugging tips Fabian Geyer raised a very good point abo...