前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Attention Is All You Need

Attention Is All You Need

作者头像
YoungTimes
发布2023-09-01 08:57:00
2260
发布2023-09-01 08:57:00
举报

Attention解决了类RNN的长时序依赖问题和计算的并行化的问题,Multi-Head Attention实现了类似RNN多通道的效果。Transformer的整体架构是如下,Encoder和Decoder都使用了Stacked Self-Attention And Point-wise, Fully Connected Layers(MLP)结构。

Encoder

Encoder包含6个Layer,每个Layer包含2个Sub-Layer,Sub-Layer-1是Multi-Head Self-Attention,Sub_Layer-2是MLP,每个Sub-Layer中还包含了Residual Connection和Layer Normalization。

Decoder

Decoder包含6个Layer,每个Layer包含3个Sub-Layer,Sub-Layer-1是Masked Multi-Head Self-Attention,Mask是为了保证【Predictions for position

i

can depend only on the known outputs at positions less than

i

】;Sub-Layer-2是Multi-Head Self Attention;Sub-Layer-3是MLP,每个Sub-Layer中还包含了Residual Connection和Layer Normalization。

Layer Normalization

在机器翻译领域,输入的Seq长度往往是不同的,因此相比于在多个样本之间做Normalization的Batch Normalization,在单个样本内部做Layer Normalization是一种更好的策略。

Scaled Dot-Product Attention

Scaled Dot-Product Attention

Attention本质是一种加权和机制,它的计算公式如下:

Attention(Q, K, V) = softmax(\frac{QK}{\sqrt{d_k}}V)
  1. 用Q与K的内积衡量二者相似度原理是什么?

来源:https://charlieleee.github.io/post/cmc/

向量的相似性是用两者的角度余弦来度量,角度越小,余弦值愈大,两者愈相似。

  1. 为什么要除以
\sqrt{d_k}

“While for small values of

d_k

the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of

d_k

,We suspect that for large values of

d_k

, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

简单的说,在Dot Product Attention当

d_k

比较大的时候,点乘的值可能比较大或者比较小,导致向量间相对的差值变大,经过softmax之后,大的值向1靠近,小的值向0靠近,使得网络的梯度变得很小,难以训练。Transformer中的

d_k

一般比较大,所以除以

\sqrt{d_k}

是不错的选择。

  1. Mask是什么

Mask主要避免网络提前看到当前时刻之后的输入。

Multi-Head Attention

Multi-Head Attention

Multi-Head Attention中,Q、K、V先经过Linear层,类似CNN的多Channel机制,给网络h次投影学习机会,让其有机会学到更好的投影。

MultiHead(Q, K, V ) = Concat(head_1, ..., head_h)W^O

其中:

head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

“In this work we employ h = 8 parallel attention layers, or heads. For each of these we use

d_k = d_v = d_{model} / h = 64

. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

Embedding & Positional Encoding

Input Embedding将输入单词映射为维度d_{model}的向量,这里将Embedding Weight除以

\sqrt{d_{model}}

,保证Input Embedding与Positional Encoding的Scale基本一致。

Position Encoding采用相对位置编码方式:

PE(pos,2i) = sin(pos/10000^{2i/d_{model}})
PE(pos,2i+1) = cos(pos/10000^{2i/d_{model}})

“where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k,

PE_{pos+k}

can be represented as a linear function of

PE_{pos}

.

参考材料

1.https://www.zhihu.com/question/592626839/answer/2965200007?utm_campaign=shareopn&utm_medium=social&utm_oi=703332645056049152&utm_psn=1630928281645072385

2.[跟李沐读论文] https://www.youtube.com/watch?v=nzqlFIcCSWQ

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2023-05-14,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 半杯茶的小酒杯 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Encoder
  • Decoder
  • Layer Normalization
  • Scaled Dot-Product Attention
  • Multi-Head Attention
  • Embedding & Positional Encoding
  • 参考材料
相关产品与服务
机器翻译
机器翻译(Tencent Machine Translation,TMT)结合了神经机器翻译和统计机器翻译的优点,从大规模双语语料库自动学习翻译知识,实现从源语言文本到目标语言文本的自动翻译,目前可支持十余种语言的互译。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档