【学习】手搓Transformer:Encoder & Dencoder


手搓Transformer:Encoder & Decoder


《Attention Is All You Need》里的Transformer架构

Transformer

1. Encoder

Encoder Layer 由一个多头注意力层和一个前馈网络层组成。输入的嵌入向量经过多头注意力机制捕获语义特征长距离依赖,然后经过一层残差连接+归一化,通过前馈网络层映射到高维空间捕获多维特征,最后再次残差连接+归一化输出。

分析一下 Encoder 部分的张量变化:

【Embedding】

  • 输入张量形状:[batch_size, seq_length]
  • Embedding 变换后:[batch_size, seq_length, d_model]

【Multi-Attention】

  • 将输入分为 num_heads 个头,Q、K、V 矩阵形状:[batch_size, seq_length, d_model / num_heads]
  • 注意力计算后合并注意力:[batch_size, seq_length, d_model]

【Add & Norm】

  • 将注意力输出与原始输入相加通过归一化层,计算残差损失

  • 输出形状保持:[batch_size, seq_length, d_model]

【FFN】

  • 线性映射到高维(hidden_dim) + ReLU激活 + 线性映射回低维(d_model)

最终输出:[batch_size, seq_length, d_model],也就是和 Embedding 变换后的维度对齐。

借助 pytorch 的代码实现:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
import numpy as np
import random
import math
from torch import Tensor
import os
import multi_attention
import layernorm
import embedding

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, hidden_dim, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x
    
class EncoderLayer(nn.Module):
    def __init__(self, d_model, hidden_dim, num_heads, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.attn = multi_attention.MultiHeadAttention(d_model, num_heads)
        self.norm1 = layernorm.LayerNorm(d_model)
        self,dropout1 = nn.Dropout(dropout)
        self.ffn = PositionwiseFeedForward(d_model, hidden_dim, dropout)
        self.norm2 = layernorm.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        _x = x
        x = self.attn(x, x, x, mask)
        x = self.dropout1(x)
        # 残差连接
        x = self.norm1(x + _x)
        _x = x
        x = self.ffn(x)
        x = self.dropout2(x)
        x = self.norm2(x + _x)
        return x
    
class Encoder(nn.Module):
    def __init__(self, voc_size, max_len, d_model, hidden_dim, num_heads, num_layers, dropout=0.1, device=None):
        super(Encoder, self).__init__()
        self.embed = embedding.Embedding(voc_size, d_model, max_len, dropout, device)
        self.layers = nn.ModuleList([EncoderLayer(d_model, hidden_dim, num_heads, dropout) for _ in range(num_layers)])

    def forward(self, x, mask=None):
        x = self.embed(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

2. Decoder

解码器包含自注意力与交叉注意力两部分。

编码器的输出被用作交叉注意力部分的 K 和 V 矩阵(键值向量),Q 矩阵(查询向量)则是自注意力层的输出。交叉注意力层会对编码结果进行注意力调整,计算编码结果与解码输入间的注意关系,以获得与当前解码位置相关的编码器信息。

在 Multi-Attention 里我们可以看到注意力分数可以应用对应的 mask,mask 矩阵对应值为0的部分,在注意力矩阵将其值设定为无穷小。这样在 Softmax 操作后,这些部分对应的注意力分数就变为0,不被注意到。

在 Encoder 里面我们只有 **padding mask (s_mask)**,用来标识输入序列的最大长度,不够最大长度的部分补0。而在 Decoder 里还存在一种 mask,也就是 **sequence mask (t_mask)**,用于解码器的自注意力层,确保预测第 i 个位置的token时只能看到位置 i 之前的信息。

前馈网络层和编码器实现一致。每一层后面都添加一层 Add&Norm 与 dropout。

最终输出经过一层线性变换,然后Softmax得到输出的概率分布(softmax层会把向量变成概率),然后通过词典,输出概率最大的对应的单词作为我们的预测输出。

借助 pytorch 的代码实现:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
import numpy as np
import random
import math
from torch import Tensor
import os
import multi_attention
import layernorm
import embedding

class DecoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.attn1 = multi_attention.MultiHeadAttention(d_model, num_heads)
        self.norm1 = layernorm.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = multi_attention.MultiHeadAttention(d_model, num_heads)
        self.norm2 = layernorm.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = embedding.PositionwiseFeedForward(d_model, ffn_hidden, dropout)
        self.norm3 = layernorm.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    # t_mask是目标序列的掩码,s_mask是源序列的掩码
    # enc是编码器输出,dec是解码器输出
    def forward(self, dec, enc, t_mask, s_mask):
        _x = dec
        x = self.attn1(_x, _x, _x, t_mask)
        x = self.dropout1(x)
        x = self.norm1(x + _x)
        _x = x
        x = self.cross_attn(x, enc, enc, s_mask)
        x = self.dropout2(x)
        x = self.norm2(x + _x)
        x = self.ffn(x)
        x = self.dropout3(x)
        x = self.norm3(x + _x)
        return x
    
class Decoder(nn.Module):
    def __init__(self, dec_voc_size, max_len, d_model, hidden_dim, num_heads, num_layers, dropout=0.1, device=None):
        super(Decoder, self).__init__()
        self.embed = embedding.TransformerEmbedding(dec_voc_size, d_model, max_len, dropout, device)
        self.layers = nn.ModuleList([DecoderLayer(d_model, hidden_dim, num_heads, dropout) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, dec_voc_size)
    
    def forward(self, dec, enc, t_mask, s_mask):
        dec = self.embed(dec)
        for layer in self.layers:
            dec = layer(dec, enc, t_mask, s_mask)
        return self.fc(dec)
        

文章作者: Cyan.
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Cyan. !
  目录