AI 1 分钟阅读

Attention 机制的数学直觉

2026年6月20日

Self-Attention 公式

Transformer 的核心是 Scaled Dot-Product Attention：

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

其中：

$Q \in \mathbb{R}^{n \times d_k}$ — Query 矩阵
$K \in \mathbb{R}^{n \times d_k}$ — Key 矩阵
$V \in \mathbb{R}^{n \times d_v}$ — Value 矩阵
$d_k$ — Key 的维度

为什么需要缩放

当 $d_k$ 较大时， $QK^T$ 的方差会随维度增长，导致 softmax 进入梯度极小的饱和区。除以 $\sqrt{d_k}$ 起到归一化作用。

代码实现

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V)

这是一个简单的示例，展示了 Attention 的核心计算过程。