之前有大概看过Normalization，了解了LN和BN的区别，恰好前段时间在面试中被问到，发现之前了解的还是太模糊了，所以又深入学习了一些，顺便写了个笔记。

什么是Normalization

一种使神经网络特征保持固定分布的运算

Normalization是如何计算的

y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

以下是自己通过均值和方差做的对比实验，可以看到结果是一样的。但实际上LN在使用时大部分参数会采用默认值，即elementwise_affine=True以及eps=1e-5，只是那样我们去对比就过于麻烦，理解就好

import torch
import torch.nn as nn

# Official NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim, elementwise_affine=False, eps=0.0)
# Activate module
y = layer_norm(embedding)

# Compare code
embedding2 = embedding.clone().detach()
p = (embedding2-embedding2.mean(dim=-1, keepdim=True)) / embedding2.std(dim=-1, unbiased=False, keepdim=True)

一些问题

BERT中LN的normalization的是哪一个维度

以下是HF的源码，其中config.hidden_size在bert-base中为768
If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.是LN官方的注释，说的是如果传递一个整数，则对最后一个维度进行归一化
可以说，在NLP中，LN将每个token视作一个特征，使得每个特征保持一个固定的均值和方差

# transformers.models.bert.modeling_bert.BertSelfOutput

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

引申一个额外的说法，归一化顾名思义有着多变一的意思。但是torch注释中有写到

1
2
3

Shape:
    - Input: :math:`(N, *)`
    - Output: :math:`(N, *)` (same shape as input)

即保持着输入相同的shape，归一事实上体现在std和mean上，我们展开开头的均值和方差的计算，可以看到在LN normalization的维度，其值是一个标量。
因为面试中面试官在我想不起LN的维度时，有提示我归一化后的结果，但是我记得normalization是不改变输入的shape的，所以对此也多了疑惑，借此学习并记录清楚。

Normalization 需要额外的存储空间吗，是可学习的吗

同样是在面试中被问到的一个问题
LN有参数elementwise_affine缺省为True，此时是可学习的
BN有参数affine缺省为True，此时是可学习的，并且要保证batch间的mean和std，需要额外存储空间（待验证）
所以Normalization通常用法来说是可学习的，但是可以设置成不可学习的，BN需要额外的存储空间