GPT Series
GPT1
Improving Language Understanding by Generative Pre-Training
前言
- 将之前词向量等无监督称之为
Semi-supervised learning
,将GPT1这种无监督称之为Unsupervised pre-training
(作为Semi-supervised learning
的子集)
Framework

Unsupervised pre-training
给定一个句子$U=(u_i)$,极大似然
$$
L_1(\mathcal{U})=\sum_i \log P\left(u_i \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\right)
$$
$k$为上下文窗口大小
Supervised fine-tuning
$$
…
$$
$$
L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda * L_1(\mathcal{C})
$$

Setup
BERT-base对标于GPT1
GPT1 | BERT-base | |
---|---|---|
参数量 | 110M | 110M |
数据集 | BooksCorpus(800M words) | BooksCorpus & English Wikipedia(2,500M words) |
vocabulary | BPE | WordPiece |
position embeddings | 可学习 | 可学习 |
发布时间点 | 2018.06 | 2018.10 |
GPT2
创新点
- zero-shot
- prompt (使用该范式,但未明确定义该范式)
相比于GPT1
GPT1 | GPT2 | |
---|---|---|
参数量 | 110M | 1.5B |
数据集 | BooksCorpus(800M words) | WebText(Reddit<40 GB) |
WebText来自于Common Crawl(一个公开爬虫脚本)
GPT2各版本参数量
Parameters | Layers | $d_{model}$ |
---|---|---|
117M | 12 | 768 |
345M | 24 | 1024 |
762M | 36 | 1280 |
1542M | 48 | 1600 |
GPT3
创新点
- few-shot(非fine-tuning,无梯度计算)
Framework
meta-learning

in-context learning

Setup
training-data 训练了一个回归模型,去过滤WebText中的数据


InstructGPT & ChatGPT
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup.
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.
GPT-3.5 series is a series of models that was trained on a blend of text and code from before Q4 2021.
Framework
通过OpenAI API获取到的一些prompts去微调了GPT-3(175B),对模型返回的N个答案,让人进行排序并反馈给模型(RLHF),得到InstructGPT(1.3B)

- SFT——对收集到的质量好的prompt采用人工标注答案后,对GPT3.5进行微调(成本过大)
- 采用人工对模型返回的几种结果,进行排序来训练RM(在线学习耗时久,成本也不小)
- 得到RM后,通过RL来训练模型
通过把最后的softmax换成全连接映射为维度为1的标量当作RM的奖励gt。RM模型loss(Pairwise Rank Loss)
$$
\operatorname{loss}(\theta)=-\frac{1}{\left(\begin{array}{c}
K \
2
\end{array}\right)} E_{\left(x, y_w, y_l\right) \sim D}\left[\log \left(\sigma\left(r_0\left(x, y_w\right)-r_o\left(x, y_l\right)\right)\right)\right]
$$
整体优化目标函数
$$
\begin{aligned}
\operatorname{objective}(\phi)= & E_{(x, y) \sim D_{\pi_\phi^{\text {RL }}}}\left[r_\theta(x, y)-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]+ \
& \gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right]
\end{aligned}
$$
Setup
The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API).
