GPT1

Improving Language Understanding by Generative Pre-Training

前言

  • 将之前词向量等无监督称之为Semi-supervised learning,将GPT1这种无监督称之为Unsupervised pre-training(作为Semi-supervised learning的子集)

Framework

Unsupervised pre-training

给定一个句子U=(ui)U=(u_i),极大似然

L1(U)=ilogP(uiuik,,ui1;Θ)L_1(\mathcal{U})=\sum_i \log P\left(u_i \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\right)

kk为上下文窗口大小

Supervised fine-tuning

......

L3(C)=L2(C)+λL1(C)L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda * L_1(\mathcal{C})

Setup

BERT-base对标于GPT1

GPT1 BERT-base
参数量 110M 110M
数据集 BooksCorpus(800M words) BooksCorpus & English Wikipedia(2,500M words)
vocabulary BPE WordPiece
position embeddings 可学习 可学习
发布时间点 2018.06 2018.10

GPT2

创新点

  • zero-shot
  • prompt (使用该范式,但未明确定义该范式)

相比于GPT1

GPT1 GPT2
参数量 110M 1.5B
数据集 BooksCorpus(800M words) WebText(Reddit<40 GB)

WebText来自于Common Crawl(一个公开爬虫脚本)

GPT2各版本参数量

Parameters Layers dmodeld_{model}
117M 12 768
345M 24 1024
762M 36 1280
1542M 48 1600

GPT3

创新点

  • few-shot(非fine-tuning,无梯度计算)

Framework

meta-learning

in-context learning

Setup

training-data 训练了一个回归模型,去过滤WebText中的数据

参数量

InstructGPT & ChatGPT

We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup.

ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.

GPT-3.5 series is a series of models that was trained on a blend of text and code from before Q4 2021.

Framework

通过OpenAI API获取到的一些prompts去微调了GPT-3(175B),对模型返回的N个答案,让人进行排序并反馈给模型(RLHF),得到InstructGPT(1.3B)

作者认为当前主流LMs中,预测下一个词的预训练任务和prompt是存在差异的,然后提出了该方法:
  1. SFT——对收集到的质量好的prompt采用人工标注答案后,对GPT3.5进行微调(成本过大)
  2. 采用人工对模型返回的几种结果,进行排序来训练RM(在线学习耗时久,成本也不小)
  3. 得到RM后,通过RL来训练模型

通过把最后的softmax换成全连接映射为维度为1的标量当作RM的奖励gt。RM模型loss(Pairwise Rank Loss)

loss(θ)=1(K2)E(x,yw,yl)D[log(σ(r0(x,yw)ro(x,yl)))]\operatorname{loss}(\theta)=-\frac{1}{\left(\begin{array}{c} K \\ 2 \end{array}\right)} E_{\left(x, y_w, y_l\right) \sim D}\left[\log \left(\sigma\left(r_0\left(x, y_w\right)-r_o\left(x, y_l\right)\right)\right)\right]

整体优化目标函数

objective(ϕ)=E(x,y)DπϕRL [rθ(x,y)βlog(πϕRL(yx)/πSFT(yx))]+γExDpretrain [log(πϕRL(x))]\begin{aligned} \operatorname{objective}(\phi)= & E_{(x, y) \sim D_{\pi_\phi^{\text {RL }}}}\left[r_\theta(x, y)-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]+ \\ & \gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right] \end{aligned}

Setup

The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API).