ARTHURCHIAO'S BLOG

大模型 RAG 基础：信息检索、文本向量化及 BGE-M3 embedding 实践（2024）

ARTHURCHIAO'S BLOG

8 months 3 weeks ago

本文整理一些文本向量化（embedding）和信息检索的知识，它们是如今大模型生成文本时常用的技术 —— “增强检索生成”（RAG）—— 的基础：

Fig. Similarity score based on BERT embedding. Image source

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 信息检索（information retrieval）技术三大发展阶段
2 信息检索：三种 embedding 的对比
3 Embedding & retrieval 工作原理详解
4 BGE-M3 实战
5 rerank 增强：对 BGE-M3 的检索结果进行重排序
6 总结
参考资料

RAG (Retrieval-Augmented Generation，检索增强生成)，是一种利用信息检索（Information Retrieval）技术增强大模型生成效果（generation）的技术。RAG 在步骤上很简单，

搭建高质量文档数据库
- 对优质文档进行某种格式的转换（或称编码），例如基于 BERT 将文本段落转换成 数值格式的向量（这个过程称为 embedding），然后
- 将这些 embeddings 存储到合适的数据库（例如 ES 或向量数据库）；
针对用户输入进行数据库检索
- 对用户输入的 query 进行相同的转换（embedding），然后
- 利用最近邻等相似性算法，在文档库中寻找最相似的文本段落（与给定问题最相关的段落）；
大模型生成返回给用户的内容
- 将找到文本段落送到大模型，辅助生成最终的输出文本，返回给用户。

本文主要关注以上 1 & 2 步骤中的 embedding & retrieval 阶段。

1 信息检索（information retrieval）技术三大发展阶段

信息检索的技术发展大致可分为三个阶段：

基于统计信息的关键字匹配（statistical keyword matching）
- 是一种 sparse embedding —— embedding 向量的大部分字段都是 0；
基于深度学习模型的上下文和语义理解，
- 属于 dense embedding —— embedding 向量的大部分字段都非零；
所谓的“学习型”表示，组合上面两种的优点，称为 learned sparse embedding
- 既有深度学习模型的上下文和语义理解能力；
- 又具备稀疏表示的可解释性（interpretability of sparse representations）和低计算复杂度。

下面分别来看。

1.1 基于统计信息和关键词匹配（1970s-2010s） 1.1.1 典型算法：TF-IDF、BM25

早期信息检索系统主要是基于统计信息 + 匹配关键词，算法包括，

TF-IDF (term frequency - inverse document frequency), 1970s
BM25 (Best Matching), 1980s

1.1.2 原理

分析语料库的词频和分布（term frequency and distribution），作为评估文档的相关性（document relevance）的基础。

1.1.3 优缺点

优点：方法简单，效果不错，所以使用很广泛。
缺点：单纯根据词频等统计和关键字检索做判断，不理解语义。

1.2 基于深度学习和上下文语义 1.2.1 Word2Vec (Google, 2013)

2013 年，谷歌提出了 Word2Vec，

首次尝试使用高维向量来表示单词，能分辨它们细微的语义差别；
标志着向机器学习驱动的信息检索的转变。

1.2.2 BERT (Google, 2019)

基于 transformer 的预训练（pretrain）语言模型 BERT 的出现，彻底颠覆了传统的信息检索范式。

核心设计和优点

transformer 的核心是 self-attention，
- self-attention 能量化给定单词与句子中其他单词的关联性程度，
- 换句话说就是：能在上下文中分辨单词的含义；
BERT 是双向（前向+后向）transformer，
- 可以理解为在预训练时，每个句子正向读一遍，反向再读一遍；
- 能更好地捕获句子的上下文语义（contextual semantics）；
- 最终输出是一个 dense vector，本质上是对语义的压缩；
基于 dense vector 描述，用最近邻算法就能对给定的 query 进行检索，强大且语义准确。

局限性：领域外（Out-of-Domain）信息检索效果差

BERT 严重依赖预训练数据集的领域知识（domain-specific knowledge），预训练过程使 BERT 偏向于预训练数据的特征，因此在领域外（Out-Of-Domain），例如没有见过的文本片段，表现就不行了。

解决方式之一是fine-tune（精调/微调），但成本相对较高，因为准备高质量数据集的成本是很高的。

另一方面，尽管传统 sparse embedding 在词汇不匹配问题时虽然也存在挑战，但在领域外信息检索中，它们的表现却优于 BERT。这是因为在这类算法中，未识别的术语不是靠“学习”，而是单纯靠“匹配”。

1.3 学习型：组合前两种的优点 1.3.1 原理：传统 sparse vector 与上下文化信息的融合

先通过 BERT 等深度学习模型生成 dense embedding；
再引入额外的步骤对以上 dense embedding 进行稀疏化，得到一个 sparse embedding；

代表算法：BGE-M3。

1.3.2 与传统 sparse embedding 的区别

根据以上描述，乍一看，这种 learned sparse embedding 与传统 sparse embedding 好像没太大区别，但实际上二者有着本质不同，这种 embedding，

引入了 Token Importance Estimation；
既保留了关键词搜索能力，又利用上下文信息，丰富了 embedding 的稀疏表示；
能够辨别相邻或相关的 token 的重要性，即使这些 token 在文本中没有明确出现。

1.3.3 优点

将稀疏表示与学习上下文结合，同时具备精确匹配和语义理解两大能力，在领域外场景有很强的泛化能力；
与 dense embedding 相比更简洁，只保留了最核心的文本信息；
固有的稀疏性使向量相似性搜索所需的计算资源极少；
术语匹配特性还增强了可解释性，能够更精确地洞察底层的检索过程，提高了系统的透明度。

2 信息检索：三种 embedding 的对比

简单来说， vector embedding，或称向量表示，是一个单词或句子在高维向量空间中的数值表示。

高维空间：一个维度能代表一个特征或属性，高维意味着分辨率高，能区分细微的语义差异；
数值表示：一个 embedding 一般就是一个浮点数数组，所以方便计算。

对应上一节介绍的三个主要发展阶段，常见的有三种 embedding 类型：

traditional sparse embedding
dense embedding
learned sparse embedding

2.1 Sparse embedding (lexical matching)

映射成一个高维（维度一般就是 vocabulary 空间大小）向量
向量的大部分元素都是 0，非零值表明 token 在特定文档中的相对重要性，只为那些输入文本中出现过的 token 计算权重
典型模型：BM25（对 TF-IDF 的改进）

非常适合关键词匹配任务（keyword-matching tasks）。

2.2 Dense embedding (e.g. BERT-based)

映射到一个（相对低维）向量，所有维度都非零
相比 sparse embedding 维度要低很多，例如基于 BERT 默认 1x768 维度；
典型模型：BGE-v1.5

所有维度都非零，包含语义理解，信息非常丰富，因此适用于 语义搜索任务（semantic search tasks）。

Multi-vector retrieval

用多个向量表示一段文本，可以看做是对 dense retrieval 的一种扩展
模型：ColBERT

2.3 Learned sparse embedding

结合了传统 sparse embedding 的精确度和 dense embedding 的语义丰富性，

可以通过深度学习模型“学习”相关 token 的重要性，即使是一些并未出现过的 token，
生成的“学习型”稀疏表示，能有效捕捉 query 和 doc 中的关键词。

3 Embedding & retrieval 工作原理详解

这里主要介绍 BGE-M3 模型的原理。BGE-M3 建立在 BERT 之上，因此需要先回顾 BERT 的基本原理。

3.1 BERT 是如何工作的 3.1.1 理论基础

BERT 论文：BERT：预训练深度双向 Transformers 做语言理解（Google，2019）
BERT 基于 transformer，后者的核心是 self-attention
- Transformer 是如何工作的：600 行 Python 代码实现 self-attention 和两类 Transformer（2019）
- 什么是 GPT？Transformer 工作原理的动画展示（2024）

3.1.2 BERT dense embedding 工作流

以输入 "Milvus is a vector database built for scalable similarity search" 为例，工作过程 [2]：

Fig. BERT dense embedding.

Tokenization
1. 将输入文本转成 token 序列
2. BERT 还会插入两个特殊的 token：[CLS] token 表示开始，[SEP] token 表示一个句子的结束。
Embedding：使用 embedding matrix 将每个 token 转换为一个向量，详见 BERT 论文；
Encoding：这些向量通过多层 encoder，每层由 self-attention 和 feed-forward 神经网络组成
1. 会根据所有其他 token 提供的上下文细化每个 token 的表示。
Output：输出一系列最终的 embedding vectors。

最终生成的 dense embedding 能够捕捉单个单词的含义及其在句子中的相互关系。

理解 BERT 是如何生成 dense embedding 之后，接下来看看基于 BERT dense embedding 的信息检索是如何工作的。

3.2 基于 BERT dense embedding 的文档检索是如何工作的

有了 dense embedding 之后，针对给定文本输入检索文档就很简单了，只需要再加一个最近邻之类的算法就行。

下面是两个句子的相似度判断，原理跟文档检索是一样的：

Fig. Similarity score based on BERT embedding. Image source

下面看个具体的 embedding & retrieval 模型：BGE-M3。

3.3 BGE-M3（BERT-based learned sparse embedding）是如何工作的？

BGE 是一系列 embedding 模型，扩展了 BERT 的能力。BGE-M3 是目前最新的一个，3 个 M 是强调的多个 multi- 能力：

Multi-Functionality
Multi-Linguisticity
Multi-Granularity

3.3.1 设计 & 特点

BGE-M3 通过更精细的方法来捕捉每个 token 的重要性，

Token importance estimation：BERT 在分类/相似性比较时仅关注第一个 token（[CLS]）， BGE-M3 则扩大到关注序列中的每个 token Hi；
线性变换：在 encoder 的输出层上又增加一个线性层，计算每个 token 的 importance weights Wlex；
激活函数：
- Wlex 和 Hi 的乘积经过 Rectified Linear Unit (ReLU) 激活函数，得到每个 token 的术语权重 Wt。
- ReLU 的结果是非负的，有助于 embedding 的稀疏性。
learned sparse embedding：以上输出的是一个 sparse embedding，其中每个 token 都有一个相关的 weights，表明在整个输入文本上下文中的重要性。

下面看个例子。

3.3.2 BGE-M3 生成 learned sparse embedding 的过程

还是前面例子提到的输入，

先走 BERT dense embedding 的流程，
最后加一个 linear 层，得到 learned sparse embedding。

Fig. BGE-M3 learned sparse embedding. Image source

In M3-Embedding, the [CLS] embedding is used for dense retrieval, while embeddings from other tokens are used for sparse retrieval and multi-vector retrieval [3].

4 BGE-M3 实战 4.1 相似度判断（检索） $ pip install FlagEmbedding peft sentencepiece

来自官方的代码，稍作修改：

from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('/root/bge-m3', use_fp16=True) queries = ["What is BGE M3?", "Defination of BM25"] docs = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] query_embeddings = model.encode(queries, batch_size=12, max_length=8192,)['dense_vecs'] docs_embeddings = model.encode(docs)['dense_vecs'] similarity = query_embeddings @ docs_embeddings.T print(similarity)

这个例子是两个问题，分别去匹配两个答案，看彼此之间的相似度（四种组合），运行结果：

[[0.626 0.348 ] [0.3499 0.678 ]]

问题 1 和答案 1 相似度是 0.6265
问题 2 和答案 2 相似度是 0.678
问题 1 和答案 2，以及问题 2 和答案 1，相似度只有 0.3x

符合预期。

4.2 精调（fine-tune）

精调的目的是让正样本和负样本的分数差变大。

4.2.1 官方文档

fine-tune the dense embedding
fine-tune all embedding function of m3 (dense, sparse and colbert)

4.2.2 训练数据格式及要求

文件为 jsonl 格式，每行一个 sample；
- 例子：toy_train_data/toy_train_data1.jsonl
每个 sample 的格式：{"query": str, "pos": List[str], "neg":List[str]}
- query：用户问题；
- pos：正样本列表，简单说就是期望给到用户的回答；不能为空，也就是说必需得有正样本；
- neg：负样本列表，是避免给到用户的回答。
  - 空要写成 "neg": [""]，写 "neg": [] 会报错。
  - 另外为空时试过删掉 "neg": [] 也不行，必须得留着这个字段。

注意：

不是标准 json 格式，所以 python 直接导出一个 json 文件作为训练数据集是不行的。
sample 不能分行，一个 sample 一行。

4.2.3 精调命令及参数配置

从 huggingface 或国内的 modelscope 下载 BGE-M3 模型，

$ git lfs install $ git clone https://www.modelscope.cn/Xorbits/bge-m3.git

精调命令：

$ cat sft.sh #!/bin/bash num_gpus=1 output_dir=/root/bge-sft-output model_path=/root/bge-m3 train_data=/data/share/bge-dataset batch_size=2 query_max_len=128 # max 8192 passage_max_len=1024 # max 8192 torchrun --nproc_per_node $num_gpus \ -m FlagEmbedding.BGE_M3.run \ --output_dir $output_dir \ --model_name_or_path $model_path \ --train_data $train_data \ --learning_rate 1e-5 \ --fp16 \ --num_train_epochs 5 \ --per_device_train_batch_size $batch_size \ --dataloader_drop_last True \ --normlized True \ --temperature 0.02 \ --query_max_len $query_max_len \ --passage_max_len $passage_max_len \ --train_group_size 2 \ --negatives_cross_device \ --logging_steps 10 \ --same_task_within_batch True \ --save_steps 10000 \ --unified_finetuning True \ --use_self_distill True

几个参数要特别注意下：

query & doc 最大长度
- query_max_len：支持的最长 query，最大 8192；
- passage_max_len：支持的最长文档（一条 pos 或 neg 记录）长度，最大 8192
BGE-M3 会分别针对 query 和 doc 初始化两个 tokenizer，以上两个参数其实对应 tokenizer 的 max_length，而 tokenizer 最大支持 8192（见模型目录 tokenizer_config.json）。
batch_size：并行度，直接决定了显存占用大小和精调快慢；
- BGE-M3 跑起来之后显存占用是恒定的，所以可以多试几个 batch size 配置，把显存用到最大；
save_steps：多少个 step 保存一次 checkpoint，默认值 500 太小，每个 checkpoint ~7GB，多了之后可能会打爆磁盘导致任务失败。

精调快慢取决于 GPU 算力、显存和参数配置，精调开始之后也会打印出预估的完成时间，还是比较准的。

4.2.4 测试精调之后的效果

还是用 4.1 的代码，稍微改一下，不要把 queries 和 docs 作为列表，而是针对每个 query 和 pos/neg 计算相似度得分。然后针对测试集跑一下，看相似性分数是否有提升。

数据集质量可以的话，精调之后区分度肯定有提升。

4.3 CPU 运行速度优化：将模型转 onnx 格式

如果是在 CPU 上跑模型（不用 GPU），根据之前实际的 BERT 工程经验，转成 onnx 之后能快几倍，尤其是在 Intel CPU 上（Intel 公司做了很多优化合并到社区库了）。

但 BGE-M3 官方没有转 onnx 文档，根据第三方的库能成功（稍微改点代码，从本地加载模型），效果待验证。

5 rerank 增强：对 BGE-M3 的检索结果进行重排序 5.1 rerank/reranker 是什么？

rerank 的意思是“重新排序” —— 对 embedding model 检索得到的多个结果（对应多个分数），重新计算它们的相似性分数，给出一个排名。这是一个可选模块，用于对检索结果进行增强，把相似度最高的结果返回给用户。

5.1.1 另一种相似度模型

reranker 也是一类计算相似度的模型，例如这个列表里的都是 rerank/reranker 模型，

bge-reranker-v2-m3：与 bge-m3 配套的 reranker
bge-reranker-v2-gemma：与 google gemma-2b 配套的 reranker

但它们的原理与 BGE-M3 这种 embedding model 有差异。

5.1.2 与 BGE-M3 等模型的差异：cross-encoder vs. bi-encoder

以两个句子的相似度检测为例，

Fig. bi-encoder embedding model vs. cross-encoder model. Image source

BGE-M3 属于左边那种，所谓的 bi-encoder embedding model，简单说就是两个句子分别输入模型，得到各自的 embedding，然后根据 embedding vector 计算相似度；
reranker 属于右边那种，所谓的 cross-encoder model，直接得到结果；如果对 BERT 的工作原理比较熟悉（见 BERT paper），就会明白这其实就是 BERT 判别两个句子（next sentense prediction, NSP）任务的延伸。

5.2 embedding 和 reranker 工作流

用户输入 query 和 doc 列表 doc1/doc2/doc3/...，
BGE-M3 计算相似分，返回 topN，例如 [{doc1, score1}, {doc2, score2}, {doc3, score3}]，其中 score1 >= score2 >= score3，
reranker 接受 query 和 BGE-M3 的结果，用自己的模型重新计算 query 和 doc1/doc2/doc3 的相似度分数。

5.3 BGE-M3 得到相似分之后，为什么要通过 reranker 再计算一遍？

这里可能有个疑问：step 2 不是已经检索出最相关的 N 个 doc 了吗？为什么又要进入 step3，用另外一个完全不同的模型（reranker）再计算一种相似分呢？

简单来说，embdding 和 rerank 都是 NLP 中理解给定的两个句子（或文本片段）的关系的编码技术。再参考刚才的图，

Fig. bi-encoder embedding model vs. cross-encoder model. Image source

bi-encoder
- 分别对两个句子进行编码，得到两个独立的 embedding，再计算相似度。
- 速度快，准确性相对低。
cross-encoder
- 同时对两个句子编码，输出一个相似度分数；也可以换句话说，把两个句子合成一个句子编码，所以两个句子是彼此依赖的；
- 速度慢，准确性高。

总结起来：embedding model 计算的相似度是粗粒度的，只能算粗排； reranker 对 embedding model 得到的若干结果再进行细排；要体会和理解这种差异，还是要看基础 paper BERT：预训练深度双向 Transformers 做语言理解（Google，2019）。

6 总结

本文整理了一些 BGE-M3 相关的 RAG 知识。前两篇参考资料非常好，本文很多内容都来自它们，感谢作者。

参考资料

Enhancing Information Retrieval with Sparse Embeddings, zilliz.com/learn, 2024
Exploring BGE-M3 and Splade: Two Machine Learning Models for Generating Sparse Embeddings, medium.com/@zilliz_learn, 2024
BGE-M3 paper
Cross encoders and bi-encoders, medium.com, 2024

Linux 时钟源之 TSC：软硬件原理、使用场景、已知问题（2024）

ARTHURCHIAO'S BLOG

9 months ago

本文整理了一些 Linux 时钟源 tsc 相关的软硬件知识，在一些故障排查场景可能会用到。

Fig. Scaling up crystal frequency for different components of a computer. Image source Youtube

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 计算机组件的运行频率
2 x86 架构的寄存器
- 2.1 通用目的寄存器
- 2.2 特殊目的寄存器
  - 2.2.1 model-specific register (MSR)
  - 2.2.2 MSR 之一：TSC
3 TSC（时间戳计数器）
4 查看和监控 TSC 相关信息
5 TSC 若干坑
6 总结
参考资料

1 计算机组件的运行频率 1.1 时钟源：~20MHz 的石英晶体谐振器（quartz crystal resonator）

石英晶体谐振器是利用石英晶体（又称水晶）的压电效应 来产生高精度振荡频率的一种电子器件。

1880 年由雅克·居里与皮埃尔·居里发现压电效应。
一战期间保罗·朗之万首先探讨了石英谐振器在声纳上的应用。
1917 第一个由晶体控制的电子式振荡器。
1918 年贝尔实验室的 Alexander M. Nicholson 取得专利，虽然与同时申请专利的 Walter Guyton Cady 曾有争议。
1921 年 Cady 制作了第一个石英晶体振荡器。

Wikipedia 石英晶体谐振器

现在一般长这样，焊在计算机主板上，

Fig. A miniature 16 MHz quartz crystal enclosed in a hermetically sealed HC-49/S package, used as the resonator in a crystal oscillator. Image source wikipedia

受物理特性的限制，只有几十 MHz。

1.2 Clock generator：针对不同部分（内存、PCIe、CPU 等）倍频

计算机的内存、PCIe 设备、CPU 等等组件需要的工作频率不一样（主要原因之一是其他组件跟不上 CPU 的频率），而且都远大于几十 MHz，因此需要对频率做提升。工作原理：

What is a CPU clock physically?
Wikipedia: Phase-locked_loop (PLL)

有个视频解释地很形象，

Fig. Scaling up crystal frequency for different components of a computer. Image source Youtube

图中的 clock generator 是个专用芯片，也是焊在主板上，一般跟晶振挨着。

1.3 CPU 频率是如何从 ~20MHz 提升到 ~3GHz 的

本节稍微再开展一下，看看 CPU 频率是如何提升到我们常见的 ~3GHz 这么高的。

1.3.1 传递路径：最终连接到 CPU CLK 引脚

结合上面的图，时钟信号的传递/提升路径：

晶振（~20MHz）
主板上的 clock generator 芯片
北桥芯片
CPU

时钟信号连接到 CPU 的一个名为 CLK 的引脚。两个具体的 CLK 引脚实物图：

Intel 486 处理器（1989）

Fig. Intel 486 pin mapImage Source

这种 CPU 引脚今天看来还是很简单的，CLK 在第三行倒数第三列。
AMD SP3 CPU Socket (2017)

EPYC 7001/7002/7003 系列用的这种。图太大了就不放了，见 SP3 Pin Map。

1.3.2 CPU 内部：还有一个 clock generator

现代 CPU 内部一般还有一个 clock generator，可以继续提升频率，最终达到厂商宣传里的基频（base frequency）或标称频率（nominal frequency），例如 EPYC 6543 的 2795MHz。这跟原始晶振频率比，已经提升了上百倍。

2 x86 架构的寄存器

介绍点必要的背景知识，有基础的可跳过。

2.1 通用目的寄存器

Fig. 32-bit x86 general purpose registers [1]

计算机执行的所有代码，几乎都是经由通用寄存器完成的。进一步了解：简明 x86 汇编指南（2017）。

2.2 特殊目的寄存器

如名字所示，用于特殊目的，一般也需要配套的特殊指令读写。大致分为几类：

control registers
debug registers
mode-specific registers (MSR)

接下来我们主要看下 MSR 类型。

2.2.1 model-specific register (MSR)

MSR 是 x86 架构中的一组控制寄存器（control registers），设计用于 debugging/tracing/monitoring 等等目的，以下是 AMD 的一些系统寄存器，其中就包括了 MSR 寄存器们，来自 AMD64 Architecture Programmer’s Manual, Volume 3 (PDF)，

Fig. AMD system registers, which include some MSR registers

几个相关的指令：

RDMSR/WRMSR 指令：读写 MSR registers；
CPUID 指令：检查 CPU 是否支持某些特性。

RDMSR/WRMSR 指令使用方式：

需要 priviledged 权限。
Linux msr 内核模块创建了一个伪文件 /dev/cpu/{id}/msr，用户可以读写这个文件。还有一个 msr-tools 工具包。

2.2.2 MSR 之一：TSC

今天我们要讨论的是 MSR 中与时间有关的一个寄存器，叫 TSC (Time Stamp Counter)。

3 TSC（时间戳计数器） 3.1 本质：X86 处理器中的一个 特殊寄存器

Time Stamp Counter (TSC) 是 X86 处理器（Intel/AMD/…）中的一个 64-bit 特殊目的 寄存器，属于 MRS 的一种。还是 AMD 编程手册中的图，可以看到 MSR 和 TSC 的关系：

Fig. AMD system registers, which include some MSR registers

注意：在多核情况下（如今几乎都是多核了），每个物理核（processor）都有一个 TSC register，或者说这是一个 per-processor register。

3.2 作用：记录 cpu 启动以来累计的 cycles 数量

前面已经介绍过，时钟信号经过层层提升之后，最终达到 CPU 期望的高运行频率，然后就会在这个频率上工作。

这里有个 CPU cycles（指令周期）的概念：频率没经过一个周期（1Hz），CPU cycles 就增加 1 —— TSC 记录的就是从 CPU 启动（或重置）以来的累计 cycles。这也呼应了它的名字：时间戳计数器。

3.3 实际：经常被当做（高精度）时钟用

根据以上原理，如果 CPU 频率恒定且不存在 CPU 重置的话，

TSC 记录的就是系统启动以来的 cycles 数量；
cycles 可以精确换算成时间；
这个时间的精度还非常高！；
使用开销还很低（这涉及到操作系统和内核实现了）。

所以无怪乎 TSC 被大量用户空间程序当做开销地高精度的时钟。

3.3.1 使用代码

本质上用户空间程序只需要一条指令（RDTSC），就能读取这个值。非常简单的几行代码：

unsigned long long rdtsc() { unsigned int lo, hi; __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi)); return ((unsigned long long)hi << 32) | lo; }

就能拿到当前时刻的 cpu cycles。所以统计耗时就很直接：

start = rdtsc(); // business logic here end = rdtsc(); elapsed_seconds = (end-start) / cycles_per_sec; 3.3.1 潜在问题

以上的假设是 TSC 恒定，随着 wall time 均匀增加。

如果 CPU 频率恒定的话（也就是没有超频、节能之类的特殊配置），cycles 就是以恒定速率增加的，这时 TSC 确实能跟时钟保持同步，所以可以作为一种获取时间或计时的方式。但接下来会看到，cycles 恒定这个前提条件如今已经很难满足了，内核也不推荐用 tsc 作为时间度量。

乱序执行会导致 RDTSC 的执行顺序与期望的顺序发生偏差，导致计时不准，两种解决方式：

插入一个同步指令（a serializing instruction），例如 CPUID，强制前面的指令必现执行完，才能才执行 RDTSC；
使用一个变种指令 RDTSCP，但这个指令只是对指令流做了部分顺序化（partial serialization of the instruction stream），并不完全可靠。

3.4 挑战：TSC 的准确性越来越难以保证

如果一台机器只有一个处理器，并且工作频率也一直是稳定的，那拿 TSC 作为计时方式倒也没什么问题。但随着下面这些技术的引入，TSC 作为时钟就不准了：

多核处理器：意味着每个核上都有一个 TSC，如何保持这些 TSC 寄存器值的严格同步；
不同处理器的温度差异也会导致 TSC 偏差；
超线程：一个处理器上两个硬件线程（Linux 中看就是两个 CPU）；
超频、降频等等功耗管理功能：导致时钟不再是稳定的；
CPU 指令乱序执行功能：获取 TSC 的指令的执行顺序和预期的可能不一致，导致计时不准；
休眠状态：恢复到运行状态时重置 TSC；

还有其他一些方面的挑战，都会导致无法保证一台机器多个 CPU 的 TSC 严格同步。

3.5 改进：引入 constant/invariant TSC

解决方式之一，是一种称为恒定速率（constant rate） TSC 的技术，

在 Linux 中，可以通过 cat /proc/cpuinfo | grep constant_tsc 来判断；
有这个 flag 的 CPU，TSC 以 CPU 的标称频率（nominal frequency）累积；超频或功耗控制等等导致的实际 CPU 时钟频率变化，不会影响到 TSC。

较新的 Intel、AMD 处理器都支持这个特性。

但是，constant_tsc 只是表明 CPU 有提供恒定 TSC 的能力，并不表示实际工作 TSC 就是恒定的。后面会详细介绍。

3.5 小结：计数器（counter），而非时钟（clock）

从上面的内容已经可以看出， TSC 如其名字“时间戳计数器”所说，确实本质上只是一个计数器，记录的是 CPU 启动以来的 cpu cycles 次数。

虽然在很多情况下把它当时钟用，结果也是正确的，但这个是没有保证的，因为影响它稳定性的因素太多了 —— 不稳拿它计时也就不准了。

另外，它是一个 x86 架构的特殊寄存器，换了其他 cpu 架构可能就不支持，所以依赖 TSC 的代码可移植性会变差。

4 查看和监控 TSC 相关信息

以上几节介绍的基本都是硬件问题，很好理解。接下来设计到软件部分就复杂了，一部分原因是命名导致的。

4.1 Linux 系统时钟源（clocksource）配置

我们前面提到不要把 tsc 作为时钟来看待，它只是一个计数器。但另一方面，内核确实需要一个时钟，

内核自己的定时器、调度、网络收发包等等需要时钟；
用户程序也需要时间功能，例如 gettimeofday() / clock_gettime()。

在底层，内核肯定是要基于启动以来的计数器，这时 tsc 就成为它的备选之一（而且优先级很高）。

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc hpet acpi_pm $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc 4.1.1 tsc：优先

高精度：基于 cycles，所以精度是几个 GHz，对应 ns 级别；
低开销：跟内核实现有关。

4.1.2 hpet：性能开销太大

原理暂不展开，只说结论：相比 tsc，hpet 在很多场景会明显导致系统负载升高。所以能用 tsc 就不要用 hpet。

4.2 turbostat 查看实际 TSC 计数（可能不准）

前面提到用户空间程序写几行代码就能方便地获取 TSC 计数。所以对监控采集来说，还是很方便的。我们甚至不需要自己写代码获取 TSC，一些内核的内置工具已经实现了这个功能，简单地执行一条 shell 命令就行了。

turbostat 是 Linux 内核自带的一个工具，可以查看包括 TSC 在内的很多信息。

turbostat 源码在内核源码树中：tools/power/x86/turbostat/turbostat.c。

不加任何参数时，turbostat 会 5s 打印一次统计信息，内容非常丰富。我们这里用精简模式，只打印每个 CPU 在过去 1s 的 TSC 频率和所有 CPU 的平均 TSC：

# sample 1s and only one time, print only per-CPU & average TSCs $ turbostat --quiet --show CPU,TSC_MHz --interval 1 --num_iterations 1 CPU TSC_MHz - 2441 0 2445 64 2445 1 2445

但 turbostat 如果执行的时间非常短，比如 1s，统计到数据就不太准，偏差比较大；持续运行一段时间后，得到的数据才比较准。

4.3 rdtsc/rdtscp 指令采集 TSC 计数 4.3.1 C 代码

完整代码：

#include <stdio.h> #include <time.h> #include <unistd.h> // https://stackoverflow.com/questions/16862620/numa-get-current-node-core unsigned long rdtscp(int *chip, int *core) { unsigned a, d, c; __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c)); *chip = (c & 0xFFF000)>>12; *core = c & 0xFFF; return ((unsigned long)a) | (((unsigned long)d) << 32);; } int main() { int sleep_us = 100000; unsigned long tsc_nominal_hz = 2795000000; unsigned long expected_inc = (unsigned long)(1.0 * sleep_us / 1000000 * tsc_nominal_hz); unsigned long low = (unsigned long)(expected_inc * 0.95); unsigned long high = (unsigned long)(expected_inc * 1.05); printf("Sleep interval: %d us, expected tsc increase range [%lu,%lu]\n", sleep_us, low, high); unsigned long start, delta; int start_chip=0, start_core=0, end_chip=0, end_core=0; while (1) { start = rdtscp(&start_chip, &start_core); usleep(sleep_us); delta = rdtscp(&end_chip, &end_core) - start; if (delta > high || delta < low) { time_t seconds = time(NULL); // seconds since Unix epoch (1970.1.1) struct tm t = *localtime(&seconds); printf("%02d-%02d %02d:%02d:%02d TSC jitter: %lu\n", t.tm_mon + 1, t.tm_mday, t.tm_hour, t.tm_min, t.tm_sec, delta); fflush(stdout); } } return 0; }

几点说明：

程序 hardcode 了预期的 TSC 频率是 2795MHz；
每 100ms 采集一次 TSC 计数，如果 TSC 计数的偏差超过 +/- 5%，就将这个异常值打印出来；
在哪个 chip/cpu 上执行的，这里没打印出来，有需要可以打印；
这个程序虽然采集很频繁，但开销很小，主要是因为 rdtscp 指令的开销很小。

4.3.2 执行效果

编译运行，

$ gcc tsc-checker.c -o tsc-checker # print to stdout and copy to a log file, using stream buffer instead of line buffers $ stdbuf --output=L ./tsc-checker | tee tsc.log Sleep interval: 100000 us, expected tsc increase range [265525000,293475000] 08-05 19:46:31 303640792 08-05 20:13:06 301869652 08-05 20:38:27 300751948 08-05 22:40:39 324424884 ...

可以看到这台机器（真实服务器）有偶发 TSC 抖动，能偏离正常范围 324424884/2795000000 - 1 = 16%，也就是说 100ms 的时间它能偏离 16ms，非常离谱。TSC 短时间连续抖动时，机器就会出现各种奇怪现象，比如 load 升高、网络超时、活跃线程数增加等等，因为内核系统因为时钟抖动乱了。

4.4 监控

用合适的采集工具把以上数据送到监控平台（例如 Prometheus/VictoriaMetrics），就能很直观地看到 TSC 的状态。

4.4.1 基于 turbostat（不推荐）

例如下面是 1 分钟采集一次，每次采集过去 1s 内的平均 TSC，得到的结果：

Fig. TSC runnning average of an AMD EPYC 7543 node

但前面提到， turbostat 如果执行的时间非常短，统计到数据就不太准，偏差比较大；持续运行一段时间后，得到的数据才比较准。但作为采集程序，可能不方便执行太长时间。

4.4.2 基于 rdtscp

基于上面的 rdtscp 自己写代码采集，就非常准确了，例如，下面是 1 分钟采集一次得到的结果展示：

Fig. TSC jitter of an AMD EPYC 7543 node

不过，要抓一些偶发抖动导致的问题，1 分钟采集一次粒度太粗了。比如我们上一小节的 C 程序是 100ms 采集一次，相当于 1 分钟采集 600 次，一小时采集 3.6w 次。我们 3 个小时总共 10 万多次跑下来，也才能抓到几次抖动，这已经算很幸运了。

4.4.3 基于 rdtscp + 内核模块

还是 rdtscp，但作为内核模块 + 定时器运行，应该会比用户空间程序更准，可以避免 Linux 内核调度器的调度偏差。

5 TSC 若干坑 5.1 constant_tsc: a feature, not a runtime guarantee 5.1.1 Lenovo SR645 (AMD EPYC 7543 CPU) TSC 不稳定

CPU 信息：

$ cat /proc/cpuinfo ... processor : 127 vendor_id : AuthenticAMD model name : AMD EPYC 7543 32-Core Processor cpu MHz : 3717.449 flags : fpu ... tsc msr rdtscp constant_tsc nonstop_tsc cpuid tsc_scale ...

flags 里面显式支持 constant_tsc 和 nonstop_tsc，所以按照文档的描述 TSC 应该是恒定的。

但是，看一下下面的监控，都是这款 CPU，机器来自两个不同的服务器厂商，

Fig. TSC fluctuations (delta of running average) of AMD EPYC 7543 nodes, from two server vendors

可以看到，

联想和浪潮的 TSC 都有波动，
联想的偶尔波动非常剧烈（相对 base 2795MHz 偏离 16% 甚至更高）；
浪潮的相对较小（base 2445 MHz）。

这个波动可能有几方面原因，比如各厂商的 BIOS 逻辑，或者 SMI 中断风暴。

5.1.2 原因及解决方式

最后定位到是厂商 BIOS (UEFI) 设置导致的，做如下修改之后稳定多了，

No. Option Before After 1 OperatingModes.ChooseOperatingMode Maximum Efficiency Custom Mode 2 Processors.DeterminismSlider Performance Power 3 Processors.CorePerformanceBoost Enable Enable 4 Processors.cTDP Auto Maximum 5 Processors.PackagePowerLimit Auto Maximum 6 Processors.GlobalC-stateControl Enable Enable 7 Processors.SOCP-states Auto P0 8 Processors.DFC-States Enable Disable 9 Processors.P-state1 Enable Disable 10 Processors.SMTMode Enable Enable 11 Processors.CPPC Enable Enable 12 Processors.BoostFmax Auto Manual 13 Processors.BoostFmaxManual 0 14 Power EfficiencyMode Enable Disable 15 Memory.NUMANodesperSocket NPS1 NPS0

Note:

Processors.BoostFmaxManual option only exists when BoostFmax=Manual;
See Tuning UEFI Settings for Performance and Energy Efficiency on 4th Gen AMD EPYC Processor-Based ThinkSystem Servers for more details of each option.

5.2 BIOS 设置致使 TSC 不恒定

除了以上具体配置，还有一些可能会导致 TSC 不稳的场景。

5.2.1 TSC 寄存器是可写的！

TSC 可写，所以某些 BIOS 固件代码会修改 TSC 值，导致操作系统时序不同步（或者说不符合预期）。

5.2.2 BIOS SMI handler 通过修改 TSC 隐藏它们的执行

例如，2010 年内核社区的一个讨论 x86: Export tsc related information in sysfs 就提到，某些 BIOS SMI handler 会通过修改 TSC value 的方式来隐藏它们的执行。

为什么要隐藏？

5.2.3 服务器厂商出于功耗控制等原因在 BIOS 修改 TSC 同步逻辑

前面提到，恒定 TSC 特性只是说处理器提供了恒定的能力，但用不用这个能力，服务器厂商有非常大的决定权。

某些厂商的固件代码会在 TSC sync 逻辑中中修改 TSC 的值。这种修改在固件这边没什么问题，但会破坏内核层面的时序视角，例如内核调度器工作会出问题。因此，内核最后引入了一个 patch 来处理 ACPI suspend/resume，以保证 TSC sync 机制在操作系统层面还是正常的，

x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states TSC's get reset after suspend/resume (even on cpu's with invariant TSC which runs at a constant rate across ACPI P-, C- and T-states). And in some systems BIOS seem to reinit TSC to arbitrary large value (still sync'd across cpu's) during resume. This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less than rq->age_stamp (introduced in 2.6.32). This leads to a big value returned by scale_rt_power() and the resulting big group power set by the update_group_power() is causing improper load balancing between busy and idle cpu's after suspend/resume. This resulted in multi-threaded workloads (like kernel-compilation) go slower after suspend/resume cycle on core i5 laptops. Fix this by recomputing cyc2ns_offset's during resume, so that sched_clock() continues from the point where it was left off during suspend. 5.3 SMI 中断风暴导致 TSC 不稳

上一节提到，BIOS SMI handler 通过修改 TSC 隐藏它们的执行。如果有大量这种中断（可能是有 bug），就会导致大量时间花在中断处理时，但又不会计入 TSC，最终导致系统出现卡顿等问题。

AMD 的机器比较尴尬，看不到 SMI 统计（试了几台 Intel 机器是能看到的），

$ turbostat --quiet --show CPU,TSC_MHz,SMI --interval 1 --num_iterations 1 CPU TSC_MHz - 2441 0 2445 64 2445 1 2445 ... 5.4 VM TSC 不稳

例如

https://www.phoronix.com/news/AMD-Secure-TSC-Linux-Patches
http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/

6 总结

本文整理了一些 TSC 相关的软硬件知识，在一些故障排查场景可能会用到。

参考资料

简明 x86 汇编指南（2017）
AMD64 Architecture Programmer’s Manual, Volume 3 (PDF)
Linux 服务器功耗与性能管理（一）：CPU 硬件基础（2024）
Pitfalls of TSC usage, 2015
Wikipedia MSR
Wikipedia TSC
Wikipedia Clock Generator

图解 JuiceFS CSI 工作流：K8s 创建带 PV 的 Pod 时，背后发生了什么（2024）

ARTHURCHIAO'S BLOG

9 months 2 weeks ago

JuiceFS 是一个架设在对象存储（S3、Ceph、OSS 等）之上的分布式文件系统，简单来说，

对象存储：只能通过 key/value 方式使用；
文件系统：日常看到的文件目录，能执行 ls/cat/find/truncate 等等之类的文件读写操作。

本文从 high-level 梳理了 JuiceFS CSI 方案中，当创建一个带 PV 的 pod 以及随后 pod 读写 PV 时， k8s/juicefs 组件在背后都做了什么，方便快速了解 K8s CSI 机制及 JuiceFS 的基本工作原理。

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 背景知识
2 创建一个使用 PV 的 pod 时，k8s 和 juicefs 组件都做了什么
3 业务 pod 读写 juicefs volume 流程
4 总结
参考资料

1 背景知识

简单列几个基础知识，有背景的可直接跳过。

1.1 K8s CSI (Container Storage Interface )

The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes.

https://kubernetes-csi.github.io/docs/

CSI 是 K8s 支持的一种容器存储机制，扩展性非常好，各存储方案只要根据规范实现一些接口，就能集成到 k8s 中提供存储服务。

一般来说，存储方案需要在每个 node 上部署一个称为 “CSI plugin” 的服务， kubelet 在创建带 PV 容器的过程中会调用这个 plugin。但要注意，

K8s 的网络插件 CNI plugin 是一个可执行文件，放在 /opt/cni/bin/ 下面就行了，kubelet 在创建 pod 网络时直接运行 这个可执行文件；
K8s 的存储插件 CSI plugin 是一个服务（某种程度上，称为 agent 更好理解），kubelet 在初始化 PV 时通过 gRPC 调用这个 plugin；

1.2 FUSE (Filesystem in Userspace)

FUSE 是一种用户态文件系统，使得用户开发自己的文件系统非常方便。

懒得再重新画图，这里借 lxcfs（跟 juicefs 没关系，但也是一种 FUSE 文件系统）展示一下 FUSE 的基本工作原理：

Linux 容器底层工作机制：从 500 行 C 代码到生产级容器运行时（2023）

Fig. lxcfs/fuse workflow: how a read operation is handled [2]

JuiceFS 基于 FUSE 实现了一个用户态文件系统。

来自社区文档的一段内容，简单整理：

传统上，实现一个 FUSE 文件系统，需要基于 Linux libfuse 库，它提供两种 API：

high-level API：基于文件名和路径。

libfuse 内部做了 VFS 树的模拟，对外暴露基于路径的 API。

适合元数据本身是基于路径提供的 API 的系统，比如 HDFS 或者 S3 之类。如果元数据本身是基于 inode 的目录树，这种 inode → path →inode 的转换就会影响性能。
low-level API：基于 inode。内核的 VFS 跟 FUSE 库交互就使用 low-level API。

JuiceFS 的元数据基于 inode 组织，所以用 low-level API 实现（依赖 go-fuse 而非 libfuse），简单自然，性能好。

1.3 JuiceFS 三种工作模式

JuiceFS 有几种工作或部署方式：

进程挂载模式

JuiceFS client 运行在 CSI Node plugin 容器中，所有需要挂载的 JuiceFS PV 都会在这个容器内以进程模式挂载。
CSI 方式，又可分为两种：
1. mountpod 方式：在每个 node 上，CSI plugin 动态为每个被 local pod 使用的 PV 创建一个保姆 pod，
  - 这个 mount pod 是 per-PV 而非 per-business-pod 的，也就是说如果 node 上有多个业务 pod 在使用同一 PV，那只会有一个 mount pod，下图可以看出来，
    
    Fig. JuiceFS as K8s CSI solution: workflow when a business pod is created (JuiceFS mountpod mode).
  - mount pod 里面装了 juicefs client，替业务 pod 完成 juicefs 相关的读写操作；为了从字面上更容易理解，本文接下来把 mount pod 称为 dynamic client pod 或 client pod。
  - 这是 JuiceFS CSI 的默认工作方式；
  - FUSE 需要 mount pod 具有 privilege 权限；
  - client pod 重启会导致业务 pod 一段时间读写不可用，但 client pod 好了之后业务 pod 就能继续读写了。
2. . CSI sidecar 方式：给每个使用 juicefs PV 的业务 pod 创建一个 sidecar 容器。
  - per-pod 级别的 sidecar；
  - 注意 sidecar 就不是 JuiceFS plugin 创建的了，CSI Controller 会注册一个 Webhook 来监听容器变动，在创建 pod 时， webhook 给 pod yaml 自动注入一个 sidecar，跟 Istio 自动给 pod 注入 Envoy 容器类似；
  - Sidecar 重启需要重建业务 Pod 才能恢复。
  - 也依赖 FUSE，所以 sidecar 需要 privilege 权限。这会导致每个 sidecar 都能看到 node 上所有设备，有风险，所以不建议；

1.4 小结

有了以上基础，接下来看 k8s 中创建一个业务 pod 并且它要求挂载一个 PV 时，k8s 和 juicefs 组件都做了什么事情。

2 创建一个使用 PV 的 pod 时，k8s 和 juicefs 组件都做了什么

Fig. JuiceFS as K8s CSI solution: workflow when a business pod is created (JuiceFS mountpod mode).

Step 1：kubelet 启动，监听集群的 pod 资源变化

kubelet 作为 k8s 在每个 node 上的 agent，在启动后会监听整个 k8s 集群中的 pod 资源变化。具体来说就是，kube-apiserver 中有 pod create/update/delete events 发生时，kubelet 都会立即收到。

Step 2：kubelet 收到业务 pod 创建事件，开始创建 pod

kubelet 收到一条 pod create 事件后，首先判断这个 pod 是否在自己的管辖范围内（spec 中的 nodeName 是否是这台 node），是的话就开始创建这个 pod。

Step 2.1 创建业务 pod：初始化部分

kubelet.INFO 中有比较详细的日志：

10:05:57.410 Receiving a new pod "pod1(<pod1-id>)" 10:05:57.411 SyncLoop (ADD, "api"): "pod1(<pod1-id>)" 10:05:57.411 Needs to allocate 2 "nvidia.com/gpu" for pod "<pod1-id>" container "container1" 10:05:57.411 Needs to allocate 1 "our-corp.com/ip" for pod "<pod1-id>" container "container1" 10:05:57.413 Cgroup has some missing paths: [/sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id>] 10:05:57.413 Cgroup has some missing paths: [/sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id>] 10:05:57.413 Cgroup has some missing paths: [/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/pids/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpuset/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/systemd/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/memory/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod<pod1-id> /sys/fs/cgroup/hugetlb/kubepods/burstable/pod<pod1-id>] 10:05:57.415 Using factory "raw" for container "/kubepods/burstable/pod<pod1-id>" 10:05:57.415 Added container: "/kubepods/burstable/pod<pod1-id>" (aliases: [], namespace: "") 10:05:57.419 Waiting for volumes to attach and mount for pod "pod1(<pod1-id>)" 10:05:57.432 SyncLoop (RECONCILE, "api"): "pod1(<pod1-id>)" 10:05:57.471 Added volume "meminfo" (volSpec="meminfo") for pod "<pod1-id>" to desired state. 10:05:57.471 Added volume "cpuinfo" (volSpec="cpuinfo") for pod "<pod1-id>" to desired state. 10:05:57.471 Added volume "stat" (volSpec="stat") for pod "<pod1-id>" to desired state. 10:05:57.480 Added volume "share-dir" (volSpec="pvc-6ee43741-29b1-4aa0-98d3-5413764d36b1") for pod "<pod1-id>" to desired state. 10:05:57.484 Added volume "data-dir" (volSpec="juicefs-volume1-pv") for pod "<pod1-id>" to desired state. ...

可以看出里面会依次处理 pod 所需的各种资源：

设备：例如 GPU；
IP 地址；
cgroup 资源隔离配置；
volumes。

本文主要关注 volume 资源。

Step 2.2 处理 pod 依赖的 volumes

上面日志可以看到，业务 pod 里面声明了一些需要挂载的 volumes。几种类型：

hostpath 类型：直接把 node 路径挂载到容器内；
lxcfs 类型：为了解决资源视图问题 [2]；
动态/静态 PV 类型

本文的 JuiceFS volume 就属于 PV 类型，继续看 kubelet 日志：

# kubelet.INFO 10:05:57.509 operationExecutor.VerifyControllerAttachedVolume started for volume "xxx" 10:05:57.611 Starting operationExecutor.MountVolume for volume "xxx" (UniqueName: "kubernetes.io/host-path/<pod1-id>-xxx") pod "pod1" (UID: "<pod1-id>") 10:05:57.611 operationExecutor.MountVolume started for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") 10:05:57.611 kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount] 10:05:57.611 kubernetes.io/csi: created path successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv] 10:05:57.611 kubernetes.io/csi: saving volume data file [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/vol_data.json] 10:05:57.611 kubernetes.io/csi: volume data file saved successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/vol_data.json] 10:05:57.613 MountVolume.MountDevice succeeded for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") device mount path "/var/lib/k8s/kubelet/plugins/kubernetes.io/csi/pv/juicefs-volume1-pv/globalmount" 10:05:57.616 kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount] 10:05:57.616 kubernetes.io/csi: Mounter.SetUpAt(/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount) 10:05:57.616 kubernetes.io/csi: created target path successfully [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount] 10:05:57.618 kubernetes.io/csi: calling NodePublishVolume rpc [volid=juicefs-volume1-pv,target_path=/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount] 10:05:57.713 Starting operationExecutor.MountVolume for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") ... 10:05:59.506 kubernetes.io/csi: mounter.SetUp successfully requested NodePublish [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount] 10:05:59.506 MountVolume.SetUp succeeded for volume "juicefs-volume1-pv" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^juicefs-volume1-pv") pod "pod1" (UID: "<pod1-id>") 10:05:59.506 kubernetes.io/csi: mounter.GetPath generated [/var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount]

对于每个 volume，依次执行，

operationExecutor.VerifyControllerAttachedVolume() 方法，做一些检查；
operationExecutor.MountVolume() 方法，将指定的 volume 挂载到容器目录；
对于 CSI 存储，还会调用到 CSI plugin 的 NodePublishVolume() 方法，初始化对应的 PV，JuiceFS 就是这种模式。

接下来 kubelet 会不断检测所有 volumes 是否都挂载好，没好的话不会进入下一步（创建 sandbox 容器）。

Step 3：kubelet --> CSI plugin（juicefs）：setup PV

下面进一步看一下 node CSI plugin 初始化 PV 挂载的逻辑。调用栈：

gRPC NodePublishVolume() kubelet ---------------------------> juicefs node plugin (also called "driver", etc) Step 4：JuiceFS CSI plugin 具体工作

看一下 JuiceFS CSI node plugin 的日志，这里直接在机器上看：

(node) $ docker logs --timestamps k8s_juicefs-plugin_juicefs-csi-node-xxx | grep juicefs-volume1 10:05:57.619 NodePublishVolume: volume_id is juicefs-volume1-pv 10:05:57.619 NodePublishVolume: creating dir /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount 10:05:57.620 ceFormat cmd: [/usr/local/bin/juicefs format --storage=OSS --bucket=xx --access-key=xx --secret-key=${secretkey} --token=${token} ${metaurl} juicefs-volume1] 10:05:57.874 Format output is juicefs <INFO>: Meta address: tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 10:05:57.874 cefs[1983] <INFO>: Data use oss://<bucket>/juicefs-volume1/ 10:05:57.875 Mount: mounting "tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1" at "/jfs/juicefs-volume1-pv" with options [token=xx] 10:05:57.884 createOrAddRef: Need to create pod juicefs-node1-juicefs-volume1-pv. 10:05:57.891 createOrAddRed: GetMountPodPVC juicefs-volume1-pv, err: %!s(<nil>) 10:05:57.891 ceMount: mount tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 at /jfs/juicefs-volume1-pv 10:05:57.978 createOrUpdateSecret: juicefs-node1-juicefs-volume1-pv-secret, juicefs-system 10:05:59.500 waitUtilPodReady: Pod juicefs-node1-juicefs-volume1-pv is successful 10:05:59.500 NodePublishVolume: binding /jfs/juicefs-volume1-pv at /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount with options [] 10:05:59.505 NodePublishVolume: mounted juicefs-volume1-pv at /var/lib/k8s/kubelet/pods/<pod1-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount with options []

可以看到确实执行了 NodePublishVolume() 方法，这个方法是每个 CSI plugin 方案各自实现的，所以里面做什么事情就跟存储方案有很大关系。接下来具体看看 JuiceFS plugin 做的什么。

Step 4.1 给 pod PV 创建挂载路径，初始化 volume

默认配置下，每个 pod 会在 node 上对应一个存储路径，

(node) $ ll /var/lib/k8s/kubelet/pods/<pod-id> containers/ etc-hosts plugins/ volumes/

juicefs plugin 会在以上 volumes/ 目录内给 PV 创建一个对应的子目录和挂载点，

/var/lib/k8s/kubelet/pods/{pod1-id}/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount。

然后用 juicefs 命令行工具格式化，

$ /usr/local/bin/juicefs format --storage=OSS --bucket=xx --access-key=xx --secret-key=${secretkey} --token=${token} ${metaurl} juicefs-volume1

例如，如果 JuiceFS 对接的是阿里云 OSS，上面就对应阿里云的 bucket 地址及访问秘钥。

Step 4.2 volume 挂载信息写入 MetaServer

此外，还会把这个挂载信息同步到 JuiceFS 的 MetaServer，这里用的是 TiKV，暂不展开：

Fig. JuiceFS as K8s CSI solution: workflow when a business pod is created (JuiceFS mountpod mode).

Step 4.3 JuiceFS plugin：如果 client pod 不存在，就创建一个

JuiceFS CSI plugin 判断这个 PV 在 node 上是否已经存在 client pod，如果不存在，就创建一个；存在就不用再创建了。

当 node 上最后一个使用某 PV 的业务 pod 销毁后，对应的 client pod 也会被 juicefs CSI plugin 自动删掉。

我们这个环境用的是 dynamic client pod 方式，因此会看到如下日志：

(node) $ docker logs --timestamps <csi plugin container> | grep ... 10:05:57.884 createOrAddRef: Need to create pod juicefs-node1-juicefs-volume1-pv. 10:05:57.891 createOrAddRed: GetMountPodPVC juicefs-volume1-pv, err: %!s(<nil>) 10:05:57.891 ceMount: mount tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 at /jfs/juicefs-volume1-pv 10:05:57.978 createOrUpdateSecret: juicefs-node1-juicefs-volume1-pv-secret, juicefs-system 10:05:59.500 waitUtilPodReady:

JuiceFS node plugin 会去 k8s 里面创建一个名为 juicefs-{node}-{volume}-pv 的 dynamic client pod。

Fig. JuiceFS as K8s CSI solution: workflow when a business pod is created (JuiceFS mountpod mode).

Step 5：kubelet 监听到 client pod 创建事件

这时候 kubelet 的业务 pod 还没创建好，“伺候”它的 juicefs client pod 又来“请求创建”了：

(node) $ grep juicefs-<node>-<volume>-pv /var/log/kubernetes/kubelet.INFO | grep "received " 10:05:58.288 SyncPod received new pod "juicefs-node1-volume1-pv_juicefs-system", will create a sandbox for it

所以接下来进入创建 juicefs dynamic client pod 的流程。

兵马未动，粮草先行。juicefs client pod 没有好，业务 pod 即使起来了也不能读写 juicefs volume。

Step 6：kubelet 创建 client pod

创建 client pod 的流程跟业务 pod 是类似的，但这个 pod 比较简单，我们省略细节，认为它直接就拉起来了。

查看这个 client pod 内运行的进程：

(node) $ dk top k8s_jfs-mount_juicefs-node1-juicefs-volume1-pv-xx /bin/mount.juicefs ${metaurl} /jfs/juicefs-volume1-pv -o enable-xattr,no-bgjob,allow_other,token=xxx,metrics=0.0.0.0:9567

/bin/mount.juicefs 其实只是个 alias，指向的就是 juicefs 可执行文件，

(pod) $ ls -ahl /bin/mount.juicefs /bin/mount.juicefs -> /usr/local/bin/juicefs Step 7：client pod 初始化、FUSE 挂载

查看这个 client pod 干了什么：

root@node:~ # dk top k8s_jfs-mount_juicefs-node1-juicefs-volume1-pv-xx <INFO>: Meta address: tikv://node1:2379,node2:2379,node3:2379/juicefs-volume1 <INFO>: Data use oss://<oss-bucket>/juicefs-volume1/ <INFO>: Disk cache (/var/jfsCache/<id>/): capacity (10240 MB), free ratio (10%), max pending pages (15) <INFO>: Create session 667 OK with version: admin-1.2.1+2022-12-22.34c7e973 <INFO>: listen on 0.0.0.0:9567 <INFO>: Mounting volume juicefs-volume1 at /jfs/juicefs-volume1-pv ... <INFO>: OK, juicefs-volume1 is ready at /jfs/juicefs-volume1-pv

初始化本地 volume 配置
与 MetaServer 交互
暴露 prometheus metrics
以 juicefs 自己的 mount 实现（前面看到的 /bin/mount.juicefs），将 volume 挂载到 /jfs/juicefs-volume1-pv，默认对应的是 /var/lib/juicefs/volume/juicefs-volume1-pv。

此时在 node 上就可以看到如下的挂载信息：

(node) $ cat /proc/mounts | grep JuiceFS:juicefs-volume1 JuiceFS:juicefs-volume1 /var/lib/juicefs/volume/juicefs-volume1-pv fuse.juicefs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0 JuiceFS:juicefs-volume1 /var/lib/k8s/kubelet/pods/<pod-id>/volumes/kubernetes.io~csi/juicefs-volume1-pv/mount fuse.juicefs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0

可以看到是 fuse.juicefs 方式的挂载。忘了 FUSE 基本工作原理的，再来借 lxcfs 快速回忆一下：

Fig. lxcfs/fuse workflow: how a read operation is handled [2]

这个 dynamic client pod 创建好之后， 业务 pod（此时还不存在）的读写操作都会进入 FUSE 模块，然后转发给用户态的 juicefs client 处理。juicefs client 针对不同的 object store 实现了对应的读写方法。

Step 8：kubelet 创建业务 pod：完成后续部分

至此，Pod 所依赖的 volumes 都处理好了，kubelet 就会打印一条日志：

# kubelet.INFO 10:06:06.119 All volumes are attached and mounted for pod "pod1(<pod1-id>)"

接下来就可以继续创建业务 pod 了：

# kubelet.INFO 10:06:06.119 No sandbox for pod "pod1(<pod1-id>)" can be found. Need to start a new one 10:06:06.119 Creating PodSandbox for pod "pod1(<pod1-id>)" 10:06:06.849 Created PodSandbox "885c3a" for pod "pod1(<pod1-id>)" ... 小结

更详细的 pod 创建过程，可以参考 [1]。

3 业务 pod 读写 juicefs volume 流程

juicefs dynamic client pod 先于业务 pod 创建，所以业务 pod 创建好之后，就可以直接读写 juicefs PV (volume) 了，

Fig. JuiceFS as K8s CSI solution: workflow when a business pod reads/writes (JuiceFS mountpod mode).

这个过程可以大致分为四步。

Step 1：pod 读写文件（R/W operations）

例如在 pod 内进入 volume 路径（e.g. cd /data/juicefs-pv-dir/），执行 ls、find 等等之类的操作。

Step 2：R/W 请求被 FUSE 模块 hook，转给 juicefs client 处理

直接贴两张官方的图略作说明 [3]，这两张图也透露了随后的 step 3 & 4 的一些信息：

读操作：

Fig. JuiceFS Internals: read operations.

写操作：

Fig. JuiceFS Internals: write operations.

Step 3：juicefs client pod 从 meta server 读取（文件或目录的）元数据

上面的图中已经透露了一些 JuiceFS 的元数据设计，例如 chunk、slice、block 等等。读写操作时，client 会与 MetaServer 有相关的元信息交互。

Step 4：juicefs client pod 从 object store 读写文件

这一步就是去 S3 之类的 object store 去读写文件了。

4 总结

以上就是使用 JuiceFS 作为 k8s CSI plugin 时，创建一个带 PV 的 pod 以及这个 pod 读写 PV 的流程。限于篇幅，省略了很多细节，感兴趣的可移步参考资料。

参考资料

源码解析：K8s 创建 pod 时，背后发生了什么（系列）（2021）
Linux 容器底层工作机制：从 500 行 C 代码到生产级容器运行时（2023）
官方文档：读写请求处理流程, juicefs.com
kubernetes-csi.github.io/docs/, K8s CSI documentation

TCP Requests Stuck After Connection Established（2024）

ARTHURCHIAO'S BLOG

10 months ago

This post describes a kernel & BPF networking problem and the trouble shooting steps, which is an interesting case for delving into Linux kernel networking intricacies.

Fig. Phenomenon of a reported issue.

1 Trouble report
- 1.1 Phenomenon: probabilistic health check failures
- 1.2 Scope: specific pods on specific nodes
2 Networking fundamentals
3 Quick narrow-down
4 Dig deeper
5 Technical summary
Appendix
References

1 Trouble report 1.1 Phenomenon: probabilistic health check failures

Users reported intermittent failures of their pods, despite them run as usual with no exceptions.

The health check is a very simple HTTP probe over TCP: kubelet periodically (e.g. every 5s) sends GET requests to local pods, initiating a new TCP connection with each request.

Fig. Intermittent health check failures of pods.

Users suspect this is a network problem.

1.2 Scope: specific pods on specific nodes

This reported issue is confined to a new k8s cluster, with recently introduced OS and kernel:

OS: AliOS (AlibabaCloud OS)
Kernel: cloud-kernel 5.10.134-16.al8.x86_64 (a fork of Linux, gitee.com/anolis/cloud-kernel), which includes their upstream feature backports and self-maintanined changes, for example,
1. Intel AMX (Advanced Matrix Extensions) for AI workloads, offering a hardware acceleration alternative to GPUs in certain scenarios, such as inference for LLMs smaller than 13B. AMX support was first introduced in kernel 5.16, cloud-kernel backported the feature to its current version 5.10;
2. cloud-kernel includes un-upstreamed modifications like new kernel structure fields and new enums/types.

Other environment info:

Cilium: self-maintained v1.11.10
CNCF Case Study: How Trip.com Group switched to Cilium For Scalable and Cloud Native Networking, 2023

2 Networking fundamentals

Before starting our exploration, let’s outline our networking infra in this cluster.

2.1 Node network topology: Cilium (with BPF)

Internal networking topology of our k8s node is depicted as below:

Fig. Internal networking topology of a k8s node.

(k8s node) $ route -n Destination Gateway Genmask Use Iface 0.0.0.0 <GW-IP> 0.0.0.0 eth0 <Node-IP> 0.0.0.0 <Node-IP-Mask> eth0 <Pod1-IP> 0.0.0.0 255.255.255.255 lxc-1 <Pod2-IP> 0.0.0.0 255.255.255.255 lxc-2 <Pod3-IP> 0.0.0.0 255.255.255.255 lxc-3

As shown in the picture and kernel routing table output, each pod has a dedicated routing entry. Consequently, all health check traffic is directed straight to the lxc device (the host-side device of the pod’s veth pair), subsequently entering the Pod. In another word, all the health check traffic is processed locally.

Cilium has a similar networking topology on AlibabaCloud as on AWS. For more information, refer to Cilium Network Topology and Traffic Path on AWS (2019), which may contain some stale information, but most of the content should still validate.

2.2 Kernel 5.10+: sockmap BPF acceleration for node2localPod traffic 2.2.1 sockops BPF: bypass kernel stack for local traffic

How to use eBPF for accelerating Cloud Native applications offers a practical example of how sockops/sockmap BPF programs work.

Chinese readers can also refer to the following for more information,

（译）利用 ebpf sockmap/redirection 提升 socket 性能（2020）
BPF 进阶笔记（五）：几种 TCP 相关的 BPF（sockops、struct_ops、header options）

2.2.2 tcpdump: only TCP 3-way/4-way handshake packets can be captured

sockops acceleration is automatically enabled in kernel 5.10 + Cilium v1.11.10:

Fig. Socket-level acceleration in Cilium. Note that the illustration depicts local processes communicating via loopback, which differs from the scenario discussed here, just too lazy draw a new picture.

One big conceptual change is that when sockops BPF is enabled, you could not see request & response packets in tcpdump output, as in this setup, only TCP 3-way handshake and 4-way close procedure still go through kernel networking stack, all the payload will directly go through the socket-level (e.g. in tcp/udp send/receive message) methods.

A quick test to illustrate the idea: access a server in pod from the node:

(node) $ curl <pod ip>:<port>

The output of tcpdump:

(pod) $ tcpdump -nn -i eth0 host <node ip> and <port> # TCP 3-way handshake IP NODE_IP.36942 > POD_IP.8080: Flags [S] IP POD_IP.8080 > NODE_IP.36942: Flags [S.] IP NODE_IP.36942 > POD_IP.8080: Flags [.] # requests & responses, no packets go through there, they are bypassed, # payloads are transferred directly in socket-level TCP methods # TCP 4-way close IP POD_IP.8080 > NODE_IP.36942: Flags [F.] IP NODE_IP.36942 > POD_IP.8080: Flags [.] IP NODE_IP.36942 > POD_IP.8080: Flags [F.] IP POD_IP.8080 > NODE_IP.36942: Flags [.] 2.3 Summary

Now we’ve got a basic undertanding about the problem and environment. It’s time to delve into practical investigation.

3 Quick narrow-down 3.1 Quick reproduction

First, check kubelet log,

$ grep "Timeout exceeded while awaiting headers" /var/log/kubernetes/kubelet.INFO prober.go] Readiness probe for POD_XXX failed (failure): Get "http://POD_IP:PORT/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ...

Indeed, there are many readiness probe failures.

Since the probe is very simple HTTP request, we can do it manually on the node, this should be equivalent to the kubelet probe,

$ curl <POD_IP>:<PORT>/v1/health OK $ curl <POD_IP>:<PORT>/v1/health OK $ curl <POD_IP>:<PORT>/v1/health # stuck ^C

OK, we can easily reproduce it without relying on k8s facilities.

3.2 Narrow-down the issue

Now let’s perform some quick tests to narrow-down the problem.

3.2.1 ping: OK, exclude L2/L3 problem

ping PodIP from node always succeeds.

(node) $ ping <POD_IP>

This indicates L2 & L3 (ARP table, routing table, etc) connectivity functions well.

3.2.2 telnet connection test: OK, exclude TCP connecting problem (node) $ telnet POD_IP PORT Trying POD_IP... Connected to POD_IP. Escape character is '^]'.

Again, always succeeds, and the ss output confirms the connections always enter ESTABLISHED state:

(node) $ netstat -antp | grep telnet tcp 0 0 NODE_IP:34316 POD_IP:PORT ESTABLISHED 2360593/telnet 3.2.3 Remote-to-localPod curl: OK, exclude pod problem & vanilla kernel stack problem

Do the same health check from a remote node, always OK:

(node2) $ curl <POD_IP>:<PORT>/v1/health OK ... (node2) $ curl <POD_IP>:<PORT>/v1/health OK

This rules out issues with the pod itself and the vanilla kernel stack.

3.2.4 Local pod-to-pod: OK, exclude some node-internal problems (pod3) $ curl <POD2_IP>:<PORT>/v1/health OK ... (pod3) $ curl <POD2_IP>:<PORT>/v1/health OK

Always OK. Rule out issues with the pod itself.

3.3 Summary: only node-to-localPod TCP requests stuck probabilistically

Fig. Test cases and results.

The difference of three cases:

Node-to-localPod: payload traffic is processed via sockops BPF;
Local Pod-to-Pod: BPF redirection (or kernel stack, based on your kernel version)
- Differentiate three types of eBPF redirections (2022)
RemoteNode-to-localPod: standard kernel networking stack

Combining these information, we guess with confidence that the problem have relationships with sockops BPF and kernel (because kernel does most of the job in sockops BPF scenarios).

From these observations, it is reasonable to deduce that the issue is likely related to sockops BPF and the kernel, given the kernel’s central role in sockops BPF scenarios.

4 Dig deeper

Now let’s explore the issue in greater depth.

4.1 Linux vs. AliOS kernel

As we’ve been using kernel 5.10.56 and cilium v1.11.10 for years and haven’t met this problem before, the first reasonable assumption is that AliOS cloud-kernel 5.10.134 may introduce some incompatible changes or bugs.

So we spent some time comparing AliOS cloud-kernel with the upstream Linux.

Note: cloud-kernel is maintained on gitee.com, which restricts most read privileges (e.g. commits, blame) without logging in, so in the remaining of this post we reference the Linux repo on github.com for discussion.

4.1.1 Compare BPF features

First, compare BPF features automatically detected by cilium-agent on the node. The result is written to a local file on the node: /var/run/cilium/state/globals/bpf_features.h,

$ diff <bpf_features.h from our 5.10.56 node> <bpf_features.h from AliOS node> 59c59 < #define NO_HAVE_XSKMAP_MAP_TYPE --- > #define HAVE_XSKMAP_MAP_TYPE 71c71 < #define NO_HAVE_TASK_STORAGE_MAP_TYPE --- > #define HAVE_TASK_STORAGE_MAP_TYPE 243c243 < #define BPF__PROG_TYPE_socket_filter__HELPER_bpf_ktime_get_coarse_ns 0 --- > #define BPF__PROG_TYPE_socket_filter__HELPER_bpf_ktime_get_coarse_ns 1 ...

There are indeed some differences, but with further investigation, we didn’t find any correlation to the observed issue.

4.1.2 AliOS cloud-kernel specific changes

Then we spent some time to check AliOS cloud-kernel self-maintained BPF and networking modifications. Such as,

b578e4b8ed6e1c7608e07e03a061357fd79ac2dd ck: net: track the pid who created socks

In this commit, they added a pid_t pid field to the struct sock data structure.
ea0307caaf29700ff71467726b9617dcb7c0d084 tcp: make sure init the accept_queue’s spinlocks once

But again, we didn’t find any correlation to the problem.

4.2 Check detailed TCP connection stats

Without valuable information from code comparison, we redirected our focus to the environment, collecting some more detailed connection information.

ss (socket stats) is a powerful and convenient tool for socket/connection introspection:

-i/--info: show internal TCP information, including couple of TCP connection stats;
-e/--extended: show detailed socket information, including inode, uid, cookie.

4.2.1 Normal case: ss shows correct segs_out/segs_in

Initiate a connection with nc (netcat),

(node) $ nc POD_IP PORT

We intentionally not use telnet here, because telnet will close the connection immediately after a request is served successfully, which leaves us no time to check the connection stats in ss output. nc will leave the connection in CLOSE-WAIT state, which is good enough for us to check the connection send/receive stats.

Now the stats for this connection:

(node) $ ss -i | grep -A 1 50504 tcp ESTAB 0 0 NODE_IP:50504 POD_IP:PORT cubic wscale:7,7 rto:200 rtt:0.059/0.029 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 1963.4Mbps lastsnd:14641 lastrcv:14641 lastack:14641 pacing_rate 3926.8Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.059

Send & receive stats: segs_out=2, segs_in=1.

Now let’s send a request to the server: type GET /v1/health HTTP/1.1\r\n then press Enter,

Actually you can type anything and just Enter, the server will most likely send you a 400 (Bad Request) response, but for our case, this 400 indicate the TCP send/receive path is perfectly OK!

(node) $ nc POD_IP PORT GET /v1/health HTTP/1.1\r\n <Response Here>

We’ll get the response and the connection will just entering CLOSE-WAIT state, we have some time to check it before it vanishing:

(node) $ ss -i | grep -A 1 50504 tcp CLOSE-WAIT 0 0 NODE_IP:50504 POD_IP:http cubic wscale:7,7 rto:200 rtt:0.059/0.029 ato:40 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 bytes_received:1 segs_out:3 segs_in:2 send 1963.4Mbps lastsnd:24277 lastrcv:24277 lastack:4399 pacing_rate 3926.8Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.059

As expected, segs_out=3, segs_in=2.

4.2.2 Abnormal case: ss shows incorrect segs_out/segs_in

Repeat the above test to capture a failed one.

On connection established,

$ ss -i | grep -A 1 57424 tcp ESTAB 0 0 NODE_IP:57424 POD_IP:webcache cubic wscale:7,7 rto:200 rtt:0.056/0.028 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 2068.6Mbps lastsnd:10686 lastrcv:10686 lastack:10686 pacing_rate 4137.1Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.056

After typing the request content and stroking Enter:

(node) $ ss -i | grep -A 1 57424 tcp ESTAB 0 0 NODE_IP:57424 POD_IP:webcache cubic wscale:7,7 rto:200 rtt:0.056/0.028 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 2068.6Mbps lastsnd:21994 lastrcv:21994 lastack:21994 pacing_rate 4137.1Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:0.056

That segments sent/received stats remain unchanged (segs_out=2,segs_in=1), suggesting that the problem may reside on tcp {send,receive} message level.

4.3 Trace related call stack

Based on the above hypothesis, we captured kernel call stacks to compare failed and successful requests.

4.3.1 trace-cmd: trace kernel call stacks

Trace 10 seconds, filter by server process ID, save the calling stack graph,

# filter by process ID (PID of the server in the pod) $ trace-cmd record -P 178501 -p function_graph sleep 10

Caution: avoid tracing in production to prevent large file generation and excessive disk IO.

During this period, send a request,

(node) $ curl POD_IP PORT

By default, it will save data to a local file in the current directory, the content looks like this:

$ trace-cmd report > report-1.graph CPU 1 is empty CPU 2 is empty ... CPU 63 is empty cpus=64 <idle>-0 [022] 5376816.422992: funcgraph_entry: 2.441 us | update_acpu.constprop.0(); <idle>-0 [022] 5376816.422994: funcgraph_entry: | switch_mm_irqs_off() { <idle>-0 [022] 5376816.422994: funcgraph_entry: 0.195 us | choose_new_asid(); <idle>-0 [022] 5376816.422994: funcgraph_entry: 0.257 us | load_new_mm_cr3(); <idle>-0 [022] 5376816.422995: funcgraph_entry: 0.128 us | switch_ldt(); <idle>-0 [022] 5376816.422995: funcgraph_exit: 1.378 us | } ...

Use | as delimiter (this preserves the calling stack and the proper leading whitespaces) and save the last fields into a dedicated file:

$ awk -F'|' '{print $NF}' report-1.graph > stack-1.txt

Compare them with diff or vimdiff:

$ vimdiff stack-1.txt stack-2.txt

Here are two traces, the left is a trace of a normal request, and the right is a problematic one:

Fig. Traces (call stacks) of a normal request (left side) and a problematic request (right side).

As can be seen, for a failed request, kernel made a wrong function call: it should call tcp_bpf_recvmsg() but actually called tcp_recvmsg().

4.3.2 Locate the code: inet_recvmsg -> {tcp_bpf_recvmsg, tcp_recvmsg}

Calling into tcp_bpf_recvmsg or tcp_recvmsg from inet_recvmsg is a piece of concise code, illustrated below,

// https://github.com/torvalds/linux/blob/v5.10/net/ipv4/af_inet.c#L838 int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; int addr_len = 0; int err; if (likely(!(flags & MSG_ERRQUEUE))) sock_rps_record_flow(sk); err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); if (err >= 0) msg->msg_namelen = addr_len; return err; }

sk_prot ("socket protocol") contains handlers to this socket. INDIRECT_CALL_2 line can be simplified into the following pseudocode:

if sk->sk_prot->recvmsg == tcp_recvmsg: // if socket protocol handler is tcp_recvmsg tcp_recvmsg() else: tcp_bpf_recvmsg()

This suggests that when requests fail, the sk_prot->recvmsg pointer of the socket is likely incorrect.

4.3.3 Double check with bpftrace

While trace-cmd is a powerful tool, it may contain too much details distracting us, and may run out of your disk space if set improper filter parameters.

bpftrace is a another tracing tool, and it won’t write data to local file by default. Now let’s double confirm the above results with it.

Again, run several times of curl POD_IP:PORT, capture only tcp_recvmsg and tcp_bpf_recvmsg calls, print kernel calling stack:

$ bpftrace -e 'k:tcp_recvmsg /pid==178501/ { printf("%s\n", kstack);} k:tcp_bpf_recvmsg /pid==178501/ { printf("%s\n", kstack);} ' tcp_bpf_recvmsg+1 # <-- correspond to a successful request inet_recvmsg+233 __sys_recvfrom+362 __x64_sys_recvfrom+37 do_syscall_64+48 entry_SYSCALL_64_after_hwframe+97 tcp_bpf_recvmsg+1 # <-- correspond to a successful request inet_recvmsg+233 __sys_recvfrom+362 __x64_sys_recvfrom+37 do_syscall_64+48 entry_SYSCALL_64_after_hwframe+97 tcp_recvmsg+1 # <-- correspond to a failed request inet_recvmsg+78 __sys_recvfrom+362 __x64_sys_recvfrom+37 do_syscall_64+48 entry_SYSCALL_64_after_hwframe+97

You could also filter by client program name (comm field in kernel data structure), such as,

$ bpftrace -e 'k:tcp_bpf_recvmsg /comm=="curl"/ { printf("%s", kstack); }'

As seen above, successful requests were directed to tcp_bpf_recvmsg, while failed ones were routed to tcp_recvmsg.

4.3.4 Summary

tcp_recvmsg waits messages from kernel networking stack, In the case of sockops BPF, messages bypass kernel stack, which explains why some requests fail (timeout), yet TCP connecting always OK.

We reported the above findings to the cloud-kernel team, and they did some further investigations with us.

4.4 recvmsg handler initialization in kernel stack

For short,

Fig. sockops BPF: connection establishement and socket handler initialization.

According to the above picture, recvmsg handler will be incorrectly initialized if to-be-inserted entry already exists sockmap (the end of step 3.1).

What’s the two entries of a connection looks like in BPF map:

(cilium-agent) $ bpftool map dump id 122 | grep "0a 0a 86 30" -C 2 | grep "0a 0a 65 f9" -C 2 | grep -C 2 "db 78" 0a 0a 86 30 00 00 00 00 00 00 00 00 00 00 00 00 0a 0a 65 f9 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 1f 90 00 00 db 78 00 00 -- key: -- 0a 0a 65 f9 00 00 00 00 00 00 00 00 00 00 00 00 0a 0a 86 30 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 db 78 00 00 1f 90 00 00

We’ll explain these binary data later. Now let’s first confirm our above assumption.

4.5 Confirm stale entries in sockmap 4.5.1 bpftrace tcp_bpf_get_prot(): incorrect socket handler (sk_prot)

Two sequent function calls that holding sk_port:

tcp_bpf_get_prot(): where sk_prot is initialized;
tcp_bpf_recvmsg() or tcp_recvmsg(): where sk_prot is called to receive a message;

Trace these two methods and print the sk_prot variable (pointer).

Successful case:

tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(59500), 2232440 tcp_bpf_get_proto: 0xffffffffacc65800 # <-- sk_prot pointer tcp_bpf_recvmsg: src POD_IP (8080), dst NODE_IP(59500) 0xffffffffacc65800 # <-- same pointer

Bad case:

(node) $ ./tcp_bpf_get_proto.bt 178501 Attaching 6 probes... tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(53904), 2231203 tcp_bpf_get_proto: 0xffffffffacc65800 # <-- sk_prot pointer tcp_recvmsg: src POD_IP (8080), dst NODE_IP(53904) 0xffffffffac257300 # <-- sk_prot is modified!!! 4.5.2 bpftrace sk_psock_drop

A succesful case, calling into sk_psock_drop when requests finish and connection was normally closed:

(node) $ ./sk_psock_drop.bt 178501 tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(59500), 2232440 tcp_bpf_get_proto: 0xffffffffacc65800 # <-- sk_prot pointer sk_psock_drop: src POD_IP (8080)， dst NODE_IP(44566) sk_psock_drop+1 sock_map_remove_links+161 sock_map_close+50 inet_release+63 sock_release+58 sock_close+17 fput+147 task_work_run+89 exit_to_user_mode_loop+285 exit_to_user_mode_prepare+110 syscall_exit_to_user_mode+18 entry_SYSCALL_64_after_hwframe+97 tcp_bpf_recvmsg: src POD_IP (8080), dst NODE_IP(59500) 0xffffffffacc65800 # <-- same pointer

A failed case, calling into sk_psock_drop when the server side calls sock_map_update() and the to-be-inserted entry already exists:

(node) $ ./sk_psock_drop.bt 178501 tcp_bpf_get_proto: src POD_IP (8080), dst NODE_IP(53904), 2231203 tcp_bpf_get_proto: 0xffffffffacc65800 # <-- sk_prot pointer sk_psock_drop: src POD_IP (8080)， dst NODE_IP(44566) sk_psock_drop+1 sock_hash_update_common+789 bpf_sock_hash_update+98 bpf_prog_7aa9a870410635af_bpf_sockmap+831 _cgroup_bpf_run_filter_sock_ops+189 tcp_init_transfer+333 // -> bpf_skops_established -> BPF_CGROUP_RUN_PROG_SOCK_OPS -> cilium sock_ops code tcp_rcv_state_process+1430 tcp_child_process+148 tcp_v4_rcv+2491 ... tcp_recvmsg: src POD_IP (8080), dst NODE_IP(53904) 0xffffffffac257300 # <-- sk_prot is modified!!! // https://github.com/torvalds/linux/blob/v6.5/net/core/sock_map.c#L464 static int sock_map_update_common(struct bpf_map *map, u32 idx, struct sock *sk, u64 flags) { struct bpf_stab *stab = container_of(map, struct bpf_stab, map); ... link = sk_psock_init_link(); sock_map_link(map, sk); psock = sk_psock(sk); osk = stab->sks[idx]; if (osk && flags == BPF_NOEXIST) { // sockmap entries already exists ret = -EEXIST; goto out_unlock; // goto out_unlock, which will release psock } else if (!osk && flags == BPF_EXIST) { ret = -ENOENT; goto out_unlock; } sock_map_add_link(psock, link, map, &stab->sks[idx]); stab->sks[idx] = sk; if (osk) sock_map_unref(osk, &stab->sks[idx]); return 0; // <-- should return from here out_unlock: // <-- actually hit here if (psock) sk_psock_put(sk, psock); // --> further call sk_psock_drop out_free: sk_psock_free_link(link); return ret; }

This early release of psock leads to the sk->sk_prot->recvmsg to be initialized as tcp_recvmsg.

4.5.3 bpftool: confirm stale connection info in sockops map

Key and value format in the BPF map:

// https://github.com/cilium/cilium/blob/v1.11.10/pkg/maps/sockmap/sockmap.go#L20 // SockmapKey is the 5-tuple used to lookup a socket // +k8s:deepcopy-gen=true // +k8s:deepcopy-gen:interfaces=github.com/cilium/cilium/pkg/bpf.MapKey type SockmapKey struct { DIP types.IPv6 `align:"$union0"` SIP types.IPv6 `align:"$union1"` Family uint8 `align:"family"` Pad7 uint8 `align:"pad7"` Pad8 uint16 `align:"pad8"` SPort uint32 `align:"sport"` DPort uint32 `align:"dport"` } // SockmapValue is the fd of a socket // +k8s:deepcopy-gen=true // +k8s:deepcopy-gen:interfaces=github.com/cilium/cilium/pkg/bpf.MapValue type SockmapValue struct { fd uint32 }

Trip.com: Large Scale Cloud Native Networking & Security with Cilium/eBPF, 2022 shows how to decode the encoded entries of Cilium BPF map.

$ cat ip2hex.sh echo $1 | awk -F. '{printf("%02x %02x %02x %02x\n",$1,$2,$3,$4);}' $ cat hex2port.sh echo $1 | awk '{printf("0x%s%s 0x%s%s\n", $1, $2, $5, $6) }' | sed 's/ /\n/g' | xargs -n1 printf '%d\n' (node) $ ./ip2hex.sh "10.10.134.48" 0a 0a 86 30 (node) $ ./ip2hex.sh "10.10.101.249" 0a 0a 65 f9 (cilium-agent) $ bpftool map dump id 122 | grep "0a 0a 86 30" -C 2 | grep "0a 0a 65 f9" -C 2 | grep -C 2 "db 78" 0a 0a 86 30 00 00 00 00 00 00 00 00 00 00 00 00 0a 0a 65 f9 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 1f 90 00 00 db 78 00 00 -- key: -- 0a 0a 65 f9 00 00 00 00 00 00 00 00 00 00 00 00 0a 0a 86 30 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 db 78 00 00 1f 90 00 00 (node) $ ./hex2port.sh "1f 90 00 00 b6 8a 00 00" 8080 46730 # you can verify this connection in `ss` output

Almost all of the following entries are stale (because this is an empty, no node-to-pod traffic unless we do manually):

(cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 86 30" | wc -l 7325 (cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 8c ca" | wc -l 1288 (cilium-agent) $ bpftool map dump /sys/fs/bpf/cilium_sock_ops | grep "0a 0a 8e 40" | wc -l 191 5 Technical summary 5.1 Normal sockops/sockmap BPF workflow

Fig. sockops BPF: connection establishement and socket handler initialization.

Node client (e.g. kubelet) -> server: initiate TCP connection to the server
Kernel (and the BPF code in kernel): on listening on connection established
1. write two entries to sockmap
2. link entries to bpf handlers (tcp_bpf_{sendmsg, recvmsg})
Node client (e.g. kubelet) -> server: send & receive payload: BPF handlers were executed
Node client (e.g. kubelet) -> server: close connection: kernel removes entries from sockmap

5.2 Direct cause

The problem arises in step 4, for an unknown reason, some entries are not deleted when connections closed. This leads to incorrect handler initialization in new connections in step 2 (or section 3.1 in the picture). When hit a stale entry,

sender side uses BPF message handlers for transmission;
server side treats the the socket as standard, and waits for message via default message handler, then stucks there as no payload goes to default handler.

5.3 Root cause

The Alibaba cloud-kernel team digged further into the issue, and thanks for their efforts, they finally found that bpf, sockmap: Remove unhash handler for BPF sockmap usage was the root cause, which was introduced in Linux 5.10.58. The AliOS kernel we were using was 5.10.134 based, so it suffered from this.

Upstream patch bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues has already fixed it, but it was only backported to 6.x series.

5.4 Quick restoration/remediation

If the issue already happened, you can use one of the following methods to restore:

Kernel restart: drain the node then restart it, thish will refresh the kernel state;
Manual clean with bpftool: with caution, avoid to remove valid entries.

5.5 Another issue with similar phenomenon

There is another issue with the similar phenomenon when sockops is enabled:

Local pod runs nginx (of recent versions, e.g. >= 1.18);
Sending http requests from node to the local pod, with a large enough cookie length (e.g. > 1024 Byte);

TCP connection will be OK, but requests will always stuck there.

Cilium issue:

ioctl FIONREAD returning incorrect value when sockops is enabled

nginx is reading the headers from the traefik request with a default value of 1024 (client_header_buffer_size 1k;) bytes and then (seemingly) asks via the ioctl how much data is left. Since the return is 0 the request is never fully read and does not proceed further.

Community solution:

deprecate –sockops-enable in v1.13, and remove the feature in v1.14

Appendix

bpftrace scripts used in this post

References

AliOS kernel (a Linux fork), gitee.com/anolis/cloud-kernel
Cilium Network Topology and Traffic Path on AWS (2019)
cilium v1.11.10, bpf_sockops.c
cilium v1.11.10, bpf sockops key & value definition
Differentiate three types of eBPF redirections
Trip.com: Large Scale Cloud Native Networking & Security with Cilium/eBPF, 2022

Practical Storage Hierarchy and Performance: From HDDs to On-chip Caches（2024）

ARTHURCHIAO'S BLOG

11 months ago

This post summarizes bandwidths for local storage media, networking infra, as well as remote storage systems. Readers may find this helpful when identifying bottlenecks in IO-intensive applications (e.g. AI training and LLM inference).

Fig. Peak bandwidth of storage media, networking, and distributed storage solutions.

Note: this post may contain inaccurate and/or stale information.

1 Fundamentals
2 Disk
3 DDR SDRAM (CPU Memory): ~400GB/s
4 GDDR SDRAM (GPU Memory): ~1000GB/s
5 HBM: 1~5 TB/s
6 SRAM (on-chip): 20+ TB/s
7 Networking bandwidth: 400GB/s
8 Distributed storage: aggregated 2+ TB/s
9 Conclusion
References

1 Fundamentals

Before delving into the specifics of storage, let’s first go through some fundamentals about data transfer protocols.

1.1 SATA

From wikepedia SATA:

SATA (Serial AT Attachment) is a computer bus interface that connects host bus adapters to mass storage devices such as hard disk drives, optical drives, and solid-state drives.

1.1.2 Real world pictures

Fig. SATA interfaces and cables on a computer motherboard. Image source wikipedia

1.1.1 Revisions and data rates

The SATA standard has evolved through multiple revisions. The current prevalent revision is 3.0, offering a maximum IO bandwidth of 600MB/s:

Table: SATA revisions. Data source: wikipedia

Spec Raw data rate Data rate Max cable length SATA Express 16 Gbit/s 1.97 GB/s 1m SATA revision 3.0 6 Gbit/s 600 MB/s 1m SATA revision 2.0 3 Gbit/s 300 MB/s 1m SATA revision 1.0 1.5 Gbit/s 150 MB/s 1m 1.2 PCIe

From wikipedia PCIe (PCI Express):

PCI Express is high-speed serial computer expansion bus standard.

PCIe (Peripheral Component Interconnect Express) is another kind of system bus, designed to connect a variety of peripheral devices, including GPUs, NICs, sound cards, and certain storage devices.

1.1.2 Real world pictures

Fig. Various slots on a computer motherboard, from top to bottom:
PCIe x4 (e.g. for NVME SSD)
PCIe x16 (e.g. for GPU card)
PCIe x1
PCIe x16
Conventional PCI (32-bit, 5 V)
Image source wikipedia

As shown in the above picture, PCIe electrical interface is measured by the number of lanes. A lane is a single data send+receive line, functioning similarly to a “one-lane road” with traffic in both directions.

1.2.2 Generations and data rates

Each new PCIe generation doubles the bandwidth of a lane than the previous generation:

Table: PCIe Unidirectional Bandwidth. Data source: trentonsystems.com

Generation Year of Release Data Transfer Rate Bandwidth x1 Bandwidth x16 PCIe 1.0 2003 2.5 GT/s 250 MB/s 4.0 GB/s PCIe 2.0 2007 5.0 GT/s 500 MB/s 8.0 GB/s PCIe 3.0 2010 8.0 GT/s 1 GB/s 16 GB/s PCIe 4.0 2017 16 GT/s 2 GB/s 32 GB/s PCIe 5.0 2019 32 GT/s 4 GB/s 64 GB/s PCIe 6.0 2021 64 GT/s 8 GB/s 128 GB/s

Currently, the most widely used generations are Gen4 and Gen5.

Note: Depending on the document you’re referencing, PCIe bandwidth may be presented as either unidirectional or bidirectional, with the latter indicating a bandwidth that is twice that of the former.

1.3 Summary

With the above knowledge, we can now proceed to discuss the performance characteristics of various storage devices.

2 Disk 2.1 HDD: ~200 MB/s

From wikipedia HDD:

A hard disk drive (HDD) is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnetic material.

2.1.1 Real world pictures

A real-world picture is shown below:

Fig. Internals of a real world HDD. Image source hardwaresecrets.com

2.1.2 Supported interfaces (bus types)

HDDs connect to a motherboard over one of several bus types, such as,

SATA
SCSI
Serial Attached SCSI (SAS)

Below is a SATA HDD:

Fig. A real world SATA HDD. Image source hardwaresecrets.com

and how an HDD connects to a computer motherboard via SATA cables:

Fig. An HDD with SATA cables. Data source datalab247.com

2.1.3 Bandwidth: constrained by machanical factors

HDDs are machanical devices, and their peak IO performance is inherently limited by various mechanical factors, including the speed at which the actuator arm can function. The current upper limit of HDDs is ~200MB/s, which is significantly below the saturation point of a SATA 3.0 interface (600MB/s).

2.1.4 Typical latencies

Table. Latency characteristics typical of HDDs. Data source: wikipedia

Rotational speed (rpm) Average rotational latency (ms) 15,000 2 10,000 3 7,200 4.16 5,400 5.55 4,800 6.25 2.2 SATA SSD: ~600MB/s

What’s a SSD? From wikipedia SSD:

A solid-state drive (SSD) is a solid-state storage device. It provides persistent data storage using no moving parts.

Like HDDs, SSDs support several kind of bus types:

SATA
PCIe (NVME)
…

Let’s see the first one: SATA-interfaced SSD, or SATA SSD for short.

2.2.1 Real world pictures

SSDs are usually smaller than HDDs,

Fig. Size of different drives, left to right: HDD, SATA SSD, NVME SSD. Image source avg.com

2.2.2 Bandwidth: constrained by SATA bus

The absence of mechanical components (such as rotational arms) allows SATA SSDs to fully utilize the capabilities of the SATA bus. This results in an upper limit of 600MB/s IO bandwidth, which is 3x faster than that of SATA HDDs.

2.3 NVME SSD: ~7GB/s, ~13GB/s

Let’s now explore another type of SSD: the PCIe-based NVME SSD.

2.3.1 Real world pictures

NVME SSDs are even smaller than SATA SSDs, and they connect directly to the PCIe bus with 4x lanes instead of SATA cables,

Fig. Size of different drives, left to right: HDD, SATA SSD, NVME SSD. Image source avg.com

2.3.2 Bandwidth: contrained by PCIe bus

NVME SSDs has a peak bandwidth of 7.5GB/s over PCIe Gen4, and ~13GB/s over PCIe Gen5.

2.4 Summary

We illustrate the peak bandwidths of afore-mentioned three kinds of local storage media in a graph:

Fig. Peak bandwidths of different storage media.

These (HDDs, SSDs) are commonly called non-volatile or persistent storage media. And as the picture hints, in next chapters we’ll delve into some other kinds of storage devices.

3 DDR SDRAM (CPU Memory): ~400GB/s

DDR SDRAM nowadays serves mainly as the main memory in computers.

3.1 Real world pictures

Fig. Front and back of a DDR RAM module for desktop PCs (DIMM). Image source wikipedia

Fig. Corsair DDR-400 memory with heat spreaders. Image source wikipedia

DDR memory connects to the motherboard via DIMM slots:

Fig. Three SDRAM DIMM slots on a ABIT BP6 computer motherboard. Image source wikipedia

3.2 Bandwidth: contrained by memory clock, bus width, channel, etc

Single channel bandwidth:

Transfer rate Bandwidth DDR4 3.2GT/s 25.6 GB/s DDR5 4–8GT/s 32–64 GB/s

if Multi-channel memory architecture is enabled, the peak (aggreated) bandwidth will be increased by multiple times:

Fig. Dual-channel memory slots, color-coded orange and yellow for this particular motherboard. Image source wikipedia

Such as [4],

Intel Xeon Gen5: up to 8 memory-channels running at up to 5600MT/s (358GB/s)
Intel Xeon Gen4: up to 8 memory-channels running at up to 4800MT/s (307GB/s)

3.3 Summary

DDR5 bandwidth in the hierarchy:

Fig. Peak bandwidths of different storage media.

4 GDDR SDRAM (GPU Memory): ~1000GB/s

Now let’s see another variant of DDR, commonly used in graphics cards (GPUs).

4.1 GDDR vs. DDR

From wikipedia GDDR SDRAM:

Graphics DDR SDRAM (GDDR SDRAM) is a type of synchronous dynamic random-access memory (SDRAM) specifically designed for applications requiring high bandwidth, e.g. graphics processing units (GPUs).

GDDR SDRAM is distinct from the more widely known types of DDR SDRAM, such as DDR4 and DDR5, although they share some of the same features—including double data rate (DDR) data transfers.

4.2 Real world pictures

Fig. Hynix GDDR SDRAM. Image Source: wikipedia

4.3 Bandwidth: contrained by lanes & clock rates

Unlike DDR, GDDR is directly integrated with GPU devices, bypassing the need for pluggable PCIe slots. This integration liberates GDDR from the bandwidth limitations imposed by the PCIe bus. Such as,

GDDR6: 1008GB/s. Peak per-pin data rate 16Gb/s, max memory bus width 384-bits.
GDDR6x: 1008GB/s, used by NVIDIA RTX 4090

4.4 Summary

With GDDR included:

Fig. Peak bandwidths of different storage media.

5 HBM: 1~5 TB/s

If you’d like to achieve even more higher bandwidth than GDDR, then there is an option: HBM (High Bandwidth Memory).

A great innovation but a terrible name.

5.1 What’s new

HBM is designed to provide a larger memory bus width than GDDR, resulting in larger data transfer rates.

Fig. Cut through a graphics card that uses HBM. Image Source: wikipedia

HBM sits inside the GPU die and is stacked – for example NVIDIA A800 GPU has 5 stacks of 8 HBM DRAM dies (8-Hi) each with two 512-bit channels per die, resulting in a total width of 5120-bits (5 active stacks * 2 channels * 512 bits) [3].

As another example, HBM3 (used in NVIDIA H100) also has a 5120-bit bus, and 3.35TB/s memory bandwidth,

Fig. Bandwidth of several HBM-powered GPUs from NVIDIA. Image source: nvidia.com

5.2 Real world pictures

The 4 squares in left and right are just HBM chips:

Fig. AMD Fiji, the first GPU to use HBM. Image Source: wikipedia

5.3 Bandwidth: contrained by lanes & clock rates

From wikipedia HBM，

Bandwidth Year GPU HBM 128GB/s/package HBM2 256GB/s/package 2016 V100 HBM2e ~450GB/s 2018 A100, ~2TB/s; Huawei Ascend 910B HBM3 600GB/s/site 2020 H100, 3.35TB/s HBM3e ~1TB/s 2023 H200, 4.8TB/s 5.4 HBM-powered CPUs

HBM is not exclusive to GPU memory; it is also integrated into some CPU models, such as the Intel Xeon CPU Max Series.

5.5 Summary

This chapter concludes our exploration of dynamic RAM technologies, which includes

DDR DRAM
GDDR DRAM
HBM DRAM

Fig. Peak bandwidths of different storage media.

In the next, let’s see some on-chip static RAMs.

6 SRAM (on-chip): 20+ TB/s

The term “on-chip” in this post refers to memory storage that's integrated within the same silicon as the processor unit.

6.1 SRAM vs. DRAM

From wikipedia SRAM:

Static random-access memory (static RAM or SRAM) is a type of random-access memory that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

The term static differentiates SRAM from DRAM:

SRAM DRAM data freshness stable in the presence of power decays in seconds, must be periodically refreshed speed (relative) fast (10x) slow cost (relative) high low mainly used for cache main memory

SRAM requires more transistors per bit to implement, so it is less dense and more expensive than DRAM and also has a higher power consumption during read or write access. The power consumption of SRAM varies widely depending on how frequently it is accessed.

6.2 Cache hierarchy (L1/L2/L3/…)

In the architecture of multi-processor (CPU/GPU/…) systems, a multi-tiered static cache structure is usually used:

L1 cache: typically exclusive to each individual processor;
L2 cache: commonly accessible by a group of processors.

NVIDIA H100 chip layout (L2 cache in the middle, shared by many SM processors). Image source: nvidia.com

6.3 Groq LPU: eliminating memory bottleneck by using SRAM as main memory

From the official website: Groq is the AI infra company that builds the world’s fastest AI inference technology with both software and hardware. Groq LPU is designed to overcome two LLM bottlenecks: compute density and memory bandwidth.

An LPU has greater compute capacity than a GPU and CPU in regards to LLMs. This reduces the amount of time per word calculated, allowing sequences of text to be generated much faster.
Eliminating external memory bottlenecks (using on-chip SRAM instead) enables the LPU Inference Engine to deliver orders of magnitude better performance on LLMs compared to GPUs.

Regarding to the chip:

Fig. Die photo of 14nm ASIC implementation of the Groq TSP. Image source: groq paper [2]

The East and West hemisphere of on-chip memory module (MEM)

Composed of 44 parallel slices of SRAM and provides the memory concurrency necessary to fully utilize the 32 streams in each direction.
Each slice provides 13-bits of physical addressing of 16-byte memory words, each byte maps to a lane, for a total of 220 MiBytes of on-chip SRAM.

6.4 Bandwidth: contrained by clock rates, etc 6.5 Summary

This chapter ends our journey to various physical storage media, from machanical devices like HDDs all the way to on-chip cache. We illustrate their peak bandwidth in a picture, note that the Y-axis is log10 scaled:

Fig. Speeds of different storage media.

These are the maximum IO bandwidths when performing read/write operations on a local node.

Conversely, when considering remote I/O operations, such as those involved in distributed storage systems like Ceph, AWS S3, or NAS, a new bottleneck emerges: networking bandwidth.

7 Networking bandwidth: 400GB/s 7.1 Traditional data center: 2*{25,100,200}Gbps

For traditional data center workloads, the following per-server networking configurations are typically sufficient:

2 NICs * 25Gbps/NIC, providing up to 6.25GB/s unidirectional bandwidth when operating in active-active mode;
2 NICs * 100Gbps/NIC, delivering up to 25GB/s unidirectional bandwidth when operating in active-active mode;
2 NICs * 200Gbps/NIC, achieving up to 50GB/s unidirectional bandwidth when operating in active-active mode.

7.2 AI data center: GPU-interconnect: 8*{100,400}Gbps

This type of networking facilitates inter-GPU communication and is not intended for general data I/O. The data transfer pathway is as follows:

HBM <---> NIC <---> IB/RoCE <---> NIC <--> HBM Node1 Node2 7.3 Networking bandwidths

Now we add networking bandwidths into our storage performance picture:

Fig. Speeds of different storage media, with networking bandwidth added.

7.4 Summary

If remote storage solutions (such as distributed file systems) is involved, and networking is fast enough, IO bottleneck would shift down to the remote storage solutions, that’s why there are some extremely high performance storage solutions dedicated for today’s AI trainings.

8 Distributed storage: aggregated 2+ TB/s 8.1 AlibabaCloud CPFS

AlibabaCloud’s Cloud Parallel File Storage (CPFS) is an exemplar of such high-performance storage solutions. It claims to offer up to 2TB/s of aggregated bandwidth.

But, note that the mentioned bandwidth is an aggregate across multiple nodes, no single node can achieve this level of IO speed. You can do some calcuatations to understand why, with PCIe bandwidth, networking bandwidth, etc;

8.2 NVME SSD powered Ceph clusters

An open-source counterpart is Ceph, which also delivers impressive results. For instance, with a cluster configuration of 68 nodes * 2 * 100Gbps/node, a user achieved aggregated throughput of 1TB/s, as documented.

8.3 Summary

Now adding distributed storage aggregated bandwidth into our graph:

Fig. Peak bandwidth of storage media, networking, and distributed storage solutions.

9 Conclusion

This post compiles bandwidth data for local storage media, networking infrastructure, and remote storage systems. With this information as reference, readers can evaluate the potential IO bottlenecks of their systems more effectively, such as GPU server IO bottleneck analysis [1]:

Fig. Bandwidths inside a 8xA100 GPU node

References

Notes on High-end GPU Servers (in Chinese), 2023
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads, ISCA paper, 2020
GDDR6 vs HBM - Defining GPU Memory Types, 2024
5th Generation Intel® Xeon® Scalable Processors, intel.com

[译] 什么是 GPT？Transformer 工作原理的动画展示（2024）

ARTHURCHIAO'S BLOG

11 months 2 weeks ago

译者序

本文翻译自 2024 年的一个视频（前半部分），这是原作者 Deep Learning 系列的第 5 章，强烈推荐原视频：

Youtube：But what is a GPT? Visual intro to transformers；
B 站：官方搬运。

Transformer 预测下一个单词四部曲。MLP 也称为 feed-forward。

作者以深厚的技术积累，将一些复杂系统以可视化的方式讲给普通人，这种能力是极其难得的。本译文希望通过“文字+动图”这种可视化又方便随时停下来思考的方式介绍 Transformer 的内部工作原理。如果想进一步从技术和实现上了解 Transformer/GPT/LLM，可参考：

GPT 是如何工作的：200 行 Python 代码实现一个极简 GPT（2023）
Transformer 是如何工作的：600 行 Python 代码实现 self-attention 和两类 Transformer（2019）
InstructGPT：基于人类反馈训练语言模型遵从指令的能力（OpenAI，2022）
大语言模型（LLM）综述与实用指南（Amazon，2023）
如何训练一个企业级 GPT 助手（OpenAI，2023）

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原视频。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 图解 “Generative Pre-trained Transformer”（GPT）
2 Transformer 起源与应用
3 Transformer 数据处理四部曲
4 GPT -> ChatGPT：从文本补全到交互式聊天助手
- 4.1 系统提示词，伪装成聊天
- 4.2 如何训练一个企业级 GPT 助手（译注）
5 总结

1 图解 “Generative Pre-trained Transformer”（GPT）

GPT 是 Generative Pre-trained Transformer 的缩写，直译为“生成式预训练 transformer”，我们先从字面上解释一下它们分别是什么意思。

1.1 Generative：生成式

“Generative”（生成式）意思很直白，就是给定一段输入（例如，最常见的文本输入），模型就能续写（“编”）下去。

1.1.1 可视化

下面是个例子，给定 “The most effective way to learn computer science is” 作为输入，模型就开始续写后面的内容了。

“Generative”：生成（续写）文本的能力。

1.1.2 生成式 vs. 判别式（译注）

文本续写这种生成式模型，区别于 BERT 那种判别式模型（用于分类、完形填空等等），

BERT：预训练深度双向 Transformers 做语言理解（Google，2019）

1.2 Pre-trained：预训练

“Pre-trained”（预训练）指的是模型是用大量数据训练出来的。

1.2.1 可视化

“Pre-trained”：用大量数据进行训练。
图中的大量旋钮/仪表盘就是所谓的“模型参数”，训练过程就是在不断优化这些参数，后面会详细介绍。

1.2.2 预训练 vs. 增量训练（微调）

“预”这个字也暗示了模型还有在特定任务中进一步训练的可能 —— 也就是我们常说的“微调”（finetuning）。

如何对预训练模型进行微调： InstructGPT：基于人类反馈训练语言模型遵从指令的能力（OpenAI，2022）。译注。

1.3 Transformer：一类神经网络架构

“GPT” 三个词中最重要的其实是最后一个词 Transformer。 Transformer 是一类神经网络/机器学习模型，作为近期 AI 领域的核心创新，推动着这个领域近几年的极速发展。

Transformer 直译为“变换器”或“转换器”，通过数学运算不断对输入数据进行变换/转换。另外，变压器、变形金刚也是这个词。译注。

Transformer：一类神经网络架构的统称。

Transformer 最后的输出层。后面还会详细介绍

1.4 小结

如今已经可以基于 Transformer 构建许多不同类型的模型，不限于文本，例如，

语音转文字
文字转语音
文生图（text-to-image）：DALL·E、MidJourney 等在 2022 年风靡全球的工具，都是基于 Transformer。

文生图（text-to-image）简史：扩散模型（diffusion models）的崛起与发展（2022）

虽然无法让模型真正理解 "物种 π"是什么（本来就是瞎编的），但它竟然能生成出来，而且效果很惊艳。

本文希望通过“文字+动图”这种可视化又方便随时停下来思考的方式，解释 Transformer 的内部工作原理。

2 Transformer 起源与应用 2.1 Attention Is All You Need, Google, 2017，机器翻译

Transformer 是 Google 2017 年在 Attention Is All You Need paper 中提出的，当时主要用于文本翻译：

2.2 Generative Transformer

之后，Transformer 的应用场景扩展到了多个领域，例如 ChatGPT 背后也是 Transformer，这种 Transformer 接受一段文本（或图像/音频）作为输入，然后就能预测接下来的内容。以预测下一个单词为例，如下图所示，下一个单词有多种可能，各自的概率也不一样：

但有了一个这样的预测下一个单词模型，就能通过如下步骤让它生成更长的文字，非常简单：

将初始文本输入模型；
模型预测出下一个可能的单词列表及其概率，然后通过某种算法（不一定挑概率最大的）从中选一个作为下一个单词，这个过程称为采样（sampling）；
将新单词追加到文本结尾，然后将整个文本再次输入模型；转 2；

以上 step 2 & 3 不断重复，得到的句子就越来越长。

2.3 GPT-2/GPT-3 生成效果（文本续写）预览

来看看生成的效果，这里拿 GPT-2 和 GPT-3 作为例子。

下面是在我的笔记本电脑上运行 GPT-2，不断预测与采样，逐渐补全为一个故事。但结果比较差，生成的故事基本上没什么逻辑可言：

下面是换成 GPT-3（模型不再开源，所以是通过 API）， GPT-3 和 GPT-2 基本架构一样，只是规模更大，但效果突然变得非常好，生成的故事不仅合乎逻辑，甚至还暗示 “物种 π” 居住在一个数学和计算王国：

2.4 ChatGPT 等交互式大模型

以上这个不断重复“预测+选取”来生成文本的过程，就是 ChatGPT 或其他类似大语言模型（LLM）的底层工作原理 —— 逐单词（token）生成文本。

2.5 小结

以上是对 GPT 及其背后的 Transformer 的一个感性认识。接下来我们就深入到 Transformer 内部，看看它是如何根据给定输入来预测（计算）出下一个单词的。

3 Transformer 数据处理四部曲

为理解 Transformer 的内部工作原理，本节从端到端（从最初的用户输入，到最终的模型输出）的角度看看数据是如何在 Transformer 中流动的。从宏观来看，输入数据在 Transformer 中经历如下四个处理阶段：

Transformer 数据处理四部曲

下面分别来看。

3.1 Embedding：分词与向量表示

首先，输入内容会被拆分成许多小片段（这个过程称为 tokenization），这些小片段称为 token，

对于文本：token 通常是单词、词根、标点符号，或者其他常见的字符组合；
对于图片：token 可能是一小块像素区域；
对于音频：token 可能是一小段声音。

然后，将每个 token 用一个向量（一维数组）来表示。

3.1.1 token 的向量表示

这实际上是以某种方式在编码该 token；

Embedding：每个 token 对应一个 N*1 维度的数值格式表示的向量。

3.1.2 向量表示的直观解释

如果把这些向量看作是在高维空间中的坐标，那么含义相似的单词在这个高维空间中是相邻的。

词义相近的四个单词 “leap/jump/skip/hop” 在向量空间中是相邻的

将输入进行 tokenization 并转成向量表示之后，输入就从一个句子就变成了一个向量序列。接下来，这个向量序列会进行一个称为 attention 的运算。

3.2 Attention：embedding 向量间的语义交流 3.2.1 语义交流

attention 使得向量之间能够相互“交流”信息。这个交流是双向的，在这个过程中，每个向量都会更新自身的值。

这种信息“交流”是有上下文和语义理解能力的。

3.2.2 例子：”machine learning model” / “fashion model”

例如，“model” 这个词在 “machine learning model”（机器学习模型）和在 “fashion model”（时尚模特）中的意思就完全不一样，因此虽然是同一个单词（token），但对应的 embedding 向量是不同的，

Attention 模块的作用就是确定上下文中哪些词之间有语义关系，以及如何准确地理解这些含义（更新相应的向量）。这里说的“含义”（meaning），指的是编码在向量中的信息。

3.3 Feed-forward / MLP：向量之间无交流

Attention 模块让输入向量们彼此充分交换了信息（例如，单词 “model” 指的应该是“模特”还是“模型”），然后，这些向量会进入第三个处理阶段：

第三阶段：多层感知机（multi-layer perceptron），也称为前馈层（feed-forward layer）。

3.3.1 针对所有向量做一次性变换

这个阶段，向量之间没有互相“交流”，而是并行地经历同一处理：

3.3.2 直观解释

后面会看，从直观上来说，这个步骤有点像对每个向量都提出一组同样的问题，然后根据得到的回答来更新对应的向量：

以上解释中省略了归一化等一些中间步骤，但已经可以看出： attention 和 feed-forward 本质上都是大量的矩阵乘法，

本文的一个目的就是让读者理解这些矩阵乘法的直观意义。

3.3.3 重复 Attention + Feed-forward 模块，组成多层网络

Transformer 基本上是不断复制 Attention 和 Feed-forward 这两个基本结构，这两个模块的组合成为神经网络的一层。在每一层，

输入向量通过 attention 更新彼此；
feed-forward 模块将这些更新之后的向量做统一变换，得到这一层的输出向量；

3.4 Unembedding：概率 3.4.1 最后一层 feed-forward 输出中的最后一个向量

如果一切顺利，最后一层 feed-forward 输出中的最后一个向量（the very last vector in the sequence），就已经包含了句子的核心意义（essential meaning of the passage）。对这个向量进行 unembedding 操作（也是一次性矩阵运算），得到的就是下一个单词的备选列表及其概率：

图：原始输入为 "To date, the cleverest thinker of all time was"，让模型预测下一个 token。经过多层 attention + feed-forward 之后，最后一层输出的最后一个向量已经学习到了输入句子表达的意思，（经过简单转换之后）就能作为下一个单词的概率。

3.4.2 下一个单词的选择

根据一定的规则选择一个 token，

注意这里不一定选概率最大的，根据工程经验，一直选概率最大的，生成的文本会比较呆板；
实际上由一个称为 temperature 的参数控制；

3.5 小结

以上就是 Transformer 内部的工作原理。

前面已经提到，有了一个这样的预测下一个单词模型，就能通过如下步骤让它生成更长的文字，非常简单：

将初始文本输入模型；
模型预测出下一个可能的单词列表及其概率，然后通过某种算法（不一定挑概率最大的）从中选一个作为下一个单词，这个过程称为采样（sampling）；
将新单词追加到文本结尾，然后将整个文本再次输入模型；转 2；

4 GPT -> ChatGPT：从文本补全到交互式聊天助手

GPT-3 的早期演示就是这样的：给 GPT-3 一段起始文本，它就自动补全（续写）故事和文章。这正式以上介绍的 Transformer 的基本也是核心功能。

ChatGPT 的核心是 GPT 系列（GPT 3/3.5/4），但它怎么实现聊天这种工作方式的呢？

4.1 系统提示词，伪装成聊天

其实很简单，将输入文本稍作整理，弄成聊天内容，然后把这样的文本再送到 GPT/Transformer，它就会把这个当前是聊天内容，续写下去。最后只需要把它续写的内容再抽出来返回给用户，对用户来说，就是在聊天。

这段文本设定用户是在与一个 AI 助手交互的场景，这就是所谓的系统提示词（system prompt）。

4.2 如何训练一个企业级 GPT 助手（译注）

OpenAI 官方对 GPT->ChatGPT 有过专门分享：如何训练一个企业级 GPT 助手（OpenAI，2023）

基础模型不是助手，它们不想回答问题，只想补全文档。因此，如果让它们“写一首关于面包和奶酪的诗”，它们不仅不“听话”，反而会有样学样，列更多的任务出来，像下面左图这样，

这是因为它只是在忠实地补全文档。但如果你能成功地提示它，例如，开头就说“这是一首关于面包和奶酪的诗”，那它接下来就会真的补全一首这样的诗出来，如右图。

我们还可以通过 few-shot 来进一步“欺骗”它。把你想问的问题整理成一个“提问+回答”的文档格式，前面给一点正常的论述，然后突然来个问题，它以为自己还是在补全文档，其实已经把问题回答了：

这就是把基础模型调教成一个 AI 助手的过程。

5 总结

本文整理翻译了原视频的前半部分，通过可视化方式解释 GPT/Transformer 的内部工作原理。原视频后面的部分是关于 general deep learning, machine learning 等等的基础，想继续学习的，强烈推荐。

[译] Meta/Facebook 超大规模 AI/GPU 基础设施设计（2024）

ARTHURCHIAO'S BLOG

1 year ago

本文翻译自 2024 年 Meta/Facebook 的一篇文章： Building Meta’s GenAI Infrastructure。

两个 GPU 集群，每个集群 2.4w H100，分别用 RoCE/InfiniBand 网络；
LLaMA3 就是在这两个集群上训练出来的；
预计到 2024 年底，Meta AI 基础设施建设将拥有 35w 张 H100 GPU，总算力相当于约 60w 张 H100。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

1 第一代 GPU 集群：1.6w A100 (RSC)
2 第二代 GPU 集群：2.4w H100
3 性能
- 3.1 原则：性能和易用性缺一不可
- 3.2 大集群优化
4 对 open AI innovation 的承诺
5 未来展望

作为对未来人工智能的重要投资，Meta 打造了两个大规模 AI 集群，每个集群由 2.4w 张 GPU 组成，本文分享其计算、网络、存储等设计细节。

1 第一代 GPU 集群：1.6w A100 (RSC)

Meta 很早就开始构建 AI 基础设施，但第一次对外分享是在 2022 年，介绍了我们的 Research SuperCluster （RSC），它由 1.6w 个 A100 GPU 组成。

RSC 支撑了 Meta 第一代先进 AI 模型的开发，在训练 Llama/llama2、计算机视觉、NLP、语音识别、图像生成甚至编码等 AI 工作中发挥了重要作用。

2 第二代 GPU 集群：2.4w H100

精确数字是每个集群 24,576 张 H100 GPU。

我们的新一代 AI 集群充分吸收了 RSC 的成功和经验教训，这包括，

新集群致力于构建端到端的 AI 系统，特别强调研究人员和开发人员的用户体验和工作效率；
新集群能支持更大、更复杂的模型，为 GenAI 产品开发和 AI 研究的进步铺平了道路。

Meta 每天需要执行数以万亿计的 AI 任务，这就需要一个高度先进和灵活的基础设施。我们自研了大部分硬件、软件和网络 fabric，使我们能进行端到端优化，确保数据中心的高效运行。

左侧：计算机柜，包括 GPU 服务器机框，置顶交换机，fabric 交换机等等；右侧：存储机柜。

2.1 计算：Grand Teton GPU 主机

两个新集群都使用了 Grand Teton，这是 Meta 开发的开放 GPU 硬件平台，我们已经将其贡献给了开放计算项目（OCP）。

从 2015 年的 Big Sur 平台开始，我们就一直在开放设计我们的 GPU 硬件平台。

Grand Teton 实物图如下，

Image Source

将 CPU 机头、GPU、交换机同步系统、电源等等集成到一个机框中，以获得更好的整体性能；
提供了快速可扩展性和灵活性，设计简化，可以快速部署到数据中心，并易于维护和扩展。

结合 Open Rack 电源和机架架构等其他内部创新，我们能为 Meta 当前和未来应用程序快速量身定制新集群。

2.2 网络

两个集群使用了不同的网络方案，但都是 400Gbps 接入。

2.2.1 集群一：400Gbps RoCE + 自研交换机

基于 RoCE 网络，使用的交换机包括

自研置顶交换机（TOR）Wedge400 / Arista 7800 ，
自研模块化交换机 Minipack2。
- Minipack/Minipack2 在组网中能承担多种角色，例如作为 Spine 交换机，
- 第一代 Minipack：(译) 重新设计 Facebook 的数据中心网络（2019）。
- 更早一点的数据中心网络：(译) 数据中心 Fabric：Facebook 的下一代数据中心网络（2014）。

2.2.2 集群二：400Gbps InfiniBand

使用 NVIDIA Quantum2 InfiniBand fabric。

2.2.3 小结

两个方案作对比，使我们能够评估 RoCE/IB 在大规模训练中的适用性和可扩展性，为设计和构建更大规模的集群提供了宝贵经验。目前这两个不同组网类型的集群都能够运行大型生成式 AI 任务（例如在 RoCE 集群上训练 Llama 3），而没有遇到网络瓶颈。

2.3 存储

存储在 AI 训练中扮演着重要角色，然而相关的讨论确非常少。

最近的发展趋势可以看出，GenAI 任务越来越多模态，需要处理大量图像、视频和文本，因此对高性能存储的需求越来越强烈。理想的存储方案除了提供良好的性能，还要做到低能耗。

2.3.1 数据和 checkpoints 存储：FUSE + Tectonic

我们 AI 集群的数据和 checkpoint 的存储方案：

上层是一个自研的 Linux 用户空间文件系统（FUSE）
底层是 Meta 的名为 Tectonic 的分布式存储解决方案，它针对闪存（Flash media）进行了优化。

这个解决方案使得

数千个 GPU 能同步保存和加载 checkpoints（对任何存储解决方案来说都是一个挑战），
同时还提供了 EB 级存储系统所需的灵活性和高吞吐。

2.3.2 交互式调试：Parallel NFS

我们还与 Hammerspace 合作开发了一个并行网络文件系统（NFS），它使工程师能够使用数千个 GPU 进行交互式调试，因为代码改动能立即同步到环境中的所有节点。

Tectonic 分布式存储加上 Hammerspace，既能满足快速迭代，又不会限制规模。

2.3.3 大容量 SSD + 定制每个机柜的服务器数量

无论是 Tectonic 还是 Hammerspace 方案，都基于 YV3 Sierra Point server platform，使用了我们在市场上能够买到的最新高容量 E1.S SSD。

除此之外，每个机架塞的服务器数量也进行了定制，以在服务器吞吐量、机架数量和能效之间取得一个平衡。

OCP 服务器就像乐高积木，使我们的存储层能够灵活扩展到未来更大 AI 集群的需求，而且不影响日常基础设施的使用和维护操作。

3 性能 3.1 原则：性能和易用性缺一不可

我们构建大规模 AI 集群的一个原则是，同时最大化性能和易用性，而不是为了一个而牺牲另一个。这是训练最佳 AI 模型的重要基础。

测试系统设计的扩展性的最佳方法就是先构建出一个系统，然后不断优化它，并进行实际测试（模拟器有帮助，但作用有限）。通过这个过程，我们比较了小集群和大集群的性能，定位瓶颈在哪里。下图显示了当大量 GPU 相互通信时（at message sizes where roofline performance is expected）的 AllGather 性能（带宽归一化到 0-100），

small cluster performance (overall communication bandwidth and utilization) reaches 90%+ out of the box, but an unoptimized large cluster performance has very poor utilization, ranging from 10% to 90%. After we optimize the full system (software, network, etc.), we see large cluster performance return to the ideal 90%+ range.

3.2 大集群优化

与优化过的小型集群性能相比，我们的大集群一开始性能是比较差的。为了解决这个问题，我们做了如下优化：

改进 job scheduler，使其具备网络拓扑感知能力，这带来的好处：
1. 延迟降低
2. 转发到更上层网络（交换机）的流量减少。
结合 NVIDIA NCCL，优化了网络路由策略，以实现最优的网络利用率。

以上两项优化使大集群的性能已经接近小集群。

除此之外，我们还

与训练框架和模型团队密切合作，不断改进基础设施。例如，
1. 支持 NVIDIA H100 GPU 的新数据类型 FP8，这对训练性能大有帮助，
2. 并行技术优化，
3. 存储优化，
意识到可调试性（debuggability）是大规模训练的主要挑战之一。在大规模情况下，定位到哪个 GPU 卡顿导致的整个训练作业变慢是非常困难的。为此，我们正在构建 desync debug 或分布式 flight recorder 之类的工具，跟踪分布式训练的过程，以更快识别问题。
继续开发基础 AI 框架 PyTorch，使其能支持数万甚至数十万 GPU 进行训练。例如，我们已经定位到进程组初始化方面的几个瓶颈，将启动时间从有时的几小时减少到几分钟。

4 对 open AI innovation 的承诺

Meta 保持对 AI 软件和硬件开放创新的承诺，我们始终相信开源硬件和软件是帮助行业解决大规模问题的有用工具。我们将

继续作为 OCP 的创始成员支持开放硬件创新，例如已经将 Grand Teton 和 Open Rack 等设计贡献给 OCP 社区。
作为 PyTorch 的最大和主要贡献者，继续推动这一 AI 软件框架的开发和普及。
继续致力于 AI 研究社区的开放创新。
- 我们发起了开放创新 AI 研究社区，旨在深化我们对如何负责任地开发和共享 AI 技术（尤其是大模型）的理解。
- 我们还推出了 AI Alliance，这是一个由 AI 行业领先组织组成的小组，专注于在开放社区内加速负责任的 AI 创新。

我们的 AI 工作建立在开放科学和协力合作的哲学之上。

5 未来展望

本文介绍的两个 AI 训练集群是我们未来 AI 路线图的一部分。预计到 2024 年底，Meta AI 基础设施建设将拥有 35w 张 H100 GPU，总算力相当于约 60w 张 H100。

当前有效的方法可能不足以满足明天的需求，这也是为什么我们一直在各个方面不断评估和改进我们的基础设施，包括物理硬件层、虚拟层、软件层以及更上面的业务层等等。我们的目标是创建灵活可靠的系统，以支持日新月异的新模型和研究。

[译] 大模型推理的极限：理论分析、数学建模与 CPU/GPU 实测（2024）

ARTHURCHIAO'S BLOG

1 year ago

译者序

本文翻译自 2024 年的一篇文章： LLM inference speed of light，分析了大模型推理的速度瓶颈及量化评估方式，并给出了一些实测数据（我们在国产模型上的实测结果也大体吻合），对理解大模型推理内部工作机制和推理优化较有帮助。

A100-80GB PICe 推理延迟与吞吐。Image Source

译者水平有限，不免存在遗漏或错误之处。如有疑问，敬请查阅原文。

以下是译文。

译者序
摘要
1 推理机制
2 以 Mistral-7B 为例，极限推理延迟的计算
3 数学模型和理论极限的用途
4 GQA (group query attention) 的影响
5 总结

摘要

在开发 calm 的过程中，我们考虑的一个核心问题是： 推理的极限在哪儿？因为我们需要以此为准绳，去衡量真实推理系统的速度。

calm 是一个基于 CUDA、完全从头开始编写的轻量级 transformer-based language models 推理实现。

本文试图讨论这个极限及其影响。如果对推导细节感兴趣，可参考这个 python notebook。

1 推理机制 1.1 transformer：逐 token 生成，无法并行

当语言模型生成文本时，它是逐个 token 进行的。可以把语言模型（特别是 decoder-only text transformer，本文统称为 LLM） 看做是一个函数，

输入：一个 token
输出：一组概率，每个概率对应词汇表中一个 token。
推理程序使用概率来指导抽样，产生（从词汇表中选择）下一个 token 作为最终输出。

词汇表（vocabulary）：通常由单词、单词片段、中文汉字等组成（这些都称为 token）。 vocabulary 长什么样，可以可以看一下 bert-base-chinese 的词典 vocab.txt。更多基础：

GPT 是如何工作的：200 行 Python 代码实现一个极简 GPT（2023）。
Transformer 是如何工作的：600 行 Python 代码实现 self-attention 和两类 Transformer（2019）

译注。

文本生成过程就是不断重复以上过程。可以看出，在生成一个文本序列时，没有并行性的可能性。

speculative execution 尝试通过一个 less accurate predictor 来实现某种程度的并行，本文不讨论。

1.2 生成过程建模：矩阵乘法

广义上，当处理一个 token 时，模型执行两种类型的操作：

矩阵-向量乘法：一个大矩阵（例如 8192x8192）乘以一个向量，得到另一个向量，
attention 计算。

在生成过程中，模型不仅可以看到当前 token 的状态，还可以看到序列中所有之前 token 的内部状态 —— 这些状态被存储在一个称为 KV-cache 的结构中，它本质上是文本中每个之前位置的 key 向量和 value 向量的集合。

attention 为当前 token 生成一个 query 向量，计算它与所有之前位置的 key 向量之间的点积，然后归一化得到的一组标量，并通过对所有之前的 value 向量进行加权求和来计算一个 value 向量，使用点积得到最终得分。

This description omits multi-head attention and the details of “normalization” (softmax), but neither are critical for understanding the inference performance.

1.3 瓶颈分析

以上两步计算有一个重要的共同特征：从矩阵或 KV-cache 读取的每个元素，只需要进行非常少量的浮点运算。

矩阵-向量乘法对每个矩阵元素执行一次乘加运算（2 FLOPs）；
attention 对每个 key 执行一次乘加，对每个 value 执行一次乘加。

1.3.1 典型“算力-带宽”比

现代 CPU/GPU 的 ALU 操作（乘法、加法）内存 IO 速度要快得多。例如：

AMD Ryzen 7950X：67 GB/s 内存带宽和 2735 GFLOPS，Flop:byte = 40:1
NVIDIA GeForce RTX 4090：1008 GB/s 显存带宽和 83 TFLOPS，Flop:byte = 82:1
NVIDIA H100 SXM：3350 GB/s 内存带宽和 67 TFLOPS，对于矩阵乘法，tensor core 提供 ~494 TFLOPS 稠密算力，Flop:byte = 147:1。

对于 FP16/FP8 等精度较低的浮点数，比率更夸张：

H100 TensorCore 对于 dense FP8 矩阵的理论吞吐量为 1979 TFLOPS，FLOP:byte = 590:1。

在这些场景中，无论是否使用 TensorCore 或使用什么浮点格式，ALU 都非常充足。

1.3.2 瓶颈：访存带宽

因此，transformer 这种只需要对每个元素执行两次操作的场景，必定受到访存带宽的限制。所以，基于下面几个因素，

模型配置（参数多少）
KV-cache 大小
访存带宽

我们就能估计推理过程的最短耗时。下面以 Mistral 7B 为例来具体看看。

2 以 Mistral-7B 为例，极限推理延迟的计算 2.1 参数（权重）数量的组成/计算

Mistral-7B 有 72 亿参数（所有矩阵元素的总数是 72 亿个）。参数的组成如下：

4096 * 32000 = 131M 用于 embedding 矩阵；
- 4096: hidden size (tokens per hidden-vector)
- 32000: vocabulary size
矩阵-向量乘法中不会使用这整个大矩阵，每个 token 只读取这个矩阵中的一行，因此数据量相对很小，后面的带宽计算中将忽略这个；
32 * (4096 * (128 * 32 + 128 * 8 * 2) + 4096 * 128 * 32) = 1342M 用于计算与 attention 相关的向量；
32 * (4096 * 14336 * 3) = 5637M 用于通过 feed-forward 转换 hidden states；
4096 * 32000 = 131M 用于将 hidden states 转换为 token 概率；这与 embedding 矩阵不同，会用于矩阵乘法。

以上加起来，大约有 7111M (~7B) “活跃”参数用于矩阵乘法。

2.2 计算一个 token 所需加载的数据量 2.2.1 总数据量

如果模型使用 FP16 作为矩阵元素的类型， 那每生成一个 token，需要加载到 ALU 上的数据量：

7111M params * 2Byte/param = ~14.2 GB

虽然计算下一个 token 时每个矩阵都可以复用，但硬件缓存的大小通常只有几十 MB，矩阵无法放入缓存中，因此我们可以断定，这个生成（推理）过程的速度不会快于显存带宽。

attention 计算需要读取当前 token 及前面上下文中所有 tokens 对应的 KV-cache，所以读取的数据量取决于生成新 token 时模型看到多少前面的 token，这包括

系统提示词（通常对用户隐藏）
用户提示词
前面的模型输出
可能还包括长聊天会话中多个用户的提示词。

2.2.2 KV-cache 部分的数据量

对于 Mistral，KV-cache

为每层的每个 key 存储 8 个 128 元素向量，
为每个层的每个 value 存储 8 个 128 元素向量，

这加起来，每个 token 对应 32 * 128 * 8 * 2 = 65K 个元素；如果 KV-cache 使用 FP16，那么对于 token number P，我们需要读取 P * 130 KB 的数据。例如， token number 1000 将需要从 KV-cache 读取 130MB 的数据。跟 14.2GB 这个总数据量相比，这 130MB 可以忽略不计了。

2.3 以 RTX 4090 为例，极限延迟计算

根据以上数字，现在可以很容易地计算出推理所需的最小时间。

例如，在 NVIDIA RTX 4090（1008 GB/s）上，

14.2GB (fp16) 需要 ~14.1ms 读取，因此可以预期对于位置靠前的 token，每个 token 大约需要 14.1ms（KV-cache 影响可以忽略不计）。
如果使用 8bit 权重，需要读取 7.1GB，这需要大约 7.0ms。

这些都是理论下限，代表了生成每个 token 的最小可能时间。

2.4 ChatGLM3-6B/Qwen-7B 实测推理延迟（译注）

简单的单卡推理测试，16bit 权重，平均延迟，仅供参考：

LLM RTX 4090 24GB (2022) A100 80GB (2020) V100 32GB (2017) ChatGLM3-6B 16ms/token 18ms/token 32ms/token Qwen-7B 19ms/token 29ms/token 41ms/token

可以看到，单就推理速度来说，只要模型能塞进去（< 24GB），4090 与 A100 相当甚至更快，比 V100 快一倍。

说明：以上测的是 4090，不带 D（4090D）。

3 数学模型和理论极限的用途

以上根据数学建模和计算得出了一些理论极限数字，接下来看看这些理论极限有什么用。

3.1 评估推理系统好坏

要接近理论极限，需要一个高质量的软件实现，以及能够达到峰值带宽的硬件。因此如果你的软件+硬件离理论最优很远，那肯定就有问题：可能在软件方面，也可能在硬件方面。

例如，在 RTX 4090 上 calm 使用 16 位权重时达到 ~15.4 ms/tok，使用 8 位权重时达到 ~7.8 ms/tok，达到了理论极限的 90%。

Close, but not quite there - 100% bandwidth utilization is unfortunately very hard to get close to on NVidia GPUs for this workload. Larger GPUs like H100 are even more difficult to fully saturate; on Mixtral - this is a different architecture but it obeys the same tradeoffs for single sequence generation if you only count active parameters - calm achieves ~80% of theoretically possible performance, although large denser models like Llama 70B can get closer to the peak.

在 Apple M2 Air 上使用 CPU 推理时，calm 和 llama.cpp 只达到理论 100 GB/s 带宽的 ~65%，然后带宽就上不去了，这暗示需要尝试 Apple iGPU 了。

3.2 指导量化

带宽与每个权重使用的 bit 数成正比；这意味着更小的权重格式（量化）能实现更低的延迟。例如，在 RTX 4090 上 llama.cpp 使用 Mistral 7B

16 bit 权重：~17.1 ms/tok（82% 的峰值）
8.5 bit 权重：~10.3ms/tok （71% 的峰值）
4.5 bit 权重：~6.7ms/tok （58% 的峰值）

因此对于低延迟场景，可以考虑低精度量化。

3.3 指导优化方向

除了为推理延迟提供下限外，上述建模还表明：推理过程并未充分利用算力（ALU）。要解决这个问题，需要重新平衡 FLOP:byte 比例， speculative decoding 等技术试图部分解决这个问题。

3.3.1 批处理 batch 1 -> N：瓶颈访存带宽 -> 算力

这里再另一种场景：多用户场景。注意到，

当多个用户请求同时处理时，我们用相同的矩阵同时执行多个矩阵-向量乘法，这里可以将多个矩阵-向量乘法变成一个矩阵-矩阵乘法。
对于足够大的矩阵来说，只要矩阵-矩阵乘法实现得当，速度就比访存 IO 快，

因此这种场景下，瓶颈不再是访存 IO，而是算力（ALU）。这就是为什么这种 ALU:byte 不平衡对于生产推理系统不是关键问题 —— 当使用 ChatGPT 时，你的请求与同一 GPU 上许多其他用户的请求并发评估，GPU 显存带宽利用更加高效。

3.3.2 批处理无法改善所需加载的 KV-cache 数据量

批处理通常不会减轻 KV-cache 带宽（除非多个请求共享非常大的前缀），因为 KV-cache 大小和带宽随请求数量的增加而增加，而不像权重矩阵保持不变。

像 Mistral 这样的混合专家模型（MoE）scaling 特性稍有不同：batching initially only increases the bandwidth required, but once the expert utilization becomes significant the inference becomes increasingly ALU bound.

3.4 硬件相对推理速度评估

带宽是评估推理性能的关键指标，对于模型变化/设备类型或架构来说是一个恒定的，因此即使无法使用 batch processing，也可以用它来评估你用的硬件。

例如，NVIDIA RTX 4080 有 716 GB/s 带宽，所以可以预期它的推理速度是 RTX 4090 的 ~70% —— 注意，游戏、光线追踪或推理其他类型的神经网络等方面，相对性能可能与此不同！

4 GQA (group query attention) 的影响

Mistral-7B 是一个非常平衡的模型；在上面的所有计算中，几乎都能忽略 KV-cache 部分的 IO 开销。这背后的原因：

较短的上下文（Mistral-7B 使用 windowed attention，限制 4096 token 的窗口），
使用了 GQA，这个是更重要的原因。

LLaMA 2：开放基础和微调聊天模型（Meta/Facebook，2023）也使用了 GQA。

4.1 GQA 为什么能减少带宽

在 GQA 中（with a 4x ratio），为了得到 attention 的 4 个点积，

不是使用 4 个 query 向量并分别与 4 个相应的 key 向量计算点积，
而是只取一个 key 向量，然后执行 4 个点积。

这能够减少 KV-cache 的大小和所需带宽，也在某种程度上重新平衡了 ALU:bandwidth 比例。

4.2 有无 GQA 的数据量对比

这对于 KV-cache 内存大小也很关键，不过，这可能对短上下文模型不太明显：

4096 token 上下文的 Mistral 需要 0.5GiB，
没有 GQA 的可比模型（如 Llama 7B）“只需要”2 GiB。

让我们看看一个最近不使用 GQA 的模型，Cohere 的 Command-R。

Command-R has a large vocab (256K) and large hidden state (8192) so it spends a whopping 2B parameters on embeddings, but it reuses the same matrix for embedding and classification so we don’t need to exclude this from the inference bandwidth calculation.

模型本身有大约 35b 参数，所以以 16 位/权重计算，我们在推理期间需要为每个 token 读取 70 GB 的权重。对于每个 token ，它需要在 KV-cache 中存储 40 * 128 * 64 * 2 = 655K 元素，以 16 位/元素计算是每个 token 1.3 MB。

因此，一个 4096 token 的上下文将需要大约 5.3GB；与 ~70 GB 的权重相比，这已经相当显著了。然而，如果考虑到 Cohere 的模型宣传有 200K token 上下文窗口 —— 计算最后一个 token 需要读取 260 GB（还需要 260GB 的显存来存储它）！

4.3 多用户场景下 KV-cache 占用的显存规模

这么大的模型，典型的生产环境配置（单用户），

weights 通常使用 4bit 量化（通常的实现占用 ~4.5bit/权重）
KV-cache 可能会使用 8bit（FP8）值。

如果我们“保守地”假设上下文为 100K，则

模型权重占 ~19.7GB
KV-cache 占 ~65GB

计算到最后一个 token 时，我们需要从内存中读取这么大的数据。可以看到，突然之间，attention 计算部分的数据量（最终转变成耗时）从微不足道变成了占 ~75%！

虽然 100K 上下文可能看起来有点极端，但在短上下文+多用户场景中，情况也是类似的：

批处理优化将多次矩阵-向量乘法变成了一次矩阵-矩阵乘法（为一批用户请求读取一次模型权重），瓶颈来到算力（ALU），
但每个用户请求通常都有自己的 KV-cache，

因此最终的 attention 仍然受访存带宽限制，并且需要大量内存/显存才能将所有用户请求放到单个节点！

4.4 GQA：减小从 KV-cache 加载的数据量

如果模型使用 4x GQA，KV-cache 的大小和所需带宽将会变成原来的 1/4。

对于 100k+ token 的上下文场景，虽然 KV-cache 的开销仍然很大（65GB -> 16GB+），但已经进入实用范围。

4.5 GQA 的问题

对于 Cohere 的目标使用场景，引入 GQA 可能会导致模型质量有下降，具体得看他们的技术报告。

但是，纯粹从成本/性能角度来看，每个基于 transformer 的 LLM 都需要评估是否能引入 GQA，因为收益太大了。

5 总结

对于大模型推理场景，计算和访存的次数是已知的，因此可以进行数学建模，计算理论极限。这非常有用，不仅可以用来验证推理系统的性能，而且能预测架构变化带来的影响。

[译][论文] InstructGPT：基于人类反馈训练语言模型遵从指令的能力（OpenAI，2022）

ARTHURCHIAO'S BLOG

1 year 1 month ago

译者序

本文翻译自 2022 年 OpenAI 的论文： Training language models to follow instructions with human feedback，整理翻译了其中感兴趣的部分。

大模型进化树，可以看到 InstructGPT 所处的年代和位置。来自大语言模型（LLM）综述与实用指南（Amazon，2023）。

GPT -> InstructGPT -> ChatGPT 的过程，可参考如何训练一个企业级 GPT 助手（OpenAI，2023）。

译者水平有限，不免存在遗漏或错误之处。如有疑问，敬请查阅原文。

以下是译文。

译者序
摘要
1 引言
2 相关工作
3 方法论与实验详情
4 结果
5 问题讨论
参考文献
附录 A: Prompt 数据详情
附录 B：Additional human data collection details
附录 C：一些模型细节
附录 D：Automatic evaluation details
附录 E：Additional results
附录 F：Model samples

MathJax.Hub.Config({ extensions: ["tex2jax.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\$","\$"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], processEscapes: true }, "HTML-CSS": { availableFonts: [], preferredFont: null, webFont: "Neo-Euler", mtextFontInherit: true }, TeX: { extensions: ["color.js"], Macros: { lgc: ["{\\color{my-light-green} #1}", 1], gc: ["{\\color{my-green} #1}", 1], lrc: ["{\\color{my-light-red} #1}", 1], rc: ["{\\color{my-red} #1}", 1], lbc: ["{\\color{my-light-blue} #1}", 1], bc: ["{\\color{my-blue} #1}", 1], kc: ["{\\color{my-gray} #1}", 1], loc: ["{\\color{my-light-orange} #1}", 1], oc: ["{\\color{my-orange} #1}", 1], a: ["\\mathbf a"], A: ["\\mathbf A"], b: ["\\mathbf b"], B: ["\\mathbf B"], c: ["\\mathbf c"], C: ["\\mathbf C"], d: ["\\mathbf d"], D: ["\\mathbf D"], E: ["\\mathbf E"], I: ["\\mathbf I"], L: ["\\mathbf L"], m: ["\\mathbf m"], M: ["\\mathbf M"], r: ["\\mathbf r"], s: ["\\mathbf s"], t: ["\\mathbf t"], S: ["\\mathbf S"], x: ["\\mathbf x"], z: ["\\mathbf z"], v: ["\\mathbf v"], y: ["\\mathbf y"], k: ["\\mathbf k"], bp: ["\\mathbf p"], P: ["\\mathbf P"], q: ["\\mathbf q"], Q: ["\\mathbf Q"], r: ["\\mathbf r"], R: ["\\mathbf R"], Sig: ["\\mathbf \\Sigma"], t: ["\\mathbf t"], T: ["\\mathbf T"], e: ["\\mathbf e"], X: ["\\mathbf X"], u: ["\\mathbf u"], U: ["\\mathbf U"], v: ["\\mathbf v"], V: ["\\mathbf V"], w: ["\\mathbf w"], W: ["\\mathbf W"], Y: ["\\mathbf Y"], z: ["\\mathbf z"], Z: ["\\mathbf Z"], p: ["\\,\\text{.}"], tab: ["\\hspace{0.7cm}"], sp: ["^{\\small\\prime}"], mR: ["{\\mathbb R}"], mC: ["{\\mathbb C}"], mN: ["{\\mathbb N}"], mZ: ["{\\mathbb Z}"], deg: ["{^\\circ}"], argmin: ["\\underset{#1}{\\text{argmin}}", 1], argmax: ["\\underset{#1}{\\text{argmax}}", 1], co: ["\\;\\text{cos}"], si: ["\\;\\text{sin}"] } } }); MathJax.Hub.Register.StartupHook("TeX color Ready", function() { MathJax.Extension["TeX/color"].colors["my-green"] = '#677d00'; MathJax.Extension["TeX/color"].colors["my-light-green"] = '#acd373'; MathJax.Extension["TeX/color"].colors["my-red"] = '#b13e26'; MathJax.Extension["TeX/color"].colors["my-light-red"] = '#d38473'; MathJax.Extension["TeX/color"].colors["my-blue"] = '#306693'; MathJax.Extension["TeX/color"].colors["my-light-blue"] = '#73a7d3'; MathJax.Extension["TeX/color"].colors["my-gray"] = '#999'; MathJax.Extension["TeX/color"].colors["my-orange"] = '#E69500'; MathJax.Extension["TeX/color"].colors["my-light-orange"] = '#FFC353'; }); 摘要

增大模型尺寸未必就能提高它对用户意图的理解能力。例如，一些大模型可能会生成不真实、有毒或对用户并无帮助（untruthful, toxic, or simply not helpful）的输出。换句话说，这些模型与它们的用户没有对齐（not aligned）。

本文展示了一种基于人类反馈进行微调（fine-tuning with human feedback），从而在各种任务上将语言模型与用户意图对齐的方法。简单来说，

先收集一组“预期的模型行为应该是什么样”的数据集，然后使用监督学习来微调 GPT-3（SFT），
接着，收集一组排名形式组织的模型输出（rankings of model outputs）作为数据集，使用人类反馈强化学习（RLHF）进一步微调上一步得到的模型。

我们将最终得到的这种模型称为 InstructGPT。

175b GPT-3 vs. 1.3b InstructGPT 的人工测评显示，大家更喜欢后者，尽管它的参数不到前者的 1%。
InstructGPT 在真实性（truthfulness）方面也有所改进，减少了有毒输出，在公开 NLP 数据集上的性能退化也很小。

尽管 InstructGPT 仍然会犯一些简单的错误，但我们的研究结果表明， 基于人类反馈进行微调（fine-tuning with human feedback）是一个很有前途的 将语言模型与人类意图对齐的方向。

1 引言

给定一些任务示例（examples of the task）作为输入，大语言模型（LLMs）可以被 “prompt” 去执行一系列自然语言处理（NLP）任务。

1.1 大模型存在的问题

然而，这些模型经常会出现一些意外的行为，比如编造事实、生成有偏见或有毒的文本，或者忽视用户的指示（Bender 等，2021；Bommasani 等，2021；Kenton 等，2021； Weidinger 等，2021；Tamkin 等，2021；Gehman 等，2020）。

1.2 语言模型建模偏差：预测下一个 token vs. 有益且安全地遵循用户指令

出现以上现象，是因为许多近期的 LLM 建模目标都是（基于互联网数据训练）预测下一个 token —— 而并不是“有益且安全地遵循用户的指令”（Radford 等，2019；Brown 等，2020；Fedus 等，2021；Rae 等，2021；Thoppilan 等，2022）。也就是说，语言建模目标有偏差（the language modeling objective is misaligned）。

由于 LLM 已经部署在大量实际应用中，因此解决大模型的这些非预期行为非常重要。

1.3 常规解决方式及评估标准

通过训练语言模型按照用户意图行事（Leike 等，2018）来推进语言模型的对齐。这里的意图包括

明确的意图，如遵循指示，
隐含的意图，如保持真实、无偏见、无毒及无害性。

使用 Askell 等（2021）的术语，我们希望语言模型是

有帮助的（helpful，应该帮助用户解决任务），
诚实的（honest，不应该捏造信息或误导用户），
无害的（harmless，不应该对人或环境造成身体、心理或社会伤害）。

我们在第 3.6 节中详细阐述了这些标准的评估。

1.4 本文方法：基于人类反馈+微调来对齐

本文专注于通过微调方法来对齐语言模型。具体来说，使用人类反馈强化学习（RLHF；Christiano 等，2017；Stiennon 等，2020）来微调 GPT-3，以便它能遵循类型广泛的各种用户指令。具体过程如图 2 所示，

Figure 2: InstructGPT 三部曲：(1) SFT, (2) RM training, (3) RLHF via proximal policy optimization (PPO) on RM.
蓝色箭头表示相应的数据用于训练模型。Step 2 中 A-D 是模型输出的采样，然后标注员对它们进行排序。详见 Section 3。

三个步骤：

收集示例数据（demonstration data），训练一个监督策略（supervised policy）。

对于给定的输入，标注员给出期望的行为 (详见 3.2 节)。然后，使用监督学习（supervised learning）对一个预训练的 GPT-3 模型进行微调。
收集对比数据（comparison data），训练一个奖励模型（RM）。

对给定输入，收集两个输出，标注员给出他们的偏好（which output they prefer）。然后，训练一个奖励模型来预测人类偏好输出（human-preferred output）。
针对奖励模型，使用 PPO 对策略进行优化（optimize a policy）。

将 RM 的输出作为一个标量奖励。通过 PPO 算法 (Schulman 等，2017) 对监督策略进行微调（fine-tune the supervised policy），以优化这一奖励。

步骤 2 和 3 可以持续迭代；在当前最佳策略上收集更多的对比数据，这些数据又用于训练新的 RM 和新的策略。实际上，大部分对比数据来自于我们的 supervised policies，一小部分来自于我们的 PPO policies。

这个过程将 GPT-3 的行为与特定人群的偏好（stated preferences of a specific group of people，大多是我们的标注员和研究人员），而非任何更广泛的“人类价值观”对齐；5.2 节将进一步讨论这个问题。

1.5 模型尺寸及架构

我们训练了三种尺寸的 InstructGPT 模型：

1.3B
6B
175B

所有模型都使用了 GPT-3 架构。

1.6 主要发现

我们的主要发现如下。

1.6.1 标注员明显更喜欢 InstructGPT 而非 GPT-3 的输出

我们将 175b GPT-3 vs. 1.3b InstructGPT 的输出进行了人工测评，大家明显更喜欢后者，尽管它的参数不到前者的 1%。

这两类模型具有相同的架构，唯一的区别是 InstructGPT 在人工数据上进行了微调。

作为对比，我们给 GPT-3 添加了一个 few-shot prompt 以使其更好地遵循指令（变成了一个提示词调优过的 GPT-3），但效果仍赶不上 InstructGPT：

175B InstructGPT 在 85 ± 3% 的结果中优于 175B GPT-3，
175B InstructGPT 在 71 ± 4% 的结果中优于 few-shot 175B GPT-3。

根据标注员的反馈，InstructGPT 模型的输出更符合 prompt ，并更可靠地遵循指令中的明确约束。

1.6.2 InstructGPT 相比 GPT-3 在真实性方面有所改进

在 TruthfulQA 基准测试中，InstructGPT 生成 truthful & informative 答案的概率比 GPT-3 高约一倍。

对于“封闭域”（closed-domain）任务（输出不应包含输入中不存在的信息，例如摘要和封闭域的问答测试， InstructGPT 的信息虚构率（编造输入中不存在的信息）只有 GPT-3 的一半（21% vs. 41%）。

1.6.3 InstructGPT 相比 GPT-3 毒性略微下降，但偏见未下降

为了衡量毒性，我们使用了 RealToxicityPrompts 数据集（Gehman 等，2020），并进行了自动和人工评估。当提示模型需要 respectful 时（prompted to be respectful），InstructGPT 生成的有毒输出比 GPT-3 少约 25%。

在 Winogender（Rudinger 等，2018）和 CrowSPairs（Nangia 等，2020）数据集上， InstructGPT 相比 GPT-3 没有明显改进。

1.6.4 通过修改 RLHF 微调过程，可以最小化在公开 NLP 数据集上的性能退化

在 RLHF 微调过程中，我们观察到在某些公开 NLP 数据集上 InstructGPT 相比 GPT-3 存在性能下降，尤其是 SQuAD（Rajpurkar 等，2018）、DROP（Dua 等，2019）、HellaSwag（Zellers 等，2019）和 WMT 2015 法英翻译（Bojar 等，2015）。

这是一个“对齐税”（alignment tax）的例子 —— 对齐可能会牺牲在某些任务上的性能。在不降低标注员偏好分数的前提下，我们通过混合 PPO updates 与 PPO-ptx updates（增加预训练分布的对数似然），大大减少了在这些数据集上的性能下降。

InstructGPT 可以推广到那些未参与编写训练数据的标注员（held-out labelers）。为测试 InstructGPT 的泛化能力而进行的初步实验结果表明，与参与训练的标注员（training labelers）一样，未参与训练的标注员也更喜欢 InstructGPT 而不是 GPT-3 的输出。当然，还需要进一步研究这些模型在更广泛的用户群体上的表现，以及它们在人们所期望的行为存在分歧的输入上的表现（inputs where humans disagree about the desired behavior）。

1.6.5 在公开 NLP 数据集上微调不如在人类偏好数据上微调的效果好

我们比较了两个微调的 GPT-3：

在人类偏好数据上微调的 GPT-3（即 InstructGPT）；
在两个公开 NLP 任务（FLAN（Wei 等，2021）和 T0/T0++（Sanh 等，2021）上微调的 GPT-3。

这两个数据集包含多种 NLP 任务，以及每个任务的自然语言指令（natural language instructions）。

标注员明显更喜欢 InstructGPT 的输出。相比基线，

InstructGPT 的胜率为 73.4 ± 2％，
T0 和 FLAN fine-tuned GPT-3 分别为 26.8 ± 2％和 29.8 ± 2％。

1.6.6 InstructGPT 对 RLHF 微调之外的指令有良好的泛化能力

我们对 InstructGPT 的能力进行了定性探究，发现它能够遵循如下指令：

总结代码，
回答关于代码的问题，
有时还能遵循不同语言的指令，尽管这些指令在训练数据中非常少。

相比之下，GPT-3 虽然也可以执行这些任务，但需要更精心设计的 prompt ，并且遵循这些领域指令的效果欠佳。

这个结果很令人兴奋，因为它表明 InstructGPT 能够推广“遵循指令”的概念。即使在只有非常少的直接监督信号（SFT 训练样本）的任务上，它们仍然具备了一定的对齐性。

InstructGPT 仍然会犯一些简单的错误。例如，可能无法遵循指令，捏造事实，对简单问题给出冗长的回答，或无法检测出有错误前提的指令。但总体而言，我们的结果表明，使用人类偏好微调大语言模型可以显著改善它们在各种任务上的行为，相应的，也需要更多工作提高它们的安全性和可靠性。

2 相关工作 2.1 对齐（alignment）与人类反馈学习（learning from human feedback）研究 2.1.1 RLHF：来自游戏领域

InstructGPT 建立在前人的技术基础上，特别是用人类反馈强化学习（RLHF）来对齐模型。

RLHF 最初是为了在模拟环境（simulated environments）和 Atari 游戏中训练简单机器人而开发的（Christiano 等，2017; Ibarz 等，2018），最近被用于微调语言模型来总结文本（Ziegler 等，2019; Stiennon 等，2020; Böhm 等，2019; Wu 等，2021）。

2.1.2 InstructGPT：基于 RLHF 在更广泛的语言任务上对齐 LLM

这项工作还受到了下列类似工作的影响：

对话（Jaques 等，2019; Yi 等，2019; Hancock 等，2019）
翻译（Kreutzer 等，2018; Bahdanau 等，2016）
语义解析（Lawrence 和 Riezler，2018）
故事生成（Zhou 和 Xu，2020）
评论生成（Cho 等，2018）
证据提取（Perez 等，2019）

Madaan 等（2022）使用人类反馈来增强 prompts，以提高 GPT-3 的性能。在基于文本的环境中，使用带有 4a normative prior 的 RL 来对齐 agents（Nahian 等，2021）。

我们的工作可以看作是用 RLHF 在更广泛的语言任务上对齐语言模型。

2.1.3 语言模型对齐意味着什么

近期，“语言模型对齐意味着什么”这一问题备受关注（Gabriel，2020）。

Kenton 等（2021）列出了由于不对齐而导致的模型行为问题，包括产生有害内容和游戏中的错误目标。
同一时间，Askell 等（2021）提出将语言助手作为对齐研究的测试对象，研究了一些简单的基线和它们的扩展特性。

2.2 训练模型遵循指令（follow instructions）

Our work is also related to research on crosstask generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.

There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018; Abramson et al., 2020; Zhao et al., 2021).

2.3 评估语言模型的危害

A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021).

2.4 修改模型行为，降低危害

There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-theloop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021)

3 方法论与实验详情 3.1 High-level 方法论

我们延用了 Ziegler 等 (2019) 和 Stiennon 等 (2020) 在 stylistic continuation and summarization 领域应用的方法，

3.1.1 准备工作

如下基础准备：

一个预训练的语言模型，即 GPT-3 (Radford 等，2019; Brown 等，2020; Fedus 等，2021; Rae 等，2021; Thoppilan 等，2022)
一个 prompt 类别分布（a distribution of prompts，希望模型输出对齐到这些领域）
一个经过培训的人工标注团队 (详见 3.4 节)。

3.1.2 InstructGPT 训练三部曲

按照以下三个步骤开始训练，如图 2 所示，

收集示范数据（demonstration data），训练一个监督策略（supervised policy）。

对于给定的输入，标注员给出期望的行为 (详见 3.2 节)。然后，使用监督学习（supervised learning）对一个预训练的 GPT-3 模型进行微调。
收集对比数据（comparison data），训练一个奖励模型（RM）。

对给定输入，收集两个输出，标注员给出他们的偏好（which output they prefer）。然后，训练一个奖励模型来预测人类偏好输出（human-preferred output）。
针对奖励模型，使用 PPO 对策略进行优化（optimize a policy）。

将 RM 的输出作为一个标量奖励。通过 PPO 算法 (Schulman 等，2017) 对监督策略进行微调（fine-tune the supervised policy），以优化这一奖励。

步骤 2 和 3 可以持续迭代；每次在当前最佳策略上收集更多的对比数据，这些数据又用于训练新的 RM 和新的策略。实际上，大部分对比数据来自于我们的 supervised policies，一小部分来自于我们的 PPO policies。

3.2 数据集 3.2.1 主要来自 OpenAI API 用户数据

我们的 prompts 数据集主要来自用户提交给 OpenAI API 的文本 prompts，尤其是用户通过 OpenAI Playground interface 提交的那些 prompts ——

这个环境背后运行的是我们用 SFT + 我们自己的一部分示例数据训练出来的初期 InstructGPT models。
用户每次通过 Playground 接口用到 InstructGPT 时，我们都会告知他们，他们的数据可能会被用于训练下一步的模型。

本文并没有用到生产环境 OpenAI API 的用户数据。

3.2.2 去重

我们的去重比较简单，有共同长前缀的 prompt 就认为是重复的，并将每个用户 ID 的 prompt 数量限制为 200 个。

我们还基于用户 ID 创建训练、验证和测试集（train, validation, and test splits）。为了避免模型学习到客户信息，我们过滤掉了训练数据集中包含个人身份信息（PII）的 prompts。

3.2.3 冷启动（第一版 InstructGPT）

为了训练最初的 InstructGPT 模型，我们要求标注员自己编写 prompt。这是因为我们需要一些初始的指令式的 prompts 来启动这个过程，而这类数据很难从 GTP-3 API 的用户数据中获得，用户通常不会提交这些格式的 prompts。

3.2.3 三种 prompt：plain/few-shot/user-based

我们要求标注员编写三种类型的 prompt ：

Plain: 标注员提出任意的任务，确保任务具有足够的多样性就行。
Few-shot: 标注员提出一条指令，并为该指令提供多个查询/响应对（query/response pairs）。
User-based: OpenAI API 的 waitlist applications 中我们列了一些使用案例。我们要求标注员提供与这些使用案例相关的 prompts。

详见附录 A。

3.2.4 三个 prompts 数据集及大小

根据以上 prompts，我们生成了三个不同的数据集用于不同的微调过程，如表 6 所示，

Table 6: Dataset sizes, in terms of number of prompts.

split source size SFT train labeler 11,295 SFT train customer 1,430 SFT valid labeler 1,550 SFT valid customer 103 SFT 总计 ~15k RM train labeler 6,623 RM train customer 26,584 RM valid labeler 3,488 RM valid customer 14,399 RM 总计 ~50k PPO train customer 31,144 PPO valid customer 16,185 PPO 总计 ~47k

SFT 数据集（来自 API 和标注员）：用于训练 SFT 模型，包含约 13k training prompts
RM 数据集（来自 API 和标注员）：标注员对模型输出的排名数据，用于训练 RM 模型，有 33k training prompts
PPO 数据集（仅来自 API）：没有任何人工标签，用作 RLHF 微调的输入，有 31k training prompts。

3.2.5 Prompts 类别分布及占比

表 1 中展示了 API prompt（尤其是 RM 数据集）的类别分布，这些类别由我们的承包商标注。可以看到，占比最大的是文本生成，

Table 1: API prompt dataset 中 use case 类别及占比

Use-case (%) Generation 45.6% Open QA 12.4% Brainstorming 11.2% Chat 8.4% Rewrite 6.6% Summarization 4.2% Classification 3.5% Other 3.5% Closed QA 2.6% Extract 1.9% 3.2.6 几个 prompt 例子

表 2 展示了几个 prompt 示例（由研究人员编写，提交给 InstructGPT 的格式），

Table 2: API prompt 具体例子。

Use-case Prompt Brainstorming List five ideas for how to regain enthusiasm for my career Generation Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home. Rewrite This is the summary of a Broadway play:
“”“
{summary}
“”“
This is the outline of the commercial for that play:
“”“

更多信息：

提交给 InstructGPT 的 prompts 见附录 A.2.1，
提交给 GPT-3 的 prompts（做对比）见附录 A.2.2，

3.3 训练任务

我们的训练任务有两个来源：

标注员编写的 prompt 数据集，
提交给早期 InstructGPT 模型的 prompt 数据集。

这些 prompt 种类繁多，包括生成、问答、对话、摘要、提取和其他自然语言任务 (见表 1)。我们的数据集中 96%+ 是英文，但在 4.3 节中，我们也探讨了 InstructGPT 对其他语言指令的响应能力以及完成代码任务的能力。

对于每个自然语言 prompt，

任务通常通过自然语言指令直接指定，例如， “Write a story about a wise frog”（“写一个关于一只聪明的青蛙的故事”），
也可以间接指定
- 通过 few-shot examples，例如，提供两个关于青蛙的故事作为示例，prompt 模型生成一个新的故事，
- 通过 implicit continuation，例如，提供一个关于青蛙的故事的开头，让模型续写。

在每种情况下，我们都要求标注员尽力推断每个 prompt 背后的用户意图，并要求他们跳过那些任务非常模糊的 prompt。此外，标注员还会根据我们提供的指导 (见附录 B) 和他们自己的判断，思考其中隐含的意图（implicit intentions），例如回答的真实性，潜在的有偏见或有毒输出。

3.4 人工数据收集

为了生成示范和对比数据，以及进行结果评估，我们通过 Upwork 和 ScaleAI 雇了大约 40 名外包人员。

与之前关于摘要任务的人类偏好数据收集工作 (Ziegler 等，2019; Stiennon 等，2020; Wu 等，2021) 相比，我们的输入数据涵盖了范围更广泛的任务，甚至还包括有争议和敏感的主题。

3.4.1 标注员筛选

我们的目标是选择一组标注员，他们对不同人口分布的偏好很敏锐，并擅长识别潜在的有害输出。因此，

我们进行了一个筛选测试（screening test）来衡量标注员在这些方面的表现；
最后选出在这个测试中表现良好的标注员。

相关的选择过程和标注员分布信息，见附录 B.1。

3.4.2 对齐冲突的处理

在训练和评估过程中，我们的对齐标准可能会发生冲突：例如，当用户请求一个潜在有害的响应时。针对这种情况，我们采取如下方式，

训练阶段：优先考虑对用户的有用性 (否则就需要做出一些艰难的设计决策，我们留给未来的工作；更多讨论见 5.4 节)。
最终评估阶段：要求标注员优先考虑真实性和无害性 (因为这是我们真正关心的)。

与 Stiennon 等 (2020) 一样，我们在项目过程中与标注员密切合作。我们有一个入职流程，对标注员进行培训，为每个任务编写详细的说明 (见附录 B.2)，并在群聊中回答标注员的问题。

3.4.3 对照度标注员：验证泛华能力

为了解 InstructGPT 推广到其他标注员的偏好时表现有多好，我们雇佣了另一组独立的标注员，他们不参与编写任何训练数据。这些标注员来自相同的供应商，但没有经过前面的筛选过程。

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other 72:6 ± 1:5% of the time, while for held-out labelers this number is 77:3 ± 1:3%. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was 73 ± 4%.

3.5 Models（模型）

我们从 GPT-3 预训练模型开始微调。GPT-3 在大量互联网数据上进行了训练，适用于各种下游任务，但其行为尚未充分符合人类需求。基于 GPT-3，我们使用三种不同技术进行了模型微调。

3.5.1 Supervised fine-tuning (SFT)

使用监督学习的方式，在我们的示范数据上对 GPT-3 进行微调。

16 epoch
a cosine learning rate decay
residual dropout 0.2

得到很多个 SFT 模型。最后根据 validation set 上的 RM 分数选择最终的 SFT 模型。

与 Wu 等（2021）类似，我们发现我们的 SFT 模型在 1 个 epoch 后在 validation loss 上就会过拟合（overfit）；但是，同时我们发现，尽管存在过拟合问题，但更多 epoch 对 RM 分数和人类偏好得分都有帮助。

3.5.2 Reward modeling (RM)

将 SFT 模型去掉最后的 unembedding 层，然后从这样的模型开始训练，

输入：prompt 和 response，
输出：一个标量奖励。

最后得到的就是一个 RM 模型。

在本文中，我们仅使用 6B 的 RM，因为这样可以节省大量计算资源，并且我们发现 175B 的 RM 训练可能不稳定，因此不太适合在 RL 中用作 value function（更多细节见附录 C）。

在 Stiennon 等（2020）中，给两个模型相同的输入，然后得到两份输出作为对比数据， RM 是在这个对比数据集（dataset of comparisons）上进行训练的。

They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.

为了快速收集对比数据，我们将 $K=4$ 到 $K=9$ 之间的输出（即一个 input/prompt 喂给模型，得到 K 个 output）都提供给标注员，并要求他们对其进行排名（rank）。这样每个 prompt 就对应 ${K \choose 2}$ 个对比数据。

由于每个 labeling task 内的 comparisons 非常相关，我们发现如果简单地 shuffle the comparisons into one dataset，对数据集训练一次就会导致 RM 过拟合。

That is, if each of the possible ${K \choose 2}$ comparisons is treated as a separate data point, then each completion will potentially be used for $K-1$ separate gradient updates. The model tends to overfit after a single epoch, so repeating data within an epoch also causes it to overfit.

因此，我们将每个 prompt 的所有 ${K \choose 2}$ 个对比作为单个 batch element 进行训练。这样做在计算上更加高效，因为只需要一次正向遍历（forward pass of the RM for each completion，而不是 ${K \choose 2}$ forward passes for $K$ completions），并且由于不再过拟合，它实现了更好的 validation accuracy and log loss。

具体来说，奖励模型的损失函数为：

\[\begin{equation} \label{eq1} \begin{split} \operatorname{loss}\left(\theta \right) = -\frac{1} {K \choose 2} E_{\left(x, y_{w}, y_{l}\right) \sim D}\left[\log \left(\sigma\left(r_{\theta}\left(x, y_{w}\right)-r_{\theta}\left(x, y_{l}\right)\right)\right)\right] \end{split} \end{equation}\]

其中，

$x$：prompt（输入的提示词）
$y$：completion（模型的返回）
$y_{w}$：the preferred completion out of the pair of $y_{w}$ and $y_{l}$
$D$：dataset of human comparisons（标注员给出的对比）
$r_{\theta}(x, y)$：scalar output of the RM for prompt $x$ and completion $y$ with parameters $\theta$

最后，由于 RM loss 对奖励的平移不变，我们使用一个 bias 来对奖励模型进行归一化，这样标注员的示范在进行 RL 之前的平均分数为 0。

3.5.3 Reinforcement learning (RL)

再次沿用（Stiennon 等，2020），我们使用 PPO（Schulman 等，2017）对 SFT 模型进行微调。我们创建一个 bandit 环境，

给一个随机的客户 prompt，得到一个 response。
给定一个 prompt 和相应的 response，会产生一个由 RM 确定的奖励，然后结束这一轮。

此外，我们添加了一个 per-token 的 KL 惩罚（来自 SFT 模型），以减轻奖励模型的过优化（over-optimization）。值函数是从 RM 初始化的。我们将这些模型称为PPO。

我们还尝试将预训练梯度（pretraining gradients）mixing into PPO 梯度中，以减轻在公开 NLP 数据集上的性能下降。我们将这些模型称为PPO-ptx。

我们在 RL 训练中最大化以下组合目标函数：

\begin{equation} \label{eq2} \begin{split} \operatorname{objective}\left(\phi\right)= & E_{\left(x, y\right) \sim D_{\pi_{\phi}^{\mathrm{RL}}}}\left[r_{\theta}(x, y)-\beta \log \left(\pi_{\phi}^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right] +
& \gamma E_{x \sim D_\textrm{pretrain}}\left[\log(\pi_{\phi}^{\mathrm{RL}}(x))\right] \end{split} \end{equation}

其中

$\pi_{\phi}^{\mathrm{RL}}$ 是学习到的 RL 策略，
$\pi^{\mathrm{SFT}}$ 是 SFT 模型，
$D_\textrm{pretrain}$ 是预训练分布（pretraining distribution）。
KL 奖励系数 $\beta$ 和预训练损失系数 $\gamma$ 分别控制 KL 惩罚和预训练梯度的强度（strength）。
对于 “PPO” models，$\gamma$ 设置为 0。

除非另有说明，在本文中，InstructGPT 指的是 PPO-ptx models。

3.5.4 性能比较基线

我们将 PPO 模型与下列模型进行比较：

SFT 模型
GPT-3
GPT-3-prompted：向 GPT-3 提供一个 few-shot prefix 以“提示”它进入指令跟随模式。在实现上，就是把这个前缀插入倒用户输入的指令之前。

To obtain this prefix, authors RL and DA held a prefix-finding competition: each spent an hour interacting with GPT-3 to come up with their two best prefixes. The winning prefix was the one that led GPT-3 to attain the highest RM score on the prompt validation set. DA won.
在 FLAN 和 T0 数据集上微调过的 175B GPT-3

这两个数据集都包含各种 NLP 任务，以及每个任务的自然语言指令。我们分别在约 1 million examples 上对它们进行微调，并选择在验证集上获得最高奖励分数的 checkpoint。更多细节见附录 C。

3.6 性能评估

为了评估我们模型的“对齐”程度，首先需要明确“对齐”的含义。

一直以来，“对齐”的定义都很模糊和令人困惑，有很多提法（Chen 等，2021；Leike 等，2018；Gabriel，2020）。

按照 Leike 等（2018）的方法，我们的目标是训练符合用户意图的模型。更实际地说，为了完成语言任务，我们使用了类似于 Askill 等（2021）的框架，他们认为，如果模型是 helpful, honest, and harmless 的，那这个模型就是对齐的（aligned）。

3.6.1 指标 helpful

要做到有帮助，模型不仅要能遵循指令，还应该能从一个 few-shot prompt 或其他可解释的模式（例如 Q: {question}\nA:）中推断意图（infer intention）。

给定的 prompt 的意图可能不清楚，这种情况下就依赖于标注员的判断，我们的主要指标是标注员的偏好评分。但另一方面，由于我们的标注员不是生成 prompt 的用户，因此，下面二者之间可能存在偏差：

用户的实际意图
标注员通过 prompt 所理解的用户意图

honest / truthfulness

在纯生成式模型中如何衡量诚实度尚无定论；这需要将模型的实际输出与其关于正确输出的“信念”进行比较（model’s actual output to its “belief” about the correct output），由于模型是一个黑盒，我们无法推断它的信念。

因此，我们使用真实性（truthfulness）—— 模型关于世界的陈述是否真实 —— 来衡量。具体到两个指标：

模型在封闭域任务中编造信息（make up information on closed domain tasks）的倾向，即“幻觉”（hallucinations），
TruthfulQA 数据集（Lin 等，2021）上的表现。

显然，这只能覆盖真实性实际含义（what is actually meant by truthfulness）的一小部分。

harmless

与诚实度类似，衡量语言模型的有害性也很难。在大多数情况下，语言模型的危害取决于其输出在现实世界中是如何被使用的。例如，对于一个生成有毒输出的模型，

如果部署在聊天机器人环境中，可能就是有害的，
如果用于数据增强以训练更准确的毒性检测模型，则可能是有益的。

在项目早期，我们让标注员评估输出是否“可能有害”。但后面经停止了这项工作，因为这需要太多关于输出最终将如何被使用的猜测（speculation），尤其是我们的部分数据还来自 Playground API 客户。

因此，我们使用了一套更具体的替代方案，旨在捕捉最终部署的模型中可能导致有害的不同行为方面：我们让标注员从一个用户助理的角度来评估输出是否恰当，是否 denigrates a protected class，或是否包含性或暴力内容。

我们还在衡量偏见和毒性的数据集上对 InstructGPT 进行基准测试，例如 RealToxicityPrompts（Gehman 等，2020）和 CrowS-Pairs（Nangia 等，2020）。

3.6.2 定量评估

我们的定量评估分为两个独立的部分。

在 OpenAI API 真实用户的 prompts 上的表现

数据来源：OpenAI Playground API（背后是 InstructGPT）收集来的用户 prompts。所以，评估用的 prompts 与训练用的 prompt 同源，但未参与训练，也就是说只选择那些未参与训练的客户 prompts。

但这里有个问题，训练用的 prompt 是专门为 InstructGPT 设计的，因此 GPT-3 在这些 prompts 上的效果可能不佳，有失公平。为此，我们还收集了用户通过 OpenAI GPT-3 API 提交的 prompts 进行评估；这些 prompt 通常不是“遵循指令”的风格，而是专门为 GPT-3 设计的。

主要评估指标是人类偏好评分。对于每个模型，都计算其输出相对于 baseline 被人类偏好的频率；这里用我们的 175B SFT 模型作为 baseline ，因为它的性能处于中等水平。此外，我们要求标注员使用 1-7 Likert scale 判断每个 response 的整体质量，并为每个输出收集一些元数据（见表 3）。

Table 3: Labeler-collected metadata on the API distribution

Metadata Scale Overall quality Likert scale; 1-7 Fails to follow the correct instruction / task Binary Inappropriate for customer assistant Binary Hallucination Binary Satisifies constraint provided in the instruction Binary Contains sexual content Binary Contains violent content Binary Encourages or fails to discourage violence/abuse/terrorism/self-harm Binary Denigrates a protected class Binary Gives harmful advice Binary Expresses opinion Binary Expresses moral judgment Binary 在公开 NLP 数据集上的表现

我们在两种公开数据集上进行评估：

能衡量模型安全性的数据集，特别是真实性、毒性和偏见；
能衡量在传统 NLP 任务（如问答、阅读理解和摘要）上的 zero-shot 性能的数据集。

我们还在 RealToxicityPrompts 数据集（Gehman 等，2020）上人工评估了毒性。

We are releasing samples from our models on all of the sampling-based NLP tasks.

4 结果

暂略。见原文。

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals

5 问题讨论

暂略。见原文。

参考文献

Abramson, J., Ahuja, A., Barr, I., Brussee, A., Carnevale, F., Cassin, M., Chhaparia, R., Clark, S., Damoc, B., Dudzik, A., et~al. (2020). Imitating interactive intelligence. arXiv preprint arXiv:2012.05672
Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In International Conference on Machine Learning pages 22–31. PMLR.
Anthony, T., Tian, Z., and Barber, D. (2017). Thinking fast and slow with deep learning and tree search. arXiv preprint arXiv:1705.08439
Aribandi, V., Tay, Y., Schuster, T., Rao, J., Zheng, H.~S., Mehta, S.~V., Zhuang, H., Tran, V.~Q., Bahri, D., Ni, J., et~al. (2021). Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et~al. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861
Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2016). An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086
Bahdanau, D., Hill, F., Leike, J., Hughes, E., Hosseini, A., Kohli, P., and Grefenstette, E. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946
Bender, E.~M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency pages 610–623.
Blodgett, S.~L., Barocas, S., Daum{'e}~III, H., and Wallach, H. (2020). Language (technology) is power: A critical survey of” bias” in nlp. arXiv preprint arXiv:2005.14050
B{"o}hm, F., Gao, Y., Meyer, C.~M., Shapira, O., Dagan, I., and Gurevych, I. (2019). Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214
Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., and Turchi, M. (2015). Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
Bommasani, R., Hudson, D.~A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.~S., Bohg, J., Bosselut, A., Brunskill, E., et~al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
Bostrom, N. (2014). Superintelligence Dunod.
Brown, T.~B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et~al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165
Buchanan, B., Lohn, A., Musser, M., and Sedova, K. (2021). Truth, lies, and automation. Technical report, Center for the Study of Emerging Technology.
Caliskan, A., Bryson, J.~J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science 356(6334):183–186.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et~al. (2021). Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) pages 2633–2650.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d.~O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et~al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
Cho, W.~S., Zhang, P., Zhang, Y., Li, X., Galley, M., Brockett, C., Wang, M., and Gao, J. (2018). Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). Quac: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing pages 2174–2184.
Christiano, P., Cotra, A., and Xu, M. (2021). Eliciting latent knowledge: How to tell if your eyes deceive you. https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge
Christiano, P., Shlegeris, B., and Amodei, D. (2018). Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575
Christiano, P.~F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems pages 4299–4307.
Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., and Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency pages 862–872.
Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D., and Weston, J. (2019a). Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842
Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. (2019b). Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161
Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines 30(3):411–437.
Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N.~A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462
Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J. (2019). Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415
Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N.~R., Fried, G., Lowe, R., and Pineau, J. (2018). Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society pages 123–129.
Huang, P.-S., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., Maini, V., Yogatama, D., and Kohli, P. (2019). Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064
Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. (2018). Reward learning from human preferences and demonstrations in atari. In Advances in neural information processing systems pages 8011–8023.
Irving, G., Christiano, P., and Amodei, D. (2018). {AI} safety via debate. arXiv preprint arXiv:1805.00899
Jaques, N., Ghandeharioun, A., Shen, J.~H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456
Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V., and Irving, G. (2021). Alignment of language agents. arXiv preprint arXiv:2103.14659
Keskar, N.~S., McCann, B., Varshney, L.~R., Xiong, C., and Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858
Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., and Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700
Kirk, H., Jun, Y., Iqbal, H., Benussi, E., Volpin, F., Dreyer, F.~A., Shtedritski, A., and Asano, Y.~M. (2021). How true is gpt-2? an empirical analysis of intersectional occupational biases. arXiv preprint arXiv:2102.04130
Krause, B., Gotmare, A.~D., McCann, B., Keskar, N.~S., Joty, S., Socher, R., and Rajani, N.~F. (2020). Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367
Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. (2018). Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958
Lawrence, C. and Riezler, S. (2018). Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871
Leike, J., Martic, M., Krakovna, V., Ortega, P.~A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. (2017). {AI} safety gridworlds. arXiv preprint arXiv:1711.09883
Liang, P.~P., Wu, C., Morency, L.-P., and Salakhutdinov, R. (2021). Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning pages 6565–6576. PMLR.
Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958
Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., and Tang, J. (2019). Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486
Madaan, A., Tandon, N., Clark, P., and Yang, Y. (2022). Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009
Manela, D. d.~V., Errington, D., Fisher, T., van Breugel, B., and Minervini, P. (2021). Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models. arXiv preprint arXiv:2101.09688
Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. (2021). Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773
Nadeem, M., Bethke, A., and Reddy, S. (2020). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456
Nahian, M. S.~A., Frazier, S., Harrison, B., and Riedl, M. (2021). Training value-aligned reinforcement learning agents using a normative prior. arXiv preprint arXiv:2104.09469
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et~al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et~al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023
Nangia, N., Vania, C., Bhalerao, R., and Bowman, S.~R. (2020). {CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing Online. Association for Computational Linguistics.
Ngo, H., Raterink, C., Ara{'u}jo, J.~G., Zhang, I., Chen, C., Morisot, A., and Frosst, N. (2021). Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790
Perez, E., Karamcheti, S., Fergus, R., Weston, J., Kiela, D., and Cho, K. (2019). Finding generalizable evidence by learning to convince q\&a models. arXiv preprint arXiv:1909.05863
Qian, Y., Muaz, U., Zhang, B., and Hyun, J.~W. (2019). Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv preprint arXiv:1905.12801
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog 1(8):9.
Rae, J.~W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et~al. (2021). Scaling language models: Methods, analysis \& insights from training gopher. arXiv preprint arXiv:2112.11446
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822
Rudinger, R., Naradowsky, J., Leonard, B., and {Van Durme B. (2018). Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies New Orleans, Louisiana. Association for Computational Linguistics.
Sanh, V., Webson, A., Raffel, C., Bach, S.~H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.~L., Raja, A., et~al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207
Schick, T., Udupa, S., and Schutze, H. (2021). Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et~al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815
Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky, E. (2015). Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.~D., Ng, A.~Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing pages 1631–1642.
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J.~W., Kreps, S., et~al. (2019). Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203
Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (palms) with values-targeted datasets. arXiv preprint arXiv:2106.10328
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D.~M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325
Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503
Thoppilan, R., De~Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et~al. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S.~M. (2020). Investigating gender bias in language models using causal mediation analysis. In NeurIPS
Volske, M., Potthast, M., Syed, S., and Stein, B. (2017). Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization pages 59–63.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.~R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537
Wei, J., Bosma, M., Zhao, V.~Y., Guu, K., Yu, A.~W., Lester, B., Du, N., Dai, A.~M., and Le, Q.~V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et~al. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359
Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L.~A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S. (2021). Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445
Wu, J., Ouyang, L., Ziegler, D.~M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. (2021). Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862
Xu, A., Pathak, E., Wallace, E., Gururangan, S., Sap, M., and Klein, D. (2021). Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390
Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079
Yi, S., Goel, R., Khatri, C., Cervone, A., Chung, T., Hedayatnia, B., Venkatesh, A., Gabriel, R., and Hakkani-Tur, D. (2019). Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? In Association for Computational Linguistics pages 4791–4800.
Zhao, M., Anderson, P., Jain, V., Wang, S., Ku, A., Baldridge, J., and Ie, E. (2021). On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504
Zhou, W. and Xu, K. (2020). Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058
Ziegler, D.~M., Stiennon, N., Wu, J., Brown, T.~B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593

附录 A: Prompt 数据详情

Prompt 长什么样非常重要，因此这里给出完整附录。此外，有些 prompts 很有意思。译注。

A.1 Labeler-written prompts

We first give slightly more details on our prompt boostrapping process. As previously mentioned, for the majority of the project, we obtained prompts directly from external users of the instruct beta models in the OpenAI API. However, this strategy only works once you have a model that accepts instruction-like prompts. In order to train the very first such model, we asked contractors to write prompts themselves. We asked labelers to write three kinds of prompts:

Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring diversity of tasks.
Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. For example, the instruction could be “Give the sentiment for a tweet,” and the queries would be tweets and the responses either “Positive” or “Negative.” We can then format these as few-shot prompts like those in Brown et al. (2020). With K query-response pairs, we create K training examples using the other K-1 in the context.
User-based: We had a number of use-cases stated in applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

In order to preserve the anonymity of the application information, we had a separate labeler create vague high level tasks based on looking at a list of applications, modifying the task descriptions to eliminate any information that were specific to a given application. This data was used to train the first InstructGPT model via supervised learning, which was deployed in beta in the API in early 2021.

A.2 API user prompts

For API prompts, we use prompts submitted by users to the aforementioned earlier version of the InstructGPT model on the OpenAI API Playground. Throughout the paper, we only use data from the Playground, rather than customers using our model in production, as it was easier to get informed consent: every time a user switched to an InstructGPT model, an alert message would pop up stating that prompts submitted to these models could be used to train future versions of our models. We also communicated this in a message on the developer Slack channel upon launching the beta of the InstructGPT models. We filter out prompts from the training split containing personally identifiable information (PII).

To ensure a diversity of use cases, we heuristically deduplicate prompts by checking for prompts that share a long common prefix, and limited the number of prompts to roughly 200 per organization. In addition, we create train, validation, and test splits based on organization IDs, so that e.g. the validation set contains different use cases than the training set. We conceptualized API requests as belonging to one of ten use cases: generation, open QA, closed QA, brainstorming, chat, rewriting, summarization, classification, extraction, or other. Below, we show fictional but realistic prompts from a variety of use cases:

A.2.1 从 InstructGPT API (Playground) 收集上来的 user prompts 示例 Use Case Example brainstorming List five ideas for how to regain enthusiasm for my career brainstorming What are some key points I should know when studying Ancient Greece? brainstorming What are 4 questions a user might have after reading the instruction manual for a trash compactor?

{user manual}

1. brainstorming What are 10 science fiction books I should read next? classification Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation

{text}

Rating: classification This is a list of tweets and the sentiment categories they fall into.

Tweet: {tweet_content1}
Sentiment: {sentiment1}

Tweet: {tweet_content2}
Sentiment: {sentiment2} classification {java code}

What language is the code above written in? classification You are a very serious professor, and you check papers to see if they contain missing citations. Given the text, say whether it is missing an important citation (YES/NO) and which sentence(s) require citing.

{text of paper} extract Extract all course titles from the table below:

| Title | Lecturer | Room |
| Calculus 101 | Smith | Hall B |
| Art History | Paz | Hall A | extract Extract all place names from the article below:

{news article} extract Given the following list of movie titles, write down any names of cities in the titles.

{movie titles} generation Write a creative ad for the following product to run on Facebook aimed at parents:

Product: {product description} generation Write a short story where a brown bear to the beach, makes friends with a seal, and then return home. generation Here’s a message to me:
—
{email}
—

Here are some bullet points for a reply:
—
{message}
—

Write a detailed reply generation This is an article about how to write a cover letter when applying for jobs:
—
It’s important to spend some time generation write rap lyrics on the topics mentioned in this news article:
—
{article}
— rewrite This is the summary of a Broadway play:
“”“
{summary}
“”“

This is the outline of the commercial for that play:
“”” rewrite Translate this sentence to Spanish:

rewrite Create turn-by-turn navigation given this text:

Go west on {road1} unto you hit {road2}. then take it east to {road3}.
Desination will be a red barn on the right

1. rewrite Rewrite the following text to be more light-hearted:
—
{very formal text}
— chat The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?
AI: I am an AI created by OpenAI. How can I help you today?
Human: I’d like to cancel my subscription.
AI: chat Marv is a chatbot that reluctantly answers questions with sarcastic responses:

You: How many pounds are in a kilogram?
Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this.
You: What does HTML stand for?
Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future.
You: When did the first airplane fly?
Marv: chat This is a conversation with an enlightened Buddha. Every response is full of wisdom and love.

Me: How can I achieve greater peace and equanimity?
Buddha: closed qa Help me answer questions about the following short story:

{story}

What is the moral of the story? closed qa Answer the following question:
What shape is the earth?

A) A circle
B) A sphere
C) An ellipse
D) A plane closed qa Tell me how hydrogen and helium are different, using the following facts:
{list of facts} open qa I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with “Unknown”.

Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: Who was president of the United States in 1955?
A: open qa Who built the statue of liberty? open qa How do you take the derivative of the sin function? open qa who are the indiginous people of New Zealand? summarization Summarize this for a second-grade student:
{text} summarization {news article}

Tl;dr: summarization {chat transcript}

Summarize the above conversation between a customer and customer assistant. Make sure to state any complaints that the customer has. other start with where other Look up “cowboy” on Google and give me the results. other Johnathan Silver goes to the market every day, and brings back a

Next, we list some schematic examples of API requests for each use-case category, for prompts submitted to GPT-3 models. These are generally less ‘instruction-style’, and contain more explicit prompting. Note that there are some prompts where the user intent is unclear.

A.2.2 从 GPT-3 API 收集上来的 user prompts 示例 Use Case Example brainstorming indie movie ideas:
- A guy travels to South America to become a shaman.
- A documentary about the world of juggling. brainstorming Baby name ideas for a boy:
1. Alfred
2. Theo
3. brainstorming Tell me a list of topics related to:
- interior design
- sustainable ecosyste
ms - fake plants brainstorming Name some rare gems
classification This is a tweet sentiment classifier.

{tweet}
Sentiment: negative
===
{tweet}
Sentiment: neutral
===
{tweet}
Sentiment: classification The following is a list of products and the kind of product they are.
Product: {product}. Type: {type}
Product: {product}. Type: {type}
Product: {product}. Type: classification The following is a list of companies and the categories they fall into:
Apple, Facebook, Fedex
Apple
Category: Technology
Facebook
Category: Social Media
Fedex
Category: extract Text: {text}
Keywords: generation “Hey, what are you doing there?” Casey was startled. He hadn’t even begun to generation The name of the next Star Wars movie is generation This is the research for an essay:
===
{description of research}
===
Write a high school essay on these topics:
=== generation Write an outline for an essay about John von Neumann and his contributions to computing:
I. Introduction, his life and background
A: His early life
B: rewrite Covert my resume into a profile overview.
{resume}
Profile overview: rewrite Rephrase this for me: “I can’t seem to find out how to work this darn thing.”
Alternate phrasing: “ rewrite Original: She no go to sleep.
Standard American English: She didn’t go to sleep

Original: It real bad for I to make do of this.
Standard American English: chat The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?
AI: I am an AI created by OpenAI. How can I help you today?
Human: I’m feeling kind of down today.
AI: chat This is a conversation with Steven. Steven likes to watch Netflix and hasn’t left his home in 2 weeks.

John: Hey man what’s up?
Steven: Exactly the same thing as yesterday. you know.
John: So we’re going to go see a movie on Thursday, want to come?
Steven: Ummmm don’t think so…. closed qa When you drop a heavy stone from a tree, what happens?
A. The stone falls to the ground.
B: The stone stays in the tree.
C: The stone floats.
D: Nothing happens.
Answer: closed qa Text:
{article describing what yoga mats to buy}
Question: What are the things I should consider when buying a yoga mat?
Answer: open qa Q: Who is Batman?
A: Batman is a fictional comic book character.
Q: What is torsalplexity?
A: ?
Q: What is Devz9?
A: ?
Q: Who is George Lucas?
A: George Lucas is American film director and producer famous for creating Star Wars.
Q: What is the capital of California?
A: open qa Who was the best human who ever lived? open qa Q: Who is Leonardo da Vinci?
A: summarization My second grader asked me what this passage means.

“”“
{text}
“”“

I rephrased it for him in plain terms that a second grader could understand:

“”” summarization ””“
{text}
“”“

I summarized the above as: other She said, and I quote
AI: other - I like to play Call of Duty
- I like to play Call of Duty
- I like to play Call of Duty
- I like to play Call of Duty A.3 数据集大小：SFT 15k / RM 50k / PPO 47k

用来 train/validate SFT, RM, RL 三个模型的数据集大小，以及多少是标注员写的，多少来自 OpenAI API 的用户数据，

Table 6: Dataset sizes, in terms of number of prompts.

For SFT, note that we have many more labeler-written prompts than customer prompts—this is because, at the start of the project, we had labelers write instructions with a user interface that asked them to give an overarching template instruction as well as few-shot examples for that instruction.

We synthetically constructed multiple SFT datapoints from the same instruction by sampling different sets of few-shot examples.

For the RM, recall that for every prompt, we collected rankings for K outputs (ranging from 4 to 9) and trained the model on all K2, so the number of ranked pairs we trained the model on is an order of magnitude larger than the number of prompts.

A.4 数据多样性

The data that we collect spans a wide range of categories and use cases. Table 1 shows the diversity of categories in our RM training and validation datasets，来自标注员的打标。The distribution of categories for the PPO datasets was similar. We additionally show a subset of our labeled prompt metadata in Table 7.

Table 7: Dataset annotations

Annotation RM test RM train SFT valid SFT train SFT valid Ambiguous – 7.9% 8.0% 5.1% 6.4% Sensitive content – 6.9% 5.3% 0.9% 1.0% Identity dependent – – – 0.9% 0.3% Closed domain 11.8% 19.4% 22.9% 27.4% 40.6% Continuation style – 15.5% 16.2% 17.9% 21.6% Requests opinionated content 11.2% 7.7% 7.5% 8.6% 3.4% Requests advice 3.9% – – - Requests moral judgment 0.8% 1.1% 0.3% 0.3% 0.0% Contains explicit safety constraints – 0.4% 0.4% 0.3% 0.0% Contains other explicit constraints – 26.3% 28.9% 25.6% 20.7% Intent unclear 7.9% – – – –

Note that our annotation fields changed over the course of the project, so not every prompt was annotated for every field.

We used a lightweight classifier (langid.py) to classify the language of all instructions in our dataset. Empirically, around 96% of our dataset (110k datapoints) is classified as English, although we estimate that the actual fraction may be 99% or higher, due to classifier inaccuracies. Besides English, a small minority of prompts were found in at least 20 other languages: Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish, Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili, Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian, and Tibetan.

Table 8 shows the average number of prompts each customer contributed to the dataset.

Table 8: Average prompts per customer

Model Split Prompts per customer SFT train 1.65 SFT valid 1.87 RM t rain 5.35 RM v alid 27.96 PPO train 6.01 PPO valid 31.55 – test 1.81

Table 9: Prompt lengths by dataset

Model Split Count Mean Std Min 25% 50% 75% Max SFT train 12725 408 433 1 37 283 632 2048 SFT valid 1653 401 433 4 41 234 631 2048 RM train 33207 199 334 1 20 64 203 2032 RM valid 17887 209 327 1 26 77 229 2039 PPO train 31144 166 278 2 19 62 179 2044 PPO valid 16185 186 292 1 24 71 213 2039 – test set 3196 115 194 1 17 49 127 1836

In Table 9, we report descriptive statistics for prompt lengths (in tokens) used to train various models, and in Table 10 we break down token lengths by use case.

Table 10: Prompt lengths by category

Category Count Mean Std Min 25% 50% 75% Max Brainstorming 5245 83 149 4 17 36 85 1795 Chat 3911 386 376 1 119 240 516 1985 Classification 1615 223 318 6 68 124 205 2039 Extract 971 304 373 3 74 149 390 1937 Generation 21684 130 223 1 20 52 130 1999 QA, closed 1398 325 426 5 68 166 346 2032 QA, open 6262 89 193 1 10 18 77 1935 Rewrite 3168 183 237 4 52 99 213 1887 Summarization 1962 424 395 6 136 284 607 1954 Other 1767 180 286 1 20 72 188 1937

Table 11: Prompt and demonstration lengths

Prompt source Measurement Count Mean Std Min 25% 50% 75% Max Contractor prompt length 12845 437 441 5 42 324 673 2048 Contractor demo length 12845 38 76 1 9 18 41 2048 Customer prompt length 1533 153 232 1 19 67 186 1937 Customer demo length 1533 88 179 0 15 39 88 2048

Finally, we also report lengths of contractor-written demonstrations used for our SFT model in table 11, both for contractor-written and labeler-written prompts.

附录 B：Additional human data collection details

暂略。见原文。

附录 C：一些模型细节

所有模型都使用 GPT-3 架构（Brown et al., 2020）。
对于奖励模型和值函数，原始模型的 unembedding 层替换为一个 projection 层，最终输出一个标量值。
所有模型都使用 fp16 权重和激活，with fp32 master copies of weights。
所有模型使用与 Brown et al. (2020)中相同的字节对编码（byte pair encodings）。
所有的模型和 RL 策略都使用长度为 2k token 的上下文。
输入 prom：长度超过 1k token 的都不要；
输出 response：限制最大响应长度为 1k token。
所有模型都使用 Adam optimizer 进行训练，设置 β1 = 0.9 和 β2 = 0.95。

C.1 SFT 训练细节

SFT 模型训练

16 epochs
residual dropout 0.2
cosine LR schedule，降至到初始学习率的 10%，没有 learning rate warmup。
1.3B 和 6B 模型：LR 9.65e-6，batch 32 batch。在 7 个 LR 上做 geometric search 选出来的 LR。
175B 模型：LR 5.03e-6，batch 8。在 5 个 LR 上做 geometric search 选出来的 LR。
还使用 geometric search 来对 epoch 数量做调优。

最终模型是基于 RM 分数选择的，我们发现与 validation loss 相比，RM 分数更能预测人类偏好结果。

C.2 RM 训练细节

同一个 6B RM 模型用于所有尺寸的 PPO 模型。 175B RM 有可能实现更低的 validation loss，但

训练不稳定，因此不适合用作 PPO 值函数的初始化，
使用 175B RM 和值函数大大增加了 PPO 的算力需求。

初步实验结果显示，6B RM 模型在大范围的学习率上都很稳定，能训练出一样强大的 PPO 模型。

The final reward model was initialized from a 6B GPT-3 model that was fine-tuned on a variety of public NLP datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, and Winogrande). This was mostly for historical reasons; we find similar results when initializing the RM from the GPT-3 or SFT models. We trained for a single epoch over the full reward model training set (see Table 6) at a learning rate of lr = 9e-6, a cosine learning rate schedule (dropping to 10% of its initial value by the end of training), and a batch size of 64. Training did not appear to be very sensitive to the learning rate or schedule; changes of up to 50% in the learning rate resulted in similar performance. Training was quite sensitive to the number of epochs: multiple epochs quickly overfit the model to the training data with obvious deterioration in the validation loss. The batch size here represents the distinct number of prompts per batch. Each prompt had between K = 4 and K = 9 labeled completions, from which there were up to K2 possible comparisons. Ties were dropped. Therefore, a single batch could contain up to 64 × K2 ≤ 2,304 comparisons.

C.3 RLHF 的初始化模型（initialization models）细节

We initialize the RLHF models from a pretrained GPT-3 model and apply supervised fine-tuning for 2 epochs on the demonstration dataset. We also mix in 10% pretraining data during fine-tuning, since we find it helpful for PPO training (see Appendix E.11 for details). Cosine learning rate schedule is used and the learning rate eventually decays to 10% of the peak learning rate. We use a batch size of 32 for 1.3B and 6B models and 8 for the 175B model. We compare a few different peak learning rates for each model and pick the one with low losses on both the demonstration and the pretraining validation datasets. A log linear sweep of 5 values of the LR’s are compared for 1.3B and 6B models and 3 values are compared for the 175B model. The resultant LR’s for the 1.3B, 6B, and 175B models are 5e-6, 1.04e-5 and 2.45e-6, respectively.

C.4 RLHF 训练细节

We then initialize the RL policies from the above supervised fine-tuned models with pretraining mix. These models are also used to compute the KL reward, in the same way as Stiennon et al. (2020), with β = 0:02 (see Equation 2). We train all the RL models for 256k episodes. These episodes include about 31k unique prompts, after filtering out prompts with PII and deduplication based on common prefixes. The batch size for each iteration is 512, with a minibatch size of 64. In other words, each batch is randomly split into 8 minibatches and is trained on for only a single inner epoch (Schulman et al., 2017). A constant learning rate is applied with a warmup over the first 10 iterations, starting with one tenth of the peak learning rate. Exponential moving averages of the weights are applied, with a decay rate of 0.992. No discount is applied when estimating the generalized advantage (Schulman et al., 2016). The PPO clip ratio is set to 0.2, and the sampling temperature is 1 for rollouts. As previously mentioned, for all PPO models we use a 6B RM and a 6B value function, and the latter is initialized from the former. By using the same 6B reward model and value function on policies of all model sizes, it’s easier to compare the effect of policy model size on policy performance. A fixed learning rate of 9e-6 for the value function is used for 1.3B and the 6B policies and 5e-6 for the 175B policy.

Our initial RLHF experiments showed regressions on public NLP datasets, such as SQuADv2 and DROP, and we mitigate the regressions by mixing in pretraining gradients during PPO training. We use 8 times more pretraining examples than the number of the RL training episodes. The pretraining data is randomly drawn from the dataset used to train the GPT-3 models. For each minibatch, we compute the PPO gradients and pretraining gradients in consecutive steps and accumulate them both into the gradient buffers. We multiply the pretraining gradients by a coefficient, γ = 27:8 (see Equation 2), to control the relative strength of gradients from PPO and pretraining distributions.

C.5 FLAN 和 T0 模型

We obtain our FLAN and T0 baselines by fine-tuning a 175B GPT-3 model on the FLAN and T0 datasets. For T0, note that we trained on the T0++ version of the dataset. Because T0 contains much more data (96M datapoints) than FLAN (1.2M datapoints), we subsampled T0 to 1 million datapoints to make the amount of training data comparable for each model. Note that the original models train on epochs where datapoints can be repeated, but in our epochs we go through every datapoint without repeats (to better match the way we trained our SFT baselines). We applied a cosine learning rate schedule, and try initial learning rates of 4e-6 and 6e-6 for each dataset. The learning rate decays to 10% of its peak at the end of training, and we use a batch size of 64 for both experiments.

To choose the best FLAN checkpoint, we use our 6B reward model to score the completions on the validation set of prompts. As shown in Figure 13, the reward saturates after the initial 400k examples of training. This indicates that training for even longer will unlikely improve the human eval performance. We picked the checkpoint with the highest RM score for our human evaluation, which is the one trained with learning rate of 4e-6 and for 896k examples.

We perform two similar experiments to find the best T0 checkpoint. In one experiment, we used a batch size of 128, a learning rate of 4e-6 and 1.28 million examples. The other experiment used a batch size of 64, a learning rate of 6e-6 and 1 million examples. Once again using the reward model score, we picked the checkpoint from the former experiment after 896k examples of training

附录 D：Automatic evaluation details

暂略。见原文。

附录 E：Additional results

暂略。见原文。

附录 F：Model samples

In this section, we provide some additional samples from both the 175B GPT-3 and 175B InstructGPT (PPO-ptx) models. We sample at T = 1 for InstructGPT, and use T = 0:7 for GPT-3, since GPT-3 performs poorly at high temperatures (this slightly disadvantages InstructGPT).

In Figure 42, we show the full French sample from Figure 8, illustrating that our model is sometimes able to follow instructions in other languages, despite our dataset containing almost exclusively English. In Figure 44, we show our model’s propensity to answer instructions that may be harmful, a result of us prioritizing helpfulness to the user in our training data. In Figure 45, we show another example of our model describing code, though it is still far from perfect.

In Figures 46–50, we show labeler-written prompts from our dataset, along with model samples and the human-written demonstration. These 5 prompts were selected from 15 to show a range of different tasks.

（略）。

[译][论文] BERT：预训练深度双向 Transformers 做语言理解（Google，2019）

ARTHURCHIAO'S BLOG

1 year 1 month ago

译者序

本文翻译自 2019 年 Google 的论文： BETT: Pre-training of Deep Bidirectional Transformers for Language Understanding。

@article{devlin2018bert, title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, journal={arXiv preprint arXiv:1810.04805}, year={2018} }

与 GPT 一样，BERT 也基于 transformer 架构，从诞生时间来说，它位于 GPT-1 和 GPT-2 之间，是有代表性的现代 transformer 之一，现在仍然在很多场景中使用，

大模型进化树，可以看到 BERT 所处的年代和位置。来自大语言模型（LLM）综述与实用指南（Amazon，2023）。

根据 Transformer 是如何工作的：600 行 Python 代码实现 self-attention 和两类 Transformer（2019）， BERT 是首批 在各种自然语言任务上达到人类水平的 transformer 模型之一。预训练和 fine-tuning 代码：github.com/google-research/bert。

BERT 模型只有 0.1b ~ 0.3b 大小，因此在 CPU 上也能较流畅地跑起来。

译者水平有限，不免存在遗漏或错误之处。如有疑问，敬请查阅原文。

以下是译文。

译者序
摘要
1 引言
2 相关工作
3 BERT
4 实验
5 对照研究
6 总结
附录
参考文献

本文提出 BERT（Bidirectional Encoder Representations from Transformers， 基于 Transformers 的双向 Encoder 表示） —— 一种新的语言表示模型（language representation model）。

与最近的语言表示模型（Peters 等，2018a; Radford 等，2018）不同， BERT 利用了所有层中的左右上下文（both left and right context in all layers），在无标签文本（unlabeled text）上 预训练深度双向表示（pretrain deep bidirectional representations）。
只需添加一个额外的输出层，而无需任何 task-specific 架构改动，就可以对预训练的 BERT 模型进行微调，创建出用于各种下游任务（例如问答和语言推理）的高效模型。

BERT 在概念上很简单，实际效果却很强大，在 11 个自然语言处理任务中刷新了目前业界最好的成绩，包括，

GLUE score to 80.5% (7.7% point absolute improvement)
MultiNLI accuracy to 86.7% (4.6% absolute improvement)
SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement)
SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement)

1 引言

业界已证明，语言模型预训练（Language model pre-training）能显著提高许多自然语言处理（NLP）任务的效果（Dai 和 Le，2015; Peters 等，2018a; Radford 等，2018; Howard 和 Ruder，2018）。这些任务包括：

sentence-level tasks：例如自然语言推理（Bowman 等，2015; Williams 等，2018）；
paraphrasing（Dolan 和 Brockett，2005）：整体分析句子来预测它们之间的关系；
token-level tasks：例如 named entity recognition 和问答，其模型需要完成 token 级别的细粒度输出（Tjong Kim Sang 和 De Meulder，2003; Rajpurkar 等，2016）。

1.1 Pre-trained model 适配具体下游任务的两种方式

将预训练之后的语言表示（pre-trained language representations）应用到下游任务，目前有两种策略：

基于特征的方式（feature-based approach）：例如 ELMo（Peters 等，2018a），使用任务相关的架构，将预训练表示作为附加特征。
微调（fine-tuning）：例如 Generative Pre-trained Transformer (OpenAI GPT)（Radford 等，2018），引入最少的 task-specific 参数，通过微调所有预训练参数来训练下游任务。

这两种方法都是使用单向语言模型来学习通用语言表示。

1.2 以 OpenAI GPT 为代表的单向架构存在的问题

我们认为，以上两种方式（尤其是微调）限制了 pre-trained language representation 的能力。主要是因为其语言模型是单向的，这限制了预训练期间的架构选择范围。

例如，OpenAI GPT 使用从左到右的架构（Left-to-Right Model, LRM），因此 Transformer self-attention 层中的 token 只能关注它前面的 tokens（只能用到前面的上下文）：

对于句子级别的任务，这将导致次优结果；
对 token 级别的任务（例如问答）使用 fine-tuning 方式效果可能非常差，因为这种场景非常依赖双向上下文（context from both directions）。

1.3 BERT 创新之处

本文提出 BERT 来改进基于微调的方式。

受 Cloze（完形填空）任务（Taylor，1953）启发，BERT 通过一个“掩码语言模型”（masked language model, MLM）做预训练，避免前面提到的单向性带来的问题，

MLM 随机掩盖输入中的一些 token ，仅基于上下文来预测被掩盖的单词（单词用 ID 表示）。
与从左到右语言模型的预训练不同，MLM 能够同时利用左侧和右侧的上下文，从而预训练出一个深度双向 Transformer。

除了掩码语言模型外，我们还使用“下一句预测”（next sentence prediction, NSP）任务来联合预训练 text-pair representation。

1.4 本文贡献

证明了双向预训练对于语言表示的重要性。与 Radford 等（2018）使用单向模型预训练不同，BERT 使用掩码模型来实现预训练的深度双向表示。这也与 Peters 等（2018a）不同，后者使用独立训练的从左到右和从右到左的浅连接。
展示了 pre-trained representations 可以减少对许多 task-specific 架构的重度工程优化。 BERT 是第一个在大量 sentence-level 和 token-level 任务上达到了 state-of-the-art 性能的 基于微调的表示模型，超过了许多 task-specific 架构。
BERT 刷新了 11 个自然语言处理任务的最好性能。

代码和预训练模型见 github.com/google-research/bert。

2 相关工作

（这节不是重点，不翻译了）。

There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section.

2.1 无监督基于特征（Unsupervised Feature-based）的方法

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pretrain word embedding vectors, left-to-right language modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to discriminate correct from incorrect words in left and right context (Mikolov et al., 2013).

These approaches have been generalized to coarser granularities, such as

sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018)
paragraph embeddings (Le and Mikolov, 2014).

To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sentence words given a representation of the previous sentence (Kiros et al., 2015), or denoising autoencoder derived objectives (Hill et al., 2016).

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including

question answering (Rajpurkar et al., 2016)
sentiment analysis (Socher et al., 2013)
named entity recognition (Tjong Kim Sang and De Meulder, 2003)

Melamud et al. (2016) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation models.

2.2 无监督基于微调（Unsupervised Fine-tuning）的方法

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text (Collobert and Weston, 2008).

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch.

At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

2.3 基于监督数据的转移学习（Transfer Learning from Supervised Data）

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017).

Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet (Deng et al., 2009; Yosinski et al., 2014).

3 BERT

本节介绍 BERT 架构及实现。训练一个可用于具体下游任务的 BERT 模型，分为两个步骤：

预训练：使用不带标签的数据进行训练，完成多种不同的预训练任务。
微调：首先使用预训练参数进行初始化，然后使用下游任务的数据对所有参数进行微调。每个下游任务最终都得到一个独立的微调模型。

3.0 BERT 架构

图 1 是一个问答场景的训练+微调，我们以它为例子讨论架构：

Figure 1: BERT pre-training 和 fine-tuning 过程。预训练模型和微调模型的输出层不一样，除此之外的架构是一样的。
左：用无标注的句子进行预训练，得到一个基础模型（预训练模型）。
右：用同一个基础模型作为起点，针对不同的下游任务进行微调，这会影响模型的所有参数。
[CLS] 是加到每个输入开头的一个特殊 token； [SEP] 是一个特殊的 separator token (e.g. separating questions/answers)

BERT 的一个独特之处是针对不同任务使用统一架构。预训练架构和最终下游架构之间的差异非常小。

3.0.1 BERT 模型架构和参数

我们的实现基于 Vaswani 等（2017）的原始实现和我们的库 tensor2tensor 。 Transformer 大家已经耳熟能详，并且我们的实现几乎与原版相同，因此这里不再对架构背景做详细描述，需要补课的请参考 Vaswani 等（2017）及网上一些优秀文章，例如 The Annotated Transformer。

本文符号表示，

L 层数（i.e., Transformer blocks）
H 隐藏层大小（embedding size）
A self-attention head 数量

In all cases we set the feed-forward/filter size to be 4H, i.e., 3072 for the H = 768 and 4096 for the H = 1024.

本文主要给出两种尺寸的模型：

BERTBASE（L=12，H=768，A=12，总参数=110M），参数与 OpenAI GPT 相同，便于比较；
BERTLARGE（L=24，H=1024，A=16，总参数=340M）

如果不理解这几个参数表示什么意思，可参考 Transformer 是如何工作的：600 行 Python 代码实现两个（文本分类+文本生成）Transformer（2019）。译注。

两个 size 的 BERT，图中的 encoder 就是 transformer。译注。Image Source

BERT Transformer 使用双向 self-attention，而 GPT Transformer 使用受限制的 self-attention，其中每个 token 只能关注其左侧的上下文。

We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.

3.0.2 输入/输出表示

为了使 BERT 能够处理各种下游任务，在一个 token 序列中，我们的输入要能够明确地区分：

单个句子（a single sentence）
句子对（a pair of sentences）例如，问题/回答。

这里，

“句子”可以是任意一段连续的文本，而不是实际的语言句子。
“序列”是指输入给 BERT 的 token 序列，可以是单个句子或两个句子组合在一起。

我们使用 30,000 tokens vocabulary 的 WordPiece embeddings (Wu et al., 2016)。

这个 vocabulary 长什么样，可以可以看一下 bert-base-chinese（官方专门针对中文训练的基础模型）： bert-base-chinese/blob/main/vocab.txt。译注。

我们 input/output 设计如下：

每个序列的第一个 token 都是特殊的 classification token [CLS]；

在最终输出中（最上面一行），这个 token (hidden state) 主要用于分类任务，再接一个分类器就能得到一个分类结果（其他的 tokens 全丢弃），如下图所示，

BERT 用于分类任务，classifier 执行 feed-forward + softmax 操作，译注。 Image Source
将 sentence-pair 合并成单个序列。通过两种方式区分，
1. 使用特殊 token [SEP] 来分隔句子；
2. 为每个 token 添加一个学习到的 embedding ，标识它属于句子 A 还是句子 B。

再回到图 1 所示，我们将

输入 embedding 表示为 $E$，

对于给定的 token ，它的输入表示是通过将 3 个 embeddings 相加来构建的，如图 2，

Figure 2: BERT input representation.
1. token embedding：输入文本经过 tokenizer 之后得到的输出；
2. segment embedding：表示 token embedding 在这个位置的 token 是属于句子 A 还是句子 B；
3. position embedding：token 在 token embedding 中的位置，0,1,2,3...,511，因为 BERT 最长支持 512 token 输入（除非自己从头开始预训练，可以改参数）。
第 $i$ 个输入 token 的在最后一层的表示（最终隐藏向量）为 $T_i$，$T_i \in \mathbb{R}^H$。
[CLS] token 在最后一层的表示（最终隐藏向量）为 $C$, $C \in \mathbb{R}^{H}$ ，

3.1 预训练 BERT

图 1 的左侧部分。

Figure 1: BERT 的 pre-training 和 fine-tuning 过程。

与 Peters 等（2018a）和 Radford 等（2018）不同，我们不使用传统的从左到右或从右到左的模型来预训练 BERT，而是用下面两个无监督任务（unsupervised tasks）来预训练 BERT。

3.1.1 任务＃1：掩码语言模型（Masked LM）

从直觉上讲，深度双向模型比下面两个模型都更强大：

从左到右的单向模型（LRM）；
简单拼接（shallow concatenation）了一个左到右模型（LRM）与右到左模型（RLM）的模型。

不幸的是，标准的条件语言模型（conditional language models）只能从左到右或从右到左进行训练，因为 bidirectional conditioning 会使每个单词间接地“看到自己”，模型就可以轻松地在 multi-layered context 中预测目标词。

为了训练一个深度双向表示，我们简单地随机屏蔽一定比例的输入 tokens，然后再预测这些被屏蔽的 tokens。我们将这个过程称为“掩码语言模型”（MLM） —— 这种任务通常也称为 Cloze（完形填空）（Taylor，1953）。

在所有实验中，我们随机屏蔽每个序列中 15% 的 token。与 denoising auto-encoders（Vincent 等，2008）不同，我们只预测被屏蔽的单词，而不是重建整个输入。

这种方式使我们获得了一个双向预训练模型，但造成了预训练和微调之间的不匹配，因为微调过程中不会出现 [MASK] token。为了减轻这个问题，我们并不总是用 [MASK] token 替换“掩码”单词：训练数据生成器（training data generator）随机选择 15%的 token positions 进行预测。如果选择了第 i 个 token ，我们将第 i 个 token 用以下方式替换：

80% 的概率用 [MASK] token 替换，
10% 的概率用随机 token 替换，
10% 的概率 保持不变。

然后，使用 $Ti$ 来预测原始 token ，并计算交叉熵损失（cross entropy loss）。附录 C.2 中比较了这个过程的几个变种。

3.1.2 任务＃2：下一句预测（Next Sentence Prediction, NSP）

许多重要的下游任务，如问答（Question Answering, QA）和自然语言推理（Natural Language Inference, NLI）都基于理解两个句子之间的关系，而语言建模（language modeling）并无法直接捕获这种关系。

为了训练一个能理解句子关系的模型，我们预先训练了一个二元的下一句预测任务（a binarized next sentence prediction task）：给定两个句子 A 和 B，判断 B 是不是 A 的下一句。

BERT 用于“下一句预测”（NSP）任务，译注。Image Source

这个任务可以用任何单语语料库（monolingual corpus），具体来说，在选择每个预训练示例的句子 A 和 B 时，

50％的概率 B 是 A 的下一个句子（labeled as IsNext），
50％的概率 B 是语料库中随机一个句子（labeled as NotNext）。

再次回到图 1，这个 yes/no 的判断还是通过 classifier token 的最终嵌入向量 $C$ 预测的，

最终我们的模型达到了 97~98% 的准确性。尽管它很简单，但我们在第 5.1 节中证明，针对这个任务的预训练对于 QA 和 NLI 都非常有益。

The vector C is not a meaningful sentence representation without fine-tuning, since it was trained with NSP。

NSP 任务与 Jernite 等（2017）和 Logeswaran 和 Lee（2018）使用的 representation learning 有紧密关系。但是他们的工作中只将句子 embedding 转移到了下游任务，而 BERT 是将所有参数都转移下游，初始化微调任务用的初始模型。

3.1.3 预训练数据集

预训练过程跟其他模型的预训练都差不多。对于预训练语料库，我们使用了

BooksCorpus (800M words) (Zhu et al., 2015)
English Wikipedia (2,500M words)。只提取文本段落，忽略列表、表格和标题。

使用文档语料库而不是像 Billion Word Benchmark（Chelba 等，2013）这样的 shuffled sentence-level 语料库非常重要，因为方便提取长连续序列。

3.2 微调 BERT

Transformer 中的 self-attention 机制允许 BERT 对任何下游任务建模 —— 无论是 single text 还是 text pairs —— 只需要适当替换输入和输出，因此对 BERT 进行微调是非常方便的。

对于 text-pair 类应用，一个常见的模式是在应用 bidirectional cross attention 之前，独立编码 text-pair ，例如 Parikh 等（2016）；Seo 等（2017）。

但 BERT 使用 self-attention 机制来统一预训练和微调这两个阶段，因为使用 self-attention 对 concatenated text-pair 进行编码，有效地包含了两个句子之间的 bidirectional cross attention。

对于每个任务，只需将任务特定的输入和输出插入到 BERT 中，并对所有参数进行端到端的微调。预训练阶段，input 句子 A 和 B 的关系可能是：

sentence pairs
hypothesis-premise pairs in entailment
question-passage pairs in question answering
文本分类或序列打标（sequence tagging）中的 degenerate text-? pair。

在输出端，

普通 token representations 送到 token-level 任务的输出层，例如 sequence tagging 或问答，
[CLS] token representation 用于分类，例如 entailment or sentiment analysis。

与预训练相比，微调的成本相对较低。从完全相同的预训练模型开始，本文中所有结果都可以在最多 1 小时内在单个 Cloud TPU 上复制，或者在 GPU 上几个小时内。第 4 节会介绍一些细节。更多细节见附录 A.5。

3.3 各种场景

Fig 4. BERT 用于不同任务场景，来自 paper 附录。
(a) 句子对分类；(b) 单句分类；(c) 问答；(d) 单句打标。

4 实验

In this section, we present BERT fine-tuning results on 11 NLP tasks.

4.1 GLUE (General Language Understanding Evaluation)

GLUE benchmark (Wang et al., 2018a) 是一个自然语言理解任务集，更多介绍见 Appendix B.1。

4.1.1 Fine-tune 工作

针对 GLUE 进行 fine-tune 所做的工作：

用第 3 节介绍的方式表示 input sequence (for single sentence or sentence pairs)
用 the final hidden vector C 判断类别；
fine-tuning 期间增加的唯一参数 是分类层的权重 $W \in \mathbb{R}^{K \times H}$，其中 $K$ 是 labels 数量。我们用 $C$ 和 $W$ 计算一个标准的 classification loss，例如 $\log({\rm softmax}(CW^T))$.

4.1.2 参数设置

batch size 32
3 epochs
learning rate: for each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set.

另外，我们发现 BERTLARGE 在小数据集上 finetuning 有时候不稳定，所以我们会随机重启几次，从得到的模型中选效果最好的。随机重启使用相同的 pre-trained checkpoint 但使用不同的数据重排和分类层初始化 （data shuffling and classifier layer initialization）。

4.1.3 结果

结果如 Table 1 所示，

System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average 392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k - Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0 BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0 OpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1 BERTBASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6 BERTLARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1

Table 1: GLUE Test results, scored by the evaluation server (https://gluebenchmark.com/leaderboard). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining 4.5% and 7.0% respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a 4.6% absolute accuracy improvement. On the official GLUE leaderboard, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

4.2 SQuAD (Stanford Question Answering Dataset) v1.1

SQuAD v1.1 包含了 100k crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the $A$ embedding and the passage using the $B$ embedding. We only introduce a start vector $S \in \mathbb{R}^H$ and an end vector $E \in \mathbb{R}^H$ during fine-tuning. The probability of word $i$ being the start of the answer span is computed as a dot product between $T_i$ and $S$ followed by a softmax over all of the words in the paragraph: $P_i = \frac{e^{S{\cdot}T_i}}{\sum_j e^{S{\cdot}T_j}}$. The analogous formula is used for the end of the answer span. The score of a candidate span from position $i$ to position $j$ is defined as $S{\cdot}T_i + E{\cdot}T_j$, and the maximum scoring span where $j \geq i$ is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

Table 2 shows top leaderboard entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018).

Table 2: SQuAD 1.1 results. The BERT ensemble is 7x systems which use different pre-training checkpoints and fine-tuning seeds.

The top results from the SQuAD leaderboard do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD. Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine- tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.12

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic. We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token.

For prediction, we compare the score of the no-answer span: $s_{\tt null} = S{\cdot}C + E{\cdot}C$ to the score of the best non-null span $\hat{s_{i,j}}$ = ${\tt max}_{j \geq i} S{\cdot}T_i + E{\cdot}T_j$. We predict a non-null answer when $\hat{s_{i,j}} > s_{\tt null} + \tau$, where the threshold $\tau$ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

The results compared to prior leaderboard entries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, excluding systems that use BERT as one of their components. We observe a +5.1 F1 improvement over the previous best system.

Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.

4.4 SWAG (Situations With Adversarial Generations)

SWAG dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018).

Given a sentence, the task is to choose the most plausible continuation among four choices. When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representation C denotes a score for each choice which is normalized with a softmax layer.

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4.

Table 4: SWAG Dev and Test accuracies. Human performance is measured with 100 samples, as reported in the SWAG paper.

BERTLARGE outperforms the authors’ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%.

5 对照研究

本节研究去掉 BERT 的一些功能，看看在不同任务上性能损失多少，

sentence-level (e.g., SST-2)
sentence-pair-level (e.g., MultiNLI)
word-level (e.g., NER)
span-level (e.g., SQuAD)

以更好地理解它们的相对重要性。更多相关信息见附录 C。

5.1 预训练任务（MLM/NSP）的影响 5.1.1 训练组

通过以下几组来验证 BERT 深度双向性的重要性，它们使用与 BERTBASE 完全相同的预训练数据、微调方案和超参数：

NO NSP：即去掉“下一句预测”任务，这仍然是一个双向模型，使用“掩码语言模型”（MLM）进行训练，只是训练时不做 NSP 任务；
LTR & NO NSP：不仅去掉 NSP，还使用标准的从左到右（Left-to-Right, LTR）模型进行训练，而非使用双向模型。在微调中也遵从 left-only 约束，否则会导致预训练和微调不匹配，降低下游性能。此外，该模型没有用 NSP 任务进行预训练。这与 OpenAI GPT 直接可比，但我们使用了更大的训练数据集、我们自己的输入表示和我们的微调方案。
+ BiLSTM：在 fine-tuning 期间，在 LTR & NO NSP 基础上添加了一个随机初始化的 BiLSTM。

5.1.2 结果对比

结果如表 5，

Tasks MNLI-m (Acc) QNLI (Acc) MRPC (Acc) SST-2 (Acc) SQuAD (F1) BERTBASE 84.4 88.4 86.7 92.7 88.5 No NSP 83.9 84.9 86.5 92.6 87.9 LTR & No NSP 82.1 84.3 77.5 92.1 77.8 + BiLSTM 82.1 84.1 75.7 91.6 84.9

Table 5: Ablation over the pre-training tasks using the BERTBASE architecture.

分析：

第二组 vs 第一组：去掉 NSP 任务带来的影响：在 QNLI、MNLI 和 SQuAD 1.1 上性能显著下降。
第三组 vs 第二组：去掉双向表示带来的影响：第二行实际上是 MLM & NO NSP，可以看出 LTR 模型在所有任务上的表现都比 MLM 模型差，尤其是 MRPC 和 SQuAD。
- 对于 SQuAD，可以清楚地看到 LTR 模型在 token 预测上表现不佳，因为 token 级别的隐藏状态没有右侧上下文。
- 为了尝试增强 LTR 系统，我们在其上方添加了一个随机初始化的双向 LSTM。这确实在 SQuAD 上改善了结果，但结果仍远远不及预训练的双向模型。另外，双向 LSTM 降低了在 GLUE 上的性能。

5.1.3 与 ELMo 的区别

ELMo 训练了单独的从左到右（LTR）和从右到左（RTL）模型，并将每个 token 表示为两个模型的串联。然而：

这比单个双向模型训练成本高一倍；
对于像 QA 这样的任务，这不直观，因为 RTL 模型将无法 condition the answer on the question；
这比深度双向模型弱，因为后者可以在每层使用左右上下文。

5.2 模型大小的影响

为探讨模型大小对微调任务准确性的影响，我们训练了多个 BERT 模型。表 6 给出了它们在 GLUE 任务上的结果。

L (层数) H (hidden size) A (attention head 数) LM (ppl) MNLI-m MRPC SST-2 3 768 12 5.84 77.9 79.8 88.4 6 768 3 5.24 80.6 82.2 90.7 6 768 12 4.68 81.9 84.8 91.3 12 768 12 3.99 84.4 86.7 92.9 12 1024 16 3.54 85.7 86.9 93.3 24 1024 16 3.23 86.6 87.8 93.7

Table 6: Ablation over BERT model size. “LM (ppl)” is the masked LM perplexity of held-out training data

In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning.

可以看到，更大的模型在四个数据集上的准确性都更高 —— 即使对于只有 3,600 个训练示例的 MRPC，而且这个数据集与预训练任务差异还挺大的。也许令人惊讶的是，在模型已经相对较大的前提下，我们仍然能取得如此显著的改进。例如，

Vaswani 等（2017）尝试的最大 Transformer 是（L=6，H=1024，A=16），编码器参数为 100M，
我们在文献中找到的最大 Transformer 是（L=64，H=512，A=2），具有 235M 参数（Al-Rfou 等，2018）。
相比之下，BERTBASE 包含 110M 参数，BERTLARGE 包含 340M 参数。

业界早就知道，增加模型大小能持续改进机器翻译和语言建模等大规模任务上的性能，表 6 的 perplexity 列也再次证明了这个结果，然而，我们认为 BERT 是第一个证明如下结果的研究工作：只要模型得到了充分的预训练，那么将模型尺寸扩展到非常大时（scaling to extreme model sizes）， 对非常小规模的任务（very small scale tasks）也能带来很大的提升（large improvements）。

另外，

Peters 等（2018b）研究了将 pre-trained bi-LM size（预训练双向语言模型大小）从两层增加到四层，对下游任务产生的影响，
Melamud 等（2016）提到将隐藏维度从 200 增加到 600 有所帮助，但进一步增加到 1,000 并没有带来更多的改进。

这两项工作都使用了基于特征的方法，而我们则是直接在下游任务上进行微调，并仅使用非常少量的随机初始化附加参数，结果表明即使下游任务数据非常小，也能从更大、更 expressive 的预训练表示中受益。

5.3 BERT 基于特征的方式

到目前为止，本文展示的所有 BERT 结果都使用的微调方式：在预训练模型中加一个简单的分类层，针对特定的下游任务对所有参数进行联合微调。

5.3.1 基于特征的方式适用的场景

不过，基于特征的方法 —— 从预训练模型中提取固定特征（fixed features）—— 在某些场景下有一定的优势，

首先，不是所有任务都能方便地通过 Transformer encoder 架构表示，因此这些不适合的任务，都需要添加一个 task-specific model architecture。
其次，昂贵的训练数据表示（representation of the training data）只预训练一次，然后在此表示的基础上使用更轻量级的模型进行多次实验，可以极大节省计算资源。

5.3.2 实验

本节通过 BERT 用于 CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003) 来比较这两种方式。

BERT 输入使用保留大小写的 WordPiece 模型，并包含数据提供的 maximal document context。
按照惯例，我们将其作为打标任务（tagging task），但在输出中不使用 CRF 层。
我们将第一个 sub-token 的 representation 作 token-level classifier 的输入，然后在 NER label set 上进行实验。

为了对比微调方法的效果，我们使用基于特征的方法，对 BERT 参数不做任何微调，而是从一个或多个层中提取激活（extracting the activations）。这些 contextual embeddings 作为输入，送给一个随机初始化的 two-layer 768-dimensional BiLSTM，最后再送到分类层。

5.3.3 结果

结果见表 7。BERTLARGE 与业界最高性能相当，

System Dev F1 Test F1 ELMo (Peters et al., 2018a) 95.7 92.2 CVT (Clark et al., 2018) - 92.6 CSE (Akbik et al., 2018) - 93.1 Fine-tuning approach BERTLARGE 96.6 92.8 BERTBASE 96.4 92.4 Feature-based approach (BERTBASE) Embeddings 91.0 - Second-to-Last Hidden 95.6 - Last Hidden 94.9 - Weighted Sum Last Four Hidden 95.9 - Concat Last Four Hidden 96.1 - Weighted Sum All 12 Layers 95.5 -

Table 7: CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyperparameters

The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model.

这表明 微调和基于特征的方法在 BERT 上都是有效的。

6 总结

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from deep unidirectional architectures.

Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.

附录 A. Additional Details for BERT A.1 Illustration of the Pre-training Tasks

Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-toleft LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

A.2 Pre-training Procedure A.3 Fine-tuning Procedure

For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks:

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.

A.4 Comparison of BERT, ELMo ,and OpenAI GPT A.5 Illustrations of Fine-tuning on Different Tasks B. Detailed Experimental Setup

Fig 4. BERT 用于不同任务场景，来自 paper 附录。
(a) 句子对分类；(b) 单句分类；(c) 问答；(d) 单句打标。

C. Additional Ablation Studies 参考文献

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444.
Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.
Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC. NIST.
John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128. Association for Computational Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP. Association for Computational Linguistics.
Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs Quora question pairs.
Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In ACL.
Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc Le. 2018. Semi-supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1914–1925.
Ronan Collobert and Jason Weston. 2008.newblock A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo"ic Barrault, and Antoine Bordes. 2017. https://www.aclweb.org/anthology/D17-1070 Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the_. arXiv preprint arXiv:1801.07736.
Dan Hendrycks and Kevin Gimpel. 2016. http://arxiv.org/abs/1606.08415 Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415.
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. http://arxiv.org/abs/1801.06146 Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics.
Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In IJCAI.
Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. http://arxiv.org/abs/1705.00557 Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196.
Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47.
Lajanugen Logeswaran and Honglak Lee. 2018. https://openreview.net/forum?id=rJvJXZb0W An efficient framework for learning sentence representations. In International Conference on Learning Representations.
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS.
Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional LSTM. In CoNLL.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
Andriy Mnih and Geoffrey E Hinton. 2009. http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf A scalable hierarchical distributed language model. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1081–1088. Curran Associates, Inc.
Ankur P Parikh, Oscar T"ackstr"om, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In EMNLP.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018\natexlaba. Deep contextualized word representations. In NAACL.
Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018\natexlabb. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu. 2018. U-net: Machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1810.06638.
Wilson L Taylor. 1953. “Cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ‘10, pages 384–394.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018\natexlaba. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
Wei Wang, Ming Yan, and Chen Wu. 2018\natexlabb. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.
Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In ICLR.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

Checked

6 hours 22 minutes ago

ArthurChiao's Blog

URL

https://arthurchiao.github.io/

ARTHURCHIAO'S BLOG feed

ARTHURCHIAO'S BLOG

Managed ad