ARTHURCHIAO'S BLOG

But What Is MCP (Model Context Protocol)? (2025)

2 months 2 weeks ago

There are already some good documents for MCP,

Model Context Protocol documentation
Model Context Protocol specification
Officially supported servers

but developers and architects may still feel confusing on how it works in the underlying, and this post try to fill the gap.

Fig. Integrate external services to AI applications with MCP. Note that MCP also supports connecting to local services (co-located with the AI application) with the same client-server architecture.

1 What’s MCP?
2 Architecture & Spec
3 Function Call vs. MCP
4 Limitations of current MCP
- 4.1 Cursor
问题总结

1 What’s MCP? 1.1 Naming

MCP is an abbreviation for Model Context Protocol. From the name, we can see that

First of all, it’s a communication protocol,
Then, it’s for models (LLMs),
At last, it is used for exchanging/passing model context.

1.2 Why MCP?

When building agents or complex workflows on top of LLMs, it is often necessary to integrate with external data or tools (e.g. external MySQL, Google Maps). MCP provides a standardized way to do this.

Let’s use an analogy to better explain it.

1.3 Analogy

Traditionally, personal computers have a variety of hardware connectors, such as USB, HDMI, DP, RJ45, etc.

Various kinds of hardware connectors.Image Source

Computer designers have to decide what devices that they would like to support during the design phase, and then pre-install the corresponding hardware interfaces on the motherboard. When new kinds of hardware connectors come in, it’s impossible to support them without changing the motherboard, or introducing new kinds of hardware adapters.

1.3.1 USB type-c for computer

With the introduction of USB type-c specification, things have changed. USB type-c is becoming the standard connector for most devices. As illustrated below,

Fig. Peripheral devices connected to a computer's USB type-c hub with adapters.

When the computer needs to connect to many peripherals, it first plugs in a USB type-c hub (the actual hub generally supports multiple interfaces, not just type-c), and for those peripheral devices,

If they are already of type-c, they can connect to the hub directly;
Otherwise, such as they are some old devices or professional devices in specific fields, they can be converted to type-c through a adapter first, then connecting to the hub.

So, as long as a device supports (directly or through a converter) the type-c interface, it can be easily integrated to the computer.

1.3.2 MCP for AI Apps

MCP is like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

An analogy is shown below,

Fig. Integrate external services to AI applications with MCP. Note that MCP also supports connecting to local services (co-located with the AI application) with the same client-server architecture.

From the left to right,

Personal Computer case AI App case Notes Peripherals, such as monitors External data or services, such as Google Translate To be integrated into the AI application. They may use various protocols, such as HTTP, WebSocket, gRPC, Redis protocol, etc. Connector adapters Protocol adaptation layer (server-side) One MCP server for each external service, providing a standardized interface (JSON-RPC) to the MCP client. USB type-c hub Protocol adaptation layer (client-side) One MCP client for each external service, connecting the corresponding MCP server with standard protocol. The personal computer The AI app The main part, integrate external services with the MCP clients. LLM layer AI apps rely on LLM services for function calling to the external services with MCP. 1.4 Summary

MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

2 Architecture & Spec

MCP follows the classic client-server architecture.

2.1 Base Protocol

JSON-RPC message format
Stateful connections
Server and client capability negotiation

2.2 Server side MCP Primitives

The MCP protocol defines three core primitives that servers can implement:

Primitive Control Description Example Use Prompts User-controlled Interactive templates invoked by user choice Slash commands, menu options Resources Application-controlled Contextual data managed by the client application File contents, API responses Tools Model-controlled Functions exposed to the LLM to take actions API calls, data updates Server Capabilities

MCP servers declare capabilities during initialization:

Capability Feature Flag Description prompts listChanged Prompt template management resources subscribe listChanged Resource exposure and updates tools listChanged Tool discovery and execution logging - Server logging configuration completion - Argument completion suggestions 2.3 Client side

Clients may offer the following feature to servers:

Sampling: Server-initiated agentic behaviors and recursive LLM interactions

MCP client gets the server’s capabilities through APIs such as list_tools.

Note that LLM is only responsible for selecting functions, the actual function calling is triggered inside the AI app.

2.4 Programming examples

https://modelcontextprotocol.io/quickstart/client
https://github.com/modelcontextprotocol/python-sdk

3 Function Call vs. MCP

Conceptually, MCP and Function call are both for AI applications to easily call external services, but their work in different ways. Let’s take a look at the workflow of a specific example —— accessing the Google Translate API —— and see the difference between these two methods.

3.1 Function Call

Fig. Function call workflow for accessing Google Translate.

Steps:

AI app: build prompt, include the function information of the Google Translate API in the prompt;
AI app: call LLM with the prompt;
LLM: model response, with the selected function included;
AI app: calling into the Google Translate API with (HTTP/HTTPS);

3.2 MCP

The same scenario for MCP:

Fig. MCP workflow for accessing Google Translate.

Steps:

AI app: init MCP client with the MCP server address of Google Translate service;
MCP client: get the capabilities of Google Translate MCP server via MCP server’s built-in list_tools API;
AI app: build prompt, include all the function information of the Google Translate API (got from step 2) in the prompt;
AI app: call LLM with the prompt;
LLM: model response, with the selected function included;
AI app: calling into the proper Google Translate API with MCP.

3.3 Comparison Function Call MCP Prior knowledge of the AI app (configurations) Exact function names and parameters MCP server addresses Functions the AI apps can use Static, only the pre-configured functions Dynamic, all functions the MCP server exposed via list_tools interface Flexibility Low High Token consumption Low High. When building a prompt, too many functions’ descriptions may be included into the prompt 4 Limitations of current MCP 4.1 Cursor

https://docs.cursor.com/context/model-context-protocol#limitations

MCP is a very new protocol and is still in active development. There are some known caveats to be aware of:

Tool Quantity

Some MCP servers, or user’s with many MCP servers active, may have many tools available for Cursor to use. Currently, Cursor will only send the first 40 tools to the Agent.

Remote Development

Cursor directly communicates with MCP servers from your local machine, either directly through stdio or via the network using sse. Therefore, MCP servers may not work properly when accessing Cursor over SSH or other development environments. We are hoping to improve this in future releases.

MCP Resources

MCP servers offer two main capabilities: tools and resources. Tools are availabe in Cursor today, and allow Cursor to execute the tools offered by an MCP server, and use the output in it’s further steps. However, resources are not yet supported in Cursor. We are hoping to add resource support in future releases.

问题总结 1. MCP 客户端问题/对接不同大模型的工作量

官方的 MCP 客户端是绑定 Anthropic 的大模型的（Claude），这意味着使用 OpenAI、 Google Gemini 等其他模型的用户需要自己实现 MCP 客户端，目前还没有看到 OpenAI 官方的 MCP 客户端。

function 的描述格式
API 的请求格式、参数格式、返回格式

https://github.com/anthropics/anthropic-sdk-python/issues/384

官方没明确支持 OpenAI compatible API 的计划，另外，相同的 prompt 在不同模型上的表现会有差异，这也是他们不想支持的一个原因。

1. function 数量的问题

不管是 function call 还是 MCP，最后都需要将 function 列表作为 tools 传给 LLM，这个列表可能会很长，如何处理这个问题？

首先，太长可能会超过模型的上下文；

其次，即使没超过上下文长度，也会导致 token 的消耗过多。或者存在一些隐形限制，跟应用有关，例如

OpenAI 的最佳实践里建议不要超过 20 个 functions；https://platform.openai.com/docs/guides/function-calling/function-calling#best-practices-for-defining-functions
cursor 里只会发送前 40 个 tools 给 agent。

3. 强依赖提示词 + 大模型的 planning 能力

不同的大模型，planning 能力不同。在判断使用哪个 function 时，判断能力也不同。

模型依赖：仍然以 anthropic claude 大模型为主

MCP 官方的 SDK https://github.com/modelcontextprotocol/python-sdk/ 是只针对 claude API 设计开发的，这意味着如果用户使用的是 OpenAI、Google 等大模型时，无法直接基于 MCP SDK 快速创建 client/server。

社区有人提了 issue，希望能支持 OpenAI compatible API，但官方表示暂无支持计划 https://github.com/anthropics/anthropic-sdk-python/issues/384 。

目前有一些个人开发的针对 OpenAI compatible API 的类似 MCP SDK 项目，但还没有 star 特别多的，能否持续投入时间迭代待观察。

https://github.com/bartolli/mcp-llm-bridge
https://github.com/S1M0N38/mcp-openai
https://github.com/chrishayuk/mcp-cli

更新：llamaindex 之类的框架，已经封装好了 mcp client 的能力，能避免这个问题。 https://docs.llamaindex.ai/en/stable/api_reference/tools/mcp/

图解神经网络和强化学习：400 行 C 代码训练一个井字棋高手（2025）

ARTHURCHIAO'S BLOG

2 months 2 weeks ago

本文解读 2025 年的一个练手项目 Tic Tac Toe with Reinforcement Learning。

这个项目实现了一个非常简单的神经网络（Neural Network），然后通过强化学习（Reinforcement Learning）训练它玩井字棋，训练好之后就可以人机对战，效果很不错。整个项目只用了400 行左右 C 代码，没有任何外部依赖。由于代码足够简单，非常适合用来理解神经网络和强化学习。

Fig. A simple neural network for reinforcement learning in this post

Code and scripts used in this post: Github.

传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 引言
2 运行效果
3. 公共代码 common.h
4 训练代码 train.c
5 人机对战代码 play.c
6 延伸思考

The only winning move is not to play

1 引言

本文展示了强化学习在没有任何先验知识的情况下学习新事物的能力：

冷启动：神经网络的权重是随机初始化的；
零先验知识学习：除了以下基本游戏规则，程序中没有关于游戏的知识：
1. X 或 O 只能放到空格子中；
2. 当一行中有三个 X 或三个 O 时，相应的一方胜出；
3. 当所有的格子都被使用时，平局。
训练神经网络的唯一信号是游戏的奖励：胜、平、负。

1.1 井字游戏（tic-tac-toe）

Fig. A completed game of tic-tac-toe. Image Source

井字游戏是一种非常简单的棋盘游戏，可以理解成是五子棋的简化版（“三子棋”）：

在一个 3x3 的棋盘上，两个人轮流在空白的位置下子（一般分别用 X 和 O 表示），
谁先将自己的三个子连成一条线（横竖斜都可以）就算赢了。

1.2 神经网络：祛魅

下图就是本文用来训练下井字棋的神经网络，

Fig. A simple neural network for reinforcement learning in this post

对于非科班的人来说，“神经网络”这个词听起来很神秘，但实际上如图所示，神经网络就是一些简单的数学运算，以上神经网络的处理过程可以归纳为：

两次矩阵乘法
两次矩阵（向量）加法
两次激活（ReLU 和 softmax）。
- 激活函数（activation function）这个名字听起来有点玄乎，其实就是一些对输入进行非线性压缩的简单数学函数，例如输入的范围可能是正负无穷，经过某个激活函数的变换，输出的范围可能就是 0.0~1.0 了。激活函数也叫非线性函数，这是相对于前面的矩阵运算而言的，因为矩阵运算都是线性的。
- ReLU 和 softmax 这俩函数也是本文自己实现的，包括在了 400 行代码里，
  - ReLU 是将输入向量中小于零的元素截断为零；
  - softmax 是将一组“浮点数表示的可能性大小”（例如范围在正负无穷）转换为 0.0~0.1 之间的 “概率表示”；

1.3 代码说明

本文代码只用了标准库函数，没用其他额外依赖。原项目的代码是训练完之后直接开始人机对战，相当于训练和推理的代码混在一起了，为了方便理解这两个过程，本文稍作修改，将这两部分分开了，

训练代码 train.c，将训练好的神经网络保存到文件；
推理代码 play.c，从文件中加载训练好的神经网络，开始人机对战，并且不再继续学习（不再更新神经网络权重，原项目的代码是继续更新）。

这两部分代码都会用到的一些结构体和函数，放在了 common.h。

2 运行效果

先上效果，以便有个直观印象。

2.1 编译 $ make rm -f train play 2>/dev/null cc train.c common.h -o train -O3 -Wall -W -ffast-math -lm cc play.c common.h -o play -O3 -Wall -W -ffast-math -lm 2.2 训练

不加参数时，默认是训练 150w 局，实际上 200w 效果就很好了，不用担心速度，非常快！

$ ./train 2000000 Training neural network with 2000000 games Training neural network against 2000000 random games... Games: 10000, Wins: 7987 (79.9%), Losses: 1003 (10.0%), Ties: 1010 (10.1%) Games: 20000, Wins: 8621 (86.2%), Losses: 282 (2.8%), Ties: 1097 (11.0%) Games: 30000, Wins: 8653 (86.5%), Losses: 219 (2.2%), Ties: 1128 (11.3%) Games: 40000, Wins: 8720 (87.2%), Losses: 198 (2.0%), Ties: 1082 (10.8%) ... Games: 1990000, Wins: 8376 (83.8%), Losses: 33 (0.3%), Ties: 1591 (15.9%) Games: 2000000, Wins: 8409 (84.1%), Losses: 35 (0.3%), Ties: 1556 (15.6%) Training complete! Neural network saved to ttt_nn.bin

注意，由于神经网络的权重是完全随机初始化的，所以每次训练的结果会有一些差异，但总体来说，效果都是很好的。

查看生成的神经网络文件大小：

$ ll ttt_nn.bin 11K ttt_nn.bin 2.3 人机对战（推理） $ ./play Neural network loaded from ttt_nn.bin Ready to play! You are X, the computer is O. Welcome to Tic Tac Toe! You are X, the computer is O. Enter positions as numbers from 0 to 8 (see picture). ... 012 ... 345 ... 678 Your move (0-8):

比如想在最中间下子，就输入 4，

Your move (0-8): 4 ... 012 .X. 345 ... 678 Computer\'s move: Neural network move probabilities: 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%*# 0.0% 0.0% Sum of all probabilities: 1.00 Computer placed O at position 6 ... 012 .X. 345 O.. 678 Your move (0-8):

可以看到神经网络计算认为最优的位置是 6，也就是左下角，然后在这个位置下子。接下来就又轮到人了，依次进行，直到游戏结束。

2.4 小结

以上已经看到了游戏的效果。接下来我们深入到代码，看看整个 RL 训练过程是如何实现的。

3. 公共代码 common.h 3.1 棋盘状态 struct GameState // Game board representation. typedef struct { char board[9]; // Can be "." (empty) or "X", "O". int current_player; // 0 for player (X), 1 for computer (O). } GameState;

棋盘是 3x3 的九宫格，每个格子可以是空、X 或 O。

本文的规则是人先下，电脑后下；人用 “X”，电脑用 “O”。

3.2 神经网络的定义

这个神经网络非常简单：只有一个隐藏层，足够模拟如此简单的游戏了（添加更多层不会加快收敛速度，也不会玩得更好）。

3.2.1 struct NeuralNetwork /* Neural network structure. For simplicity we have just one hidden layer and fixed sizes. * However for this problem going deeper than one hidden layer is useless. */ typedef struct { // Weights and biases. float weights_ih[NN_INPUT_SIZE * NN_HIDDEN_SIZE]; float weights_ho[NN_HIDDEN_SIZE * NN_OUTPUT_SIZE]; float biases_h[NN_HIDDEN_SIZE]; float biases_o[NN_OUTPUT_SIZE]; // Activations are part of the structure itself for simplicity. float inputs[NN_INPUT_SIZE]; float hidden[NN_HIDDEN_SIZE]; float raw_logits[NN_OUTPUT_SIZE]; // Outputs before softmax(). float outputs[NN_OUTPUT_SIZE]; // Outputs after softmax(). } NeuralNetwork;

注意，只有前面四个变量（两个权重矩阵，两个 bias 向量）是神经网络的参数，后面三个变量是为了计算方便，也放到了神经网络结构体里。

Activations are always memorized directly inside the neural network, so calculating the deltas and performing the backpropagation is very simple.

我们神经网络的具体参数定义：

// Neural network parameters. #define NN_INPUT_SIZE 18 // 这个大小跟棋盘状态的编码方式有关，后面会讲到。 #define NN_HIDDEN_SIZE 100 #define NN_OUTPUT_SIZE 9

对应的就是前面已经展示过的图：

Fig. A simple neural network for reinforcement learning in this post

3.2.2 参数量（模型大小）的计算

注意，井字棋只有 5478 种可能的状态，而根据这些参数和前面的结构体定义，我们的神经网络的参数总数是 2809，

18 (inputs) * 100 (hidden) + 100 (hidden) * 9 (outputs) weights + 100 + 9 biases = 2809

这意味着我们的神经网络几乎可以记住游戏的每个状态。你可以将隐藏层的大小降到 25（或更小），这时参数大约是 700 左右，它仍然能够很好地玩（但肯定会差一些）。

3.3 将棋盘状态转换为神经网络输入：board_to_inputs

棋盘每个位置的状态共三种：

空白：.
玩家已经下子：X
电脑已经下子：O

所以这里用了一个（逻辑意义上的）2bit 编码方式，将以上状态分别编码为 00、10、01，作为神经网络的输入。

Fig. A simple neural network for reinforcement learning in this post

/* Convert board state to neural network inputs. Note that we use a peculiar encoding I descrived here: * https://www.youtube.com/watch?v=EXbgUXt8fFU * * Instead of one-hot encoding, we can represent N different categories * as different bit patterns. In this specific case it's trivial: * * 00 = empty * 10 = X * 01 = O * * Two inputs per symbol instead of 3 in this case, but in the general case * this reduces the input dimensionality A LOT. * * LEARNING OPPORTUNITY: You may want to learn (if not already aware) of * different ways to represent non scalar inputs in neural networks: * One hot encoding, learned embeddings, and even if it's just my random * exeriment this "permutation coding" that I'm using here. */ static void board_to_inputs(GameState *state, float *inputs) { for (int i = 0; i < 9; i++) { if (state->board[i] == '.') { inputs[i*2] = 0; inputs[i*2+1] = 0; } else if (state->board[i] == 'X') { inputs[i*2] = 1; inputs[i*2+1] = 0; } else { // 'O' inputs[i*2] = 0; inputs[i*2+1] = 1; } } }

由于棋盘是一个 3x3 的矩阵，所以输入向量的长度是 2*9=18。这也是前面看到的 NN_INPUT_SIZE 定义的由来：

#define NN_INPUT_SIZE 18 3.4 前向传播 forward_pass

Fig. A simple neural network for reinforcement learning in this post

前向传播过程实现的就是图中的 neural network 部分，

/* Get the best move for the computer using the neural network. * Neural network foward pass (inference). We store the activations * so we can also do backpropagation later. */ static void forward_pass(NeuralNetwork *nn, float *inputs) { memcpy(nn->inputs, inputs, NN_INPUT_SIZE * sizeof(float)); // Input to hidden layer. for (int i = 0; i < NN_HIDDEN_SIZE; i++) { float sum = nn->biases_h[i]; for (int j = 0; j < NN_INPUT_SIZE; j++) { sum += inputs[j] * nn->weights_ih[j * NN_HIDDEN_SIZE + i]; } nn->hidden[i] = relu(sum); } // Hidden to output (raw logits). for (int i = 0; i < NN_OUTPUT_SIZE; i++) { nn->raw_logits[i] = nn->biases_o[i]; for (int j = 0; j < NN_HIDDEN_SIZE; j++) { nn->raw_logits[i] += nn->hidden[j] * nn->weights_ho[j * NN_OUTPUT_SIZE + i]; } } // Apply softmax to get the final probabilities. softmax(nn->raw_logits, nn->outputs, NN_OUTPUT_SIZE); } 3.4.1 计算过程

Fig. A simple neural network for reinforcement learning in this post

可以分为两个过程：

左边：一次矩阵乘法 + 一次矩阵加法 + 一次激活函数（ReLU）
- 输入：1x18 的 input vector，是当前棋盘状态的编码；
- 输出：1x100 的 hidden vector；
右边：一次矩阵乘法 + 一次矩阵加法 + 一次激活函数（softmax）
- 输入：1x100 的 hidden vector；
- 输出：1x9 的 output vector，其中的每个元素表示在对应位置下子的概率。

接下来再详细看看两个激活函数。

3.4.2 非线性/激活函数 relu

ReLU 就一行代码，将输入 <0 的部分截断为 0：

/* ReLU activation function */ static float relu(float x) { return x > 0 ? x : 0; }

神经网络的每个参数模拟的是大脑的一个神经元，对应到 ReLU，这里的直观的解释是， 如果刺激强度太弱，那么相应的神经元是不会被激活的，或者说刺激强度超过一个阈值，神经元才会被激活。

使用 RELU 是因为它很简单，并且能适用于几乎所有场景。权重初始化跟 RELU 没任何关系，只完全随机的。

3.4.3 非线性/激活函数 softmax

softmax 跟模拟大脑和神经元就没有关系了，是个纯数学技巧，用来将一组（通常是正负无穷范围内的）数值转换为概率分布。直观上也很好理解：

如果原始输入 x 是正负无穷范围内的数值，那 ex 的范围就是 0 到正无穷；
对所有 x 都计算 ex，再取加权，就得到了一个总和为 100% 的 概率分布。

Fig. A definition of softmax. Image Source: wikipedia

实际上会有各种变种，但基本原理都是这样。详见 wikipedia softmax function。

/* Apply softmax activation function to an array input, and * set the result into output. */ static void softmax(float *input, float *output, int size) { /* Find maximum value then subtact it to avoid numerical stability issues with exp(). */ float max_val = input[0]; for (int i = 1; i < size; i++) { if (input[i] > max_val) { max_val = input[i]; } } // Calculate exp(x_i - max) for each element and sum. float sum = 0.0f; for (int i = 0; i < size; i++) { output[i] = expf(input[i] - max_val); sum += output[i]; } // Normalize to get probabilities. if (sum > 0) { for (int i = 0; i < size; i++) { output[i] /= sum; } } else { /* Fallback in case of numerical issues, just provide a uniform distribution. */ for (int i = 0; i < size; i++) { output[i] = 1.0f / size; } } }

softmax() 得到的输出就是一个 1x9 的概率向量，其中的每个值表示的是“下一步在这个位置下子的概率”。

In theory we use cross entropy to calculate the loss function, but in practice we evaluate our agent based on the results of the games, so we only use it implicitly here:

deltas[i] = output[i] - target[i]

That is the delta in case of softmax and cross entropy.

3.5 神经网络计算下一步最优 get_computer_move

这个函数的目的是寻找神经网络输出（1x9）中，目前仍然空白（未下子）且概率最大的位置，返回的 best_move 就是这个位置，意思是去这个位置下子。

3.5.1 调用栈 get_computer_move |- board_to_inputs |- forward_pass |- for (i=0; i<9; i++) | if (state->board[i] == '.' && (best_move == -1 || nn->outputs[i] > best_legal_prob)) { | best_move = i; | best_legal_prob = nn->outputs[i]; | } |- return best_move 3.5.2 代码

为避免干扰，这里 Debug 相关的代码去掉了，只保留核心代码：

/* Get the best move for the computer using the neural network. * Note that there is no complex sampling at all, we just get * the output with the highest value THAT has an empty tile. */ static int get_computer_move(GameState *state, NeuralNetwork *nn, int display_probs) { float inputs[NN_INPUT_SIZE]; board_to_inputs(state, inputs); forward_pass(nn, inputs); // 得到了下一步的概率分布，保存在 nn->outputs[] 中 int best_move = -1; float best_legal_prob = -1.0f; for (int i = 0; i < 9; i++) { if (state->board[i] == '.' && (best_move == -1 || nn->outputs[i] > best_legal_prob)) { best_move = i; best_legal_prob = nn->outputs[i]; } } return best_move; } 3.6 小结

以上就是训练和推理都会用到的一些结构体和函数。接下来看看具体的训练过程。

4 训练代码 train.c 4.1 main 函数

步骤：

初始化神经网络。
训练神经网络，让它与一个每次随机下子的对手对弈 N 局。
保存训练好的神经网络。

4.1.1 代码 int main(int argc, char **argv) { int random_games = 150000; // Fast and enough to play in a decent way. const char *output_file = "ttt_nn.bin"; if (argc > 1) random_games = atoi(argv[1]); if (argc > 2) output_file = argv[2]; srand(time(NULL)); // Initialize neural network. NeuralNetwork nn; init_neural_network(&nn); printf("Training neural network with %d games\n", random_games); // Train against random moves. if (random_games > 0) train_against_random(&nn, random_games); // Save the trained neural network save_neural_network(&nn, output_file); return 0; } 4.1.2 调用栈 main |- init_neural_network |- train_against_random | |- for (i=0; i<num_games; i++) | play_random_game | |- init_game | |- while (!check_game_over(&state, &winner)) { | | if (state.current_player == 0) // Random player's turn (X) | | move = get_random_move(&state); | | else // Neural network's turn (O) | | move = get_computer_move(&state, nn, 0); | | char symbol = (state.current_player == 0) ? 'X' : 'O'; | | state.board[move] = symbol; | | move_history[(*num_moves)++] = move; | | state.current_player = !state.current_player; | |- learn_from_game | |- backprop |- save_neural_network 4.2 init_neural_network 函数 /* Initialize a neural network with random weights, we should * use something like He weights since we use RELU, but we don't care as this is a trivial example. */ #define RANDOM_WEIGHT() (((float)rand() / RAND_MAX) - 0.5f) void init_neural_network(NeuralNetwork *nn) { // Initialize weights with random values between -0.5 and 0.5 for (int i = 0; i < NN_INPUT_SIZE * NN_HIDDEN_SIZE; i++) nn->weights_ih[i] = RANDOM_WEIGHT(); for (int i = 0; i < NN_HIDDEN_SIZE * NN_OUTPUT_SIZE; i++) nn->weights_ho[i] = RANDOM_WEIGHT(); for (int i = 0; i < NN_HIDDEN_SIZE; i++) nn->biases_h[i] = RANDOM_WEIGHT(); for (int i = 0; i < NN_OUTPUT_SIZE; i++) nn->biases_o[i] = RANDOM_WEIGHT(); } 4.3 train_against_random -> for {play_random_game} /* Train the neural network against random moves. */ void train_against_random(NeuralNetwork *nn, int num_games) { int move_history[9]; int wins = 0, losses = 0, ties = 0; printf("Training neural network against %d random games...\n", num_games); int played_games = 0; for (int i = 0; i < num_games; i++) { char winner = play_random_game(nn, move_history); } printf("\nTraining complete!\n"); } 4.4 play_random_game

这里是训练过程的核心代码，play_random_game 让 computer 和 random 对手下棋，

computer 用神经网络（feed forward）计算下一步最优位置。
random 对手随机下子。

一局结束之后，根据游戏结果进行奖励（强化学习）。

/* Play a game against random moves and learn from it. * * This is a very simple Montecarlo Method applied to reinforcement learning: * * 1. We play a complete random game (episode). * 2. We determine the reward based on the outcome of the game. * 3. We update the neural network in order to maximize future rewards. * * LEARNING OPPORTUNITY: while the code uses some Montecarlo-alike * technique, important results were recently obtained using * Montecarlo Tree Search (MCTS), where a tree structure repesents * potential future game states that are explored according to * some selection: you may want to learn about it. */ char play_random_game(NeuralNetwork *nn, int *move_history) { GameState state; char winner = 0; int num_moves = 0; init_game(&state); while (!check_game_over(&state, &winner)) { int move; if (state.current_player == 0) { // Random player's turn (X) move = get_random_move(&state); } else { // Neural network's turn (O) move = get_computer_move(&state, nn, 0); } /* Make the move and store it: we need the moves sequence during the learning stage. */ char symbol = (state.current_player == 0) ? 'X' : 'O'; state.board[move] = symbol; move_history[num_moves++] = move; // Switch player. state.current_player = !state.current_player; } // Learn from this game - neural network is 'O' (even-numbered moves). learn_from_game(nn, move_history, num_moves, 1, winner); return winner; } /* Get a random valid move, this is used for training against a random opponent. * Note: this function will loop forever if the board is full, but here we want simple code. */ int get_random_move(GameState *state) { while(1) { int move = rand() % 9; if (state->board[move] != '.') continue; return move; } } 4.4.1 和随机下子的对手下一局

记录双方的每一步保存在 move_history 中，

while (!check_game_over(&state, &winner)) { int move; if (state.current_player == 0) { // Random player's turn (X) move = get_random_move(&state); } else { // Neural network's turn (O) move = get_computer_move(&state, nn, 0); } /* Make the move and store it: we need the moves sequence during the learning stage. */ char symbol = (state.current_player == 0) ? 'X' : 'O'; state.board[move] = symbol; move_history[num_moves++] = move; // Switch player. state.current_player = !state.current_player; } 4.5 learn_from_game

根据这一局的结果，对神经网络进行奖励（强化学习）。

// Learn from this game - neural network is 'O' (even-numbered moves). learn_from_game(nn, move_history, num_moves, 1, winner);

五个参数：

nn：神经网络；
move_history：记录了整局游戏的每一步；
num_moves：整局游戏的步数；
1：表示游戏中的偶数步骤是神经网络下的；
winner：赢家是谁。

4.5.1 强化学习的奖励策略

使用的 reward policy：基于奖励，在神经网络计算下一步时，我们列出所有可能的下一个状态，并奖励每个状态获胜的 move （不仅仅是最终获胜的那一步，而是赢了的这一局中，所有执行的步骤），

赢：要奖励的 move 为 1（100%），将所有其他 move 为 0。
平：也给予奖励，但是比胜利的奖励要小。
负：target move 奖励为 0，非法 moves 奖励也为 0，其他合法 moves 奖励为 1/(number-of-valid-moves)。

此外，我们还根据游戏的进度进行缩放：

float move_importance = 0.5f + 0.5f * (float)move_idx/(float)num_moves; float scaled_reward = reward * move_importance;

游戏前期的 moves，给予较小的奖励，
游戏后期（接近游戏结束）的 moves，给予更大的奖励：

Note that the above makes a lot of difference in the way the program works. Also note that while this may seem similar to Time Difference in reinforcement learning, it is not: we don’t have a simple way in this case to evaluate if a single step provided a positive or negative reward: we need to wait for each game to finish. The temporal scaling above is just a way to code inside the network that early moves are more open, while, as the game goes on, we need to play more selectively.

4.5.2 代码 /* Train the neural network based on game outcome. * * The move_history is just an integer array with the index of all the moves. */ void learn_from_game(NeuralNetwork *nn, int *move_history, int num_moves, int nn_moves_even, char winner) { float reward; char nn_symbol = nn_moves_even ? 'O' : 'X'; if (winner == 'T') { reward = 0.3f; // Small reward for draw } else if (winner == nn_symbol) { reward = 1.0f; // Large reward for win } else { reward = -2.0f; // Negative reward for loss } GameState state; float target_probs[NN_OUTPUT_SIZE]; // Process each move the neural network made. for (int move_idx = 0; move_idx < num_moves; move_idx++) { // Skip if this wasn't a move by the neural network. if ((nn_moves_even && move_idx % 2 != 1) || (!nn_moves_even && move_idx % 2 != 0)) { continue; } // Recreate board state BEFORE this move was made. init_game(&state); for (int i = 0; i < move_idx; i++) { char symbol = (i % 2 == 0) ? 'X' : 'O'; state.board[move_history[i]] = symbol; } // Convert board to inputs and do forward pass. float inputs[NN_INPUT_SIZE]; board_to_inputs(&state, inputs); forward_pass(nn, inputs); /* The move that was actually made by the NN, that is the one we want to reward (positively or negatively). */ int move = move_history[move_idx]; /* Here we can't really implement temporal difference in the strict * reinforcement learning sense, since we don't have an easy way to * evaluate if the current situation is better or worse than the previous state in the game. * * However "time related" we do something that is very effective in * this case: we scale the reward according to the move time, so that * later moves are more impacted (the game is less open to different solutions as we go forward). * * We give a fixed 0.5 importance to all the moves plus a 0.5 that depends on the move position. * NOTE: this makes A LOT of difference. Experiment with different values. * * LEARNING OPPORTUNITY: Temporal Difference in Reinforcement Learning * is a very important result, that was worth the Turing Award in * 2024 to Sutton and Barto. You may want to read about it. */ float move_importance = 0.5f + 0.5f * (float)move_idx/(float)num_moves; float scaled_reward = reward * move_importance; /* Create target probability distribution: let's start with the logits all set to 0. */ for (int i = 0; i < NN_OUTPUT_SIZE; i++) target_probs[i] = 0; /* Set the target for the chosen move based on reward: */ if (scaled_reward >= 0) { /* For positive reward, set probability of the chosen move to 1, with all the rest set to 0. */ target_probs[move] = 1; } else { /* For negative reward, distribute probability to OTHER valid moves, * which is conceptually the same as discouraging the move that we want to discourage. */ int valid_moves_left = 9-move_idx-1; float other_prob = 1.0f / valid_moves_left; for (int i = 0; i < 9; i++) { if (state.board[i] == '.' && i != move) { target_probs[i] = other_prob; } } } /* Call the generic backpropagation function, using our target logits as target. */ backprop(nn, target_probs, LEARNING_RATE, scaled_reward); } } 4.5.3 奖励过程：回放每一步，根据真实 input 预测这一步的输出，和真实的输出比较，进行奖励

循环：遍历整局游戏的每一步 move_idx；针对从开始到 move_idx 为止，

用实际的状态填充棋盘 0~move_idx-1；
将 step 1 的状态作为输入，用神经网络预测下一步，得到一个概率分布；
进行奖励：根据真实的下一步 move 构建目标概率分布 target_probs（一个位置是 100%，其他地方都是 0%）；
根据 step 2 & 3 的两个概率分布，调用 backprop 函数进行反向传播，更新神经网络的权重。

举个例子，下图这一局只用了总共 5 步 O 方就赢了，右侧是回放到第 3 步（对应到代码是 move_idx=2）时的状态：

Fig. Reward illustration

对照右侧图，对应的奖励过程：

将已经回放到的部分作为输入，计算下一步的概率；
进行奖励：将真实的下一步作为目标概率分布（index=8 处为 100%，其他位置都是 0%）；
用 step 1 & 2 的两个概率分布进行反向传播，更新神经网络的权重。

4.6 反向传播 backprop

这里使用了很简单的反向传播，代码非常清晰，它的工作方式与监督学习非常相似，唯一的区别是输入/输出对事先不知道，而是根据强化学习的奖励策略实时提供奖励的。

/* Derivative of ReLU activation function */ float relu_derivative(float x) { return x > 0 ? 1.0f : 0.0f; } /* Backpropagation function. * The only difference here from vanilla backprop is that we have * a 'reward_scaling' argument that makes the output error more/less * dramatic, so that we can adjust the weights proportionally to the reward we want to provide. */ void backprop(NeuralNetwork *nn, float *target_probs, float learning_rate, float reward_scaling) { float output_deltas[NN_OUTPUT_SIZE]; float hidden_deltas[NN_HIDDEN_SIZE]; /* === STEP 1: Compute deltas === */ /* Calculate output layer deltas: * Note what's going on here: we are technically using softmax as output function and cross entropy as loss, * but we never use cross entropy in practice since we check the progresses in terms of winning the game. * * Still calculating the deltas in the output as: output[i] - target[i] * Is exactly what happens if you derivate the deltas with softmax and cross entropy. * * LEARNING OPPORTUNITY: This is a well established and fundamental result in neural networks, you may want to read more about it. */ for (int i = 0; i < NN_OUTPUT_SIZE; i++) output_deltas[i] = (nn->outputs[i] - target_probs[i]) * fabsf(reward_scaling); // Backpropagate error to hidden layer. for (int i = 0; i < NN_HIDDEN_SIZE; i++) { float error = 0; for (int j = 0; j < NN_OUTPUT_SIZE; j++) { error += output_deltas[j] * nn->weights_ho[i * NN_OUTPUT_SIZE + j]; } hidden_deltas[i] = error * relu_derivative(nn->hidden[i]); } /* === STEP 2: Weights updating === */ // Output layer weights and biases. for (int i = 0; i < NN_HIDDEN_SIZE; i++) for (int j = 0; j < NN_OUTPUT_SIZE; j++) nn->weights_ho[i * NN_OUTPUT_SIZE + j] -= learning_rate * output_deltas[j] * nn->hidden[i]; for (int j = 0; j < NN_OUTPUT_SIZE; j++) nn->biases_o[j] -= learning_rate * output_deltas[j]; // Hidden layer weights and biases. for (int i = 0; i < NN_INPUT_SIZE; i++) for (int j = 0; j < NN_HIDDEN_SIZE; j++) nn->weights_ih[i * NN_HIDDEN_SIZE + j] -= learning_rate * hidden_deltas[j] * nn->inputs[i]; for (int j = 0; j < NN_HIDDEN_SIZE; j++) nn->biases_h[j] -= learning_rate * hidden_deltas[j]; }

分为两步，

分别计算 output_deltas 和 hidden_deltas，
用以上两个 delta 更新 input-hidden 和 hidden-output 两个矩阵的权重。

4.6.1 output 概率和 target 概率的 delta 计算

Fig. Reward illustration

output_deltas (1x9)：

softmax 得到的概率分布（nn->outputs）和根据奖励模型得到的目标概率分布（target_probs）之间的 delta vector；
这里还引入了一个缩放系数 reward_scaling 来调整奖励的大小，使结果更好；

4.6.2 hidden layer 的 delta 计算

这一步是将 output layer 的 delta 反向传播到 hidden layer，得到的是 hidden_deltas (1x100)： hidden-output 矩阵（100x9）和 output_deltas（1x9）做矩阵乘法，得到一个 hidden layer 的 delta vector（1x100）；

4.6.3 更新神经网络的权重（with learning rate）

Fig. A simple neural network for reinforcement learning in this post

根据设置的学习率 learning_rate，更新神经网络的权重：

hidden-output matrix output bias input-hidden matrix hidden bias 大小 100x9 1x9 18x100 1x100 依赖 1 output_deltas output_deltas hidden_deltas hidden_deltas 依赖 2 nn->hidden - nn->inputs - 公式 nn->weights_ho[i][j] -= lr * output_deltas[j] * nn->hidden[i] nn->biases_o[j] -= lr * output_deltas[j] nn->weights_ih[i][j] -= lr * hidden_deltas[j] * nn->inputs[i] nn->biases_h[j] -= lr * hidden_deltas[j] 4.7 save_neural_network

写到文件，在概念上类似于现在的大模型文件，

/* Save neural network parameters to a file */ void save_neural_network(NeuralNetwork *nn, const char *filename) { FILE *file = fopen(filename, "wb"); if (file == NULL) { printf("Error opening file for writing: %s\n", filename); return; } // Write weights and biases fwrite(nn->weights_ih, sizeof(float), NN_INPUT_SIZE * NN_HIDDEN_SIZE, file); fwrite(nn->weights_ho, sizeof(float), NN_HIDDEN_SIZE * NN_OUTPUT_SIZE, file); fwrite(nn->biases_h, sizeof(float), NN_HIDDEN_SIZE, file); fwrite(nn->biases_o, sizeof(float), NN_OUTPUT_SIZE, file); fclose(file); printf("Neural network saved to %s\n", filename); } 5 人机对战代码 play.c

非常简单，从文件中加载神经网络参数，然后开始游戏，类似于现在的加载大模型开始提供推理服务，

/* Load neural network parameters from a file */ int load_neural_network(NeuralNetwork *nn, const char *filename) { FILE *file = fopen(filename, "rb"); if (file == NULL) { printf("Error opening file for reading: %s\n", filename); return 1; } // Read weights and biases size_t items_read = 0; items_read += fread(nn->weights_ih, sizeof(float), NN_INPUT_SIZE * NN_HIDDEN_SIZE, file); items_read += fread(nn->weights_ho, sizeof(float), NN_HIDDEN_SIZE * NN_OUTPUT_SIZE, file); items_read += fread(nn->biases_h, sizeof(float), NN_HIDDEN_SIZE, file); items_read += fread(nn->biases_o, sizeof(float), NN_OUTPUT_SIZE, file); fclose(file); // Check if we read the expected number of items size_t expected_items = NN_INPUT_SIZE * NN_HIDDEN_SIZE + NN_HIDDEN_SIZE * NN_OUTPUT_SIZE + NN_HIDDEN_SIZE + NN_OUTPUT_SIZE; if (items_read != expected_items) { printf("Error: Read %zu items, expected %zu\n", items_read, expected_items); return 2; } printf("Neural network loaded from %s\n", filename); return 0; } int main(int argc, char **argv) { const char *input_file = "ttt_nn.bin"; if (argc > 1) input_file = argv[1]; // Load neural network from file NeuralNetwork nn; if (load_neural_network(&nn, input_file)) { printf("Failed to load neural network.\n"); return 1; } printf("Ready to play! You are X, the computer is O.\n"); // Play game with human while(1) { char play_again; play_game(&nn); printf("Play again? (y/n): "); scanf(" %c", &play_again); if (play_again != 'y' && play_again != 'Y') break; } return 0; } 6 延伸思考

原作者的课后作业，供学有余力的同学参考：

Can this approach work with connect four as well? The much larger space of the problem would be really interesting and less of a toy.
Train the network to play both sides by having an additional input set, that is the symbol that is going to do the move (useful especially in the case of connect four) so that we can use the network itself as opponent, instead of playing against random moves.
Implement proper sampling, in the case above, so that initially moves are quite random, later they start to pick more consistently the predicted move.
MCTS.

[译][论文] Attention paper | 神经机器翻译：联合学习对齐和翻译（2014）

ARTHURCHIAO'S BLOG

3 months ago

译者序

本文翻译自 2014 年提出 Attention 机制的论文 Neural Machine Translation by Jointly Learning to Align and Translate。

Attention 机制当时是针对机器翻译场景提出的。

基于神经网络的机器翻译工作过程，举个具体例子： 输入一个英文句子，要求将其翻译成德文，

首先，整个句子作为输入，因此在开始翻译之前，已经能知道这个句子的完整意思；
翻译时，每次翻译一个德文单词；
在翻译下一个德文单词时，除了源句子，还可以利用前面已经翻译的德文单词信息。换句话说，可以维护一个全局的翻译状态，或者成为上下文。

实现这种翻译过程的典型方式是 encoder-decoder 模型，如下图所示，

Image Source: Google NMT Architecture

encoder-decoder 极简原理：Sequence to Sequence Learning with Neural Networks (2014) / one minute summary。

Attention 仍然属于 encoder-decoder 模型，但相比之前提出了几点改进，

Figure 1: 直观解释：给定源句子 (x1; x2; ...; xT)，生成第 t 个目标单词 yt 的过程。

直观上的解释是：

用一个双向 RNN 对源句子进行编码，得到每个词的隐藏状态，文章里叫 annotation；在生成每个位置的翻译词时，就可以利用源句子中这个词前和后双向的信息；
翻译过程中维护的上下文不再是一个全局的，而是每个位置的词都有自己的上下文；
1. 上下文向量是 annotations 的加权和；
2. 上下文向量也不再是定长的；
在每个位置生成翻译时，decoder 能够自主选择使用其他哪些位置的信息，这个选择过程就是attention —— 换句话说就是此时 decoder “关注”哪些位置的单词（隐藏状态表示）；
Attention 的数学表示就是参数矩阵 $\alpha_{ij}$，它衡量的是源句子第 $j$ 个位置与目标句子第 $i$ 个位置的匹配程度（相关度）。

另一张图直观解释：

Image Source: Attention (2014) / one minute summary

Attention 机制是 Transformer（Attention is all you need, 2017）的基础。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 背景：神经机器翻译（Neural Machine Translation）
3 学习对齐和翻译（ALIGN AND TRANSLATE）
- 3.1 decoder
- 3.2 encoder：用于 annotating sequence 的双向 RNN
  - 3.2.1 为什么用 BiRNN：总结每个词前和后的信息
  - 3.2.2 Annotation 的计算
4 实验设置（略）
5 结果（略）
6 相关工作
- 6.1 学习对齐（Learning to Align）
- 6.2 神经网络用于机器翻译
7 总结
致谢
- 参考文献

MathJax.Hub.Config({ extensions: ["tex2jax.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\$","\$"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], processEscapes: true }, "HTML-CSS": { availableFonts: [], preferredFont: null, webFont: "Neo-Euler", mtextFontInherit: true }, TeX: { extensions: ["color.js"], Macros: { lgc: ["{\\color{my-light-green} #1}", 1], gc: ["{\\color{my-green} #1}", 1], lrc: ["{\\color{my-light-red} #1}", 1], rc: ["{\\color{my-red} #1}", 1], lbc: ["{\\color{my-light-blue} #1}", 1], bc: ["{\\color{my-blue} #1}", 1], kc: ["{\\color{my-gray} #1}", 1], loc: ["{\\color{my-light-orange} #1}", 1], oc: ["{\\color{my-orange} #1}", 1], a: ["\\mathbf a"], A: ["\\mathbf A"], b: ["\\mathbf b"], B: ["\\mathbf B"], c: ["\\mathbf c"], C: ["\\mathbf C"], d: ["\\mathbf d"], D: ["\\mathbf D"], E: ["\\mathbf E"], I: ["\\mathbf I"], L: ["\\mathbf L"], m: ["\\mathbf m"], M: ["\\mathbf M"], r: ["\\mathbf r"], s: ["\\mathbf s"], t: ["\\mathbf t"], S: ["\\mathbf S"], x: ["\\mathbf x"], z: ["\\mathbf z"], v: ["\\mathbf v"], y: ["\\mathbf y"], k: ["\\mathbf k"], bp: ["\\mathbf p"], P: ["\\mathbf P"], q: ["\\mathbf q"], Q: ["\\mathbf Q"], r: ["\\mathbf r"], R: ["\\mathbf R"], Sig: ["\\mathbf \\Sigma"], t: ["\\mathbf t"], T: ["\\mathbf T"], e: ["\\mathbf e"], X: ["\\mathbf X"], u: ["\\mathbf u"], U: ["\\mathbf U"], v: ["\\mathbf v"], V: ["\\mathbf V"], w: ["\\mathbf w"], W: ["\\mathbf W"], Y: ["\\mathbf Y"], z: ["\\mathbf z"], Z: ["\\mathbf Z"], p: ["\\,\\text{.}"], tab: ["\\hspace{0.7cm}"], sp: ["^{\\small\\prime}"], mR: ["{\\mathbb R}"], mC: ["{\\mathbb C}"], mN: ["{\\mathbb N}"], mZ: ["{\\mathbb Z}"], deg: ["{^\\circ}"], argmin: ["\\underset{#1}{\\text{argmin}}", 1], argmax: ["\\underset{#1}{\\text{argmax}}", 1], co: ["\\;\\text{cos}"], si: ["\\;\\text{sin}"] } } }); MathJax.Hub.Register.StartupHook("TeX color Ready", function() { MathJax.Extension["TeX/color"].colors["my-green"] = '#677d00'; MathJax.Extension["TeX/color"].colors["my-light-green"] = '#acd373'; MathJax.Extension["TeX/color"].colors["my-red"] = '#b13e26'; MathJax.Extension["TeX/color"].colors["my-light-red"] = '#d38473'; MathJax.Extension["TeX/color"].colors["my-blue"] = '#306693'; MathJax.Extension["TeX/color"].colors["my-light-blue"] = '#73a7d3'; MathJax.Extension["TeX/color"].colors["my-gray"] = '#999'; MathJax.Extension["TeX/color"].colors["my-orange"] = '#E69500'; MathJax.Extension["TeX/color"].colors["my-light-orange"] = '#FFC353'; }); 摘要

神经机器翻译（Neural machine translation）是最近出现的一种机器翻译方法。

与传统的统计机器翻译（statistical machine translation）不同，神经机器翻译旨在构建一个单一的神经网络，通过联合微调（jointly tune）最大化翻译性能。
近期提出的一些 neural machine translation 模型大都属于 encoder-decoder 家族， encoder 将源句子编码为固定长度的向量，decoder 从该向量生成翻译。

使用固定长度向量，是 encoder-decoder 架构的性能瓶颈来源，为此我们提出一种改进方式：允许模型自动（软）搜索与预测目标词相关的源句子部分，而无需将这些部分生硬地切段。基于这种新方法，

在英法翻译任务上，实现了与现有最好的基于短语的系统（phrase-based system）相当的翻译性能。
定性分析表明，模型找到的（软）对齐与我们的直觉非常一致。

1 引言

神经机器翻译是最近由 Kalchbrenner、Sutskever、Cho 等提出的一种新兴机器翻译方法。与传统的基于短语的翻译系统不同，神经机器翻译试图构建和训练一个单一、大型的神经网络，该网络读取句子并输出正确的翻译。

1.1 文本翻译：encoder-decoder 系统

目前的大多数神经机器翻译模型都属于 encoder-decoder 家族。在这类架构中，每种语言都有一个 encoder 和 decoder，

Image Source: Google NMT Architecture

encoder 神经网络读取源句子并将其编码为固定长度的向量。
decoder 从编码的向量输出翻译。

整个 encoder-decoder 系统联合训练，最大化给定源句子的正确翻译概率。

1.2 encoder-decoder 架构的问题：无法有效处理长句子

encoder-decoder 方法的一个潜在问题是，神经网络必须将源句子的所有必要信息压缩到一个固定长度的向量中。

这使神经网络难以处理长句子，尤其是那些比训练语料库中的句子更长的句子。
Cho 等表明，随着输入句子长度的增加，基本的 encoder-decoder 性能确实迅速下降。

1.3 扩展 encoder-decoder 1.3.1 思路：联合学习对齐和翻译（align and translate）

为了解决这个问题，本文引入了 encoder-decoder 模型的一种扩展，该模型联合学习对齐和翻译（learns to align and translate jointly）。

每次生成一个翻译词（预测目标词）时，（软）搜索源句子中与预测目标词最相关的一些位置。
基于与这些源位置相关的上下文向量和之前已经生成的翻译词，来预测下一个目标词（翻译词）。

1.3.2 与基本 encoder-decoder 的区别

这种方法与基本 encoder-decoder 的最重要区别：不再将整个输入句子编码为单个固定长度的向量。

encoder：将输入句子编码为一系列向量，
decoder：解码（翻译）时，自适应地选择其中的某些向量来使用。

这使得神经翻译模型不必将源句子的所有信息（无论其长短）压缩到单一、固定长度的向量中。实验结果也表明，这种改进使模型能够更好地处理长句子。

1.3.3 好处

联合学习对齐和翻译（jointly learning to align and translate）比基本 encoder-decoder 显著提高了翻译性能。这种改进在长句子上更为明显，但在任何长度的句子上都可以观察到。

此外，定性分析表明，这种模型在源句子和相应的目标句子之间找到了语言学上合理的（软）对齐。

2 背景：神经机器翻译（Neural Machine Translation） 2.1 “翻译”的数学模型：条件概率

从概率的角度来看，翻译就是给定源句子 $x$ 时，找到一个目标句子 $y$，使条件概率 $p(y \mid x)$ 最大。

在神经机器翻译中，我们使用并行训练语料来拟合一个参数化模型，以最大化句子对的条件概率。模型学到了条件分布之后，再给定源句子，它就可以通过搜索条件概率最大的句子来生成相应的翻译。

2.2 用神经网络直接学习条件概率分布

最近，一些论文提出了使用神经网络直接学习这种条件分布。这种方法通常由两个组件组成，

encoder：对源句子 $x$ 进行编码，
decoder：将 encoder 编码后的句子解码为目标句子 $y$。

例如，Cho 等和 Sutskever 等使用两个循环神经网络（RNN）将可变长度的源句子编码为固定长度的向量，并将该向量解码为可变长度的目标句子。这种新方法前景广阔：

Sutskever 等的结果已经证明，基于 LSTM RNN 的神经机器翻译在英法翻译任务上接近传统基于短语的机器翻译系统的最好性能。
将神经组件添加到现有翻译系统中，例如对短语表中的短语对进行评分或对候选翻译进行重排序，得到的效果已经超过了以前的最好水平。

2.3 RNN encoder-decoder

这里简要描述下由 Cho 等和 Sutskever 等提出的基础框架，称为 RNN encoder-decoder。

2.3.1 encoder 数学模型

encoder 将输入句子（向量序列 $x=\left( x_1, \cdots, x_{T_x} \right)$）编码为向量 $c$。最常见的编码方法是 RNN，使得

其中

$h_t \in \mathbb{R}^{n}$ 是 $x_t$ 的隐藏状态（hidden states），
- $f$ 是非线性函数。例如，Sutskever 等使用 LSTM 作为 $f$。
$c$ 是从隐藏状态序列生成的上下文向量（context），
- $q$ 也是非线性函数。

2.3.2 decoder 数学模型

decoder 通常按下面的方式进行训练：

输入：
1. 上下文向量 $c$
2. 所有之前已经预测（翻译）的词 ${ y_1, \cdots, y_{t’-1} }$
输出：下一个词 $y_{t’}$，也就是预测的下一个目标词。

换句话说，decoder 通过将联合概率分解为多个有序的条件概率 （人话：先根据条件概率翻译第一个词，然后把翻译好的这个词也作为输入的一部分，利用此时的条件概率再翻译第二个词，以此类推）来定义一个翻译 $y$ 的概率：

使用 RNN 的话，每个条件概率可以建模为：

其中，

$g$ 是一个非线性、可能多层的函数，
$s_t$ 是 RNN 的隐藏状态。

It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 学习对齐和翻译（ALIGN AND TRANSLATE）

本节提出一种新的神经机器翻译架构：

encoder 是一个双向 RNN
decoder 在解码翻译时，在源句子中模拟搜索过程。

3.1 decoder 3.1.1 条件概率的数学模型条件概率

在我们的模型架构中，将以上方程 2 中的每个条件概率定义为：

注意，

常见的 encoder-decoder（方程 2）中，上下文向量是粗粒度的；我们这里则细化到了每个位置： 每个目标词 $y_i$ 都使用自己特有的上下文向量 $c_i$ 来计算条件概率；
上下文向量 $c_i$ 依赖于一个 annotation 序列 $(h_1, \cdots, h_{T_x})$，后者是由 encoder 对输入句子进行映射得到的；
每个 annotation $h_i$ 包含关于整个输入序列的信息，但重点关注输入序列中第 $i$ 个词周围的部分。下一节会详细解释如何计算这些 annotations。

每个位置独有的上下文向量 $c_i$

$c_i$ 是 annotation $h_i$ 的加权和：

其中，权重 $\alpha_{ij}$ 是一个对齐模型（alignment model），见下面。

对齐模型 $\alpha_{ij}$：评估输入位置 $j$ 和输出位置 $i$ 的匹配程度

对齐模型 评估输入位置 $j$ 和输出位置 $i$ 的匹配程度（分数）：

能量函数 $e_{ij}$

$e_{ij}$ 是一个能量函数（energy function）。

3.1.2 对齐模型 $\alpha_{ij}$：feed-forward 神经网络

我们将对齐模型 $a$ 参数化为一个前馈神经网络，与系统的所有其他组件联合训练。

注意，

与传统机器翻译不同，这里不将对齐视为一个 latent variable。相反，
对齐模型直接计算一个 soft alignment，这使得 cost function 的梯度可以后向传播。该梯度可用于联合训练对齐模型以及整个翻译模型。

对所有 annotation 取加权和的方法，可以理解为计算一个 expected annotation， where the expectation is over possible alignments。

令 $\alpha_{ij}$ 表示目标词 $y_i$ 与源词 $x_j$ 对齐 —— 或者说从源词 $x_j$ 翻译而来 —— 的概率，
那么，第 $i$ 个上下文向量 $c_i$ 就是以概率 $\alpha_{ij}$ 对所有 annotation $h_i$ 的加权期望。

3.1.3 直观解释：一种注意力（attention）机制

概率 $\alpha_{ij}$ —— 或者与其相关的能量 $e_{ij}$ —— 反映了 annotation $h_j$ （跟前一个隐藏状态 $s_{i-1}$ 有关）在决定下一个隐藏状态 $s_i$ 和生成 $y_i$ 时的重要性。

直观来说，这在 decoder 中实现了一种注意力机制：

decoder 决定关注源句子的哪些部分；
decoder 有了这种注意力机制，encoder 就减轻了将源句子中的所有信息编码到固定长度向量中的负担。

通过这种方法，信息就能分布在 annotation 序列中，decoder 可以选择性检索它需要的信息。

3.2 encoder：用于 annotating sequence 的双向 RNN 3.2.1 为什么用 BiRNN：总结每个词前和后的信息

如方程 1 所示，读取输入序列 $x$ 时，

普通的 RNN 按顺序从第一个符号 $x_1$ 读到最后一个符号 $x_{T_x}$ 。

我们希望每个词的 annotation 不仅总结前面的词，还总结后面的词，因此使用了双向 RNN（BiRNN），这种技术最近在语音识别领域很成功。

3.2.2 Annotation 的计算

BiRNN 由前向和后向 RNN 组成。

前向 RNN $\vec{f}$ 按正向顺序读取输入序（从 $x_1$ 到 $x_{T_x}$），然后计算前向隐藏状态序列 $\overrightarrow{h_1}$, …, $\overrightarrow{h}_{T_x}$。
后向 RNN $\overleftarrow{f}$ 以相反的顺序读取序列（从 $x_{T_x}$ 到 $x_1$），得到后向隐藏状态序列 $\overleftarrow{h_1}$, …, $\overleftarrow{h_{T_x}}$。

将前向和后向隐藏状态拼接到一起，我们就得到了每个词 $x_j$ 的最终 annotation，即 $h_j = \left[ \overrightarrow{h}_j^\top ; \overleftarrow{h}_j^\top \right]^\top$：

annotation $h_j$ 包含了这个词前面和后面的摘要。
由于 RNN 能较好地表示最近的输入（recent inputs），因此 annotation $h_j$ 的信息将集中在 $x_j$ 附近的词上。
这个 annotation 序列随后被 decoder 和对齐模型用于计算上下文向量（方程 5-6）。

Figure 1: 直观解释：给定源句子 (x1; x2; ...; xT)，生成第 t 个目标单词 yt 的过程。

注释版：

Figure 1: 直观解释：给定源句子 (x1; x2; ...; xT)，生成第 t 个目标单词 yt 的过程。

4 实验设置（略） 5 结果（略） 6 相关工作 6.1 学习对齐（Learning to Align）

最近，Graves 等在手写合成（handwriting synthesis）任务中提出了一种类似的对齐输出符号与输入符号的方法。手写合成任务是给定一个字符序列，要求模型生成对应的手写。 Graves 等使用 a mixture of Gaussian kernels 来计算 annotation 的权重，其中每个 kernel 的位置、宽度和混合系数由 alignment model 预测。更具体地说，他的对齐（alignment）具体就是预测位置（predict the location），使得 location 单调递增。

与我们的方法的主要区别在于，在 Graves 等的工作中，annotation 权重的模式只能单向移动。在机器翻译中，这是一个很大的限制，因为生成语法正确的翻译（例如，英语到德语）通常需要（长距离）重排序。

另一方面，我们的方法需要为每个翻译词计算源句子中每个词的 annotation 权重。这种问题在翻译任务中还好，因为大多数输入和输出句子只有 15-40 个词。然而，在其他类型的任务重，本文这种方案可能就会不太适用了。

6.2 神经网络用于机器翻译

Since introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system.

For instance, proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system. More recently, and reported the successful use of the neural networks as a sub-component of the existing translation system. Traditionally, a neural network trained as a target-side language model has been used to rescore or rerank a list of candidate translations.

Although the above approaches were shown to improve the translation performance over the state-of-the-art machine translation systems, we are more interested in a more ambitious objective of designing a completely new translation system based on neural networks. The neural machine translation approach we consider in this paper is therefore a radical departure from these earlier works. Rather than using a neural network as a part of the existing system, our model works on its own and generates a translation from a source sentence directly.

7 总结

The conventional approach to neural machine translation, called an encoder–decoder approach, encodes a whole input sentence into a fixed-length vector from which a translation will be decoded. We conjectured that the use of a fixed-length context vector is problematic for translating long sentences, based on a recent empirical study reported.

In this paper, we proposed a novel architecture that addresses this issue. We extended the basic encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word. This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word. This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences. Unlike with the traditional machine translation systems, all of the pieces of the translation system, including the alignment mechanism, are jointly trained towards a better log-probability of producing correct translations.

We tested the proposed model, called RNNsearch, on the task of English-to-French translation. The experiment revealed that the proposed RNNsearch outperforms the conventional encoder–decoder model (RNNencdec) significantly, regardless of the sentence length and that it is much more robust to the length of a source sentence. From the qualitative analysis where we investigated the (soft-)alignment generated by the RNNsearch, we were able to conclude that the model can correctly align each target word with the relevant words, or their annotations, in the source sentence as it generated a correct translation.

Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. We believe the architecture proposed here is a promising step toward better machine translation and a better understanding of natural languages in general.

One of challenges left for the future is to better handle unknown, or rare words. This will be required for the model to be more widely used and to match the performance of current state-of-the-art machine translation systems in all contexts.

致谢

The authors would like to thank the developers of Theano. We acknowledge the support of the following agencies for research funding and computing support: NSERC, Calcul Qu'{e}bec, Compute Canada, the Canada Research Chairs and CIFAR. Bahdanau thanks the support from Planet Intelligent Systems GmbH. We also thank Felix Hill, Bart van Merri'enboer, Jean Pouget-Abadie, Coline Devin and Tae-Ho Kim.

参考文献

Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res., 3, 1137–1155.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR.
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014). to appear.
Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: Encoder–Decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. to appear.
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics.
Forcada, M. L. and Ñeco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-Díaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology, volume 1240 of Lecture Notes in Computer Science, pages 453–462. Springer Berlin Heidelberg.
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning, pages 1319–1327.
Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012).
Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs.NE].
Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273–278.
Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014).
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. Association for Computational Linguistics.
Koehn, P. (2010). Statistical Machine Translation. Cambridge University Press, New York, NY, USA.
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ‘03, pages 48–54, Stroudsburg, PA, USA. Association for Computational Linguistics.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML’2013.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013).
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014).
Pouget-Abadie, J., Bahdanau, D., van Merriënboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. to appear.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11), 2673–2681.
Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN), pages 1071–1080. Indian Institute of Technology Bombay.
Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 723–730. Association for Computational Linguistics.
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv:1212.5701 [cs.LG].

[译][论文] Transformer paper | Attention Is All You Need（Google，2017）

ARTHURCHIAO'S BLOG

3 months 1 week ago

译者序

本文翻译自 2017 年 Google 提出 Transformer 的论文： Attention Is All You Need。

Figure 1: Transformer 架构：encoder/decoder 内部细节。

摘录一段来自 Transformer 是如何工作的：600 行 Python 代码实现两个（文本分类+文本生成）Transformer（2019）的介绍，说明 Transformer 架构相比当时主流的 RNN/CNN 架构的创新之处：

在 transformer 之前，最先进的架构是 RNN（通常是 LSTM 或 GRU），但它们存在一些问题。

RNN 展开（unrolled）后长这样：

RNN 最大的问题是级联（recurrent connection）：虽然它使得信息能沿着 input sequence 一路传导，但也意味着在计算出 $i-1$ 单元之前，无法计算出 $i$ 单元的输出。

与 RNN 此对比，一维卷积（1D convolution）如下：

在这个模型中，所有输出向量都可以并行计算，因此速度非常快。但缺点是它们在 long range dependencies 建模方面非常弱在一个卷积层中，只有距离比 kernel size 小的单词之间才能彼此交互。对于更长的依赖，就需要堆叠许多卷积。（为什么？可参考以图像识别为例，关于卷积神经网络（CNN）的直观解释（2016））。

Transformer 试图兼顾二者的优点：

可以像对彼此相邻的单词一样，轻松地对输入序列的整个范围内的依赖关系进行建模（事实上，如果没有位置向量，二者就没有区别）；
同时，避免 recurrent connections，因此整个模型可以用非常高效的 feed forward 方式计算。

Transformer 的其余设计主要基于一个考虑因素 —— 深度 —— 大多数选择都是训练大量 transformer block 层，例如，transformer 中只有两个非线性的地方：

self-attention 中的 softmax；
前馈层中的 ReLU。

模型的其余部分完全由线性变换组成，完美地保留了梯度。

提出 attention 机制的 paper：神经机器翻译：联合学习对齐和翻译（Align & Translate）（2014）。

[译][论文] Attention paper | 神经机器翻译：联合学习对齐和翻译（2014）

[译] 文生图（text-to-image）简史：扩散模型（diffusion models）的崛起与发展（2022）

[译] Transformer 是如何工作的：600 行 Python 代码实现 self-attention 和两类 Transformer（2019）

[译] 什么是 GPT？Transformer 工作原理的动画展示（2024）

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 背景
3 Transformer 模型架构
4 Why Self-Attention
5 Training
6 结果
7 Conclusion
致谢
参考文献
附录：Attention 的可视化

主流的 sequence transduction model 都是基于复杂的循环或卷积神经网络，其中包括一个 encoder 和一个 decoder。效果最好的模型还会通过 attention 机制将 encoder 和 decoder 连起来。

我们提出一种新的简单网络架构 Transformer，它弃用了循环和卷积，完全基于 attention 机制。

在两个机器翻译任务上的实验表明，Transformer 模型的效果好于其他模型，并且更容易并行化，训练时间显著减少。

Tranformer 在 WMT 2014 英德翻译任务上达到了 28.4 BLEU，比现有最佳结果提高了 2 BLEU 以上。
在 WMT 2014 英法翻译任务上，Tranformer 在 8 个 P100 GPU 上训练 3.5 天后，创造了新的单模型最佳性能，这个训练成本也远小于本文引用的性能类似的其他模型。

我们还成功将 Transformer 应用于英语句法分析，展示了 Transformer 在其他任务上的泛化能力。

1 引言

当前，RNN（Recurrent Neural Networks，循环神经网络）—— 尤其是 LSTM RNN（long short-term memory）和 gated RNN —— 已经是序列建模和 transduction 问题（例如语言建模和机器翻译）的最好方式，现在也仍然有大量的工作在继续扩大 recurrent 类语言模型和 encoder-decoder 架构的能力边界。

1.1 RNN 架构的内在顺序计算限制（来自 RNN 其中的 R）

Recurrent models 通常沿输入和输出序列的符号位置进行因子计算。

对于位置 $t$，根据前一个隐藏状态 $h_{t-1}$ 和位置 $t$ 处的 input 生成新的隐藏状态 $h_t$。
这种内在的顺序性限制了训练数据之间的并行化，序列较长时这一点尤为重要。

近期的工作通过分解技巧（factorization tricks）和条件计算（conditional computation）显著提高了计算效率，此外，后者还提高了模型性能。然而，顺序计算（sequential computation）这一根本约束仍然存在。

1.2 RNN+Attention 架构：更好的模型效果

Attention 机制已经成为很多任务中序列建模和 transduction 模型的一个重要组成部分，它允许直接对依赖进行建模（modeling of dependencies），而不用考虑这些依赖在输入或输出序列中的距离。

但是，绝大部分大部分情况，人们仍然是将 attention 机制与 RNN 一起使用，因而仍然受到顺序计算的约束。

1.3 Transformer：避免 R，一种完全基于 attention 机制的新架构

本文提出 Transformer —— 一种避免循环机制、完全基于 attention 机制 而在输入和输出之间建立全局依赖关系的模型架构。

相比 RNN，Transformer 的并行能力显著提升，在 8 个 P100 GPU 上训练 12 小时就能创造新的最高翻译水平。

2 背景 2.1 CNN：减少顺序计算，但对远距离依赖关系的学习成本很高

Extended Neural GPU、ByteNet 和 ConvS2S 也是想减少顺序计算，它们都使用 CNN（convolutional neural networks，卷积神经网络）作为基本构建块，为所有输入和输出位置并行计算隐藏表示。

但是，在这些模型中，从两个任意输入或输出位置（input or output positions）做信号关联，所需的操作数量随着位置之间的距离增加而增加，

ConvS2S 线性增长
ByteNet 对数增长。

这使得学习远距离位置之间的依赖关系变得困难。而在 Transformer 中，

所需的操作减少到一个常量，不过这里的代价是有效分辨率降低，这是 averaging attention-weighted positions 导致的；
但是，可以通过 Multi-Head Attention 来缓解。

2.2 Self-attention (intra-attention) 机制

Self-attention，有时称为 intra-attention，

是一种注意力机制（2014 paper），
目的是计算序列的一种表示（a representation of the sequence）
方式是对一个输入序列的不同位置做各种关联（relating different positions of a single sequence）。

Self-attention 已经成功地应用于各种任务 [4, 27, 28, 22]，包括

阅读理解（reading comprehension）
总结抽象（abstractive summarization）
textual entailment
学习任务无关的句子表示（task-independent sentence representations）

2.3 Tranformer：避免 RNN 和 CNN

端到端的记忆网络（end-to-end memory networks）是基于一种 recurrent attention 而非 sequence-aligned recurrence 的机制，在 simple-language question answering 和语言建模任务中表现良好。

但据我们所知，Transformer 是第一个完全依赖 self-attention —— 而不使用 sequence-aligned RNNs 或 CNNs —— 来计算输入和输出表示的 transduction 模型。

3 Transformer 模型架构 3.0 Encoder-decoder：sequence transduction 模型的基本结构

大部分性能较好的 neural sequence transduction 模型都会包含一个 encoder-decoder 结构：

encoder 将一个输入序列 $(x_1, …, x_n)$ 映射到另一个序列表示 $\mathbf{z} = (z_1, …, z_n)$。
给定 $\mathbf{z}$，decoder 生成一个输出序列 $(y_1,…,y_m)$ —— 每次生成一个元素：
- 生成下一个元素时，会将 input 连同上一步生成的元素一起，作为新的 input 输入 decoder；
- 这种机制叫 auto-regressive（自回归）。

3.1 Encoder/decoder 内部结构

如下图所示，Transformer 沿用了 encoder-decoder 架构，

Figure 1: Transformer 架构，沿用了业界的 encoder-decoder 架构。

3.1.1 Encoder：6 * {multi-head-attention + feed-forward}

Figure 1: Transformer 架构：encoder/decoder 内部细节。

Transformer 的 encoder 由 N=6 个相同的层组成，每层又分为两个子层（图 1 左边）：

multi-head self-attention 层；
简单的 feed-forward 全连接层。

两个子层后面都会使用 residual connection，然后是 layer normalization。也就是说，每个子层的输出是 LayerNorm(x+Sublayer(x))，其中 Sublayer(x) 是子层本身实现的函数。

为了促进这些残差连接，模型中的所有子层以及 embedding 层，都产生 dmodel=512 维的输出。

3.1.2 Decoder：6 * {masked-multi-head-attention + multi-head-attention + feed-forward}

Transformer 的 decoder 也由 N=6 个相同的层组成，

Figure 1: Transformer 架构：encoder/decoder 内部细节。

但与 encoder 不同，decoder 的每层还插入了第三个子层（图 1 右边），

它对 encoder 的输出执行 multi-head attention。具体来说，decoder 的输入是 encoder 的输出往右偏移一个位置（the output embeddings are offset by one position），再加上 position embeddings；
这一子层的 self-attention 比较特殊，加了个掩码（masking），这是为了避免它使用当前位置后面的信息（attending to subsequent positions）。换句话说，这确保了位置 $i$ 处的预测只能依赖 $i$ 位置前面的已知输出。

其他都与 encoder 类似，decoder 的每个子层后面都使用了残差连接，然后是层归一化。

3.2 Attention 内部结构

一个 attention 函数可以描述为将一个查询（query）和一组 key-value pairs 映射到一个 output，其中：

查询、键、值和输出都是向量；
output 是 values 向量的加权和，其中每个 value 的权重是由 query 与相应 key 的一个 compatibility function 计算得到的。

3.2.1 Scaled Dot-Product Attention

如图 2 左侧所示，我们的 attention 称为 “Scaled Dot-Product Attention”。

Figure 2:(left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

输入

queries 和 keys：都是 $d_k$ 维的向量；
values：$d_v$ 的向量。

计算过程

分为两步：

query 与所有 keys 的点积，将每个点积除以 $\sqrt{d_k}$，然后应用 softmax，得到的是 values 的权重；
将这些权重与 values 相乘。

如图右侧，实际中，

同时计算一组 queries，将它们打包成一个矩阵 $Q$。
keys 和 values 也被打包成矩阵 $K$ 和 $V$。

计算输出矩阵为：

\begin{equation} \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V \end{equation}

(1)

两个最常用的 attention 函数是 additive attention [2] 和 dot-product（multiplicative）attention。

Dot-product attention 除了缩放因子 $\frac{1}{\sqrt{d_k}}$ 与我们的算法不同，其他都是一样的；
Additive attention 使用有单个隐藏层的 feed-forward network 来计算 compatibility function。

尽管二者的理论复杂度上类似，但实际上 dot-product attention 更快，更节省空间，因为它可以使用高度优化的矩阵乘法实现。

虽然对于小的 $d_k$ 值，这两种机制的性能相似，但对于较大的 $d_k$ 值，additive attention 优于不缩放的 dot-product attention。我们猜测是对于较大的 $d_k$ 值，点积变得很大，将 softmax 函数推到到了梯度极小的区域。为了避免这个问题，我们通过 $\frac{1}{\sqrt{d_k}}$ 缩放点积。

3.2.2 Multi-Head Attention 的计算

Figure 2:(left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

线性变换 query/key，并行 attention 计算，最后再拼接 value

相比于对 $d_{model}$ 维的 keys、values 和 queries 执行单个 attention 函数，我们发现可以并行计算：

将 queries、keys 和 values 进行 h 次线性变换（投影） —— 每次使用不同的、学习到的变换矩阵 —— 将三者分别变换到 $d_k$、$d_k$ 和 $d_v$ 维度。
对变换之后的 queries、keys 和 values 并行执行 attention 函数，就得到 $d_v$ 维的输出 values。
将这些 values 拼接到一起再进行一次线性变换，就得到了最终的 values。

公式和参数矩阵

Multi-head attention 允许模型同时 attend（关注）不同位置的不同表示子空间（representation subspaces）的信息。如果只有一个 attention head，它的平均（averaging）会削弱这种效果。

\begin{align} \mathrm{MultiHead}(Q, K, V) &= \mathrm{Concat}(\mathrm{head_1}, …, \mathrm{head_h})W^O
\end{align}

其中，

\begin{align} \mathrm{head_i} &= \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)
\end{align}

其中，线性变换（投影）就是下面几个参数矩阵：

$W^Q_i \in \mathbb{R}^{d_{model} \times d_k}$
$W^K_i \in \mathbb{R}^{d_{model} \times d_k}$
$W^V_i \in \mathbb{R}^{d_{model} \times d_v}$
$W^O \in \mathbb{R}^{hd_v \times d_{model}}$

本文中我们使用

h=8，也就是 8 个并行的 attention layers/heads。
$d_k=d_v=d_{model}/h=64$，也就是将 query/key/value 向量都分段投影到 64 维向量。

由于每个 head 的维度降低，总计算成本与完整维度的 single head attention 相似。

3.2.3 Attention 在模型中的应用

Transformer 以三种不同的方式使用 multi-head attention：

“encoder-decoder attention” layers

这一步的用法就是 sequence-to-sequence 模型中 [38, 2, 9] 的典型 encoder-decoder attention 机制。

输入：

queries 来自前一个 decoder 层
memory keys 和 values 来自 encoder 的输出。

这使得 decoder 中的每个位置都可以关注输入序列中的所有位置。

encoder layers

encoder 层包含了 self-attention layers。

输入：keys、values 和 queries 都来自 encoder 中前一层的输出。

encoder 中的每个位置都可以关注 encoder 前一层的所有位置。

docoder layers

与 encoder 中类似，decoder 中的 self-attention 层允许 decoder 中的每个位置关注 decoder 中到该位置为止的所有位置。

为了保证自回归特性（auto-regressive），需要防止 decoder 中的左向信息流。
我们通过屏蔽与非法连接对应的 softmax 输入中的所有值（设置为负无穷大 $-\infty$）来实现这一点。

3.3 Position-wise Feed-Forward Networks

除了 attention 子层，encoder 和 decoder 中的每个层都包含一个全连接的 feed-forward 网络，包括两个线性变换和一个 ReLU 激活。

Figure 1: Feed-Forward Network (FFN) 内部结构。

对应的数学公式：

\begin{equation} \mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2 \end{equation}

(2)

线性变换在不同位置上是功能是相同的，但在不同的层使用的参数不同。也可以将它们描述为：两个 kernel size 为 1 的卷积。输入和输出的维度是 $d_{model}=512$，内层的维度是 $d_{ff}=2048$。

3.4 Embeddings and Softmax

与其他 sequence transduction models 类似，我们使用 learned embeddings 将输入 tokens 和输出 tokens 转换为维度为 $d_{model}=512$ 的向量。我们还使用常见的基于学习的线性变换和 softmax 函数将 decoder 输出转换为下一个 token 的预测概率分布（predicted next-token probabilities）。我们的模型中，在两个 embedding 层和 pre-softmax 线性变换之间共享相同的权重矩阵，类似于 [30]。在 embedding 层中，我们将这些权重乘以 $\sqrt{d_{model}}$。

3.5 Positional Encoding（位置编码） 3.5.1 目的：向 token 注入位置信息

因为我们的模型不包含循环和卷积，为了使模型能够利用到序列的顺序，必须向 token 注入一些关于相对或绝对位置的信息。

3.5.2 编码算法：正弦函数

如下图所示，为了注入位置信息，

Figure 1: Transformer 架构，沿用了业界的 encoder-decoder 架构。

我们在 encoder/decoder 的入口都添加了 “positional encodings”，它与 input embeddings 相加之后才开始后面的 attention 计算。。位置编码与 input embedding 具有相同的维度 $d_{model}=512$，因此可以相加。

位置编码有许多选择，有基于学习的，也有固定的。本文中，我们使用不同频率的正弦和余弦函数：

\begin{align} PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{model}})
\end{align}

\begin{align} PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{model}}) \end{align}

其中，$pos$ 是位置，$i$ 是维度。也就是说，位置编码的每个维度对应于一个正弦波。波长从 $2\pi$ 到 $10000 \cdot 2\pi$ 形成一个几何级数。

选择这个函数是因为我们猜测它可以让模型很容易地学习通过相对位置进行 attention，因为对于任何固定的偏移 $k$，$PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性函数。
我们还尝试使用 learned positional embeddings，发现结果几乎相同。
最终选择了正弦版本，因为它可能会让模型对超出训练期间遇到的序列长度进行外推（extrapolate to sequence lengths）。

4 Why Self-Attention

本节我们对 self-attention 层与循环及卷积层（the recurrent and convolutional layers）进行一个比较，它们都是常用的将一个变长序列的符号表示 $(x_1, …, x_n)$ 映射为另一个同样长度的序列 $(z_1, …, z_n)$ 的方式，其中 $x_i, z_i \in \mathbb{R}^d$，例如典型序列转换 encoder/decoder 中的隐藏层。

4.1 Motivation

我们设计 self-attention 有三方面原因：

每层的计算复杂度；
可以并行化的计算量，由所需的最小顺序操作数来衡量；
网络中长距离依赖（long-range dependencies）的路径长度。

学习 long-range dependencies 是许多序列转换任务的核心挑战。影响学习这种依赖的能力的一个核心因素是信号在网络中前向和后向传播的路径长度。输入和输出序列中任意位置的这种路径越短，long-range dependencies 的学习越容易。因此，我们还比较了在多层网络中，输入和输出位置之间任意两个位置的 maximum path length。

4.2 与循环网络、卷积网络的计算复杂度对比

如下表所示，

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n 序列长度， d representation 的维度， k 卷积的 kernel size， r restricted self-attention 中的 neighborhood size。

对于 sequential operations，

一个 self-attention 层连接所有位置，因此所需的顺序操作是常数（换句话说，可以完全并行化，一次完成）；
一个循环层则需要 $O(n)$ 个顺序操作。

在计算复杂度方面，

当序列长度 $n$ 小于表示维度 $d$ 时，self-attention 层比循环层更快，
这在机器翻译领域已经得到证明，例如 word-piece 和 byte-pair 表示。

处理非常长的序列方面：

为了提高计算性能，可以限制让 self-attention 只考虑 a neighborhood of size $r$ in the input sequence centered around the respective output position。
这会将最大路径长度增加到 $O(n/r)$。我们计划在未来的工作中进一步研究这种方法。

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of $O(n/k)$ convolutional layers in the case of contiguous kernels, or $O(log_k(n))$ in the case of dilated convolutions , increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$. Separable convolutions , however, decrease the complexity considerably, to $O(k \cdot n \cdot d + n \cdot d^2)$. Even with $k=n$, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

4.3 更具可解释性的模型

除了上述优势，self-attention 还能产生更具可解释性的模型。

我们检查了 Transformer 模型的 attention 分布，并在附录中展示和讨论了一些例子。不仅每个 attention head 都明显学会了执行不同的任务，许多 head 还表现出与句子的句法和语义结构相关的行为。

5 Training

本节描述 Transformer 的训练方案。

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens.
For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38].
Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

5.2 Hardware and Schedule

在一台 8 * NVIDIA P100 GPU 的机器上训练。

对于本文描述的超参数/尺寸，我们称为基本模型，每个训练步骤大约需要 0.4 秒。整个训练共 100,000 步或 12 小时。
对于尺寸更大的模型，步骤时间为 1.0 秒。整个训练用了 300,000 步（3.5 天）。

5.3 Optimizer

我们使用了 Adam 优化器，其中 $\beta_1=0.9$，$\beta_2=0.98$ 和 $\epsilon=10^{-9}$。根据以下公式在训练过程中改变学习率：

\begin{equation} lrate = d_{model}^{-0.5} \cdot \min({step_num}^{-0.5}, {step_num} \cdot {warmup_steps}^{-1.5}) \end{equation}

这对应于在前 $warmup_steps$ 训练步骤中线性增加学习率，然后在此后按比例减少，与步数的倒数平方根成比例。我们使用了 $warmup_steps=4000$。

5.4 Regularization

我们在训练过程中使用了几种类型的正则化。

Residual Dropout

对每个子层的输出应用 dropout，然后将其添加到子层输入并进行归一化。
对 encoder/decoder 中的 input embeddings + positional encodings 的结果应用 dropout。

对于 base 模型，我们使用了 $P_{drop}=0.1$。

Label Smoothing

在训练过程中，我们使用了 $\epsilon_{ls}=0.1$ 的 label smoothing。这会降低 perplexity，因为模型 learns to be more unsure，但会提高准确性和 BLEU 分数。

6 结果 6.1 Machine Translation

Table 2:The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) outperforms the best previously reported models (including ensembles) by more than $2.0$ BLEU, establishing a new state-of-the-art BLEU score of $28.4$. The configuration of this model is listed in the bottom line of Table 2. Training took $3.5$ days on $8$ P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of $41.0$, outperforming all of the previously published single models, at less than $1/4$ the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $P_{drop}=0.1$, instead of $0.3$.

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of $4$ and length penalty $\alpha=0.6$ . These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + $50$, but terminate early when possible .

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.

6.2 Model Variations

Table 3:Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section multihead. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings , and observe nearly identical results to the base model.

6.3 English Constituency Parsing

Table 4:The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes.

We trained a 4-layer transformer with $d_{model} = 1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank , about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences . We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

We performed only a small number of experiments to select the dropout, both attention and residual, learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + $300$. We used a beam size of $21$ and $\alpha=0.3$ for both WSJ only and the semi-supervised setting.

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar.

In contrast to RNN sequence-to-sequence models , the Transformer outperforms the BerkeleyParser even when training only on the WSJ training set of 40K sentences.

7 Conclusion

本文提出了 Transformer，这是第一个完全基于 attention 的序列转换模型，用 multi-head attention 替代了 encoder-decoder 架构中最常用的循环层。

对于翻译任务，Transformer 的训练速度比基于循环或卷积层的架构快得多。在 WMT 2014 英德和英法翻译任务中，我们达到了新的 SOTA 结果。对于英德翻译，我们的最佳模型甚至超过了所有已知模型的结果。

展望未来，我们对基于 attention 的模型充满期待，并计划将其应用于其他任务。我们计划将 Transformer 扩展到文本以外的涉及输入/输出模态（involving input and output modalities）的场景，并研究局部、受限的 attention 机制，以有效处理大输入和输出，如图像、音频和视频。让生成过程尽量避免顺序执行（making generation less sequential）也是我们的一个研究目标。

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

致谢

We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.

参考文献

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.

附录：Attention 的可视化 Attention 机制学习长距离依赖的例子

Figure 3：一个 attention 机制跟踪长距离依赖的例子，来自第 5 层（总共 6 层）中的 encoder self-attention。

这里只展示了 ‘making’ 的 attention。

不同颜色代表不同的 attention head。
可以看到，多个 attention head 都在关注动词 “making” 的 distant dependency，一起凑成短语 "making … more difficult"。

代词解析（anaphora resolution）

这里展示两个 attention head，也在第 5 层（总共 6 层）中，显然涉及到了代词解析，

图 4：
（左）head 5 的完整 attention。
（右）：heads 5 和 6 针对 "its" 这个词的具体 attention。注意到，这个词的 attention 非常集中。

句子结构与 attention head 学习行为

许多 attention head 表现出与句子结构相关的行为。下面给出了两个这样的例子，来自第 5 层（总共 6 层）中的 encoder self-attention。这些 head 明显学会了执行不同的任务。

图 5：许多 attention head 表现出与句子结构相关的行为。

[译][论文] DeepSeek-R1：通过强化学习激励大模型的推理能力（DeepSeek，2024）

ARTHURCHIAO'S BLOG

3 months 2 weeks ago

译者序

本文翻译自 2024 年 DeepSeek AI 的 paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning。介绍了 DeepSeek 第一代推理模型（reasoning models） （所以缩写为 R1）的设计和训练过程：

Fig. How DeepSeek-R1-series models were trained.

要理解 DeepSeek-R1 的创新之处，可以先阅读如何训练一个企业级 GPT 助手（OpenAI，2023），里面介绍了典型的大模型训练 pipeline，其中包括预训练、SFT、RM、RL等步骤。

OpenAI：训练一个 GPT 助手的流程

DeepSeek-R1-Zero 的创新之处在于完全跳过了 SFT 步骤，直接在基座模型上进行大规模 RM+RL 训练，性能达到了 OpenAI-o1-0912 的水平。
- LLaMA 2：开放基础和微调聊天模型（Meta/Facebook，2023）对基于人类反馈的强化学习（HFRL）有较详细的介绍，DeepSeek 这里用的 RL 没有 HF，离 AGI 更进了一步。
- 更详细的 HFRL 可介绍可以参考 InstructGPT：基于人类反馈训练语言模型遵从指令的能力（OpenAI，2022），
InstructGPT 三部曲：(1) SFT, (2) RM training, (3) RLHF via proximal policy optimization (PPO) on RM.
蓝色箭头表示相应的数据用于训练模型。Step 2 中 A-D 是模型输出的采样，然后标注员对它们进行排序。
为了解决 DeepSeek-R1-Zero 存在的一些问题（可读性差，语言混用），又引入了少量的 SFT 数据作为冷启动，再参考 R1-Zero 的过程，训练了 DeepSeek-R1，在推理任务上的表现与 OpenAI-o1-1217 不相上下。
将 DeepSeek-R1 的推理能力蒸馏到 Qwen/LLaMA 等小型 dense 模型上，性能也很好。

总结下和 OpenAI 的性能对标：

DeepSeek Models OpenAI Models DeepSeek-R1-Zero OpenAI-o1-0912 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1 Distilled Models OpenAI-o1-mini

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 方法
3 实验（略）
- 3.1 DeepSeek-R1 评估
- 3.2 蒸馏模型评估
4 讨论
- 4.1 蒸馏与强化学习的性能对比
- 4.2 失败的尝试
  - 4.2.1 Process Reward Model (PRM)
  - 4.2.2 Monte Carlo Tree Search (MCTS)
5 结论、局限性和未来工作
参考文献

本文介绍我们的第一代推理模型，DeepSeek-R1-Zero 和 DeepSeek-R1。

DeepSeek-R1-Zero
- 这是一个跳过监督微调（SFT）步骤，直接通过大规模强化学习（RL）训练得到的模型，具备卓越的推理能力。
  
  译注：下图来自如何训练一个企业级 GPT 助手（OpenAI，2023），展示了 OpenAI 从预训练开始逐步训练出一个 GPT 助手的步骤， pre-training -> SFT -> RM -> RL 也是典型的大模型训练过程。 R1-Zero 是在 DeepSeek-V3 基座大模型上直接进行 RM+RL，跳过中间的 SFT，
  
  OpenAI：训练一个 GPT 助手的流程
- 通过大规模 RL，DeepSeek-R1-Zero 自然地涌现出许多强大且有趣的推理行为。不过，它也存在可读性差、混用语言等问题。
DeepSeek-R1
- 为了解决以上提到的 R1-Zero 存在的问题，并进一步提升推理性能，在 RL 阶段之前引入了多阶段训练和冷启动数据，训练得到的模型称为 DeepSeek-R1。
- DeepSeek-R1 在推理任务上的表现与 OpenAI-o1-1217 不相上下。
  
  Figure 1 | Benchmark performance of DeepSeek-R1.

为了支持研究社区，我们此次开源了 8 个推理模型：

DeepSeek-R1
DeepSeek-R1-Zero
DeepSeek-R1-Distill-Llama-70B
DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-1.5B

其中，后面 6 个是以 Qwen/Llama 作为基座模型，利用 DeepSeek-R1 蒸馏出来的 dense 模型。

1 引言

近年来，大模型的迭代与演进速度非常快（OpenAI, 2024a；Anthropic, 2024；Google, 2024）。

1.0 Post-Training：完整 training pipeline 的重要组成部分

现在，post-training 已成为完整 training pipeline 的一个重要组成部分。

1.0.1 作用

Post-Training 的好处：

提高推理任务的准确性，
与人类社会价值观对齐，
能适应用户偏好，
相对于预训练，所需的计算资源极少。

1.0.2 提高推理能力：与 OpenAI-o1 的思路区别

具体到提高推理能力方面，

OpenAI 的 o1（OpenAI, 2024b）系列模型首次通过增加推理过程中的思维链长度（Chain-of-Thought, CoT）来引入 inference-time scaling。这种方法在数学、编码和科学推理等推理任务上取得了显著的效果。
但是，有效的 test-time scaling 仍然是社区的一个开放性问题。此前，业界已经探索了很多方法，包括 process-based reward models (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024)，但这些方法都没有达到与 OpenAI o1 相当的通用推理性能。

本文迈出了通过纯强化学习（pure RL）提高模型推理能力的第一步。

我们的目标是探索大模型在没有任何监督数据的情况下 —— 单纯通过 RL 过程自我进化 —— 发展出推理能力的潜力。
具体来说，我们使用 DeepSeek-V3-Base 作为基础模型，采用 GRPO（Shao 等，2024）作为 RL 框架，来提高模型在推理方面的表现。
在训练过程中，DeepSeek-R1-Zero 自然地涌现出许多强大且有趣的推理行为。经过几千步的 RL 训练后， DeepSeek-R1-Zero 在推理基准测试中表现出色。例如，AIME 2024 的 pass@1 得分从 15.6% 提高到 71.0%，加上多数投票，得分进一步提高到 86.7%，与 OpenAI-o1-0912 表现相当。

然而，DeepSeek-R1-Zero 面临着诸如可读性差、语言混用等挑战。为了解决这些问题并进一步提升推理性能，我们引入了少量的冷启动数据和一个 multi-stage training pipeline，训练得到 DeepSeek-R1，其性能与 OpenAI-o1-1217 相当。

最后，我们还进一步探索了从 DeepSeek-R1 蒸馏较小的 dense models。例如，使用 Qwen2.5-32B（Qwen, 2024b）作为基础模型，两种思路：

直接在 Qwen-32B 上进行强化学习（RL），得到一个推理模型；
从 DeepSeek-R1 进行蒸馏（把 DeepSeek-R1 的知识“传授”给 Qwen2.5-32B），得到一个推理模型；

我们发现后者（蒸馏）的性能优于前者（直接 RL）。这表明尺寸更大的基础模型发现的推理模式对于提高推理能力至关重要。

我们开源了基于 Qwen/Llama（Dubey 等，2024）的蒸馏模型。值得注意的是，我们蒸馏出的 14B 模型在 AIME 2024 上的表现大幅超过了现有的开源模型 QwQ-32B-Preview（Qwen, 2024a），而蒸馏出的 32B 和 70B 模型在针对 dense models 的推理基准测试中创下了新纪录。

1.1 贡献 1.1.1 post-training：在基础模型上进行大规模强化学习

我们跳过监督微调（SFT）步骤，直接在基础模型（base model）上应用 RL。这会使模型去探索解决复杂问题时的思维链（CoT），用这种方式训练得到的就是 DeepSeek-R1-Zero。

DeepSeek-R1-Zero 展现出自我验证、反思和生成长 CoT 等能力，为社区研究树立了一个重要的里程碑。
值得注意的是，这是首个证实大模型的推理能力可以通过纯 RL 激励实现（无需 SFT）的公开研究，这一突破为该领域的未来发展铺平了道路。

此外，我们还介绍了开发 DeepSeek-R1 的 pipeline。

Fig. How DeepSeek-R1-Zero and DeepSeek-R1 were trained (based on the same base model).

该 pipeline 包含，

两个 RL stage
- 一个用于发现更强的推理模式（stage 2）
- 一个用于与人类偏好对齐（stage 4）
两个 SFT stage：用于激发出模型的 reasoning and non-reasoning 能力。

1.1.2 蒸馏：小型模型也可以很强大

我们证明了大型模型的推理模式可以被蒸馏到小型模型中，

与在小型模型上进行 RL 发现的推理模式相比，蒸馏可以取得更好的性能。
开源的 DeepSeek-R1 及其 API 将有助于社区在未来蒸馏出更好的小模型。

利用 DeepSeek-R1 生成的推理数据，我们微调了几个在社区中广泛使用的小型 dense 模型。结果显示，这些经过蒸馏的小型 dense model 在基准测试中表现非常好。

DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview.
DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench.
These results significantly outperform previous open-source models and are comparable to o1-mini.

1.2 性能评估结果 1.2.1 推理任务

DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models.
On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.

1.2.2 知识

On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark.

1.2.3 其他

DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

2 方法 2.1 概述

以往的研究重度依赖于大量的监督数据（人类标注数据）来提升模型性能。本文的研究证明：

不使用监督微调（SFT），单纯通过大规模强化学习（RL）也能显著提升推理能力。
通过引入少量冷启动数据（SFT 训练数据），还可以进一步增强性能。

2.2 DeepSeek-R1-Zero：在基础模型（base model）上进行强化学习

之前的研究（Wang 等，2023；Shao 等，2024）已经证明，强化学习对提高推理性能非常有用。但是，这些前期研究都重度依赖监督数据，而收集监督数据是个费事费力的过程。

本节探索在没有任何监督数据的情况下（单纯通过 RL 过程自我进化），大模型发展出推理能力的过程。

2.2.1 强化学习算法：Group Relative Policy Optimization (GRPO)

为了降低 RL 训练成本，我们采用了 GRPO（组相对策略优化）算法（Shao 等，2024），该方法放弃了 critic model（通常尺寸与 policy model 大小相同），而是用 group scores 来估计基线。

具体来说，对于每个问题 $q$, GRPO 从老的 policy $\pi_{\theta_{old}}$ 中采样得到一组输出 ${o_1, o_2, \cdots, o_G}$，然后用下面的目标函数优化 policy model $\pi_{\theta}$：

2.2.2 奖励建模（Reward Modeling）：rule-based reward system

奖励是 training signal 的来源，它决定了强化学习的优化方向。训练 DeepSeek-R1-Zero 时，我们采用了一个基于规则的奖励系统（rule-based reward system），该系统主要由两种类型的奖励组成。

类型一：准确性奖励（Accuracy rewards）

准确性奖励模型评估响应是否正确（whether the response is correct）。例如，

对于具有确定性结果的数学问题，要求模型以指定格式提供最终答案，从而能可靠地基于规则验证正确性。
对于 LeetCode 问题，可以使用编译器对生成的程序进行编译，然后运行预定义的测试用例。

类型二：格式奖励（Format rewards）

我们还采用了一个格式奖励模型，强制推理模型将其思考过程放在 <think> 和 </think> tag 内。

这里没有使用结果或过程神经奖励模型（outcome or process neural reward model），因为我们发现神经奖励模型可能会在大规模强化学习过程中出现 reward hacking 行为，并且重新训练奖励模型需要额外的训练资源，也会使整个训练流程变得更加复杂。

2.2.3 训练模板（提示词模板）

我们设计了一个简单直白的模板，指导基础模型遵循我们的具体指令。如表 1 所示，

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: prompt. Assistant: 表 1：DeepSeek-R1-Zero 的模板。在训练期间，将用具体的推理问题替换提示。

可以看到，这个模板要求 DeepSeek-R1-Zero 首先生产一个推理过程，然后再给出最终答案。我们有意将约束限制在这一结构内，避免任何 content-specific biases —— 例如，mandating reflective reasoning or promoting particular problem-solving strategies —— 以确保我们能够准确地观察模型在 RL 过程中的自然进化。

2.2.4 DeepSeek-R1-Zero 的性能、自我进化过程和顿悟时刻性能

下图展示了 DeepSeek-R1-Zero 在 AIME 2024 基准测试中的性能轨迹，

Figure 2:AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

可以看到，随着 RL 训练的进行，DeepSeek-R1-Zero 的性能稳步提升。 AIME 2024 pass@1 得分从 15.6% 跃升至 71.0%，达到了与 OpenAI-o1-0912 相当的性能水平，说明了我们的 RL 算法在优化模型性能方面的有效性。

表 2 是 DeepSeek-R1-Zero 与 OpenAI o1-0912 在多种推理基准测试上的性能对比，

表 2：DeepSeek-R1-Zero 与 OpenAI o1 在推理相关基准测试上的性能对比。

几点结论，

通过 RL，DeepSeek-R1-Zero 能够在无需任何监督微调数据的情况下获得强大的推理能力，也就是说模型仅通过 RL 就能有效学习和泛化。
DeepSeek-R1-Zero 的性能还可以通过多数投票（majority voting）进一步提升。例如，在 AIME 基准测试中采用多数投票时，DeepSeek-R1-Zero 的性能从 71.0% 上升至 86.7%，超过了 OpenAI-o1-0912 的性能。
DeepSeek-R1-Zero 在有无多数投票的情况下都能取得如此高的性能，突显了其强大的基础能力以及在推理任务中进一步发展的潜力。

自我进化过程

DeepSeek-R1-Zero 的自我进化过程非常好地展示了强化学习是如何驱动模型自主提升推理能力的。

直接从基础模型启动 RL 训练，使得我们免受监督微调（SFT）阶段的影响，从而能直观监测模型的进化过程。
这种方法为我们提供了一个观察模型随时间演变的清晰视角，特别是在处理复杂推理任务方面。

Figure 3:The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

如图 3 所示，DeepSeek-R1-Zero 的思考时间在整个训练过程中呈现出持续改进（增加）的趋势。

这种进步并非外部调整的结果，而是模型内部的自然发展。
DeepSeek-R1-Zero 自然获得了通过增加 test-time computation 来解决越来越复杂的推理任务的能力。
这里所说的 computation 是指生成几百到几千个不等的推理 token，使模型能够更深入地探索和完善其思考过程。

随着 test-time computation 的增加，这种自我进化过程中最显著的方面之一是出现了复杂行为。例如，观察到下面两个行为同时自发出现了，

反思行为：模型重新审视和评估自己先前的步骤
模型主动探索解决问题的替代方法

这些行为并非明确编程的结果，而是模型与强化学习环境互动的结果。这种自发的发展显著增强了 DeepSeek-R1-Zero 的推理能力，使其能够以更高的效率和准确性应对更具挑战性的任务。

顿悟时刻

在 DeepSeek-R1-Zero 的训练过程中，观察到的一个奇特现象是所谓的 “顿悟时刻”。如表 3 所示，

Table 3:An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.

这一时刻出现在模型的一个中间版本中。在这个阶段，DeepSeek-R1-Zero 学会了通过重新评估其初始处理方法，为问题分配更多的思考时间。这种行为不仅是模型逐步增长的推理能力的证明，也是强化学习能够带来意外且复杂结果的一个迷人例证。

这对于模型和观察其行为的研究者来说都是一个 “顿悟时刻”，它凸显了强化学习的力量和美感：

我们并没有明确地教导模型如何解决问题，而是仅仅提供了正确的激励，模型便能够自主地发展出高级的问题解决策略。
“顿悟时刻” 有力地提醒了我们 RL 激发人工智能系统新智能水平的潜力，为未来更具自主性和适应性的模型铺平了道路。

缺点和解决方式

尽管 DeepSeek-R1-Zero 展示了强大的推理能力，并且能够自主发展出意外且强大的推理行为，但它也面临一些问题。例如，DeepSeek-R1-Zero 遇到了诸如可读性差、语言混用等挑战。为了使推理过程更具可读性，我们探索了 DeepSeek-R1。

2.3 DeepSeek-R1：带冷启动的强化学习

DeepSeek-R1-Zero 的结果令人鼓舞，关于如何进一步提升性能，自然会产生两个问题：

引入少量高质量数据作为冷启动，是否可以进一步提升推理性能或加速收敛？
如何训练一个用户友好的模型，该模型不仅能够产生清晰连贯的思维链（CoT），而且还能展现出强大的通用能力？

为了回答这些问题，我们设计了一个新的 pipeline，训练得到的模型称为 DeepSeek-R1。

该 pipeline 包含四个阶段。

2.3.1 阶段一：冷启动

为了避免从基础模型直接开始 RL 训练导致的不稳定冷启动阶段，我们构建了一定量的长 CoT 数据集并对模型进行微调（SFT），得到一个 initial RL actor。

数据源

几种方式：

提供一个 CoT 作为示例，然后使用 few-shot prompting 生成更多例子；
直接提示模型（directly prompting models），让它生成带有反思和验证过程的详细回答；
收集 DeepSeek-R1-Zero 输出的一些回答，并通过人工标注对输出的质量进行增强。

我们收集了几千个冷启动数据，拿来微调 DeepSeek-V3-Base，得到的模型作为接下来的 RL 过程的起点。

冷启动数据的好处

冷启动数据的好处包括：

提升输出的可读性

DeepSeek-R1-Zero 的主要问题之一是输出的内容经常可读性很差。可能会混杂多种语言，或者不是 markdown 格式，无法高亮一些重点。

因此，在为 DeepSeek-R1 创建冷启动数据时，我们设计了一种可读性很好的格式，在每个响应的末尾包含一个总结，并过滤出对读者不友好的响应。在这里，我们定义输出格式为 |special_token|<reasoning_process>|special_token|<summary>，其中 <reasoning_process> 是用户输入的 query 对应的 CoT（推理过程），而 <summary> 用于总结推理结果。
潜力
- 基于人的先验知识（human priors）精心设计冷启动数据，观察到训练出来的模型比 DeepSeek-R1-Zero 表现更好。
- 我们相信迭代式训练（iterative training）是很好的训练推理模型的方式。

2.3.2 阶段二：面向 reasoning 的强化学习

在使用冷启动数据对 DeepSeek-V3-Base 进行微调后，第二阶段的训练过程与 DeepSeek-R1-Zero 相同：使用大规模强化学习进行后训练。这一阶段专注于提升模型的 reasoning 能力，特别是在推理密集型任务中，如编码、数学、科学和逻辑推理，这些任务具有明确定义的问题和解决方案。

在训练过程中，我们观察到 CoT 经常出现语言混用（language mixing），特别是在 RL 提示词涉及多种语言时。为了缓解这个问题，我们在 RL 训练中引入了一种语言一致性奖励（language consistency reward），计算方式是 CoT 中目标语言单词的比例（proportion of target language words in the CoT）。尽管消融实验表明，这种对齐会导致模型性能略有下降，但这种奖励与人类偏好一致，使其更具可读性。

最后，我们直接将推理任务的准确性与语言一致性奖励相加来形成最终奖励。然后，我们在微调后的模型上应用 RL 训练，直到它在推理任务上收敛。

这个阶段的 RL 收敛时，保存一个 checkpoint 供第三阶段使用。

2.3.3 阶段三：拒绝采样和监督微调

Rejection sampling is a technique where the LLM generates multiple candidate answers and then filters out those that do not meet certain criteria, retaining only the “good” results。It is used to enhance the quality and reliability of the model’s outputs, making them more aligned with desired standards or distributions

更多信息，可参考 LLaMA 2：开放基础和微调聊天模型（Meta/Facebook，2023），里面对 rejection sampling 有较详细的介绍。

译注。

利用第二阶段的 checkpoint 收集 SFT（监督微调）数据。

初始冷启动数据主要关注推理，而这一阶段则纳入了来自其他领域的数据，以增强模型在写作、角色扮演和其他通用任务中的能力。具体来说，我们按照以下方式生成数据并微调模型。

推理数据（Reasoning data）：600k

人工整理一批推理提示词，从上述 RL 训练的 checkpoint 进行拒绝采样来生成推理轨迹。

在第二阶段，我们只纳入了可以使用基于规则的奖励进行评估的数据。

在这一阶段，

引入额外数据来扩展数据集，其中一些数据使用生成式奖励模型 —— 将事实和模型预测输入 DeepSeek-V3 进行判断。
由于模型输出有时会杂乱无章且难以阅读，我们会过滤掉带有混合语言、冗长段落和代码块的思维链。
对于每个提示，我们采样多个响应，并且只保留正确的响应。

总共，我们收集了大约 600k 个与推理相关的训练样本。

非推理数据（Non-Reasoning data）：200k

对于非推理数据，如写作、事实问答、自我认知和翻译，我们采用 DeepSeek-V3 pipeline，并复用 DeepSeek-V3 的一部分 SFT 数据集。

对于某些非推理任务，我们调用 DeepSeek-V3 来生成一个潜在的思维链，然后通过提示回答问题。
对于更简单的查询，如 “hello”，我们不会在响应中提供 CoT。

最终，我们收集了总共大约 200k 个与推理无关的训练样本。

我们使用上述整理的数据集（约 800k 样本）对 DeepSeek-V3-Base 进行了两个 epoch 的微调。

2.3.4 阶段四：所有场景的强化学习

为了进一步使模型与人类偏好对齐，我们又进行了一轮强化学习，在完善模型推理能力的同时，提高模型的有用性和无害性（helpfulness and harmlessness）。

Fig. How DeepSeek-R1-Zero and DeepSeek-R1 were trained (based on the same base model).

具体来说，我们组合使用 reward signals 和多样化的 prompt distributions 来训练模型。

对于推理数据，遵循 DeepSeek-R1-Zero 中的方法，利用基于规则的奖励来指导数学、编码和逻辑推理领域的学习过程。
对于通用数据，借助奖励模型，以捕捉复杂微妙场景中的人类偏好。我们基于 DeepSeek-V3 pipeline，并采用类似的偏好对和训练提示分布。
对于有用性，仅关注最终总结，确保评估强调响应对用户的实用性和相关性，同时尽量减少对底层推理过程的干扰。
对于无害性，评估模型的整个响应，包括推理过程和总结，以识别和减轻在生成过程中可能出现的任何潜在风险、偏见或有害内容。

这些方式组合起来，最终使我们训练出一个在推理方面表现出色、同时还会优先考虑有用性和无害性的模型。

2.4 蒸馏：赋予小型模型推理能力

为了使小型模型具备类似 DeepSeek-R1 的推理能力，我们直接用 DeepSeek-R1 生成的 800k 样本对开源模型进行微调。

我们的研究发现，这种直接蒸馏的方法能显著提升小型模型的推理能力。我们使用的基础模型包括：

Qwen2.5-Math-1.5B
Qwen2.5-Math-7B
Qwen2.5-14B
Qwen2.5-32B
Llama-3.1-8B
Llama-3.3-70B-Instruct。选择 Llama-3.3 是因为其推理能力略优于 Llama-3.1。

蒸馏过程：在以上基础模型上进行监督微调（SFT），

这里不再进行强化学习（RL），尽管叠加 RL 可能会进一步提升模型性能。
我们的主要目的是展示蒸馏技术的有效性，叠加 RL 阶段的探索就留给更社区研究。

3 实验（略） 3.1 DeepSeek-R1 评估 3.2 蒸馏模型评估 4 讨论 4.1 蒸馏与强化学习的性能对比

前面已经看到，通过蒸馏 DeepSeek-R1，小型模型可以取得非常好的效果。但这里还有一个问题待解答：通过本文讨论的大规模 RL 对小模型训练，和蒸馏方式相比，哪个效果来的更好？

为了回答这个问题，我们在 Qwen-32B-Base 上进行了大规模 RL 训练，使用数学、编码和 STEM 数据，训练了超过 10K 步，得到了 DeepSeek-R1-Zero-Qwen-32B。两种方式得到的模型，性能对比如下，

Table 6:Comparison of distilled and RL Models on Reasoning-Related Benchmarks.

大规模 RL 训练的 32B 基础模型，在性能上与 QwQ-32B-Preview 相当。
从 DeepSeek-R1 蒸馏而来的模型，在所有基准测试中都显著优于 DeepSeek-R1-Zero-Qwen-32B。

因此，我们可以得出两个结论：

将更强大的模型蒸馏到小型模型中，可以让小模型获得出色的性能。对小型模型进行大规模 RL 也能取得不错的性能，但需要的算力比蒸馏要多很多，而且可能无法达到蒸馏取得的效果。
蒸馏是一种既经济又高效的方式，但要突破智能边界，可能仍需要更强大的基础模型和更大规模的强化学习。

4.2 失败的尝试

在开发 DeepSeek-R1 早期，我们也遇到了一些失败和挫折。这里分享一些失败经验，提供一些见解，但这并不意味着这些方法无法开发出有效的推理模型。

4.2.1 Process Reward Model (PRM)

PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

4.2.2 Monte Carlo Tree Search (MCTS)

Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

5 结论、局限性和未来工作

In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instruction-tuned models based on the same underlying checkpoints.

In the future, we plan to invest in research across the following directions for DeepSeek-R1.

General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields.
Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates.
Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.
Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

参考文献

AI@Meta (2024) AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md .
Anthropic (2024) Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet .
Chen et al. (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. et al Evaluating large language models trained on code. , abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374 .
Dubey et al. (2024) A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024.
Dubois et al. (2024) Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 , 2024.
Feng et al. (2024) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https://arxiv.org/abs/2309.17179 .
Gao et al. (2022) L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760 .
Gema et al. (2024) A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. Are we done with mmlu? , abs/2406.04127, 2024. URL https://doi.org/10.48550/arXiv.2406.04127 .
Google (2024) Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024 .
He et al. (2024) Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chinese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint arXiv:2411.07140 , 2024.
Hendrycks et al. (2020) D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020.
Huang et al. (2023) Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 , 2023.
Jain et al. (2024) N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. , abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974 .
Krishna et al. (2024) S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. , abs/2409.12941, 2024. 10.48550/ARXIV.2409.12941 . URL https://doi.org/10.48550/arXiv.2409.12941 .
Kumar et al. (2024) A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917 , 2024.
Li et al. (2023) H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 , 2023.
Li et al. (2024) T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939 , 2024.
Lightman et al. (2023) H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050 , 2023.
Lin (2024) B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL https://github.com/WildEval/ZeroEval .
MAA (2024) MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024 , February 2024. URL https://maa.org/math-competitions/american-invitational-mathematics-examination-aime .
OpenAI (2024a) OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/ .
OpenAI (2024b) OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/ .
OpenAI (2024c) OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing-simpleqa/ .
OpenAI (2024d) OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swe-bench that more, 2024d. URL https://openai.com/index/introducing-swe-bench-verified/ .
Qwen (2024a) Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm.github.io/blog/qwq-32b-preview/ .
Qwen (2024b) Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/blog/qwen2.5 .
Rein et al. (2023) D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 , 2023.
Shao et al. (2024) Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
Silver et al. (2017a) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. , abs/1712.01815, 2017a. URL http://arxiv.org/abs/1712.01815 .
Silver et al. (2017b) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. , 550(7676):354–359, 2017b. 10.1038/NATURE24270 . URL https://doi.org/10.1038/nature24270 .
Snell et al. (2024) C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314 .
Trinh et al. (2024) T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. , 2024. 10.1038/s41586-023-06747-5 .
Uesato et al. (2022) J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275 , 2022.
Wang et al. (2023) P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935 , 2023.
Wang et al. (2022) X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 , 2022.
Wang et al. (2024) Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. , abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574 .
Xia et al. (2024) C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint , 2024.
Xin et al. (2024) H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/2408.08152 .
Zhou et al. (2023) J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911 , 2023.

[译] AI Workflow & AI Agent：架构、模式与工程建议（Anthropic，2024）

ARTHURCHIAO'S BLOG

4 months 3 weeks ago

译者序

本文翻译自 2024 年 Anthropic（开发 Claude 大模型的公司）的一篇文章 Building Effective Agents。

Agents 只是一些“在一个循环中，基于环境反馈来选择合适的工具，最终完成其任务”的大模型。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 什么是 AI Agent/Workflow？
2 Workflow & Agent 的基础构建模块
3 Workflow
4 Agent
5 总结
致谢
附录 1：真实 Agent 举例
- A. AI 客服
- B. Coding Agent
附录 2：工具的提示词工程（Prompt engineering your tools）
- 输出格式的选择
- 建议

过去一年中，我们与几十个团队合作过，构建了很多不同行业的大模型 Agent。我们从中得到的经验是：成功的 Agent 并不是依靠复杂的框架或库，而是基于简单、可组合的模式逐步构建的。

本文总结我们在此过程中积累的一些 Agent 方法论，并给出一些实用的工程建议。

1 什么是 AI Agent/Workflow？

目前关于 AI Agent 并没有一个统一的定义：

有人将 Agent 定义为完全自主的系统，这些系统可以在较长时间内独立运行，使用各种工具来完成复杂任务。
有人则用这个术语来描述一种遵循预定义工作流的规范实现（prescriptive implementations that follow predefined workflows）。

在 Anthropic，我们将所有这些统一归类为 agentic systems。

1.1 Workflow vs. Agent

虽然统一称为“智能体系统”，但我们还是对 Workflow 和 Agent 做出了重要的架构区分，因此二者属于两类不同的系统：

Workflow：通过预定义的代码路径来编排大模型和和工具 （systems where LLMs and tools are orchestrated through predefined code paths）；
Agent：大模型动态决定自己的流程及使用什么工具，自主控制如何完成任务 （systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks）。

1.2 何时使用/不使用 Agent & Workflow

在使用大模型构建应用程序时，我们建议寻找尽可能简单的方案，只有在必要时才增加复杂性。

这意味着如无必要，不要试图构建 Agent/Workflow。
Agent/Workflow 虽然在处理任务时效果更好，但通常也会有更高的延迟和成本，因此需要权衡利弊。

如果确实是要解决复杂场景的问题，

Workflow 为明确定义的任务提供了可预测性和一致性，
Agent 则在需要大规模灵活性和模型驱动的决策时是一个更好的选择。

但是，对于很多应用程序来说，大模型本身加上 RAG、in-context examples 等技术通常就足以解决问题了。

1.3 何时以及如何使用框架

许多框架可以简化 Agent/Workflow 的实现，包括：

LangGraph from LangChain;
Amazon Bedrock’s AI Agent framework;
Rivet, a drag and drop GUI LLM workflow builder; and
Vellum, another GUI tool for building and testing complex workflows.

这些框架通过简化标准的底层任务（如调用 LLM、定义和解析工具以及链接调用）使用户更容易入门。但是，它们通常会创建额外的抽象层，这可能会使底层的提示和响应变得难以调试，增加了不必要的复杂性。

我们建议开发者，

首选直接使用 LLM API：本文接下来介绍的许多模式几行代码就能实现；
如果确实要用框架，要确保理解这些框架的底层代码。对底层代码的错误假设是常见的问题来源。

1.4 一些例子

见 anthropic-cookbook。

2 Workflow & Agent 的基础构建模块 2.1 增强型大模型（augmented LLM）

如下图所示，Agent/Workflow 的基本构建模块是一个增强型大语言模型，

这个模型具有检索、工具和记忆等增强功能。模型可以主动使用这些功能，例如搜索查询、选择适当的工具、保存必要的信息到记忆模块中等等。

2.2 功能选型建议

关于以上提到的增强功能如何选择，我们有如下建议：

不是所有功能都需要用上，而应该根据你的实际需求，只保留最必要的部分；
尽量使用那些文档完善的组件，否则就是给自己挖坑。

最后，实现这些增强功能有很多方式，我们最近发布的 Model Context Protocol 也是其中一种。开发者只需要实现简单的客户端 client implementation，就能与不断增长的第三方工具生态系统进行集成。

2.3 小结

基于增强型大模型，我们就可以构建出各种 AI Workflow & Agent。

3 Workflow

本节来看一些常见的 AI Workflow 范式。

3.1 提示链（Prompt chaining）

提示链将任务分解为一系列顺序的子任务，

每个 LLM call 处理前一个 LLM call 的输出；
可以在中间任何步骤添加检查点（图中的 “Gate”），以确保处理过程仍在正轨上。

3.1.1 适用场景

适用于能干净地将任务分解为固定子任务的场景。

背后的逻辑：相比于一整个大任务，拆解后的每个 LLM call 都是一个准确率更高、延迟更低、更容易完成的任务。

3.1.2 场景举例生成营销文案

生成营销文案，然后将其翻译成不同的语言。

按大纲编写文档

首先编写文档大纲，确保大纲符合某些标准，然后根据大纲编写文档。

3.2 路由（Routing）

通过路由对输入进行分类，并将其转发到专门的后续任务（specialized followup task）。

将任务的关注点进行拆解，从而针对每个具体任务设计和调整提示词。
否则，（all-in-one）提示词不仅很长，而且针对任何一种任务的提示词优化都可能会导致其他任务的性能下降。

3.2.1 适用场景

适用于存在不同类别的复杂任务，而且这些类别分开处理时，都能得到更好的效果。
前提是能够准确分类，至于是使用大模型分类，还是使用传统模型/算法分类，关系不大。

3.2.2 场景举例智能客服

将不同类型的用户问题（一般问题、请求退款、技术支持）转发到不同的下游流程、提示和工具。

大小模型路由

将简单/常见问题路由到较小的模型，如 Claude 3.5 Haiku，将困难/不寻常问题路由到更强大的模型，如 Claude 3.5 Sonnet，以优化成本和速度。

3.3 并行化（Parallelization）

多个任务同时进行，然后对输出进行聚合处理。考虑两个场景：

分段（Sectioning）：类似 MapReduce，将任务分解为独立的子任务并行运行，最后对输出进行聚合。
投票（Voting）：相同的任务并行执行多次，以获得多样化的输出。

3.3.1 适用场景

分为两类：

并行化可以提高任务的最终完成速度，
需要多种视角或尝试，对所有结果进行对比，取最好的结果。

背后的逻辑：如果一个复杂任务需要考虑很多方面，那针对每个方面单独调用 LLM 效果通常会更好，因为每个 LLM 都可以更好地关注一个具体方面。

3.3.2 场景举例旁路安全检测

属于 Sectioning。

一个模型实例处理用户查询，另一个模型实例筛选是否包含不当的内容或请求。这通常比让同一个模型实例同时请求响应和安全防护效果更好。

大模型性能评估的自动化

属于 Sectioning。

针对给到的提示词，每个 LLM 调用评估模型不同方面的性能。

Code review

属于 voting。

几个不同的提示审查并标记代码，寻找漏洞。

生成的代码的质量评估

属于 voting。

评估输出的代码是否恰当：使用多个提示词，分别评估生成的代码的不同方面，或通过不同的投票阈值，以平衡误报和漏报（false positives and negatives）。

3.4 编排者-工作者（Orchestrator-workers）

在这种 Workflow 中，一个中心式 LLM 动态地分解任务，将其委托给 worker LLM，并汇总它们的结果。

3.4.1 适用场景

适用于无法预测所需子任务的复杂任务。例如，在编程中，修改的文件数量。

虽然在拓扑上与 Parallelization Workflow 相似，但关键区别在于其灵活性 —— 子任务不是预先定义的，而是由协调者/编排者根据特定输入确定的。

3.4.2 场景举例 Code review

编程产品：每次对多个文件（数量不确定）进行修改。

智能搜索

搜索任务：从多个来源收集和分析信息。

3.5 评估者-优化者（Evaluator-optimizer）

在这种 Workflow 中，一个 LLM call 生成响应，而另一个提供评估和反馈，形成一个闭环。

3.5.1 适用场景

有明确的评估标准，并且迭代式改进确实有效（可衡量）。

两个适用于此模式的标志，

当人类给出明确反馈时，LLM 响应可以明显改进；
LLM 也能提供此类反馈。

类似于作家写一篇文章并不断润色的过程。

3.5.2 场景举例文学翻译

承担翻译任务的 LLM 可能没有捕捉到细微差别，但承担评估任务的 LLM 可以提供有用的批评。

复杂的搜索任务

需要多轮搜索和分析以收集全面信息，评估者决定是否需要进一步搜索。

3.6 AI Workflow 小结

Workflow 是基于增强型大模型的一种应用形式，可以帮助用户将任务分解为更小的子任务，以便更好地处理。虽然 Workflow 也有一些动态的能力，例如路由和并行化，但这种程度的动态能力还是预定义的。下面将出场的 AI Agent，则在动态上与此完全不同了。

4 Agent

随着 LLM 在关键能力上的不断成熟 —— 理解复杂输入、进行推理和规划、可靠地使用工具以及自动从错误中恢复 —— 人们开始将 Agent 应用到生产环境中。

4.1 原理

Agent 一般从下面场景收到任务并开始执行：

收到明确的人类指令；
与人类交流到一定程度时，理解了自己接下来应该做什么。

一旦任务明确，Agent 就会独立规划和执行，中间也可能会问人类一些问题，以获取更多信息或帮助它自己做出正确判断。

在 Agent 执行过程中，对它来说最重要的是每一步执行之后，都能从环境中获得“真实信息”（例如工具调用或执行代码），以帮助它评估任务的进展。
Agent 可以在检查点或遇到障碍时暂停，然后向人类获取帮助。
任务通常在完成时终止，但也可以包括停止条件（例如最大迭代次数），以避免 Agent 行为不可控。

4.2 抽象层次：Agent vs. LLM

Agent 可以处理复杂的任务，但其实现通常很简单 —— 它们通常只是一些“在一个循环中，基于环境反馈来选择合适的工具，最终完成其任务的大模型”。因此，给 Agent 设计工具集时，其文档时必须清晰，否则这些工具大模型用起来可能会效果欠佳。

附录 2 介绍了工具开发的最佳实践。

4.3 何时使用 Agent

首先，必须对对大模型的决策有一定程度的信任，否则就不要用 Agent 了。

其次，Agent 的自主性使它们非常适合在受信任的环境中执行任务。 Agent 的自主性质意味着更高的成本和潜在的错误累积。建议在沙箱环境中进行广泛测试，并设置适当的保护措施。

场景：难以或无法预测需要多少步的开放式问题，以及无法 hardcode 处理路径的情况。

4.4 Agent 设计三原则

在实现 Agent 时，建议遵循三个核心原则：

Agent 设计的简洁性。
Agent 工作过程的透明性，例如能明确显示 Agent 的规划和步骤。
通过完善的文档和测试，精心设计 Agent 与计算机之间的接口（agent-computer interfaces, ACI）。

开源框架可以帮助你快速入门，但落地生产时，要极力减少抽象层，尽量使用基本组件。遵循这些原则，就能创建出强大、可靠、可维护并受到用户信任的 Agent。

4.5 场景举例

我们自己的 Agent 例子：

一个解决 SWE-bench tasks 任务的 Coding Agent：会根据任务描述对多个文件进行编辑；
我们的 “computer use” reference implementation，其中 Claude 大模型使用计算机来完成任务。

5 总结

本文介绍的内容，不管是 Workflow 还是 Agent，都是一种模式，而不是规范，开发者可以组合和改造这些模式来实现自己的 AI 系统。成功的关键，是能衡量系统的性能，然后不断对实现进行改进和迭代。

大模型领域的成功并不是构建最复杂的系统，而是构建符合你需求的系统。从简单的提示词开始，不断评估和优化，只有在简单的解决方案真的解决不了问题时，才应该考虑引入 multi-step agentic systems。或者换句话说，只有在性能有明显改善时，才应该考虑增加复杂性。

致谢

Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we’re deeply grateful.

附录 1：真实 Agent 举例

本附录介绍在我们的客户案例中，两个特别有价值的领域。

我们与客户的工作揭示了两个特别有前景的 AI Agent 应用，展示了上述模式的实际价值。这两个应用都说明了 Agent 在满足以下条件的任务中非常有价值：

require both conversation and action
have clear success criteria
enable feedback loops
integrate meaningful human oversight

A. AI 客服

AI 客服将聊天机器人与工具集成到一起。这是非常典型的开放式 Agent 场景，因为：

客服场景天然就是对话流程，同时需要访问外部信息和执行行动；
可以集成工具以获取客户数据、订单历史和知识库文章；
行动（如退款或更新工单）可以程序化处理；
通过用户反馈，可以明确衡量成功与否。

几家公司在 usage-based pricing models 中展示了这种方法的可行性，在这种定价模型中，他们仅在 AI 客服成功给出用户解决方案时才收费，显示出这些公司对这种 Agent 的效果非常有信心。

B. Coding Agent

软件开发领域展示了 LLM 功能的显著潜力，功能从代码补全发展到自主问题解决。 Agent 在编程领域特别有效，因为：

代码解决方案可以通过自动化测试来验证；
Agent 可以使用测试结果作为反馈来迭代解决方案；
问题空间是明确定义和结构化的；
输出质量可以客观衡量。

在我们自己的实现中，Agent 现在可以仅根据 Pull Request 描述，就能解决 SWE-bench Verified 中的真实 GitHub 问题。

不过，虽然自动化测试能验证功能，但还少不了人类 review，这对于确保解决方案与更系统要求的对齐至关重要。

附录 2：工具的提示词工程（Prompt engineering your tools）

无论构建哪种 Agent/Workflow ，工具很可能都是其中重要的组成部分。工具能让我们在使用 Claude 时，以标准 API 的方式指定工具的结构和定义，Claude 就能与外部服务和 API 进行交互。当 Claude 响应时，如果它计划调用工具，它将在 API 响应中包含一个 tool use block。

工具的定义和规范（tool definitions and specifications） 也需要提示工程，需要给到足够的关注度。

本附录接下来介绍如何通过提示工程来描述你的工具。

输出格式的选择

同一个 action，通常可以有不同的实现方式。例如，

修改文件：可以通过提供 diff，也可以直接重写整个文件；
结构化输出：可以用 markdown，也可以用 JSON 格式。

在软件工程中，这样的差异问题不大，几种格式都可以无损转换。但对于大模型来说，某些格式的输出比其他格式更难。例如，

输出 diff 格式，需要知道在新代码之前，前面改动了多少行；
输出 JSON 格式，需要额外处理字符转义问题（相比 markdown）。

建议

我们对工具输出格式的建议如下：

给模型足够的 token 来“思考”，从而避免它进入死胡同；
文本的输出格式，与此类文本在互联网上的常见格式保持一致，因为大模型就是在互联网数据上进行训练的；
确保没有任何格式“开销”（例如需要准确记录几千行代码，或对代码进行转义）。

一个经验法则：在人机界面（HCI）上投入了多少努力，就在 agent-computer interfaces（ACI）上投入同样多的努力。如何做到这一点：

换位思考，多站在模型的角度思考问题。
- 根据给定的描述和参数，作为自然人是一看就懂，还是需要思考一下才能判断？自然人是什么反应，模型也很可能是什么反应。
- 一个好的工具定义通常包括示例用法、边界情况、输入格式要求以及明确与其他工具的界限。
如何重命名参数或改进文档，使工具的描述更简洁直白？可以将这个过程当做为团队中的新人编写一个优秀的 docstring。当工具很多而且存在一些类似时，这一点尤其重要。
测试模型如何使用你的工具：运行一些示例输入，看看模型犯了什么错误，并进行迭代。
工具的防呆（Poka-yoke）。

我们在构建 SWE-benchAgent 时，实际上花在优化工具上的时间比在整体提示上的时间还要多。例如，我们发现模型在 Agent 移出根目录后仍然会使用相对文件路径，导致调用工具出错。为了解决这个问题，我们将工具的设计改为永远使用绝对文件路径。

[译] AI Agent（智能体）技术白皮书（Google，2024）

ARTHURCHIAO'S BLOG

5 months ago

译者序

本文翻译自 2024 年 Google 团队的一份 Agents 白皮书，作者 Julia Wiesinger, Patrick Marlow, Vladimir Vuskovic。

Agent 可以理解为是一个扩展了大模型出厂能力的应用程序。

工具的使用，是人类区别于动物的标志 —— 也是 Agent 区别于大模型的标志。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 引言
- 1.1 人类的先验知识与工具的使用
- 1.2 人类的模仿者
2 什么是 Agent？
3 认知架构：Agent 是如何工作的
4 工具：模型通往现实世界的关键
5 通过针对性学习提升模型性能
6 基于 LangChain 快速创建 Agent
7 总结
参考资料

1 引言 1.1 人类的先验知识与工具的使用

人类很很好地处理复杂和微妙的模式识别任务。能做到这一点是因为，我们会通过书籍、搜索或计算器之类的工具来补充我们头脑中的先验知识，然后才会给出一个结论（例如，“图片中描述的是 XX”）。

1.2 人类的模仿者

与以上类似，我们可以对生成式 AI 模型进行训练，让它们能使用工具来在现实世界中获取实时信息或给出行动建议。例如，

利用数据库查询工具获取客户的购物历史，然后给出购物建议。
根据用户的查询，调用相应 API，替用户回复电子邮件或完成金融交易。

为此，模型不仅需要访问外部工具，还要能够自主规划和执行任务。这种具备了推理、逻辑和访问外部信息的生成式 AI 模型，就是 Agent 的概念；换句话说，Agent 是一个扩展了生成式 AI 模型出厂能力的程序。

2 什么是 Agent？ 2.1 概念：应用程序

宽泛地来说，生成式 AI Agent 可以被定义为一个应用程序，通过观察周围世界并使用可用的工具来实现其目标。

Agent 是有自主能力的（autonomous），只要提供了合适的目标，它们就能独立行动，无需人类干预；
即使是模糊的人类指令，Agent 也可以推理出它接下来应该做什么，并采取行动，最终实现其目标。

在 AI 领域，Agent 是一个非常通用的概念。本文接下来要讨论的 Agent 会更具体，指的是本文写作时，基于生成式 AI 模型能够实现的 Agents。

2.2 架构：cognitive architecture

为了理解 Agent 的内部工作原理，我们需要看看驱动 Agent 行为、行动和决策（behavior, actions, and decision making）的基础组件。

这些组件的组合实现了一种所谓的认知架构（cognitive architecture），通过这些组件可以实现许多这样的架构。我们后面还会就这一点展开讨论。

2.3 组件

Agent 架构中有三个核心组件，如图所示，

Figure 1. 典型 Agent 架构与组件。

2.3.1 模型（model）

这里指的是用作 Agent 中用来做核心决策的语言模型（LM）。

可以是一个或多个任何大小的模型，能够遵循基于指令的推理和逻辑框架，如 ReAct、Chain-of-Thought、Tree-of-Thoughts。
可以是通用的、多模态的，或根据特定 Agent 架构的需求微调得到的模型。
可以通过“能展示 Agent 能力的例子或数据集”来进一步微调模型，例如 Agent 在什么上下文中使用什么工具，或者执行什么推理步骤。

2.3.2 工具（tool）

基础模型在文本和图像生成方面非常强大，但无法与外部世界联动极大限制了它们的能力。工具的出现解决了这一问题。有了工具，Agent 便能够与外部数据和服务互动，大大扩展了它们的行动范围。

工具可以有多种形式，常见是 Web API 方式，即 GET、POST、PATCH 和 DELETE 方法。例如，结合用户信息和获取天气数据的 tool，Agent 可以为用户提供旅行建议。

有了工具，Agent 可以访问和处理现实世界的信息，这使它们能够支撑更专业的系统，如检索增强生成（RAG），显著扩展了 Agent 的能力。

2.3.3 编排层（orchestration）

编排层描述了一个循环过程：Agent 如何接收信息，如何进行内部推理，如何使用推理来结果来指导其下一步行动或决策。

一般来说，这个循环会持续进行，直到 Agent 达到其目标或触发停止条件。
编排层的复杂性跟 Agent 及其执行的任务直接相关，可能差异很大。例如，一些编排就是简单的计算和决策规则，而其他的可能包含链式逻辑、额外的机器学习算法或其他概率推理技术。

我们将在认知架构部分更详细地讨论 Agent 编排层的详细实现。

2.4 Agent 与 model 的区别

为了更清楚地理解 Agent 和模型之间的区别，这里整理个表格，

模型 Agent 知识范围知识仅限于其训练数据。通过工具连接外部系统，能够在模型自带的知识之外，实时、动态扩展知识。状态与记忆 无状态，每次推理都跟上一次没关系，除非在外部给模型加上会话历史或上下文管理能力。 有状态，自动管理会话历史，根据编排自主决策进行多轮推理。原生工具无。有，自带工具和对工具的支持能力。原生逻辑层无。需要借助提示词工程或使用推理框架（CoT、ReAct 等）来形成复杂提示，指导模型进行预测。有，原生认知架构，内置 CoT、ReAct 等推理框架或 LangChain 等编排框架。 3 认知架构：Agent 是如何工作的 3.1 类比：厨师做菜

想象厨房中一群忙碌的厨师。他们的职责是根据顾客的菜单，为顾客烹制相应的菜品。这就涉及到我们前面提到的“规划 —— 执行 —— 调整”循环。具体来说，厨师们需要执行以下步骤，

收集信息（输入）：顾客点的菜，后厨现有的食材等等；
推理（思考）：根据收集到的信息，判断可以做哪些菜；
做菜（行动）：包括切菜、加调料、烹炒等等。

在以上每个阶段，厨师都根据需要进行调整 —— 例如某些食材不够用了，或者顾客反馈好吃或难吃了 —— 进而不断完善他们的计划。这个信息接收、规划、执行和调整（information intake, planning, executing, and adjusting）的循环描述的就是一个厨师用来实现其目标的特定认知架构。

3.2 Agent 推理框架

跟以上厨师类似，Agent 也可以使用认知架构处理信息、做出决策，并根据前一轮的输出调整下一个行动，如此循环迭代来实现其最终目标。

在 Agent 中，认知架构的核心是编排层，负责维护记忆、状态、推理和规划（memory, state, reasoning and planning）。
它使用快速发展的提示词工程及相关框架（prompt engineering and associated frameworks）来指导推理和规划，使 Agent 能够更有效地与环境互动并完成任务。

在写作本文时，有下面几种流行的推理框架和推理技术。

3.2.1 ReAct

为语言模型提供了一个思考过程策略。

已经证明 ReAct 优于几个 SOTA 基线，提高了 LLM 的人机交互性和可信度。

3.2.2 Chain-of-Thought (CoT)

通过中间步骤实现推理能力。CoT 有各种子技术，包括自我一致性、主动提示和多模态 CoT，适合不同的场景。

3.2.3 Tree-of-Thoughts (ToT)

非常适合探索或战略前瞻任务。概括了链式思考提示，并允许模型探索各种思考链，作为使用语言模型解决问题的中间步骤。

3.3 ReAct 例子

Agent 可以使用以上一种或多种推理技术，给特定的用户请求确定下一个最佳行动。例如，使用 ReAct 的例子，

用户向 Agent 发送查询。
Agent 开始 ReAct sequence。
Agent 提示模型，要求其生成下一个 ReAct 步骤及其相应的输出：
1. 问题：提示词 + 用户输入的问题
2. 思考：模型的想法：下一步应该做什么
3. 行动：模型的决策：下一步要采取什么行动。这里就是可以引入工具的地方，例如，行动可以是 [Flights, Search, Code, None] 中的一个，前三个代表模型可以选择的已知工具，最后一个代表“无工具选择”。
4. 行动的输入：模型决定是否要向工具提供输入，如果要提供，还要确定提供哪些输入
5. 观察：行动/行动输入序列的结果。根据需要，这个思考/行动/行动输入/观察（thought / action / action input / observation）可能会重复 N 次。
6. 最终答案：模型返回对原始用户查询的最终答案。
ReAct 循环结束，并将最终答案返回给用户。

Figure 2. Example Agent with ReAct reasoning in the orchestration layer

如图 2 所示，模型、工具和 Agent 配置共同工作，根据用户的输入返回了一个有根据的、简洁的响应。虽然模型第一轮根据其先前知识猜了一个答案（幻觉），但它接下来使用了一个工具（航班）来搜索实时外部信息，从而能根据真实数据做出更明智的决策，并将这些信息总结回给用户。

总结起来，Agent 的响应质量与模型的推理能力和执行任务的能力直接相关，包括选择正确工具的能力，以及工具自身的定义的好坏（how well that tools has been defined）。就像厨师精选食材、精心做菜，并关注顾客的反馈一样，Agent 依赖于合理的推理和可靠的信息来提供最佳结果。

在下一节中，我们将深入探讨 Agent 与“新鲜”数据的各种连接方式。

4 工具：模型通往现实世界的关键

语言模型很擅长处理信息，但它们缺乏直接感知和影响现实世界的能力。在需要与外部系统或数据联动的情况下，这些模型的实用性就很低了。某种意义上说， 语言模型的能力受限于它们的训练数据中覆盖到的信息。

那么，如何赋予模型与外部系统进行实时、上下文感知的互动能力呢？目前有几种方式：

Functions
Extensions
Data Stores
Plugins

虽然名称各异，但它们都统称为工具（tools）。 工具是将基础模型与外部世界连接起来的桥梁。

能够连接到外部系统和数据之后，Agent 便能够执行更广泛的任务，并且结果更加准确和可靠。例如，工具使 Agent 能够调整智能家居设置、更新日程、从数据库中获取用户信息或根据特定指令发送电子邮件。

写作本文时，Google 模型能够与三种主要工具类型互动：Functions、Extensions、Data Stores。

配备了工具之后，Agent 不仅解锁了理解真实世界和在真实世界中做出行动的超能力，而且打开了各种新应用场景和可能性的大门。

4.1 工具类型一：extensions

在最简单的概念上： extension 是一种以标准化方式连接 API 与 Agent 的组件，使 Agent 能够调用外部 API，而不用管这些 API 背后是怎么实现的。

4.1.1 需求：预定航班的 Agent

假设你想创建一个帮用户预订航班的 Agent，并使用 Google Flights API 来搜索航班信息，但不确定如何让你的 Agent 调用这个 API。

Figure 3. How do Agents interact with External APIs?

4.1.2 实现方式一：传统方式，写代码解析参数

传统解决方式是写代码，从用户输入中解析城市等相关信息，然后调用 API。例如，

用户输入 “I want to book a flight from Austin to Zurich”（“我想从奥斯汀飞往苏黎世”）；我们的代码需要从中提取“Austin”和“Zurich”作为相关信息，然后才能进行 API 调用。
但如果用户输入“I want to book a flight to Zurich”，我们就无法获得出发城市信息，进而无法成功调用 API，所以需要写很多代码来处理边界 case。

显然，这种方法维护性和扩展性都很差。有没有更好的解决方式呢？这就轮到 exntension 出场了。

4.1.3 实现方式二：使用 Extension

Figure 4. Extensions connect Agents to External APIs

如上图所示，Extension 通过以下方式将 Agent 与 API 串起来：

提供示例信息教 Agent 如何使用 API。
告诉 Agent 调用 API 所需的具体参数。

Extension 可以独立于 Agent 开发，但应作为 Agent 配置的一部分。 Agent 在运行时，根据提供的示例和模型来决定使用哪个 extension 来处理用户的查询，这突出了 extension 的一个核心优势：built-in example types，允许 Agent 动态选择最适合所执行任务的 extension，如下图所示，

Figure 5. 1-to-many relationship between Agents, Extensions and APIs

4.1.4 Extension 示例

以 Google 的 Code Interpreter extension 作为例子，从自然语言描述生成和运行 Python 代码。

import vertexai import pprint PROJECT_ID = "YOUR_PROJECT_ID" REGION = "us-central1" vertexai.init(project=PROJECT_ID, location=REGION) from vertexai.preview.extensions import Extension extension_code_interpreter = Extension.from_hub("code_interpreter") CODE_QUERY = """Write a python method to invert a binary tree in O(n) time.""" response = extension_code_interpreter.execute( operation_id="generate_and_execute", operation_params={"query": CODE_QUERY} ) print("Generated Code:") pprint.pprint(response['generated_code'])

输出如下：

class TreeNode: def __init__(self, val=0, left=None, right=None): self.val = val self.left = left self.right = right def invert_binary_tree(root): """Inverts a binary tree.""" if not root: return None # Swap the left and right children recursively root.left, root.right = invert_binary_tree(root.right), invert_binary_tree(root.left) return root # Example usage: # Construct a sample binary tree root = TreeNode(4) root.left = TreeNode(2) root.right = TreeNode(7) root.left.left = TreeNode(1) root.left.right = TreeNode(3) root.right.left = TreeNode(6) root.right.right = TreeNode(9) # Invert the binary tree inverted_root = invert_binary_tree(root) 4.2 工具类型二：functions

在软件工程中，也就是我们日常写代码时，“函数”指的是自包含的代码模块，用于完成特定任务，并可以复用（被不同地方的代码调用）。软件工程师写程序时，通常会创建许多函数来执行各种任务，还会定义函数的预期输入和输出。

在 Agent 的世界中，函数的工作方式非常相似 —— 只是将“软件开发者”替换为“模型”。模型可以设置一组已知的函数，然后就可以根据规范决定何时使用哪个函数，以及函数需要哪些参数。

4.2.1 Function vs. Extension

还是以前面的 Google Flights 为例，可以看出 Function 与 Extension 的不同：

Figure 7. How do functions interact with external APIs?

模型只输出函数名及其参数信息，但不会执行函数；
函数在客户端执行。作为对比，Extension 在 Agent 端执行。见下图，

Figure 8. Delineating client vs. agent side control for extensions and function calling

4.2.2 例子：教模型结构化输出信息

考虑以下例子，实现一个 AI Traval Agent，它会与想要旅行的用户互动。我们的目标是让 Agent 生成一个城市列表，然后就可以下载相应城市的图片、数据等，以供用户旅行规划使用。

用户可能会说：
I’d like to take a ski trip with my family but I’m not sure where to go.
典型的模型输出可能如下：
Sure, here’s a list of cities that you can consider for family ski trips: - Crested Butte, Colorado, USA - Whistler, BC, Canada - Zermatt, Switzerland
虽然以上输出包含了我们需要的数据（城市名称），但格式不适合解析。通过 Function，我们可以教模型以结构化风格（如 JSON）输出，以便其他系统解析。例如，输出可能是下面这样，
{ "name": "display_cities", "args": { "cities": ["Crested Butte", "Whistler", "Zermatt"], "preferences": "skiing" } }

这个 Agent 应用的整体流程图如图 9 所示，

Figure 9. Sequence diagram showing the lifecycle of a Function Call

4.2.3 示例代码

Function 定义：

def display_cities(cities: list[str], preferences: Optional[str] = None): """Provides a list of cities based on the user's search query and preferences. Args: preferences (str): The user's preferences for the search, like skiing, beach, restaurants, bbq, etc. cities (list[str]): The list of cities being recommended to the user. Returns: list[str]: The list of cities being recommended to the user. """ return cities

接下来，初始化模型和工具，然后将用户的查询和工具传递给模型。

from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration model = GenerativeModel("gemini-1.5-flash-001") display_cities_function = FunctionDeclaration.from_func(display_cities) tool = Tool(function_declarations=[display_cities_function]) message = "I’d like to take a ski trip with my family but I’m not sure where to go. " res = model.generate_content(message, tools=[tool]) print(f"Function Name: {res.candidates[0].content.parts[0].function_call.name}") print(f"Function Args: {res.candidates[0].content.parts[0].function_call.args}")

效果：

> Function Name: display_cities > Function Args: {'preferences': 'skiing', 'cities': ['Aspen', 'Vail', 'Park City']}

总结起来，Function 提供了一个简单的框架，使应用程序开发人员能够

对数据流和系统执行进行细粒度的控制，
利用 Agent 和模型生成结构化的信息，方便作为下一步的输入。

4.3 工具类型三：data storage

Figure 10. How can Agents interact with structured and unstructured data?

语言模型就像一个大图书馆，其中包含了其训练数据（信息）。但与真实世界的图书馆不同的是，这个图书馆是静态的 —— 不会更新，只包含其最初训练时的知识。而现实世界的知识是不断在演变的，所以静态模型在解决现实世界问题时就遇到了挑战。

Figure 11. Data Stores connect Agents to new real-time data sources of various types.

Data Storage 通过提供动态更新的信息来解决这一问题，

允许开发人员以原始格式向 Agent 提供增量数据，将传入的文档将被转换为一组向量数据库嵌入（embedding），Agent 可以使用这些 embedding 来提取信息。
使模型的返回更相关，更具实效性。
避免了微调甚至重新训练模型等重量级操作。

4.3.1 实现与应用

在生成式 AI 场景，Agent 使用的数据库一般是向量数据库 —— 它们以向量 embedding 的形式存储数据，这是一种高维向量或数学表示。

Figure 12. 1-to-many relationship between agents and data stores, which can represent various types of pre-indexed data

使用语言模型与 Data Storage 的最典型例子是检索增强生成（RAG）。 RAG 应用程序通过让模型访问各种格式的数据来扩展模型知识的广度和深度，如：

网站内容
结构化数据，如 PDF、Word 文档、CSV、电子表格等
非结构化数据，如 HTML、PDF、TXT 等

每个用户请求和 Agent 响应循环的基本过程通常如图 13 所示，

Figure 13. The lifecycle of a user request and agent response in a RAG based application

用户 query 送到 embedding 模型，生成 query 的 embedding 表示。
将 query embedding 与向量数据库的内容进行匹配，本质上就是在计算相似度。
将相似度最高的内容以文本格式发送回 Agent。
Agent 决定响应或行动。
最终响应发送给用户。

4.3.2 例子

图 14 是一个 RAG 与 ReAct 推理/规划的 Agent 示例，

Figure 14. Sample RAG based application w/ ReAct reasoning/planning

4.4 工具小结

总结来说，Extension、Function 和 Data Storage 是 Agent 在运行时可以使用的几种不同工具类型。每种工具都有其特定的用途，可以根据 Agent 开发人员的判断单独或一起使用。

Extensions Function Calling Data Stores Execution Agent-Side Execution Client-Side Execution Agent-Side Execution Use Case

开发人员希望 Agent 控制 API 的调用
使用 native pre-built Extensions (i.e., Vertex Search, Code Interpreter, etc.) 时比较有用
Multi-hop planning and API calling (i.e., 下一个 action 取决于前一个 action/API call 的输出)

安全或认证等原因，导致 Agent 无法直接调用 API 的场景
时序或者操作顺序限制，导致 Agent 无法直接事实调用 API 的场景，(i.e., batch operations, human-in-the-loop review, etc.)
API 没有暴露给公网，只能在内部使用的场景。

开发人员希望使用以下数据类型实现 RAG：
Website Content from pre-indexed domains and URLs
Structured Data in formats like PDF, Word Docs, CSV, Spreadsheets, etc.
Relational/Non-Relational Databases
Unstructured Data in formats like HTML, PDF, TXT, etc.

5 通过针对性学习提升模型性能

有效使用模型的一个关键是，让模型具备在生成输出时选择正确工具的能力。虽然一般训练有助于模型获得这种技能，但现实世界的场景通常需要超出训练数据的知识。这就像是掌握基本做菜技能和精通特定菜系之间的区别，两者都需要基础烹饪知识，但后者需要针对性学习以获得更好的垂类结果。

帮模型获得这种特定技能，有几种方法：

In-context learning
Retrieval-based in-context learning
Fine-tuning based learning

5.1 In-context learning, e.g. ReAct

基于上下文学习：

原理：还是使用通用模型，但在推理时为模型提供提示词、工具和示例，使模型其能够“即时学习”如何以及何时为特定任务使用这些工具。
例子：ReAct 框架。

5.2 Retrieval-based in-context learning, e.g. RAG

基于检索的上下文学习：

原理：这种技术通过从外部存储中检索相关信息、工具和示例来动态填充模型提示词。
例子：RAG 架构。

5.3 Fine-tuning based learning

基于微调的学习：

原理：用大量的特定示例对模型进行训练（微调/精调），然后用微调过的模型进行推理。
好处：微调之后的模型在处理请求之前，已经具备了何时以及如何使用某些工具的先验知识。

5.4 再次与“厨师做饭”做类比

最后与厨师做饭再做个类比，加深理解：

方式类比 In-context learning 厨师收到了一个特定的食谱（提示词）、一些食材（相关工具）和一些示例菜肴（少量示例）。基于这些信息和厨师已经具备的常规烹饪知识，“即时学习”如何准备最符合菜单和客户偏好的菜品。 Retrieval-based in-context learning 厨房里有一个储藏室（外部 Data Storage），里面有各种食材和食谱（示例和工具）。厨师可以从储藏室中自主选择更符合用户饮食偏好的食材和食谱，做出让用户更满意的菜品。 Fine-tuning based learning 把厨师送回学校学习新的菜系（在大量的特定示例数据集上进行训练）。如果希望厨师在特定菜系（知识领域）中表现出色，这种方法非常合适。

每种方法在速度、成本和延迟方面都各有优缺点，需要看实际需求组合使用。

6 基于 LangChain 快速创建 Agent

本节来看下如何基于 LangChain 和 LangGraph 构建一个 Agent 快速原型。这些开源库允许用户通过“串联”逻辑、推理和工具调用序列来构建客户 Agent。

6.1 代码 from langgraph.prebuilt import create_react_agent from langchain_core.tools import tool from langchain_community.utilities import SerpAPIWrapper from langchain_community.tools import GooglePlacesTool os.environ["SERPAPI_API_KEY"] = "XXXXX" os.environ["GPLACES_API_KEY"] = "XXXXX" @tool def search(query: str): """Use the SerpAPI to run a Google Search.""" search = SerpAPIWrapper() return search.run(query) @tool def places(query: str): """Use the Google Places API to run a Google Places Query.""" places = GooglePlacesTool() return places.run(query) model = ChatVertexAI(model="gemini-1.5-flash-001") tools = [search, places] query = "Who did the Texas Longhorns play in football last week? What is the address of the other team's stadium?" Agent = create_react_agent(model, tools) input = {"messages": [("human", query)]} for s in Agent.stream(input, stream_mode="values"): message = s["messages"][-1] if isinstance(message, tuple): print(message) else: message.pretty_print()

其中用到的工具包括：

SerpAPI（用于 Google 搜索）
Google Places API。

6.2 运行效果 =============================== Human Message ================================ Who did the Texas Longhorns play in football last week? What is the address of the other team's stadium? ================================= Ai Message ================================= Tool Calls: search Args: query: Texas Longhorns football schedule ================================ Tool Message ================================ Name: search {...Results: "NCAA Division I Football, Georgia, Date..."} ================================= Ai Message ================================= The Texas Longhorns played the Georgia Bulldogs last week. Tool Calls: places Args: query: Georgia Bulldogs stadium ================================ Tool Message ================================ Name: places {...Sanford Stadium Address: 100 Sanford...} ================================= Ai Message ================================= The address of the Georgia Bulldogs stadium is 100 Sanford Dr, Athens, GA 30602, USA

虽然这是一个很简单的 Agent，但它展示了模型、编排和工具等基础组件如何协同工作以实现特定目标。

6.3 使用 Google Vertex AI Agent 创建生产应用

最后，我们来看看这些组件如何在像 Vertex AI Agent 和生成式剧本这样的 Google 规模的托管产品中结合在一起。

Figure 15. Sample end-to-end agent architecture built on Vertex AI platform

7 总结

本文讨论了生成式 AI Agent 的基础构建模块及工作原理。一些关键信息：

Agent 可以利用工具来扩展语言模型的能力，
- 扩展的能力包括：访问实时信息、建议现实世界的行动以及自主规划和执行复杂任务。
- Agent 可以利用语言模型来决定何时以及如何转换状态，并使用外部工具完成任意数量的复杂任务，这些任务对于模型单独完成来说是困难甚至不可能的。
Agent 的核心是编排层，
- 这是一个认知架构，它结构化了推理、规划、决策并指导其行动。
- 各种推理技术，如 ReAct、Chain-of-Thought 和 Tree-of-Thoughts，为编排层提供了一个框架，以接收信息、进行内部推理并生成决策或响应。
工具作为 Agent 通往外部世界的关键，使 Agent 能够与外部系统互动，以及让模型获取在它的训练数据之外的知识。
- Extensions 为 Agent 与外部 API 之间提供了一个桥梁，使 Agent 能完成实时 API 调用和实时信息检索。
- Functions 使 Agent 能够生成可以在客户端执行的函数代码，为开发人员提供了更精细的控制。
- Data Stores 为 Agent 提供了访问结构化或非结构化数据的能力，使数据驱动的应用程序成为可能。

本文对 Agent 的探索还非常浅显和初级，Agent 的未来将非常激动人心。随着工具变得更加复杂，推理能力得到增强，Agent 将被赋予解决现实生活中越来越复杂的问题的能力。

此外，“Agent chaining” 也将是一个战略性方向，通过结合 specialized Agents —— 每个 Agent 在其特定领域或任务中表现出色 —— 可以创建一种 “mixture of Agent experts”（混合智能体专家）的方法，能够在各个行业和问题领域中提供卓越的性能。

最后需要说明，复杂的 Agent 架构并不是一蹴而就的，需要持续迭代（iterative approach）。给定业务场景和需求之后，不断的实验和改进是找到解决方案的关键。

Agents 底层都是基于基座大模型，而后者的生成式性质决定了没有两个 Agent 是相同的。但是，只要利用好这些基座模型，我们可以创建出真正有影响力的应用程序，这种应用程序极大扩展了语言模型的能力，带来了真实的现实世界价值。

参考资料

Shafran, I., Cao, Y. et al., 2022, ReAct: Synergizing Reasoning and Acting in Language Models
Wei, J., Wang, X. et al., 2023, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wang, X. et al., 2022, Self-Consistency Improves Chain of Thought Reasoning in Language Models
Diao, S. et al., 2023, Active Prompting with Chain-of-Thought for Large Language Models
Zhang, H. et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
Yao, S. et al., 2023, Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Long, X., 2023, Large Language Model Guided Tree-of-Thought
Google, Google Gemini Application
Swagger, OpenAPI Specification
Xie, M., 2022, How does in-context learning work? A framework for understanding the differences from traditional supervised learning
Google Research, ScaNN (Scalable Nearest Neighbors)
LangChain, LangChain

存储进阶笔记（一）：硬件基础：HDD/SDD、JBOD、RAID 等（2024）

ARTHURCHIAO'S BLOG

5 months 3 weeks ago

记录一些平时接触到的存储知识。由于是笔记而非教程，因此内容不求连贯，有基础的同学可作查漏补缺之用。

Fig. 12 Left: HDDs as a JBOD, present to OS as 12 independent devices (sd*), running a Ceph OSD service on each device. Right: speedup performance with high-end RAID cards.

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

存储进阶笔记（一）：硬件基础：HDD/SDD、JBOD、RAID 等（2024）

存储进阶笔记（二）：Linux 存储栈：从 Device Mapper、LVM 到文件系统（2024）

1 磁盘的硬件组成和工作原理
- 1.1 HDD 和 SSD
- 1.2 直接使用 HDD/SDD 面临的问题
2 容量不够，JBOD (Just a Bunch Of Disks) 来凑
3 花钱办事：硬件 RAID 卡数据冗余+提升性能
参考资料

1 磁盘的硬件组成和工作原理 1.1 HDD 和 SSD

HDD 是如何工作的：旋转硬盘内部结构与工作原理的动画展示（2022）
SSD 是如何工作的：固态硬盘内部结构与工作原理的动画展示（2020）

1.2 直接使用 HDD/SDD 面临的问题

单个磁盘的容量、性能等不够
冗余/高可用需求

解决办法：RAID、JBOD、LVM 等等。

2 容量不够，JBOD (Just a Bunch Of Disks) 来凑 2.1 定义

JBOD 在 Wikipedia 中没有单独的词条，而是归类在 Non-RAID drive architectures 中。

JBOD 是一种架构，

往下管理的是多个磁盘，这里所说的“磁盘”可以是
- 物理设备，
- 逻辑卷（logical volume），又分为几种，
  - 多个物理设备组合成的一个逻辑卷，比如用 LVM 或者 mdadm 之类的工具（后面会介绍）；
  - btrfs 之类的能跨设备的文件系统（device-spanning filesystem）
往上呈现给操作系统的是一个或多个独立设备（devices，/dev/xxx）。

最简化的理解：使用 JBOD 模式，那机器上插了几个盘，操作系统中就能看到几个 /dev/sd* 设备。

比如下图是一台 12 盘的 Ceph 机器。Ceph 的设计中，每个盘由一个独立的进程来管理，也就是它的 OSD 进程，所以就适合做 JBOD（但 RAID 也是可以的，右边所示 [2]），

Fig. 12 Left: HDDs as a JBOD, present to OS as 12 independent devices (sd*), running a Ceph OSD service on each device. Right: speedup performance with high-end RAID cards.

2.2 优缺点

无冗余：每个盘（或逻辑 volume）都是独立的，可以独立访问，在其他盘上没有冗余，坏了里面的数据就没了；
每个盘都是独立的，所以加减盘比较简单和方便（作为对比，RAID 加减盘就得考虑数据重新分布了）；
可扩展性和灵活性比较好。可以将不同大小的盘组合到一起；
灵活控制数据存储和备份策略；
性能上就是多个盘的叠加，没有额外性能提升（相比某些 RAID 之类的）；
便宜，不怎么花钱。

2.3 使用场景

需要独立盘的场景，例如 Ceph OSD；
动态扩容比较频繁的场景，例如云存储；
需要精确控制备份策略的场景。

2.4 类似功能的软件：LVM

JBOD 是硬件特性，主板的存储控制器自带这个功能，一般的 RAID 卡也支持 JBOD 模式。

也有一些具有类似功能的软件，比如 LVM (Logical Volume Manager)。下一篇再介绍。

3 花钱办事：硬件 RAID 卡数据冗余+提升性能 3.1 定义

RAID 是 Redundant Array of Independent Disks 的缩写，独立磁盘冗余阵列，可以提供多种级别的数据容易，防止因为单个磁盘故障导致数据丢失或不可用。 RAID 本身只是一种技术。实现上可以是硬件 RAID 卡，也可以是纯软件方案。

我们接下来讨论的主要是硬件 RAID 卡。

3.2 分类 3.2.1 按 RAID 模式分类

可参考 [2]，不错的介绍和软件 raid 教程。

3.2.2 按有无缓存（write back cache）分类

RAID 卡上有没有内存：

无
- 低端卡，便宜
- 数据直接写入磁盘（write-throught）。无加速能力，但能做硬件 RAID，性能比纯软件的 RAID 还是要好。
有
- 高端卡，贵
- 数据写到 RAID 卡内存后直接返回（write-back)，极大提高性能。

查看 WB cache 大小 $ ./storcli64 /c0 show all | grep "Current Size" Current Size of FW Cache (MB) = 6675 3.3 实物图及使用方式 3.3.1 SATA/PCIe RAID

以下是 Broadcom MegaRAID 9560-16i 8GB RAID 卡，自带 8C 处理器，8GB 内存。

Fig. Broadcom MegaRAID 9560-16i 8GB RAID Controller.

RAID 卡本身作为 PCIe 卡插到主板上，磁盘通过 SATA 接口插到右侧（也可以加转换线，将 PCIe 接口的 NVME SSD 插到右侧）。一些产品参数 [3]：

PCIe 4.0 RAID 卡
单个 RAID 卡最多能支持 240 SAS/SATA devices or 32 NVMe devices
支持 RAID 0, 00, 1, 5, 6, 10, 50 and 60
JBOD mode with RAID 0, 1, 10 and JBOD for SDS environments

3.3.2 M.2 RAID

NVME SSD 有两种常见的接口格式：

PCIe 格式：这种 SSD 数据线直接插在主板的 PCIe 插槽上就行了，速度已经很快，例如 PCIe Gen4 的实测写入带宽能打到 3GB/s 左右，Gen5 的写入带宽号称能到 8GB/s。
M.2 格式：体积很小，插在主板上的 M.2 插槽上，速度也很快，但容量一般较小；

如果以上速度还不满足业务需求，可以考虑加上 RAID 卡，下面是 M.2 格式的多个 NVME SSD 做成 RAID 的样子：

Fig. Hardware RAID10 over NVME SSDs. Image Source

前面 Broadcom 那个卡也支持 NVME RAID，但支持的 PCIe 格式的 NVME，而且需要通过 PCIe 扩展线来连接。

3.4 RAID 卡上为什么要配备电池（或超级电容）？ 3.4.1 突然掉电的问题

对于有 WB cache 的，如果数据写到了 cache，但还没写到磁盘，掉电了怎么办？会导致数据丢失。所以引入了配套的电池（BBU, Battery Backup Unit），

电池的作用不是在断电后将数据刷到磁盘 —— 因为这时候磁盘也没电了 —— 而是确保缓存中数据的安全，等重新上电后，再刷到磁盘；
BBU 可以保持 RAID Cache 中的数据几天时间，具体看厂商及电池寿命；
没有电池或电池失效，读缓存还可以用，写缓存会自动关闭（写性能急剧下降）。

3.4.2 BBU vs. supercapacitors

电池能解决掉电丢数据问题，但寿命和故障率是个问题。近几年新出来的另一种保持数据的方式是超级电容（supercapacitors）。

BBU or SuperCapacitor [4]:

A BBU has a docked battery that powers the volatile cache memory for up to 72 hours. Like all Li-ion batteries, they will age and need to be replaced in a maintenance slot after about three to five years.
A SuperCapacitor works differently, but also provides higher security: With the energy stored in the capacitor, the data is quickly shifted into a non-volatile memory and is thus ready for the next start.

3.4.3 查看 raid 卡超级电容信息 $ ./storcli64 /c0/cv show all J | jq 3.5 降本方案

再回到 RAID 卡本身。东西好是好，但贵，有没有降本的方案呢？

3.5.1 VROC (Virtual Raid On CPU)

Intel CPU 独有的技术，CPU 内置硬件模块，官方介绍。

没用过。

参考资料

Considerations for using a RAID controller with OSD hosts, redhat.com, 2024
An Introduction to RAID in Linux, baeldung.com, 2024
Broadcom MegaRAID 9560-16i 8GB RAID Controller, 2024
Protecting RAID systems with BBU or SuperCapacitor, 2024

存储进阶笔记（二）：Linux 存储栈：从 Device Mapper、LVM 到文件系统（2024）

ARTHURCHIAO'S BLOG

5 months 3 weeks ago

记录一些平时接触到的存储知识。由于是笔记而非教程，因此内容不求连贯，有基础的同学可作查漏补缺之用。

Fig. LVM concepts, and how userspace file operations traverse the Linux storage stack.

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

存储进阶笔记（一）：硬件基础：HDD/SDD、JBOD、RAID 等（2024）

存储进阶笔记（二）：Linux 存储栈：从 Device Mapper、LVM 到文件系统（2024）

1 Device Mapper：内核存储基础设施
2 LVM：基于 Device Mapper 创建逻辑卷（设备）
3 文件系统：基于物理或逻辑卷（块设备），创建和管理文件层级
- 3.1 常规文件系统：不能跨 device
- 3.2 Cross-device 文件系统
4 云计算：块存储是如何工作的
参考资料

1 Device Mapper：内核存储基础设施 1.1 内核框架：物理块设备 -> 虚拟块设备

Device mapper（设备映射器）是 Linux 内核提供的一个框架，用于将物理块设备（physical block devices）映射到更上层的虚拟块设备（virtual block devices）。

是逻辑卷管理器（LVM）、software RAID 和 dm-crypt 磁盘加密技术的基础，
还提供了诸如文件系统快照等功能，
还可以在传递数据的同时进行修改，例如，在提供磁盘加密，或者模拟不可靠的硬件行为。

1.2 在内核存储栈中的位置

Fig. Device Mapper 在 Linux 存储栈中的位置（图中间部分）

1.3 使用场景及典型应用

dm-cache：组合使用 SSD 和 HDD 的混合卷（hybrid volume）

A hybrid volume is any volume that intentionally and opaquely makes use of two separate physical volumes. For instance, a workload may consist of random seeks so an SSD may be used to permanently store frequently used or recently written data, while using higher-capacity rotational magnetic media for long-term storage of rarely needed data. On Linux, bcache or dm-cache may be used for this purpose.
Docker – 基于 device mapper 给容器创建 copy-on-write 存储；
LVM2 – 内核最常用的一种逻辑卷管理器（logical volume manager）

2 LVM：基于 Device Mapper 创建逻辑卷（设备） 2.1 功能

Logical Volume Manager （LVM，逻辑卷管理器）1998 年引入内核，是一个基于 device mapper 的框架，为内核提供逻辑卷管理能力。

LVM 可以认为是物理磁盘和分区之上的一个很薄的软件层，能方便换盘、重新分区和备份等等管理工作。

2.2 LVM 中的概念/术语图解

Fig. LVM concepts, and how userspace file operations traverse the Linux storage stack.

2.3 使用场景

LVM 使用场景：

将多个物理卷（physical volumes）或物理盘创建为一个逻辑卷（logical volume），有点类似于 RAID0，但更像 JBOD，好处是方便动态调整卷大小。
热插拔，能在不停服的情况下添加或替换磁盘，管理非常方便。

2.4 使用教程

What is LVM2 in Linux?, medium.com, 2023

3 文件系统：基于物理或逻辑卷（块设备），创建和管理文件层级 3.1 常规文件系统：不能跨 device

常规的文件系统，例如 XFS、EXT4 等等，都不能跨多个块设备（device）。也就是说，创建一个文件系统时，只能指定一个特定的 device，比如 /dev/sda。

要跨多个盘，只能通过 RAID、JBOD、LVM 等等技术将这些块设备合并成一个逻辑卷，然后在这个逻辑卷上初始化文件系统。

3.2 Cross-device 文件系统

更高级一些的文件系统，是能够跨多个块设备的，包括，

ZFS
BTRFS

4 云计算：块存储是如何工作的

上一节已经介绍到，在块设备上初始化文件系统，就可以创建文件和目录了。这里所说的块设备 —— 不管是物理设备，还是逻辑设备 —— 穿透之后终归是一个插在本机上硬件设备。

有了虚拟化之后，情况就不一样了。比如有一类特殊的 Linux 设备，它们对操作系统呈现的确实是一个块设备，但其实底层对接的远端存储系统，而不是本机硬件设备。

在云计算中，这种存储类型称为“块存储”。

4.1 典型块存储产品

块存储（Block Storage），也称为 block-level storage，是公有云和私有云上都非常常见的一种存储。各家的叫法或产品名字可能不同，例如，

AWS EBS（Elastic Block Store）
阿里云的 SSD
Ceph RBD

4.2 工作层次：块级别

块存储工作在块级别（device-level），可以直接访问数据并实现高性能I/O。因此它提供高性能、低延迟和快速数据传输。

4.3 使用场景和使用方式

使用场景：

虚拟机系统盘
数据库磁盘

使用方式：

在块存储系统（例如 AWS EBS）中创建一个块设备，
将这个块挂载到想使用的机器上，这时呈现给这台机器的操作系统的是一个块设备（/dev/xxx），

Storage Decision. Image Source
在这个块设备上初始化文件系统（例如初始化一个 ext4 文件系统），然后就可以像普通硬盘一样使用了。

4.4 基本设计

AWS 对文件存储、对象存储和块存储有一个不错的介绍文档。其中提到的块存储的设计：

块存储将数据划分为固定大小的 block进行存储。Block 的大小在初始化块设备时指定，可以是几 KB 到几 MB；
操作系统为每个 block 分配一个唯一的地址/序号，记录在一个表中。寻址使用这个序号，因此非常快；
每个 Block 独立，可以直接访问或修改某个 block，不影响其他 blocks；
存储元数据的设计非常紧凑，以保持高效。
- 非常基本的元数据结构，确保了在数据传输过程中的最小开销。
- 搜索、查找和检索数据时，使用每个 block 的唯一标识符。
块存储不依赖文件系统，也不需要独立的进程（例如，区别于 JuiceFS [4]），由操作系统直接管理。

4.5 Ceph 块存储（RBD）的设计 4.5.1 概念

Pool：存储对象的逻辑分区（logical partitions used to store objects），有独立的 resilience/placement-groups/CRUSH-rules/snaphots 管理能力；
Image: 一个块，类似 LVM 中的一个 logical volume
PG (placement group): 存储 objects 的副本的基本单位，一个 PG 包含很多 objects，例如 3 副本的话就会有 3 个 PG，存放在三个 OSD 上；

创建一个 RBD 块设备的大致步骤：

$ ceph osd pool create {pool-name} [{pg-num} [{pgp-num}]] [replicated] \ [crush-rule-name] [expected-num-objects] $ rbd pool init {pool-name} $ rbd create --size {size MB} {pool-name}/{image-name} 4.5.2 RBD 的后端存储：Ceph 对象存储

Ceph 的设计比较特殊，同时支持三种存储类型：

对象存储（object storage），类似 AWS S3；
文件存储（file storage），类似 JuiceFS [4]；
块存储（block storage），类似 AWS EBS。

背后，每个块存储中的 “block”（4.4 节中介绍的 block 概念），实际上最后是一个 Ceph 对象存储中的 object。也就是 Ceph 的块存储是基于 Ceph 的对象存储。

4.5.3 读写流程

Fig. Ceph RBD IO. Each object is fix-sized, e.g. 4MB by default. Image Source

4.5.4 客户端代码实现

两种使用方式，二选一：

Fig. Ceph RBD workflow. Image Source

用户态库：librbd，这会直接通过 librados 去访问 Ceph 集群；
内核态库：将 RBD 挂载到主机之后，在系统中就可以看到一个 /dev/rbd{N} 的设备，
- 可以像使用本地盘一样，在这个设备上初始化一个文件系统，然后就能在这个文件系统里面读写文件了；
- RBD 驱动会将这些文件操作转换为对 Ceph 集群的操作，比如满 4MB 的文件作为一个 object 写到 Ceph 对象存储中；
- 内核驱动源码：drivers/block/brd.c。
- 源码解读：[2,3]

参考资料

What’s the Difference Between Block, Object, and File Storage?, aws.amazon.com, 2024
Ceph-RBD 源码阅读, blog.shunzi.tech, 2019
Deep Dive Into Ceph’s Kernel Client, engineering.salesforce.com, 2024
JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）

[译] SSD 是如何工作的：固态硬盘内部结构与工作原理的动画展示（2020）

ARTHURCHIAO'S BLOG

6 months ago

译者序

本文翻译自 2020 年 Branch Education 的一个科普视频 How do SSDs Work? How does your Smartphone store data? Insanely Complex Nanoscopic Structures!，强烈推荐观看原视频。本文整理个图文版方便查阅与思考。

HDD 是如何工作的：旋转硬盘内部结构与工作原理的动画展示（2022）
SSD 是如何工作的：固态硬盘内部结构与工作原理的动画展示（2020）

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原视频。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 存储材料 & 结构：Charge Trap
2 SSD 芯片硬件组成
3 真实 SSD 产品的参数

手机的存储、平板电脑的存储、SSD 硬盘，其实都类似，核心都是一个固态（Solid State） 存储芯片：

称为“固态”是相对于旋转（rotational）磁盘（也就是普通 HDD 硬盘）那种“动态”而言的。

本文将深入到这个芯片内部，看看它是如何工作的。

1 存储材料 & 结构：Charge Trap

将 SSD 芯片放大到纳米级，就能看到它存储电荷的基本结构。

根据技术路线的不同，存储结构/材料的选择也不同，
本文介绍的是比较新的一种，称为 Charge Trap（电荷捕获，或电荷陷阱），它使用的是氮化矽（silicon nitride），这是一种绝缘体。

下图中的“工”字结构就是 Charge Trap，它的基本原理是将电子吸附到氮化矽上，吸附的电子数量不一样，电荷的高低就不一样，从而可以用于表示不同的数字，

图中黄色部分就是吸附的电子，

较老的技术只能存储2 个不同的电荷级别，即电子很多或很少，因此只能表示两种数值，也就是 1bit 0 和 1；
较新的 Charge Trap 可以存储 8 个或 16 个电荷级别，也就是每个 Charge Trap 可以表示 3bit 或 4bit。

被吸附的电荷可以保持几十年之久，这也是它被称为电荷陷阱的原因。

2 SSD 芯片硬件组成

下面从小到大，看看是如何基于 Charge Trap 这样一个最基本单元构建出一个最终的 SSD 芯片的。

2.1 Charge Trap -> 基本存储单元 Memory Cell

Charge Trap 是 SSD 的基本存储单元 —— memory cell —— 的核心。

在本文接下来的内容中，我们假设一个 charge trap 支持 8 个不同的电荷级别，也就是说可以表示 3bit，比如吸附的电子很少对应 111，吸附的电子很多对应 000。

下面简单介绍下读取和删除数据对应的底层操作。

2.1.1 读取数据

读取一个 memory cell 存储的数据，就是测量这个 Charge Trap 上的的电荷量，

这需要先通过 control gate 锁定该 Charge Trap，然后信息就可以从中间的传输线送上去。后面会详细介绍。

2.1.2 删除数据

删除一个 memory cell 存储的数据，就是清除这个 Charge Trap 上的的电荷量，使其回到最低电平（111）。

2.2 纵向堆叠 Memory Cell -> String

有了能表示 3bit 的基本单元，接下来我们将 N 个 cell 垂直堆叠起来，就得到一个称为 String（“串”）的结构。

下图是 10 个 memory cell 堆叠成的 string，

一个 String 内的所有 cell 共享顶部的 bit line（“bit 传输线”，读取或写入 cell 数据的线），

一个 String 有很多 cell，但它们共享同一根 bit line，因此，在任一时间只能激活 String 中的一个 cell。为此，需要引入了 control gate。

control gate 控制 String 上的哪个 cell 可以读写数据，此时称为“激活”状态；如上图所示，读取第 10 层的 cell 信息时，就激活第 10 层的 control gate：
但注意，control gate 只是用来激活 cell，而不是用来读取 cell 的信息：比如在读数据场景，被激活的 cell 会将它保存的信息通过 String 中心的数据线（每个“工”字的中心线）发送给顶部的 bit line。

2.3 横向堆叠 Memory Cell -> Page

将多个 String 水平连到一起，就得到一个二维 cell 空间。

横向的每一排 memory cell，称为一个 Page（“页”），如下图所示：

2.4 String+Page 组成 2D 存储矩阵 -> Row

String+Page 组成的 2D 存储矩阵，称为 Row（虽然在这里直觉上叫“Page”更合适，后面会看到这个名称的由来），

2.4.1 bit line 和 control gate

再来看下 bit-line/control-gate 和 String/Row 的关系，

每个 String 有独立的 bit line；
每个 Row 上的所有 cell 共享一个 control gate，

2.4.2 读写一个 Page：仅需一次 control gate 操作

由上图可知，向 Row 写入或读取数据时，横向的 cell 能同时被激活，它们能通过顶上的 bit lines 并行传输。

换句话说，一个 Page 内的数据仅需一次操作就能全部读出或写入。

2.5 多个 Row（2D）堆叠成 3D 存储模块 -> Block

将 N 个 Row 并排连起来，就得到一个 block。下面是 6 个 Row 组成的 block，

下面是 12 个 Row 组成的 block，

2.5.1 渲染图（3D-NAND / V-NAND）

这种立体的 Block 有个专业名词叫 3D-NAND 或 V-NAND（垂直堆叠 NAND），以为以前的芯片都是二维的，

NAND 本身是 Not AND（“与非”门）的缩写，是一种逻辑门，后来泛指一类存储技术。

2.5.2 Block 能存储多少数据：~1.5KB

现在让我们来算一下，一个 block 能存储多少数据。

3bit/cell
10 cells/string
32 cells/page
6 rows/block
2 block

最终是 3,840 个 memory cell，总共能够存储 11,520 bit，约 1.4KB。

2.6 小结

回顾下我们目前为止介绍的所有概念，

从小到大的结构是：cell -> String / Page -> Row -> Block。这里还有 Column 和 Layer 的概念，这个图加上这俩概念，就不难理解为什么一个 2D cell 矩阵叫 Row 而不叫 Page 了。

3 真实 SSD 产品的参数 3.1 Block 3.1.1 高度（Cells per String）：100~200 cells

图中画的是 96~136 层高，右边是一张纸，可以直观理解 100~200 层大概是什么概念。

3.1.2 宽度（Cells per Page）: 30K~60K cells

一个 Page 的宽度约为 30,000~60,000 个 memory cell。

这意味着有 30,000~60,000 可并行读写的 bit lines。

3.1.3 深度（Rows per Block）：4~8 Rows

4~8 个 Row 组成一个 Block，

3.2 Blocks per Chip Unit: 4K~6K

一个最基础的芯片单元有大约 4000~6000 个 Block（后面还将重复这个基础单元很多次，最终封装成一个芯片）。

3.3 Row decoder, Page Buffer

两侧的 control gate & bit line selector 组成了所谓的行解码器，通过这两组选择器就可以访问任意 Page；
一个 Page（约 45,000 个 memory cell）能同时使用上方并行的 bit line 来读取或写入信息；
上万条 bit line 将 Page 中的数据送到 Page cache。

下图是对应到实际芯片的结构，

图中的产品为了提高存储容量，将 3.2 介绍的模块复制了一倍。这样一个模块的读写速度约为 500MB/s，

3.4 多层 Chip Unit，封装到最终的一块 SSD 芯片

为了进一步提高存储容量，在一个芯片中放 8 个（层）上一节那样的子芯片，然后通过外围接口芯片（下图最左侧）来协调这 8 个子芯片，

这样一个结构再加个外壳封装，才是我们拆开 SSD 时在电路板上看到的芯片：

[译] HDD 是如何工作的：旋转硬盘内部结构与工作原理的动画展示（2022）

ARTHURCHIAO'S BLOG

6 months 3 weeks ago

译者序

本文翻译自 2022 年 Branch Education 的一个科普视频 How do Hard Disk Drives Work? (Youtube)，强烈推荐观看原视频（上不了油管的，B 站也有搬运）。本文整理个图文版方便查阅与思考，

HDD 是如何工作的：旋转硬盘内部结构与工作原理的动画展示（2022）
SSD 是如何工作的：固态硬盘内部结构与工作原理的动画展示（2020）

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原视频。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 硬盘拆解
2 盘片的微观组成
3 写数据
4 读数据
5 致谢
6 Linux 存储相关的子系统和软件栈（译注）

原视频由 PCBWay 赞助，感谢赞助商。

1 硬盘拆解 1.1 盘片（platter）

盘片是存储数据的地方，

Disk/platter

根据存储容量的不同，硬盘可能会有多个盘片堆叠，如上面右图所示；
磁盘由铝镁合金（aluminum magnesium alloy）和其他合金的多个涂层组成，

Disk/platter
磁性功能层是 120nm 的钴铬钽合金薄层（cobalt chromium tantalum alloy），它由磁性微块组成，磁极方向能变，

Disk/platter
盘片安装在主轴上，主轴使用中心的无刷直流电机（brushless DC motor）以 7200rpm 的等速度旋转。

1.2 机械臂装置

机械臂装置包括好几个组成部分，分别来看下。

1.2.1 机械臂（arm）

每个盘（platter）上下各有一个臂（arm），

1.2.2 滑橇（slider）和读写头（read/write head）

每个臂的末端有一个称为 slider（滑橇、滑块）的模块，它里面又包括了一个读/写头（注意，读头和写头是分开的两个部件，后面会详细介绍），

磁盘高速旋转产生的气流能使这个滑块（和读写头）浮起来，稳定运行在离磁盘表面 15nm（约 100 个原子）的地方，如下面的动图所示，

Fig. 高速旋转的盘片产生的气流使滑橇和读写头飘起来

1.2.3 读写头停靠装置

只有当盘片全速旋转时（有数据读写任务），机械臂才会转到磁盘表面上。平时盘片不旋转时（没有读写任务），机械臂会停在磁盘边上的一个小塑料装置上。

1.2.4 尾部音圈电机（马达）

机械臂的尾部有一个 音圈电机（voice coil motor），或称音圈马达，它由线圈（coil of wire）和上下两个强钕磁铁（strong neodymium magnets）组成，

VCM（Voice Coil Motor）一种特殊形式的直接驱动电机，原理和扬声器类似，固得名。通电线圈在磁场内就会产生力，力的大小与施加在线圈上的电流成比例，运动轨迹可以是直线也可以是弧线。具有结构简单、体积小、速度快、响应快等特点。译注。

线圈通电之后会产生一个力，使机械臂在磁盘上移动（可以正向也可以反向），

这种马达的速度和精度：

速度：读/写头能够在不同磁道上来回移动 ~20 次/秒；
精度：读/写头位置精度 ~30nm。

1.3 机械臂-电路板之间的数据线

如下图所示，一条柔性电线（a flexible ribbon of wires）沿着机型臂的侧面布线，

一边连接到读/写头，
一边连接到一个连接器（connector），该 connector 进一步连接到硬盘的主板，或称印刷电路板（PCB）。

1.4 PCB 和上面的芯片

PCB 上面的东西如下图所示，

这里主要介绍三个芯片：

主处理器芯片；
内存芯片，作为主处理器的 cache；
控制音圈马达和磁盘主轴电机的芯片。

1.5 数据线接口（e.g. SATA）和电源线接口

PCB 边缘还有两个硬件接口，

数据接口：例如 SATA 接口，用于和电脑主板相连传输数据；
电源接口：用于给 HDD 供电。

1.6 防尘装置

再看一下硬盘的两个防尘装置，

垫圈：将磁盘密封起来；
灰尘过滤器：用于捕获灰尘颗粒。

密封和过滤都是非常必要的，因为读写头距离盘片仅 15nm，而灰尘颗粒的大小可达 10,000nm，如果与 7200rpm 高速旋转磁盘碰撞，可能会造成严重损坏，

Fig. 读写头正常运行时，距离盘片仅 15nm。

2 盘片的微观组成

了解了粗粒度的硬件构成之后，现在让来深入到盘片的内部，看看它的微观组成。

2.1 磁盘（disk） -> 磁道（track）

首先，每个磁盘以同心圆的方式分割为多个磁道（concentric circles of tracks），

Fig. 磁盘分割为大量磁道。

每个磁盘的磁道数量能达到 500,000 个甚至更多。

2.2 磁道（track） -> 扇区（sector）

然后，沿着直径的方向，所有磁道又被分割为多个扇区，

Fig. 磁道进一步分割为扇区。

2.3 扇区内

现在看一下每个扇区内的结构，

Fig. 每个扇区的内部结构。

如上图所示，每个扇区中，依次包含五部分。

2.3.1 前导/同步区（preamble or synchronization zone）

记录这个旋转磁盘的确切速度和每个比特位的长度（length of each bit of data）。

2.3.2 地址区

帮助读/写头确定当前位于哪个磁道和扇区。

2.3.3 数据区扇区大小

扇区的大小因盘而异，例如老一些的盘是 512 字节或 2KB，新一些的通常是 4KB。

查看磁盘扇区大小（译注）

有很多工具可以查看，lsblk 指定显示磁盘名字、物理扇区大小和逻辑扇区大小：

$ lsblk -o NAME,PHY-SeC,LOG-SeC NAME PHY-SEC LOG-SEC sda 4096 512 # 这块是 SATA SSD sdb 512 512 # 这块是 SATA HDD

fdisk -l，这个命令好记：

$ fdisk -l Disk /dev/sdb: 2.18 TiB, 2399276105728 bytes, 4686086144 sectors Disk model: XXX # 硬盘型号 Units: sectors of 1 * 512 = 512 bytes # 当前扇区大小 Sector size (logical/physical): 512 bytes / 512 bytes # 逻辑值 & 物理支持的最大值 I/O size (minimum/optimal): 512 bytes / 512 bytes iostat 磁盘读写带宽（译注）

可以通过 cat /proc/diskstats 查看磁盘的读写情况，其中就包括了每个磁盘已经读写的 sectors 数量：

$ cat /proc/diskstats # r_sectors w_sectors 8 0 sda 31663 10807 2928442 8471 203024 106672 6765800 ...

这个数量乘以 sector 大小，就是已经读写的字节数，iostat 等工具显示的磁盘读写带宽，就是根据这个来计算（估算）的。

一个扇区只会属于一个文件（译注）

根据 wikipedia Disk sector，对于绝大部分文件系统来说，任何一个文件都是占用整数个扇区的 —— 也就是说一个扇区只会属于一个文件，如果没用满，后面的就空着。所以在调整扇区大小时，这是一个需要考虑的因素。

扇区与 block 的关系（译注）

这里说的 block 是文件系统的概念，比如常见的一个 block 是 4KB，如果磁盘格式化的时候，扇区大小选择的 512B，那一个 block 就对应 8 个扇区。对操作系统屏蔽了底层的硬件细节。

2.3.4 纠错码（ECC）区

Fig. 每个扇区的内部结构。

用于校验存储在块中的数据。

2.3.5 扇区之间的间隔区

给了读/写磁头一定的容错能力。

3 写数据

现在让我们进一步看看读/写磁头的内部机制，以及写头（write head）是是如何写数据的。

3.1 磁场微块和磁化

扇区是由一个个磁场微块组成的，写头通过改变磁盘微块的磁化方向来实现数据写入，

每个磁盘微块大小约为 90nm x 100nm x 125nm，

磁化之外，微块内原子的南北极是随机的；磁化之，微块所有原子的北南极都指向同一方向，

每个微块对应的就是一个 bit 数据，

3.2 写入 1bit 的过程

下面具体看一下如何磁化一个微块（相当于写入 1bit 数据）。

电流施加到 write head 的线圈之后，就会在此处产生一个强磁场，

这个磁场沿着 write head 向下，聚焦到尖端的一个小点，改变它正下方的磁盘微块极性（中间的缝隙就是前面提到过的读写头 15nm 悬浮高度），

磁化之后的微块变成永磁体，能保持这个状态很多年，也就是数据已经持久化，以后可以重复用读头感应这个永久磁场，读出存储的数据。

3.3 覆盖写

原理跟上面一样，也是逐 bit 来。如果新写入的 bit 跟已经存储的一样，磁极就不变，否则就改变一下方向。

4 读数据

再来看看如何从磁盘读数据。

4.1 如何表示 0 和 1 4.1.1 不是用南北极指向表示

前面我们假设了不同南北极的磁块分别表示 0 和 1，

这在概念上非常简单，但实际实现并非如此。

4.1.2 用南北极指向的变化表示

实际的 read head，检测的是相邻两个微块的磁极变化，这是因为磁极变化的强度比单个微块的磁场强度要大得多，所以这种方式的检测准确率非常高。

所以，如上图所示，

相邻微块磁场方向变化，表示 1；
相邻微块磁场方向不变，表示 0。

4.2 读头（read head）内部结构

那么，检测这些磁场的读头内部结构是怎样的呢？

如上图所示，

读头里面是多层导电材料，由铁磁材料和非磁性材料的交替组成。
这种多层材料具有一种称为巨磁阻（giant magnetoresistance, GMR）的特性，简单来说，穿过它的磁场强度发生变化时，它的电阻率就会变化。

4.3 读取数据：GMR 和读头电阻率

基于 GMR 特性，根据读头的电阻率就能判断下面存储的 0 还是 1，

电阻率较低时，表示读取头下方磁场变化强，对应存储的是 bit 1；
电阻率较高且无磁场时，对应存储的是 bit 0。

4.4 连续 0 的问题

以上过程有一个问题：如果较长连续区域的磁极都一样，对应的就是一长串的 0，由于读头的精度，有可能会导致多读或少读几个 0，导致数据错乱。

解决方少：利用每个 sector 的前导区和纠错码区中的信息。

5 致谢

原作者 Branch Education 感谢所有个人赞助者和会员赞助商，让他们制作了如此精良的科普视频。

6 Linux 存储相关的子系统和软件栈（译注） 6.1 从进程 read/write 请求到 HDD 读写数据

来自 Linux Storage Stack Diagram，涵盖了 3.x ~ 6.x 多个内核版本，这里先贴一个 3.x 的，因为简单，方便看出从用户进程发出 read/write 请求到 HDD 读写数据的内核模块链路：

虚拟文件系统（VFS）里面分为几类：

常规文件系统（ext4, xfs, btrfs, …）；
网络文件系统（NFS, CIFS, …）；
伪文件系统（procfs, sysfs, …）；
特殊文件系统（tmpfs, devtmpfs, …）。

再贴一个 kernel v6.9 的，

6.2 内核 block layer 深入解读

A block layer introduction part 1: the bio layer, LWN.net, 2017
A block layer introduction part 2: the request layer, LWN.net, 2017

6.3 其他优质文章

How does a hard drive work, https://www.explainthatstuff.com/, 2024

除了硬件拆解和介绍工作原理，还对比了 HDD 和 SDD，并且更重要的，介绍了 IBM 发明硬盘的历史。
How a Hard Drive Works, cs.stanford.edu, 2012

斯坦福的一个老师实物教学，开盖展示读写数据时，硬盘的工作过程（然后这个盘就报废了）。
HDD from Inside: Hard Drive Main Parts, https://hddscan.com/

硬件拆解部分比本文更详细，想了解更多硬件细节的，可作为补充。

直观解读 JuiceFS 的数据和元数据设计（一）：看山是山（2024）

ARTHURCHIAO'S BLOG

7 months 1 week ago

本系列分为三篇文章，试图通过简单的实地环境来直观理解 JuiceFS 的数据（data）和元数据（metadata）设计。

Fig. MinIO bucket browser: one object was created ({volume}/juicefs_uuid) on a new juicefs volume creation.

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

直观解读 JuiceFS 的数据和元数据设计（一）：看山是山（2024）
直观解读 JuiceFS 的数据和元数据设计（二）：看山不是山（2024）
直观解读 JuiceFS 的数据和元数据设计（三）：看山还是山（2024）

1 JuiceFS 高层架构与组件
2 搭建极简 JuiceFS 集群
3 将 JuiceFS volume 挂载到本地路径
4 在 JuiceFS volume 挂载的本地路径内读写
5 总结
参考资料

本篇首先快速了解下 JuiceFS 架构和组件，然后将搭建一个极简 JuiceFS 集群，并以 JuiceFS 用户的身份来体验下它的基本功能。

1 JuiceFS 高层架构与组件

JuiceFS 的高层架构和组件，

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

三大组件：

元数据引擎：存储文件元数据，例如文件名、权限等。JuiceFS 支持多种元数据引擎，比如 TiKV、sqlite、redis 等。
对象存储：存储文件本身。JuiceFS 支持多种对象存储，比如 MinIO、AWS S3、阿里云 OSS 等。
JuiceFS 客户端：将 JuiceFS volume 挂载到机器上，提供文件系统视图给用户。

更多架构信息，见 [1]。

2 搭建极简 JuiceFS 集群

接下来搭建一个极简 JuiceFS 环境，方便我们做一些功能测试。按上一节提到的，只需要搭建以下 3 个组件：

元数据引擎，这里我们用 TiKV；
对象存储，这里我们用 MinIO；
JuiceFS 客户端。

2.1 搭建元数据集群

对于功能测试来说，使用哪种元数据引擎都无所谓，比如最简单的 sqlite 或 redis。

不过，本系列第二篇会介绍 TiKV 相关的一些设计，所以本文用的 TiKV 集群作为元数据引擎，相关的搭建步骤见社区文档。

本篇假设搭建的是三节点的 TiKV 集群，IP 地址分别是 192.168.1.{1,2,3}。

2.2 搭建对象存储（MinIO）

这里我们用 MinIO 搭建一个对象存储服务，主要是空集群方便观察其中的文件变化。

2.2.1 启动 MinIO server

MinIO 是一个兼容 S3 接口的开源对象存储产品，部署非常简单，就一个可执行文件，下载执行就行了。

也可以用容器，一条命令启动：

$ sudo docker run -p 9000:9000 -p 8080:8080 \ quay.io/minio/minio server /data --console-address "0.0.0.0:8080"

访问 http://localhost:8080/ 就能看到 MinIO 的管理界面了。默认账号密码都是 minioadmin。

2.2.2 创建 bucket

通过 MinIO 管理界面创建一个 bucket，这里我们命名为 juicefs-bucket，

Fig. MinIO bucket list: an empty bucket.

可以看到现在里面一个对象也没有，已使用空间也是 0 字节。

2.3 下载 juicefs 客户端

从 https://github.com/juicedata/juicefs/releases 下载一个可执行文件就行了，

$ wget https://github.com/juicedata/juicefs/releases/download/v1.2.1/juicefs-1.2.1-linux-amd64.tar.gz $ tar -xvf juicefs-1.2.1-linux-amd64.tar.gz $ chmod +x juicefs 2.4 创建 JuiceFS volume

接下来就可以创建一个 JuiceFS volume 了，这里命名为 foo-dev。

2.4.1 创建/格式化 volume：juicefs format $ juicefs format --storage minio --bucket http://localhost:9000/juicefs-bucket \ --access-key minioadmin \ --secret-key minioadmin \ tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev \ foo-dev <INFO>: Meta address: tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev [interface.go:504] <INFO>: Data use minio://localhost:9000/juicefs-bucket/foo-dev/ [format.go:528] <INFO>: Volume is formatted as { "Name": "foo-dev", "UUID": "3b4e509b-a7c8-456f-b726-cb8395cf8eb6", "Storage": "minio", "Bucket": "http://localhost:9000/juicefs-bucket", "AccessKey": "minioadmin", "SecretKey": "removed", "BlockSize": 4096, "UploadLimit": 0, "DownloadLimit": 0, ... } 2.4.2 查看 MinIO bucket：多了一个 juicefs_uuid 文件

再查看 MinIO bucket，会发现多了一个 object，

Fig. MinIO bucket browser: one object was created on a new juicefs volume creation.

点进去，发现是一个叫 juicefs_uuid 的文件，

Fig. MinIO bucket browser: one object was created after juicefs format.

可以把这个文件下载下来，其内容就是上面 juicefs format 命令输出的 uuid 信息，也就是说 juicefs client 会把 volume 的 uuid 上传到对象存储中。

3 将 JuiceFS volume 挂载到本地路径

这么我们将这个 volume 挂载到本地路径 /tmp/foo-dev，

$ ./juicefs mount --debug --backup-meta 0 \ tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev /tmp/foo-dev [INFO] [client.go:405] ["[pd] create pd client with endpoints"] [component=tikv] [pid=2881678] [pd-address="[192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379]"] [INFO] [base_client.go:378] ["[pd] switch leader"] [component=tikv] [pid=2881678] [new-leader=https://192.168.1.3:2379] [old-leader=] [INFO] [base_client.go:105] ["[pd] init cluster id"] [component=tikv] [pid=2881678] [cluster-id=7418858894192002550] [INFO] [client.go:698] ["[pd] tso dispatcher created"] [component=tikv] [pid=2881678] [dc-location=global] <INFO>: Data use minio://localhost:9000/juicefs-bucket/foo-dev/ [mount.go:650] ...

进入目录：

$ cd /tmp/foo-dev $ ls -ahl -r-------- 1 root root 0 Oct 26 10:45 .accesslog -r-------- 1 root root 2.9K Oct 26 10:45 .config -r--r--r-- 1 root root 0 Oct 26 10:45 .stats dr-xr-xr-x 2 root root 0 Oct 26 10:45 .trash

可以看到几个隐藏文件，

这些是 JuiceFS 的元数据文件，在 [1] 系列文章中有过详细介绍。
这些都是 volume 本地文件，不会上传到 MinIO。此时，MinIO juicefs-bucket 里面还是只有一个 uuid 文件。

4 在 JuiceFS volume 挂载的本地路径内读写

接下来进行一些 POSIX 操作测试。

4.1 创建和写入文件

创建三个文件，一个只有几十字节（但命名为 file1_1KB），一个 5MB，一个 129MB，

$ cd /tmp/foo-dev $ echo "Hello, JuiceFS!" > file1_1KB $ dd if=/dev/zero of=file2_5MB bs=1M count=5 5+0 records in 5+0 records out 5242880 bytes (5.2 MB, 5.0 MiB) copied, 0.0461253 s, 114 MB/s $ dd if=/dev/zero of=file3_129MB bs=1M count=129 129+0 records in 129+0 records out 135266304 bytes (135 MB, 129 MiB) copied, 0.648757 s, 209 MB/s 4.2 查看文件属性 $ ls -ahl file* -rw-r----- 1 root root 16 file1_1KB -rw-r----- 1 root root 5.0M file2_5MB -rw-r----- 1 root root 129M file3_129MB $ file file2_5MB file2_5MB: data 4.3 读取和追加文件 $ cat file1_1KB Hello, JuiceFS! $ echo "Hello, JuiceFS!" >> file1_1KB $ cat file1_1KB Hello, JuiceFS! Hello, JuiceFS! 4.4 查找文件 $ find /tmp -name file1_1KB /tmp/foo-dev/file1_1KB 4.5 删除文件

直接用 rm 删除就行了，不过这几个文件我们还有用，先不删。

4.6 目录操作

目录的创建、移动、修改权限、删除等待也是一样的，大家可以自己试试，这里不再赘述。

4.7 小结

根据以上测试，在 JuiceFS 挂载路径里创建/读写/查找/删除文件，都跟本地目录没什么区别 —— 这也正是「分布式“文件系统”」的意义所在 —— 兼容 POSIX 语义，用户无需关心数据存在哪，当本地目录使用就行了（性能另当别论）。

5 总结

本篇中，我们作为 JuiceFS 用户对它进行了一些最基本的功能测试，结论是和本地文件系统没什么区别。

对于普通用户来说，了解到这一层就够了；但对于高阶用户以及 JuiceFS 的开发/运维来说，这只是表象，必有第二重境界等着他们。

参考资料

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）

直观解读 JuiceFS 的数据和元数据设计（二）：看山不是山（2024）

ARTHURCHIAO'S BLOG

7 months 1 week ago

本系列分为三篇文章，试图通过简单的实地环境来直观理解 JuiceFS 的数据（data）和元数据（metadata）设计。

Fig. JuiceFS object key naming and the objects in MinIO.

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

直观解读 JuiceFS 的数据和元数据设计（一）：看山是山（2024）
直观解读 JuiceFS 的数据和元数据设计（二）：看山不是山（2024）
直观解读 JuiceFS 的数据和元数据设计（三）：看山还是山（2024）

1 引言
2 对象存储中 JuiceFS 写入的文件
3 JuiceFS 数据的设计
4 JuiceFS 元数据的设计（TKV 版）
- 4.1 TKV 类型 key 列表
- 4.2 元数据引擎中的 key/value
  - 4.2.1 扫描相关的 TiKV key
  - 4.2.2 解码成 JuiceFS metadata key
5 总结
参考资料

1 引言

上一篇从功能的角度体验了下 JuiceFS，这一篇我们深入到背后，看看 JuiceFS 分别在数据和元数据上做了哪些设计，才给到用户和本地文件系统一样的体验的。

2 对象存储中 JuiceFS 写入的文件

本篇以 MinIO 为例，来看 JuiceFS 写入到对象存储中的文件是怎样组织的。其他云厂商的对象存储（AWS S3、阿里云 OSS 等）也都是类似的。

2.1 Bucket 内：每个 volume 一个“目录”

可以用上一篇介绍的 juicefs format 命令再创建两个 volume，方便观察它们在 bucket 中的组织关系，

Fig. MinIO bucket browser: volume list.

如上图所示，bucket 内的顶层“目录”就是 JuiceFS 的 volumes，

我们这里提到“目录”时加双引号，是因为对象存储是扁平的 key-value 存储，没有目录的概念，前端展示时模拟出目录结构（key 前缀一样的，把这个前缀作为一个“目录”）是为了查看和理解方便。简单起见，后文不再加双引号。

2.2 每个 volume 的目录： {chunks/, juicefs_uuid, meta/, ...}

每个 volume 目录内的结构如下：

{volume_name}/ |-chunks/ # 数据目录，volume 中的所有用户数据都放在这里面 |-juicefs_uuid |-meta/ # `juicefs mount --backup-meta ...` 产生的元数据备份存放的目录 2.2.1 juicefs_uuid：JuiceFS volume 的唯一标识

可以把这个文件下载下来查看内容，会发现里面存放的就是 juicefs format 输出里看到的那个 uuid，也就是这个 volume 的唯一标识。

删除 volume 时需要用到这个 uuid。

2.2.2 meta/：JuiceFS 元数据备份

如果在 juicefs mount 时指定了 --backup-meta，JuiceFS 就会定期把元数据（存在在 TiKV 中）备份到这个目录中，用途：

元数据引擎故障时，可以从这里恢复；
在不同元数据引擎之间迁移元数据。

详见 JuiceFS 元数据引擎五探：元数据备份与恢复（2024）。

2.2.3 chunks/

Fig. MinIO bucket browser: files in a bucket.

chunks/ 内的目录结构如下，

{volume_name}/ |-chunks/ | |-0/ # <-- id1 = slice_id / 1000 / 1000 | | |-0/ # <-- id2 = slice_id / 1000 | | |-1_0_16 # <-- {slice_id}_{block_id}_{size_of_this_block} | | |-3_0_4194304 # | | |-3_1_1048576 # | | |-... |-juicefs_uuid |-meta/

如上，所有的文件在 bucket 中都是用数字命名和存放的，分为三个层级：

第一层级：纯数字，是 sliceID 除以 100 万得到的；
第二层级：纯数字，是 sliceID 除以 1000 得到的；
第三层级：纯数字加下划线，{slice_id}_{block_id}_{size_of_this_block}，表示的是这个 chunk 的这个 slice 内的 block_id 和 block 的大小。

不理解 chunk/slice/block 这几个概念没关系，我们马上将要介绍。

2.3 小结

通过以上 bucket 页面，我们非常直观地看到了一个 JuiceFS volume 的所有数据在对象存储中是如何组织的。

接下来进入正题，了解一下 JuiceFS 的数据和元数据设计。

3 JuiceFS 数据的设计 3.1 顶层切分：一切文件先切 chunk

对于每个文件，JuiceFS 首先会按固定大小（64MB）切大块，这些大块称为「Chunk」。

这是为了读或修改文件内容时，方便查找和定位。
不管是一个只有几字节的文本文件，还是一个几十 GB 的视频文件，在 JuiceFS 中都是切分成 chunk，只是 chunk 的数量不同而已。

3.1.1 示意图

Fig. JuiceFS: split each file into their respective chunks (with max chunk size 64MB).

3.1.2 对象存储：不存在 chunk 实体

Chunk 在对象存储中 没有对应任何实际文件，也就是说在对象存储中没有一个个 64MB 的 chunks；
用 JuiceFS 的话来说，Chunk 是一个逻辑概念。暂时不理解没关系，接着往下看。

3.2 Chunk 内的一次连续写入：Slice

chunk 只是一个“框”，在这个框里面对应文件读写的，是 JuiceFS 称为「Slice」的东西。

chunk 内的一次连续写入，会创建一个 slice，对应这段连续写入的数据；
由于 slice 是 chunk 内的概念，因此它不能跨 Chunk 边界，长度也不会超 max chunk size 64M。
slice ID 是全局唯一的；

3.2.1 Slice 的重叠问题

根据写入行为的不同，一个 Chunk 内可能会有多个 Slice，

如果文件是由一次连贯的顺序写生成，那每个 Chunk 只包含一个 Slice。
如果文件是多次追加写，每次追加均调用 flush 触发写入上传，就会产生多个 Slice。

Fig. JuiceFS: chunks are composed of slices, each slice corresponds to a continues write operation.

拿 chunk1 为例，

用户先写了一段 ~30MB 数据，产生 slice5；
过了一会，从 ~20MB 的地方重新开始写 45MB（删掉了原文件的最后一小部分，然后开始追加写），
- chunk1 内的部分产生 slice6；
- 超出 chunk1 的部分，因为 slice 不能跨 chunk 边界，因此产生 chunk2 和 slice7；
过了一会，从 chunk1 ~10MB 的地方开始修改（覆盖写），产生 slice8。

由于 Slice 存在重叠，因此引入了几个字段标识它的有效数据范围，

// pkg/meta/slice.go type slice struct { id uint64 size uint32 off uint32 len uint32 pos uint32 left *slice // 这个字段不会存储到 TiKV 中 right *slice // 这个字段不会存储到 TiKV 中 } 3.2.2 读 chunk 数据时的多 slice 处理：碎片化和碎片合并

Fig. JuiceFS: chunks are composed of slices, each slice corresponds to a continues write operation.

对 JuiceFS 用户来说，文件永远只有一个，但在 JuiceFS 内部，这个文件对应的 Chunk 可能会有多个重叠的 Slice，

有重叠的部分，以最后一次写入的为准。
直观上来说，就是上图 chunk 中的 slices 从上往下看，被盖掉的部分都是无效的。

因此，读文件时，需要查找「当前读取范围内最新写入的 Slice」，

在大量重叠 Slice 的情况下，这会显著影响读性能，称为文件「碎片化」。
碎片化不仅影响读性能，还会在对象存储、元数据等层面增加空间占用。
每当写入发生时，客户端都会判断文件的碎片化情况，并异步地运行碎片合并，将一个 Chunk 内的所有 Slice 合并。

3.2.3 对象存储：不存在 slice 实体

跟 chunk 类似，在对象存储中 slice 也没有 没有对应实际文件。

为了加速写到对象存储，JuiceFS 将 Slice 进一步拆分成一个个「Block」（默认 4MB），多线程并发写入。

Fig. JuiceFS: slices are composed of blocks (4MB by default), each block is an object in object storage.

Block 是 JuiceFS 数据切分设计中最后一个层级，也是 chunk/slice/block 三个层级中唯一能在 bucket 中看到对应文件的。

Fig. MinIO bucket browser: objects in a bucket.

连续写：前面 Block 默认都是 4MB，最后一个 Block 剩多少是多少。
追加写：数据不足 4MB 时，最终存入对象存储的也会是一个小于 4M 的 Block。

从上图的名字和大小其实可以看出分别对应我们哪个文件：

1_0_16：对应我们的 file1_1KB；
- 我们上一篇的的追加写 echo "hello" >> file1_1KB 并不是写入了 1_0_16，而是创建了一个新对象 7_0_16，这个 object list 最后面，所以在截图中没显示出来；
- 换句话说，我们的 file1_1KB 虽然只有两行内容，但在 MinIO 中对应的却是两个 object，各包含一行。
- 通过这个例子，大家可以体会到 JuiceFS 中连续写和追加写的巨大区别。
3_0_4194304 + 3_1_1048576：总共 5MB，对应我们的 file2_5MB；
4_*：对应我们的 file3_129MB；

3.4 object key 命名格式（及代码）

格式：{volume}/chunks/{id1}/{id2}/{slice_id}_{block_id}_{size_of_this_block}，对应的代码，

// pkg/chunk/cached_store.go func (s *rSlice) key(blockID int) string { if s.store.conf.HashPrefix // false by default return fmt.Sprintf("chunks/%02X/%v/%v_%v_%v", s.id%256, s.id/1000/1000, s.id, blockID, s.blockSize(blockID)) return fmt.Sprintf("chunks/%v/%v/%v_%v_%v", s.id/1000/1000, s.id/1000, s.id, blockID, s.blockSize(blockID)) } 3.5 将 chunk/slice/block 对应到对象存储

最后，我们将 volume 的数据切分和组织方式对应到 MinIO 中的路径和 objects，

Fig. JuiceFS object key naming and the objects in MinIO.

3.6 小结：光靠对象存储数据和 slice/block 信息无法还原文件

至此，JuiceFS 解决了数据如何切分和存放的问题，这是一个正向的过程：用户创建一个文件，我们能按这个格式切分、命名、上传到对象存储。

对应的反向过程是：给定对象存储中的 objects，我们如何将其还原成用户的文件呢？显然，光靠 objects 名字中包含的 slice/block ID 信息是不够的，例如，

最简单情况下，每个 chunk 都没有任何 slice 重叠问题，那我们能够根据 object 名字中的 slice_id/block_id/block_size 信息拼凑出一个文件，但仍然无法知道这个文件的文件名、路径（父目录）、文件权限（rwx）等等信息；
chunk 一旦存在 slice 重叠，光靠对象存储中的信息就无法还原文件了；
软链接、硬链接、文件属性等信息，更是无法从对象存储中还原。

解决这个反向过程，我们就需要文件的一些元数据作为辅助 —— 这些信息在文件切分和写入对象存储之前，已经记录到 JuiceFS 的元数据引擎中了。

4 JuiceFS 元数据的设计（TKV 版）

JuiceFS 支持不同类型的元数据引擎，例如 Redis、MySQL、TiKV/etcd 等等，每种类型的元数据引擎都有自己的 key 命名规则。本文讨论的是 JuiceFS 使用 transactional key-value（TKV）类型的元数据引擎时的 key 命名规则。

更具体地，我们将拿 TiKV 作为元数据引擎来研究。

4.1 TKV 类型 key 列表

这里的 key 是 JuiceFS 定义元数据 key，key/value 写入元数据引擎；请注意跟前面提到的对象存储 key 区别开，那个 key/value 是写入对象存储的。

key 是一个字符串，所有 key 的列表，

// pkg/meta/tkv.go setting format C{name} counter A{8byte-inode}I inode attribute A{8byte-inode}D{name} dentry A{8byte-inode}P{8byte-inode} parents // for hard links A{8byte-inode}C{4byte-blockID} file chunks A{8byte-inode}S symlink target A{8byte-inode}X{name} extented attribute D{8byte-inode}{8byte-length} deleted inodes F{8byte-inode} Flocks P{8byte-inode} POSIX locks K{8byte-sliceID}{8byte-blockID} slice refs Ltttttttt{8byte-sliceID} delayed slices SE{8byte-sessionID} session expire time SH{8byte-sessionID} session heartbeat // for legacy client SI{8byte-sessionID} session info SS{8byte-sessionID}{8byte-inode} sustained inode U{8byte-inode} usage of data length, space and inodes in directory N{8byte-inode} detached inde QD{8byte-inode} directory quota R{4byte-aclID} POSIX acl

在 TKV 的 Keys 中，所有整数都以编码后的二进制形式存储 [2]：

inode 和 counter value 占 8 个字节，使用小端编码
SessionID、sliceID 和 timestamp 占 8 个字节，使用大端编码

setting 是一个特殊的 key，对应的 value 就是这个 volume 的设置信息。前面的 JuiceFS 元数据引擎系列文章中介绍过 [3]，这里不再赘述。

其他的，每个 key 的首字母可以快速区分 key 的类型，

C：counter，这里面又包含很多种类，例如 name 可以是：
- nextChunk
- nextInode
- nextSession
A：inode attribute
D：deleted inodes
F：Flocks
P：POSIX lock
S：session related
K：slice ref
L: delayed (to be deleted?) slices
U：usage of data length, space and inodes in directory
N：detached inode
QD：directory quota
R：POSIX acl

需要注意的是，这里是 JuiceFS 定义的 key 格式，在实际将 key/value 写入元数据引擎时， 元数据引擎可能会对 key 再次进行编码，例如 TiKV 就会在 key 中再插入一些自己的字符。前面的 JuiceFS 元数据引擎系列文章中也介绍过，这里不再赘述。

4.2 元数据引擎中的 key/value 4.2.1 扫描相关的 TiKV key

TiKV 的 scan 操作类似 etcd 的 list prefix，这里扫描所有 foo-dev volume 相关的 key，

$ ./tikv-ctl.sh scan --from 'zfoo-dev' --to 'zfoo-dew' key: zfoo-dev\375\377A\000\000\000\020\377\377\377\377\177I\000\000\000\000\000\000\371 key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile1_\3771KB\000\000\000\000\000\372 key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile2_\3775MB\000\000\000\000\000\372 ... key: zfoo-dev\375\377SI\000\000\000\000\000\000\377\000\001\000\000\000\000\000\000\371 default cf value: start_ts: 453485726123950084 value: 7B225665727369...33537387D key: zfoo-dev\375\377U\001\000\000\000\000\000\000\377\000\000\000\000\000\000\000\000\370 key: zfoo-dev\375\377setting\000\376 default cf value: start_ts: 453485722598113282 value: 7B0A224E616D65223A202266...0A7D 4.2.2 解码成 JuiceFS metadata key

用 tikv-ctl --decode <key> 可以解码出来，注意去掉最前面的 z，得到的就是 JuiceFS 的原始 key，看着会更清楚一点，

foo-dev\375A\000\000\000\020\377\377\377\177I foo-dev\375A\001\000\000\000\000\000\000\000Dfile1_1KB foo-dev\375A\001\000\000\000\000\000\000\000Dfile2_5MB foo-dev\375A\001\000\000\000\000\000\000\000Dfile3_129MB foo-dev\375A\001\000\000\000\000\000\000\000I foo-dev\375A\002\000\000\000\000\000\000\000C\000\000\000\000 foo-dev\375A\002\000\000\000\000\000\000\000I foo-dev\375A\003\000\000\000\000\000\000\000C\000\000\000\000 foo-dev\375A\003\000\000\000\000\000\000\000I foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\000 foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\001 foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\002 foo-dev\375A\004\000\000\000\000\000\000\000I foo-dev\375ClastCleanupFiles foo-dev\375ClastCleanupSessions foo-dev\375ClastCleanupTrash foo-dev\375CnextChunk foo-dev\375CnextCleanupSlices foo-dev\375CnextInode foo-dev\375CnextSession foo-dev\375CtotalInodes foo-dev\375CusedSpace foo-dev\375SE\000\000\000\000\000\000\000\001 foo-dev\375SI\000\000\000\000\000\000\000\001 foo-dev\375U\001\000\000\000\000\000\000\000 foo-dev\375setting

从上面的 keys，可以看到我们创建的三个文件的元信息了，这里面是用 slice_id 等信息关联的，所以能和对象存储里的数据 block 关联上。

可以基于上一节的 key 编码规则进一步解码，得到更具体的 sliceID/inode 等等信息，这里我们暂时就不展开了。

5 总结

这一篇我们深入到 JuiceFS 内部，从数据和元数据存储中的东西来 反观 JuiceFS 切分数据和记录元数据的设计。站在这个层次看，已经跟前一篇的理解程度全然不同。

如果说第一篇是“见自己”（功能如所见），这第二篇就是“见天（元数据引擎）地（对象存储）”，那必然还得有一篇“见众生”。

参考资料

官方文档：JuiceFS 如何存储文件, juicefs.com
官方文档：JuiceFS 开发：内部实现, juicefs.com
JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）

直观解读 JuiceFS 的数据和元数据设计（三）：看山还是山（2024）

ARTHURCHIAO'S BLOG

7 months 1 week ago

本系列分为三篇文章，试图通过简单的实地环境来直观理解 JuiceFS 的数据（data）和元数据（metadata）设计。

Fig. JuiceFS object key naming and the objects in MinIO.

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

直观解读 JuiceFS 的数据和元数据设计（一）：看山是山（2024）
直观解读 JuiceFS 的数据和元数据设计（二）：看山不是山（2024）
直观解读 JuiceFS 的数据和元数据设计（三）：看山还是山（2024）

1 如何从数据和元数据中恢复文件
- 1.2 理论步骤
- 1.2 juicefs info 查看文件 chunk/slice/block 信息
2 如何判断 {volume}/chunks/ 中的数据是否是合法
3 问题讨论
4 总结
参考资料

1 如何从数据和元数据中恢复文件 1.2 理论步骤

对于一个给定的 JuiceFS 文件，我们在上一篇中已经看到两个正向的过程：

文件本身被切分成 Chunk、Slice、Block，然后写入对象存储；
文件的元数据以 inode、slice、block 等信息组织，写入元数据引擎。

有了对正向过程的理解，我们反过来就能从对象存储和元数据引擎中恢复文件：对于一个给定的 JuiceFS 文件，

首先扫描元数据引擎，通过文件名、inode、slice 等等信息，拼凑出文件的大小、位置、权限等等信息；
然后根据 slice_id/block_id/block_size 拼凑出对象存储中的 object key；
依次去对象存储中根据这些 keys 读取数据拼到一起，得到的就是这个文件，然后写到本地、设置文件权限等等。

但这个恢复过程不是本文重点。本文主要看几个相关的问题，以加深对 JuiceFS 数据/元数据设计的理解。更多信息见官方文档 [2]。

1.2 juicefs info 查看文件 chunk/slice/block 信息

JuiceFS 已经提供了一个命令行选项，能直接查看文件的 chunk/slice/block 信息，例如：

$ ./juicefs info foo-dev/file2_5MB foo-dev/file2_5MB : inode: 3 files: 1 dirs: 0 length: 5.00 MiB (5242880 Bytes) size: 5.00 MiB (5242880 Bytes) path: /file2_5MB objects: +------------+--------------------------------+---------+--------+---------+ | chunkIndex | objectName | size | offset | length | +------------+--------------------------------+---------+--------+---------+ | 0 | foo-dev/chunks/0/0/3_0_4194304 | 4194304 | 0 | 4194304 | | 0 | foo-dev/chunks/0/0/3_1_1048576 | 1048576 | 0 | 1048576 | +------------+--------------------------------+---------+--------+---------+

和我们在 MinIO 中看到的一致。

2 如何判断 {volume}/chunks/ 中的数据是否是合法

bucket 中的数据是 JuiceFS 写入的，还是其他应用写入的呢？另外即使是 JuiceFS 写入的，也可能有一些数据是无效的，比如 size 为 0 的 block、超出所属 slice 范围的 block 等等。我们来看看基于哪些规则，能对这些非法数据进行判断。

2.1 原理

准备工作：

从 JuiceFS 的元数据引擎中读取所有 slice size，这对应的是元数据信息；
从 object storage 中读取所有 object key，这对应的数据信息。

接下来，根据几条标准，判断 bucket 中 {volume}/chunks/ 内的数据是否是合法的 JuiceFS 数据：

如果 object 不符合命名规范 {volume}/chunks/{slice_id/1000/1000}/{slice_id/1000}/{slice_id}_{block_id}_{block_size}，那么这个 object 就不是 JuiceFS 写入的；
如果符合以上命名规范，，那么这个 object 就是 JuiceFS 写入的，接下来，
1. 如果 object 大小为零，那可以清理掉，因为这种 object 留着没意义；
2. 如果 object 大小不为零，根据元数据内记录的 slice/block 信息计算这个 block 应该是多大，
  1. 如果大小跟 object 一致，那这个 object 就是一个合法的 JuiceFS 数据（Block）；
  2. 否则，说明这个 object 有问题。

这个过程是没问题的，但需要对所有 object 和所有元数据进行遍历和比对，效率比较低。有没有更快的方法呢？

2.2 改进：pending delete slices

回忆上一篇，在元数据引擎中其实已经记录了待删除的 slice/block 信息，这里“待删除”的意思是 JuiceFS 中已经把文件删掉了（用户看不到了，volume usage 统计也不显示了），但还没有从对象存储中删掉，

D 开头的记录：deleted inodes
格式：D{8bit-inode}{8bit-length}，

这种记录是 JuiceFS 在从 object storage 删除文件之前插入到元数据引擎中的，所以扫描所有 D 开头的记录，可以找到所有待删除的 slice/block 信息。

2.3 工具：juicefs gc

结合 2.1 & 2.2，就可以快速判断 bucket 中的数据是否是 JuiceFS 合法数据，不是就删掉；基于 juicefs 已有的代码库，就可以写一个工具 —— 但用不着自己写 —— JuiceFS 已经提供了。

2.3.1 核心代码

完整代码见 pkg/cmd/gc.go。

从元数据引擎 list 所有 slice 信息 func (m *kvMeta) ListSlices(ctx Context, slices map[Ino][]Slice, delete bool, showProgress func()) syscall.Errno { if delete m.doCleanupSlices() // 格式：A{8digit-inode}C{4digit-blockID} file chunks klen := 1 + 8 + 1 + 4 result := m.scanValues(m.fmtKey("A"), -1, func(k, v []byte) bool { return len(k) == klen && k[1+8] == 'C' }) for key, value := range result { inode := m.decodeInode([]byte(key)[1:9]) ss := readSliceBuf(value) // slice list for _, s := range ss if s.id > 0 slices[inode] = append(slices[inode], Slice{Id: s.id, Size: s.size}) } if m.getFormat().TrashDays == 0 return 0 return errno(m.scanTrashSlices(ctx, func(ss []Slice, _ int64) (bool, error) { slices[1] = append(slices[1], ss...) if showProgress != nil for range ss showProgress() return false, nil })) } 从对象存储 list 所有 objects 信息 // Scan all objects to find leaked ones blob = object.WithPrefix(blob, "chunks/") objs := osync.ListAll(blob, "", "", "", true) // List {vol_name}/chunks/ 下面所有对象遍历所有 objects，跟元数据引擎中的 slice 信息比对 for obj := range objs { // key 格式：{slice_id/1000/1000}/{slice_id/1000}/{slice_id}_{index}_{size} parts := strings.Split(obj.Key(), "/") // len(parts) == 3 parts = strings.Split(parts[2], "_") // len(parts) == 3 sliceID, _ := strconv.Atoi(parts[0]) // slice id, JuiceFS globally unique blockID, _ := strconv.Atoi(parts[1]) // blockID in this slice blockSize, _ := strconv.Atoi(parts[2]) // block size, <= 4MB sliceSizeFromMetaEngine := sliceSizesFromMetaEngine[uint64(sliceID)] // tikv 中记录的 slice size var isEmptySize bool if sliceSizeFromMetaEngine == 0 { sliceSizeFromMetaEngine = sliceSizesFromTrash[uint64(sliceID)] isEmptySize = true } if sliceSizeFromMetaEngine == 0 { foundLeaked(obj) continue } if blockSize == chunkConf.BlockSize { // exactly 4MB if (blockID+1)*blockSize > sliceSizeFromMetaEngine foundLeaked(obj) } else { // < 4MB if blockID*chunkConf.BlockSize+blockSize != sliceSizeFromMetaEngine foundLeaked(obj) }

slice size 为 0，说明这个 slice 在元数据引擎中被 compact 过了；
slice size 非零，
- block size == 4MB，可能是也可能不是最后一个 block；
- block size != 4MB，说明这个 block 是最后一个 block；

2.3.2 使用方式 $ ./juicefs gc -h NAME: juicefs gc - Garbage collector of objects in data storage USAGE: juicefs gc [command options] META-URL

大致效果：

$ ./juicefs gc tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev <INFO>: TiKV gc interval is set to 3h0m0s [tkv_tikv.go:138] <INFO>: Data use minio://localhost:9000/juicefs-bucket/foo-dev/ [gc.go:101] Pending deleted files: 0 0.0/s Pending deleted data: 0.0 b (0 Bytes) 0.0 b/s Cleaned pending files: 0 0.0/s Cleaned pending data: 0.0 b (0 Bytes) 0.0 b/s Listed slices: 6 327.3/s Trash slices: 0 0.0/s Trash data: 0.0 b (0 Bytes) 0.0 b/s Cleaned trash slices: 0 0.0/s Cleaned trash data: 0.0 b (0 Bytes) 0.0 b/s Scanned objects: 37/37 [=================================] 8775.9/s used: 4.268971ms Valid objects: 37 11416.0/s Valid data: 134.0 MiB (140509216 Bytes) 41.0 GiB/s Compacted objects: 0 0.0/s Compacted data: 0.0 b (0 Bytes) 0.0 b/s Leaked objects: 0 0.0/s Leaked data: 0.0 b (0 Bytes) 0.0 b/s Skipped objects: 0 0.0/s Skipped data: 0.0 b (0 Bytes) 0.0 b/s <INFO>: scanned 37 objects, 37 valid, 0 compacted (0 bytes), 0 leaked (0 bytes), 0 delslices (0 bytes), 0 delfiles (0 bytes), 0 skipped (0 bytes) [gc.go:379] 3 问题讨论 3.1 chunk id 和 slice id 的分配

每个文件都是从 chunk0 开始的；
实际上没有 chunk id 的概念，只是在查找文件的过程中动态使用，并没有存储到数据和元数据中；

代码里就是直接根据 64MB 计算下一个 chunk id，接下来的读写都是 slice 维度的， slice id 是全局唯一的，会存储到数据（object key）和元数据（tikv keys/values）中。

下一个可用的 sliceID 和 inodeID 记录在 global unique 变量中，初始化：

Register("tikv", newKVMeta) // pkg/meta/tkv_tikv.go |-newBaseMeta(addr, conf) // pkg/meta/tkv.go |-newBaseMeta(addr, conf) // pkg/meta/base.go |-.freeInodes // initialized as default value of type `freeID` |-.freeSlices // initialized as default value of type `freeID`

然后，以写文件为例，调用栈：

Write(off uint64, data) |-if f.totalSlices() >= 1000 { | wait a while | } |-chunkID := uint32(off / meta.ChunkSize) // chunk index, or chunk id |-pos := uint32(off % meta.ChunkSize) // position inside the chunk for writing |-for len(data) > 0 { | |-writeChunk | |-c := f.findChunk(chunkID) | |-s := c.findWritableSlice(off, uint32(len(data))) | |-if no wriatable slice { | | s = &sliceWriter{chunk: c, off: off, } | | go s.prepareID(meta.Background, false) // pkg/vfs/writer.go | | |-NewSlice | | |-*id = m.freeSlices.next // globally unique ID | | | | c.slices = append(c.slices, s) | | if len(c.slices) == 1 { | | f.refs++ | | go c.commitThread() | | } | |-} | |-return s.write(ctx, off-s.off, data) | NewSlice // pkg/meta/base.go |-} 3.2 JuiceFS pending delete slices 和 background job 3.2.1 设计初衷

引入 pending delete slices 主要是大批量删除场景的性能优化：

每个 JuiceFS 客户端只允许并发 100 的删除操作；
超过 100 时，自动放入后台队列，由 background job 异步删除；

3.2.2 代码 // pkg/meta/base.go func (m *baseMeta) fileDeleted(opened, force bool, inode Ino, length uint64) { if opened m.removedFiles[inode] = true else m.tryDeleteFileData(inode, length, force) } func (m *baseMeta) tryDeleteFileData(inode Ino, length uint64, force bool) { if force { m.maxDeleting <- struct{}{} } else { select { case m.maxDeleting <- struct{}{}: // maxDeleting 没满，直接删 default: // maxDeleting 满了之后走到这里，直接返回，靠后台任务删 return // will be cleanup later } } go func() { m.en.doDeleteFileData(inode, length) <-m.maxDeleting }() }

这个 maxDeleting 初始为一个 100 的 buffered channel，每次删除文件时，会尝试往里面放一个元素，

// pkg/meta/base.go func newBaseMeta(addr string, conf *Config) *baseMeta { return &baseMeta{ sid: conf.Sid, removedFiles: make(map[Ino]bool), compacting: make(map[uint64]bool), maxDeleting: make(chan struct{}, 100), // 代码里写死了 100 ... 3.2.3 潜在的问题

后台删除是 JuiceFS client 中的 background job 做的，这个 background job 的开关是可配置的，

$ ./juicefs mount --no-bgjob ... # 关闭 background job

这个开关的控制有点 tricky：

打开：如果一个 volume 的客户端太多，大家都会去做后台清理，都获取文件锁，对元数据引擎的压力非常大；
关闭：没有客户端去做后台清理，导致这些文件一直存在于对象存在中，也可以称为文件泄露，使用成本上升。

一种折中的做法：

客户端不太多的 volumes：默认启用 bgjob；
客户端太多的 volumes，默认关闭 bgjob，然后指定特定的 client 开启 bgjob，代表这个 volume 的所有客户端执行清理操作。

3.3 JuiceFS 支持的单个最大文件 128PiB 是怎么来的

从以上定义可以看到，理论上 JuiceFS 支持的单个文件大小是 maxSliceID (int64) * maxChunkSize，以默认的 maxChunkSize=64MB（2^26 Byte）为例，

理论上限：2^63 * 2^26 = 2^(63+26) Byte。
实际上限：2^31 * 2^26 = 2^(31+26) Byte = 128PiB，这个数字来自官方文档。

实际上限是 128PiB 的原因也很简单，在代码里写死了，

// pkg/vfs/vfs.go const ( maxFileSize = meta.ChunkSize << 31 ) 3.4 为什么 JuiceFS 写入对象存储的文件，不能通过对象存储直接读取？

这里说的“不能读取”，是指不能直接读出原文件给到用户，而不是说不能读取 objects。

看过本文应该很清楚了，JuiceFS 写入对象存储的文件是按照 Chunk、Slice、Block 进行切分的，只有数据内容，且保护重复数据，还没有文件信息元信息（文件名等）。

所以，以对象的存储的方式只能读这些 objects，是无法恢复出原文件给到用户的。

3.5 JuiceFS 不会对文件进行合并

Highlight：JuiceFS 不会文件进行合并写入对象存储，这是为了避免读放大。

4 总结

至此，我们对 JuiceFS 数据和元数据设计的探索学习就告一段落了。希望有了这些知识，用户和工程师在日常的使用和维护 JuiceFS 过程中，看问题和解决问题能更加得心应手。

参考资料

官方文档：JuiceFS 如何存储文件, juicefs.com
官方文档：文件数据格式, juicefs.com

JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

ARTHURCHIAO'S BLOG

7 months 4 weeks ago

Fig. TiKV backup with different CLI tools (and their problems).

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）
JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）
JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）
JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）
JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 JuiceFS 元数据备份方式
2 JuiceFS 自带方式（volume 级别）
3 从 TiKV 层面对 JuiceFS 元数据进行备份
参考资料

1 JuiceFS 元数据备份方式

再复习下 JuiceFS 架构，如下图所示：

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

JuiceFS 的元数据都存储在元数据引擎（例如，TiKV）里，因此元数据的备份有两种实现方式：

从上层备份：JuiceFS client 扫描 volume，将 volume 内所有元数据备份；
元数据引擎（例如 TiKV）备份。

下面分别看看这两种方式。

2 JuiceFS 自带方式（volume 级别） 2.1 juicefs dump 手动备份 volume metadata

对指定 volume 进行备份，

$ juicefs dump tikv://ip:2379/foo-dev foo-dev-dump.json <INFO>: Meta address: tikv://<ip>:2379/foo-dev [interface.go:406] <WARNING>: Secret key is removed for the sake of safety [tkv.go:2571] Scan keys count: 357806 / 357806 [===========================] done Dumped entries count: 122527 / 122527 [===========================] done <INFO>: Dump metadata into dump succeed [dump.go:76]

生成的是一个 JSON 文件，包含了 volume 的所有元数据信息，

{ "Setting": { "Name": "foo-dev", "UUID": "ca95c258", "Storage": "OSS", "Bucket": "http://<url>", "AccessKey": "ak", "BlockSize": 4096, "Capacity": 0, "Inodes": 0, "MetaVersion": 0, "MinClientVersion": "", "MaxClientVersion": "", }, "Counters": { "usedSpace": 6164512768, "usedInodes": 5010, "nextInodes": 10402, "nextChunk": 25001, "nextSession": 118, }, "FSTree": { "attr": {"inode":1,"type":"directory","mode":511,"atime":1645791488,"mtime":1652433235,"ctime":1652433235,"mtimensec":553010494,"ctimensec":553010494,"nlink":2,"length":0}, "xattrs": [{"name":"lastBackup","value":"2024-05-30T13:50:25+08:00"}], "entries": { "001eb8b": { "attr": {...}, "chunks": [{"index":0,"slices":[{"chunkid":15931,"size":32,"len":32}]}] ... } } }, }

其中，volume 中的所有文件和目录信息都描述在 FSTree 字段中。

2.2 juicefs mount --backup-meta <duration> 自动备份

juicefs client 默认会自动备份 volume 的元数据，

备份间隔通过 --backup-meta {duration} 选项控制，默认 1h，
备份文件在对象存储的 meta 特殊目录中，该目录在挂载点中不可见，用对象存储的文件浏览器可以查看和管理，
多 client 挂载同一个 volume 也不会发生备份冲突，因为 JuiceFS 维护了一个全局的时间戳，确保同一时刻只有一个客户端执行备份操作，但是，
当文件数太多（默认达到 100w）且备份频率为默认值 1h 时，为避免备份开销太大，JuiceFS 会自动停止元数据备份，并打印相应的告警。

2.3 juicefs load 从元数据备份文件恢复 $ juicefs load tikv://<ip>:2379/foo-dev-new foo-dev-dump.json 2.4 限制及问题

根据官方文档，以上两种方式都有一些限制或问题：

导出过程中如果业务仍在写入，导出的文件可能不可用。如果对一致性有更高要求，需要在导出前停写。
对规模较大的 volume，直接在线上进行导出可能会影响业务稳定性。

另外，以上方式都是 volume 级别的备份，如果要备份整个 JuiceFS 集群，需要逐个 volume 备份，比较麻烦。下面再看看直接从元数据引擎进行备份的方式。

3 从 TiKV 层面对 JuiceFS 元数据进行备份

这里假设 JuiceFS 的元数据引擎是 TiKV。

3.1 TiKV backup/restore 原理

从上层来说，很简单：

发请求给 TiKV 集群的管理者 PD，让它对集群的所有数据进行备份；
接下来，PD 会发请求给集群的所有 TiKV 节点，通知它们各自进行备份
- TiKV 是按 region 进行多副本存储的，因此只需要一个副本进行备份就行了，
- 在当前的设计里面就是让每个 region 的 leader 副本进行备份，
TiKV region leaders 把这个 region 内的数据写到指定位置。可以是本地磁盘或分布式存储。

3.2 备份工具 TiDB br 和 TiKV tikv-br

理论上，有两个工具可能实现以上效果，它们分别来自 TiDB 和 TiKV 社区，

Fig. TiKV backup with different CLI tools (and their problems).

br：以前是个独立项目（图中 A.1），后来合到 tidb 仓库里了（图中 B.1），

这个工具主要是给 TiDB 备份用的（虽然底层备份的是 TiDB 的 TiKV），所以需要一些 TiDB 知识（上下文），例如 db/table 都是 TiDB 才有的概念。理论上，它也能备份独立部署的 TiKV 集群（“不依赖 TiDB 的 TiKV”），所以加了 raw/txn 支持，但不是 TiDB 社区的重点，所以目前还是 experimental 特性，且用下来有 bug。
tikv-br：是个独立项目，应该是当时 TiKV 作为独立项目推进时，想搞一个配套的独立备份工具，但目前看起来跟 TiKV 社区一样已经不活跃了，它也没法对 txn 进行备份（JuiceFS 用的 txnkv 接口）。

至少对于 5.x TiKV 集群，测试下来以上哪个工具都无法完成备份：有的工具备份和恢复都提示完成，看起来是成功的，但实际上是失败的，JuiceFS 挂载时才能发现。

最后，我们是基于目前（2024.09）最新的 TiDB br，修改了两个地方，才成功完成 TiKV 的备份与恢复。

3.3 基于 TiDB br 对 JuiceFS TiKV 集群进行备份与恢复的步骤

之所以要强调 “JuiceFS TiKV 集群”，是因为 JuiceFS 用的 txnkv 接口，这个比较特殊；如果是 rawkv 接口，那 tikv 自带的备份工具 tikv-br 也许就能用了（没测过）。

（可选）关闭 TiKV MVCC GC；
br 执行备份，

br-dev 是我们基于最新 master（202409）改过的版本。
$ ./br-dev backup txn \ --ca /tmp/pki/root.crt --cert /tmp/pki/pd.crt --key /tmp/pki/pd.key \ --pd https://$pd_addr \ --s3.endpoint $s3_addr \ --storage $storage_path \ --log-file /var/log/tikv/br.log \ --ratelimit $bw_limit_per_node \ --log-level debug \ --check-requirements=false
可以设置限速等参数，避免备份占用的 CPU/Memory/DiskIO/… 过大。根据 db size 等等因素，备份的耗时是可估算的，下面拿一个真实集群的备份为例：
- 每个 TiKV 的 DB size：监控能看到，一般每个节点的 DB size 都差不太多，这里是 25GB per TiKV node；
- MVCC 保留了数据的多个版本：假设平均保留两个版本，那就是 DB size * 2
- 限速带宽：设置为 30MB/s，这个带宽不算大，不会是磁盘和网络瓶颈，因此可以全速运行
根据以上参数，估算耗时：25GB * 2 / 30MBps = 1700s = 28min

Fig. TiKV backup resource usage with br --ratelimit=30MB/s.

可以看到跟预估的差不多。资源销毁方面：
- CPU 利用率比平时翻倍；
- 其中两台机器的 CPU 数量比较少，所以会比其他节点更明显。
检查备份

如果是备份到 S3，可以用 s3cmd 或 web 控制台查看，
$ s3cmd du s3://{bucket}/<backup>/ 295655971082 18513 objects s3://{bucket}/<backup>/
290GB 左右，比监控看到的 DB size 大一倍，因为保留了 MVCC 多版本。大多少倍与 MVCC GC 间隔有密切关系，比如写或更新很频繁的场景，1h 和 3h 的 MVCC 数据量就差很多了。
br 恢复：将备份数据恢复到一个新的 JuiceFS TiKV 集群，
$ ./br-dev restore txn \ --ca /tmp/pki/root.crt --cert /tmp/pki/pd.crt --key /tmp/pki/pd.key \ --pd https://$pd_addr \ --s3.endpoint $s3_addr \ --storage $storage_path \ --log-file /var/log/tikv/br.log \ --ratelimit $bw_limit_per_node \ --log-level debug \ --check-requirements=false
可能的问题：ratelimit 好像不起作用，全速恢复，网络带宽打的很高。
JuiceFS client 挂载，验证恢复成功

用 juicefs 挂载目录，指定新 TiKV 集群的 PD 地址，
$ juicefs mount tikv://<new-pd-ip>:2379/<volume name> /tmp/test $ cd /tmp/test && ls # 原来 volume 内的文件都在

3.4 TiDB br 备份逻辑

感兴趣的可以看看 br 源码的备份逻辑，

3.4.1 RunBackupTxn() // tidb br/pkg/task/backup_txn.go // RunBackupTxn starts a backup task inside the current goroutine. func RunBackupTxn(c context.Context, g glue.Glue, cmdName string, cfg *TxnKvConfig) error { mgr := NewMgr(ctx, g, cfg.PD, cfg.TLS, GetKeepalive(&cfg.Config), cfg.CheckRequirements, false) client := backup.NewBackupClient(ctx, mgr) backupRanges := make([]rtree.Range, 0, 1) // current just build full txn range to support full txn backup minStartKey := []byte{} maxEndKey := []byte{} backupRanges = append(backupRanges, rtree.Range{ StartKey: minStartKey, EndKey: maxEndKey, }) // Backup req := backuppb.BackupRequest{ ClusterId: client.GetClusterID(), StartVersion: 0, EndVersion: client.GetCurrentTS(ctx), // gets a new timestamp (TSO) from PD RateLimit: cfg.RateLimit, Concurrency: cfg.Concurrency, StorageBackend: client.GetStorageBackend(), IsRawKv: false, } ranges, schemas, policies := client.BuildBackupRangeAndSchema(mgr.GetStorage(), cfg.TableFilter, backupTS, isFullBackup(cmdName)) // StartWriteMetasAsync writes four kind of meta into backupmeta. // 1. file // 2. schema // 3. ddl // 4. rawRange( raw kv ) metaWriter := metautil.NewMetaWriter(client.GetStorage(), metautil.MetaFileSize, false, metautil.MetaFile, &cfg.CipherInfo) metaWriter.StartWriteMetasAsync(ctx, metautil.AppendDataFile) // Start TiKV backup client.BackupRanges(ctx, backupRanges, req, 1, nil, metaWriter, progressCallBack) // Backup has finished metaWriter.Update(func(m *backuppb.BackupMeta) { m.StartVersion = req.StartVersion m.EndVersion = req.EndVersion m.IsRawKv = false m.IsTxnKv = true m.ClusterId = req.ClusterId m.ClusterVersion = mgr.GetClusterVersion(ctx) m.BrVersion = brVersion m.ApiVersion = client.GetApiVersion() }) metaWriter.FinishWriteMetas(ctx, metautil.AppendDataFile) metaWriter.FlushBackupMeta(ctx) }

几点说明：

KV 的 backup range 是全量（start/end key 都是空）；
MVCC 的 start/end version 分别是 0 和当前 PD 最新的 TSO；

3.4.2 调用栈 BackupRanges // make a backup of the given key ranges. |-mainBackupLoop := &MainBackupLoop | BackupSender: &MainBackupSender{}, | BackupReq: request, | Concurrency: concurrency, | GlobalProgressTree: &globalProgressTree, | ReplicaReadLabel: replicaReadLabel, | GetBackupClientCallBack: func(ctx , storeID uint64, reset bool) (backuppb.BackupClient, error) { | return bc.mgr.GetBackupClient(ctx, storeID) | }, | } |-bc.RunLoop(ctx, mainBackupLoop) // infinite loop to backup ranges on all tikv stores |-for { inCompleteRanges = iter.GetIncompleteRanges() // 还未完成备份的 key 范围 loop.BackupReq.SubRanges = getBackupRanges(inCompleteRanges) allStores := bc.getBackupStores(mainCtx, loop.ReplicaReadLabel) for _, store := range allStores { cli := loop.GetBackupClientCallBack(mainCtx, storeID, reset) loop.SendAsync(round, storeID, loop.BackupReq, loop.Concurrency, cli, ch, loop.StateNotifier) |-go startBackup(storeID, request, cli, concurrency, respCh) |-for i, req := range reqs { doSendBackup(ectx, backupCli, bkReq, ...) |-ctx, timerecv := StartTimeoutRecv(pctx, TimeoutOneResponse) |-bCli := client.Backup(ctx, &req) // protobuf grpc method |-for { |- resp := bCli.Recv() |- timerecv.Refresh() |- respFn(resp) |-} } } 3.4.3 tikv-server 备份代码 // components/backup/src/service.rs impl<H> Backup for Service<H> { fn backup( req: BackupRequest, mut sink: ServerStreamingSink<BackupResponse>, ) { if let Err(status) = match Task::new(req, tx) { Ok((task, c)) => { self.scheduler.schedule(task) } } let send_task = async move { let mut s = rx.map(|resp| Ok((resp, WriteFlags::default()))); sink.send_all(&mut s).await?; } ctx.spawn(send_task); } } /// Backup Task. pub struct Task { request: Request, pub(crate) resp: UnboundedSender<BackupResponse>, } // components/backup/src/endpoint.rs impl Task { /// Create a backup task based on the given backup request. pub fn new( req: BackupRequest, resp: UnboundedSender<BackupResponse>, ) -> Result<(Task, Arc<AtomicBool>)> { let speed_limit = req.get_rate_limit(); let limiter = Limiter::new(if speed_limit > 0 else f64::INFINITY }); let cf = name_to_cf(req.get_cf()) let task = Task { request: Request { start_key: req.get_start_key().to_owned(), end_key: req.get_end_key().to_owned(), sub_ranges: req.get_sub_ranges().to_owned(), start_ts: req.get_start_version().into(), end_ts: req.get_end_version().into(), backend: req.get_storage_backend().clone(), limiter, is_raw_kv: req.get_is_raw_kv(), dst_api_ver: req.get_dst_api_version(), cf, replica_read: req.get_replica_read(), resource_group_name: .get_resource_group_name().to_owned(), }), }, resp, }; } } // components/backup/src/endpoint.rs BackupRanges -> BackupWriterBuilder -> S3Uploader self.writer.put(&data_key_write, value) -> s3 put 参考资料

官方文档：元数据备份和恢复, juicefs.com

JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）

ARTHURCHIAO'S BLOG

8 months 2 weeks ago

Fig. TiKV MVCC GC mechanisms.

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）
JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）
JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）
JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）
JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 概念与实测
2 TiKV MVCC GC
3 GC 不及时导致的问题一例
4 问题讨论
参考资料

1 概念与实测 1.1 MVCC（多版本并发控制）

来自 wikipedia 的定义，

Multiversion concurrency control (MCC or MVCC), is a non-locking concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.

TiKV 支持 MVCC，当更新数据时，旧的数据不会被立即删掉，而是新老同时保留，以时间戳来区分版本。官方有几篇很不错的博客 [1,3]。

下面进行一个简单测试来对 MVCC 有一个初步的直观认识。

1.1.2 TiKV MVCC 测试

参考上一篇，新创建一个新 volume，里面什么文件都没有，有 8 条记录，

$ tikv-ctl.sh scan --from 'zfoo' --to 'zfop' | grep "key:" | wc -l 8

然后进入这个 volume 的挂载目录，在里面创建一个文件，

$ cd <mount dir> $ echo 1 > foo.txt

再次扫描这个 volume 对应的所有 keys，

$ tikv-ctl.sh scan --from 'zfoo' --to 'zfop' | grep "key:" | wc -l 16

可以看到变成 16 条记录，比之前多了 8 条。内容如下，依稀能看出大部分条目的用途（行末的注释是本文加的），

key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfoo.tx\377t\000\000\000\000\000\000\000\370 # foo.txt key: zfoo-dev\375\377A\002\000\000\000\000\000\000\377\000C\000\000\000\000\000\000\375 key: zfoo-dev\375\377A\002\000\000\000\000\000\000\377\000I\000\000\000\000\000\000\371 key: zfoo-dev\375\377ClastCle\377anupFile\377s\000\000\000\000\000\000\000\370 # lastCleanupFile key: zfoo-dev\375\377ClastCle\377anupSess\377ions\000\000\000\000\373 # lastCleanupSessions key: zfoo-dev\375\377CtotalIn\377odes\000\000\000\000\373 # totalInodes key: zfoo-dev\375\377CusedSpa\377ce\000\000\000\000\000\000\371 # UsedSpace key: zfoo-dev\375\377U\001\000\000\000\000\000\000\377\000\000\000\000\000\000\000\000\370

接下来继续更新这个文件 1000 次（每次都是一个整数，由于文件内容极小，不会导致 TiKV 的 region split 等行为），

$ for n in {1..1000}; do echo $n > bar.txt; done

再次查看元数据条目数量：

$ tikv-ctl.sh scan --from 'zfoo' --to 'zfop' | grep key | wc -l 59

又多了 43 条。多的条目大致长这样：

key: zfoo-dev\375\377L\000\000\000\000f\356\221\377\231\000\000\000\000\000\000\000\3777\000\000\000\000\000\000\000\370 key: zfoo-dev\375\377L\000\000\000\000f\356\221\377\233\000\000\000\000\000\000\000\377j\000\000\000\000\000\000\000\370 key: zfoo-dev\375\377L\000\000\000\000f\356\221\377\234\000\000\000\000\000\000\000\377\235\000\000\000\000\000\000\000\370 ... key: zfoo-dev\375\377L\000\000\000\000f\356\221\377\271\000\000\000\000\000\000\003\377\362\000\000\000\000\000\000\000\370

TiKV supports MVCC, which means that there can be multiple versions for the same row stored in RocksDB. All versions of the same row share the same prefix (the row key) but have different timestamps as a suffix.

https://tikv.org/deep-dive/key-value-engine/rocksdb/

下面我们再看看执行以上文件更新操作期间，juicefs 客户端的日志。

1.1.2 JuiceFS client 日志

在执行以上 for 循环期间，JuiceFS client 的日志，

$ juicefs mount ... ... <DEBUG>: PUT chunks/0/0/170_0_4 (req_id: "xx", err: <nil>, cost: 32.002516ms) [cached_store.go:669] <DEBUG>: PUT chunks/0/0/171_0_4 (req_id: "xx", err: <nil>, cost: 32.002516ms) [cached_store.go:669] <DEBUG>: PUT chunks/0/0/172_0_4 (req_id: "xx", err: <nil>, cost: 32.002516ms) [cached_store.go:669] ...

这个似乎对应的就是以上多出来的条目。

1.1.3 小结

本节的例子让我们看到，虽然 volume 里面从头到尾只有一个文件，但随着我们不断覆盖这个文件内的值，元数据引擎 TiKV 内的条目数量就会持续增加。多出来的这些东西，对应的就是这份数据的多个版本，也就是 MVCC 里面 multi-version 的表现。

显然，没有冲突的话，只保留最后一个版本就行了，其他版本都可以删掉 —— 这就是垃圾回收（GC）的作用。

1.2 GC（垃圾回收）

垃圾回收 (GC) 的功能是清理 MVCC 留下的旧版本。比如同一份数据保存了 1000 个版本，那原则上前面大部分版本都可以清掉了，只保留最新的一个或几个。

那如何判断哪些版本可以安全地清掉呢？TiKV 引入了一个时间戳概念： safepoint。

GC is a process to clean up garbage versions (versions older than the configured lifetime) of each row.

https://tikv.org/deep-dive/key-value-engine/rocksdb/

1.3 Safepoint（可安全删除这个时间戳之前的版本）

In order to ensure the correctness of all read and write transactions, and make sure the GC mechanism works, TiKV/TiDB introduced the concept of safe-point. There is a guarantee that all active transactions and future transactions’ timestamp is greater than or equal to the safe-point. It means old versions whose commit-ts is less than the safe-point can be safely deleted by GC. [3]

2 TiKV MVCC GC

以上看到，TiKV 有 GC 功能，但由于其“历史出身”，也存在一些限制。

2.1 历史：从 TiDB 里面拆分出来，功能不完整

TiKV 是从 TiDB 里面拆出来的一个产品，并不是从一开始就作为独立产品设计和开发的。这导致的一个问题是：MVCC GC 功能在使用上有点蹩脚：

默认情况下，靠底层 RocksDB 的 compaction 触发 GC，这周触发周期不确定且一般比较长；
TiKV+PD 也内置了另一种 GC 方式，但并不会自己主动去做，而是将 GC 接口暴露出来，靠 TiDB 等在使用 TiKV 的更上层组件来触发（见下节的图）；
tikv-ctl/pd-ctl 等等命令行工具也都没有提供 GC 功能，这导致 TiKV 的运维很不方便，比如有问题想快速手动触发时用不了。

下面具体看看 TiKV 中的 GC 设计。

2.2 TiKV GC 设计和配置项

Fig. TiKV MVCC GC mechanisms.

2.2.1 设计：两种 GC 触发方式

被动 GC：TiKV 底层的 RocksDB compact 时进行垃圾回收。
- 通过 tikv-server 的 enable-compaction-filter 配置项控制；
- 默认启用；
- 触发 RocksDB compaction 时才能进行 GC。
- tikv-ctl compact/compact-cluster 可以手动触发这种 compact，进而 GC。
半主动 GC：内置了 GC worker，
- 定期获取 PD 里面的 gc safepoint，然后进行 GC；会占用一些 CPU/IO 资源；
- PD 不会主动更新这个 gc safepoint，一般是由在使用 TiKV 的更外围组件来更新的，例如 TiDB、JuiceFS 等等；
- 所以本文把这种方式称为“半主动”。

2.2.2 tikv-server 启动日志中的 GC 配置信息

tikv-server.log，

[INFO] [server.rs:274] ["using config"] [config="{..., "enable-compaction-filter":true, ...}"] [INFO] [compaction_filter.rs:138] ["initialize GC context for compaction filter"] [INFO] [gc_worker.rs:786] ["initialize compaction filter to perform GC when necessary"] 2.2.3 tikv-ctl compact/compact-cluster 触发被动 GC 例子 # compact-cluster 必须要指定 --pd 参数，因为针对是整个集群。指定 --host 会失败，但没有提示错在哪，TiKV 的命令行工具经常这样 $ tikv-ctl.sh compact-cluster --from 'zfoo' --to 'zfop' $ tikv-ctl.sh compact --from 'zfoo' --to 'zfop' store:"192.168.1.1:20160" compact db:Kv cf:default range:[[122, 122, 121, 110], [122, 122, 121, 111]) success! $ tikv-ctl.sh compact --from 'zfoo' --to 'zfop' -c default # 很快 $ tikv-ctl.sh compact --from 'zfoo' --to 'zfop' -c lock # 很快 store:"192.168.1.1:20160" compact db:Kv cf:lock range:[[122, 122, 121, 110], [122, 122, 121, 111]) success! $ tikv-ctl.sh compact --from 'zfoo' --to 'zfop' -c write # 非常慢 store:"192.168.1.1:20160" compact db:Kv cf:write range:[[122, 122, 121, 110], [122, 122, 121, 111]) success! # 还可以指定本地 TiKV 数据路径直接 compact # -d: specify the RocksDB that performs compaction. default: kv. Valid values: {kv, raft} $ tikv-ctl --data-dir /path/to/tikv compact -d kv 2.2.4 小结

“半主动方式”需要外围组件去更新 PD 中的 gc safepoint 信息，这样下面的 TiKV 才会去执行 GC 操作。作为两个具体例子，我们接下来看看 TiDB 和 JuiceFS 在使用 TiKV 时，分别是怎么去更新这个信息的。

2.3 TiDB 中触发 TiKV GC 的方式

TiDB 有 GC 相关的配置和 worker，会按照配置去触发底层的 TiKV GC，

Fig. TiDB SQL layer overview. GC worker is outside of TiKV. Image Source: pingcap.com

更多信息可以参考 [3,4]。

2.4 JuiceFS 触发 TiKV GC 的方式

TiKV 作为元数据引擎时，JuiceFS 并没有使用 TiDB，而是直接使用的 TiKV（和 PD），所以就需要 JuiceFS client 来触发这个 GC （因为不考虑 CSI 部署方式的话，JuiceFS 就一个客户端组件，也没有其他 long running 服务来做这个事情了）。

Fig. Typical JuiceFS cluster.

2.4.1 定期更新 gc safepoint 的代码

JuiceFS v1.0.4+ 客户端会周期性地设置 PD 中的 gc safepoint，默认是 now-3h，也就是可以删除 3 小时之前的旧版本数据，

// pkg/meta/tkv_tikv.go func (c *tikvClient) gc() { if c.gcInterval == 0 { return } safePoint := c.client.GC(context.Background(), oracle.GoTimeToTS(time.Now().Add(-c.gcInterval))) }

接下来的调用栈：

gc // github.com/juicedata/juicefs pkg/meta/tkv_tikv.go |-c.client.GC // github.com/tikv/client-go tikv/gc.go |-s.pdClient.UpdateGCSafePoint // github.com/tikv/pd client/client.go |-ctx = grpcutil.BuildForwardContext(ctx, c.GetLeaderAddr()) |-c.getClient().UpdateGCSafePoint(ctx, req) / gRPC / /----<--<----/ / UpdateGCSafePoint // github.com/tikv/pd server/grpc_service.go |-rc := s.GetRaftCluster() |-oldSafePoint := s.storage.LoadGCSafePoint() |-s.storage.SaveGCSafePoint(newSafePoint) |-key := path.Join(gcPath, "safe_point") // gcPath = "gc" |-value := strconv.FormatUint(safePoint, 16) |-return s.Save(key, value) 2.4.2 配置：META URL \?gc-interval=1h

这个 gc-interval 可在 juicefs 挂载卷时加到 TiKV URL 中，

默认值：3h
最小值：1h，设置的值小于这个值会打印一条 warning，然后强制设置为 1h。

juicefs client 挂载时显式设置 gc-interval，

$ juicefs mount tikv://localhost:2379\?gc-interval=1h ~/mnt/jfs <INFO>: Meta address: tikv://localhost:2379?gc-interval=1h [interface.go:491] <INFO>: TiKV gc interval is set to 1h0m0s [tkv_tikv.go:84] ... 2.4.3 juicefs gc 手动触发 TiKV GC

还可以通过 juicefs gc 子命令来主动触发 TiKV GC。这个例子中设置的时间太短，可以看到被强制改成了允许的最小值 1h，

$ juicefs gc tikv://<ip>:2379/foo-dev\?gc-interval=1m --delete ... <WARNING>: TiKV gc-interval (1m0s) is too short, and is reset to 1h [tkv_tikv.go:133] <INFO>: TiKV gc interval is set to 1h0m0s [tkv_tikv.go:138] Cleaned pending slices: 0 0.0/s Pending deleted files: 0 0.0/s Pending deleted data: 0.0 b (0 Bytes) 0.0 b/s Cleaned pending files: 0 0.0/s Cleaned pending data: 0.0 b (0 Bytes) 0.0 b/s Cleaned trash: 0 0.0/s Cleaned detached nodes: 0 0.0/s Listed slices: 2047 4930.4/s Trash slices: 2026 55423.8/s Trash data: 7.7 KiB (7883 Bytes) 211.8 KiB/s Cleaned trash slices: 0 0.0/s Cleaned trash data: 0.0 b (0 Bytes) 0.0 b/s Scanned objects: 2047/2047 [===========================================] 18138.6/s used: 113.115519ms Valid objects: 21 187.2/s Valid data: 85.0 b (85 Bytes) 758.0 b/s Compacted objects: 2026 18064.2/s Compacted data: 7.7 KiB (7883 Bytes) 68.6 KiB/s Leaked objects: 0 0.0/s Leaked data: 0.0 b (0 Bytes) 0.0 b/s Skipped objects: 0 0.0/s Skipped data: 0.0 b (0 Bytes) 0.0 b/s <INFO>: scanned 2047 objects, 21 valid, 2026 compacted (7883 bytes), 0 leaked (0 bytes), 0 delslices (0 bytes), 0 delfiles (0 bytes), 0 skipped (0 bytes) [gc.go:379] 2.5 外挂组件 github.com/tikv/migration/gc-worker

代码仓库，是个在 TiKV 之上的组件，从 PD 获取 service safepoint 信息，然后计算 gc safepoint 并更新到 PD，从而触发 TiKV GC。

3 GC 不及时导致的问题一例

这里挑一个典型的问题讨论下。

3.1 问题现象 3.1.1 监控：TiKV db size 暴增，磁盘空间不断减小

如下面监控所示，

Fig. TiKV DB size soaring in a JuiceFS cluster, caused by TiKV GC lagging.

TiKV DB size 暴增；
TiKV region 分布出现显著变量，总数量也有一定程度上升；
TiKV node 可用磁盘空间不断下降。

3.1.2 tikv-server 错误日志：failed to split region

查看 tikv-server 日志，看到一直在刷下面这样的 warning/error：

[WARN] [split_observer.rs:73] ["invalid key, skip"] [err="\"key 6E677... should be in (6E677..., 6E677...)\""] [index=0] [region_id=39179938] [ERROR] [split_observer.rs:136] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=39179938] [WARN] [peer.rs:2971] ["skip proposal"] [error_code=KV:Raftstore:Coprocessor] [err="Coprocessor(Other(\"[components/raftstore/src/coprocessor/split_observer.rs:141]: no valid key found for split.\"))"] [peer_id=39179939] [region_id=39179938]

也就是 region split 失败。

3.2 问题排查

根据日志报错，网上搜到一些帖子，初步了解问题背景（JuiceFS/TiKV 新人，接触没多久）；
对报错日志进行分析，发现：
- 报错集中在几十个 region（grep "failed to handle split req" tikv.log | awk '{print $NF}' | sort | uniq -c | sort -n -k1,1），相对总 region 数量很少；
- pd-ctl region-properties -r <region> 看，发现 start/end key 都来自同一个 volume（命令行操作见下一篇）；
- 根据 volume 监控看，只有一个客户端 set 请求非常高，每秒 400 次请求，而这个 volume 只有几个 GB，可以说非常小；
tikv-ctl mvcc -k <key> 查看有问题的 key，发现超时了，报错说文件（元数据）太大；

结合以上三点，判断是某个或少数几个文件的 MVCC 版本太多，导致 TiKV split region 失败，进而不断累积垃圾数据。

3.3 问题根因

以上，猜测直接原因是这个用户 非正常使用 JuiceFS，疯狂更新文件，也就是我们 1.1 中例子的极端版。这导致部分文件的历史版本极其多，TiKV 在 auto split region 时失败。网上也有一些类似的 case（大部分是 TiDB 用户）。

但本质上，还是因为 TiKV 的 GC 太滞后，

被动 GC（RocksDB compact 方式）的频率不可控，跟集群所有客户端的总 write/update/delete 行为有关；
JuiceFS 的主动 GC 频率太慢，跟不上某些文件的版本增长速度。
- JuiceFS 默认 now-3h，最小 now-1h，也就是至少会保留一个小时内的所有版本（实际上我们是有个外部服务在定期更新 PD 的 gc safepoint，但也是设置的 now-1h）；
- 根据监控看，异常的 juicefs client 每秒有 400+ set 请求，一个小时就是 144w 次的更新（这些请求更新的文件很集中）。

3.4 解决方式

写了个程序，允许以非常小的粒度去更新 PD 的 gc safepoint，例如 now-5m，也就是最多保留最近 5 分钟内的版本，其他的都删掉；这一步下去就有效果了，先稳住了，DB 不再增长，开始缓慢下降；
通知用户去处理那个看起来异常的客户端（我们没权限登录用户的机器，客户端不可控，这是另一个问题了）。

1+2，DB 开始稳步下降，最终完全恢复正常。

3.5 问题小结

对于 TiKV 这种 MVCC 的元数据引擎来说，JuiceFS 的一条元数据可能会保留多个版本，老版本什么时候删掉很大程度上依赖外部 GC 触发。如果 GC 间隔太长 + 文件更新太频繁，单条元数据极端情况下就可以占几个 GB，这时候不仅 DB size 暴大，还会导致 TiKV split region 工作不正常。

4 问题讨论

前面看到，JuiceFS 支持配置 TiKV 的 GC 间隔，但从管理和运维层面，这里面也有几个问题可以探讨。

4.1 允许的最小 GC 间隔太大

目前最小是 now-1h，极端情况会导致第 3 节中的问题，TiKV DB size 暴增，集群被打爆。

4.2 GC 配置放在客户端，增加了用户的认知负担和学习成本

用户必需感知 TiKV gc 这个东西，增加认知成本和使用负担；

用户只是用 JuiceFS volume 读写文件，原则上没有必要去知道 JuiceFS 集群用什么元数据引擎，甚至还必现了解这种元数据引擎的 GC 知识，后者都是 JuiceFS 集群管理员需要关心和解决的；
用户如果没有配置，就只完全依赖 RocksDB compaction 来 GC，更容易触发版本太多导致的问题。

4.3 管理员运维困境

用户一旦没有显式配置 gc-interval（使用很大的默认值），TiKV 可能就被打爆，这种情况下用户不知道，管理员知道但可能没短平快的解决办法（不一定有权限管理用户的机器）。

4.4 小结

对集群管理员来说，更好的方式可能是，

有个（内部或外部）服务，可以按管理员的需求随时和/或定时去 GC；
用户侧完全不用感知这个事情；
有 Meta 操作的限流能力（可以隔离有问题的 volume 或 client），下一篇讨论。

参考资料

MVCC in TiKV, pingcap.com, 2016
JuiceFS 元数据引擎最佳实践：TiKV, juicefs.com
Deep Dive into Distributed Transactions in TiKV and TiDB, medium.com, 2024
MVCC garbage collection, TiDB doc, 2024

JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）

ARTHURCHIAO'S BLOG

8 months 2 weeks ago

Fig. JuiceFS upload/download data bandwidth control.

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）
JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）
JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）
JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）
JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 元数据存储在哪儿？文件名到 TiKV regions 的映射
2 JuiceFS 集群规模与元数据大小（engine size）
- 2.1 二者的关系
  - 2.1.1 文件数量 & 平均文件大小
  - 2.1.2 MVCC GC 快慢
- 2.2 两个集群对比
3 限速（上传/下载数据带宽）设计
4 限流（metadata 请求）设计
参考资料

1 元数据存储在哪儿？文件名到 TiKV regions 的映射 1.1 pd-ctl region 列出所有 region 信息 $ pd-ctl.sh region | jq . { "regions": [ { "id": 11501, "start_key": "6161616161616161FF2D61692D6661742DFF6261636B7570FD41FFCF68030000000000FF4900000000000000F8", "end_key": "...", "epoch": { "conf_ver": 23, "version": 300 }, "peers": [ { "id": 19038, "store_id": 19001, "role_name": "Voter" }, ... ], "leader": { "id": 20070, "store_id": 20001, "role_name": "Voter" }, "written_bytes": 0, "read_bytes": 0, "written_keys": 0, "read_keys": 0, "approximate_size": 104, "approximate_keys": 994812 }, ] } 1.2 tikv-ctl region-properties 查看 region 属性详情 $ ./tikv-ctl.sh region-properties -r 23293 mvcc.min_ts: 438155461254971396 mvcc.max_ts: 452403302095650819 mvcc.num_rows: 1972540 mvcc.num_puts: 3697509 mvcc.num_deletes: 834889 mvcc.num_versions: 4532503 mvcc.max_row_versions: 54738 num_entries: 4549844 num_deletes: 17341 num_files: 6 sst_files: 001857.sst, 001856.sst, 002222.sst, 002201.sst, 002238.sst, 002233.sst region.start_key: 6e6772... region.end_key: 6e6772... region.middle_key_by_approximate_size: 6e6772... 1.3 tikv-ctl --to-escaped：从 region 的 start/end key 解码文件名范围

如上，每个 region 都会有 start_key/end_key 两个属性，这里面编码的就是这个 region 内存放是元数据的 key 范围。我们挑一个来解码看看：

$ tikv-ctl.sh --to-escaped '6161616161616161FF2D61692D6661742DFF6261636B7570FD41FFCF68030000000000FF4900000000000000F8' aaaaaaaa\377-ai-fat-\377backup\375A\377\317h\003\000\000\000\000\000\377I\000\000\000\000\000\000\000\370

再 decode 一把会更清楚：

$ tikv-ctl.sh --decode 'aaaaaaaa\377-ai-fat-\377backup\375A\377\317h\003\000\000\000\000\000\377I\000\000\000\000\000\000\000\370' aaaaaaaa-ai-fat-backup\375A\317h\003\000\000\000\000\000I

对应的是一个名为 aaaaaaa-ai-fat-backup 的 volume 内的一部分元数据。

1.4 filename -> region：相关代码

这里看一下从文件名映射到 TiKV region 的代码。

PD 客户端代码，

// GetRegion gets a region and its leader Peer from PD by key. // The region may expire after split. Caller is responsible for caching and // taking care of region change. // Also, it may return nil if PD finds no Region for the key temporarily, // client should retry later. GetRegion(ctx , key []byte, opts ...GetRegionOption) (*Region, error) // GetRegion implements the RPCClient interface. func (c *client) GetRegion(ctx , key []byte, opts ...GetRegionOption) (*Region, error) { options := &GetRegionOp{} for _, opt := range opts { opt(options) } req := &pdpb.GetRegionRequest{ Header: c.requestHeader(), RegionKey: key, NeedBuckets: options.needBuckets, } serviceClient, cctx := c.getRegionAPIClientAndContext(ctx, options.allowFollowerHandle && c.option.getEnableFollowerHandle()) resp := pdpb.NewPDClient(serviceClient.GetClientConn()).GetRegion(cctx, req) return handleRegionResponse(resp), nil }

PD 服务端代码，

func (h *regionHandler) GetRegion(w http.ResponseWriter, r *http.Request) { rc := getCluster(r) vars := mux.Vars(r) key := url.QueryUnescape(vars["key"]) // decode hex if query has params with hex format paramsByte := [][]byte{[]byte(key)} paramsByte = apiutil.ParseHexKeys(r.URL.Query().Get("format"), paramsByte) regionInfo := rc.GetRegionByKey(paramsByte[0]) b := response.MarshalRegionInfoJSON(r.Context(), regionInfo) h.rd.Data(w, http.StatusOK, b) } // GetRegionByKey searches RegionInfo from regionTree func (r *RegionsInfo) GetRegionByKey(regionKey []byte) *RegionInfo { region := r.tree.search(regionKey) if region == nil { return nil } return r.getRegionLocked(region.GetID()) }

返回的是 region info，

// RegionInfo records detail region info for api usage. // NOTE: This type is exported by HTTP API. Please pay more attention when modifying it. // easyjson:json type RegionInfo struct { ID uint64 `json:"id"` StartKey string `json:"start_key"` EndKey string `json:"end_key"` RegionEpoch *metapb.RegionEpoch `json:"epoch,omitempty"` Peers []MetaPeer `json:"peers,omitempty"` // https://github.com/pingcap/kvproto/blob/master/pkg/metapb/metapb.pb.go#L734 Leader MetaPeer `json:"leader,omitempty"` DownPeers []PDPeerStats `json:"down_peers,omitempty"` PendingPeers []MetaPeer `json:"pending_peers,omitempty"` CPUUsage uint64 `json:"cpu_usage"` WrittenBytes uint64 `json:"written_bytes"` ReadBytes uint64 `json:"read_bytes"` WrittenKeys uint64 `json:"written_keys"` ReadKeys uint64 `json:"read_keys"` ApproximateSize int64 `json:"approximate_size"` ApproximateKeys int64 `json:"approximate_keys"` ApproximateKvSize int64 `json:"approximate_kv_size"` Buckets []string `json:"buckets,omitempty"` ReplicationStatus *ReplicationStatus `json:"replication_status,omitempty"` } // GetRegionFromMember implements the RPCClient interface. func (c *client) GetRegionFromMember(ctx , key []byte, memberURLs []string, _ ...GetRegionOption) (*Region, error) { for _, url := range memberURLs { conn := c.pdSvcDiscovery.GetOrCreateGRPCConn(url) cc := pdpb.NewPDClient(conn) resp = cc.GetRegion(ctx, &pdpb.GetRegionRequest{ Header: c.requestHeader(), RegionKey: key, }) if resp != nil { break } } return handleRegionResponse(resp), nil } 2 JuiceFS 集群规模与元数据大小（engine size） 2.1 二者的关系

一句话总结：并没有一个线性的关系。

2.1.1 文件数量 & 平均文件大小

TiKV engine size 的大小，和集群的文件数量和每个文件的大小都有关系。例如，同样是一个文件，

小文件可能对应一条 TiKV 记录；
大文件会被拆分，对应多条 TiKV 记录。

2.1.2 MVCC GC 快慢

GC 的勤快与否也会显著影响 DB size 的大小。第三篇中有过详细讨论和验证了，这里不再赘述，

Fig. TiKV DB size soaring in a JuiceFS cluster, caused by TiKV GC lagging.

2.2 两个集群对比

集群 1：~1PB 数据，以小文件为主，~30K regions，~140GB TiKV engine size (3 replicas)；
集群 2：~7PB 数据，以大文件为主，~800 regions，~3GB TiKV engine size (3 replicas)；

如下面监控所示，虽然集群 2 的数据量是前者的 7 倍，但元数据只有前者的 1/47，

Fig. TiKV DB sizes and region counts of 2 JuiceFS clusters: cluster-1 with ~1PB data composed of mainly small files, cluster-2 with ~7PB data composed of mainly large files.

3 限速（上传/下载数据带宽）设计

限速（upload/download bandwidth）本身是属于数据平面（data）的事情，也就是与 S3、Ceph、OSS 等等对象存储关系更密切。

但第二篇中已经看到，这个限速的配置信息是保存在元数据平面（metadata）TiKV 中 —— 具体来说就是 volume 的 setting 信息；此外，后面讨论元数据请求限流（rate limiting）时还需要参考限速的设计。所以，这里我们稍微展开讲讲。

3.1 带宽限制：--upload-limit/--download-limit

--upload-limit，单位 Mbps
--download-limit，单位 Mbps

3.2 JuiceFS 限速行为

如果 juicefs mount 挂载时指定了这两个参数，就会以指定的参数为准；
如果 juicefs mount 挂载时没指定，就会以 TiKV 里面的配置为准，
- juicefs client 里面有一个 refresh() 方法一直在监听 TiKV 里面的 Format 配置变化，
- 当这俩配置发生变化时（可以通过 juicefs config 来修改 TiKV 中的配置信息），client 就会把最新配置 reload 到本地（本进程），
- 这种情况下，可以看做是中心式配置的客户端限速，工作流如下图所示，

Fig. JuiceFS upload/download data bandwidth control.

3.3 JuiceFS client reload 配置的调用栈

juicefs mount 时注册一个 reload 方法，

mount |-metaCli.OnReload |-m.reloadCb = append(m.reloadCb, func() { updateFormat(c)(fmt) // fmt 是从 TiKV 里面拉下来的最新配置 store.UpdateLimit(fmt.UploadLimit, fmt.DownloadLimit) })

然后有个后台任务一直在监听 TiKV 里面的配置，一旦发现配置变了就会执行到上面注册的回调方法，

refresh() for { old := m.getFormat() format := m.Load(false) // load from tikv if !reflect.DeepEqual(format, old) { cbs := m.reloadCb for _, cb := range cbs { cb(format) } } 4 限流（metadata 请求）设计 4.1 为什么需要限流？

如下图所示，

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

限速保护的是 5；
限流保护的是 3 & 4；

下面我们通过实际例子看看可能会打爆 3 & 4 的几种场景。

4.2 打爆 TiKV API 的几种场景 4.2.1 mlocate (updatedb) 等扫盘工具一次故障复盘

下面的监控，左边是 TiKV 集群的请求数量，右边是 node CPU 利用率（主要是 PD leader 在用 CPU），

Fig. PD CPU soaring caused by too much requests.

大致时间线，

14:30 开始，kv_get 请求突然飙升，导致 PD leader 节点的 CPU 利用率大幅飙升；
14:40 介入调查，确定暴增的请求来自同一个 volume，但这个 volume 被几十个用户的 pod 挂载，能联系到的用户均表示 14:30 没有特殊操作；
14:30~16:30 继续联系其他用户咨询使用情况 + 主动排查；期间删掉了几个用户暂时不用的 pod，减少挂载这个 volume 的 juicefs client 数量，请求量有一定下降；
16:30 定位到请求来源
- 确定暴增的请求不是用户程序读写导致的，
- 客户端大部分都 ubuntu 容器（AI 训练），
- 使用的是同一个容器镜像，里面自带了一个 daily 的定时 mlocate 任务去扫盘磁盘，

这个扫盘定时任务的时间是每天 14:30，因此把挂载到容器里的 JuiceFS volume 也顺带扫了。确定这个原因之后，

16:40 开始，逐步强制停掉（pkill -f updatedb.mlocate）并禁用（mv /etc/cron.daily/mlocate /tmp/）这些扫盘任务，看到请求就下来了，PD CPU 利用率也跟着降下来了；
第二天早上 6:00 又发生了一次（凌晨 00:00 其实也有一次），后来排查发生是还有几个基础镜像也有这个任务，只是 daily 时间不同。

juicefs mount 时会自动禁用 mlocate，但 CSI 部署方式中部分失效

其实官方已经注意到了 mlocate，所以 juicefs mount 的入口代码就专门有检测，开了之后就自动关闭，

// cmd/mount_unix.go func mountMain(v *vfs.VFS, c *cli.Context) { if os.Getuid() == 0 { disableUpdatedb() |-path := "/etc/updatedb.conf" |-file := os.Open(path) |-newdata := ... |-os.WriteFile(path, newdata, 0644) } ...

但是，在 K8s CSI 部署方式中，这个代码是部分失效的：

Fig. JuiceFS K8s CSI deployment

JuiceFS per-node daemon 在创建 mount pod 时，会把宿主机的 /etc/updatedb.conf 挂载到 mount pod 里面，所以它能禁掉宿主机上的 mlocate，

volumes: - hostPath: path: /etc/updatedb.conf type: FileOrCreate name: updatedb

但正如上一小结的例子看到的，业务 pod 里如果开了 updatedb，它就管不到了。而且业务容器很可能是同一个镜像启动大量 pod，挂载同一个 volume，所以扫描压力直线上升。

4.2.2 版本控制工具

类似的工具可能还有版本控制工具（git、svn）、编程 IDE（vscode）等等，威力可能没这么大，但排查时需要留意。

4.3 需求：对元数据引擎的保护能力

以上 case，包括上一篇看到的用户疯狂 update 文件的 case，都暴露出同一个问题： JuiceFS 缺少对元数据引擎的保护能力。

4.3.1 现状：JuiceFS 目前还没有

社区版目前（2024.09）是没有的，企业版不知道有没有。

下面讨论下如果基于社区版，如何加上这种限流能力。

4.4 客户端限流方案设计

Fig. JuiceFS upload/download data bandwidth control.

基于 JuiceFS 已有的设计，再参考其限速实现，其实加上一个限流能力并不难，代码也不多：

扩展 Format 结构体，增加限流配置；
juicefs format|config 增加配置项，允许配置具体限流值；这会将配置写到元数据引擎里面的 volume setting；
juicefs mount 里面解析 setting 里面的限流配置，传给 client 里面的 metadata 模块；
metadata 模块做客户端限流，例如针对 txnkv 里面的不到 10 个方法，在函数最开始的地方增加一个限流检查，allow 再继续，否则就等待。

这是一种（中心式配置的）客户端限流方案。

4.5 服务端限流方案设计

在 TiKV 集群前面挡一层代理，在代理上做限流，属于服务端限流。

参考资料

图解 JuiceFS CSI 工作流：K8s 创建带 PV 的 Pod 时，背后发生了什么（2024）

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）

ARTHURCHIAO'S BLOG

8 months 3 weeks ago

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）
JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）
JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）
JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）
JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 JuiceFS 高层架构与组件
2 JuiceFS 元数据存储引擎对比：tikv vs. etcd
3 JuiceFS + TiKV：集群启动和宏观读写流程
4 TiKV 内部数据初探
5 总结
参考资料

1 JuiceFS 高层架构与组件

Fig. JuiceFS components and architecutre.

如图，最粗的粒度上可以分为三个组件。

1.1 JuiceFS client

juicefs format ... 可以创建一个 volume；
juicefs config ... 可以修改一个 volume 的配置；
juicefs mount ... 可以把一个 volume 挂载到机器上，然后用户就可以在里面读写文件了；

1.2 Metatdata engine（元数据引擎）

用于存储 JuiceFS 的元数据，例如每个文件的文件名、最后修改时间等等；
可选择 etcd、TiKV 等等；

1.3. Object store

实际的对象存储，例如 S3、Ceph、阿里云 OSS 等等，存放 JuiceFS volume 内的数据。

2 JuiceFS 元数据存储引擎对比：tikv vs. etcd 2.1 设计与优缺点对比 TiKV as metadata engine etcd as metadata engine 管理节点（e.g. leader election） PD (TiKV cluster manager) etcd server 数据节点（存储 juicefs metadata） TiKV server etcd server 数据节点对等 无要求 完全对等 数据一致性粒度 region-level (TiKV 的概念，region < node) node-level Raft 粒度 region-level (multi-raft，TiKV 的概念) node-level 缓存多少磁盘数据在内存中 一部分所有 集群支持的最大数据量 PB 级别几十 GB 级别性能（JuiceFS 场景）高（猜测是因为 raft 粒度更细，并发读写高）低维护和二次开发门槛高（相比 etcd）低流行度 & 社区活跃度低（相比 etcd）高适用场景 大和超大 JuiceFS 集群 中小 JuiceFS 集群 2.2 几点解释

etcd 集群，

每个节点完全对等，既负责管理又负责存储数据；
所有数据全部缓存在内存中，每个节点的数据完全一致。这一点限制了 etcd 集群支持的最大数据量和扩展性，例如现在官网还是建议不要超过 8GB（实际上较新的版本在技术上已经没有这个限制了，但仍受限于机器的内存）。

TiKV 方案可以可以理解成把管理和数据存储分开了，

PD 可以理解为 TiKV cluster manager，负责 leader 选举、multi-raft、元数据到 region 的映射等等；
节点之间也不要求对等，PD 按照 region（比如 96MB）为单位，将 N（默认 3）个副本放到 N 个 TiKV node 上，而实际上 TiKV 的 node 数量是 M，M >= N；
数据放在 TiKV 节点的磁盘，内存中只缓存一部分（默认是用机器 45% 的内存，可控制）。

2.3 例子：TiKV 集群 engine size 和内存使用监控

TiKV 作为存储引擎，总结成一句话就是：根据硬件配置干活，能者多劳 —— 内存大、磁盘大就多干活，反之就少干活。

下面的监控展示是 7 台 TiKV node 组成的一个集群，各 node 内存不完全一致： 3 台 256GB 的，2 台 128GB 的，2 台 64GB 的，可以看到每个 TiKV server 确实只用了各自所在 node 一半左右的内存：

Fig. TiKV engine size and memory usage of a 7-node (with various RAMs) cluster.

3 JuiceFS + TiKV：集群启动和宏观读写流程 3.1 架构

用 TiKV 作为元数据引擎，架构如下（先忽略其中的细节信息，稍后会介绍）：

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

3.2 TiKV 集群启动 3.2.1 TiKV & PD 配置差异

两个组件的几个核心配置项，

$ cat /etc/tikv/pd-config.toml name = "pd-node1" data-dir = "/var/data/pd" client-urls = "https://192.168.1.1:2379" # 客户端（例如 JuiceFS）访问 PD 时，连接这个地址 peer-urls = "https://192.168.1.1:2380" # 其他 PD 节点访问这个 PD 时，连接这个地址，也就是集群内互相通信的地址 # 创建集群时的首批 PD initial-cluster-token = "<anything you like>" initial-cluster = "pd-node1=https://192.168.1.3:2380,pd-node2=https://192.168.1.2:2380,pd-node3=https://192.168.1.1:2380"

可以看到，PD 的配置和 etcd 就比较类似，需要指定其他 PD 节点地址，它们之间互相通信。

TiKV 节点（tikv-server）的配置就不一样了，

$ cat /etc/tikv/tikv-config.toml [pd] endpoints = ["https://192.168.1.1:2379", "https://192.168.1.2:2379", "https://192.168.1.3:2379"] [server] addr = "192.168.1.1:20160" # 服务地址，JuiceFS client 会直接访问这个地址读写数据 status-addr = "192.168.1.1:20180" # prometheus

可以看到，

TiKV 会配置所有 PD 节点的地址，以便自己注册到 PD 作为一个数据节点（存储JuiceFS 元数据）；
TiKV 还会配置一个地址的 server 地址，这个读写本节点所管理的 region 内的数据用的；正常流程是 JuiceFS client 先访问 PD，拿到 region 和 tikv-server 信息，然后再到 tikv-server 来读写数据（对应 JuiceFS 的元数据）；
TiKV 不会配置其他 TiKV 节点的地址，也就是说 TiKV 节点之间不会 peer-to-peer 互连。属于同一个 raft group 的多个 region 通信，也是先通过 PD 协调的，最后 region leader 才发送数据给 region follower。详见 [1]。

3.2.2 服务启动

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

对应图中 step 1 & 2：

step 1. PD 集群启动，选主；
step 2. TiKV 节点启动，向 PD 注册；每个 TiKV 节点称为一个 store，也就是元数据仓库。

3.3 宏观读写流程

对应图中 step 3~5：

step 3. JuiceFS 客户端连接到 PD；发出读写文件请求；
- JuiceFS 客户端中会初始化一个 TiKV 的 transaction kv client，这里面又会初始化一个 PD client，
- 简单来说，此时 JuiceFS 客户端就有了 PD 集群的信息，例如哪个文件对应到哪个 region，这个 region 分布在哪个 TiKV 节点上，TiKV 服务端连接地址是多少等等；
step 4. JuiceFS （内部的 TiKV 客户端）直接向 TiKV 节点（准确说是 region leader）发起读写请求；
step 5. 元数据处理完成，JuiceFS 客户端开始往对象存储里读写文件。

4 TiKV 内部数据初探

TiKV 内部存储的都是 JuiceFS 的元数据。具体来说又分为两种：

用户文件的元数据：例如用户创建了一个 foo.txt，在 TiKV 里面就会对应一条或多条元数据来描述这个文件的信息；
JuiceFS 系统元数据：例如每个 volume 的配置信息，这些对用户是不可见的。

TiKV 是扁平的 KV 存储，所以以上两类文件都放在同一个扁平空间，通过 key 访问。本文先简单通过命令看看里面的元数据长什么样，下一篇再结合具体 JuiceFS 操作来深入解读这些元数据。

4.1 简单脚本 tikv-ctl.sh/pd-ctl.sh

简单封装一下对应的命令行工具，使用更方便，

$ cat pd-ctl.sh tikv-ctl \ --ca-path /etc/tikv/pki/root.crt --cert-path /etc/tikv/pki/tikv.crt --key-path /etc/tikv/pki/tikv.key \ --host 192.168.1.1:20160 \ "$@" $ cat pd-ctl.sh pd-ctl \ --cacert /etc/tikv/pki/root.crt --cert /etc/tikv/pki/pd.crt --key /etc/tikv/pki/pd.key \ --pd https://192.168.1.1:2379 \ "$@" 4.2 tikv-ctl scan 扫描 key/value

tikv-ctl 不支持只列出所有 keys，所以只能 key 和 value 一起打印（扫描）。

扫描前缀是 foo 开头的所有 key：

$ ./tikv-ctl.sh scan --from 'zfoo' --to 'zfop' --limit 100 ... key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile3.\377txt\000\000\000\000\000\372 key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile4.\377txt\000\000\000\000\000\372 ... key: zfoo-dev\375\377setting\000\376 default cf value: start_ts: 452330324173520898 value: 7B0A22...

扫描的时候一定要在 key 前面加一个 z 前缀，这是 TiKV 的一个设计，

The raw-scan command scans directly from the RocksDB. Note that to scan data keys you need to add a ‘z’ prefix to keys.

代码出处 components/keys/src/lib.rs。但对用户来说不是太友好，暴露了太多内部细节，没有 etcdctl 方便直接。

4.3 tikv-ctl mvcc 查看给定 key 对应的 value $ ./tikv-ctl.sh mvcc -k 'zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile1.\377txt\000\000\000\000\000\372' --show-cf default,lock,write key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile1.\377txt\000\000\000\000\000\372 write cf value: start_ts: 452330816414416901 commit_ts: 452330816414416903 short_value: 010000000000000002

CF 是 column family 的缩写，进一步了解，可参考 Google bigtable 中关于 CF 的定义译 | Bigtable: A Distributed Storage System for Structured Data (OSDI, 2006)。

4.4 tikv-ctl --decode <key> 解除字符转义 # tikv escaped format -> raw format ./tikv-ctl.sh --decode 'foo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile4.\377txt\000\000\000\000\000\372' foo-dev\375A\001\000\000\000\000\000\000\000Dfile4.txt 4.5 tikv-ctl --to-hex：转义表示 -> 十六进制表示 $ ./tikv-ctl.sh --to-hex '\375' FD 4.6 tikv-ctl --to-escaped <value>：十六进制 value -> 带转义的字符串 ./tikv-ctl.sh scan --from 'zfoo' --to 'zfop' --limit 100 key: zfoo-dev\375\377setting\000\376 default cf value: start_ts: 452330324173520898 value: 7B0A22...

其中的 value 是可以解码出来的，

# hex -> escaped string $ ./tikv-ctl.sh --to-escaped '7B0A22...' {\n\"Name\": \"...\",\n\"UUID\": \"8cd1ac73\",\n\"Storage\": \"S3\",\n\"Bucket\": \"http://xxx\",\n\"AccessKey\": \"...\",\n\"BlockSize\": 4096,\n\"Compression\": \"none\",\n\"KeyEncrypted\": true,\n\"MetaVersion\": 1,\n\"UploadLimit\": 0,\n\"DownloadLimit\": 0,\n\"\": \"\"\n} 5 总结

本文介绍了一些 JuiceFS 元数据引擎相关的内容。

参考资料

A Deep Dive into TiKV, 2016, pincap.com

JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）

ARTHURCHIAO'S BLOG

8 months 3 weeks ago

Fig. JuiceFS upload/download data bandwidth control.

JuiceFS 元数据引擎初探：高层架构、引擎选型、读写工作流（2024）
JuiceFS 元数据引擎再探：开箱解读 TiKV 中的 JuiceFS 元数据（2024）
JuiceFS 元数据引擎三探：从实践中学习 TiKV 的 MVCC 和 GC（2024）
JuiceFS 元数据引擎四探：元数据大小评估、限流与限速的设计思考（2024）
JuiceFS 元数据引擎五探：元数据备份与恢复（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 创建一个 volume
2 将 volume 挂载（mount）到机器
3 创建、更新、删除文件
4 元数据操作和 TiKV key/value 编码规则
5 总结
参考资料

有了第一篇的铺垫，本文直接进入正题。

首先创建一个 volume，然后在其中做一些文件操作，然后通过 tikv-ctl 等工具在 TiKV 中查看对应的元数据。
有了这些基础，我们再讨论 JuiceFS metadata key 和 TiKV 的编码格式。

之前有一篇类似的，开箱解读 etcd 中的 Cilium 元数据： What’s inside Cilium Etcd (kvstore)。

1 创建一个 volume

创建一个名为 foo-dev 的 JuiceFS volume。

1.1 JuiceFS client 日志

用 juicefs client 的 juicefs format 命令创建 volume，

$ juicefs format --storage oss --bucket <bucket> --access-key <key> --secret-key <secret key> \ tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev foo-dev <INFO>: Meta address: tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev <INFO>: Data use oss://xxx/foo-dev/ <INFO>: Volume is formatted as { "Name": "foo-dev", "UUID": "ec843b", "Storage": "oss", "BlockSize": 4096, "MetaVersion": 1, "UploadLimit": 0, "DownloadLimit": 0, ... }

对象存储用的是阿里云 OSS；
TiKV 地址指向的是 PD 集群地址，上一篇已经介绍过，2379 是 PD 接收客户端请求的端口；

1.2 JuiceFS client 中的 TiKV/PD client 初始化/调用栈

下面我们进入 JuiceFS 代码，看看 JuiceFS client 初始化和连接到元数据引擎的调用栈：

mount |-metaCli = meta.NewClient |-txnkv.NewClient(url) // github.com/juicedata/juicefs: pkg/meta/tkv_tikv.go | |-NewClient // github.com/tikv/client-go: txnkv/client.go | |-pd.NewClient // github.com/tikv/client-go: tikv/kv.go | | |-NewClient // github.com/tikv/pd: client/client.go | | |-NewClientWithContext // github.com/tikv/pd: client/client.go | | |-createClientWithKeyspace // github.com/tikv/pd: client/client.go | | |-c.pdSvcDiscovery = newPDServiceDiscovery // github.com/tikv/pd: client/pd_xx.go | | |-c.setup() // github.com/tikv/pd: client/pd_xx.go | | |-c.pdSvcDiscovery.Init() | | |-c.pdSvcDiscovery.AddServingURLSwitchedCallback | | |-c.createTokenDispatcher() | |-spkv, err := tikv.NewEtcdSafePointKV | |-tikv.NewRPCClient | |-tikv.NewKVStore(uuid, pdClient, spkv, rpcClient) // github.com/tikv/client-go: tikv/kv.go | |-oracles.NewPdOracle | |-store := &KVStore{} | |-go store.runSafePointChecker() | | |-check key "/tidb/store/gcworker/saved_safe_point" from etcd every 10s | |-go store.safeTSUpdater() |-metaCli.NewSession |-doNewSession |-m.setValue(m.sessionKey(m.sid), m.expireTime()) // SE |-m.setValue(m.sessionInfoKey(m.sid), sinfo) // SI

这里面连接到 TiKV/PD 的代码有点绕，

传给 juicefs client 的是 PD 集群地址，
但代码使用的是 tikv 的 client-go 包，创建的是一个 tikv transaction client，
这个 tikv transaction client 里面会去创建 pd client 连接到 PD 集群，

所以，架构上看 juicefs 是直连 PD，但实现上并没有直接创建 pd client，也没有直接使用 pd 的库。

Fig. JuiceFS cluster initialization, and how POSIX file operations are handled by JuiceFS.

1.3 tikv-ctl 查看空 volume 的系统元数据

现在再把目光转到 TiKV。看看这个空的 volume 在 TiKV 中对应哪些元数据：

$ ./tikv-ctl.sh scan --from 'zfoo' --to 'zfop' key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000I\000\000\000\000\000\000\371 # attr? key: zfoo-dev\375\377ClastCle\377anupSess\377ions\000\000\000\000\373 # lastCleanupSessions key: zfoo-dev\375\377CnextChu\377nk\000\000\000\000\000\000\371 # nextChunk key: zfoo-dev\375\377CnextIno\377de\000\000\000\000\000\000\371 # nextInode key: zfoo-dev\375\377CnextSes\377sion\000\000\000\000\373 # nextSession key: zfoo-dev\375\377SE\000\000\000\000\000\000\377\000\001\000\000\000\000\000\000\371 # session key: zfoo-dev\375\377SI\000\000\000\000\000\000\377\000\001\000\000\000\000\000\000\371 # sessionInfo key: zfoo-dev\375\377setting\000\376 # setting

以上就是我们新建的 volume foo-dev 的所有 entry 了。也就是说一个 volume 创建出来之后，默认就有这些 JuiceFS 系统元数据。

TiKV 中的每个 key 都经过了两层编码（JuiceFS 和 TiKV），我们后面再介绍编码规则。就目前来说，根据 key 中的字符还是依稀能看出每个 key 是干啥用的，为方便起见直接注释在上面每行的最后了。比如，下面两个 session 相关的 entry 就是上面调用栈最后两个创建的：

session
sessionInfo

1.4 例子：tikv-ctl mvcc 解码 volume setting 元数据

TiKV 中的每个 entry 都是 key/value。现在我们尝试解码最后一个 entry，key 是 zfoo-dev\375\377setting\000\376，我们来看看它的 value —— 也就是它的内容 —— 是什么：

$ value_hex=$(./tikv-ctl.sh mvcc -k 'zfoo-dev\375\377setting\000\376' --show-cf=default | awk '/default cf value:/ {print $NF}') $ value_escaped=$(./tikv-ctl.sh --to-escaped $value_hex) $ echo -e $value_escaped | sed 's/\\"/"/g' | jq .

输出：

{ "Name": "foo-dev", "UUID": "1ce2973b", "Storage": "S3", "Bucket": "http://xx/bucket", "AccessKey": "xx", "SecretKey": "xx", "BlockSize": 4096, "MetaVersion": 1, "UploadLimit": 0, "DownloadLimit": 0, ... }

可以看到是个 JSON 结构体。这其实就是这个 volume 的配置信息。如果对 JuiceFS 代码有一定了解，就会看出来它对应的其实就是 type Format 这个 struct。

1.4.1 对应 JuiceFS Format 结构体 // https://github.com/juicedata/juicefs/blob/v1.2.0/pkg/meta/config.go#L72 type Format struct { Name string UUID string Storage string StorageClass string `json:",omitempty"` Bucket string AccessKey string `json:",omitempty"` SecretKey string `json:",omitempty"` SessionToken string `json:",omitempty"` BlockSize int Compression string `json:",omitempty"` Shards int `json:",omitempty"` HashPrefix bool `json:",omitempty"` Capacity uint64 `json:",omitempty"` Inodes uint64 `json:",omitempty"` UploadLimit int64 `json:",omitempty"` // Mbps DownloadLimit int64 `json:",omitempty"` // Mbps ... } 2 将 volume 挂载（mount）到机器

接下来我们找一台机器，把这个 volume 挂载上去，这样就能在这个 volume 里面读写文件了。

2.1 JuiceFS client 挂载日志 $ juicefs mount --verbose --backup-meta 0 tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev /tmp/foo-dev <INFO>: Meta address: tikv://192.168.1.1:2379,192.168.1.2:2379,192.168.1.3:2379/foo-dev [interface.go:406] <DEBUG>: Creating oss storage at endpoint http://<url> [object_storage.go:154] <INFO>: Data use oss://xx/foo-dev/ [mount.go:497] <INFO>: Disk cache (/var/jfsCache/ec843b85/): capacity (10240 MB), free ratio (10%), max pending pages (15) [disk_cache.go:94] <DEBUG>: Scan /var/jfsCache/ec843b85/raw to find cached blocks [disk_cache.go:487] <DEBUG>: Scan /var/jfsCache/ec843b85/rawstaging to find staging blocks [disk_cache.go:530] <DEBUG>: Found 8 cached blocks (32814 bytes) in /var/jfsCache/ec843b85/ with 269.265µs [disk_cache.go:515] <INFO>: Create session 4 OK with version: 1.2.0 [base.go:279] <INFO>: Prometheus metrics listening on 127.0.0.1:34849 [mount.go:165] <INFO>: Mounting volume foo-dev at /tmp/foo-dev ... [mount_unix.go:203] <INFO>: OK, foo-dev is ready at /tmp/foo-dev [mount_unix.go:46]

可以看到成功挂载到了本机路径 /tmp/foo-dev/。

2.2 查看挂载信息 $ mount | grep juicefs JuiceFS:foo-dev on /tmp/foo-dev type fuse.juicefs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other) $ cd /tmp/foo-dev $ ls # 空目录 2.3 查看 JuiceFS 隐藏（系统）文件

新建的 volume 里面其实有几个隐藏文件：

$ cd /tmp/foo-dev $ ll -r-------- 1 root root .accesslog -r-------- 1 root root .config -r--r--r-- 1 root root .stats dr-xr-xr-x 2 root root .trash/ 2.3.1 .accesslog

可以通过 cat 这个文件看到一些 JuiceFS client 底层的操作日志，我们一会会用到。

2.3.2 .config

包括 Format 在内的一些 volume 配置信息：

$ cat .config { "Meta": { "Strict": true, "Retries": 10, "CaseInsensi": false, "ReadOnly": false, "NoBGJob": false, "OpenCache": 0, "Heartbeat": 12000000000, "MountPoint": "/tmp/foo-dev", "Subdir": "", "CleanObjFileLever": 1 }, "Format": { "Name": "foo-dev", "UUID": "ec843b85", "Storage": "oss", "Bucket": "http://<url>", "UploadLimit": 0, "DownloadLimit": 0, ... }, "Chunk": { "CacheDir": "/var/jfsCache/ec843b85", "CacheMode": 384, "CacheSize": 10240, "FreeSpace": 0.1, "AutoCreate": true, "Compress": "none", "MaxUpload": 20, "MaxDeletes": 2, "MaxRetries": 10, "UploadLimit": 0, "DownloadLimit": 0, "Writeback": false, "UploadDelay": 0, "HashPrefix": false, "BlockSize": 4194304, "GetTimeout": 60000000000, "PutTimeout": 60000000000, "CacheFullBlock": true, "BufferSize": 314572800, "Readahead": 0, "Prefetch": 1, "UseMountUploadLimitConf": false, "UseMountDownloadLimitConf": false }, "Version": "1.2.0", "AttrTimeout": 1000000000, "DirEntryTimeout": 1000000000, "EntryTimeout": 1000000000, "BackupMeta": 0, "HideInternal": false } 2.3.3 .stats

cat 能输出一些 prometheus metrics：

$ cat .stats ... juicefs_uptime 374.021754516 juicefs_used_buffer_size_bytes 0 juicefs_used_inodes 7 juicefs_used_space 28672

用 prometheus 采集器把这个数据收上去，就能在 grafana 上展示 volume 的各种内部状态。

2.3.4 .trash

类似于 Windows 的垃圾箱。如果启用了，删掉的文件会在里面保存一段时间再真正从对象存储删掉。

3 创建、更新、删除文件

接下来做一些文件操作，看看 TiKV 中对应元数据的变化。

3.1 创建文件 3.1.1 创建文件 $ cd /tmp/foo-dev $ echo test3 > file3.txt 3.1.2 JuiceFS .accesslog $ cat .accesslog [uid:0,gid:0,pid:169604] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585251,1725585251,4096]) <0.001561> [uid:0,gid:0,pid:169604] lookup (1,file3.txt): no such file or directory <0.000989> [uid:0,gid:0,pid:169604] create (1,file3.txt,-rw-r-----:0100640): OK (103,[-rw-r-----:0100640,1,0,0,1725585318,1725585318,1725585318,0]) [fh:27] <0.003850> [uid:0,gid:0,pid:169604] flush (103,27): OK <0.000005> [uid:0,gid:0,pid:169604] write (103,6,0,27): OK <0.000048> [uid:0,gid:0,pid:169604] flush (103,27): OK <0.026205> [uid:0,gid:0,pid:0 ] release (103): OK <0.000006> [uid:0,gid:0,pid:169749] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585318,1725585318,4096]) <0.000995> [uid:0,gid:0,pid:169750] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585318,1725585318,4096]) <0.001219> 3.1.3 TiKV 元数据 $ ./tikv-ctl.sh scan --from 'zfoo' --to 'zfop' --limit 100 ... key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile3.\377txt\000\000\000\000\000\372 ...

可以看到 meta 中多了几条元数据，依稀可以分辨出对应的就是我们创建的文件，

这个 key 经过了 juicefs 和 tikv 两次编码，
简单来说，它是 volume + 0xFD（8 进制的 \375）+ 文件名 + tikv 编码，最终得到的就是上面看到的这个 key。

对应的 value 一般长这样：

$ ./tikv-ctl.sh mvcc -k 'zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile3.\377txt\000\000\000\000\000\372' --show-cf default,lock,write key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile3.\377txt\000\000\000\000\000\372 write cf value: start_ts: 452330816414416901 commit_ts: 452330816414416903 short_value: 010000000000000002

先粗略感受一下，后面再具体介绍 key/value 的编解码规则。

3.2 删除文件操作 3.2.1 删除文件 rm file4.txt 3.2.2 JuiceFS .accesslog $ cat .accesslog [uid:0,gid:0,pid:169604] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585532,1725585532,4096]) <0.001294> [uid:0,gid:0,pid:169902] lookup (1,file4.txt): OK (104,[-rw-r-----:0100640,1,0,0,1725585532,1725585532,1725585532,6]) <0.001631> [uid:0,gid:0,pid:169902] unlink (1,file4.txt): OK <0.004206> [uid:0,gid:0,pid:169904] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585623,1725585623,4096]) <0.000718> [uid:0,gid:0,pid:169905] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585623,1725585623,4096]) <0.000843> 3.2.3 TiKV 元数据

对应的元数据就从 TiKV 删掉了。

3.3 更新（追加）文件 3.3.1 更新文件 $ echo test3 >> file3.txt 3.3.2 JuiceFS .accesslog $ cat .accesslog [uid:0,gid:0,pid:169604] getattr (1): OK (1,[drwxrwxrwx:0040777,3,0,0,1725503250,1725585623,1725585623,4096]) <0.001767> [uid:0,gid:0,pid:169604] lookup (1,file3.txt): OK (103,[-rw-r-----:0100640,1,0,0,1725585318,1725585318,1725585318,6]) <0.001893> [uid:0,gid:0,pid:169604] open (103): OK [fh:51] <0.000884> [uid:0,gid:0,pid:169604] flush (103,51): OK <0.000011> [uid:0,gid:0,pid:169604] write (103,6,6,51): OK <0.000068> [uid:0,gid:0,pid:169604] flush (103,51): OK <0.036778> [uid:0,gid:0,pid:0 ] release (103): OK <0.000024> 3.3.3 TiKV 元数据

如果追加的内容不多，TiKV 中还是那条元数据，但 value 会被更新；
如果追加的内容太多（例如几百兆），文件就会被切分，这时候元数据就会有多条了。

4 元数据操作和 TiKV key/value 编码规则

上一节简单看了下创建、更新、删除 volume 中的文件，TiKV 中对应的元数据都有什么变化。我们有意跳过了 key/value 是如何编码的，这一节就来看看这块的内容。

4.1 JuiceFS key 编码规则 4.1.1 每个 key 的公共前缀：<vol_name> + 0xFD

TiKV 客户端初始化：每个 key 的 base 部分：<vol_name> + 0xFD

// pkg/meta/tkv_tikv.go func init() { Register("tikv", newKVMeta) drivers["tikv"] = newTikvClient } func newTikvClient(addr string) (tkvClient, error) { client := txnkv.NewClient(strings.Split(tUrl.Host, ",")) prefix := strings.TrimLeft(tUrl.Path, "/") return withPrefix(&tikvClient{client.KVStore, interval}, append([]byte(prefix), 0xFD)), nil } 4.1.2 每个 key 后面的部分

根据对应的是文件、目录、文件属性、系统元数据等等，会有不同的编码规则：

// pkg/meta/tkv.go /** Ino iiiiiiii Length llllllll Indx nnnn name ... sliceId cccccccc session ssssssss aclId aaaa All keys: setting format C... counter AiiiiiiiiI inode attribute AiiiiiiiiD... dentry AiiiiiiiiPiiiiiiii parents // for hard links AiiiiiiiiCnnnn file chunks AiiiiiiiiS symlink target AiiiiiiiiX... extented attribute Diiiiiiiillllllll delete inodes Fiiiiiiii Flocks Piiiiiiii POSIX locks Kccccccccnnnn slice refs Lttttttttcccccccc delayed slices SEssssssss session expire time SHssssssss session heartbeat // for legacy client SIssssssss session info SSssssssssiiiiiiii sustained inode Uiiiiiiii data length, space and inodes usage in directory Niiiiiiii detached inde QDiiiiiiii directory quota Raaaa POSIX acl */

具体可以再看看这个文件中的代码。

4.1.3 最终格式：字节序列 // pkg/meta/tkv.go func (m *kvMeta) fmtKey(args ...interface{}) []byte { b := utils.NewBuffer(uint32(m.keyLen(args...))) for _, a := range args { switch a := a.(type) { case byte: b.Put8(a) case uint32: b.Put32(a) case uint64: b.Put64(a) case Ino: m.encodeInode(a, b.Get(8)) case string: b.Put([]byte(a)) default: panic(fmt.Sprintf("invalid type %T, value %v", a, a)) } } return b.Bytes() } 4.2 TiKV 对 JuiceFS key 的进一步编码

JuiceFS client 按照以上规则拼好一个 key 之后，接下来 TiKV 会再进行一次编码：

加一些 TiKV 的前缀，例如给文件 key 加个 z 前缀；
- TiKV 代码 components/keys/src/lib.rs
转义，例如 8 个字节插入一个 \377（对应 0xFF），不够 8 字节的补全等等；
- tikv rust encode 代码
- 借鉴的是 golang protobuf 的代码

最终得到的就是我们用 tikv-ctl scan 看到的那些 key。

4.3 例子：查看特殊元数据：volume 的 setting/format 信息

JuiceFS 的 Format 配置保存在 tikv 中，原始 key 是 setting，经过以上两层编码就变成了下面的样子：

$ ./tikv-ctl.sh scan --from 'zfoo' --to 'zfop' --limit 100 key: zfoo-dev\375\377setting\000\376 default cf value: start_ts: 452330324173520898 value: 7B0A22...

其中的 value 是可以解码出来的，

# hex -> escaped string $ ./tikv-ctl.sh --to-escaped '7B0A22...' {\n\"Name\": \"foo-dev\",\n\"UUID\": \"8cd1ac73\",\n\"Storage\": \"S3\",\n\"Bucket\": \"http://xxx\",\n\"AccessKey\": \"...\",\n\"BlockSize\": 4096,\n\"Compression\": \"none\",\n\"KeyEncrypted\": true,\n\"MetaVersion\": 1,\n\"UploadLimit\": 0,\n\"DownloadLimit\": 0,\n\"\": \"\"\n}

对应的就是 pkg/meta/config.go 中的 Format 结构体。

5 总结

本文结合一些具体 JuiceFS 操作，分析了 TiKV 内的元数据格式与内容。

参考资料

What’s inside Cilium Etcd (kvstore)

GPU 进阶笔记（四）：NVIDIA GH200 芯片、服务器及集群组网（2024）

ARTHURCHIAO'S BLOG

9 months 2 weeks ago

记录一些平时接触到的 GPU 知识。由于是笔记而非教程，因此内容不求连贯，有基础的同学可作查漏补缺之用。

GPU 进阶笔记（一）：高性能 GPU 服务器硬件拓扑与集群组网（2023）
GPU 进阶笔记（二）：华为昇腾 910B GPU 相关（2023）
GPU 进阶笔记（三）：华为 NPU (GPU) 演进（2024）
GPU 进阶笔记（四）：NVIDIA GH200 芯片、服务器及集群组网（2024）

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 传统原厂 GPU 服务器：Intel/AMD x86 CPU + NVIDIA GPU
2 新一代原厂 GPU 服务器：NVIDIA CPU + NVIDIA GPU
3 GH200 服务器内部设计
4 GH200 服务器及组网
- 4.1 NVIDIA MGX with GH200：原厂主机及组网
- 4.2 NVIDIA GH200 NVL32：原厂 32 卡机柜
5 总结
参考资料

1 传统原厂 GPU 服务器：Intel/AMD x86 CPU + NVIDIA GPU

2024 之前，不管是 NVIDIA 原厂还是第三方服务器厂商的 NVIDIA GPU 机器，都是以 x86 CPU 机器为底座， GPU 以 PCIe 板卡或 8 卡模组的方式连接到主板上，我们在第一篇中有过详细介绍，

典型 8 卡 A100 主机硬件拓扑

这时 CPU 和 GPU 是独立的，服务器厂商只要买 GPU 模组（例如 8*A100），都可以自己组装服务器。至于 Intel/AMD CPU 用哪家，就看性能、成本或性价比考虑了。

2 新一代原厂 GPU 服务器：NVIDIA CPU + NVIDIA GPU

随着 2024 年 NVIDIA GH200 芯片的问世，NVIDIA 的 GPU 开始自带 CPU 了。

桌面计算机时代：CPU 为主，GPU（显卡）为辅，CPU 芯片中可以集成一块 GPU 芯片，叫集成显卡；
AI 数据中心时代：GPU 反客为主，CPU 退居次席，GPU 芯片/板卡中集成 CPU。

所以 NVIDIA 集成度越来越高，开始提供整机或整机柜。

2.1 CPU 芯片：Grace (ARM)

基于 ARMv9 设计。

2.2 GPU 芯片：Hopper/Blackwell/…

比如 Hopper 系列，先出的 H100-80GB，后面继续迭代：

H800：H100 的阉割版，
H200：H100 的升级版，
H20：H200 的阉割版，比 H800 还差，差多了。

算力对比：GPU Performance (Data Sheets) Quick Reference (2023)

2.3 芯片产品（命名）举例 2.3.1 Grace CPU + Hopper 200 (H200) GPU：GH200

一张板子：

NVIDIA GH200 芯片（板卡）渲染图。左：Grace CPU 芯片；右：Hopper GPU 芯片 [2]

2.3.2 Grace CPU + Blackwell 200 (B200) GPU：GB200

一个板子（模块），功耗太大，自带液冷：

NVIDIA GB200 渲染图，一个模块包括 2 Grace CPU + 4 B200 GPU，另外自带了液冷模块。 [3]

72 张 B200 组成一个原厂机柜 NVL72：

NVIDIA GB200 NVL72 机柜。 [3]

3 GH200 服务器内部设计 3.1 GH200 芯片逻辑图：CPU+GPU+RAM+VRAM 集成到单颗芯片

NVIDIA GH200 芯片（单颗）逻辑图。[2]

3.1.1 核心硬件

如上图所示，一颗 GH200 超级芯片集成了下面这些核心部件：

一颗 NVIDIA Grace CPU；
一颗 NVIDIA H200 GPU；
最多 480GB CPU 内存；
96GB 或 144GB GPU 显存。

3.1.2 芯片硬件互连

CPU 通过 4 个 PCIe Gen5 x16 连接到主板，
- 单个 PCIe Gen5 x16 的速度是双向 128GB/s，
- 所以 4 个的总速度是 512GB/s；
CPU 和 GPU 之间，通过 NVLink® Chip-2-Chip (NVLink-C2C) 技术互连，
- 900 GB/s，比 PCIe Gen5 x16 的速度快 7 倍；
GPU 互连（同主机扩跨主机）：18x NVLINK4
- 900 GB/s

NVLink-C2C 提供了一种 NVIDIA 所谓的“memory coherency”：内存/显存一致性。好处：

内存+显存高达 624GB，对用户来说是统一的，可以不区分的使用；提升开发者效率；
CPU 和 GPU 可以同时（concurrently and transparently）访问 CPU 和 GPU 内存。
GPU 显存可以超分（oversubscribe），不够了就用 CPU 的内存，互连带宽够大，延迟很低。

下面再展开看看 CPU、内存、GPU 等等硬件。

3.2 CPU 和内存 3.2.1 72-core ARMv9 CPU

72-core Grace CPU (Neoverse V2 Armv9 core)

3.2.2 480GB LPDDR5X (Low-Power DDR) 内存

最大支持 480GB LPDDR5X 内存；
500GB/s per-CPU memory bandwidth。

参考下这个速度在存储领域的位置：

Fig. Peak bandwidth of storage media, networking, and distributed storage solutions. [1]

3.2.3 三种内存对比：DDR vs. LPDDR vs. HBM

普通服务器（绝大部分服务器）用的是 DDR 内存，通过主板上的 DIMM 插槽连接到 CPU，[1] 中有详细介绍；
1-4 代的 LPDDR 是对应的 1-4 代 DDR 的低功耗版，常用于手机等设备。
- LPDDR5 是独立于 DDR5 设计的，甚至比 DDR5 投产还早；
- 直接和 CPU 焊到一起的，不可插拔，不可扩展，成本更高，但好处是速度更快；
- 还有个类似的是 GDDR，例如 RTX 4090 用的 GDDR。
HBM 在第一篇中已经介绍过了；

下面列个表格对比三种内存的优缺点，注意其中的高/中/低都是三者相对而言的：

DDR LPDDR HBM 容量大中小速度慢中快带宽低中高可扩展性好差差可插拔可不可不可成本低中高功耗高中低

更多细节，见 [1]。

例如，与 8-channel DDR5（目前高端 x86 服务器的配置）相比， GH200 的 LPDDR5X 内存带宽高 53%，功耗还低 1/8。

3.3 GPU 和显存 3.3.1 H200 GPU

算力见下面。

3.3.2 显存选配

支持两种显存，二选一：

96GB HBM3
144GB HBM3e，4.9TB/s，比 H100 SXM 的带宽高 50%；

3.4 变种：GH200 NVL2，用 NVLINK 全连接两颗 GH200

在一张板子内放两颗 GH200 芯片，CPU/GPU/RAM/VRAM 等等都翻倍，而且两颗芯片之间是全连接。

例如，对于一台能插 8 张板卡的服务器，

用 GH200 芯片：CPU 和 GPU 数量 8 * {72 Grace CPU, 1 H200 GPU}
用 GH200 NVL2 变种：CPU 和 GPU 数量 8 * {144 Grace CPU, 2 H200 GPU}

3.5 GH200 & GH200 NVL2 产品参数（算力）

NVIDIA GH200 产品参数。上半部分是 CPU、内存等参数，从 "FP64" 往下是 GPU 参数。[2]

4 GH200 服务器及组网

两种服务器规格，分别对应 PCIe 板卡和 NVLINK 板卡。

4.1 NVIDIA MGX with GH200：原厂主机及组网

下图是单卡 node 的一种组网方式：

NVIDIA GH200 MGX 服务器组网。每台 node 只有一片 GH200 芯片，作为 PCIe 板卡，没有 NVLINK。[2]

每台 node 只有一片 GH200 芯片（所以只有一个 GPU），作为 PCIe 板卡，没有 NVLINK；
每台 node 的网卡或加速卡 BlueField-3 (BF3) DPUs 连接到交换机；
跨 node 的 GPU 之间没有直连，而是通过主机网络（走 GPU->CPU-->NIC 出去）的方式实现通信；
适合 HPC workload、中小规模的 AI workload。

4.2 NVIDIA GH200 NVL32：原厂 32 卡机柜

通过 NVLINk 将 32 个 GH200 芯片全连接为一个逻辑 GPU 模块，所以叫 NVL32，

NVIDIA GH200 NVL32 组网。[2]

NVL32 模块实际形态是一个机柜；
- 一个机柜能提供 19.5TB 内存+显存；
- NVLink TLB 能让任意一个 GPU 访问这个机柜内的任意内存/显存；
  
  NVIDIA GH200 NVL32 中 3 种内存/显存访问方式。[2]
- Extended GPU Memory (EGM)
多个机柜再通过网络互连，形成集群，适合超大规模 AI workload。

5 总结

本文粗浅地整理了一些 NVIDIA GH200 相关技术知识。

其他：