博客

RLHF 家族的奇巧魔改：On Policy 与 Off Policy 路线大PK
随着 [Llama3] 的开源，Alignment 的重要性日益凸显，而作为 Alignment 中坚力量的 RLHF 家族也愈发繁荣。今天，我们就来一起探索一下 RLHF 领域中那些令人拍案叫绝的魔改思路吧！

On Policy vs. Off Policy：究竟谁更胜一筹？

在 LLM 领域，RLHF 主要分为两大路线：
- On Policy 路线: 以 [PPO] 为代表，需要 LLM 在训练过程中实时生成文本。
- Off Policy 路线: 以 [DPO] 为代表，不需要 LLM 在训练过程中实时生成文本，而是利用已有的数据进行学习。
On Policy 方法通常需要更大的算力支持，训练时间也更长，但理论上具有更高的效果上限。

On Policy：让模型亲自上阵

On Policy 方法强调让模型在训练过程中“亲力亲为”，根据自身生成结果的好坏来学习改进。

举个例子，想象一下你正在学习玩王者荣耀：
- On Policy: 你亲自上阵，旁边有一位教练实时指导你的操作，当你成功推塔时给予鼓励，当你失误被杀时及时提醒。
- Off Policy: 你观看大量职业选手和青铜玩家的对局视频，学习前者的优秀操作，避免后者的低级失误。
On Policy 方法的优势在于训练数据与模型能力完全匹配，因为所有数据都是由当前模型生成的。

Off Policy：站在巨人的肩膀上学习

Off Policy 方法则侧重于利用已有数据进行学习，模型不需要亲自生成答案，因此训练速度更快，对算力要求更低。

然而，Off Policy 方法的效果很大程度上取决于训练数据的质量和与模型能力的匹配程度。如果数据质量不高，或者与模型能力相差太远，那么训练效果就会大打折扣。

1. On Policy 路线：PPO 及其优化

1.1 ReMax：丢掉 Critic，轻装上阵

[ReMax] 提出了一种大胆的想法：丢弃 PPO 中的 Critic 网络，让 Actor 直接与 Reward Model 对齐。

这样做的好处显而易见：
- 减少模型参数: 从 4 个模型减少到 3 个，参数量大幅降低。
- 加速训练: 不再需要更新 Critic 网络，反向传播速度更快。
ReMax 的核心在于使用“当前策略”认为最好的行为来作为 baseline，从而在没有 Critic 的情况下降低方差，稳定训练。

1.2 GRPO：暴力求均值，效果依旧惊艳

[DeepSpeek-v2] 中提出的 [GRPO] 算法则采取了另一种思路：保留 PPO 中 importance sampling 和 clip 等先进机制，但使用暴力采样求均值的方式来代替 Critic 网络。

GRPO 的优势在于：
- 简化模型结构: 无需 Critic 网络，降低了模型复杂度。
- 保留 PPO 优势: 保留了 PPO 中的优秀机制，保证了训练效果。
1.3 其他 On Policy 优化方向

除了 ReMax 和 GRPO 之外，研究人员还探索了其他优化 PPO 算法的方向，例如：
- 分布式 PPO: 将训练任务分配到多个 GPU 或 TPU 上，加快训练速度。
- 基于 Transformer 的 PPO: 利用 Transformer 模型强大的表征能力，提升策略网络的性能。
2. Off Policy 路线：DPO 及其改进

2.1 DPO：最大化概率差，简单高效

[DPO] 算法的思路非常直观：对于同一个 prompt，通过降低“坏答案”的采样概率，提升“好答案”的采样概率，从而训练模型。

DPO 的优势在于：
- 训练高效: 无需模型生成文本，训练速度快。
- 数据利用率高: 可以充分利用已有的 pair 数据。
2.2 DPOP：添加正则项，防止模型“训崩”

DPO 算法存在一个问题：在某些情况下，”好答案” 和 “坏答案” 的采样概率会同时降低，导致模型效果不佳。

为了解决这个问题，[DPOP] 算法在 DPO loss 的基础上加入了一个正则项，旨在：
- 当模型对 “好答案” 拟合不足时，鼓励模型更多地学习 “好答案”。
- 当模型对 “好答案” 拟合较好时，着重降低 “坏答案” 的采样概率。
2.3 TDPO：引入 KL 惩罚，平衡效率与多样性

与 PPO 类似，[TDPO] 算法也在 DPO 的 loss 函数中引入了 KL 惩罚项，用于限制模型更新幅度，防止过拟合。

与 PPO 不同的是，TDPO 使用的是 forward KL，而不是 backward KL。这样做的好处是：
- 输出多样性更高: forward KL 鼓励模型覆盖更广泛的概率分布，从而生成更多样化的文本。
2.4 ORPO：抛弃参考模型，化繁为简

[ORPO] 算法则更进一步，试图连 reference model 也一并省去。

ORPO 的 loss 函数由两部分组成：
- SFT Loss: 保证模型对 chosen response 的基本拟合。
- Odds Ratio Loss: 通过最大化“好答案”与“坏答案”的 odds 值之比，来提升模型对“好答案”的偏好。
结语

无论是 On Policy 还是 Off Policy，RLHF 家族的“魔改”之路都充满了奇思妙想。相信随着研究的深入，RLHF 技术将会在 Alignment 领域发挥越来越重要的作用。
2024 年 6 月 23 日
PowerInfer-2: Unlocking High-Speed Large Language Model Inference on Smartphones
In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have become indispensable tools, offering unparalleled capabilities in understanding and generating human-like text. Traditionally, these models have been deployed in data centers equipped with powerful GPUs, but there’s a growing trend to bring these capabilities to more ubiquitous devices like smartphones. This shift aims to leverage rich personal data while maintaining privacy by keeping computations local. However, deploying LLMs on smartphones presents significant challenges due to their limited processing power and memory. Enter PowerInfer-2, a groundbreaking framework from the Institute of Parallel and Distributed Systems (IPADS) at Shanghai Jiao Tong University, designed to tackle these challenges head-on.

Introduction to PowerInfer-2

PowerInfer-2 is an innovative framework specifically engineered for high-speed inference of LLMs on smartphones, even for models whose sizes exceed the device’s memory capacity. The key to PowerInfer-2’s success lies in its ability to utilize the heterogeneous computation, memory, and I/O resources available in modern smartphones. By decomposing traditional matrix computations into fine-grained neuron cluster computations, PowerInfer-2 significantly enhances inference speed and efficiency.

Key Features of PowerInfer-2
1. Polymorphic Neuron Engine: Adapts computational strategies for various stages of LLM inference.
2. Segmented Neuron Caching: Minimizes and conceals I/O overhead.
3. Fine-Grained Neuron-Cluster-Level Pipelining: Reduces computational delays caused by I/O operations.
4. Support for Large Models: Capable of running models with up to 47 billion parameters.
Technical Insights

Heterogeneous Computation Utilization

PowerInfer-2 leverages the heterogeneous hardware present in smartphones, such as asymmetric big.LITTLE CPU cores, GPUs, and NPUs. This approach allows the framework to dynamically adapt to the strengths of each component during the different stages of LLM inference.

Prefill Stage

During the prefill stage, which processes all tokens in the input sequence concurrently, PowerInfer-2 employs the NPU to handle large matrix computations. This stage benefits from the NPU’s efficiency in processing dense computations, significantly speeding up the generation of the first token.

Decoding Stage

In the decoding stage, where tokens are generated sequentially, PowerInfer-2 utilizes small neuron clusters and CPU cores to handle the sparse computations. This method leverages the flexibility of CPU cores, which are well-suited for the lighter computational tasks associated with sparse activations.

Neuron Caching and Pipelining

PowerInfer-2 introduces a segmented cache that operates at the neuron granularity level. This cache is designed to enhance the cache hit rate and reduce the impact of I/O overhead on inference performance. By overlapping I/O operations with neuron cluster computations, the framework minimizes waiting times and maximizes throughput.

Offline Planner

Before running a new model on a smartphone, PowerInfer-2 executes an offline planning phase. This phase analyzes the model and hardware specifications to generate an execution plan that optimally configures computation, memory, and I/O resources. This plan ensures that inference is performed efficiently, even for models that do not fit entirely in memory.

Implementation and Evaluation

PowerInfer-2 has been implemented with an additional 12,000 lines of code on top of the original PowerInfer framework. The researchers deployed it on two smartphones: OnePlus 12 and Ace 2, equipped with Qualcomm XPUs and 24GB and 16GB of DRAM, respectively.

Supported Models

PowerInfer-2 supports a diverse array of LLMs, including:
- Llama-2 (7B, 13B)
- TurboSparse-Mistral (7B)
- TurboSparse-Mixtral (47B)
Performance

The evaluation of PowerInfer-2 shows impressive results:
- Speed: Up to 29.2× speed increase compared to state-of-the-art frameworks.
- Memory Efficiency: Approximately 40% reduction in memory usage for smaller models while maintaining comparable inference speeds.
Notably, PowerInfer-2 is the first system to support the TurboSparse-Mixtral-47B model on mobile platforms, achieving a generation speed of 11.68 tokens per second.

Real-World Applications

To demonstrate its practical utility, PowerInfer-2 was tested on various real-world tasks such as multi-turn dialogue, code generation, math problem solving, and role play. The framework consistently delivered high performance across these diverse tasks, showcasing its robustness and versatility.

Conclusion

PowerInfer-2 represents a significant advancement in the deployment of LLMs on smartphones. By harnessing the heterogeneous resources of modern smartphones and optimizing computation, memory, and I/O operations, PowerInfer-2 enables high-speed, efficient inference for even the largest models. This innovation opens up new possibilities for privacy-preserving, intelligent personal assistants and other applications that require powerful language understanding and generation capabilities on mobile devices.

For more details and a demonstration video, visit the PowerInfer-2 project site.
2024 年 6 月 22 日

博客

RLHF 家族的奇巧魔改：On Policy 与 Off Policy 路线大PK

On Policy vs. Off Policy：究竟谁更胜一筹？

On Policy：让模型亲自上阵

Off Policy：站在巨人的肩膀上学习

1. On Policy 路线：PPO 及其优化

1.1 ReMax：丢掉 Critic，轻装上阵

1.2 GRPO：暴力求均值，效果依旧惊艳

1.3 其他 On Policy 优化方向

2. Off Policy 路线：DPO 及其改进

2.1 DPO：最大化概率差，简单高效

2.2 DPOP：添加正则项，防止模型“训崩”

2.3 TDPO：引入 KL 惩罚，平衡效率与多样性

2.4 ORPO：抛弃参考模型，化繁为简

结语

PowerInfer-2: Unlocking High-Speed Large Language Model Inference on Smartphones

Introduction to PowerInfer-2

Key Features of PowerInfer-2

Technical Insights

Heterogeneous Computation Utilization

Prefill Stage

Decoding Stage

Neuron Caching and Pipelining

Offline Planner

Implementation and Evaluation

Supported Models

Performance

Real-World Applications

Conclusion