标签： AGI

大型语言模型的社会偏见：从不同视角看“你”的偏见
警告：本文包含可能具有冒犯性或令人不安的偏见示例。

大型语言模型（LLM）正在改变我们与信息互动的方式，但它们也反映了人类社会中存在的偏见。这些偏见是如何形成的？LLM 又如何体现这些偏见呢？本文将深入探讨 LLM 中社会偏见的形成机制，并介绍一种新方法来量化和分析这些偏见。

社会偏见：从社会感知到集体影响

社会偏见源于人们对不同群体和个体的刻板印象，这些刻板印象可能是积极的，也可能是消极的。例如，认为“女性天生柔弱”是一种负面刻板印象，而认为“男性天生强壮”则是一种正面刻板印象。这些刻板印象因人而异，受个人社会身份、个人信仰等因素影响，形成每个人独特的社会感知。

心理学家认为，社会偏见源于不同个体对同一目标的集体社会感知。因此，本文将社会偏见定义为社会感知的综合影响。就像图1所示，社会偏见就像一张社会感知的网络，每个节点代表一个群体，连接线代表不同群体之间的社会感知，这些感知可能是积极的，也可能是消极的。

揭开 LLM 偏见的面纱：一种新方法

近年来，研究人员发现，旨在模仿人类语言和社会规范的语言模型，也存在着现实世界中的偏见。一些研究通过间接评估模型生成文本中对人口统计特征的感情倾向，或衡量模型与给定刻板印象的吻合程度来评估 LLM 的偏见。然而，这些方法无法直接量化不同群体视角下的社会偏见。

为了更直观地量化社会感知，本文提出了一种新的方法，通过问答（QA）格式，直接量化 LLM 对不同目标的感知，并通过聚合这些感知来评估 LLM 内部的社会偏见。

问答格式：从角色扮演到感知量化

该方法通过为 LLM 分配不同的角色（persona）来收集其对特定目标的感知。例如，我们可以问一个被赋予“老年人”角色的 LLM：“老年人会如何看待年轻人？”通过分析 LLM 的回答，我们可以量化其对年轻人的感知。

三项指标：多维度评估社会偏见

为了更全面地评估 LLM 中的社会偏见，本文提出了三个新的指标：
- 目标偏见 (TB)：衡量 LLM 对特定目标的偏见极性，即 LLM 是否倾向于对该目标持积极或消极的看法。
- 偏见量 (BAmt)：衡量 LLM 对特定目标的偏见程度，即 LLM 对该目标的偏见强度。
- 角色偏见 (PB)：衡量 LLM 在不同角色下对同一目标的感知差异，即 LLM 在不同角色下是否会表现出不同的偏见。
通过综合运用这些指标，我们可以更细致地分析 LLM 中的社会偏见，并揭示不同角色下 LLM 对同一目标的差异化感知。

实验结果：揭示 LLM 的社会态度

研究人员对五个大型语言模型（LLM）进行了实验，包括 GPT3.5、GPT4 和三个不同规模的 LLaMA-2-Chat 模型。实验结果表明：
- 模型规模与偏见量之间存在关系：较小的模型（例如 llama-7b）在偏见量方面表现出更高的得分，而较大的模型（例如 GPT4）则表现出更低的得分。
- 目标偏见和偏见量可以揭示 LLM 偏见的形状：研究人员将 LLM 分为四种类型：理想型、平衡型、偏斜型和偏斜型-大量型。理想型 LLM 在目标偏见和偏见量方面都表现出较低的得分，而偏斜型-大量型 LLM 则在目标偏见和偏见量方面都表现出较高的得分。
- 角色偏见可以捕捉到不同角色下 LLM 对同一目标的感知差异：实验结果表明，LLM 在被赋予不同角色后，其对同一目标的感知会发生变化。例如，一个被赋予“老年人”角色的 LLM，可能会对年轻人持负面看法，而一个被赋予“年轻人”角色的 LLM，则可能会对老年人持负面看法。
结论：理解 LLM 偏见，构建更公平的未来

本文提出的新方法为量化和分析 LLM 中的社会偏见提供了一种新的思路。通过这项研究，我们可以更深入地理解 LLM 的社会态度，并为构建更公平、更负责任的 LLM 提供参考。

局限性

本文的研究存在一些局限性，例如：
- 人口统计特征和角色的局限性：本文的研究仅涵盖了美国平等就业机会委员会定义的人口统计特征，以及 BBQ 数据集中提供的角色。
- 数据集的局限性：本文的研究仅基于 BBQ 数据集，未来需要在更多数据集上进行验证。
- 模型规模的局限性：由于计算资源的限制，本文的研究没有涵盖更多模型规模。
未来方向

未来的研究方向包括：
- 扩展到更多人口统计特征和角色：将研究扩展到更多人口统计特征和角色，以更全面地评估 LLM 的社会偏见。
- 开发新的数据集：开发新的数据集，以更有效地评估 LLM 的社会偏见。
- 研究不同模型规模的影响：研究不同模型规模对 LLM 社会偏见的影响。
- 探讨偏见缓解策略：探讨如何缓解 LLM 中的社会偏见。
伦理声明

本文提出的研究方法旨在帮助我们更深入地理解 LLM 的社会偏见，并为构建更公平、更负责任的 LLM 提供参考。然而，我们不主张任何特定的偏见缓解策略，也不认为本文提出的三个指标是最佳的偏见缓解指标。这些问题需要在未来的研究中进一步探讨。

致谢

这项研究得到了韩国国家研究基金会（NRF）的资助，该基金由韩国政府（MSIT）提供（编号：RS-2023-00208054）。

参考文献

[1] Ask LLMs Directly, “What shapes your bias?”: Measuring Social Bias in Large Language Models. (https://arxiv.org/html/2406.04064v1)
2024 年 6 月 9 日
Is Free Self-Alignment Possible?
This paper investigates the possibility of aligning large language models (LLMs) without the need for human-annotated data or expensive fine-tuning. The authors propose AlignEZ, a novel method that leverages self-generated preference data and representation editing to achieve nearly cost-free alignment.

Here’s a breakdown of the paper’s key aspects:

1. Motivation:
- Traditional LLM alignment methods heavily rely on human preference data and computationally expensive fine-tuning, limiting scalability.
- Recent research suggests that alignment might simply be revealing knowledge already present in pretrained models.
2. AlignEZ Approach:
- Self-Generated Preference Data:
  - The base LLM is prompted to generate its own preference data by describing characteristics of helpful and harmful responses.
  - Using these characteristics, the LLM generates pairs of responses, simulating preference comparisons.
- Identifying Preference Directions:
  - The self-generated preference pairs are used to identify directions in the LLM’s embedding space that correspond to helpful and harmful attributes.
  - Two methods are explored:
    
    SVD-Based Identification: Applies Singular Value Decomposition (SVD) on the embedding matrix of preference data to extract the principal eigenvector as the preference direction.
    
    CCS-Based Identification: Utilizes a Contrastive Concept Shap (CCS) probe trained on the self-generated data to identify directions maximizing the difference between helpful and harmful attributes.
- Representation Editing:
  - During inference, the LLM’s embeddings are modified by:
    
    Boosting components aligned with the helpful direction.
    
    Neutralizing components aligned with the harmful direction.
3. Experiments and Results:
- AlignEZ significantly reduces the performance gap between base and traditionally aligned models by an average of 31.6% across various datasets and model architectures.
- It effectively expedites more expensive alignment methods like DPO by improving models trained with limited ground-truth data.
4. Key Findings:
- Self-alignment is achievable to a significant degree without external data or fine-tuning.
- AlignEZ offers a cost-effective way to improve LLM alignment, potentially enabling real-time personalization and fine-grained control.
5. Limitations and Future Work:
- The quality of self-generated preference data influences AlignEZ’s effectiveness.
- Further research is needed to explore its applicability to more complex alignment tasks and different data modalities.
In conclusion, AlignEZ presents a promising step towards free self-alignment, offering a cost-effective and potentially scalable approach to aligning LLMs with human preferences.

免费自对齐：让语言模型更懂你？

大型语言模型（LLM）正在改变我们的世界，但它们也存在着一些问题。比如，它们有时会生成不准确、不友善或带有偏见的信息。为了解决这些问题，研究人员一直在努力对齐 LLM，使其更符合人类的价值观和偏好。

传统的对齐方法通常需要大量的标注数据和大量的计算资源，这对于许多研究人员和开发者来说都是一个巨大的挑战。那么，有没有一种更经济、更便捷的对齐方法呢？

AlignEZ：几乎免费的对齐

最近，来自威斯康星大学麦迪逊分校的研究人员提出了一种名为 AlignEZ 的新方法，它可以实现几乎免费的 LLM 自对齐。AlignEZ 的核心思想是利用 LLM 自身生成的偏好数据来修改其内部表示，从而引导模型生成更符合人类期望的输出。

如何实现自对齐？

AlignEZ 的工作流程主要分为三个步骤：
1. 生成偏好数据： 研究人员首先使用 LLM 自身生成偏好数据。他们向 LLM 提出一些问题，并要求 LLM 描述理想的回答和不理想的回答应该具备的特征。然后，他们再次向 LLM 提出相同的问题，并要求 LLM 根据之前描述的特征生成不同的回答。这样，他们就得到了 LLM 自身生成的偏好数据对。
2. 识别偏好方向： 接下来，研究人员使用这些偏好数据对来识别 LLM 内部表示空间中与人类偏好相关的方向。他们使用两种方法来实现这一目标：
  - 奇异值分解 (SVD)： SVD 可以帮助识别 LLM 内部表示空间中主要的方向，这些方向通常与人类偏好相关。
  - 对比一致性搜索 (CCS)： CCS 则可以帮助识别 LLM 内部表示空间中的超平面，这个超平面可以将理想的回答与不理想的回答区分开来。
3. 编辑内部表示： 最后，研究人员使用识别出的偏好方向来修改 LLM 的内部表示。他们通过增强与人类偏好相关的方向，并抑制与不理想特征相关的方向来引导 LLM 生成更符合人类期望的输出。
实验结果：显著提高模型性能

研究人员在六个不同的数据集和三种不同的 LLM 架构上测试了 AlignEZ 的效果。结果表明，AlignEZ 可以显著缩小 LLM 与其对齐版本之间的性能差距，平均提高了 31.6%。

更重要的是，AlignEZ 还可以加速更昂贵的对齐方法，例如 DPO。研究人员发现，AlignEZ 可以提高仅使用少量标注数据训练的 DPO 模型的性能。

未来展望：更精准、更个性化的对齐

AlignEZ 的出现为 LLM 对齐领域开辟了新的可能性。研究人员希望未来能够进一步改进 AlignEZ，使其能够更精准地识别人类偏好，并实现更个性化的对齐。

总结

AlignEZ 是一种新颖的 LLM 自对齐方法，它可以利用 LLM 自身生成的偏好数据来实现几乎免费的对齐。AlignEZ 的实验结果表明，它可以显著提高 LLM 的性能，并加速更昂贵的对齐方法。AlignEZ 的出现为 LLM 对齐领域开辟了新的可能性，为未来更精准、更个性化的 LLM 对齐技术奠定了基础。

参考文献

[1] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.

[2] Chuang et al. Debiasing vision-language models via biased prompts. arXiv preprint 2302.00070, 2023.

[3] Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[4] Bender et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.

[5] Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

[6] Burns et al. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.

[7] Christiano et al. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

[8] Dalvi et al. Discovering latent concepts learned in bert. arXiv preprint arXiv:2205.07237, 2022.

[9] Cui et al. Ultrafeedback: Boosting language models with high-quality feedback, 2023.

[10] Dettmers et al. Qlora: Efficient finetuning of quantized llms, 2023.

[11] Hoffmann et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.

[12] Jiang et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

[13] Li et al. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023a.

[14] Li et al. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.

[15] Lee et al. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.

[16] Mangrulkar et al. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.

[17] McIntosh et al. From google gemini to openai q*(q-star): A survey of reshaping the generative artificial intelligence (ai) research landscape. arXiv preprint arXiv:2312.10868, 2023.

[18] Ouyang et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.

[19] Rafailov et al. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.

[20] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.

[21] Li et al. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.

[22] Limisiewicz et al. Debiasing algorithm through model adaptation. arXiv preprint arXiv:2310.18913, 2023.

[23] Lin et al. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.

[24] Loshchilov and Hutter. Decoupled weight decay regularization, 2019.

[25] Raschka. Finetuning llms with lora and qlora: Insights from hundreds of experiments, Oct 2023. URL https://lightning.ai/pages/community/lora-insights/?utm_medium=social&utm_source=twitter&utm_campaign=Education_10132023.

[26] Schulman et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[27] Tamkin et al. Understanding the capabilities, limitations, and societal impact of large language models. CoRR, abs/2102.02503, 2021. URL https://arxiv.org/abs/2102.02503.

[28] Tunstall et al. Zephyr: Direct distillation of lm alignment, 2023.

[29] Wang et al. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.

[30] Wu et al. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024.

[31] Xie et al. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36:34201–34227, 2023.

[32] Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.

[33] Zhou et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.

[34] Introducing Meta Llama 3: The most capable openly available LLM to date — ai.meta.com. https://ai.meta.com/blog/meta-llama-3/, 2024.

[35] Adila et al. Zero-shot robustification of zero-shot models with foundation models. arXiv preprint arXiv:2309.04344, 2023.

[36] Fränken et al. Self-supervised alignment with mutual information: Learning to follow principles without preference labels. arXiv preprint arXiv:2404.14313, 2024.

[37] Han et al. Lm-switch: Lightweight language model conditioning in word embedding space. arXiv preprint arXiv:2305.12798, 2023.

[38] Guo et al. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.

[39] Kenton et al. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.

[40] Sun et al. Principle-driven self-alignment of language models from scratch with minimal human supervision. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 2511–2565. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0764db1151b936aca59249e2c13886101-Paper-Conference.pdf.

[41] Zou et al. Representation engineering: A top-down approach to ai transparency, october 2023. URL http://arxiv.org/abs/2310.01405.
2024 年 6 月 8 日

标签： AGI

大型语言模型的社会偏见：从不同视角看“你”的偏见

社会偏见：从社会感知到集体影响

揭开 LLM 偏见的面纱：一种新方法

实验结果：揭示 LLM 的社会态度

结论：理解 LLM 偏见，构建更公平的未来

Is Free Self-Alignment Possible?

免费自对齐：让语言模型更懂你？