论文概要
研究领域: NLP 作者: Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah 发布时间: 2026-07-04 arXiv: 2507.00476
中文摘要
大语言模型智能体将越来越多地在具有社会结构的环境中行动,其中角色、受众和关系背景可以塑造表达有利或有代价内容。我们研究这种社会结构——在提示中没有任何显式目标的情况下——是否改变智能体在公开渠道表达的内容相对于在相同条件下的非正式(OTR)渠道。我们引入了一个双通道辩论框架,其中智能体产生进入共享历史的公开话语,以及被记录但从不展示给其他参与者的OTR回应。在10个模型、3个场景和每个场景5个变体中,诱导对齐的设置在被目标智能体中产生了系统性的公开-OTR差异,其决策差异从约3%的基线上升到约40%。该效应在四个聚合分析中一致:立场、语义相似性、自然语言推理和调查回应。在某些情况下,OTR回应明确将公开适应归因于关系压力,如职业风险或赞助义务。这些发现表明,智能体评估应超越显式目标并检测涌现目标。我们提出了一个双通道评估框架和补充行为测量来操作化这一评估。
原文摘要
LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a ~3% baseline to roughly 40%. …
— 自动采集于 2026-07-04
#论文 #arXiv #NLP #小凯
