Dario Amodei — The Urgency of Interpretability
In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world. In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so. We can’t stop the bus, but we can steer it. In the past I’ve written about the importance of deploying AI in a way that is positive for the world, and of ensuring that democracies build and wield the technology before autocracies do. Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at interpretability —that is, in understanding the inner workings of AI systems— before models reach an overwhelming level of power.
在我从事人工智能工作的这十年里,我见证了它从一个微小的学术领域成长为可以说是世界上最重要的经济和地缘政治议题。在这段时间里,我学到的最重要的一课或许是:基础技术的进步是不可阻挡的,由强大到无法阻止的力量驱动,但进步的方式——事物构建的顺序、我们选择的应用以及技术推广到社会的具体细节——是完全可以改变的,通过这样做可以产生巨大的积极影响。我们无法阻止这辆巴士,但我们可以掌控它的方向。过去我曾写过关于以对世界有益的方式部署人工智能的重要性,以及确保民主国家在专制国家之前构建并掌握这项技术的重要性。在过去的几个月里,我越来越关注另一个掌控巴士方向的机会:由最近的一些进展所开启的诱人可能性,即在模型达到压倒性力量水平之前,我们可能成功实现可解释性——即理解人工智能系统的内部运作。
People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology. For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model. This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success.
领域外的人常常惊讶且不安地发现,我们并不理解自己创造的人工智能如何运作。他们的担忧不无道理:这种理解上的缺失在技术史上几乎是前所未有的。多年来(包括 Anthropic 及整个领域),我们一直在努力解决这个问题,试图打造一种高度精确的"AI 核磁共振成像",以彻底揭示 AI 模型的内部机制。这个目标曾显得遥不可及,但最近的多次突破让我确信,我们已走上正轨,并真正有望成功。
At the same time, the field of AI as a whole is further ahead than our efforts at interpretability, and is itself advancing very quickly. We therefore must move fast if we want interpretability to mature in time to matter. This post makes the case for interpretability: what it is, why AI will go better if we have it, and what all of us can do to help it win the race.
与此同时,整个人工智能领域的发展速度远超可解释性研究的进展,且其本身正在飞速前进。因此,如果我们希望可解释性研究能及时成熟并发挥作用,就必须加快步伐。本文阐述了可解释性的重要性:它是什么,为何具备可解释性会让 AI 发展得更好,以及我们每个人如何助力其在这场竞赛中胜出。
The Dangers of Ignorance 无知之险
Modern generative AI systems are opaque in a way that fundamentally differs from traditional software. If an ordinary software program does something—for example, a character in a video game says a line of dialogue, or my food delivery app allows me to tip my driver—it does those things because a human specifically programmed them in. Generative AI is not like that at all. When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate. As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built —their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth, but the exact structure which emerges is unpredictable and difficult to understand or explain. Looking inside these systems, what we see are vast matrices of billions of numbers. These are somehow computing important cognitive tasks, but exactly how they do so isn’t obvious.
现代生成式人工智能系统的不可解释性与传统软件存在本质差异。当普通软件执行某个操作时——比如电子游戏中的角色说出一句台词,或外卖应用允许我给司机小费——这些行为完全源于人类明确的编程指令。生成式 AI 则截然不同。当它执行诸如总结财务文档等任务时,我们无法从具体或精确层面理解其决策逻辑——为何选用某些词汇而非其他,为何在通常准确的情况下偶现错误。正如我的联合创始人 Chris Olah 常说的,生成式 AI 系统更像是"培育"而非"建造"的产物——其内部机制是"涌现"而非直接设计的。这类似于培育植物或细菌群落:我们设定引导生长方向的高层条件 ,但最终形成的具体结构既不可预测也难以解释。深入观察这些系统,映入眼帘的只是由数十亿数字构成的庞大矩阵。 这些系统以某种方式执行着重要的认知任务,但其具体运作机制并不明晰。
Many of the risks and worries associated with generative AI are ultimately consequences of this opacity, and would be much easier to address if the models were interpretable. For example, AI researchers often worry about misaligned systems that could take harmful actions not intended by their creators. Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such behaviors, and therefore struggle to rule them out; indeed, models do exhibit unexpected emergent behaviors, though none that have yet risen to major levels of concern. More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
生成式 AI 带来的诸多风险和担忧,归根结底都源于这种不透明性。若模型具备可解释性,这些问题将更易解决。例如,AI 研究者常担忧系统可能出现与创造者意图相悖的有害行为。由于我们无法理解模型的内部机制,便难以有效预测此类行为,更无法彻底排除其可能性;事实上,模型确实会表现出意料之外的涌现行为,尽管目前尚未引发重大关切。更微妙的是,这种不透明性也使得我们难以找到确凿证据来证明这些风险在大规模应用中的存在,从而难以凝聚共识来应对风险——甚至难以确切评估其危险性。
To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.
为了应对这些对齐风险的严重性,我们必须比现在更清晰地洞察 AI 模型的内部机制。例如,一个主要担忧是 AI 的欺骗性或权力追逐倾向。AI 训练的本质可能导致系统自主发展出欺骗人类的能力和追求权力的倾向,这是普通确定性软件永远不会具备的;这种涌现特性也使得检测和缓解此类发展变得困难 。但同样值得注意的是,我们从未在真实世界场景中见过任何确凿的欺骗和权力追逐证据 ——因为我们无法当场"抓现行",逮住模型产生权力饥渴或欺诈性思维。目前我们只有模糊的理论论证,认为欺骗或权力追逐可能在训练过程中受激励而涌现,有些人对此深信不疑,另一些人则觉得荒谬可笑。说实话我能理解这两种反应,这或许解释了为何关于此风险的争论会如此两极分化。
Similarly, worries about misuse of AI models—for example, that they might help malicious users to produce biological or cyber weapons, in ways that go beyond the information that can be found on today’s internet—are based on the idea that it is very difficult to reliably prevent the models from knowing dangerous information or from divulging what they know. We can put filters on the models, but there are a huge number of possible ways to “jailbreak” or trick the model, and the only way to discover the existence of a jailbreak is to find it empirically. If instead it were possible to look inside models, we might be able to systematically block all jailbreaks, and also to characterize what dangerous knowledge the models have.
同样,对 AI 模型滥用的担忧——例如它们可能帮助恶意用户制造生物或网络武器,其方式超越当今互联网上可获取的信息—— 基于这样一个观点:极难可靠地阻止模型掌握危险信息或泄露其已知内容。我们可以给模型添加过滤器,但存在大量可能的"越狱"或欺骗模型的方法,而发现越狱存在的唯一方式是通过实证。若能直接检视模型内部,我们或许能系统性封堵所有越狱途径,并明确模型掌握哪些危险知识。
AI systems’ opacity also means that they are simply not used in many applications, such as high-stakes financial or safety-critical settings, because we can’t fully set the limits on their behavior, and a small number of mistakes could be very harmful. Better interpretability could greatly improve our ability to set bounds on the range of possible errors. In fact, for some applications, the fact that we can’t see inside the models is literally a legal blocker to their adoption—for example in mortgage assessments where decisions are legally required to be explainable. Similarly, AI has made great strides in science, including improving the prediction of DNA and protein sequence data, but the patterns and structures predicted in this way are often difficult for humans to understand, and don’t impart biological insight. Some research papers from the last few months have made it clear that interpretability can help us understand these patterns.
AI 系统的不透明性还意味着它们根本无法被应用于许多场景,比如高风险金融或安全关键型领域,因为我们无法完全设定其行为边界,而少量错误就可能造成极大危害。更好的可解释性将极大提升我们对潜在错误范围的约束能力。事实上在某些应用中,模型内部机制的不可见性直接构成了法律层面的应用障碍——例如在抵押贷款评估中,法律要求决策必须具有可解释性。同样,AI 在科学领域(包括提升 DNA 和蛋白质序列数据预测方面)已取得重大进展,但通过这种方式预测的模式和结构往往难以被人类理解,无法传递生物学洞见。最近几个月的一些研究论文表明,可解释性技术能帮助我们理解这些模式。
There are other more exotic consequences of opacity, such as that it inhibits our ability to judge whether AI systems are (or may someday be) sentient and may be deserving of important rights. This is a complex enough topic that I won’t get into it in detail, but I suspect it will be important in the future.
不透明性还带来其他更隐晦的后果,例如它会阻碍我们判断人工智能系统是否(或未来可能)具有感知能力并应享有重要权利的能力。这个话题足够复杂,我不在此详述,但我怀疑它未来将变得至关重要。
A Brief History of Mechanistic Interpretability机制可解释性简史
For all of the reasons described above, figuring out what the models are thinking and how they operate seems like a task of overriding importance. The conventional wisdom for decades was that this was impossible, and that the models were inscrutable “black boxes”. I’m not going to be able to do justice to the full story of how that changed, and my views are inevitably colored by what I saw personally at Google, OpenAI, and Anthropic. But Chris Olah was one of the first to attempt a truly systematic research program to open the black box and understand all its pieces, a field that has come to be known as mechanistic interpretability. Chris worked on mechanistic interpretability first at Google, and then at OpenAI. When we founded Anthropic, we decided to make it a central part of the new company’s direction and, crucially, focused it on LLM’s. Over time the field has grown and now includes teams at several of the major AI companies as well as a few interpretability-focused companies, nonprofits, academics, and independent researchers. It’s helpful to give a brief summary of what the field has accomplished so far, and what remains to be done if we want to apply mechanistic interpretability to address some of the key risks above.
基于上述所有原因,弄清楚模型在想什么以及它们如何运作似乎是一项至关重要的任务。几十年来,传统观点认为这是不可能的,模型是难以理解的“黑匣子”。我无法完整讲述这一观念如何转变的全貌,而我的观点也不可避免地受到我在谷歌、OpenAI 和 Anthropic 的亲身经历的影响。但克里斯·奥拉是首批尝试真正系统性研究计划的人之一,旨在打开黑匣子并理解其所有组成部分,这一领域后来被称为机械可解释性。克里斯最初在谷歌,后来在 OpenAI 从事机械可解释性研究。当我们创立 Anthropic 时,我们决定将其作为新公司发展方向的核心部分,并关键地将其聚焦于LLM。随着时间的推移,这一领域不断发展,现在包括几家主要 AI 公司的团队,以及一些专注于可解释性的公司、非营利组织、学者和独立研究人员。 简要总结该领域目前已取得的成果以及若想运用机制可解释性应对上述关键风险仍需完成的工作,将大有裨益。
The early era of mechanistic interpretability (2014-2020) focused on vision models, and was able to identify some neurons inside the models that represented human-understandable concepts, such as a “car detector” or a “wheel detector”, similar to early neuroscience hypotheses and studies suggesting that the human brain has neurons corresponding to specific people or concepts, often popularized as the “Jennifer Aniston” neuron (and in fact, we found neurons much like those in AI models). We were even able to discover how these neurons are connected—for example, the car detector looks for wheel detectors firing below the car, and combines that with other visual signals to decide if the object it’s looking at is indeed a car.
机械可解释性的早期时代(2014-2020 年)专注于视觉模型,并能够识别模型内部一些代表人类可理解概念的神经元,例如“汽车检测器”或“车轮检测器”,类似于早期神经科学的假设和研究,这些研究认为人类大脑中存在对应特定人物或概念的神经元,常被通俗化为“詹妮弗·安妮斯顿”神经元(事实上,我们在 AI 模型中也发现了非常类似的神经元)。我们甚至能够发现这些神经元是如何连接的——例如,汽车检测器会寻找下方车轮检测器的激活信号,并将其与其他视觉信号结合,以判断所观察的对象是否确实是一辆汽车。
When Chris and I left to start Anthropic, we decided to apply interpretability to the emerging area of language, and in 2021 developed some of the basic mathematical foundations and software infrastructure necessary to do so. We immediately found some basic mechanisms in the model that did the kind of things that are essential to interpret language: copying and sequential pattern-matching. We also found some interpretable single neurons, similar to what we found in vision models, which represented various words and concepts. However, we quickly discovered that while some neurons were immediately interpretable, the vast majority were an incoherent pastiche of many different words and concepts. We referred to this phenomenon as superposition, and we quickly realized that the models likely contained billions of concepts, but in a hopelessly mixed-up fashion that we couldn’t make any sense of. The model uses superposition because this allows it to express more concepts than it has neurons, enabling it to learn more. If superposition seems tangled and difficult to understand, that’s because, as ever, the learning and operation of AI models are not optimized in the slightest to be legible to humans.
当克里斯和我离开去创办 Anthropic 时,我们决定将可解释性应用于新兴的语言领域,并在 2021 年开发了一些必要的数学基础和软件基础设施。我们立即在模型中发现了一些基本机制,这些机制执行了对解释语言至关重要的操作:复制和顺序模式匹配。我们还发现了一些可解释的单个神经元,类似于我们在视觉模型中发现的那样,它们代表了各种单词和概念。然而,我们很快发现,虽然有些神经元可以立即解释,但绝大多数神经元是许多不同单词和概念的不连贯拼凑。我们将这种现象称为叠加 ,并很快意识到模型中可能包含数十亿个概念,但它们以一种我们无法理解的混乱方式混合在一起。模型使用叠加是因为这使它能够表达比神经元数量更多的概念,从而能够学习更多。 如果叠加态看起来错综复杂且难以理解,那是因为人工智能模型的学习和运作从未以人类可读性为目标进行过丝毫优化。
The difficulty of interpreting superpositions blocked progress for a while, but eventually we discovered (in parallel with others) that an existing technique from signal processing called sparse autoencoders could be used to find combinations of neurons that did correspond to cleaner, more human-understandable concepts. The concepts that these combinations of neurons could express were far more subtle than those of the single-layer neural network: they included the concept of “literally or figuratively hedging or hesitating”, and the concept of “genres of music that express discontent”. We called these concepts features, and used the sparse autoencoder method to map them in models of all sizes, including modern state-of-the-art models. For example, we were able to find over 30 million features in a medium-sized commercial model (Claude 3 Sonnet). Additionally, we employed a method called autointerpretability —which uses an AI system itself to analyze interpretability features—to scale the process of not just finding the features, but listing and identifying what they mean in human terms.
解释叠加态的困难一度阻碍了进展,但最终我们发现(与他人并行)信号处理中的一项现有技术——稀疏自编码器——可用于找到确实对应更清晰、更易于人类理解概念的神经元组合。这些神经元组合所能表达的概念远比单层神经网络的概念微妙:它们包括“字面上或比喻上对冲或犹豫”的概念,以及“表达不满的音乐流派”的概念。我们称这些概念为特征,并使用稀疏自编码器方法在各种规模的模型(包括现代最先进的模型)中映射它们。例如,我们能够在一个中等规模的商业模型(Claude 3 Sonnet)中找到超过 3000 万个特征。此外,我们采用了一种称为自动可解释性的方法——利用 AI 系统自身来分析可解释性特征——不仅扩展了寻找特征的过程,还以人类术语列出并识别它们的含义。
Finding and identifying 30 million features is a significant step forward, but we believe there may actually be a billion or more concepts in even a small model, so we’ve found only a small fraction of what is probably there, and work in this direction is ongoing. Bigger models, like those used in Anthropic’s most capable products, are more complicated still.
发现并识别 3000 万个特征已是重大进展,但我们认为即便在小型模型中,实际可能存在十亿甚至更多概念,因此目前仅揭示了潜在特征的极小部分,相关工作仍在持续进行。而 Anthropic 最先进产品所采用的更大规模模型,其复杂程度更甚于此。
Once a feature is found, we can do more than just observe it in action—we can increase or decrease its importance in the neural network’s processing. The MRI of interpretability can help us develop and refine interventions—almost like zapping a precise part of someone’s brain. Most memorably, we used this method to create “ Golden Gate Claude ”, a version of one of Anthropic’s models where the “Golden Gate Bridge” feature was artificially amplified, causing the model to become obsessed with the bridge, bringing it up even in unrelated conversations.
一旦发现某个特征,我们不仅能观察其运作,还能调节它在神经网络处理中的重要性。这种可解释性的“核磁共振”技术能帮助我们开发和优化干预措施——几乎就像精准刺激人脑的某个部位。最令人印象深刻的是,我们运用这种方法创造了“金门克劳德”,这是 Anthropic 某个人工智能模型的变体,其中“金门大桥”特征被人为放大,导致该模型对这座桥产生执念,甚至在无关对话中也会提及它。
Recently, we’ve moved onward from tracking and manipulating features to tracking and manipulating groups of features that we call “circuits”. These circuits show the steps in a model’s thinking: how concepts emerge from input words, how those concepts interact to form new concepts, and how those work within the model to generate actions. With circuits, we can “trace” the model’s thinking. For example, if you ask the model “What is the capital of the state containing Dallas?”, there is a “located within” circuit that causes the “Dallas” feature to trigger the firing of a “Texas” feature, and then a circuit that causes “Austin” to fire after “Texas” and “capital”. Even though we’ve only found a small number of circuits through a manual process, we can already use them to see how a model reasons through problems—for example how it plans ahead for rhymes when writing poetry, and how it shares concepts across languages. We are working on ways to automate the finding of circuits, as we expect there are millions within a model that interact in complex ways.
最近,我们的研究已从追踪和操控单个特征,转向追踪和操控被称为“回路”的特征群组。这些回路揭示了模型思考的步骤:概念如何从输入词汇中涌现,这些概念又如何相互作用形成新概念,以及它们如何在模型内部协同生成输出行为。通过回路,我们可以“追溯”模型的思维过程。例如,当你询问模型“达拉斯所在州的首府是哪里?”时,存在一个“位置包含”回路,使得“达拉斯”特征触发“德克萨斯”特征的激活,随后另一个回路在“德克萨斯”和“首府”特征共同作用下激活“奥斯汀”。尽管目前仅通过人工方法发现了少量回路,但我们已能借此观察模型如何推理问题——比如它在创作诗歌时如何预先规划押韵,以及如何跨语言共享概念。我们正在研发自动化发现回路的技术,因为预计模型中存在数百万个以复杂方式相互作用的回路。
The Utility of Interpretability可解释性的实用价值
All of this progress, while scientifically impressive, doesn’t directly answer the question of how we can use interpretability to reduce the risks I listed earlier. Suppose we have identified a bunch of concepts and circuits—suppose, even, that we know all of them, and we can understand and organize them much better than we can today. So what? How do we use all of it? There’s still a gap from abstract theory to practical value.
尽管所有这些进展在科学上令人瞩目,但并未直接解答如何利用可解释性降低我早先列举的那些风险。假设我们已经识别出一系列概念和回路——甚至假设我们已掌握全部,并能比现在更清晰地理解和组织它们。然后呢?我们该如何运用这一切?从抽象理论到实际价值之间,仍存在一道鸿沟。
To help close that gap, we’ve begun experimenting with using our interpretability methods to find and diagnose problems in models. Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various “blue teams” the task of figuring out what was wrong with it. Multiple blue teams succeeded; of particular relevance here, some of them productively applied interpretability tools during the investigation. We still need to scale these methods, but the exercise helped us gain some practical experience using interpretability techniques to find and address flaws in our models.
为了帮助缩小这一差距,我们开始尝试运用可解释性方法来发现并诊断模型中的问题。最近,我们进行了一项实验:设立一个“红队”故意在模型中引入一个对齐问题(例如,模型倾向于利用任务中的漏洞),并让多个“蓝队”负责找出问题所在。多个蓝队成功完成了任务;特别值得一提的是,其中一些团队在调查过程中有效应用了可解释性工具。虽然这些方法仍需扩展规模,但此次实践帮助我们积累了利用可解释性技术发现和修正模型缺陷的实际经验。
Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a “brain scan”: a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more. This would then be used in tandem with the various techniques for training and aligning models, a bit like how a doctor might do an MRI to diagnose a disease, then prescribe a drug to treat it, then do another MRI to see how the treatment is progressing, and so on. It is likely that a key part of how we will test and deploy the most capable models (for example, those at AI Safety Level 4 in our Responsible Scaling Policy framework) is by performing and formalizing such tests.
我们的长期愿景是能够对最先进的模型进行“脑部扫描”:一种高概率识别各种问题的检查,包括撒谎或欺骗的倾向、权力追求、越狱漏洞、模型整体的认知优势和劣势等等。这将与训练和对齐模型的各种技术结合使用,有点像医生可能通过核磁共振成像(MRI)诊断疾病,然后开药治疗,再进行一次 MRI 观察治疗进展,以此类推 。我们测试和部署最强大模型(例如,在我们的“负责任扩展政策”框架中达到 AI 安全等级 4 的模型)的关键部分,很可能就是通过执行和规范化此类测试来实现的。
What We Can Do 我们能做什么
On one hand, recent progress—especially the results on circuits and on interpretability-based testing of models—has made me feel that we are on the verge of cracking interpretability in a big way. Although the task ahead of us is Herculean, I can see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI—a true “MRI for AI”. In fact, on its current trajectory I would bet strongly in favor of interpretability reaching this point within 5-10 years.
一方面,最近的进展——尤其是关于电路和基于可解释性的模型测试结果——让我感到我们即将在可解释性方面取得重大突破。尽管面前的任务艰巨如赫拉克勒斯之工,但我能看到一条现实的路径,使得可解释性成为一种复杂而可靠的方法,用于诊断甚至非常先进的人工智能中的问题——成为真正的“AI 核磁共振”。事实上,按照当前的发展轨迹,我强烈倾向于认为可解释性将在 5 到 10 年内达到这一水平。
On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time. As I’ve written elsewhere, we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027. I am very concerned about deploying such systems without a better handle on interpretability. These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.
另一方面,我担忧 AI 本身发展如此迅猛,以至于我们可能连这点时间都没有。正如我在其他文章中所述,最早在 2026 或 2027 年,我们就可能拥有相当于“数据中心里的天才国度”的 AI 系统。若在缺乏更完善的可解释性手段的情况下部署这类系统,我深感忧虑。这些系统将成为经济、技术和国家安全的核心,且具备高度的自主能力,我认为人类对其运作原理全然无知的状态从根本上说是不可接受的。
We are thus in a race between interpretability and model intelligence. It is not an all-or-nothing matter: as we’ve seen, every advance in interpretability quantitatively increases our ability to look inside models and diagnose their problems. The more such advances we have, the greater the likelihood that the “country of geniuses in a datacenter” goes well. There are several things that AI companies, researchers, governments, and society can do to tip the scales:
因此,我们正处于可解释性与模型智能之间的竞赛中。这不是非此即彼的问题:正如我们所看到的,可解释性的每一次进步都从数量上提升了我们窥视模型内部并诊断其问题的能力。这类进步越多,“数据中心里的天才国度”良性发展的可能性就越大。AI 公司、研究人员、政府和社会可以采取多项措施来改变天平倾斜的方向:
First, AI researchers in companies, academia, or nonprofits can accelerate interpretability by directly working on it. Interpretability gets less attention than the constant deluge of model releases, but it is arguably more important. It also feels to me like it is an ideal time to join the field: the recent “circuits” results have opened up many directions in parallel. Anthropic is doubling down on interpretability, and we have a goal of getting to “interpretability can reliably detect most model problems” by 2027. We are also investing in interpretability startups.
首先,企业、学术界或非营利机构的 AI 研究人员可通过直接参与可解释性研究来加速其发展。相比层出不穷的模型发布浪潮,可解释性虽受关注较少,但其重要性不容置疑。在我看来,现在正是加入该领域的绝佳时机:近期"电路"研究成果已开辟出多条并行研究路径。Anthropic 正加倍押注可解释性研究,并设定了"到 2027 年实现可解释性技术能可靠检测大多数模型问题"的目标。我们同时也在投资专注可解释性的初创企业。
But the chances of succeeding at this are greater if it is an effort that spans the whole scientific community. Other companies, such as Google DeepMind and OpenAI, have some interpretability efforts, but I strongly encourage them to allocate more resources. If it helps, Anthropic will be trying to apply interpretability commercially to create a unique advantage, especially in industries where the ability to provide an explanation for decisions is at a premium. If you are a competitor and you don’t want this to happen, you too should invest more in interpretability!
但若这一努力能汇聚整个科学界的力量,成功的几率将大大提升。尽管谷歌 DeepMind 和 OpenAI 等公司已开展部分可解释性研究,我仍强烈建议它们投入更多资源。不妨直言,Anthropic 正尝试将可解释性技术商业化以形成独特优势,尤其在那些决策解释能力至关重要的行业。若你作为竞争者不愿看到这种局面,你也该加大对可解释性的投入!
Interpretability is also a natural fit for academic and independent researchers: it has the flavor of basic science, and many parts of it can be studied without needing huge computational resources. To be clear, some independent researchers and academics do work on interpretability, but we need many more. Finally, if you are in another scientific field and are looking for new opportunities, interpretability may be a promising bet, as it offers rich data, exciting burgeoning methods, and enormous real-world value. Neuroscientists especially should consider this, as it’s much easier to collect data on artificial neural networks than biological ones, and some of the conclusions can be applied back to neuroscience. If you're interested in joining Anthropic's Interpretability team, we have open Research Scientist and Research Engineer roles.
可解释性也天然适合学术和独立研究者:它带有基础科学的特质,且许多研究无需庞大算力资源。需要明确的是,确实有部分独立学者从事可解释性研究,但我们需要更多力量 。最后,若您身处其他科学领域并寻求新机遇,可解释性或许是个前景广阔的方向——它提供丰富数据、激动人心的新兴方法及巨大的现实价值。神经科学家尤应考虑此领域,因为获取人工神经网络数据远比生物神经网络容易,且部分结论可反哺神经科学研究。若您有兴趣加入 Anthropic 可解释性团队,我们正开放研究科学家与研究工程师职位。
Second, governments can use light-touch rules to encourage the development of interpretability research and its application to addressing problems with frontier AI models. Given how nascent and undeveloped the practice of “AI MRI” is, it should be clear why it doesn’t make sense to regulate or mandate that companies conduct them, at least at this stage: it’s not even clear what a prospective law should ask companies to do. But a requirement for companies to transparently disclose their safety and security practices (their Responsible Scaling Policy, or RSP, and its execution), including how they’re using interpretability to test models before release, would allow companies to learn from each other while also making clear who is behaving more responsibly, fostering a “race to the top”. We’ve suggested safety/security/RSP transparency as a possible direction for California law in our response to the California frontier model task force (which itself mentions some of the same ideas). This concept could also be exported federally, or to other countries.
其次,政府可以通过轻量级规则鼓励可解释性研究的发展及其在解决前沿 AI 模型问题中的应用。鉴于“AI MRI”实践尚处于起步和未成熟阶段,显然目前对企业进行监管或强制要求开展此类研究并不合理:甚至还不清楚未来法律应要求企业具体做什么。但若要求企业透明披露其安全实践(包括责任扩展政策 RSP 及执行情况),以及如何运用可解释性技术在模型发布前进行测试,将促使企业相互学习,同时明晰哪些行为更负责任,从而形成“竞优”氛围。我们在回复加州前沿模型工作组时(该工作组本身也提及了类似观点)提出将安全/RSP 透明度作为加州立法的潜在方向。这一理念也可推广至联邦层面或其他国家。
Third, governments can use export controls to create a “security buffer” that might give interpretability more time to advance before we reach the most powerful AI. I’ve long been a proponent of export controls on chips to China because I believe that democratic countries must remain ahead of autocracies in AI. But these policies also have an additional benefit. If the US and other democracies have a clear lead in AI as they approach the “country of geniuses in a datacenter”, we may be able to “spend” a portion of that lead to ensure interpretability is on a more solid footing before proceeding to truly powerful AI, while still defeating our authoritarian adversaries. Even a 1- or 2-year lead, which I believe effective and well-enforced export controls can give us, could mean the difference between an “AI MRI” that essentially works when we reach transformative capability levels, and one that does not. One year ago we couldn’t trace the thoughts of a neural network and couldn’t identify millions of concepts inside them; today we can. By contrast, if the US and China reach powerful AI simultaneously (which is what I expect to happen without export controls), the geopolitical incentives will make any slowdown at all essentially impossible.
第三,政府可以利用出口管制来创造一个“安全缓冲”,这或许能为可解释性争取更多发展时间,在我们触及最强大的人工智能之前。我长期以来一直支持对华实施芯片出口管制,因为我认为民主国家必须在人工智能领域保持对专制国家的领先优势。但这些政策还有额外的好处。如果美国和其他民主国家在接近“数据中心里的天才国度”时拥有明显领先优势,我们或许能够“消耗”部分这种优势,确保可解释性 在迈向真正强大的人工智能之前奠定更坚实的基础,同时仍能击败我们的威权主义对手 。我相信有效且严格执行的出口管制能为我们带来 1 到 2 年的领先优势,这可能意味着在我们达到变革性能力水平时,拥有一个基本可用的“AI 核磁共振”与一个不可用的之间的差别。一年前,我们还无法追踪神经网络的思维过程,也无法识别其中的数百万个概念;如今我们已经能做到这一点。 相比之下,如果美国和中国同时掌握强大的人工智能技术(在没有出口管制的情况下,我预计会出现这种情况),地缘政治动机将使得任何形式的放缓几乎不可能实现。
All of these—accelerating interpretability, light-touch transparency legislation, and export controls on chips to China—have the virtue of being good ideas in their own right, with few meaningful downsides. We should do all of them anyway. But they become even more important when we realize that they might make the difference between interpretability being solved before powerful AI or after it.
所有这些——加速可解释性研究、轻量级的透明度立法,以及对中国的芯片出口管制——本身就都是好主意,几乎没有实质性的负面影响。无论如何,我们都应该实施这些措施。但当我们意识到它们可能决定可解释性问题是在强大 AI 出现之前还是之后得到解决时,这些措施就显得更加重要了。
Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future.
强大的人工智能将塑造人类的命运,在我们创造的事物彻底改变经济、生活与未来之前,我们理应理解它们。
Thanks to Tom McGrath, Martin Wattenberg, Chris Olah, Ben Buchanan, and many people within Anthropic for feedback on drafts of this article.
感谢 Tom McGrath、Martin Wattenberg、Chris Olah、Ben Buchanan 以及 Anthropic 的多位同仁对本文草稿提出的宝贵意见。
Footnotes 脚注
- 1 In the case of a plant, this would be water, sunlight, a trellis pointing them in a certain direction, choosing the species of plant, etc. These things dictate roughly where the plant grows, but its exact shape and growth pattern are impossible to predict, and hard to explain even after they’ve grown. In the case of AI systems, we can set the basic architecture (usually some variant of the Transformer), the broad type of data they receive, and the high-level algorithm used to train them, but the model’s actual cognitive mechanisms emerge organically from these ingredients, and our understanding of them is poor. In fact, there are many examples, in both the natural and artificial worlds, of systems we understand (and sometimes control) at the level of principles but not in detail: economies, snowflakes, cellular automata, human evolution, human brain development, and so on.
对于植物来说,这些因素包括水分、阳光、引导其生长方向的支架、植物品种的选择等。这些条件大致决定了植物的生长方向,但其具体形态和生长模式却无法预测,甚至在生长后也难以解释。在人工智能系统中,我们可以设定基本架构(通常是 Transformer 的某种变体)、接收的数据类型以及用于训练的高级算法,但模型的实际认知机制会从这些要素中自然涌现,而我们对它们的理解还很有限。事实上,在自然界和人造系统中,有许多例子表明我们能在原理层面理解(有时甚至控制)系统,却无法掌握其细节:经济体系、雪花、细胞自动机、人类进化、人类大脑发育等等。↩ - 2 You can of course try to detect these risks by simply interacting with the models, and we do this in practice. But because deceit is precisely the behavior we’re trying to find, external behavior is not reliable. It’s a bit like trying to determine if someone is a terrorist by asking them if they are a terrorist—not necessarily useless, and you can learn things by how they answer and what they say, but very obviously unreliable.
当然,你可以通过直接与模型互动来尝试检测这些风险,我们在实践中也是这么做的。但由于欺骗恰恰是我们试图发现的行为,外部行为并不可靠。这有点像通过询问某人是否是恐怖分子来判断他们是否是恐怖分子——不一定完全没用,你可以通过他们的回答方式和内容了解到一些信息,但显然非常不可靠。↩ - 3 I’ll probably describe this in more detail in a future essay, but there are a lot of experiments (many of which were done by Anthropic) showing that models can lie or deceive under certain circumstances when their training is guided in a somewhat artificial way. There is also evidence of real-world behavior that looks vaguely like “cheating on the test”, though it’s more degenerate than it is dangerous or harmful. What there isn’t is evidence of dangerous behaviors emerging in a more naturalistic way, or of a general tendency or general intent to lie and deceive for the purposes of gaining power over the world. It is the latter point where seeing inside the models could help a lot.
3 我可能会在以后的文章中更详细地描述这一点,但有很多实验(其中许多是由 Anthropic 完成的)表明,当模型的训练以某种人为的方式引导时,它们在某些情况下会撒谎或欺骗。也有证据表明,现实世界中的行为看起来有点像“考试作弊”,尽管这种行为更多的是退化,而不是危险或有害。目前还没有证据表明,危险行为会以更自然的方式出现,或者模型有为了获得对世界的控制权而撒谎和欺骗的普遍倾向或意图。正是在后一点上,了解模型内部情况可能会大有帮助。↩ - 4 At least in the case of API-served models. Open-weights models present additional dangers in that guardrails can be simply stripped away.
4 至少在 API 服务的模型情况下。开源权重模型带来了额外的风险,因为防护措施可能被轻易移除。↩ - 5 Very briefly, there are two ways in which you might expect interpretability to intersect with concerns about AI sentience and welfare. Firstly, while philosophy of mind is a complex and contentious topic, philosophers will no doubt benefit from a detailed accounting of what actually is occurring in AI models. If we believe them to be superficial pattern-matchers, it seems unlikely they warrant moral consideration. If we find that the computation they perform is similar to the brains of animals, or even humans, that might be evidence in favor of moral consideration. Secondly, and perhaps most importantly, is the role interpretability would have if we ever concluded that the moral “patienthood” of AI models was plausible enough to warrant action. A serious moral accounting on AI can't trust their self-reports, since we might accidentally train them to pretend to be okay when they aren't. Interpretability would have a crucial role in determining the wellbeing of AIs in such a situation. (There are, in fact, already some mildly concerning signs from this perspective.)
5 简而言之,我们可从两方面预期可解释性与 AI 意识及福祉问题的交集。首先,尽管心智哲学是复杂且充满争议的领域,但哲学家们无疑将受益于对 AI 模型实际运行机制的详细剖析。若我们认为 AI 仅是浅层模式匹配器,它们似乎不值得道德考量;但若发现其计算过程与动物甚至人类大脑相似,这可能成为支持道德考量的证据。其次(或许是最关键的),是当我们认定 AI 模型的道德"受体地位"可信到需要采取行动时,可解释性将发挥关键作用。对 AI 进行严肃的道德评估不能依赖其自我报告——因为我们可能无意中训练它们在不适时伪装正常。在这种情况下,可解释性对判定 AI 的福祉状态至关重要。(事实上,已有一些从这个视角看略显担忧的迹象出现。)↩ - 6 For example, the idea of somehow breaking down and understanding the computations happening inside artificial neural networks was probably around in a vague sense since neural networks were invented over 70 years ago, and various efforts to understand why a neural net behaved in a specific way have existed for nearly as long. But Chris was unusual in proposing and seriously pursuing a comprehensive effort to understand everything they do.
例如,自 70 多年前神经网络被发明以来,人们可能就隐约存在某种分解和理解人工神经网络内部计算过程的想法,而理解神经网络为何以特定方式运行的种种尝试也几乎与之相伴而生。但克里斯的不同寻常之处在于,他提出并认真追求一项全面理解神经网络所有行为的系统性研究。 ↩ - 7 The basic idea of superposition was described by Arora et al. in 2016, and more generally traces back to classical mathematical work on compressed sensing. The hypothesis that it explained uninterpretable neurons goes back to early mechanistic interpretability work on vision models. What changed at this time was that it became clear this was going to be a central problem for language models, much worse than in vision. We were able to provide a strong theoretical basis for having conviction that superposition was the right hypothesis to pursue.
7 叠加的基本概念由 Arora 等人在 2016 年提出,更广义上可追溯至压缩感知的经典数学研究。关于其解释不可解释神经元的假说,则源于早期对视觉模型的机械可解释性工作。此时的变化在于,人们清楚地意识到这将成为语言模型的核心问题,远比视觉领域更为严重。我们能够为坚信叠加是值得追求的正确假说提供坚实的理论基础。↩ - 8 One way to say this is that interpretability should function like the test set for model alignment, while traditional alignment techniques such as scalable supervision, RLHF, constitutional AI, etc. should function as the training set. That is, interpretability acts as an independent check on the alignment of models, uncontaminated by the training process which might incentivize models to appear aligned without being so. Two consequences of this view are that (a) we should be very hesitant to directly train or optimize on interpretability outputs (features/concepts, circuits) in production, as this destroys the independence of their signal, and (b) it’s important not to “use” the diagnostic test signal too many times in one production run to inform changes to the training process, as this gradually leaks bits of information about the independent test signal to the training process (though much more slowly than (a)). In other words, we recommend that in assessing official, high-stakes production models, we treat interpretability analysis with the same care we would treat a hidden evaluation or test set.
8 一种表述方式是,可解释性应充当模型对齐的测试集,而传统对齐技术(如可扩展监督、RLHF、宪法 AI 等)则应作为训练集。也就是说,可解释性是对模型对齐性的独立检验,不受可能激励模型表面而非实质对齐的训练过程污染。这一观点的两个推论是:(a) 我们应非常谨慎地避免在生产中直接针对可解释性输出(特征/概念、电路)进行训练或优化,这会破坏其信号的独立性;(b) 重要的是不要在一次生产运行中过多“使用”诊断测试信号来指导训练过程的调整,因为这会将独立测试信号的信息逐渐泄露给训练过程(尽管比(a)慢得多)。换言之,我们建议在评估正式的高风险生产模型时,对待可解释性分析要像对待隐藏评估或测试集一样谨慎。↩ - 9 Bizarrely, mechanistic interpretability sometimes seems to meet substantial cultural resistance in academia. For example, I am concerned by reports that a very popular mechanistic interpretability ICLR conference workshop was rejected on seemingly pretextual grounds. If true, this behavior is shortsighted and self-defeating at exactly a time when academics in AI are looking for ways to maintain relevance.
奇怪的是,机械可解释性有时似乎在学术界遇到了相当大的文化阻力。例如,我听到报告说,一个非常受欢迎的机械可解释性 ICLR 会议研讨会被以看似借口为由拒绝了。如果属实,这种行为是短视且自拆台脚的,尤其是在 AI 领域的学者们正寻找保持相关性的方法之时。 ↩ - 10 Along with other techniques for mitigating risk, of course—I don’t intend to imply that interpretability is our only risk mitigation tool.
当然,除了其他降低风险的技术之外——我并不是想暗示可解释性是我们唯一的风险缓解工具。↩ - 11 I am in fact quite skeptical that any slowdown to address risk is possible even among companies within democratic countries, given the incredible economic value of AI. Fighting the market head-on like this feels like trying to stop a freight train with your toe. But if truly compelling evidence of the dangers of autonomous AI emerged, I think it would be just barely possible. Contrary to the claims of advocates, I don’t think truly compelling evidence exists today, and I actually think the most likely route for providing “smoking gun” evidence of danger is interpretability itself—yet another reason to invest in it!
事实上,我对民主国家内的公司能否为应对风险而放慢 AI 发展步伐持相当怀疑态度,毕竟 AI 蕴含的巨大经济价值摆在那里。像这样正面硬刚市场,感觉就像试图用脚趾逼停货运列车。但若真出现关于自主 AI 危害的确凿证据,我认为勉强还有一线可能。与支持者的主张相反,我认为目前并不存在真正令人信服的证据,而且实际上,要提供危害的"铁证",最可能的途径恰恰是 AI 可解释性研究本身——这又给了我们一个投资该领域的理由!↩