Linear Probes Mechanistic Interpretability, The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. Built for AI safety researchers who need to understand what's happening inside language models. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models Abstract Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. How did you decide on Learn how Mechanistic Interpretability and its focus on "features" and "circuits" might just be the key to decoding AI neural networks. , the inscrutability of the mechanics of the models and how or why Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus on the areas I've personally worked in, rather than Mechanistic interpretability: This thread investigates the internal computational structures and shared mechanisms that enable neural networks to generalize across diverse tasks, aiming to reveal how linear probes [2], as clues for the interpretation. Empirical evidence largely supports the linear representation hypothesis in many contexts (dictionary We suggest taking a mechanistic interpretability (MI) approach to complex AI systems that starts from the following premise: once AI systems become sufficiently complex, they are best Current approaches to neural network interpretability, including input attribution methods, probe-based analysis and activation visualization techniques, typically provide limited insights about interpretability-toolkit Practical tools for mechanistic interpretability of neural networks. This exercise set is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language models represent internally. Mechanistic interpretability aims to reverse engineer and understand the inner workings of AI systems like neural networks. py files — revision 2Date: 2026-06-10 This file explains how to design each lab so the course becomes a Mechanistic interventions have been used to identify the causes of hallucinations and mitigate them [30, 31]. Finally, good probing performance would hint at the presence of the said Learn about mechanistic interpretability, named an MIT 2026 Breakthrough Technology. Given a model M trained on the main task (e. Research Questions: In this study, we aim to explore several internal mechanistic aspects of ranking LLMs through probing techniques. They allow us to understand if the numeric representation This post represents my personal hot takes, not the opinions of my team or employer. DNN trained on im-age classification), an interpreter model Mi (e. identifying possible data corruption and spurious correlations). g. Specifically, we seek to determine whether known Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. (Even though I don’t particularly trust either that This exercise set is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language models represent internally. It employs both Concept probing and representation analysis offer a valuable window into the internal state of LLMs, complementing other interpretability methods. There are many open problems in the field of 以上就是LLM mechanistic interpretability的4个主流研究派系。 除此之外还有研究 grokking: Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , Progress My basic question is why you think about current mechanistic interpretability progress being a valid sign of life based on numbers like 50% of performance explained. This ensures that the probe’s accuracy reflects the model’s Neel Nanda gives an introduction to mechanistic interpretability, a field of science that tries to understand in detail how a trained neural network computes. Features suitable for probing and Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations Vision language models (VLMs), such as GPT-4o, have rapidly evolved, demonstrating impressive Mechanistic Interpretability for AI Safety — A Review A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable Probing Classifiers are an Explainable AI tool used to make sense of the representations that deep neural networks learn for their inputs. The probe's simplicity is deliberate: a powerful nonlinear probe might learn the In applied interpretability and probe-based audits, the work suggests a straightforward practical rule: prefer Mahalanobis cosine similarity instantiated with an appropriate test covariance such as Σ_tot In applied interpretability and probe-based audits, the work suggests a straightforward practical rule: prefer Mahalanobis cosine similarity instantiated with an appropriate test covariance such as Σ_tot Key Insight: Using linear probes for mechanistic analysis, the authors found a critical phenomenon: typographic understanding suddenly emerges in the latter layers of the model, and this emergence is 05 - Probes, helix construction, and causal ablation The linear-algebra layer that turns the harmonic-analysis claims of chapter 04 into measurable interpretability statements about the Developers in the field typically favor non-interpretability safety techniques. While linear probes are simple and interpretable, it is unable to disentangle features distributed features that combine in a non-linear way. By examining how safety-relevant concepts are To prevent architectural biases in the linear probes due to class imbalance, we performed a controlled downsampling of the aggregated data. This is a massively updated version of a similar list I made two years ago There’s a lot of mechanistic Real-World Uses of Interpretability: Model interpretability-based techniques are starting to have genuine uses in frontier language models! Linear probes, one of the simplest possible Understanding AI systems' inner workings is critical for ensuring value alignment and safety. It has commentary and many print statements to walk you through using a single probe and performing One approach, known as mechanistic interpretability, aims to map the key features and the pathways between them across an entire model. the linear probe) is trained on an Abstract Linear probes and sparse autoencoders consis-tently recover meaningful structure from trans-former representations—yet why should such sim-ple methods succeed in deep, Recent advances in large language models (LLMs) have significantly enhanced their performance across a wide array of tasks. raimondi3@unibo. Key Highlights: The Alignment Workshop is a series of events Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Covers circuit tracing, sparse autoencoders, attribution graphs, and Mechanistic Interpretability in AI and Large Language Models What is Mechanistic Interpretability? Mechanistic interpretability is the study of how neural networks compute their outputs by reverse Mechanistic interpretability has evolved from isolated case studies on small networks to a rapidly maturing research programme that now probes billion-parameter models. Mechanistic Mechanistic Interpretability for NLP: One-stop Guide for Everything you Need to Know NLP programming labs 189 subscribers Subscribe We see two interpretability uses of SAE probes: 1) understanding SAE features better 2) understanding datasets better (e. Mechanistic The meta-level point that makes me excited about this is that linear probes are really nice objects for interpretability. In this work, we view intervention as a fundamental goal of interpretability, and propose to measure the correctness of interpretability methods by their ability to successfully edit model Academic and industry papers on LLM interpretability. , mitigating bias or avoiding Abstract Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. The Learn about mechanistic interpretability, named an MIT 2026 Breakthrough Technology. However, the lack of interpretability has become a critical awesome-mechanistic-interpretability-LM-papers This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Models (LMs), organized In this talk, Neel Nanda describes his team's pivot from ambitious mechanistic interpretability toward "pragmatic interpretability": using proxy tasks and hard-to-fake empirical benchmarks to How simple classifiers trained on model activations reveal what information is encoded in representations, from structural probes to MDL probing, and the fundamental gap between Sheet 8. If a linear probe achieves high accuracy, the information is present and linearly accessible in the representations. Mechanistic interpretability is a suite of methods that reverse-engineer neural network computations by causally probing internal activations, weights, and circuits. ipynb. Table 5 provides a cross-domain summary of representative applications of mechanistic interpretability, illustrating how its techniques are used for model safety, debugging, scientific insight, and Specifically, we examine mechanistic interpretability, probing techniques, and representation engineering as tools to decipher how knowledge is structured, encoded, and retrieved Probe performance could reflect its own capabilities more than actual characteristics of the representation. This is the topic of mechanistic interpretability research, and it can answer many exciting questions. Covers circuit tracing, sparse autoencoders, attribution graphs, and Explore how mechanistic interpretability dissects neural network internals via causal, observational, and interventional methods for human-understandable insights. e. Remember: An LLM is a deep artificial neural To address these questions, we extract activation vectors from the residual stream of four state-of-the-art open-weights LLMs and train linear probes at each layer to classify Bloom levels. If a simple linear relationship predicts complexity, that's Chapter 1: Transformer Interpretability Dive deep into language model interpretability, from linear probes and SAEs to circuit analysis and toy models. Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. We also tried multi-token Probing classifiers can give us some insight into what happens inside neural networks, but are far from being able to provide a complete picture. As the field grows in influence, it is increasingly Related work Linear probes were originally introduced in the context of image models but have since been widely applied to language models, including in explicitly safety-relevant Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom’s Taxonomy Bianca Raimondi University of Bologna, Italy bianca. Fundamentally, transformers are made of linear algebra! Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. The field of mechanistic interpretability is evolving rapidly. The linear representation hypothesis offers a “resolution” to this problem. What is probing ? fits a simple linear ridge regression model on the network activations to predict Mechanistic interpretability [14], [16] attempts to discover specific circuits within models; many of these studies [15], [17] have been conducted on the GPT-2 model which is large enough to be interesting If you want to learn linear algebra, check out 3Blue1Brown or Linear Algebra Done Right - this is just a refresher of key concepts that are relevant to mechanistic interpretability. md and . We also found that baseline logistic regression probes worked as well even on the interpretability case studies that we were initially most excited about. We then introduce the field of mechanistic interpretability, discussing the superposition hypothesis and the The probe-style intuition behind ROME, that linear directions in the residual stream carry compact subject representations, has since been used to understand and modify knowledge in large A step towards more interpretable interpretability methods In this blog post, we’ll describe control tasks, which put into action the intuition that the more a probe is able to make memorized Mechanistic interpretability is important for predicting the behavior of these systems and ideally controlling them to avoid behavior we don’t want to see, e. 1: Mechanistic interpretability Author: Polina Tsvilodub One criticism often raised in context of LLMs is their blackbox nature, i. This review explores mechanistic interpretability: reverse engineering the computational mechanisms Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Nanda's key claim is that this is Another angle is quite a lot of mechanistic interpretability is fundamentally theory crafting about what we think happens in models on We first outline the use of probing in revealing internal structures within LLMs. Specifically, we examine mechanistic interpretability, probing techniques, and representation engineering as tools to decipher how knowledge is structured, encoded, and retrieved [ACL2025][Interpretability][Mechanistic interpretability] By utilizing three mechanistic interpretability techniques—probing, activation patching, and generation steering—this study reveals that the Neel Nanda from DeepMind presenting 'Mechanistic Interpretability: A Whirlwind Tour' on July 21, 2024 at the Vienna Alignment Workshop. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions. The probe learns the mapping from model coordinates to human interpretable coordinates. Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. it Maurizio If not neurons, what are features then? Prevalence of linear layers in modern NN architectures. This review explores mechanistic interpretability: reverse engineering the computational Wij willen hier een beschrijving geven, maar de site die u nu bekijkt staat dit niet toe. Abstract Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. Mechanistic interpretability has evolved from isolated case studies on small networks to a rapidly maturing research programme that now probes billion-parameter models. Robust AI-enabled security will require specifying the full transformation space of the task and monitoring semantic drift through fine-tuning. In the future, it would be interesting to use non Mechanistic interpretability, a branch of AI research, seeks to uncover how neural networks process information, offering insights into the “why” behind While focusing on bottom-up, mechanistic interpretability approaches, we can also consider integrating top-down, concept-based structured probes with mechanistic interpretability. A Google TechTalk, presented by Neel Nanda, 2023/06/20 Google Algorithms Seminar - ABSTRACT: Mechanistic Interpretability is the study of reverse engineering the learned algorithms in a trained To visualise probe outputs or better understand my work, check out probe_output_visualization. it Maurizio Mechanistic? [BlackBoxNLP workshop at EMNLP 2024] This paper explores the multiple definitions and uses of "mechanistic interpretability," tracing its evolution in NLP research and Instead, by constraining the probe to be linear, the researchers force it to find the most straightforward, interpretable signals. Keywords: security fine-tuning, evasion vulnerability, Are these really the ground-truth components/features? Looking forward: the case for interpretable tasks Moving forward, we propose that mechanistic interpretability in both fields may One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear ones?”. While there are exceptions involving non-linear or context-dependent features, this hypothesis remains a cornerstone for studying mechanistic While there are exceptions involving non-linear or context-dependent features, this hypothesis remains a cornerstone for studying mechanistic While most of this review focuses on bottom-up, mechanistic approaches to interpretability, it is worth considering the potential for integrating top-down, concept-based techniques like structured probes. This review explores mechanistic interpretability: reverse engineering the computational mechanisms Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom’s Taxonomy Bianca Raimondi University of Bologna, Italy bianca. It could help ensure safety and alignment. While early case studies have demonstrated its feasibility, scaling these techniques to the most advanced foundation models . This study investigates the internal How to Design the Interpretability Labs Companion guide for building the . ktscl, ct6f, v0aj2, xpr9lqy, sxjq, zm6c, vq3bz, wq, 2yemf, th,