evaluating SAEs

Four key methods for evaluating sparse autoencoders

Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but measuring whether those features are actually good remains an open problem. The four evaluation methods below represent the core toolkit the field has converged on, each probing a different dimension of SAE quality: concept detection, feature separability, and human-understandable explanations. Together, they form the backbone of SAEBench (Karvonen et al., 2025), the most comprehensive SAE evaluation suite to date. A brief note: the arxiv ID 2412.06410 cited in the query corresponds to the BatchTopK paper (Bussmann et al., 2024), not a paper titled “SynthSAEBench.” The benchmark that consolidates these methods is SAEBench, published as arxiv 2503.09532.

SAEBench tests SAEs across eight diverse metrics

SAEBench (Karvonen et al., 2025; arxiv 2503.09532) is a comprehensive, open-source evaluation suite that measures sparse autoencoder performance across eight distinct metrics organized into four capability dimensions: concept detection, interpretability, reconstruction fidelity, and feature disentanglement. The benchmark was created because most prior SAE research evaluated progress using unsupervised proxy metrics — like reconstruction loss at a given sparsity level — whose practical relevance was unclear. SAEBench demonstrated that gains on proxy metrics do not reliably translate to better practical performance. For example, Matryoshka SAEs slightly underperform on traditional sparsity–fidelity tradeoffs but substantially outperform other architectures on feature disentanglement metrics, an advantage that grows with scale. The suite includes over 200 open-source SAEs across seven architectures, evaluated on metrics including sparse probing, automated interpretability, RAVEL disentanglement, spurious correlation removal, targeted probe perturbation, feature absorption detection, unlearning, and reconstruction loss. By providing a standardized framework, SAEBench lets researchers make apples-to-apples comparisons between SAE architectures and training methods, revealing hidden tradeoffs that single-metric evaluations would miss.

Sparse probing asks whether individual features capture known concepts

Sparse probing (Gurnee et al., 2023; arxiv 2305.01610) is an evaluation method that tests whether a small number of SAE latents — sometimes just one — can accurately predict human-labeled concepts in the input data. The technique works by selecting the top-k SAE features most correlated with a binary classification task (such as “is this text in French?” or “does this express positive sentiment?”), then training a simple logistic regression probe on only those k features. If one or a few features suffice to classify the concept accurately, the SAE has successfully learned a dedicated, interpretable representation of that concept. Gurnee et al. originally applied this to raw MLP neurons in the Pythia model suite, finding that certain “context neurons” in middle layers are monosemantic (a single neuron encodes a single concept), while early-layer representations tend to be in superposition. When adapted for SAE evaluation in SAEBench, sparse probing is run across 35 binary classification tasks spanning language identification, profession classification, and sentiment analysis. Kantamneni et al. (2025; arxiv 2502.16681) extended this line of work by rigorously testing whether SAE-based sparse probes outperform standard baselines like logistic regression on raw activations across 113 datasets — finding, notably, that SAE probes do not consistently beat simple baselines on supervised classification, though their latents remain useful for unsupervised concept discovery and debugging dataset quality issues.

Concept disentanglement measures whether SAEs properly separate independent features

Concept disentanglement (Karvonen et al., 2024; arxiv 2408.00113) evaluates whether an SAE decomposes neural activations into cleanly separated, non-overlapping features that each correspond to a single independent concept. The original paper tackled this using language models trained on chess and Othello game transcripts, where ground-truth features are formally specifiable — for instance, “there is a knight on F3” or “the position is in check.” Two metrics were introduced: coverage, which measures the fraction of known concepts that at least one SAE feature can accurately classify, and board reconstruction, which tests whether the full game board state can be recovered from SAE features using simple rules. These board-game metrics cannot directly apply to language models trained on natural text, so SAEBench extends the concept to general language models through three related metrics: RAVEL (testing whether intervening on specific latents can change one attribute, like a city’s country, without altering others, like the language spoken there), spurious correlation removal (testing whether ablating a small set of latents removes a gender bias without destroying the profession signal), and targeted probe perturbation (testing whether concept-specific latent sets are non-overlapping). A key SAEBench finding is that Matryoshka SAEs uniquely exhibit positive scaling on disentanglement — they get better at separating concepts as the dictionary grows — while most other architectures actually get worse, likely due to feature splitting.

Autointerpretability uses LLMs to explain and score SAE features at scale

Autointerpretability (Paulo et al., 2025; arxiv 2410.13928), also called automated interpretability, is a technique that uses large language models to automatically generate and evaluate natural language explanations of what each SAE feature represents. The pipeline works in two stages. First, an “explainer” LLM (such as Llama 3.1 70B) is shown dozens of text examples where a given SAE feature activates, with the activating tokens highlighted, and asked to produce a concise explanation — for example, “this feature activates on idiomatic phrases following negation words.” Second, a “scorer” LLM is given the explanation and a mix of new activating and non-activating sequences, then asked to predict which sequences should activate the feature. The accuracy of these predictions constitutes the autointerpretability score — essentially measuring whether the explanation is precise and complete enough to serve as a working definition of the feature. Paulo et al. built on foundational work by Bills et al. (2023) at OpenAI, who first demonstrated this explain-then-simulate paradigm for GPT-2 neurons, but made the process dramatically cheaper (roughly $1,300 for 1.5 million features versus $200,000) by using open-weight models and introducing more efficient scoring methods including detection scoring, fuzzing scoring, and generation scoring. The EleutherAI team released this as an open-source library called Delphi. In SAEBench, autointerpretability is one of the eight evaluation metrics, though the authors note it often struggles to differentiate between SAE architectures, suggesting it is a necessary but not sufficient measure of SAE quality.

These methods reveal that no single metric captures SAE quality

The central lesson from these four evaluation methods, and their integration in SAEBench, is that SAE quality is irreducibly multi-dimensional. Sparse probing reveals whether features align with known concepts but provides limited differentiation between architectures. Concept disentanglement exposes whether features are properly separated but favors higher sparsity levels than conventionally used. Autointerpretability scales to millions of features but lacks discriminative power across architectures. No single metric can identify the “best” SAE — a result that has shifted the field away from optimizing purely for reconstruction loss toward a more nuanced, multi-metric evaluation paradigm. These methods collectively push researchers to ask not just whether SAEs reconstruct activations faithfully, but whether the features they learn are individually meaningful, properly separated, and understandable to humans.