universality SAEbench research plan

Integrated Research Implementation Plan: Synthetic Benchmarking for Universal Feature Geometry

Research Document: Detailed Implementation of Joint Research Directions

Date: February 2026
Version: 1.0


1. Complementary Research Questions

Implementation Overview

This research direction focuses on creating a unified experimental framework that bridges the gap between controlled synthetic evaluation and real-world cross-model feature comparison. The core implementation involves developing a “synthetic model zoo” where we can systematically vary properties that affect feature universality.

Detailed Implementation Steps

Task 1.1: Multi-Model Synthetic Generator Development

Create an extension to the SynthSAEBench codebase that generates pairs or families of related synthetic models with controlled differences. Each synthetic model pair will share a base feature set but diverge in specific, measurable ways:

Task 1.2: Controlled Divergence Experiments

Design experiments that systematically vary one aspect of model difference while holding others constant:

Task 1.3: Validation Metrics Framework

Implement comprehensive metrics that leverage ground truth:

Task 1.4: Integration with Real Model Benchmarking

Create a pipeline that uses synthetic experiments to predict real-world behavior:

Expected Outcomes:


2. SynthSAEBench Results Supporting Weak Universality

Implementation Overview

This research direction focuses on characterizing the “weak universality regime” - the observation that SAEs capture features analogously but not identically. We implement experiments that quantify the spectrum from strong to weak universality.

Detailed Implementation Steps

Task 2.1: Architecture Comparison Framework

Systematically compare all major SAE architectures on identical synthetic data to understand their different “views” of the same feature space:

Task 2.2: Reconstruction-Interpretability Trade-off Surface

Map the Pareto frontier between reconstruction quality and feature interpretability:

Task 2.3: Weak Universality Quantification

Develop metrics that specifically measure “analogous but not identical” feature similarity:

Task 2.4: Stability Analysis Across Training

Track how weak universality emerges during SAE training:

Task 2.5: Controlled Analogy Experiments

Create synthetic scenarios where we know features should be analogous but not identical:

Expected Outcomes:


3. Superposition as a Source of Non-Identical Features

Implementation Overview

This research direction investigates how superposition creates a fundamental ambiguity in feature decomposition, leading to multiple valid but non-identical solutions. We implement experiments that directly manipulate superposition levels and measure the resulting decomposition diversity.

Detailed Implementation Steps

Task 3.1: Superposition Spectrum Experiments

Extend SynthSAEBench to generate models across a fine-grained superposition spectrum:

Task 3.2: Mechanistic Analysis of MP-SAE Overfitting

Implement detailed analysis of the discovered phenomenon where Matching Pursuit SAEs overfit superposition noise:

Task 3.3: Optimal Decomposition Under Superposition

Formulate and solve the theoretical problem of optimal feature decomposition under superposition:

Task 3.4: Superposition-Aware Feature Matching

Develop matching algorithms that explicitly account for superposition:

Task 3.5: Cross-Model Superposition Transfer

Test whether SAEs trained on high-superposition models transfer better to other high-superposition models:

Task 3.6: Validation on Real LLMs

Estimate superposition levels in real LLMs and test predictions:

Expected Outcomes:


4. Hierarchy and Correlation Create Analogous Features

Implementation Overview

This research direction explores how hierarchical feature relationships and correlation patterns create structured feature spaces that are similar across models even when individual features differ. We implement experiments that independently manipulate hierarchy and correlation to measure their effects on analogous universality.

Detailed Implementation Steps

Task 4.1: Hierarchical Structure Experiments

Systematically vary hierarchical properties in synthetic models:

Task 4.2: Semantic Subspace Analysis

Implement detailed analysis of the semantic subspace phenomenon discovered in the universality paper:

Task 4.3: Correlation Pattern Experiments

Use the low-rank correlation structure from SynthSAEBench to study correlation effects:

Task 4.4: Hierarchical and Correlation Interaction Studies

Test whether hierarchy and correlation have synergistic or antagonistic effects:

Task 4.5: Transfer Learning Using Structure

Implement structure-aware transfer learning methods:

Task 4.6: Real-World Semantic Validation

Apply semantic subspace analysis to real LLMs:

Expected Outcomes:


5. The Precision-Recall Trade-off Validates Transformation-Based Similarity

Implementation Overview

This research direction leverages the precision-recall trade-off observed in SynthSAEBench to understand and optimize transformation-based similarity measures. We implement experiments that show why rotation-invariant measures are necessary and how to optimize them.

Detailed Implementation Steps

Task 5.1: Precision-Recall Trade-off Surface Mapping

Comprehensively map the precision-recall trade-off across all conditions:

Task 5.2: Basis-Dependent vs Basis-Independent Quality

Demonstrate that precision-recall trade-offs are basis-dependent while geometric similarity is basis-independent:

Task 5.3: Optimal Transformation Learning

Implement learnable transformations that maximize feature correspondence while respecting geometric structure:

Task 5.4: Transformation Quality Metrics

Develop comprehensive metrics for evaluating transformation quality:

Task 5.5: Trade-off-Aware Matching

Develop matching algorithms that account for the precision-recall trade-off:

Task 5.6: L0-Specific Transformation Learning

Test whether optimal transformations depend on the L0 regime of the SAEs:

Task 5.7: Real-World Transformation Validation

Apply transformation learning to real LLM pairs:

Expected Outcomes:


6. Dead Latents and Feature Coverage

Implementation Overview

This research direction addresses the observation that SAEs incompletely cover the feature space, with different SAEs covering different subsets. We implement experiments that characterize feature coverage patterns and develop methods to achieve more complete coverage.

Detailed Implementation Steps

Task 6.1: Dead Latent Characterization

Systematically study dead latent patterns across architectures and conditions:

Task 6.2: Coverage Complementarity Analysis

Measure the extent to which different SAE architectures cover different feature subsets:

Task 6.3: Ensemble Coverage Methods

Develop methods to combine multiple SAEs for more complete feature coverage:

Task 6.4: Active Feature Discovery

Implement active learning methods that iteratively improve feature coverage:

Task 6.5: Feature Rarity vs Coverage Analysis

Test whether feature rarity (firing frequency) predicts coverage failures:

Task 6.6: Coverage Transfer Across Models

Test whether feature coverage patterns transfer across models:

Task 6.7: Minimum Coverage Requirements

Determine what level of feature coverage is sufficient for practical interpretability:

Expected Outcomes:


7. Quantitative Validation of Analogous Universality Hypothesis

Implementation Overview

This research direction provides rigorous quantitative validation of the analogous universality hypothesis using the controlled environment of synthetic benchmarks. We implement comprehensive experiments that measure universality at multiple levels of granularity.

Detailed Implementation Steps

Task 7.1: Multi-Level Universality Measurement

Implement a hierarchy of universality metrics from strongest (identical features) to weakest (unrelated features):

For each synthetic configuration and real model pair, report the distribution across these levels.

Task 7.2: Synthetic-to-Real Validity Testing

Rigorously test whether findings from synthetic data transfer to real LLMs:

Task 7.3: Universality Spectrum Experiments

Map the complete spectrum from no universality to perfect universality:

Task 7.4: Layer-Wise Universality Patterns

Validate the layer-wise universality patterns found in the original universality paper:

Task 7.5: Cross-Model-Family Universality

Test universality across model families (not just within families):

Task 7.6: Universality Stability Analysis

Test whether universality is stable across perturbations:

Task 7.7: Quantitative Benchmarking Suite

Create a comprehensive benchmark suite for universality:

Expected Outcomes:


8. Methodological Alignment

Implementation Overview

This research direction focuses on unifying and extending the methodologies from both papers into a comprehensive toolkit for studying feature universality. We implement standardized protocols and shared infrastructure.

Detailed Implementation Steps

Task 8.1: Unified Feature Matching Pipeline

Develop a single pipeline that implements all feature matching methods:

Make all modules interoperable: output of one module can be input to validation steps in another.

Task 8.2: Comprehensive Similarity Metrics Library

Implement all similarity metrics from both papers plus extensions:

Provide implementations in both PyTorch and NumPy for maximum compatibility.

Task 8.3: Standardized Experimental Protocols

Create detailed, reproducible protocols for common experimental patterns:

Task 8.4: Shared Infrastructure and Tools

Build shared infrastructure to enable both research directions:

Task 8.5: Cross-Validation Framework

Implement methods to validate one paper’s findings with the other’s methods:

Task 8.6: Reproducibility Infrastructure

Ensure all research is fully reproducible:

Task 8.7: Benchmark Standardization

Create official benchmark tasks and leaderboards:

Publish official leaderboards and provide submission infrastructure.

Expected Outcomes:


9. The Critical Insight: Why Analogous, Not Identical?

Implementation Overview

This research direction addresses the fundamental theoretical question: Why do SAEs learn analogous rather than identical features? We implement experiments that test mechanistic hypotheses about the origins of feature analogy.

Detailed Implementation Steps

Task 9.1: Superposition Decomposition Ambiguity Theory

Formalize and test the theory that superposition creates fundamental ambiguity in optimal feature decomposition:

Task 9.2: Optimization Landscape Analysis

Analyze the loss landscape to understand why different training runs find different solutions:

Task 9.3: Architecture-Specific Biases

Identify what implicit biases different SAE architectures have that lead them to different solutions:

Task 9.4: Information-Theoretic Analysis

Apply information theory to understand feature analogousness:

Task 9.5: Causal Mechanisms of Analogy

Identify causal factors that create analogous vs. identical features:

Task 9.6: Analogousness as a Continuum

Model feature analogousness as a continuous spectrum rather than binary property:

Task 9.7: Predictive Models of Analogousness

Build models that predict the degree of analogousness from model/architecture properties:

Expected Outcomes:


10. Synthesis: The Complete Picture

Implementation Overview

This final research direction integrates all previous findings into a unified theory and comprehensive practical framework. We implement the synthesis phase that combines theoretical understanding with practical tools.

Detailed Implementation Steps

Task 10.1: Unified Theory of Feature Universality

Develop a comprehensive theoretical framework integrating all findings:

Task 10.2: Practical Universality Toolkit

Create comprehensive software toolkit implementing all methods:

Task 10.3: Comprehensive Validation Study

Conduct large-scale validation study integrating all methods:

Task 10.4: Application Demonstrations

Implement concrete applications enabled by universality understanding:

Task 10.5: Best Practices Guide

Develop comprehensive guide for practitioners:

Task 10.6: Educational Materials

Create materials enabling broader adoption:

Task 10.7: Community Building and Dissemination

Establish infrastructure for ongoing community engagement:

Task 10.8: Future Research Directions

Identify and articulate open problems for the community:

For each problem:

Expected Outcomes:


Conclusion

This integrated research program bridges the gap between controlled synthetic evaluation (SynthSAEBench) and real-world cross-model universality analysis. By implementing the 10 research directions detailed above, we will:

  1. Understand why SAEs learn analogous rather than identical features
  2. Quantify the degree of universality across models and conditions
  3. Predict when and how strongly universality will occur
  4. Exploit universality for practical interpretability transfer
  5. Advance the broader field of mechanistic interpretability

The comprehensive implementation plan provided here offers a roadmap for 18-24 months of research that will fundamentally advance our understanding of feature learning in neural networks and enable new capabilities in AI interpretability and safety.


Appendix: Implementation Timeline

Months 1-4: Infrastructure (Tasks 1.1-1.3, 2.1, 3.1, 8.1-8.4)
Months 5-8: Core Experiments (Tasks 2.2-2.5, 3.2-3.4, 4.1-4.3, 5.1-5.3)
Months 9-12: Advanced Analysis (Tasks 4.4-4.6, 5.4-5.6, 6.1-6.4, 7.1-7.3)
Months 13-16: Theory & Transfer (Tasks 7.4-7.7, 9.1-9.5, 10.1-10.2)
Months 17-20: Validation & Applications (Tasks 6.5-6.7, 8.5-8.7, 9.6-9.7, 10.3-10.5)
Months 21-24: Synthesis & Dissemination (Tasks 10.6-10.8, paper writing, release)

Total Estimated Effort: 4-5 full-time researchers for 24 months


Document End