Integrated
Research Implementation Plan: Synthetic Benchmarking for Universal
Feature Geometry
Research
Document: Detailed Implementation of Joint Research Directions
Date: February 2026 Version: 1.0
1. Complementary Research
Questions
Implementation Overview
This research direction focuses on creating a unified experimental
framework that bridges the gap between controlled synthetic evaluation
and real-world cross-model feature comparison. The core implementation
involves developing a “synthetic model zoo” where we can systematically
vary properties that affect feature universality.
Detailed Implementation
Steps
Task 1.1: Multi-Model Synthetic Generator
Development
Create an extension to the SynthSAEBench codebase that generates
pairs or families of related synthetic models with controlled
differences. Each synthetic model pair will share a base feature set but
diverge in specific, measurable ways:
Implement a SyntheticModelFamily class that takes a
base configuration (number of features N, hidden dimension D,
superposition level ρₘₘ) and generates K variants
Add parameters controlling “model divergence”:
feature_overlap_ratio (0.5-1.0, determining what fraction
of features are shared), basis_rotation_angle (0-90
degrees, controlling how much the feature basis rotates between models),
and superposition_delta (±0.1, allowing different models to
have different superposition levels)
Implement a ground-truth correspondence matrix C ∈ ℝ^(N₁×N₂) where
C_ij = 1 if feature i in Model 1 corresponds to feature j in Model 2,
enabling perfect evaluation of feature matching algorithms
Task 1.2: Controlled Divergence Experiments
Design experiments that systematically vary one aspect of model
difference while holding others constant:
Experiment 1A: Basis Rotation Only - Generate model
pairs that have identical features but rotated bases (angles: 0°, 15°,
30°, 45°, 60°, 75°, 90°). Train SAEs on each and measure whether
SVCCA/RSA can recover similarity despite basis differences.
Experiment 1B: Partial Feature Overlap - Generate
models where 50%, 60%, 70%, 80%, 90%, 100% of features are shared. This
tests whether universality measures degrade gracefully as models become
more different.
Experiment 1C: Different Superposition Regimes -
Create model pairs where Model 1 has ρₘₘ=0.10 and Model 2 ranges from
ρₘₘ=0.10 to ρₘₘ=0.30, testing how superposition differences affect
feature correspondence.
Task 1.3: Validation Metrics Framework
Implement comprehensive metrics that leverage ground truth:
Ground-Truth MCC (GT-MCC): Instead of using
Hungarian matching on learned features, use the known correspondence
matrix C to compute the “oracle MCC” - the best possible MCC if features
were perfectly matched
Transformation Quality Score (TQS): After learning
a transformation T from SAE₁ to SAE₂, measure TQS = correlation(T(W₁),
W₂_corresponding) using ground truth correspondences
Universality Gap Metric: Define UG = GT-MCC -
Empirical-MCC, quantifying how much feature universality is lost due to
SAE limitations vs. being genuinely absent in the models
Task 1.4: Integration with Real Model
Benchmarking
Create a pipeline that uses synthetic experiments to predict
real-world behavior:
Train SAEs with various architectures on synthetic model pairs
across all divergence conditions
Build a regression model predicting real-world SVCCA scores from:
(1) synthetic GT-MCC, (2) synthetic TQS, (3) SAE architecture type, (4)
model size ratio
Validate predictions on held-out real model pairs (e.g., Gemma-2-9B
vs Gemma-2-27B)
Expected Outcomes:
A flexible synthetic model generator supporting 50+ divergence
configurations
Quantitative understanding of how each type of model difference
affects feature universality
Predictive models for real-world universality with R²>0.7
This research direction focuses on characterizing the “weak
universality regime” - the observation that SAEs capture features
analogously but not identically. We implement experiments that quantify
the spectrum from strong to weak universality.
Detailed Implementation
Steps
Task 2.1: Architecture Comparison Framework
Systematically compare all major SAE architectures on identical
synthetic data to understand their different “views” of the same feature
space:
Use SynthSAEBench-16k as the standard testbed
Train 5 seeds each of: Standard L1, TopK, BatchTopK, Matryoshka,
JumpReLU, Gated, Matching Pursuit
For each architecture pair (e.g., Matryoshka vs. Matching Pursuit),
compute:
Architecture Agreement Score: Percentage of
features where both architectures identify the same ground-truth feature
(using GT correspondences)
Complementary Coverage: Percentage of ground-truth
features captured by at least one architecture but not both
Cross-Architecture SVCCA: Apply SVCCA between
feature spaces learned by different architectures on the same data
Map the Pareto frontier between reconstruction quality and feature
interpretability:
For each architecture and hyperparameter configuration, plot points
in (Reconstruction MSE, MCC, F1-score) space
Fit a Pareto frontier surface to identify non-dominated
configurations
Identify “reconstruction specialists” (high MSE, low MCC - like
Matching Pursuit) and “interpretability specialists” (low MSE, high MCC
- like Matryoshka)
Implement ensemble methods that combine specialists: train both
types of SAEs, use reconstruction specialists for generation tasks and
interpretability specialists for feature analysis
Task 2.3: Weak Universality Quantification
Develop metrics that specifically measure “analogous but not
identical” feature similarity:
Angular Similarity with Tolerance (AST): Instead of
requiring features to be exactly aligned, measure the percentage of
feature pairs within θ degrees of alignment (test θ = 5°, 10°, 15°,
20°)
Functional Equivalence Score (FES): For paired
features f₁ and f₂, compute FES = correlation(f₁·dataset, f₂·dataset) -
do they activate on the same tokens even if their weight vectors
differ?
Semantic Preservation Score (SPS): For semantically
labeled features (using the concept categories from the universality
paper), measure what percentage maintain their semantic label across
architectures
Task 2.4: Stability Analysis Across Training
Track how weak universality emerges during SAE training:
Save SAE checkpoints every 10M tokens during training
At each checkpoint, compute universality metrics (SVCCA, MCC)
between checkpoints from different architecture trainings
Identify whether architectures start similar and diverge, or start
different and converge
Test hypothesis: “Weak universality emerges early (first 50M tokens)
and remains stable, while strong universality (exact feature matching)
never fully develops”
Task 2.5: Controlled Analogy Experiments
Create synthetic scenarios where we know features should be analogous
but not identical:
Generate Model A with features [f₁, f₂, f₃] and Model B with
features [f₁, 0.9·f₂ + 0.1·f₃, 0.1·f₂ + 0.9·f₃] (controlled feature
mixing)
Train SAEs on both models and verify that:
Strong matching (exact feature recovery) succeeds for f₁
Weak matching (SVCCA) succeeds for the mixed features
Functional equivalence is preserved for all features
Use this as a calibration for real-world experiments: if methods
work on controlled analogies, they should work on natural analogies
Expected Outcomes:
Quantitative spectrum from strong to weak universality (0.0 = no
similarity to 1.0 = identical features)
Understanding of which architectures are “natural analogues” (high
SVCCA despite different features)
Ensemble methods achieving MCC>0.75 and MSE<0.05
simultaneously by combining specialists
3.
Superposition as a Source of Non-Identical Features
Implementation Overview
This research direction investigates how superposition creates a
fundamental ambiguity in feature decomposition, leading to multiple
valid but non-identical solutions. We implement experiments that
directly manipulate superposition levels and measure the resulting
decomposition diversity.
Detailed Implementation
Steps
Task 3.1: Superposition Spectrum Experiments
Extend SynthSAEBench to generate models across a fine-grained
superposition spectrum:
Create 20 synthetic models with ρₘₘ ranging from 0.05 to 0.40 in
increments of ~0.018
For each model, train 3 seeds each of 5 SAE architectures (Standard
L1, TopK, Matryoshka, JumpReLU, Matching Pursuit)
For each (superposition_level, architecture) pair, compute:
Decomposition Diversity (DD): Standard deviation of
MCC scores across the 3 seeds - high DD indicates superposition creates
multiple valid solutions
Architecture-Specific Response (ASR): Compare how
different architectures respond to the same superposition level - do
some architectures handle high superposition better?
Task 3.2: Mechanistic Analysis of MP-SAE
Overfitting
Implement detailed analysis of the discovered phenomenon where
Matching Pursuit SAEs overfit superposition noise:
For models with high superposition (ρₘₘ > 0.2), train MP-SAEs and
Standard L1 SAEs
Extract specific examples where:
MP-SAE achieves lower reconstruction error
But has lower MCC (worse ground-truth feature recovery)
Visualize the weight vectors: show that MP-SAE splits one
ground-truth feature into multiple SAE features that collectively
reconstruct better but individually correspond worse
Implement a “superposition exploitation score”: Measure the average
number of SAE features needed to reconstruct each ground-truth feature -
MP-SAEs should have higher scores
Task 3.3: Optimal Decomposition Under
Superposition
Formulate and solve the theoretical problem of optimal feature
decomposition under superposition:
Given: Ground-truth features D with known superposition ρₘₘ
Find: SAE decoder W that optimizes a weighted combination of
reconstruction and feature recovery: L = α·MSE + (1-α)·(1-MCC)
Solve this optimization problem for various α values (0.0, 0.25,
0.5, 0.75, 1.0) and compare solutions to actual SAE architectures
Hypothesis: Different SAE architectures implicitly optimize
different α values (MP-SAEs ≈ α=1.0, Matryoshka ≈ α=0.3)
Task 3.4: Superposition-Aware Feature Matching
Develop matching algorithms that explicitly account for
superposition:
Traditional matching: One-to-one Hungarian algorithm
Superposition-aware matching: Allow many-to-many correspondences
with weights
Implement “Soft Correspondence Matrix” S where S_ij ∈ [0,1]
represents the strength of correspondence between SAE feature i and
ground-truth feature j
Use this to compute “Superposition-Adjusted MCC” (SA-MCC) that
credits partial correspondences
Test whether SA-MCC better correlates with functional feature
quality than standard MCC
Task 3.5: Cross-Model Superposition Transfer
Test whether SAEs trained on high-superposition models transfer
better to other high-superposition models:
Create three model types: Low Superposition (LS, ρₘₘ=0.08), Medium
Superposition (MS, ρₘₘ=0.15), High Superposition (HS, ρₘₘ=0.25)
Train SAEs on LS₁ and test transferability to LS₂, MS₁, HS₁
Train SAEs on HS₁ and test transferability to HS₂, MS₁, LS₁
Hypothesis: “HS→HS transfer succeeds better than LS→HS transfer”
(SAEs learn superposition-handling strategies that transfer within
regime)
Task 3.6: Validation on Real LLMs
Estimate superposition levels in real LLMs and test predictions:
Use the eigenvalue decay of activation covariance matrices as a
proxy for superposition (fast decay = low superposition)
Classify real LLM layers into estimated superposition regimes
Test whether SAE transferability between real models correlates with
similarity of estimated superposition levels
Example: Pythia-70m layer 3 and Pythia-160m layer 5 should have
similar superposition if they show high universality in the original
paper
Expected Outcomes:
Quantitative relationship: “Each 0.05 increase in ρₘₘ reduces
expected MCC by ~0.08 and increases decomposition diversity by
~0.12”
Superposition-aware matching algorithms improving effective MCC by
15-20% in high-superposition regimes
Validation that real-world universality patterns match synthetic
superposition predictions
4.
Hierarchy and Correlation Create Analogous Features
Implementation Overview
This research direction explores how hierarchical feature
relationships and correlation patterns create structured feature spaces
that are similar across models even when individual features differ. We
implement experiments that independently manipulate hierarchy and
correlation to measure their effects on analogous universality.
Detailed Implementation
Steps
Task 4.1: Hierarchical Structure Experiments
Systematically vary hierarchical properties in synthetic models:
Hierarchy Depth Sweep: Create models with hierarchy
depths 0 (flat), 1, 2, 3, 4, 5 levels while keeping total features
constant (N=16384)
Branching Factor Sweep: For fixed depth=3, vary
branching factors from 2 (binary trees) to 8 (octary trees)
Mutual Exclusivity Variations: Test 0%, 50%, 100%
mutual exclusivity among sibling features
For each configuration:
Train SAEs from multiple architectures
Measure whether hierarchical structure is preserved: compute
“Hierarchy Recovery Score (HRS)” = percentage of parent-child
relationships in ground truth that are also found in nearest-neighbor
relationships in SAE feature space
Test cross-model transfer: Do SAEs trained on depth-3 models
transfer better to other depth-3 models than to depth-5 models?
Task 4.2: Semantic Subspace Analysis
Implement detailed analysis of the semantic subspace phenomenon
discovered in the universality paper:
Use the 8 concept categories from the universality paper (Time,
Calendar, Nature, People/Roles, Emotions, MonthNames, Countries,
Biology)
For each category in synthetic data:
Assign ground-truth features to categories (e.g., 200 “time”
features including hierarchical relationships like
“decade→year→month→day”)
Train SAEs and identify which SAE features correspond to each
category
Measure Subspace Coherence Score (SCS): Are
category features clustered in SAE space? Compute average pairwise
cosine similarity within vs. between categories
Generate model pairs with identical category structures but
different individual features
Test hypothesis: “Category subspaces show higher SVCCA similarity
than the overall feature space”
Task 4.3: Correlation Pattern Experiments
Use the low-rank correlation structure from SynthSAEBench to study
correlation effects:
Generate models with correlation rank r ∈ {10, 25, 50, 100, 200} and
correlation scale s ∈ {0.025, 0.050, 0.075, 0.100, 0.150}
For each (r, s) configuration:
Train SAEs and measure “Correlation Recovery” = percentage of highly
correlated feature pairs (correlation > 0.5) in ground truth that are
also correlated in SAE feature activation space
Measure how correlation affects feature matching: Do highly
correlated features in Model 1 map to highly correlated features in
Model 2?
Create “correlation transfer tasks”: Train on Model 1, use learned
correlations to improve feature matching in Model 2
Task 4.4: Hierarchical and Correlation Interaction
Studies
Test whether hierarchy and correlation have synergistic or
antagonistic effects:
Measure how each factor independently and jointly affects:
Feature recovery (MCC, F1)
Cross-model universality (SVCCA, RSA)
Semantic subspace preservation
Hypothesis: “Hierarchy and correlation both improve universality,
and their effects are super-additive (combined effect > sum of
individual effects)”
Task 4.5: Transfer Learning Using Structure
Implement structure-aware transfer learning methods:
Hierarchy-Guided Matching: When matching features
between models, prefer matches that preserve parent-child relationships
(if feature A matches feature X, and A’s children are B and C,
prioritize matching B and C to X’s children)
Correlation-Aware Transformation: Learn
transformation matrices T that preserve correlation structure: minimize
||corr(W₁) - corr(T(W₂))||
Semantic Subspace Alignment: For each semantic
category, learn a separate transformation and blend them for overall
feature matching
Compare structure-aware methods to structure-agnostic baselines
Task 4.6: Real-World Semantic Validation
Apply semantic subspace analysis to real LLMs:
Use the exact semantic categories from the universality paper’s
experiments
For model pairs shown to have high universality (e.g.,
Pythia-70m/160m layers 2-3 vs 4-7), test predictions:
Semantic subspaces should have higher SVCCA than random subspaces of
same size (already shown in original paper)
Categories with more hierarchical structure (like “Calendar” with
its year→month→day hierarchy) should show stronger universality than
flat categories
Correlation patterns within categories should be preserved across
models
Expected Outcomes:
Quantification: “Hierarchical structure accounts for ~35% of
observed universality, correlation patterns for ~25%, with ~20% from
their interaction”
Structure-aware transfer methods improving matching accuracy by
20-30% over structure-agnostic baselines
Discovery of which types of semantic structure are most universal
(prediction: hierarchies with mutual exclusivity show strongest
universality)
5.
The Precision-Recall Trade-off Validates Transformation-Based
Similarity
Implementation Overview
This research direction leverages the precision-recall trade-off
observed in SynthSAEBench to understand and optimize
transformation-based similarity measures. We implement experiments that
show why rotation-invariant measures are necessary and how to optimize
them.
Comprehensively map the precision-recall trade-off across all
conditions:
Train SAEs at L0 values: {5, 10, 15, 20, 25, 30, 35, 40, 45, 50}
using the L0 controller from SynthSAEBench
For each L0 and architecture combination, compute:
Per-feature precision and recall (treating each SAE feature as a
binary classifier for its matched ground-truth feature)
Aggregate precision-recall curves
F1-score, but also F_β scores for β ∈ {0.5, 2.0} to emphasize
precision or recall
Create a “Pareto frontier” in (precision, recall) space showing the
optimal trade-offs achievable by each architecture
Task 5.2: Basis-Dependent vs Basis-Independent
Quality
Demonstrate that precision-recall trade-offs are basis-dependent
while geometric similarity is basis-independent:
Take two SAEs trained on the same data that achieve different
precision-recall trade-offs (e.g., low-L0 SAE with high precision,
high-L0 SAE with high recall)
Apply rotation matrices R to one SAE’s features: W_rotated =
R·W
Show that:
Precision and recall change dramatically under rotation
(basis-dependent)
SVCCA and RSA remain constant under rotation
(basis-independent)
This demonstrates why transformation-based measures are necessary
for fair comparison
Task 5.3: Optimal Transformation Learning
Implement learnable transformations that maximize feature
correspondence while respecting geometric structure:
Linear Transformation Learning:
Learn T ∈ ℝ^(L×L) that maps features from SAE₁ to SAE₂
Test whether constrained transformations generalize better to
out-of-distribution model pairs
Task 5.4: Transformation Quality Metrics
Develop comprehensive metrics for evaluating transformation
quality:
Correspondence Fidelity (CF): Using ground truth,
measure percentage of correct correspondences after transformation
Geometric Preservation (GP): Measure how well
pairwise distances are preserved: GP = correlation(||w_i - w_j||,
||T(w_i) - T(w_j)||)
Semantic Consistency (SC): For semantically labeled
features, measure whether semantic labels are preserved after
transformation
Reconstruction Preservation (RP): If features are
used for steering/intervention, measure whether functionality is
preserved: RP = correlation(steering_effect₁, steering_effect₂)
Create a composite “Transformation Quality Score” (TQS) as weighted
combination: TQS = 0.4·CF + 0.3·GP + 0.2·SC + 0.1·RP
Task 5.5: Trade-off-Aware Matching
Develop matching algorithms that account for the precision-recall
trade-off:
Confidence-Weighted Matching: For each feature
pair, compute a confidence score based on activation correlation and
precision-recall position
Multi-Threshold Matching: Instead of
single-threshold matching, use multiple thresholds and ensemble the
results
Calibrated Matching: Use synthetic data to
calibrate the relationship between activation correlation and true
feature correspondence, then apply calibration to real-world
matching
Task 5.6: L0-Specific Transformation Learning
Test whether optimal transformations depend on the L0 regime of the
SAEs:
Train SAEs at low L0 (≈15), medium L0 (≈25), high L0 (≈35)
Learn separate transformations for each L0 regime
Test cross-regime transfer: Does a transformation learned for low-L0
SAEs work for high-L0 SAEs?
Hypothesis: “Transformations learned at medium L0 generalize best to
other L0 regimes”
Task 5.7: Real-World Transformation Validation
Apply transformation learning to real LLM pairs:
Use Pythia-70m/160m and Gemma-1-2B/2-2B pairs from the universality
paper
Train transformations on layers where universality is known to be
strong (e.g., middle layers)
Validate on held-out layers
Test whether transformations enable transfer of interpretability
artifacts:
Extract a steering vector in Model 1
Apply learned transformation to predict corresponding steering
vector in Model 2
Test whether predicted steering vector actually works in Model
2
Expected Outcomes:
Transformation learning improving feature correspondence from
baseline 60% to 75-80%
Demonstration that precision-recall trade-off explains ~40% of why
different SAEs learn different features
Successful transfer of steering vectors across models with >0.7
correlation in steering effects
6. Dead Latents and Feature
Coverage
Implementation Overview
This research direction addresses the observation that SAEs
incompletely cover the feature space, with different SAEs covering
different subsets. We implement experiments that characterize feature
coverage patterns and develop methods to achieve more complete
coverage.
Detailed Implementation
Steps
Task 6.1: Dead Latent Characterization
Systematically study dead latent patterns across architectures and
conditions:
Train SAEs across all architectures on SynthSAEBench-16k with
multiple seeds
Track dead latent counts throughout training (checkpoints every 10M
tokens)
For each architecture and condition:
Dead Latent Rate (DLR): Percentage of latents that
never activate above threshold
Dead Latent Stability (DLS): What percentage of
latents that are dead at 100M tokens were also dead at 50M tokens?
(Measures whether dead latents “die early”)
Dead Latent Recovery (DLRec): Can we revive dead
latents through continued training, auxiliary losses, or
re-initialization?
Analyze which ground-truth features are NOT captured by any SAE
features: create a “feature discovery gap” metric
Task 6.2: Coverage Complementarity Analysis
Measure the extent to which different SAE architectures cover
different feature subsets:
For each pair of architectures (A, B), compute:
Coverage Overlap (CO): Percentage of ground-truth
features captured by both A and B
Unique Coverage (UC_A, UC_B): Percentage of
ground-truth features captured by A but not B (and vice versa)
Union Coverage (UnC): Percentage of ground-truth
features captured by at least one of A or B
Hypothesis: “Different architectures have UC ≈ 15-25%, meaning they
capture genuinely different features”
Create visualizations showing which types of features each
architecture is best at capturing (e.g., JumpReLU might excel at rare
features, Matryoshka at hierarchically organized features)
Task 6.3: Ensemble Coverage Methods
Develop methods to combine multiple SAEs for more complete feature
coverage:
Simple Union Ensemble: Concatenate features from
multiple SAE architectures, remove duplicates based on high correlation
(>0.9)
Weighted Ensemble: For each ground-truth feature,
identify which SAE architecture captures it best, weight architectures
accordingly
Complementary Training: Train SAEs sequentially
where each subsequent SAE is encouraged (via auxiliary loss) to capture
features missed by previous SAEs
Coverage-Aware Architecture Search: Use synthetic
data to identify which architecture combinations achieve highest union
coverage with minimal redundancy
Task 6.4: Active Feature Discovery
Implement active learning methods that iteratively improve feature
coverage:
Start with a trained SAE (with dead latents)
Identify dataset examples that are poorly reconstructed (high
MSE)
Analyze these examples to identify which ground-truth features are
active but not captured
Re-initialize dead latents to target these missing features
Continue training with emphasis on improving reconstruction of
poorly-reconstructed examples
Iterate until coverage plateaus
Task 6.5: Feature Rarity vs Coverage Analysis
Test whether feature rarity (firing frequency) predicts coverage
failures:
Divide ground-truth features into deciles by firing frequency
For each decile, measure what percentage of features are
successfully captured (MCC > 0.5) by SAEs
Hypothesis: “Rare features (bottom 20% by frequency) are captured
<50% of the time, while common features (top 20%) are captured
>85% of the time”
Test whether architectural choices differentially affect rare
vs. common feature coverage:
Prediction: TopK SAEs better at common features due to explicit
top-k selection
Prediction: L1 SAEs better at rare features due to continuous
optimization
Task 6.6: Coverage Transfer Across Models
Test whether feature coverage patterns transfer across models:
Train SAEs on synthetic Model 1, identify which features are
consistently missed (across architectures and seeds)
Train SAEs on synthetic Model 2 (similar but distinct), check if the
same types of features are missed
If patterns transfer, develop “coverage prediction models” that
predict which features will be hard to capture based on feature
properties (rarity, hierarchy position, correlation with other
features)
Apply predictions to real LLMs: predict which types of features will
be poorly covered by SAEs
Task 6.7: Minimum Coverage Requirements
Determine what level of feature coverage is sufficient for practical
interpretability:
Create synthetic models where ground truth is known
Train SAEs achieving various coverage levels (40%, 60%, 80%,
95%)
Test downstream interpretability tasks with each coverage level:
Circuit discovery: Can we identify a ground-truth circuit using SAE
features?
Steering vector effectiveness: How well do steering vectors work
when based on incomplete feature coverage?
Causal interventions: What accuracy do we achieve in causal
intervention experiments?
Establish empirically: “X% feature coverage is required for Y%
accuracy on task Z”
Discovery that dead latents are not random: specific feature types
(rare + hierarchically deep + highly correlated) are systematically
missed
Ensemble methods that achieve >85% coverage with only 2-3
architectures (vs. 70% for best single architecture)
7.
Quantitative Validation of Analogous Universality Hypothesis
Implementation Overview
This research direction provides rigorous quantitative validation of
the analogous universality hypothesis using the controlled environment
of synthetic benchmarks. We implement comprehensive experiments that
measure universality at multiple levels of granularity.
Detailed Implementation
Steps
Task 7.1: Multi-Level Universality Measurement
Implement a hierarchy of universality metrics from strongest
(identical features) to weakest (unrelated features):
Level 1 - Strong Universality (Identical Features):
Measure percentage of features with MCC > 0.95
Level 2 - Feature-Level Analogous Universality:
Percentage with MCC > 0.7 (highly similar but not identical)
Level 3 - Geometric Analogous Universality:
Percentage of feature pairs with high SVCCA (>0.6) between their
neighborhoods (local geometric structure preserved)
Level 4 - Functional Analogous Universality:
Percentage of features with high activation correlation (>0.7) across
same dataset
Level 5 - Semantic Analogous Universality:
Percentage of features that activate on the same semantic category (even
if activation correlation is lower)
Level 6 - Weak/No Universality: Features that fail
all above criteria
For each synthetic configuration and real model pair, report the
distribution across these levels.
Task 7.2: Synthetic-to-Real Validity Testing
Rigorously test whether findings from synthetic data transfer to real
LLMs:
Identify 10 key findings from synthetic experiments (e.g., “middle
layers show higher universality than early/late layers”)
For each finding, formulate a testable prediction for real LLMs
Test predictions on 5 real model pairs: Pythia-70m/160m,
Gemma-1-2B/2-2B, Gemma-2-2B/9B, Llama-3/3.1, GPT-2-small/medium
Compute “prediction accuracy” = percentage of synthetic-derived
predictions that hold in real data
Goal: Achieve >80% prediction accuracy, validating that synthetic
benchmarks are realistic
Task 7.3: Universality Spectrum Experiments
Map the complete spectrum from no universality to perfect
universality:
Generate synthetic model pairs with controlled universality levels:
0% universality: Completely independent models with no shared
features
25% universality: Models share 25% of features, 75% are unique
50% universality: Equal mix of shared and unique features
75% universality: Most features shared, some unique
For each universality level, train SAEs and measure:
What SVCCA/RSA scores result from different true universality
levels?
How does SAE architecture affect the observed universality (e.g., do
some architectures “find” more universality)?
Create calibration curves: Given observed SVCCA=0.6, what is the
estimated true universality level (with confidence intervals)?
Task 7.4: Layer-Wise Universality Patterns
Validate the layer-wise universality patterns found in the original
universality paper:
Generate synthetic “multi-layer models” where early layers have high
superposition, middle layers have moderate superposition, late layers
have low superposition but high hierarchy
Train layer-specific SAEs
Verify synthetic models reproduce the pattern: “middle layers show
highest universality”
Mechanistically explain why: “Middle layers balance feature
distinguishability (enough to avoid superposition confusion) with
feature generality (not yet specialized to specific tasks)”
Task 7.5: Cross-Model-Family Universality
Test universality across model families (not just within
families):
Generate synthetic models representing different “families”:
Family A: High superposition, shallow hierarchy
Family B: Low superposition, deep hierarchy
Family C: Medium superposition, no hierarchy but high
correlation
Train SAEs within and across families
Measure: Is universality higher within families than across
families?
Benchmark suite enabling standardized comparison of future
universality research
8. Methodological Alignment
Implementation Overview
This research direction focuses on unifying and extending the
methodologies from both papers into a comprehensive toolkit for studying
feature universality. We implement standardized protocols and shared
infrastructure.
Detailed Implementation
Steps
Task 8.1: Unified Feature Matching Pipeline
Develop a single pipeline that implements all feature matching
methods:
Build shared infrastructure to enable both research directions:
SyntheticModelZoo: Repository of pre-generated
synthetic models with various properties
Models covering full parameter space: superposition × hierarchy ×
correlation
Pre-computed ground-truth metadata
Versioned and immutable for reproducibility
SAE Model Zoo: Repository of pre-trained SAEs
SAEs for all major architectures on standard synthetic models
SAEs for common real model pairs (Pythia, Gemma, Llama)
Standardized naming and metadata
Evaluation Harness: Unified evaluation framework
Load any SAE + any ground-truth or comparison SAE
Run all applicable metrics
Generate standardized reports and visualizations
Experiment Tracking Integration:
Weights & Biases integration for experiment logging
Automatic hyperparameter tracking
Visualization dashboards for comparing runs
Task 8.5: Cross-Validation Framework
Implement methods to validate one paper’s findings with the other’s
methods:
Validation 1: Use SynthSAEBench metrics (MCC, F1)
to validate universality paper’s claim that high SVCCA indicates good
features
Train SAEs on synthetic data where ground truth is known
Show: “High SVCCA between two SAEs correlates with both SAEs having
high MCC to ground truth”
Validation 2: Use universality paper’s methods
(SVCCA, RSA) to validate SynthSAEBench’s finding about MP-SAE
overfitting
Show: “MP-SAEs have high within-architecture SVCCA (different seeds
converge to similar solutions) but lower cross-architecture SVCCA
(fundamentally different from other architectures)”
Validation 3: Use both frameworks to validate
semantic subspace findings
In synthetic data, define ground-truth semantic categories
Show: Universality paper’s semantic subspace SVCCA scores are higher
when SynthSAEBench’s HRS (Hierarchy Recovery Score) is also high
Task 8.6: Reproducibility Infrastructure
Ensure all research is fully reproducible:
Containerization: Docker containers with all
dependencies
Configuration Management: Use Hydra or similar for
experiment configuration
Data Version Control: DVC for managing datasets and
model checkpoints
Continuous Integration: Automated testing of all
pipelines
Documentation: Comprehensive tutorials and API
documentation
Public Release: Open-source repository with
pre-commit hooks, code formatting, type hints
Task 8.7: Benchmark Standardization
Create official benchmark tasks and leaderboards:
Task 1: Feature Recovery on SynthSAEBench-16k
Metric: MCC
Current best: ~0.75 (Matryoshka SAE)
Task 2: Cross-Model Universality on Pythia-70m/160m
Metric: SVCCA at layer pairs
Current best: ~0.68 (middle layer pairs)
Task 3: Transformation Learning on Synthetic Pairs
Metric: Ground-truth correspondence accuracy after
transformation
Target: >0.80
Task 4: Semantic Subspace Preservation
Metric: Mean category subspace SVCCA
Current best: ~0.60
Publish official leaderboards and provide submission
infrastructure.
Expected Outcomes:
Unified codebase reducing duplication and enabling direct method
comparison
10+ standardized protocols enabling reproducible universality
research
Public benchmarks with leaderboards driving community progress
At least 3 external research groups using the infrastructure within
12 months of release
9. The
Critical Insight: Why Analogous, Not Identical?
Implementation Overview
This research direction addresses the fundamental theoretical
question: Why do SAEs learn analogous rather than identical features? We
implement experiments that test mechanistic hypotheses about the origins
of feature analogy.
Detailed Implementation
Steps
Task 9.1: Superposition Decomposition Ambiguity
Theory
Formalize and test the theory that superposition creates fundamental
ambiguity in optimal feature decomposition:
Theoretical Framework:
Given features in superposition with overlap matrix O where O_ij =
|d_i^T d_j| (absolute cosine similarity)
Prove: When O is non-identity, there exist multiple factorizations
of activation space achieving identical reconstruction error
Derive: Number of “equivalent decompositions” as a function of
spectral properties of O
Empirical Validation:
Generate models with controlled superposition (vary ρₘₘ from 0.05 to
0.30)
For each model, train 10 SAE seeds
Measure “decomposition diversity” = variance in feature recovery
across seeds
Test prediction: “Decomposition diversity increases linearly with
ρₘₘ”
Uniqueness Conditions:
Identify conditions under which decomposition becomes unique
Test: “When ρₘₘ < 0.03, different SAE seeds converge to identical
features (MCC between seeds > 0.95)”
Analyze the loss landscape to understand why different training runs
find different solutions:
Loss Landscape Visualization:
Train two SAE seeds to different solutions (low MCC between
them)
Interpolate in weight space: W(t) = (1-t)W₁ + t·W₂ for t ∈
[0,1]
Plot reconstruction loss, sparsity loss, and ground-truth MCC along
this path
Hypothesis: “Loss remains low along the entire path (suggesting flat
valley connecting multiple good solutions)”
Local Minima Characterization:
Use random weight perturbations to estimate local curvature around
each solution
Measure: Are different solutions in the same basin of attraction or
different basins?
Test whether adding noise during training increases or decreases
decomposition diversity
Mode Connectivity:
Apply mode connectivity algorithms to find low-loss paths between
solutions
If paths exist, this confirms multiple solutions are “equally good”
from an optimization perspective
Task 9.3: Architecture-Specific Biases
Identify what implicit biases different SAE architectures have that
lead them to different solutions:
Bias Analysis Framework:
For each architecture, identify its inductive bias (e.g., TopK
explicitly biases toward k most important features per sample)
Generate synthetic scenarios where different biases would favor
different decompositions
Example: Create features where half are rare-but-strong and half are
common-but-weak. TopK should favor rare-but-strong, while L1 should
balance both
Systematic Bias Testing:
Create 5 synthetic scenarios favoring different decomposition
strategies
Train all architectures on each scenario
Measure which architectures succeed on which scenarios
Build a “bias profile” for each architecture
Bias Complementarity:
Test whether architectures with complementary biases can be combined
for more complete coverage
Example: Combine TopK (good for rare-strong features) + L1 (good for
common features) to capture both
Task 9.4: Information-Theoretic Analysis
Apply information theory to understand feature analogousness:
Mutual Information Framework:
Measure I(SAE_features; ground_truth_features) = mutual information
between SAE and ground-truth features
Also measure I(SAE₁_features; SAE₂_features) = mutual information
between two SAEs trained on same data
Hypothesis: “I(SAE₁; ground_truth) ≈ I(SAE₂; ground_truth) but
I(SAE₁; SAE₂) < I(SAE₁; ground_truth)” (both SAEs capture same amount
of information about ground truth, but encode it differently)
Information Bottleneck Perspective:
Frame SAE training as information bottleneck: minimize
I(SAE_features; input) while maximizing I(SAE_features;
ground_truth_features)
Different SAEs may find different trade-offs on this Pareto
frontier
Measure where different architectures fall on this frontier
Task 9.5: Causal Mechanisms of Analogy
Identify causal factors that create analogous vs. identical
features:
Controlled Intervention Experiments:
Start with a baseline condition yielding analogous features
Systematically remove potential causes: eliminate superposition → do
features become identical? Eliminate hierarchy → does analogy
remain?
Build causal graph: superposition → decomposition ambiguity →
analogous features
Sufficiency and Necessity Tests:
Test sufficiency: “If we have high superposition (ρₘₘ > 0.15),
will we always get analogous features?” (Train 20 SAEs, measure if all
pairs have MCC < 0.80)
Test necessity: “Can we get analogous features without
superposition?” (Test on ρₘₘ = 0 models)
Task 9.6: Analogousness as a Continuum
Model feature analogousness as a continuous spectrum rather than
binary property:
Analogousness Score Definition:
Define A(f₁, f₂) = function measuring “degree of analogousness”
between features f₁ and f₂
Mechanistic understanding: “75% of analogousness stems from
superposition ambiguity, 15% from optimization stochasticity, 10% from
architectural biases”
Predictive models achieving R² > 0.65 in predicting analogousness
scores
Clear answer to “Why analogous?”: Because superposition creates
fundamental ambiguity that different SAEs resolve differently
10. Synthesis: The Complete
Picture
Implementation Overview
This final research direction integrates all previous findings into a
unified theory and comprehensive practical framework. We implement the
synthesis phase that combines theoretical understanding with practical
tools.
Detailed Implementation
Steps
Task 10.1: Unified Theory of Feature
Universality
Develop a comprehensive theoretical framework integrating all
findings:
Identify and articulate open problems for the community:
Open Problem 1: Can we achieve >95% feature
coverage with better SAE architectures?
Open Problem 2: How does universality extend to
multimodal models (vision + language)?
Open Problem 3: Can we characterize universality in
RL agents and policy networks?
Open Problem 4: What universal structures exist at
the circuit level (beyond features)?
Open Problem 5: How can we use universality for
efficient continual learning?
For each problem:
Clearly define the problem and why it matters
Provide initial experiments showing feasibility
Outline potential approaches
Offer starter code and datasets
Expected Outcomes:
Comprehensive theory paper (target: 40+ pages with appendices)
Practical toolkit with >1000 GitHub stars within 1 year
5+ concrete applications demonstrating real-world value
Active research community with 20+ groups building on this work
At least 10 follow-up papers citing this work within 18 months
Conclusion
This integrated research program bridges the gap between controlled
synthetic evaluation (SynthSAEBench) and real-world cross-model
universality analysis. By implementing the 10 research directions
detailed above, we will:
Understand why SAEs learn analogous rather than
identical features
Quantify the degree of universality across models
and conditions
Predict when and how strongly universality will
occur
Exploit universality for practical interpretability
transfer
Advance the broader field of mechanistic
interpretability
The comprehensive implementation plan provided here offers a roadmap
for 18-24 months of research that will fundamentally advance our
understanding of feature learning in neural networks and enable new
capabilities in AI interpretability and safety.