This document summarizes the key papers discovered through semantic
search that are relevant to developing manifold-aware sparse
autoencoders based on the SynthSAEBench framework.
1. Core Papers on Feature
Manifolds
1.1
Not All Language Model Features Are One-Dimensionally Linear
Provides rigorous definition of irreducible
multi-dimensional features: features that cannot be decomposed
into independent or non-co-occurring lower-dimensional features
Develops scalable method using sparse autoencoders to automatically
find multi-dimensional features
Discovers circular features representing days of
the week and months of the year in GPT-2 and Mistral 7B
Validates computational importance through intervention experiments
on Mistral 7B and Llama 3 8B
Relevance to Our Work: This is the foundational
paper establishing that manifold features exist in LLMs and are
computationally fundamental. Our synthetic benchmark must reproduce
these circular structures and test whether SAEs can recover them.
Key Findings:
Circular features form 2D representations: (cos θ, sin θ) for
periodic concepts
These features resist decomposition into 1D components
Intervention experiments confirm they are the fundamental unit of
computation
Standard SAEs may not optimally represent these geometric
structures
1.2
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Analyzes geometric structure of SAE feature dictionaries at three
scales:
Atomic scale: “Crystal” structures with
parallelogram and trapezoid faces
Intermediate scale: Spatial modularity resembling
functional lobes (math/code features cluster)
Large scale: Global organization reflecting
semantic relationships
Demonstrates hierarchical geometric organization in learned
features
Shows that features are not randomly distributed but form structured
manifolds
Relevance to Our Work: Provides empirical evidence
for geometric structure in SAE features. Our evaluation metrics must
capture these multi-scale geometric properties: local crystalline
structure, intermediate clustering, and global topology.
Key Findings:
SAE features exhibit non-trivial geometric relationships beyond
simple linear directions
Semantic similarity corresponds to geometric proximity
Hierarchical organization suggests manifolds participate in
hierarchical structures
Different conceptual domains (math, code, language) form distinct
geometric “lobes”
Introduces theoretical framework for understanding feature
manifolds
Proposes that related features lie on continuous manifolds where
nearby points represent similar concepts
Contrasts with discrete, independent feature assumption
Suggests manifolds may be fundamental to neural representation
Relevance to Our Work: Provides theoretical
motivation for our manifold-aware synthetic benchmark. The toy model
framework can guide our design of synthetic manifolds and help us
understand what properties to test.
Key Concepts:
Features form continuous families rather than discrete atoms
Manifolds allow efficient representation of structured concept
spaces
Standard sparse coding may miss manifold structure
Need specialized methods to detect and represent manifolds
1.4
Understanding Sparse Autoencoder Scaling in the Presence of Feature
Manifolds
Adapts capacity-allocation model to understand SAE scaling with
manifolds
Identifies two scaling regimes:
Pathological regime (β < α): SAEs learn far
fewer features than available latents; continuously “tile” manifolds at
expense of discovering rare features
Benign regime (α < β): Number of discovered
features scales linearly with latents
Provides theoretical predictions for SAE behavior on manifold
data
Relevance to Our Work:CRITICAL
PAPER for understanding how manifolds affect SAE performance.
We must test these scaling predictions empirically on our synthetic
benchmark and verify the pathological tiling behavior.
Key Predictions:
SAEs will allocate disproportionately many latents to high-frequency
manifolds
Manifold “tiling” behavior: learning many latents per manifold
instead of efficient manifold representation
Scaling laws differ between manifold and non-manifold features
Standard SAE architectures may be fundamentally limited for manifold
features
2. Foundational SAE Papers
2.1
Towards Monosemanticity: Decomposing Language Models with Dictionary
Learning
Introduces sparse autoencoder approach for decomposing neural
network activations
Demonstrates that learned features are more monosemantic than raw
neurons
Establishes dictionary learning as interpretability method
Shows SAEs can find interpretable features in language models
Relevance to Our Work: Foundation for the entire SAE
field. Our manifold-aware extensions build on this baseline approach. We
must ensure our manifold SAEs maintain the interpretability benefits
while improving geometric representation.
Studies relationship between SAE capacity, sparsity, and
reconstruction quality
Provides practical guidance for training large-scale SAEs
Relevance to Our Work: Provides baseline metrics and
scaling laws for standard SAEs. We must compare manifold-aware SAE
scaling to these established baselines and show where manifolds create
different scaling behavior.
2.3
Sparse Autoencoders Find Highly Interpretable Features in Language
Models
Authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs,
Robert Huben, Lee Sharkey Publication: 2023
ArXiv:https://arxiv.org/abs/2309.08600
Key Contributions:
Demonstrates that SAE features are highly interpretable
Studies feature activation patterns and semantics
Validates SAE approach on multiple models
Analyzes trade-offs between sparsity and reconstruction
Relevance to Our Work: Establishes interpretability
as key evaluation criterion. Our manifold-aware SAEs must preserve
interpretability while improving geometric representation. We need to
test whether manifold features are as interpretable as standard
features.
3. Geometric
Representation and Structure
3.1
The Linear Representation Hypothesis and the Geometry of Large Language
Models
Authors: Kiho Park, Yo Joong Choe, Victor Veitch
Publication: ICML 2024
Key Contributions:
Formalizes the Linear Representation Hypothesis (LRH)
Studies geometric properties of concept representations
Analyzes when linear representations are sufficient
Provides framework for understanding representation geometry
Relevance to Our Work: Provides theoretical
foundation for comparing linear vs. manifold representations. Our
synthetic benchmark can instantiate both hypotheses and test which
better explains SAE behavior.
3.2
The Geometry of Categorical and Hierarchical Concepts in Large Language
Models
Authors: Kiho Park, Yo Joong Choe, Yibo Jiang,
Victor Veitch Publication: 2024
Key Contributions:
Studies how categorical and hierarchical concepts are geometrically
organized
Analyzes representation of structured knowledge
Connects to cognitive science theories of concept
representation
Relevance to Our Work: Hierarchical manifolds in our
benchmark must reflect these organizational principles. We need to test
how SAEs handle the interaction between hierarchy and manifold
structure.
Empirical discovery of geometric structure in calendar features
Shows days and months form 2D geometric patterns
Provides visualization techniques for feature geometry
Validates existence of structured manifolds in real LLMs
Relevance to Our Work: Direct empirical validation
that calendar manifolds exist in real models. Our synthetic benchmark
should reproduce this specific phenomenon and test whether SAEs can
recover it. This provides a concrete validation target.
4. SAE Architecture Variants
4.1
Improving Dictionary Learning with Gated Sparse Autoencoders
Introduces gated SAE architecture separating feature selection from
magnitude
Improves reconstruction quality and feature interpretability
Shows benefits of architectural modifications
Relevance to Our Work: Gating mechanism could be
adapted for manifold SAEs: gate could select which manifold is active,
then magnitude determines position on manifold.
4.2
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse
Autoencoders
Introduces JumpReLU activation for better reconstruction
Reduces reconstruction-sparsity trade-off
Demonstrates architecture improvements matter
Relevance to Our Work: Could be combined with
manifold-aware architectures to improve reconstruction while maintaining
manifold structure.
5. Evaluation and Analysis
Methods
Representational Similarity Analysis
Authors: Nikolaus Kriegeskorte, Marieke Mur, Peter
Bandettini Publication: Frontiers in Systems
Neuroscience, 2008 DOI:10.3389/neuro.06.004.2008
Key Contributions:
Framework for comparing neural representations across systems
Distance-based similarity metrics
Applicable to both biological and artificial systems
Relevance to Our Work: RSA methods can be adapted to
compare manifold structures between ground truth and learned SAE
representations. Provides principled framework for manifold alignment
metrics.
5.2
Similarity of Neural Network Representations Revisited
Authors: Simon Kornblith, Mohammad Norouzi, Honglak
Lee, Geoffrey Hinton Publication: 2019
ArXiv:https://arxiv.org/abs/1905.00414
Key Contributions:
Comprehensive comparison of representation similarity metrics
Studies CKA, CCA, and other methods
Provides guidance on metric selection
Relevance to Our Work: These metrics can be adapted
for comparing manifold representations. We need manifold-aware versions
of CKA/CCA to evaluate geometric structure preservation.
6. Feature Phenomena in SAEs
6.1
A is for Absorption: Studying Feature Splitting and Absorption in Sparse
Autoencoders
Identifies feature splitting: one ground-truth
feature learned by multiple SAE latents
Identifies feature absorption: multiple
ground-truth features collapsed to one SAE latent
Studies frequency and causes of these phenomena
Relevance to Our Work:Critical for
predictions: We predict manifold features will exhibit more
splitting (SAE tiles manifold) and less absorption (manifold resists
collapse). Must measure these phenomena on our synthetic benchmark.
6.2
Decomposing the Dark Matter of Sparse Autoencoders
Studies “dark matter”: activation variance not explained by learned
features
Analyzes what SAEs miss
Provides insights into SAE limitations
Relevance to Our Work: Manifolds may be part of the
“dark matter” that standard SAEs fail to capture. Our manifold-aware
SAEs should reduce dark matter by explicitly modeling geometric
structure.
7. Neuroscience Connections
7.1
Bipartite Invariance in Mouse Primary Visual Cortex
Studies manifold representations in biological neural systems
Shows visual cortex represents features on low-dimensional
manifolds
Demonstrates manifolds are fundamental to biological
computation
Relevance to Our Work: Provides biological precedent
for manifold representations. If biological systems use manifolds,
artificial systems likely do too. Our work bridges neuroscience and AI
interpretability.
7.2
Alignment of Brain Embeddings and Artificial Contextual Embeddings
Shows brain and LLM representations share common geometric
patterns
Demonstrates alignment between biological and artificial
embeddings
Suggests universal geometric principles
Relevance to Our Work: If brain and LLM
representations share geometry, manifold structures may be universal.
Our synthetic benchmark can test whether SAEs capture these universal
geometric principles.
Theoretical framework for understanding superposition
Toy models demonstrating when and why superposition occurs
Analysis of feature interference and recovery
Relevance to Our Work: Superposition interacts with
manifolds (Michaud et al.). We must extend toy model framework to
include manifold features and test predictions about manifold
superposition.
8.2
Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by
V1?
Authors: Bruno A. Olshausen and David J. Field
Publication: Vision Research, 37(23), 1997
DOI:10.1016/S0042-6989(97)00169-7
Key Contributions:
Original sparse coding framework for neural representation
Demonstrates biological plausibility of sparse representations
Establishes connection to visual cortex
Relevance to Our Work: Historical foundation for
sparse autoencoders. Our manifold extensions maintain this biological
inspiration while adding geometric structure.
Studies universality of SAE features across models
Analyzes concept alignment between different architectures
Tests transferability of learned representations
Relevance to Our Work: If manifolds are fundamental,
they should be universal across models. Our validation experiments
should test whether manifold structures transfer across different
synthetic models and to real LLMs.
9.2
Steering Language Models with Activation Engineering
Methods for steering LLM behavior through activation
manipulation
Demonstrates practical applications of interpretability
Shows importance of understanding representation geometry
Relevance to Our Work: Understanding manifolds
enables better steering: can move along geodesics on manifolds for
smoother, more natural interventions. Manifold-aware steering could be
more effective than linear steering.
10. Research Gaps and Our
Contributions
Identified Gaps:
No synthetic benchmark for manifold features → We
create ManifoldSynthSAEBench
No manifold-specific evaluation metrics → We
develop geodesic preservation, topology preservation, curvature
accuracy
No manifold-aware SAE architectures → We propose
GL-SAE, MP-SAE, HM-SAE, ALM-SAE
No systematic hypothesis testing → We generate
models for LRH, manifold hypothesis, etc.
No scaling laws for manifolds → We test Michaud et
al.’s predictions empirically
No transfer learning from synthetic to real → We
test manifold SAE transfer to GPT-2
Our Novel Contributions:
Extension of SynthSAEBench with circular,
spherical, toroidal manifolds
Manifold-aware evaluation suite with 6 new
metrics
Four novel SAE architectures explicitly modeling
manifolds
Representation hypothesis testing framework with
testable predictions
Validation against real LLM phenomena (calendar
features, circular structure)
Practical implementation plan with 7 phases over 20
weeks
11. Integration with Existing
Work
How Our Work Connects:
Engels et al. (2025) discovers circular features →
We create synthetic circular features with known ground truth → We test
if SAEs can recover them
Li et al. (2025) finds geometric structure → We
generate synthetic geometric structure → We measure if SAEs preserve
it
Michaud et al. (2024) predicts pathological scaling
→ We test predictions empirically on controlled data → We validate or
refute theory
Olah & Batson (2023) proposes manifold toy model
→ We implement full-scale realistic benchmark → We bridge theory to
practice
SynthSAEBench provides LRH baseline → We add
manifold features → We compare LRH vs. manifold hypotheses
Synthesis:
Our work synthesizes theoretical insights (Olah,
Michaud), empirical discoveries (Engels, Li, Leask), and rigorous
benchmarking (SynthSAEBench, Gao) into a unified
framework for studying manifold-aware interpretability.
12. Next Steps
Immediate Actions:
✅ Create research plan document
✅ Organize key papers in Manifold-SAEs folder
✅ Summarize literature and identify gaps
⏳ Begin Phase 1 implementation: Manifold generation code
⏳ Implement circular and spherical manifolds
⏳ Generate first dataset with 10 manifolds
Long-term Roadmap:
Months 1-2: Implement data generation and
evaluation metrics
This literature review identifies feature manifolds
as a critical but understudied aspect of neural representation. Recent
empirical work (Engels, Li, Leask) provides strong evidence that
manifolds exist in real LLMs and are computationally fundamental.
Theoretical work (Olah, Michaud) predicts that standard SAEs will
struggle with manifolds due to pathological scaling and tiling
behaviors.
Our contribution: A systematic, rigorous approach to
understanding manifold-aware interpretability through controlled
synthetic experiments. By extending SynthSAEBench, we can test competing
hypotheses, develop better architectures, and establish new evaluation
standards.
Impact: This work has the potential to reshape how
the field thinks about feature learning, representation geometry, and
the fundamental units of neural computation. If manifolds are indeed
fundamental, then manifold-aware interpretability tools are not
optional—they are essential for understanding how language models truly
work.
Papers Included in Folder
PDFs Available:
18572_SynthSAEBench_Evaluating.pdf - Original
SynthSAEBench paper
Sparse_Autoencoders_Universal_Feature_Spaces.pdf -
Universal SAE features
SAE_Research_Analysis.pdf - General SAE research
overview
Papers to Download:
Engels et al. (2025) - Not All Language Model Features Are
Linear
Li et al. (2025) - Geometry of Concepts
Michaud et al. (2024) - SAE Scaling with Manifolds
Gao et al. (2025) - Scaling and Evaluating SAEs
Bricken et al. (2023) - Towards Monosemanticity
Cunningham et al. (2023) - SAEs Find Interpretable Features
Rajamanoharan et al. (2024a) - Gated SAEs
Rajamanoharan et al. (2024b) - JumpReLU SAEs
Park et al. (2024) - Linear Representation Hypothesis
Chanin et al. (2024) - Feature Splitting and Absorption
Last Updated: February 11, 2026
Status: Literature review complete, ready for
implementation Phase 1