paper semantic synthsaebench

Geometric Feature Invariance in SAEs: A Framework for Transferable Mechanistic Interpretability and Scalable AI Safety

Abstract

This research will integrate controlled synthetic model features evaluation with cross-model geometric feature invariance analysis, towards developing a principled framework for transferable interpretability across model families and variants. Recent work has showed that Sparse Autoencoders (SAEs) trained on different LLMs learn geometrically similar feature spaces (invariance and analogous feature universality) while exhibiting trade-offs against synthetic ground-truth features, meaning no current SAE architecture can perfectly recover the ground-truth features. Our work will address fundamental challenges in mechanistic interpretability by establishing principled methods for leveraging geometric feature universality. Efficient feature transfer is critical for AI safety because it enables:

  1. Rapid safety evaluation of new models without restarting interpretability analysis from scratch
  2. Early detection of hazardous capabilities by comparing feature spaces to known dangerous configurations
  3. Reliable monitoring across deployment contexts by tracking feature drift
  4. Scalable oversight of large model families where per-model analysis becomes infeasible During the project implementation we will characterize geometric patterns that remain stable across models, and geometric transformation methods that reliably map feature correspondences, validated against synthetic ground-truth. Furthermore, we will demonstrate that safety-relevant interventions transfer within the same family of models, or fine-tuned variants of a model. This research will accelerate mechanistic interpretability, and enable efficient safety analysis as AI systems grow in capability and complexity.

Theory of Change

Activities Develop synthetic benchmarks to test SAE transfer and the presence of invariant feature structures
Outputs Protocols that enable reliable cross-model interpretability transfer based on invariant AI safety-related features and circuits
Outcomes Enable interpretations and safety interventions developed on one model to reliably transfer to other models in the same family
Impact AI safety becomes scalable without requiring complete re-interpretation and SAE generation for each new model, fine-tuned variant, or model family member
Key Assumption Geometric feature similarity is both necessary and sufficient for interpretability transfer

Enforcing Semantic Constraints in SynthSAEBench Hierarchies

The Approach: Compositional Feature Directions with Hierarchical Constraints

While SynthSAEBench’s current hierarchy mechanism only enforces probabilistic dependencies (children ßfire when parents fire), we can introduce partial semantic structure by modifying how feature direction vectors are constructed within the hierarchy. The key insight is to make child feature directions compositionally dependent on their parent directions, rather than having them be independent random vectors. Specifically, when constructing a hierarchical feature tree, we can define each child feature direction as dchild = α ⋅ dparent + β ⋅ d, where dparent is the parent’s direction vector, d is a component orthogonal to the parent (representing the “specialization” of the child concept), and α, β are mixing coefficients that control how much of the parent’s representation is inherited. This creates genuine compositional structure where activations containing child features literally contain components of parent features in their vector representations. By carefully choosing the α values across the hierarchy (e.g., α = 0.7 for closely related concepts, α = 0.3 for more distantly related ones), we can encode semantic relatedness as geometric similarity in the feature space.

Implementing d_⊥:

To get a vector orthogonal to the parent direction, you use Gram-Schmidt orthogonalization. You start with a random vector v, then subtract off its projection onto the parent direction:

d = v − (vdparent)dparent

The term (v · d_parent) d_parent is the component of v that “points along” the parent, so subtracting it leaves only the part that’s perpendicular. You then normalize the result to unit length. This gives you a vector that lives in the subspace of the full embedding space that is completely “unrelated” to the parent direction — it captures whatever is unique or distinguishing about the child concept beyond what it inherits from the parent.

What β does geometrically:

β controls how much of that orthogonal “specialization” component makes it into the final child direction. The full formula d_child = α·d_parent + β·d_⊥ is a linear combination of two orthogonal vectors, which means you’re essentially picking a point on a 2D plane spanned by those two directions. Since d_parent and d_⊥ are orthogonal unit vectors, the cosine similarity between d_child (after normalization) and d_parent is determined by the ratio α/β — a large β relative to α means the child direction tilts strongly away from the parent toward its own unique subspace, encoding a concept that is more “specialized” and less like its parent. A small β means the child hugs closely to the parent direction, encoding a concept that is almost synonymous with the parent. So α and β together act as a geometric dial controlling where on the spectrum between “identical to parent” and “completely independent from parent” the child concept sits.

Hierarchical Feature Geometry Containing Misalignment Concept Semantics

Step 1: LLM generates the misalignment concept hierarchy. You prompt an LLM to produce a tree of misalignment-related concepts — something like “Deceptive Reasoning” as a root, with children “Goal Misrepresentation” and “Information Withholding,” and then grandchildren like “Reward Hacking” and “Sycophantic Agreement” under Goal Misrepresentation, and “Selective Omission” and “Framing Manipulation” under Information Withholding. Critically, you also ask the LLM to assign an α value to each parent-child edge, encoding its judgment of how semantically similar the child is to the parent — “Reward Hacking” might get α=0.4 from “Goal Misrepresentation” since it’s a fairly specific instantiation, while “Goal Misrepresentation” might get α=0.7 from “Deceptive Reasoning” since it’s almost a direct sub-case.

Step 2: Translate the tree into feature directions. You start at the root and assign it a random unit vector d_root. For each child, you compute d_child = α·d_parent + β·d_⊥, where d_⊥ is obtained by Gram-Schmidt against the parent, and β is chosen such that after normalization the cosine similarity between parent and child matches the α the LLM specified (so β is essentially derived from α, not independently set). You recurse down the tree level by level, so grandchildren inherit geometric structure from both their parent and transitively from their grandparent — “Reward Hacking” ends up with some directional overlap with both “Goal Misrepresentation” and “Deceptive Reasoning,” which is exactly the right semantic property.

Step 3: Feed into SynthSAEBench’s hierarchy mechanism. The resulting feature directions slot directly into the feature dictionary D, and the tree structure maps directly onto SynthSAEBench’s existing hierarchy: the parent firing-probability constraint (c_child ← c_child · 1[c_parent > 0]) enforces that “Reward Hacking” can only be active when “Goal Misrepresentation” is active, while the geometric construction ensures that the hidden activations produced for “Reward≠ Hacking” samples literally contain a component pointing in the direction of “Deceptive Reasoning.”

Step 4 and additional experiments: You can now train an SAE on this synthetic data and ask very concrete diagnostic questions: does the SAE learn a latent whose decoder direction has high cosine similarity with d_root even when only grandchild features are firing? Does ablating the latent most aligned with “Deceptive Reasoning” impair reconstruction of “Reward Hacking” more than it impairs reconstruction of unrelated features? Does the SAE split the hierarchy correctly or does it absorb child concepts into parent latents? Because you hold the ground truth — you know exactly which direction corresponds to which concept and what the α values were — you can measure failure modes with precision that is impossible on a real LLM. The LLM-generated hierarchy is what makes the concepts interpretable to humans; the compositional direction construction is what makes the geometry testable.

Theory: Semantic Correlation Through Geometric Constraints

The theoretical foundation rests on the principle that semantic relatedness should manifest as geometric structure in representation space. When we set α > 0, we create non-zero cosine similarity between parent and child feature directions: cos (θ) = dchildTdparent = α - after normalization, from

$d_{\text{child}}^T d_{\text{parent}} = (\alpha \cdot d_{\text{parent}} + \beta \cdot d_{\perp})^T d_{\text{parent}} = \alpha \underbrace{(d_{\text{parent}}^T d_{\text{parent}})}_{=1} + \beta \underbrace{(d*{\perp}^T d_{\text{parent}})}_{=0} = \alpha$),

while their orthogonal components d point in different directions to distinguish them from each other. This creates a testable prediction: SAEs that successfully decompose these features should discover latents where the decoder directions for child features have high cosine similarity with the decoder direction for the parent feature, and interventions that ablate the parent feature should impair reconstruction of child features more severely than unrelated features.

Limitations and What This Achieves

This approach provides weak semantic structure—it’s geometrically grounded but still falls short of true semantic understanding. We’re encoding human-chosen semantic relationships (like “deception contains goal misrepresentation”) into the statistical properties of the data, but the model still doesn’t “understand” deception in any functional sense. What we gain is the ability to test whether SAEs can discover and respect compositional structure: if an SAE trained on hierarchically-constrained features with compositional directions fails to learn latents that preserve the parent-child geometric relationships, this tells us it will struggle even more with the implicit semantic hierarchies in real LLMs. Critically, this approach lets us validate necessary conditions for handling semantic structure—if SAEs can’t decompose explicitly encoded compositional hierarchies where we control the mixing coefficients and geometric relationships, they certainly won’t succeed on the far more complex implicit semantic structures in language model representations. This bridges the gap between purely statistical hierarchies (current SynthSAEBench) and truly semantic hierarchies (which we cannot fully create without solving the interpretability problem we’re trying to investigate), providing a testbed for architectural improvements that must handle compositional feature structure.