SynthSAEBench’s hierarchy mechanism is purely statistical, not semantic. It enforces probabilistic dependencies (child features can only fire when parents fire) but has zero understanding of what those features actually mean. When we label a feature as “Deceptive Reasoning” with children “Goal Misrepresentation” and “Information Withholding,” those labels are purely for our bookkeeping - the synthetic model doesn’t know or care about deception, goals, or information. From the model’s perspective, it’s just:
There is no inherent relationship between d42, d137, and d291 beyond the
statistical constraint. They could be pointing in completely unrelated
directions in the activation space. The “hierarchy” is implemented as:
if c_42 == 0, then set c_137 = 0 and c_291 = 0 - a simple
masking operation that has nothing to do with semantic content.
The fundamental issue is that semantics requires compositional structure in the representation itself, not just in firing probabilities. In real language models, “Deceptive Reasoning” and “Goal Misrepresentation” have a semantic relationship because:
None of this exists in current SynthSAEBench. The features are just random orthogonal directions with a statistical dependency rule. When feature 137 fires, it adds c137 ⋅ d137 to the activation, which has no compositional relationship to what feature 42 adds (c42 ⋅ d42). They’re just independent vectors that happen to have correlated firing patterns.
To create genuinely semantic hierarchies in SynthSAEBench, we would need several fundamental changes:
Child feature directions should be composed from parent feature directions, not independent of them. For example:
Parent: "Deceptive Reasoning" → d_parent
Child: "Goal Misrepresentation" → d_child = α·d_parent + β·d_specificity
where:
- α·d_parent: The "deceptive reasoning" component (inherited)
- β·d_specificity: The specific additional structure for "goals" vs other deception types
- d_specificity is orthogonal to d_parent
This creates genuine compositional structure where the child representation literally contains the parent representation plus additional information. When an SAE decomposes activations containing the child feature, it should ideally recover both components.
Instead of random orthogonal directions, define a set of semantic basis concepts that have interpretable meanings. For instance:
Semantic Bases:
- e_intent: "Intentionality/Agency" direction
- e_honesty: "Truthfulness" direction
- e_goal: "Goal-oriented behavior" direction
- e_other_aware: "Model of other agents" direction
Then construct features as combinations:
- "Deceptive Reasoning" = +1·e_intent -1·e_honesty +1·e_other_aware
- "Goal Misrepresentation" = +1·e_intent -1·e_honesty +1·e_goal +1·e_other_aware
- "Honest Goal Pursuit" = +1·e_intent +1·e_honesty +1·e_goal
The challenge is that we don’t actually know what the “right” semantic basis vectors are, and they may not exist as simple linear directions. This is precisely the problem SAEs are trying to solve for real models!
For hierarchies to be semantically meaningful, there should be tasks where the hierarchical relationship matters functionally. For example:
This would require coupling SynthSAEBench to synthetic task datasets where we can test functional relationships, not just generating random activations.
Another approach is to generate activations that are actually grounded in meaningful data. Instead of random sampling, create activations that correspond to:
For example, you could:
Here’s the deep problem: We want to use synthetic data to test SAEs because we don’t know the true features in real models. But to create semantically meaningful synthetic features, we need to know what semantic structure looks like in representation space - which is exactly what we don’t know.
This creates a paradox:
Given this limitation, here are realistic approaches:
We can’t create true semantics, but we can create statistical signatures that might correlate with semantics:
This won’t give us true semantics, but it tests whether SAEs can handle the statistical patterns that real semantic structures might produce.
Use real LLM activations to guide synthetic feature construction:
This grounds the synthetic features in real semantics while maintaining the control and ground truth of synthetic data.
Explicitly acknowledge that SynthSAEBench tests statistical decomposition, not semantic understanding:
Use it for what it’s good at: controlled experiments on statistical properties of SAE learning, not as a complete substitute for evaluation on real semantically meaningful data.
For AI safety applications, this limitation means:
What we CAN test with statistical hierarchies:
What we CANNOT test:
Practical recommendation: Use SynthSAEBench to test necessary but not sufficient conditions. If SAEs fail on statistical hierarchies, they’ll definitely fail on real semantic hierarchies. If they succeed on statistical hierarchies, they might still fail on semantic ones, so you need additional validation on real models.
You’ve identified the core limitation: SynthSAEBench’s hierarchies are statistical simulacra of semantic structure, not genuine semantic hierarchies. The feature labels we assign (“Deceptive Reasoning,” “Goal Misrepresentation”) are for human interpretation only - the synthetic model has no semantic understanding. This is a fundamental constraint of working with synthetic data in the absence of knowing what true semantic features look like in representation space. The value of SynthSAEBench lies in providing controlled testbeds for statistical properties (sparsity, correlation, hierarchy, manifolds) that we believe are necessary for handling real semantic structures, while acknowledging it cannot fully capture the compositional and functional properties that define true semantics.