To implement manifold features in SynthSAEBench, we need to modify
the data generation pipeline to support features that are not just
scalar coefficients times unit vectors, but rather points sampled from
geometric structures embedded in the activation space. The key is to
extend the existing feature dictionary D and coefficient sampling
mechanism while preserving compatibility with correlation, hierarchy,
and superposition. First, extend the synthetic model to maintain two
types of features: discrete features (as currently implemented) and
manifold features. For each feature i, add metadata specifying
feature_type (discrete or manifold), and for manifold
features, store additional parameters: intrinsic_dim (the
manifold’s true dimensionality), embedding_basis (a D × d
matrix of orthonormal vectors defining the subspace), and
manifold_type (circle, sphere, torus, etc.). Instead of
storing a single direction vector di ∈ ℝD
for manifold feature i, store the embedding basis Ui that spans
the subspace where the manifold lives. The existing firing probability
pi,
hierarchy relationships, and correlation structure remain
unchanged—these still determine whether the manifold feature is active
on a given sample. The crucial modification is in what happens when a
manifold feature fires.
When generating a batch of activations, the sampling proceeds in phases just like current SynthSAEBench, but with manifold-specific logic inserted. Phase 1: Use the existing Gaussian copula mechanism to determine which features fire (binary indicators zi), respecting correlations. Phase 2: For discrete features that fire (zi = 1), sample scalar coefficients exactly as before: ci = ReLU(μi+σiϵi). For manifold features that fire, perform a two-step sampling: (a) Sample a point on the intrinsic manifold mi according to the manifold’s parameterization, and (b) Sample a radial magnitude ri = ReLU(μi+σiϵi) using the same rectified Gaussian as discrete features. Phase 3: Apply hierarchy constraints as before, but now when a manifold child is deactivated, we zero out both its manifold position and radial component. Phase 4: Compute the final activation by summing contributions: discrete features contribute cidi as usual, while manifold features contribute riϕi(mi) where ϕi is the embedding function. The total activation is a = ∑discretecidi + ∑manifoldriϕi(mi) + b.
For a circular feature (S1), the implementation
would store a 2-column embedding basis Ui ∈ ℝD × 2
where the columns are orthonormal. When the feature fires, sample an
angle θ ∼ Uniform(0,2π) (or from
a von Mises distribution if you want concentration at certain points,
like specific days of the week). The embedding function is ϕcircle(θ) = Ui[cos(θ),sin(θ)]T,
giving a point in ℝD that lies on a circle
in the 2D subspace spanned by Ui. For radial
variation, multiply by ri: contribution
= ri × ϕcircle(θ).
For a sphere Sd − 1, store a
d-column embedding basis Ui ∈ ℝD × d.
Sample d independent Gaussians g ∼ 𝒩(0,Id),
normalize to get a point on the unit sphere v = g/∥g∥, then
embed via ϕsphere(v) = Uiv
and scale by ri. For a torus
T2 = S1 × S1,
store a 4-column basis and sample two independent angles θ1, θ2,
then embed as ϕtorus(θ1,θ2) = Ui[cos(θ1),sin(θ1),cos(θ2),sin(θ2)]T.
The embedding bases Ui should be
initialized as random orthonormal matrices (via QR decomposition of
random Gaussian matrices) and can optionally be included in the
orthogonalization procedure that reduces superposition between features.
Critically, to match observed LLM behavior, set σi > 0 for
realistic radial variation—Michaud et al. found that hollow manifolds
(σi = 0)
lead to pathological tiling, while radial variation encourages SAEs to
learn basis-like representations. The practical implementation would add
a ManifoldFeature class with methods
sample_intrinsic_point() and
embed_to_activation_space(intrinsic_point, radius), making
it modular to add new manifold types while maintaining the existing
SynthSAEBench infrastructure for correlation matrices, hierarchy trees,
and evaluation metrics.