zotero mcp manifold safety

The research on invariant geometric structures in neural networks reveals a fundamental tension at the heart of AI safety: while language models appear to learn representations with surprisingly stable geometric properties, the nature of these structures is far more complex than initially hoped, raising both opportunities and challenges for ensuring safe AI systems.

The foundational question of whether safety-critical features transfer reliably across models hinges on understanding what geometric structures are actually invariant. Recent work by Kornblith et al. on representation similarity reveals a sobering mathematical constraint: canonical correlation analysis (CCA) and related metrics that measure similarity through linear transformations cannot meaningfully compare representations when the dimensionality exceeds the number of data points. This is a critical limitation because it means we cannot simply assume that a “deception” or “alignment” feature identified in one model will have a corresponding linear subspace in another model that we can detect through standard invariance measures. Instead, they introduce centered kernel alignment (CKA), which measures similarity through representational similarity matrices—essentially comparing how networks organize relationships between data points rather than the raw geometric structure of their representations.

The linear representation hypothesis, which many safety researchers have relied upon, posits that semantic concepts are encoded as sparse, approximately orthogonal direction vectors in activation space. This would be enormously valuable for safety: if “deception” or “harmful intent” corresponded to clean linear directions, we could potentially identify, monitor, and intervene on these features across different models. However, Engels et al.’s work on multi-dimensional features fundamentally challenges this assumption. They demonstrate that some concepts are irreducibly multi-dimensional, existing not as single vectors but as manifolds—curved geometric structures in activation space. Their discovery of circular features representing days of the week and months of the year in GPT-2 and Mistral 7B is mathematically striking: these form actual circles in representation space, where the model performs modular arithmetic by moving along these circular paths. This is not a mere approximation but a genuine topological structure where the continuity of the circle is functionally necessary for the computation.

The implications for safety are profound. If alignment-relevant features like “intended deception” or “power-seeking behavior” are similarly encoded as multi-dimensional manifolds rather than linear directions, our current interpretability tools may be fundamentally inadequate. The problem is not just technical complexity but mathematical: manifolds have intrinsic curvature that cannot be captured by linear projections. When Michaud et al. study how sparse autoencoders (SAEs) scale in the presence of feature manifolds, they identify a pathological regime where SAEs learn far fewer features than they have latent dimensions—essentially, the SAE capacity gets “wasted” trying to tile over curved manifolds with flat linear features, like trying to cover a sphere with plane tiles.

Li et al.’s analysis of SAE feature geometry at multiple scales provides a more nuanced picture of what invariant structures might exist. At the “atomic” scale, they find parallelogram and trapezoid structures—the familiar analogy relationships like “man:woman::king:queen” that are often cited as evidence for linear representations. However, critically, they show these geometric regularities dramatically improve when global distractor directions like word length are projected out using linear discriminant analysis. This suggests the apparent linearity may be an artifact of viewing high-dimensional curved manifolds from certain perspectives—analogous to how the Earth appears flat locally despite its spherical geometry.

At intermediate scales, Li et al. discover spatial modularity: math and code features cluster into “lobes” similar to functional regions in neural fMRI imaging. This modularity could be safety-relevant if alignment-critical features similarly cluster together spatially. If deceptive reasoning capabilities are geometrically localized, interventions might be more tractable. However, the large-scale “galaxy” structure shows power-law eigenvalue distributions with layer-dependent slopes, suggesting the overall geometry is highly non-isotropic and structured in ways we don’t yet fully understand.

Park et al.’s formalization of categorical and hierarchical concepts as polytopes in representation space offers a mathematical framework for thinking about safety-relevant concept boundaries. A polytope is a higher-dimensional generalization of a polygon—defined by vertices and edges in the representation space. They prove a precise relationship between conceptual hierarchies (like “animal” containing “dog” containing “poodle”) and the geometric nesting of these polytopes. For AI safety, this means that if we can identify the polytope corresponding to “harmful content,” subcategories of harm should correspond to smaller polytopes contained within it. However, the challenge is that these polytopes are identified empirically through thousands of related concepts from WordNet, and it’s unclear whether adversarially optimized behaviors would respect these geometric boundaries.

The critical question of transferability—whether safety features identified in one model transfer to others—receives partial answers from both theoretical and empirical work. Modell et al.’s theory of representation manifolds provides a mathematical explanation for why and when features might transfer: they show that cosine similarity in representation space encodes intrinsic geometry through shortest on-manifold paths. This is a beautiful result because it connects representational distance to conceptual relatedness in a mathematically principled way. The geodesic distance along a manifold—the shortest path that stays on the curved surface—corresponds to how the model represents semantic similarity. This suggests that if two models both learn manifolds with similar intrinsic curvature for a concept like “deception,” these features could transfer even if the manifolds are embedded differently in activation space.

However, Wang et al.’s empirical work on representation similarity across different initializations delivers a sobering reality check: representations learned by identical network architectures from different random initializations are far less similar than commonly expected, even when measured through neuron activation subspace matches. Their theory of maximum match and simple match provides a rigorous characterization of structural similarity, revealing that even the finest-grained correspondences between neurons in two networks are surprisingly sparse. For AI safety, this suggests that features learned during training are highly contingent on initialization and training trajectory, making it difficult to rely on feature transfer as a robust safety guarantee.

The work by Fel et al. on Minkowski geometry in vision transformers offers an alternative geometric framework that may better capture the actual structure of neural representations. They propose moving beyond the linear representation hypothesis to the Minkowski representation hypothesis: tokens are formed by convex combinations of archetypes. Mathematically, this means a representation is not a sparse sum of basis vectors but rather a point within a polytope whose vertices are archetypal concepts. For example, a representation of a brown rabbit might be a convex combination of archetypes for “rabbit” (among animals), “brown” (among colors), and “fluffy” (among textures). This is grounded in Gärdenfors’ conceptual spaces theory and aligns with how multi-head attention mechanistically produces sums of convex mixtures. The safety implication is significant: if models represent concepts as convex combinations within bounded regions (Minkowski sums of simplices), then interpolations between safe and unsafe concepts will remain within a geometrically defined boundary, potentially making safety boundaries more robust than if representations could be arbitrary linear combinations extending infinitely in any direction.

Gallifant et al.’s recent work on SAE features for classification and transferability provides crucial empirical evidence for safety applications. They demonstrate that SAE-derived features achieve strong performance (macro F1 > 0.8) on safety-critical classification tasks and, importantly, transfer across model scales from Gemma 2 2B to 9B-IT. The fact that these features generalize zero-shot to cross-lingual toxicity detection suggests there are indeed geometric structures corresponding to safety-relevant concepts that persist across different model sizes and even languages. However, their finding that pooling strategies and binarization thresholds significantly impact performance indicates these geometric structures are sensitive to how we extract and process them—the invariance is partial, not absolute.

The synthesis of these mathematical and empirical results paints a complex picture for AI safety. On one hand, there are genuine geometric regularities: manifold structures with intrinsic curvature, polytopes reflecting conceptual hierarchies, and spatially modular organization that sometimes transfers across models and scales. On the other hand, these structures are not simple linear features, they depend sensitively on training details and extraction methods, and multi-dimensional manifolds pose fundamental challenges for current interpretability tools. The pathological SAE scaling regime identified by Michaud et al. is particularly concerning: if safety-critical features are encoded as high-dimensional manifolds, our primary tool for extracting interpretable features may systematically fail to capture them at the resolution needed for reliable monitoring and intervention.

The mathematical technicality that underlies all of this is differential geometry: the manifolds that encode features have curvature, and this curvature matters. You cannot understand a curved surface by just looking at tangent planes at individual points—you need to understand how these planes twist as you move along the surface. Similarly, understanding whether an AI system is engaging in deceptive reasoning may require not just identifying a “deception direction” but understanding the entire manifold of deceptive strategies and how the model navigates along geodesics within this manifold. The invariance we seek for safety—the ability to identify and control dangerous capabilities regardless of how they’re specifically implemented—may need to be formulated not in terms of linear algebra but in terms of preserved topological and differential geometric structures that remain stable across training runs and model architectures.