LLM Foundations and Mechanistic Interpretability

BIO 47102 / 79006 • 2025

🤖 Course Overview

Explore the foundations of transformer architectures and modern techniques for mechanistically interpreting neural language models. Gain deep understanding of self-attention mechanisms, cutting-edge interpretability methods including circuit analysis, feature decomposition, and attribution graphs, combining theoretical foundations with hands-on implementation and analysis of real models.

Credits

3 Credits (Lecture)

Level

400 + 700

Format

In-person - synchronous
Weekly lectures: 3 hours

Class Size

Graduate seminar

Prerequisites: Machine Learning fundamentals, Python/PyTorch

📚 Course Syllabus

Week 1: Introduction to Transformers Course overview and motivation, attention mechanism intuition, transformer architecture overview, encoder vs. decoder vs. encoder-decoder models, and position in the NLP landscape

Week 2: Self-Attention Mechanism Deep Dive Query, key, value formulation, scaled dot-product attention mathematics, multi-head attention architecture, computational complexity analysis, softmax normalization, and information routing through attention

Week 3: Transformer Architecture Components Positional encodings (sinusoidal and learned), feed-forward networks (MLP layers), residual connections and layer normalization, token embeddings and output layers, absolute vs. relative position encodings, and information flow through residual paths

Week 4: Training Transformers Training dynamics and challenges, gradient issues in deep transformers, layer normalization interactions, and Adam optimizer considerations

Week 5: Advanced Transformer Extensions Sparse attention mechanisms, long-range dependencies, attention approximation methods, memory efficiency techniques, and trade-offs in sparse attention

Week 6: Introduction to Mechanistic Interpretability Goals of mechanistic interpretability, circuits hypothesis, feature vs. neuron analysis, superposition phenomenon, polysemantic vs. monosemantic neurons, and the toy models approach

Week 7: Toy Models and Superposition Superposition hypothesis, toy model construction, feature geometry in neural representations, interference and capacity analysis, sparse features and coding theory, and privileged vs. antiprivileged bases

Week 8: Induction Heads and In-Context Learning Discovery of induction heads, in-context learning mechanisms, phase transitions in training, copying and matching circuits, induction circuit structure, emergent capabilities during training, and role in few-shot learning

Weeks 9: Monosemantic Features and Dictionary Learning Sparse autoencoders for feature extraction, dictionary learning applied to neural networks, monosemantic feature discovery, overcomplete representations, feature activation patterns, and automated interpretability scoring

Week 10: Circuit Analysis and Causal Tracing Path patching and activation patching, causal interventions, circuit discovery methodology, validation through ablations, counterfactual analysis, logit attribution, direct vs. indirect effects, and circuit completeness metrics

Week 11: Attribution Graphs and Cross-Layer Analysis Feature attribution methods, cross-layer transcoders (CLTs), attribution graph construction, information flow visualization, linear approximation of feature interactions, graph pruning strategies, and reconstruction fidelity

Week 12: Attention Pattern Analysis Attention head specialization, compositional attention patterns, attention flow and copying, multi-layer attention interactions, head roles (induction, previous token, etc.), composition through OV and QK circuits, and attention-MLP interaction

Week 13: Scaling and Real-World Models Interpretability at scale, limitations of current methods, computational challenges, automated vs. manual analysis, emergence in large models, and research frontiers

Week 14: Advanced Topics and Applications Steering and model editing, adversarial interpretability, safety applications, ethical considerations, activation engineering, feature manipulation, interpretability for alignment, and transparency vs. security

Week 15: Course Wrap-up Open problems in interpretability, future research directions, and career paths in interpretability

🎯 Learning Outcomes

Students will be able to:

Understand Transformer Architectures Apply Mechanistic Interpretability Identify Computational Circuits Design Intervention Experiments Analyze Attention Patterns Evaluate Interpretability Research

🔬 Hands-On Experience

Implement transformer components from scratch in PyTorch and train small language models. Build and analyze toy superposition models to understand feature geometry. Apply sparse autoencoders for dictionary learning and monosemantic feature discovery. Conduct causal tracing and path patching experiments to identify computational circuits. Perform model steering experiments using activation engineering techniques and explore attribution graphs for real-world LLMs.

Grading: 5 Assignments (40%) • Midterm Model Analysis (20%) • Final Research Project (30%) • Participation (10%)

Instructor: Prof. Konstantinos Krampis