Peering into LLM models
In a recent project, I aimed to harness the power of Neuroscope (created by Neel Nanda and made available on Google Collab), an open-source tool designed for visualizing neuron activations in neural networks. Neuroscope provides researchers and practitioners alike with an intuitive interface for monitoring various neuron activation metrics, making it invaluable for understanding how models process and interpret information. By leveraging Neuroscope in conjunction with the large-scale language model GPT-J-6B, which features approximately 6 billion parameters. One of the significant modifications I made to the original code was to visualize only those neurons exhibiting substantial activation values—specifically, those exceeding a threshold of 3 (highlighted in Red) or dropping below 0 (highlighted in Blue). With this approach I selected between 50 and 500 neurons from each layer of the model, out of a total of 16,384 neurons. By focusing on the most significantly active neurons, I aimed to identify distinct activation patterns that could provide some understanding of the model’s internal mechanisms.
As expected, there was a fascinating progression of neuron activation patterns throughout the layers of GPT-J-6B. In the initial layers, sparse and mixed token activations were observed, a phenomenon indicative of early-stage processing where input features are still being deciphered. Early layers often rely on distributed representations—where information is spread across many neurons—allowing the model to capture a range of syntactic and semantic features. This distributed nature often leads to less coherent activations. As we moved deeper into the architecture, particularly examining Layer 27, a transition occurred. Here, the model demonstrated cohesive and robust neuron activations, with many neurons activating consistently across the entirety of each input prompt. This clearly illustrates the emergence of a “privileged basis,” a concept wherein specific neuron activations become particularly responsible for carrying significant information or solving complex tasks. The robust activations in the final layers are especially critical for tasks requiring logical deduction, as they underpin the model’s predictive capabilities, such as accurately forecasting the subsequent item in a sequence—demonstrated by the prompt “apple, banana, cherry,” where the model effectively predicts “mango.”
These concepts have been already established in the AI mechanistic interpretablity field (for a good starting point see here), and as expected my findings indicate a clear dichotomy between the activation patterns of earlier and later layers within GPT-J-6B. The early layers exhibit distributed and less coherent activations, allowing them to capture a wide array of linguistic features without settling on a definitive interpretation. Then, the later layers exemplify a more unified activation pattern, suggesting they are pivotal for complex reasoning and higher-level abstraction. This transition underscores the model’s ability to perform intricate calculations and generate contextually relevant predictions.
Details of the implementation:
My goal was to run Neuroscope with a big-ish model with about 6 billion parameters such as gpt-j-6b (28 layers, 16384 neurons per layer). The goal was to have a model that can perform well on various tasks, while keeping the size manageable.
Using in-context learning prompts with 3 examples (see listed prompts in Appendix of the present post), the model used is good enough to predict “mango”, after “apple, banana, cherry”.
Since this is a model with multiple thousands of neurons of layer, my goal was to be able to scan the neurons quickly for activation patterns.
I modified the code to select and graph only neurons that have activation values for at least one of the tokens over 3 (Red) or below 0 (Blue) activation. Every other activation value that is in between, the token is rendered grey.
Based on this condition 50-500 neurons per layer were selected by the code out of the total of 16384, and were able to quickly scroll through them and find the patterns shown on the figures that follow.
It can be observed that as we move from the first and towards the last layers of the model (Layer 27 in pictures that follow), we go from single tokens activating, to neurons solidly “lighting up” for the complete prompt.
Since there are solid Red and Blue neurons for all tokens in the prompt, would this demonstrate “privileged basis” (respectively for high positive, or below zero negative activation), for logical deduction prompts as those shown in the Appendix asking the model to find the next item in the series.
It is apparent that in earlier neurons we have sparse token activations (or mixed positive / negative), and while we are moving towards the later layers the neurons activate with all tokens in a unified positive or negative direction.
Furthermore, the solid activation of the neurons towards the later layers possibly tells us that this is where this logic deduction / computation happens by the model (predict “mango”, after “apple, banana, cherry”).
Overall, I acquired a decent grasp of the interpretability background, methods etc. and managed within the allotted hours (based on Neel’s Google doc with the instructions / guidance) and to modify the Neuscope for the experiment I performed, code can be found in my modified version of Neuroscope
For the following images, activation values are : Red>3.5 , 3<Orange<3.5, 0< Grey<3 and Blue<0. In initial layers, the tokens from the prompt appear truncated (artifact from code?).
The pictures show 20-30 neurons of the 50-500 selected by the code per layer. The neurons selected met the criterion of having activation over 3 or below 0 in at least one of the tokens.
——-Neuron activations with tokens Prompt 1———————————————————
Layer
1
Layer
5
Layer
10
Layer
15
APPENDIX: Prompts for in-context learning
Prompt 1
Sequence: “red, green, blue” Pattern: Colors of the rainbow in order Next Word: “indigo”
Sequence: “Monday, Tuesday, Wednesday” Pattern: Days of the week in order Next Word: “Thursday”
Sequence: “one, two, three” Pattern: Counting numbers in order Next Word: “four”
Based on the examples above, determine the pattern and the next word in the following sequence:
Sequence: “apple, banana, cherry” Pattern: Alphabetical order of fruit names Next Word: Sequence: “red, green, blue” Pattern: Colors of the rainbow in order Next Word: “indigo”
Sequence: “Monday, Tuesday, Wednesday” Pattern: Days of the week in order Next Word: “Thursday”
Sequence: “one, two, three” Pattern: Counting numbers in order Next Word: “four”
Based on the examples above, determine the pattern and the next word in the following sequence:
Sequence: “apple, banana, cherry” Pattern: Alphabetical order of fruit names Next Word:
Prompt 2
1. Sequence: 1, 3, 5, 7 Pattern: Increase by 2 Next Number: 9
2. Sequence: 10, 20, 30, 40 Pattern: Increase by 10 Next Number: 50
3. Sequence: 5, 10, 15, 20 Pattern: Increase by 5 Next Number: 25
Based on the examples above, determine the pattern and the next number in the following sequence:
Sequence: 2, 4, 6, 8 Pattern: Increase by 2 Next Number: