← Back to Blog

Visualizing model's thought

2025.7.10

Hypothesis

for classification tasks performed by in-context learning, model assign inputs belongs to different labels onto distinct, different manifolds to form clusters.

Task setting: a simple classification task with 10 labels.

$$\text{Label}(x) = \text{alphabet}(x \; \text{mod} \; 10)$$

where $x$ is an integer input, and $\text{alphabet}(n)$ returns the n-th letter in English alphabet. For example, $1:A, 11:A, 2:B, 22:B, ...$

Visualizing the hidden state of model

The context is : "1:A, ..., 101:A, 2:B, ..., 102:B, 3:C, ..., 103:C, 4:D, ..., 104:D, 111:".

I extract the hidden state of all number tokens in layer 1, 14, 32 for meta llama3-8b, and perform t-SNE on them to visualize. Blue, orange, green, red are numbers with label A, B, C, D, that is, end with 1, 2, 3, 4:

t-SNE visualization of hidden states at different layers — Layer 1: the representation learned from pre-train. Tokens like "1, 2, 3, 4" and "11, 12, 13, 14" are close to each other, see sticks in the figure.

As we can see, at the very first layer, the representations of different labels are mixed together. As we go deeper into the network, the model gradually separates the representations of different labels, forming distinct clusters by the final layer. This suggests that the model is organizing the input data into different clusters, supporting our hypothesis.

How much is the clusters separated?

One may solve the problem with one-dimensional non-linear representation. Let's check the effective dimension of model's representation.

$$\alpha = \frac{\left(\sum_{i}\sigma_{i}^{2}\right)^{2}}{\sum_{i}\sigma_{i}^{4}},\quad\mathrm{where}~X=U\Sigma V^{T}, \sigma_i \;\text{is the i-th singular value}$$

Let's see how the effective dimension changes across layers

The effective dimension is about 1.4, which meets with the intuition that the problem should be solved with one dimension representation.

Is the representation linear separable? Let's figure out:

$$S^{\star}(A):=\operatorname*{sup}_{\|u \|=1}\frac{1}{n}\left[\sum_{i\in A}\mathbb{I}(u^{T}h^{(i)}>0)+\sum_{i\ne A}\mathbb{I}(u^{T}h^{(i)})\right] $$

The separability in the 2 dimensional-space should be large, according to the effective dimension we had before. Let's check it by first project the representation on the first 2 right- singular vectors

$$h^{(i)} = (x^{(i)} \cdot v_1, \; x^{(i)} \cdot v_2)$$

And plot the average separability score $S^{\star}$ between clusters

The figure clearly demonstrate that the representation projected to the first two right singular vectors has a high separability score, indicating that the clusters are well-separated, which meets with the hypothesis, and the model's perfect performance on the task.

Other visualization

Here are some other visualizations. For example, how does model represent 7 days in a week, or words discribing size like "big", "small"?

Words describing size arranged by their meanings

The above visualization shows that, model can learn the geometry of semantics, or in-context. If one thinks about the relationships between words, it becomes clear that the model is capturing not just the meanings of individual words, but also their contextual relationships.

The representation of language model is still quite mysterious. Back to 2023, linear representation hypothesis was proposed. After that, researchers found that there are many non-linear representations, like circles, helix. Recently, some researcher proposed a new perspective of treating features, from a geometry perspective. However, there is still a lot of definition on features in the field of mechanistic interpretability.

This blog simply visualized some toy cases, aiming to let readers have an intuition about representations.

Visualizing model's thought

Hypothesis

Visualizing the hidden state of model

How much is the clusters separated?

Other visualization

References