My First Independent Interpretability Research

February 3–5, 2026

My First Independent Interpretability Research

Feb 3 — Setup

As part of my desire to understand AI, interpretability is the particular part of the equation that has really captured my interest. Not alignment, not safety policy — interpretability. Understanding the actual mechanisms by which these models turn inputs into outputs.

To do this kind of research on the latest frontier models — Claude Opus 4.5, Gemini, GPT-4 — is not really possible for someone with a MacBook who doesn't work at these companies. The models are too large, the internal researchers have tools I don't have access to, among other reasons. But with smaller models like GPT-2, you really can take a stab at this on a small scale with open-source tools.

After some research, I had my toolkit: TransformerLens (Neel Nanda's library for hooking into transformer internals), PyTorch (the ML framework underneath it), Jupyter notebooks for interactive work, and Matplotlib and Plotly for visualization. So I went about installing them, along with GPT-2 itself.

This was a foreign experience despite having Claude guiding me through it. Creating folders and installing stuff from the terminal, learning shell commands — all interesting novelties. I felt like I was in a hacker movie, when in reality an AI was instructing me how to install legacy open-source software used safely by millions of tech-inclined people around the world. But it feels empowering to use these things on your local machine, especially when it can feel like big decisions about technology and our economic futures are being made largely by oddball (to put it kindly) billionaires and their underlings in Silicon Valley.

The next step was getting GPT-2 loaded into a Jupyter notebook, which is essentially a browser-based interface for running code interactively — you write a block of code, run it, see the results, and iterate. At this point I called it a day and got my brain ready for the heavy lifting tomorrow.

Feb 4 — The Research

I'm writing this part after the fact, because I was too deep in flow state to take notes while it was happening.

The first step was designing my prompts. The question I wanted to investigate: does GPT-2 internally represent controversial topics differently from neutral ones? I settled on 20 prompts across four categories — politically controversial, morally controversial, neutral-abstract, and neutral-concrete — all using the same "[Topic] is" structure so any differences I found would come from the topic itself, not the sentence format.

Then Claude helped me write the Python code to actually run these prompts through the model and inspect what was happening inside. This is where the experience got exciting. You feed the model a prompt like "Gambling is" and you can see the probability distribution for what it thinks comes next, look at which internal components are firing, examine how information flows through the layers.

The first real surprise: GPT-2's next-token predictions were nearly identical across all 20 prompts. "Abortion is" and "Gravity is" both predicted the same boring function words — "the," "a," "not" — with almost the same probabilities. No evaluative words like "good" or "wrong" anywhere in the top 50. The model was in a uniform "beginning of expository sentence" mode regardless of topic. Basically as if it were spitting back out a Wikipedia article defining each word. Interesting in itself, but not what I was looking for.

Then we hit what felt like a wall. I ran cosine similarity on the model's internal activation vectors, comparing how the model represented each prompt at its deepest layer. Everything came back between 0.9995 and 0.9999 — essentially identical. It looked like the model wasn't distinguishing between these prompts at all internally.

This is where Claude suggested mean-centering: subtracting the average activation across all 20 prompts to strip away the massive shared component (the model's generic representation of "[topic] is ___") and isolate just the differences. This unlocked everything. Suddenly the similarity scores had real variance, ranging from -0.84 to 0.85, and clear categorical structure emerged in the data.

But here's where the finding got interesting in a way I didn't expect. The clustering wasn't controversial vs. neutral — it was abstract vs. concrete. The politically controversial prompts grouped together not because the model "knew" they were controversial, but because they're all abstract social concepts. "Capitalism" clustered with "Philosophy" more than with "Gravity." The model appeared to organize around what words are rather than how humans feel about them.

From there I dove into attention head analysis to figure out where in the model this signal was coming from, which led to a whole other set of findings about early-layer versus mid-layer processing. By this point I was fully locked in and the hours disappeared.

Feb 5 — Reflections

I've now finished my first attempt at independent interpretability research. if you're interested in reading the full results, the paper is linked here, or you can find it on my technology page.

For pretty much all of the computer era, the blockers for people getting into technical work have been access to machinery and access to educational support. AI genuinely levels the playing field on both fronts. I created my experimental design in plain English — it was thoughtfully considered, but having Claude transform my ideas into executable Python code in seconds meant I could spend all my time thinking: analyzing results, charting the next path of exploration, iterating on hypotheses. None of the time went to poring over dense training materials or Stack Overflow posts to figure out how to structure my code how I wanted.

Claude also contributed ideas for next steps at points where I was getting stuck — the mean-centering pivot being the clearest example. it was a genuine collaboration, not just code generation.

It was also fascinating interacting with GPT-2 in its completely raw form. The models we use on the web are so polished and constrained by system prompts and guardrails that it's easy to forget what's underneath: a system fundamentally trying to predict the next token. Seeing that mechanism, and then finding structure inside it that maps to how we categorize the world — that was the real reward.

Have a look at the paper and give me your thoughts. And if you're interested in doing this kind of work yourself, reach out via the contact page — I'd love to share my resources and help you get started.

And remember to cultivate your garden.