Transformer Debugger (TDB) is a research tool developed by OpenAI’s Superalignment team to investigate and interpret the behaviors of small language models. It combines automated interpretability methods with sparse autoencoders, enabling researchers to analyze how specific neurons, attention heads, and latent features contribute to a model’s outputs. TDB allows users to intervene directly in the forward pass of a model and observe how such interventions change predictions, making it possible to answer questions like why a token was selected or why an attention head focused on a certain input. It automatically identifies and explains the most influential components, highlights activation patterns, and maps relationships across circuits within the model. The tool includes both a React-based neuron viewer for exploring model components and a backend activation server for running inferences and serving data.
Features
- Investigates behaviors of small language models with interpretability tools
- Intervenes in the forward pass to test effects on outputs
- Identifies and explains neuron, attention head, and latent activations
- Provides a React-based neuron viewer for interactive exploration
- Includes an activation server and inference hooks for GPT-2 models
- Offers collated activation datasets for deeper analysis