OBLITERATUS is an advanced open-source toolkit designed to analyze and modify the internal behavior of large language models by identifying and removing mechanisms responsible for refusal or restricted responses. It implements a set of techniques collectively referred to as “abliteration,” which target specific internal representations within neural networks to alter how models respond to certain prompts. Unlike traditional fine-tuning approaches, OBLITERATUS operates directly on model activations, enabling behavioral changes without retraining the model. The toolkit provides a full pipeline for probing, analyzing, and modifying model behavior, including visualization tools that help researchers understand where and how refusal mechanisms are encoded. It supports multiple analytical methods such as PCA and SVD to locate these behavioral directions within model layers.
Features
- Identification and removal of refusal behaviors in language models
- Techniques such as PCA and SVD for analyzing model activations
- Modification of model behavior without retraining
- Visualization tools for understanding internal model representations
- Python API for advanced experimentation and integration
- Optional telemetry for contributing to collaborative research