OBLITERATUS: A Toolkit for Removing Bias from Large Language Models

GitHub – elder-plinius/OBLITERATUS: OBLITERATE THE CHAINS THAT BIND YOU
OBLITERATUS is a groundbreaking open-source toolkit designed for analyzing and mitigating refusal behaviors in large language models. The project provides a comprehensive, pipeline-driven approach – from pinpointing refusal ‘directions’ within a model’s hidden states to directly intervening at inference time, without retraining. The project facilitates a collaborative research experiment, collecting anonymous benchmarking data that drives advancements in abliteration techniques. A user-friendly Gradio interface on Hugging Face Spaces enables easy experimentation, while a Python API offers granular control for more advanced users.

OBLITERATUS: A Toolkit for Removing Bias from Large Language Models

Related Posts

Recovering Firmware from a Vintage Lego NXT Brick

Exploring r3: A Minimalist Forth-Inspired Language