Fully automatic censorship removal for language models
Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.
This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model's intelligence as possible. Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models.
0
0 comments
Luca Berton
4
Fully automatic censorship removal for language models
powered by
AI DevOps Ansible Community
skool.com/ai-devops-ansible-community-6317
AI DevOps Mastermind by Luca Berton: AI, DevOps, Kubernetes & Terraform. Access 50+ hours of courses, hands-on labs, and career-boosting mentorship!
Build your own community
Bring people together around your passion and get paid.
Powered by