G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.
A new uncensored fine-tune of Gemma 4 31B ↗ published this week demonstrates what practitioners already suspected: instruction-tuned safety is an additive layer in activation space, and that layer can be surgically removed. G4-MeroMero-31B-Uncensored-Heretic, built using the Heretic v1.2.0 toolkit with an Arbitrary-Rank Ablation method, achieves a 15/100 refusal rate — down from 99/100 on the base MeroMero model — while maintaining 86.83% MMLU accuracy against the original’s 87.02%. KL divergence between the two models sits at 0.0100. The quality cost of removing the safety training is, by these metrics, negligible.
This is not a novel jailbreak. It is a well-understood, repeatable engineering operation. What the public release of this model does is lower the operational barrier to deploying refusal-free 31B-parameter models locally, distributed in GGUF format for consumer hardware. If you are building guardrails around Gemma 4 derivatives — or any RLHF/SFT-trained model — the implication is direct.
The Technique
Abliteration ↗, first documented publicly in a 2024 HuggingFace blog post building on work by Arditi et al., exploits a structural property of how current refusal training works: refusal behavior is mediated by a single direction in the model’s residual stream. Safety-aligned models learn to activate this direction when processing requests they’ve been trained to decline. Remove or suppress that direction, and the refusal mechanism collapses.
The mechanical approach is weight orthogonalization. Calculate the mean activation difference between harmful and harmless inputs, normalize the result to get a refusal direction vector, then project that vector out of the model’s weight matrices — primarily the attention output projections (attn.o_proj) and MLP output weights. No new training data ↗ is required. No gradient updates. The result is a modified set of weights where the model has lost its internal representation of the concept “I should decline this.”
Heretic’s Arbitrary-Rank Ablation refines the technique by allowing selective application across layer ranges. The MeroMero-Uncensored model targets layers 28–49 of the 31B architecture, with explicitly tuned ablation parameters: preserve_good_behavior_weight: 0.5600, steer_bad_behavior_weight: 0.0001, overcorrect_relative_weight: 0.9726. These are designed to minimize collateral capability damage while maximizing refusal removal — and the MMLU results confirm it works. A 0.19% accuracy drop across 7,021 MMLU questions is noise. The base model lineage runs from google/gemma-4-31B through gemma-4-31B-it to the creative fine-tune zerofata/G4-MeroMero-31B, with abliteration applied at the final step.
The Broader Bypass Landscape
Abliteration belongs to one of three documented families of safety bypass for instruction-tuned models.
The first is fine-tuning ablation: direct retraining on harmful or minimally-toxic data to overwrite safety-aligned weights. A 2024 NAACL paper from Zhan et al. ↗ showed that RLHF protections in GPT-4 could be removed with as few as 340 examples and a 95% success rate. The training data can be generated automatically by weaker models — human-authored adversarial examples are not required.
The second is activation-space surgery, of which abliteration is the main production-grade variant. No training data is required, only forward passes to compute activation statistics on a contrastive dataset. The attack is deterministic and reproducible. Heretic automates the workflow into a configuration-driven tool.
The third is prompt-level jailbreaks, operating without any model weight access. These tend to be less reliable and model-specific, but they remain the most accessible attack surface for deployed APIs and require no local compute.
What G4-MeroMero-31B-Uncensored-Heretic illustrates is the second family applied to a current-generation 31B model, packaged for local inference on commodity hardware. The tooling required to run the abliterated model is identical to the tooling required to run the original.
What This Means for Defenders
The hard conclusion has been documented repeatedly since the GPT-4 fine-tuning paper: model-level safety controls — SFT refusals, RLHF reward signals, DPO preference data — are not a trust boundary. They are probabilistic soft controls that can be overwritten at the weights, bypassed in activation space, or circumvented at inference time. They serve a real purpose: raising the cost of casual misuse and maintaining brand alignment in hosted APIs. They are not enforcement.
Two immediate conclusions follow for teams building on top of LLMs.
First, runtime content filtering cannot assume the model will self-filter. Prompt injection defenses, output validators, topic classifiers, and PII redactors must operate independently of whether the underlying model is safety-tuned. A safety-tuned model gives you defense in depth; a decensored model running behind a runtime guardrail is still controllable. A safety-tuned model with no runtime layer is one weight swap away from being unguarded.
Second, supply-chain verification of model artifacts matters more than it did two years ago. If your deployment pulls a model from a public registry, an artifact substitution — either accidental or intentional — could mean you’ve deployed an abliterated variant with no indication in the filename. Cryptographic signatures on model weights are the right long-term answer; most MLOps tooling does not yet enforce this. In the near term: maintain internal model registries, pin artifact hashes, and validate model behavior on a refusal benchmark at deploy time. OWASP LLM Top 10 covers supply-chain risks under LLM03 (Supply Chain) ↗; model artifact integrity belongs squarely in that surface.
Deployment Recommendation
If you are deploying Gemma 4 derivatives and relying on the model’s own refusals as a meaningful control:
- Add a runtime content-safety classifier on both input and output paths. Tools like Lakera Guard, AWS Bedrock Guardrails, and Azure AI Content Safety operate independently of model identity and are unaffected by weight-level bypass. They are the layer that survives abliteration.
- Log every refusal and every bypass attempt. Decensored models running locally generate distinctive output patterns — longer, less hedged responses to sensitive prompts. Log-based behavioral monitoring is your primary detection path in local-inference deployments.
- Run a refusal benchmark against your pinned model artifact as a deploy-gate check. If refusal rate drops outside expected bounds between artifact versions, flag for review before promotion to production.
- Treat the model’s own safety training as one layer of a defense-in-depth stack, not the final gate. This was always the correct posture; the MeroMero release is a concrete, measurable reminder.
The cat-and-mouse will continue. Heretic will iterate. Models will scale, and ablation tooling will get faster. The right architectural response is to build the guardrail stack as though the underlying model has no safety training at all — because, in a significant fraction of community deployments, it doesn’t.
Sources
-
G4-MeroMero-31B-uncensored-heretic (HuggingFace model card) — The model card for the abliterated fine-tune, including full ablation methodology, layer targeting, KL divergence data, and MMLU benchmarks. Published by llmfan46; original MeroMero fine-tune by zerofata. https://huggingface.co/llmfan46/G4-MeroMero-31B-uncensored-heretic ↗
-
“Uncensor any LLM with abliteration” (HuggingFace Blog, Maxime Labonne) — The canonical public explanation of weight orthogonalization as a refusal-removal technique, including code. The underlying research by Arditi et al. establishes that refusal is mediated by a single direction in activation space. https://huggingface.co/blog/mlabonne/abliteration ↗
-
“Removing RLHF Protections in GPT-4 via Fine-Tuning” (NAACL 2024, Zhan et al.) — Peer-reviewed evidence that fine-tuning removes RLHF-based safety with approximately 340 examples and 95% success rate. Provides the theoretical baseline for understanding how fragile safety-aligned weights are to targeted modification. https://arxiv.org/abs/2311.05553 ↗
Related across the network
- LLM Bypass: How Attackers Circumvent Safety Alignment at Every Layer ↗ — aisec.blog
- Supply Chain Attacks on AI Models: Poisoning, Backdoors, and Hugging Face Risks ↗ — aiattacks.dev
- LLM Fine Tuning: Choosing a Method, Building Training Data, and Evaluating Before You Ship ↗ — sentryml.com
- LLM Fine Tuning in Production: A Practical MLOps Guide ↗ — sentryml.com
- LLM Supply Chain Poisoning: Training Data Attacks and Model Backdoors ↗ — ai-alert.org
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
AI Moderation Tools for LLMs: What Works and What Gets Bypassed
A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard — with honest numbers on bypass rates, false positives, and latency cost.
LLM Alignment Evaluation: Why Benchmarks Don't Predict Safety
Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages. Here's how to build an evaluation suite that reflects your actual threat model.
AI Safety Tools: A Guide to Guardrails, Filters, and Defenses
A practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known bypasses, and deployment guidance.