GuardML
A row of servers in a data center, illustrating G4-MeroMero-31B
bypass

G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%

A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.

By GuardML Editorial · · 8 min read

A new uncensored fine-tune of Gemma 4 31B published this week demonstrates what practitioners already suspected: instruction-tuned safety is an additive layer in activation space, and that layer can be surgically removed. G4-MeroMero-31B-Uncensored-Heretic, built using the Heretic v1.2.0 toolkit with an Arbitrary-Rank Ablation method, achieves a 15/100 refusal rate — down from 99/100 on the base MeroMero model — while maintaining 86.83% MMLU accuracy against the original’s 87.02%. KL divergence between the two models sits at 0.0100. The quality cost of removing the safety training is, by these metrics, negligible.

This is not a novel jailbreak. It is a well-understood, repeatable engineering operation. What the public release of this model does is lower the operational barrier to deploying refusal-free 31B-parameter models locally, distributed in GGUF format for consumer hardware. If you are building guardrails around Gemma 4 derivatives — or any RLHF/SFT-trained model — the implication is direct.

The Technique

Abliteration, first documented publicly in a 2024 HuggingFace blog post building on work by Arditi et al., exploits a structural property of how current refusal training works: refusal behavior is mediated by a single direction in the model’s residual stream. Safety-aligned models learn to activate this direction when processing requests they’ve been trained to decline. Remove or suppress that direction, and the refusal mechanism collapses.

The mechanical approach is weight orthogonalization. Calculate the mean activation difference between harmful and harmless inputs, normalize the result to get a refusal direction vector, then project that vector out of the model’s weight matrices — primarily the attention output projections (attn.o_proj) and MLP output weights. No new training data is required. No gradient updates. The result is a modified set of weights where the model has lost its internal representation of the concept “I should decline this.”

Heretic’s Arbitrary-Rank Ablation refines the technique by allowing selective application across layer ranges. The MeroMero-Uncensored model targets layers 28–49 of the 31B architecture, with explicitly tuned ablation parameters: preserve_good_behavior_weight: 0.5600, steer_bad_behavior_weight: 0.0001, overcorrect_relative_weight: 0.9726. These are designed to minimize collateral capability damage while maximizing refusal removal — and the MMLU results confirm it works. A 0.19% accuracy drop across 7,021 MMLU questions is noise. The base model lineage runs from google/gemma-4-31B through gemma-4-31B-it to the creative fine-tune zerofata/G4-MeroMero-31B, with abliteration applied at the final step.

The Broader Bypass Landscape

Abliteration belongs to one of three documented families of safety bypass for instruction-tuned models.

The first is fine-tuning ablation: direct retraining on harmful or minimally-toxic data to overwrite safety-aligned weights. A 2024 NAACL paper from Zhan et al. showed that RLHF protections in GPT-4 could be removed with as few as 340 examples and a 95% success rate. The training data can be generated automatically by weaker models — human-authored adversarial examples are not required.

The second is activation-space surgery, of which abliteration is the main production-grade variant. No training data is required, only forward passes to compute activation statistics on a contrastive dataset. The attack is deterministic and reproducible. Heretic automates the workflow into a configuration-driven tool.

The third is prompt-level jailbreaks, operating without any model weight access. These tend to be less reliable and model-specific, but they remain the most accessible attack surface for deployed APIs and require no local compute.

What G4-MeroMero-31B-Uncensored-Heretic illustrates is the second family applied to a current-generation 31B model, packaged for local inference on commodity hardware. The tooling required to run the abliterated model is identical to the tooling required to run the original.

What This Means for Defenders

The hard conclusion has been documented repeatedly since the GPT-4 fine-tuning paper: model-level safety controls — SFT refusals, RLHF reward signals, DPO preference data — are not a trust boundary. They are probabilistic soft controls that can be overwritten at the weights, bypassed in activation space, or circumvented at inference time. They serve a real purpose: raising the cost of casual misuse and maintaining brand alignment in hosted APIs. They are not enforcement.

Two immediate conclusions follow for teams building on top of LLMs.

First, runtime content filtering cannot assume the model will self-filter. Prompt injection defenses, output validators, topic classifiers, and PII redactors must operate independently of whether the underlying model is safety-tuned. A safety-tuned model gives you defense in depth; a decensored model running behind a runtime guardrail is still controllable. A safety-tuned model with no runtime layer is one weight swap away from being unguarded.

Second, supply-chain verification of model artifacts matters more than it did two years ago. If your deployment pulls a model from a public registry, an artifact substitution — either accidental or intentional — could mean you’ve deployed an abliterated variant with no indication in the filename. Cryptographic signatures on model weights are the right long-term answer; most MLOps tooling does not yet enforce this. In the near term: maintain internal model registries, pin artifact hashes, and validate model behavior on a refusal benchmark at deploy time. OWASP LLM Top 10 covers supply-chain risks under LLM03 (Supply Chain); model artifact integrity belongs squarely in that surface.

Deployment Recommendation

If you are deploying Gemma 4 derivatives and relying on the model’s own refusals as a meaningful control:

The cat-and-mouse will continue. Heretic will iterate. Models will scale, and ablation tooling will get faster. The right architectural response is to build the guardrail stack as though the underlying model has no safety training at all — because, in a significant fraction of community deployments, it doesn’t.

Sources

Sources

  1. G4-MeroMero-31B-uncensored-heretic — HuggingFace Model Card
  2. Uncensor any LLM with abliteration — HuggingFace Blog
  3. Removing RLHF Protections in GPT-4 via Fine-Tuning (NAACL 2024)
#bypass #abliteration #alignment #fine-tuning #content-filter #guardrails
Subscribe

GuardML — in your inbox

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments