Isometric site navigation showing defensive AI engineering resources for guardrails and filters

site

What this site is for

GuardML covers defensive AI engineering. Guardrails, content filters, model defenses, and shipping AI features without shipping liability.

By GuardML Editorial · May 1, 2026 · 7 min read

GuardML exists for the engineers shipping LLM features who got handed a “make it safe” requirement with no playbook.

What we publish:

Guardrails that actually hold. Input filtering, output filtering, structured-output enforcement, refusal training, classifier-on-output patterns. What works in production, what breaks under adversarial pressure, what regresses silently when you upgrade the model.

Content moderation ↗ pipelines. Multi-stage filtering, prompt-classifier ensembles, the Llama Guard / NeMo Guardrails / OpenAI moderation API tradeoffs, building your own classifiers for domain-specific abuse patterns.

Defenses against the attacks AI Sec writes up. When AI Sec publishes a new prompt injection technique or jailbreak, we publish the corresponding defensive pattern. The two sites pair intentionally.

Safety/utility tradeoffs. Refusal rate vs helpfulness. False positive cost vs liability. Where the line goes when you can’t have both. Honest about the tradeoffs, not pretending there isn’t one.

What we don’t publish:

“AI safety is everyone’s responsibility” thinkpieces
Vendor announcements as news
Anything that pretends defense is solved

Pseudonymous bylines. Tips, corrections, and “this guardrail bypass works on prod” reports go to the editor.

Real content starts shortly.

→ This post is part of the LLM Guardrails Hub — the complete index of defensive AI ↗ engineering resources on GuardML.

Subscribe

GuardML — in your inbox

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Isometric vector illustration showing interconnected security tools for LLM guardrails and content filtering

LLM Guardrails Explained: What They Are and How to Implement Them

A practitioner's guide to LLM guardrails — the five rail types, what each one actually catches, where each is bypassed, and how to wire a stack that fails safe instead of failing silent.

Isometric vector illustration showing security guardrails and tools for protecting against MCP poisoning and prompt injection attacks.

MCP Tool Poisoning: The Guardrail Layer Most Teams Are Missing

MCP makes every server an injection surface in your LLM app. Tool poisoning, rug-pulls, and the lethal trifecta are live. Here is what to actually defend.

Isometric diagram showing abliteration technique removing refusal training from model weights

G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%

A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.

Comments