All posts

LLM Guardrails Explained: What They Are and How to Implement Them

A practitioner's guide to LLM guardrails — the five rail types, what each one actually catches, where each is bypassed, and how to wire a stack that fails safe instead of failing silent.
June 2, 2026
MCP Tool Poisoning: The Guardrail Layer Most Teams Are Missing

MCP makes every server an injection surface in your LLM app. Tool poisoning, rug-pulls, and the lethal trifecta are live. Here is what to actually defend.
May 29, 2026
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%

A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.
May 15, 2026
AI Moderation Tools for LLMs: What Works and What Gets Bypassed

A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard — with honest numbers on bypass rates, false positives, and latency cost.
May 13, 2026
LLM Alignment Evaluation: Why Benchmarks Don't Predict Safety

Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages. Here's how to build an evaluation suite that reflects your actual threat model.
May 13, 2026
AI Safety Tools: A Guide to Guardrails, Filters, and Defenses

A practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known bypasses, and deployment guidance.
May 11, 2026
KV Cache Compression Is Now an Alignment Problem

A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control, the off-policy bug matters more than the throughput win.
May 11, 2026
ChatGPT Safety: How OpenAI's Guardrails Work and Where They Break

A technical breakdown of ChatGPT safety architecture: hardcoded refusals, RLHF training, Rule-Based Rewards, safe-completions, and the bypass research that stress-tests every layer.
May 10, 2026
LLM Alignment: What It Does, Where It Breaks, How to Deploy

LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths. Here's how RLHF, DPO, and Constitutional AI work, and what practitioners need to layer on top.
May 10, 2026
LLM Guardrails: Comparing Tools and Implementation Patterns

A practical comparison of LLM guardrail implementations — classifiers, rule engines, LLM judges — with empirical bypass rates and deployment patterns that don't collapse under adversarial pressure.
May 10, 2026
LLM Guardrails: Architecture, Bypasses, and What to Deploy

LLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial pressure, and the layered deployment stack that holds.
May 10, 2026
LLM Safety: What It Actually Means and How to Build It

LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes. This guide covers the layered defense stack practitioners actually need.
May 10, 2026
LLM Security Tools: A Practical Guide to the Current Stack

A working guide to LLM security tools for 2026 — covering red-teaming frameworks, runtime guardrails, and observability layers, with honest notes on what each category gets wrong.
May 10, 2026
Model Alignment: What It Is, How It Works, and Where It Fails

Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're bypassed, and what defenders must layer on top.
May 10, 2026
Content Moderation AI Tools: Benchmarks, Bypasses, and Deployment

A practitioner's comparison of leading content moderation AI tools — OpenAI Moderation, Azure AI Content Safety, Llama Guard 4, NeMo Guardrails, and more — with real benchmark data and adversarial bypass patterns.
May 9, 2026
AI Content Filter: Architecture, Bypasses, and Layered Defense

A practitioner's breakdown of AI content filter approaches — classifier-based, LLM-as-judge, and guard models — with honest coverage of bypass techniques and deployment recommendations for security-conscious teams.
May 8, 2026
Output Classification: A PII and Secrets Detector for LLM Apps

Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.
May 6, 2026
Content Moderation Tools for LLMs: What Works and Where It Breaks

A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and others—covering capabilities, documented bypasses, and a layered deployment strategy.
May 4, 2026
OpenAI's Under-18 Principles: An Engineer Reads the Model Spec

OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.
May 4, 2026
AI Content Moderation: How LLM Filters Work and Where They Break

A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and how to layer defenses that hold under real adversarial pressure.
May 2, 2026
OpenAI's Under-18 Principles: What the New Model Spec Does

OpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities. Here is what those defenses cover, where they have already been bypassed, and what to layer on top if you ship for minors.
May 2, 2026
What this site is for

GuardML covers defensive AI engineering. Guardrails, content filters, model defenses, and shipping AI features without shipping liability.
May 1, 2026