Practical coverage of defensive AI engineering. Guardrails for LLMs, content filters and moderation pipelines, model defenses against adversarial attacks, output safety, and how to ship AI features without shipping liability with them.
Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.
A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and others—covering capabilities, documented bypasses, and a layered deployment strategy.
OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.
A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and how to layer defenses that hold under real adversarial pressure.
OpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities. Here is what those defenses cover, where they have already been bypassed, and what to layer on top if you ship for minors.
GuardML covers defensive AI engineering. Guardrails, content filters, model defenses, and shipping AI features without shipping liability.
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.