Tag
#guardrails
4 posts tagged guardrails.
- tooling
Content Moderation Tools for LLM Applications: What Works and Where They Break
A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and others—covering capabilities, documented bypasses, and a layered deployment strategy.
- deep-dive
OpenAI's Under-18 Principles: a guardrail engineer reads the new Model Spec
OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.
- content-filter
AI Content Moderation: How LLM Filters Work and Where They Break
A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and how to layer defenses that hold under real adversarial pressure.
- guardrails
OpenAI's Under-18 Principles: what the new Model Spec teen guardrails actually do
OpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities. Here is what those defenses cover, where they have already been bypassed, and what to layer on top if you ship for minors.