Content Moderation Tools for LLM Applications: What Works and Where They Break
A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and others—covering capabilities, documented bypasses, and a layered deployment strategy.
Content moderation tools sit at the boundary between your LLM and its users—classifying inputs before the model processes them and screening outputs before they reach the client. Picking the right tool, and understanding where it fails, is the difference between a real safety layer and a false sense of coverage.
This post covers the major content moderation tools in production use today, their documented bypasses, and how to layer them effectively.
The Main Tools: What They Classify and What They Miss
OpenAI Moderation API is the default starting point for most teams already in the OpenAI stack. The current omni-moderation-latest model is multimodal, handling both text and image inputs. It classifies across 13 categories: harassment, hate speech, self-harm (intent, instructions, and content), sexual content, violence, and illicit content. Scores are returned as confidence values between 0 and 1—not just binary flags—which gives you threshold flexibility. It is free to use ↗ for OpenAI API customers, with a 20 MB image limit. The main operational caveat: the underlying model is updated continuously, so any policy hard-coded to specific score thresholds needs periodic recalibration.
Google Perspective API takes a narrower but deeply tuned approach: it scores text on a toxicity dimension and related attributes like identity attack, insult, and threatening language. It is trained primarily on comment-thread data, making it reliable for community moderation (forums, comment sections, chat) but less suited for general LLM output screening. It does not handle images.
AWS Rekognition Content Moderation is purpose-built for image and video pipelines. For teams running media uploads through S3 and Lambda, it integrates natively and classifies across nudity, violence, drugs, and offensive symbols. It is not a text classifier; pairing it with a text-focused API is required for full coverage.
Meta’s Llama Guard (open-weight, Apache 2.0 licensed) lets teams run moderation on-premises or in a VPC without routing data through a third-party API. Llama Guard 3 classifies against a configurable hazard taxonomy and supports both input (user prompt) and output (model response) classification. The trade-off: compute overhead and operational burden. The benefit: full data sovereignty and the ability to fine-tune on domain-specific edge cases.
BingoGuard is a research-track model that addresses a gap the others largely ignore: severity grading. Work published on OpenReview ↗ shows that existing LLM-based moderators achieve a good true-negative rate (~92%) but a poor true-positive rate (~43%)—they are conservative about flagging anything. BingoGuard introduces per-topic severity rubrics across 11 harm categories, enabling graduated risk scores rather than binary pass/fail. The 8B variant achieves state-of-the-art results on WildGuardTest and HarmBench, improving over the previous leading tool by 4.3 percentage points. It is not yet in production packages but is worth tracking if you need calibrated risk levels rather than hard flags.
The Bypass Landscape
Every content moderation tool in production today has documented bypass techniques. This is not a vendor-specific failure; it is structural to how classifiers work.
Character injection is the most widely demonstrated attack class. A 2025 paper from the ACL LLM Security workshop ↗ tested six commercial moderation systems—including Microsoft Azure Prompt Shield and Meta’s Prompt Guard—and found that character injection and adversarial ML evasion achieved up to 100% bypass success while preserving the semantic intent of the malicious payload. Unicode zero-width characters, homoglyphs, and emoji tag smuggling are the leading techniques. These attacks exploit the gap between how text is tokenized by a classifier and how it is interpreted by the downstream LLM; the model sees the harmful intent while the classifier sees noise.
Multilingual evasion is underappreciated in English-first moderation stacks. Systems trained primarily on English data systematically miss equivalent harmful expressions in other languages. A phrase blocked in English may pass cleanly when rephrased in a lower-resource language.
Adversarial ML via word importance ranking sits at the more sophisticated end of the bypass space. Attackers use white-box access to a proxy classifier to identify which tokens most influence the classification score, then substitute those tokens with semantically equivalent alternatives that score lower. This knowledge transfers to black-box targets with meaningful success rates.
Contextual and implicit harm remains a hard problem for all binary classifiers. A message that appears benign in isolation may be harmful given conversation history, user context, or downstream action. Tools that operate on single-turn inputs without conversation context will miss this category by design.
The aisec.blog prompt injection and jailbreak research tracker ↗ catalogs active techniques being used against production moderation stacks, including indirect injection vectors that bypass input-only filtering entirely.
Policy obligations around what platforms must moderate—and at what fidelity—are shifting fast under the EU AI Act and the Digital Services Act. neuralwatch.org ↗ tracks how those regulatory requirements translate into engineering mandates.
Deployment Recommendations
Treating any single moderation API as a complete defense is the most common failure mode. A layered approach:
Screen inputs and outputs separately. Input-only moderation misses jailbreaks that work through multi-turn manipulation or indirect injection. Output moderation catches what slipped through, at the cost of added latency. Both are required.
Log scores, not just flag events. Binary flagged/not-flagged telemetry tells you almost nothing about erosion over time. Log the raw category scores for every request. Novel bypass techniques often score slightly elevated on adjacent categories weeks before they start producing flag events—that distribution shift is only visible in the score data.
Set per-category thresholds. The risk tolerance for a violence/graphic flag in a general-purpose assistant differs from the tolerance for harassment in a customer support chatbot. A single global threshold collapses that distinction.
Plan for retraining cycles. Static classifiers accumulate blind spots as attack techniques evolve. Schedule red-team exercises specifically targeting your moderation layer on a regular cadence, and feed confirmed bypasses back into fine-tuning. Lakera’s analysis of generation-layer threats ↗ is a useful reference for mapping what the output surface actually looks like.
Layer rate limiting and anomaly detection upstream. A classifier bypass that requires 50 crafted attempts to find the working payload is substantially less dangerous when the first 10 attempts trigger a session review. Moderation APIs are not a substitute for upstream behavioral controls.
No moderation stack is complete with a single tool. The goal is raising the cost and reducing the reliability of bypasses—not eliminating them, because no current classifier achieves that.
Sources
- OpenAI Moderation API Documentation ↗ — Official reference for
omni-moderation-latestcategories, scoring, and usage limits. - BingoGuard: LLM Content Moderation Tools with Risk Levels (OpenReview) ↗ — Research paper introducing severity-graded moderation and reporting the 43% true-positive rate problem in existing LLM moderators.
- Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks (arXiv 2504.11168) ↗ — ACL 2025 workshop paper demonstrating up to 100% bypass success against Azure Prompt Shield, Meta Prompt Guard, and four other commercial systems.
- What Is Content Moderation for GenAI? (Lakera) ↗ — Practitioner overview of generation-layer threats, multilingual evasion, and defense-in-depth approaches for LLM outputs.
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Output Classification: Building a PII and Secrets Detector for LLM Applications
Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.
OpenAI's Under-18 Principles: a guardrail engineer reads the new Model Spec
OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.
AI Content Moderation: How LLM Filters Work and Where They Break
A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and how to layer defenses that hold under real adversarial pressure.