GuardML
Source code displayed on a computer screen, illustrating AI Moderation Tools for LLMs
tooling

AI Moderation Tools for LLMs: What Works and What Gets Bypassed

A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard — with honest numbers on bypass rates, false positives, and latency cost.

By GuardML Editorial · · 8 min read

If you are shipping an LLM feature in 2026, you are operating under a standing assumption that the model alone is not your safety boundary. AI moderation tools — the runtime filters, content-safety classifiers, jailbreak detectors, and PII redactors that sit between user input and model response — are now a standard part of the production stack. The question is no longer whether to deploy them; it is which ones to layer, in what order, at what latency cost, and what residual risk you accept when you do.

This post maps the current landscape, names the known bypasses, and gives you a decision framework for picking controls that fit your threat model.

What “AI Moderation Tools” Actually Covers

The term spans at least five distinct control types, often conflated:

Input scanners inspect the user’s prompt before it reaches the model. They look for prompt injection patterns (OWASP LLM01:2025), jailbreak phrasing, PII, and forbidden topic keywords. Lakera Guard, Azure Prompt Shield, and Llama Guard all operate at this layer.

Output validators inspect the model’s response before it reaches the user. They flag hallucinations, check factual grounding against a retrieval corpus, detect policy violations in generated text, and enforce JSON-schema conformance. AWS Bedrock Guardrails’ Automated Reasoning checks and Azure’s Groundedness Detection sit here.

Semantic firewalls classify intent rather than surface patterns. Instead of blocking the phrase “how do I make,” they model the underlying topic and block requests that fall into denied categories regardless of phrasing — relevant to indirect prompt injection where the malicious instruction arrives via a retrieved document, not the user’s direct input.

PII redactors identify and remove sensitive entities (names, SSNs, API keys, health data) from both inputs and outputs before storage or further processing.

RAG-context isolators are a newer category: they attempt to prevent retrieved document content from overriding system instructions, addressing the indirect injection vector that the AI security research community has tracked as one of the highest-risk agent attack paths.

Most enterprise deployments need at least input scanning and output validation running together. Picking just one is a control gap.

The Current Tool Landscape

Cloud-native: AWS Bedrock Guardrails and Azure AI Content Safety

AWS Bedrock Guardrails is the most complete managed offering. It covers content moderation (configurable severity thresholds for hate, violence, sexual content), prompt attack detection, denied-topic classification, PII redaction, and hallucination detection via Automated Reasoning checks. Per AWS, the Automated Reasoning checks produce mathematically verifiable explanations with claimed 99% accuracy on supported fact types — a strong claim for narrow domains. The control applies to any model accessible via Bedrock, including third-party models, which matters if you are running a multi-model architecture.

Azure AI Content Safety adds Prompt Shield — separate classifiers for direct jailbreak attempts and for indirect injection hidden in documents. The indirect injection detector is more granular than Bedrock’s current offering for RAG architectures. The trade-off: Azure’s controls are more modular but require more integration work to cover the full input-output surface.

Open-source: NeMo Guardrails and Llama Guard

NVIDIA NeMo Guardrails is an open-source toolkit that lets you define rails in a declarative Colang syntax, combining LLM self-checking, NVIDIA safety classifiers, topic rails, and third-party API calls. The programmability is its main advantage — you can encode domain-specific policies that SaaS products cannot anticipate. The operational cost is real: you own the infrastructure, the model updates, and the evaluation harness.

Llama Guard (Meta, 2023) is an LLM-based classifier fine-tuned on a taxonomy of harm categories. The paper reports that it outperforms every other evaluated method on the ToxicChat benchmark and approaches OpenAI’s moderation API performance on OpenAI’s own dataset — without task-specific training examples. It is now widely used as a baseline for input-output safeguard evaluation. The practical caveat: running Llama Guard adds one full inference call per moderated turn, which at scale compounds latency and cost.

Dedicated API: Lakera Guard

Lakera Guard operates as a single-endpoint API call inserted before the primary model request. It detects prompt injection, jailbreaks, PII, malicious URLs, and off-topic content without requiring model changes. The primary pitch is drop-in integration: no new infrastructure, one API call, audit log included. Per Lakera’s documentation, detection is real-time and adds minimal latency for most workloads. Independent evaluations on the Palit benchmark place it alongside Azure Prompt Shield as a strong performer for jailbreak detection, though both tools carry non-trivial false-positive rates on edge-case legitimate requests that resemble adversarial phrasing.

Bypass Reality

Unit 42’s comparative study across major GenAI platforms documents bypass techniques that succeed across multiple guardrail systems. The pattern across their findings: content-safety classifiers tuned for direct harmful phrasing remain susceptible to indirect framing, roleplay scaffolding, and multilingual restatement. No single tool in their evaluation blocked all tested attack variants.

This is not a vendor failure. It is the nature of semantic classifiers operating on natural language, which is infinitely rephraseble. The operational implication: a single content-safety classifier at the input layer is necessary but not sufficient. Teams tracking active jailbreak disclosures should monitor ai-alert.org for indicators that their classifier version is being targeted by newly public payloads.

The current state-of-the-art combination — semantic firewall on input, topic classifier on output, structured-output mode enforcement for tool calls, PII redaction on both sides — reduces the attack surface without eliminating it. Log everything; the audit trail is how you detect bypass campaigns in production rather than in post-incident review.

Deployment Decision Framework

Pick a managed cloud control (Bedrock, Azure) if you are already in that cloud’s ecosystem, need PII redaction with SOC 2 / HIPAA-eligible data handling, and want hallucination checks bundled. The integration overhead is low if your model serving is already in the same stack.

Pick an open-source option (NeMo Guardrails, Llama Guard) if you have domain-specific policy requirements that SaaS classifiers cannot be trained to understand, need on-premise data residency, or want to own the evaluation process end-to-end.

Pick a dedicated API (Lakera Guard) if you are operating across multiple model providers, need to retrofit guardrails onto an existing application quickly, or want a vendor whose entire focus is the classification problem rather than a feature within a broader cloud platform.

In all cases: run both input and output scanning, enforce JSON-schema or structured-output mode on any tool-calling interface, and treat your false-positive rate as a first-class SLA metric. An overly aggressive content filter degrades product quality as reliably as an overly permissive one creates security exposure.

Sources

Sources

  1. How Good Are the LLM Guardrails on the Market? — Unit 42, Palo Alto Networks
  2. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations — arXiv 2312.06674
  3. Amazon Bedrock Guardrails — AWS Documentation
  4. OWASP Top 10 for LLM Applications 2025 — LLM01: Prompt Injection
  5. Lakera Guard Guardrails — Lakera API Documentation
  6. NVIDIA NeMo Guardrails for Developers
#guardrails #content-filter #llm-security #jailbreak #prompt-injection #tooling
Subscribe

GuardML — in your inbox

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments