Tag

#jailbreak

5 posts tagged jailbreak.

tooling

AI Moderation Tools for LLMs: What Works and What Gets Bypassed

A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard —
May 13, 2026
guardrails

ChatGPT Safety: How OpenAI's Guardrails Work and Fail

ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
May 10, 2026
guardrails

LLM Guardrails: Architecture, Bypasses, and What to Deploy

LLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial
May 10, 2026
defense-in-depth

LLM Safety: What It Actually Means and How to Build It

LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.
May 10, 2026
content-filter

AI Content Moderation: How LLM Filters Work and Where They Break

A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and
May 2, 2026