Topics
Browse posts by category and tag — every topic we cover, with the latest pieces under each.
Tags
- #guardrails 19
- #defense-in-depth 11
- #content-filter 9
- #alignment 7
- #llm-safety 6
- #prompt-injection 6
- #jailbreak 5
- #tooling 5
- #llm-security 4
- #rlhf 4
- #bypass 3
- #constitutional-ai 3
- #content-moderation 3
- #llm-guardrails 3
- #fine-tuning 2
- #model-spec 2
- #openai 2
- #rlaif 2
- #abliteration 1
- #age-gating 1
- #age-verification 1
- #agent-security 1
- #agentic-ai 1
- #ai-safety 1
- #ai-tools 1
- #benchmarks 1
- #chatgpt 1
- #detection-engineering 1
- #dpo 1
- #evaluation 1
- #filtering 1
- #kv-cache 1
- #lethal-trifecta 1
- #llama-guard 1
- #llm-security-tools 1
- #mcp 1
- #model-alignment 1
- #moderation 1
- #multilingual 1
- #output-filtering 1
- #pii-detection 1
- #policy-hierarchy 1
- #red-teaming 1
- #reward-hacking 1
- #roleplay-jailbreak 1
- #secrets-detection 1
- #teen-safety 1
- #tool-poisoning 1
- #training-infra 1
Categories
tooling 6 posts
- AI Moderation Tools for LLMs: What Works and What Gets BypassedA practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard —
- AI Safety Tools: A Guide to Guardrails, Filters, and DefensesA practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known
- LLM Guardrails: Comparing Tools and Implementation PatternsA practical comparison of LLM guardrail implementations — classifiers, rule engines, LLM judges — with empirical bypass rates and deployment patterns that
- LLM Security Tools: A Practical Guide to the Current StackA working guide to LLM security tools for 2026 — covering red-teaming frameworks, runtime guardrails, and observability layers, with honest notes on what
- Content Moderation AI Tools: Benchmarks, Bypasses, and DeploymentA practitioner's comparison of leading content moderation AI tools — OpenAI Moderation, Azure AI Content Safety, Llama Guard 4, NeMo Guardrails, and more
- Content Moderation Tools for LLMs: What Works and Where It BreaksA practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and
deep-dive 4 posts
- Constitutional AI Explained: How Principle-Based Training Builds Safer ModelsConstitutional AI replaces human harm labels with a written set of principles and AI self-critique. Here is how the method works, where it sits in your
- MCP Tool Poisoning: The Guardrail Layer Most Teams Are MissingMCP makes every server an injection surface in your LLM app. Tool poisoning, rug-pulls, and the lethal trifecta are live. Here is what to actually defend.
- KV Cache Compression Is Now an Alignment ProblemA new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control
- OpenAI's Under-18 Principles: An Engineer Reads the Model SpecOpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing.
guardrails 4 posts
- LLM Guardrails Explained: What They Are and How to Implement ThemA practitioner's guide to LLM guardrails — the five rail types, what each one actually catches, where each is bypassed, and how to wire a stack that fails
- ChatGPT Safety: How OpenAI's Guardrails Work and FailChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
- LLM Guardrails: Architecture, Bypasses, and What to DeployLLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial
- OpenAI's Under-18 Principles: What the New Model Spec DoesOpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities.
alignment 3 posts
- LLM Alignment Evaluation: Why Benchmarks Don't Predict SafetyPractitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages.
- LLM Alignment: What It Does, Where It Breaks, How to DeployLLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths.
- Model Alignment: What It Is, How It Works, and Where It FailsModel alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're
content-filter 2 posts
- AI Content Filter: Architecture, Bypasses, and Layered DefenseA practitioner's breakdown of AI content filter approaches — classifier-based, LLM-as-judge, and guard models — with honest coverage of bypass techniques
- AI Content Moderation: How LLM Filters Work and Where They BreakA technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and