Topics

Browse posts by category and tag — every topic we cover, with the latest pieces under each.

Tags

Categories

tooling 6 posts

AI Moderation Tools for LLMs: What Works and What Gets Bypassed

A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard —
AI Safety Tools: A Guide to Guardrails, Filters, and Defenses

A practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known
LLM Guardrails: Comparing Tools and Implementation Patterns

A practical comparison of LLM guardrail implementations — classifiers, rule engines, LLM judges — with empirical bypass rates and deployment patterns that
LLM Security Tools: A Practical Guide to the Current Stack

A working guide to LLM security tools for 2026 — covering red-teaming frameworks, runtime guardrails, and observability layers, with honest notes on what
Content Moderation AI Tools: Benchmarks, Bypasses, and Deployment

A practitioner's comparison of leading content moderation AI tools — OpenAI Moderation, Azure AI Content Safety, Llama Guard 4, NeMo Guardrails, and more
Content Moderation Tools for LLMs: What Works and Where It Breaks

A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and

deep-dive 4 posts

guardrails 4 posts

alignment 3 posts

content-filter 2 posts

bypass 1 posts

G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%

A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.

defense 1 posts

Output Classification: A PII and Secrets Detector for LLM Apps

Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.

defense-in-depth 1 posts

LLM Safety: What It Actually Means and How to Build It

LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.