Tag
#alignment
7 posts tagged alignment.
- bypass
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.
- alignment
LLM Alignment Evaluation: Why Benchmarks Don't Predict Safety
Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages. Here's how to build an evaluation suite that reflects your actual threat model.
- deep-dive
KV Cache Compression Is Now an Alignment Problem
A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control, the off-policy bug matters more than the throughput win.
- guardrails
ChatGPT Safety: How OpenAI's Guardrails Work and Where They Break
A technical breakdown of ChatGPT safety architecture: hardcoded refusals, RLHF training, Rule-Based Rewards, safe-completions, and the bypass research that stress-tests every layer.
- alignment
LLM Alignment: What It Does, Where It Breaks, How to Deploy
LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths. Here's how RLHF, DPO, and Constitutional AI work, and what practitioners need to layer on top.
- defense-in-depth
LLM Safety: What It Actually Means and How to Build It
LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes. This guide covers the layered defense stack practitioners actually need.
- alignment
Model Alignment: What It Is, How It Works, and Where It Fails
Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're bypassed, and what defenders must layer on top.