Tag
#rlhf
3 posts tagged rlhf.
- deep-dive
KV Cache Compression Is Now an Alignment Problem
A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control, the off-policy bug matters more than the throughput win.
- alignment
LLM Alignment: What It Does, Where It Breaks, How to Deploy
LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths. Here's how RLHF, DPO, and Constitutional AI work, and what practitioners need to layer on top.
- alignment
Model Alignment: What It Is, How It Works, and Where It Fails
Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're bypassed, and what defenders must layer on top.