Tag

#reward-hacking

1 post tagged reward-hacking.

alignment

Model Alignment: What It Is, How It Works, and Where It Fails

Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're bypassed, and what defenders must layer on top.
May 10, 2026