Tag
#bypass
3 posts tagged bypass.
- bypass
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.
- guardrails
ChatGPT Safety: How OpenAI's Guardrails Work and Where They Break
A technical breakdown of ChatGPT safety architecture: hardcoded refusals, RLHF training, Rule-Based Rewards, safe-completions, and the bypass research that stress-tests every layer.
- content-filter
AI Content Filter: Architecture, Bypasses, and Layered Defense
A practitioner's breakdown of AI content filter approaches — classifier-based, LLM-as-judge, and guard models — with honest coverage of bypass techniques and deployment recommendations for security-conscious teams.