ChatGPT Safety: How OpenAI's Guardrails Work and Where They Break
A technical breakdown of ChatGPT safety architecture: hardcoded refusals, RLHF training, Rule-Based Rewards, safe-completions, and the bypass research that stress-tests every layer.
ChatGPT safety refers to the set of mechanisms OpenAI has built to prevent the model from generating harmful content, assisting with weapons development, facilitating illegal activity, or behaving in ways that expose users and businesses to serious risk. For teams building products on top of ChatGPT, understanding what those mechanisms actually do — and which attack patterns reliably defeat them — is a prerequisite for responsible deployment. The official story and the adversarial reality diverge in several important places.
The Safety Stack: What OpenAI Has Built
OpenAI’s safety approach for ChatGPT operates across multiple layers, each addressing a different failure mode.
Training-time constraints are the foundation. Reinforcement Learning from Human Feedback (RLHF) shapes the model’s default behavior by rewarding compliant, helpful outputs and penalizing policy violations. On top of this, OpenAI layered Rule-Based Rewards (RBRs) ↗: a system that encodes explicit safety rules in plain language and uses them as a scoring signal during training. RBRs target edge cases where human preference data is sparse, noisy, or internally inconsistent — precisely the distribution where RLHF alone tends to underspecify behavior.
For GPT-5, OpenAI introduced safe-completions ↗, a training method that moves away from binary refuse-or-comply decisions. Instead of training the model to refuse entire request categories, safe-completions trains it to maximize helpfulness subject to policy constraints — producing a useful, bounded response where possible rather than a flat refusal. OpenAI reports that this approach reduced both safety failures on dual-use prompts and over-refusals on benign ones. The trade-off is that “nuanced compliance” is harder to audit than “hard refusal.”
Hardcoded behaviors are enforced at the model level regardless of what operators or users request. OpenAI’s Model Spec ↗ defines these as absolute limits: generating sexual content involving minors, providing actionable synthesis routes for CBRN weapons, and a small set of catastrophic-risk categories. No operator system prompt unlocks them. No API tier overrides them. The spec calls these “root-level prohibitions” — the floor below which no downstream configuration can go.
Softcoded behaviors occupy everything else. These defaults can be shifted by operators via the system prompt or by users with appropriate context. Adult content generation, detailed harm-reduction discussions for controlled substances, clinical detail in verified medical contexts — all of these are operator-configurable. The Model Spec makes the permission hierarchy explicit: operators inherit what OpenAI allows, users inherit what operators allow. If your system prompt doesn’t restrict a category, the model’s consumer-product defaults apply, which are calibrated for a general audience, not a specialized deployment.
The Moderation API provides a separate synchronous classifier that developers can run against both inputs and outputs. It returns structured confidence scores across harm categories — hate, self-harm, sexual, violence, and so on. It is not a substitute for training-time safety; it is a runtime checkpoint that catches content the model should have refused but didn’t. Critically, the Moderation API and the chat model operate on different internal representations and different training distributions. A response that scores clean on the Moderation API can still carry policy violations embedded in indirect language.
The Bypass Landscape
Every layer described above has documented bypass patterns. ChatGPT’s safety mechanisms are mature enough to be extensively studied — which means the failure modes are public knowledge.
Prompt engineering jailbreaks are the most systematically researched attack surface. A widely cited empirical study (arXiv 2305.13860 ↗) classified ten distinct jailbreak prompt patterns across three categories and demonstrated consistent policy evasion across 3,120 test cases spanning eight prohibited content scenarios, targeting both GPT-3.5 and GPT-4. The study frames the core mechanism precisely: jailbreaks exploit competing objectives — the model is trained to be helpful and to refuse harmful requests, and when adversarial prompts put those objectives in tension, the model sometimes resolves toward helpfulness. Roleplay framing, persona injection, and fictional context are persistent attack surfaces because they are also legitimate use cases. The training data cannot fully distinguish them.
Time Bandit is a more recent technique that exploits temporal and procedural ambiguity rather than roleplay framing. Discovered by researcher David Kuszmar and reported by BleepingComputer ↗, the method frames requests in historical timeframes while asking for information that requires current technical knowledge. A prompt might ask ChatGPT to describe how a process was handled “in 1789,” then request specific technical details that only exist today. The model, unable to consistently determine which version of its rules applies to historical contexts, has been observed providing malware code, weapons guidance, and controlled synthesis information. The jailbreak exploits what Kuszmar calls “procedural ambiguity” — uncertainty in how the model interprets and enforces its own policy under unfamiliar framing.
Multilingual and encoding attacks exploit a known gap in classifier coverage. Safety classifiers are trained predominantly on English-language data. Switching to a lower-resource language, or encoding the attack prompt in base64 or character substitution, degrades classifier confidence enough to slip past input-layer screening. This is not unique to ChatGPT — it affects most commercial classifier-based guardrails. AI-alert.org ↗ tracks disclosed jailbreaks and LLM vulnerability disclosures as they emerge, which is useful for teams who want current signal rather than relying only on academic benchmarks.
API-level bypass rates vary by model version. Recent adversarial testing found GPT-5-mini could be tricked roughly half the time on targeted attack prompts; older models showed higher rates. The pattern is consistent: safety training shifts the output distribution but does not enforce a hard boundary. A sufficiently adversarial prompt is a search problem — the attacker iterates against a fixed target; the model cannot adapt at inference time.
Deployment Recommendations
The native ChatGPT safety stack is a starting point. For any application that accepts free-form user input, that starting point leaves residual risk that operators are responsible for closing.
Apply the Moderation API bidirectionally. Run it on inputs before they reach the model, and run it on outputs before they reach the user. Skipping input-side screening means relying entirely on model-level refusals to handle adversarial prompts. Skipping output-side screening means a successful bypass produces no detection event.
Layer additional classifiers. The Moderation API, model-level refusals, and a secondary classifier model — Llama Guard, Azure Prompt Shield, or a custom fine-tuned model — are three separate control points with different training distributions. A prompt that defeats one may not defeat two. Defense-in-depth ↗ here is not paranoia; it is the standard architecture for anything with meaningful abuse surface.
Write a narrow system prompt. ChatGPT’s softcoded defaults are calibrated for general-purpose consumer use. If your application has a specific scope, restrict it explicitly. Name the topics that are out of scope. Define the persona tightly. The more constrained the system prompt, the less behavioral surface area the model has to be manipulated across.
Monitor outputs, not just refusals. A refusal is visible. The more dangerous case is a bypass that produces a compliant-looking but policy-violating response. Log full responses, score them with your classifier stack, and alert on anomalous outputs even when no single check fired. The observability principles used for model drift detection apply directly to safety monitoring — output distributions should be stable, and deviation is signal. SentryML ↗ covers the ML monitoring layer that makes ongoing safety surveillance tractable.
Red-team ↗ before you ship. The jailbreak research literature is public. The ten prompt patterns in arXiv 2305.13860 and Time Bandit’s temporal framing technique represent known, documented attacks. If your configuration cannot withstand 2023-vintage attacks, you will not withstand novel ones either.
Sources
- OpenAI Model Spec (December 2025) ↗ — Authoritative specification of hardcoded versus softcoded ChatGPT behaviors and the operator/user permission hierarchy.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study (arXiv 2305.13860) ↗ — Systematic empirical analysis of 3,120 jailbreak prompts across 8 prohibited scenarios; identifies 10 distinct bypass pattern categories across GPT-3.5 and GPT-4.
- Time Bandit ChatGPT Jailbreak Bypasses Safeguards on Sensitive Topics — BleepingComputer ↗ — Documents the temporal confusion bypass technique used to extract weapons guidance and malware instructions from ChatGPT.
- From Hard Refusals to Safe-Completions: GPT-5 Safety Training — OpenAI ↗ — OpenAI’s technical post on the shift from binary refusals to output-centric safety training, including reported reductions in both safety failures and over-refusals.
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Guardrails: Architecture, Bypasses, and What to Deploy
LLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial pressure, and the layered deployment stack that holds.
OpenAI's Under-18 Principles: What the New Model Spec Does
OpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities. Here is what those defenses cover, where they have already been bypassed, and what to layer on top if you ship for minors.
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.0100, MMLU drop 0.19%. A case study in why model-level safety controls are a soft layer, not a hard boundary.