GuardML
deep-dive

OpenAI's Under-18 Principles: a guardrail engineer reads the new Model Spec

OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.

By Daniel Park · · 8 min read

OpenAI’s December 18, 2025 Model Spec revision added a section called Under-18 Principles at what the spec calls Root authority — the highest tier in OpenAI’s instruction hierarchy, the one that cannot be overridden by a system prompt, a developer message, or a user. The accompanying announcement frames it as developmentally informed safety design. From a guardrail-engineering perspective, the more interesting story is mechanical: this is the first time a major lab has shipped age-gated behavior as a hardcoded policy rule rather than as a system-prompt overlay or a downstream classifier. That choice matters because everything we know about jailbreaking says system-prompt overlays are a soft target, and downstream classifiers fire too late.

This post walks through what the policy actually says, what the public bypass research says about defenses of this shape, and how a security team shipping a teen-facing AI feature should read the trade-offs.

What changed in the Model Spec

The new section sits at the same authority tier as the spec’s hard prohibitions on weapons synthesis and CSAM. According to OpenAI and the early reporting, the U18 rules govern conduct in six high-risk categories: self-harm and suicide, romantic or sexualized roleplay, graphic or explicit content, dangerous activities and substances, body image and disordered eating, and requests to keep secrets about unsafe behavior from caregivers. TechCrunch’s reporting preserves the operative language: the model must “avoid immersive romantic roleplay, first-person intimacy, and first-person sexual or violent roleplay, even when it’s non-graphic,” and these limits hold “even when prompts are framed as fictional, hypothetical, historical, or educational.”

The “even when framed as” clause is the load-bearing piece. It is OpenAI explicitly conceding that the standard jailbreak class — wrap the request in a fictional skin and dare the model to play along — is what this rule is designed to stop. The four guiding principles named in the spec (prioritize safety over intellectual freedom, route to real-world support, communicate without condescension, be transparent about model limits) are the policy statement. The “even when framed as” qualifier is the threat model.

Around the policy sit two enforcement layers OpenAI is still building out. The first is an age-prediction model that runs on consumer plans and applies teen safeguards by default when account-age signal is ambiguous, with adult verification as the unlock path. The second is a real-time classifier stack that already exists for self-harm and CSAM detection, plus human review for acute-distress cases that can trigger parent notification, as documented by EdTech Innovation Hub. Three things are doing the work: a router (age prediction), a policy (Root-level rules in the spec), and a post-hoc filter (classifiers on outputs). Each fails differently, and the failure modes compose.

The bypass landscape

Anyone shipping a guardrail in this shape needs to know what the public research has already broken about defenses of this shape. The honest answer is: a lot.

Persona and roleplay attacks

The clearest threat to age-gated content rules is the persona attack. A 2025 paper by Wang et al., Enhancing Jailbreak Attacks on LLMs via Persona Prompts, uses a genetic algorithm to evolve persona prompts that reduce refusal rates by 50–70 percent across multiple frontier models, with persona prompts boosting other attack methods by another 10–20 percent when stacked. The mechanism is exactly what the U18 rule is trying to neutralize: by reframing the assistant as a character whose value system or fictional context permits the disallowed behavior, the attacker changes which conditional distribution the model samples from. The U18 spec wording — refusing to “play along” with sexualized or violent first-person roleplay even when fictional — reads as a direct response to this literature.

The question is whether a Root-tier policy line is actually robust to persona attacks, given that persona attacks are themselves applied at the user-message tier. In principle, Root rules outrank user instructions. In practice, the model’s compliance with the hierarchy is itself learned behavior, and the evidence is that learned hierarchy is leakier than declared hierarchy. The Wang results were on models that already claimed hierarchical instruction following.

Policy Puppetry and structured-format attacks

The other class of attack worth watching is what HiddenLayer calls Policy Puppetry, reported in 2025 as a universal bypass that worked on ChatGPT-4o, o1, o3-mini, Claude 3.5/3.7 Sonnet, Gemini 1.5/2.0/2.5, and a list of other production models. The trick is to format the malicious request as something that looks like a configuration file — XML, INI, or JSON — combined with a fictional roleplay frame and light leetspeak obfuscation. The model parses the structured input as if it were a legitimate policy override.

For age-gated rules specifically, Policy Puppetry-style attacks can attempt to forge the age signal. If the model reads its operating context in part from structured tokens delivered by the application layer, an attacker who can squeeze a forged “verified_adult: true” field into anything the model treats as authoritative gets to bypass the gate without ever touching the persona surface. Whether OpenAI’s deployment plumbing is vulnerable to this depends entirely on whether the age signal is a model-input field, a system-prompt assertion, or a out-of-band tag enforced at the routing layer. Only the third design is robust against Policy Puppetry.

The age-prediction layer

Age prediction is itself an adversarial classifier and inherits all the failure modes of adversarial classifiers. The cost asymmetry is bad: a teen who wants the adult experience can iterate as many writing-style or topic prompts as they want until the predictor flips, while OpenAI eats false positives every time an actual adult is misclassified. The published guidance from OpenAI is that ambiguous cases default to the teen experience and adult verification is the unlock — a good default, but it means the actual adversarial pressure shifts onto the verification flow, not the predictor. If verification is a payment-method check, the bypass is a parent’s card. If verification is a government-ID upload, the bypass surface narrows but you have inherited a much bigger privacy and breach-risk problem.

This is also where the policy interacts with regulation. California SB 243, effective 2027, will require chatbot operators to publicly disclose safeguards, prohibit suicide/self-harm discussions, and surface break reminders every three hours. SB 243 doesn’t mandate any specific verification technology, which means the operational decisions about how to gate are still on the implementer, including the question of how aggressively to fail-closed.

Original analysis: what the U18 spec actually buys you, and what it doesn’t

The framing this post wants to push back on — and which most of the press coverage adopts — is that the Model Spec update is primarily a policy announcement. Read mechanically, it is a priority-tier announcement, and that’s a different thing.

Here is the synthesis. Guardrails fail along three axes: the policy is wrong, the policy is right but wired into the wrong layer, or the policy is right and well-wired but the routing signal that selects it is corruptible. The U18 update fixes the second axis (it moves teen rules from system-prompt territory to Root-tier policy) and partially addresses the first (it explicitly closes the fictional-framing loophole). It does almost nothing for the third, because the routing signal is age prediction plus self-attestation, both of which are soft.

That suggests a concrete prediction: the dominant teen-safety bypass in 2026 will not be a persona jailbreak against the Root-tier rule. It will be a routing-layer bypass against the age predictor, because that’s the cheapest path. We will see public examples of teen accounts that look adult to the predictor through writing-style mimicry, account-aging tactics, and shared-device misclassification. The Root-tier policy will hold reasonably well for the cases the router actually flags as teen, and the failure mode will look like a coverage failure rather than a defeat of the rule.

This has two implications for engineers building similar systems.

First, if you are layering an age gate on top of an LLM you do not control, do not put your enforcement at the prompt level. The published evidence — persona attacks, Policy Puppetry, structured-format injections — says any policy you encode in a system prompt is reachable from the user message with high success rates. Enforce the gate at request routing: refuse to send a request to a permissive model in the first place if the user context says under-18, and serve a different model or a tighter system prompt to teen sessions. This is what OpenAI is implicitly doing with its Root-tier separation; replicate the architecture, don’t just copy the policy text.

Second, instrument the routing layer like it’s the attack surface, because it is. Log age-prediction confidence per session, log adult-verification attempts and method per account, log discrepancies between predicted age and self-declared age, and treat sudden style-shift events on a single account as a signal worth reviewing. The classifier-and-human-review pipeline OpenAI runs for self-harm and CSAM is the right shape; extend the same pipeline to flag age signal manipulation as a category. Most teams will not build this because it doesn’t show up in any threat model based on the prompt layer. That’s the point — the threat has moved.

The counter-argument worth taking seriously is that age verification, done well, makes the routing signal hard. If OpenAI moved to ID-based verification at scale, the predictor’s role would shrink, and the bypass surface would be the verification pipeline (document forgery, identity theft, account sharing). That’s a different security problem with a much smaller blast radius per attack: it costs the attacker money and traceable artifacts. The reason to expect OpenAI not to go there yet is the privacy story — collecting ID at the scale of ChatGPT is a regulatory and breach-risk nightmare that no consumer AI product wants until forced. SB 243 doesn’t force it. The status quo of soft prediction with self-attestation is therefore the equilibrium for now, which means the routing-bypass prediction stands.

There is a third issue the spec doesn’t really address: measurement. We have no public refusal-rate numbers for the U18 categories, no published red-team transcripts, no evaluation harness scores. The reporting on this update is almost entirely on the policy text, not the empirical performance against a held-out adversarial set. That’s a gap any operator inheriting these rules into their own product should close locally — generate persona attacks against the six U18 categories, measure baseline refusal, measure refusal under adversarial framing, and treat the delta as your real safety budget. For background on building that kind of evaluation harness, the offensive-side coverage on aisec.blog is closer to what you actually need than vendor-side announcements.

Deployment recommendations

If you are building or operating a teen-facing AI feature on top of OpenAI or any other vendor, the practical takeaways from this update are:

The U18 update is one of the more honest pieces of guardrail work a major lab has shipped this year, in that it explicitly names the framing-attack class it is trying to defeat. That alone is unusual. But the evidence from persona-attack research and from universal-bypass disclosures says the gap between policy intent and enforced behavior is narrower than press coverage suggests. The teams that handle this update well will be the ones that read the spec as architecture, not as a press release.

Sources

Sources

  1. Updating our Model Spec with teen protections (OpenAI)
  2. Model Spec, 2025-12-18 revision (OpenAI)
  3. OpenAI adds new teen safety rules to ChatGPT as lawmakers weigh AI standards for minors (TechCrunch)
  4. Novel Universal Bypass for All Major LLMs — Policy Puppetry (HiddenLayer)
  5. Enhancing Jailbreak Attacks on LLMs via Persona Prompts (arXiv 2507.22171)
  6. OpenAI updates model rules to strengthen protections for teens using ChatGPT (EdTech Innovation Hub)
#model-spec #age-gating #guardrails #roleplay-jailbreak #policy-hierarchy #openai
Subscribe

GuardML — in your inbox

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments