GuardML
guardrails

OpenAI's Under-18 Principles: what the new Model Spec teen guardrails actually do

OpenAI's December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities. Here is what those defenses cover, where they have already been bypassed, and what to layer on top if you ship for minors.

By GuardML Editorial ·

OpenAI shipped a revised Model Spec on December 18, 2025 that introduces a dedicated Under-18 Principles section and pairs it with an age-prediction classifier on consumer ChatGPT plans. The company describes the changes as four ranked principles for teen interactions: prioritize safety over competing values, push toward real-world support, talk to teens like teens, and stay transparent about what the assistant is. For teams building safety layers around LLM products, the more interesting questions are what the spec actually constrains, what it visibly does not, and where existing bypass research already sits relative to those constraints.

What the U18 section actually covers

The new section sits at the policy layer of the Model Spec, meaning it is enforced through training, system prompts, and inference-time classifiers rather than a hard out-of-band gate. The flagged higher-risk categories are explicit: self-harm and suicide, romantic or sexualized roleplay, graphic and explicit content, dangerous activities and substances, body image and disordered eating, and requests to keep secrets about unsafe behavior. Per TechCrunch’s reporting on the rollout, the model is now instructed to avoid immersive romantic roleplay, first-person intimacy, and first-person sexual or violent roleplay for minor accounts.

Routing into this U18 mode happens two ways. Account holders identified as under 18 get the experience by default. The new wrinkle is the age-prediction model, which infers minor status from usage patterns even without verified age. When confidence is low, the system fails to U18 rather than to the adult default. That is the right direction for a safety classifier — fail closed on the protected population — but it is also the surface that will get the most adversarial pressure, because it determines which policy ladder applies to a given session.

The bypass landscape

The Model Spec is a behavioral specification, not a sandbox. Every layer of it has prior art on the bypass side, and pretending otherwise is the failure mode this site exists to flag.

Roleplay reframing. The U18 section names romantic and sexualized roleplay as a hard category. The bypass pattern that has worked against ChatGPT for years is exactly that: requests laundered through fiction, screenplay framing, “write a scene where two characters discuss,” translation tasks, or nested narrators. OpenAI has been patching these for years and they keep coming back, because the underlying problem — the model is trained to be helpful with creative work — is the same property that makes the bypass work. A new principle does not change the gradient.

Age-inference evasion. The age predictor is doing classification on usage signals. The historical analog is content-platform age estimation, which has been demonstrated to fail in both directions when users tune their inputs. Adversarial input here is cheap: change vocabulary, drop slang, move conversations into technical or professional registers, pay for an account. There is no published evaluation of false-negative rates for the U18 classifier, which is the number that matters for a defender.

Mirroring drift. Robbie Torney of Common Sense Media told TechCrunch that the spec still carries tension between safety provisions and the model’s broader “no topic is off limits” framing, and that ChatGPT “often mirrors users’ energy” in long sessions. The Adam Raine litigation reportedly surfaced a session in which the moderation system flagged more than a thousand suicide mentions and 377 self-harm messages without interrupting the interaction — classifier signal without enforcement is just telemetry.

Policy-vs-measurement gap. Steven Adler, a former OpenAI safety researcher quoted in the same TechCrunch piece, framed it directly: “unless the company measures the actual behaviors, intentions are ultimately just words.” The spec is the intent. The eval suite is the contract. OpenAI has not published U18-specific red-team numbers alongside this update.

What to do if you ship to minors on top of OpenAI

If you are building a product that sits on OpenAI’s API and may be used by anyone under 18, the U18 Principles are a useful prior, not a finished defense. A few concrete suggestions for the safety layer you own.

  1. Do not rely on OpenAI’s age inference for your own gating. It is opaque to you, it can flip, and you cannot audit its false-negative rate. If your product needs an age gate, run your own — verified ID, payment-card-as-proxy, school-issued SSO, or whatever your jurisdiction permits — and treat OpenAI’s classification as a secondary signal.

  2. Layer your own classifier on the egress. The U18 categories OpenAI publishes are a reasonable starting taxonomy. Run your own moderation on model outputs for self-harm, sexual content involving minors, eating-disorder content, and substance-use instructions, and gate the response before it leaves your server. Llama Guard, ShieldGemma, and the OpenAI moderation endpoint all do something here; none are sufficient alone.

  3. Treat long sessions as the threat surface. Mirroring drift happens over turns. If you cap the U18 experience at single-turn safety only, you will reproduce the failure modes the Raine case made public. Re-evaluate the session state — not just the latest message — against the protected categories on a rolling basis, and force a topic break when the rolling signal trips a threshold.

  4. Log what would have shipped. Every refusal, every bypass attempt, every flagged-but-allowed turn. The eval that shows your guardrail is working is the one built from production traffic, not from an internal red-team set the model has implicitly memorized.

  5. Assume jurisdictional drift. Several U.S. state attorneys general and the FTC are looking at AI-for-minors right now. The spec OpenAI published is shaped by that pressure and will keep moving. Build your own policy mapping so you can re-route when an upstream provider changes the contract under you.

The Model Spec update is real defensive work and the U18 section is more specific than what came before it. It is also a guardrail, which means somewhere there is already a working bypass and another one being written. Plan for both.

Sources

Sources

  1. Updating our Model Spec with teen protections — OpenAI
  2. Model Spec (2025/12/18) — OpenAI
  3. OpenAI adds new teen safety rules to ChatGPT as lawmakers weigh AI standards for minors — TechCrunch

Subscribe

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.