Output Classification: Building a PII and Secrets Detector for LLM Applications
Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.
The input side of LLM security gets all the attention. Input is the obvious attack surface; the output side feels like a downstream concern. In practice, output classification is where you catch the things your input filter missed — leaked system prompts, exfiltrated PII, emitted secrets, malicious payloads heading to a downstream renderer. Skipping it leaves the heaviest gaps unguarded.
This is the production-grade approach we’ve shipped at multiple defenders.
What an output classifier actually checks
A useful output classifier runs every model response through a sequence of detectors. Each detector is independent, returns a verdict + confidence, and never exceeds 50ms p99 latency on its own.
The base detectors:
- PII detection — names, emails, phone numbers, addresses, government IDs, payment cards, health info
- Secrets detection — API keys, tokens, private keys, passwords, internal hostnames
- System prompt leakage — substring overlap between output and the system prompt
- Injected instruction echo — output that contains imperative-mood directives at unusual density (signs the model is forwarding an injected payload)
- Markdown/HTML rendering hazards — output contains JS URIs, off-domain links, encoded HTML
- Off-policy content — falling outside the application’s allowed domain (e.g. a customer-support bot answering medical questions)
In production, you typically get the first three for free with off-the-shelf libraries. The last three require app-specific rules.
Building the PII detector
Microsoft Presidio ↗ is the most-used open-source option. It ships ~30 entity types, integrates with spaCy or Transformers backends, and is fast enough for production.
The trap is that Presidio’s defaults flag aggressively. Out of the box, you’ll get false positives on:
- Random alphanumeric strings the regex thinks are credit cards
- Common first names embedded in unrelated content
- Hex strings that look like IBAN
Tune the recognizers per your domain. For most LLM apps:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.predefined_recognizers import (
EmailRecognizer, PhoneRecognizer, CreditCardRecognizer,
UsSsnRecognizer, IbanRecognizer
)
registry = RecognizerRegistry()
# Only the recognizers your domain actually needs
registry.add_recognizer(EmailRecognizer())
registry.add_recognizer(PhoneRecognizer(supported_regions=["US", "GB"]))
registry.add_recognizer(CreditCardRecognizer())
analyzer = AnalyzerEngine(registry=registry)
results = analyzer.analyze(text=output, language="en")
flagged = [(r.entity_type, r.score, output[r.start:r.end]) for r in results if r.score > 0.85]
The 0.85 threshold is empirical — at 0.7 we saw 4-6% false positive rate; at 0.85 we get ~1% with 95%+ recall on real PII in our test corpus.
Building the secrets detector
Don’t roll your own regex. TruffleHog ↗ maintains the most comprehensive set of secret-detection rules — 800+ patterns across cloud providers, SaaS APIs, internal services. The engine validates suspected secrets against the actual provider’s API where possible, eliminating false positives entirely.
Run it as a library, not a subprocess:
import trufflehog3
results = trufflehog3.scan_text(output)
verified_hits = [r for r in results if r.get("Verified") is True]
If Verified is True, the string was confirmed as a live secret by calling the issuing service. Block the output unconditionally.
If Verified is False but a high-confidence pattern matched, you have a likely secret that wasn’t validatable (revoked, custom-issued, or rate-limited). Default to block in production; allow override for trusted internal flows.
System prompt leakage
This is the detector most teams skip and most attacks exploit. Compute a rolling-hash representation of the system prompt at request time. After generating the response, scan for substring matches at length ≥ 40 characters.
import re
def system_prompt_overlap(output: str, system_prompt: str, min_len: int = 40) -> bool:
# Find longest common substring; if >= min_len, flag
# Production version uses suffix automaton; this is the simple form
for i in range(len(system_prompt) - min_len):
chunk = system_prompt[i:i + min_len]
if chunk in output:
return True
return False
40 characters is short enough to catch leakage of meaningful instructions and long enough to avoid false positives on common phrases. Tune to your prompt length.
When this fires, the response is blocked AND the session is marked compromised — assume the attacker can reproduce and exfiltrate further. Log the request_id for retro analysis.
Off-policy content
The hardest detector to build because “off-policy” is application-specific. The most reliable approach we’ve found:
- Define the application’s allowed topic surface as a fixed set of phrases (10-50 entries; the customer-support bot’s “what we sell,” “shipping policies,” “returns process,” etc.)
- Compute embedding similarity between the output and each allowed phrase
- Take the max similarity. If below threshold, the output drifted off-policy.
This catches:
- A coding assistant answering legal questions
- A customer-support bot weighing in on geopolitics
- A medical-info bot opining on stock picks
The threshold is empirical (typically 0.4-0.6 with sentence-transformer/all-mpnet) and needs domain tuning.
Output composition: pipeline structure
def classify_output(output: str, ctx: dict) -> Verdict:
checks = [
pii_check(output),
secrets_check(output),
system_prompt_leak_check(output, ctx["system_prompt"]),
markdown_hazard_check(output),
off_policy_check(output, ctx["allowed_topics"]),
]
# Any block-severity match → block
if any(c.severity == "block" for c in checks):
return Verdict.block(reasons=[c for c in checks if c.severity == "block"])
# Any warn → log + alert + pass
if any(c.severity == "warn" for c in checks):
return Verdict.warn(reasons=[c for c in checks if c.severity == "warn"])
return Verdict.pass_()
In production, run this as a sidecar to the LLM call, not inline. Latency budget for the full pipeline should be ≤100ms p95. If you exceed that, the LLM call’s UX degrades noticeably.
What to log
Every classification decision emits a structured event with:
request_id,session_id,user_id- The detectors that fired (entity types, severity, confidence)
- The truncated matched span (max 50 chars; never the full secret/PII)
- The action taken (block, warn, pass)
This is the audit trail when you need to investigate a leakage incident or tune detectors. It’s also the data you train on to improve recall.
What to NOT do
- Don’t show the user what was redacted. “We blocked ‘4111-1111-1111-1111’ from your response” tells the attacker exactly what triggered the filter. Show a generic message.
- Don’t store full responses with detected PII in normal logs. That’s a recursive privacy problem. Redact at log time.
- Don’t trust the LLM to self-classify. “Ask GPT if this output contains PII” sounds clever but adversarial inputs that bypass the production model also bypass the judge model.
- Don’t fail open on classifier errors. If the detector throws an exception, default to block. Production stability requires this.
Cross-mapping
This pipeline addresses OWASP LLM02 Insecure Output Handling ↗ and parts of LLM06 (Sensitive Information Disclosure) and LLM07 (System Prompt Leakage). For the input side and the architectural concerns, see our detection engineering runbook ↗ on the sister site.
The output filter is not the most fun part of LLM security to build, but it catches the highest-impact class of failures. Build it before you build anything else.
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Content Moderation Tools for LLM Applications: What Works and Where They Break
A practitioner's guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and others—covering capabilities, documented bypasses, and a layered deployment strategy.
OpenAI's Under-18 Principles: a guardrail engineer reads the new Model Spec
OpenAI's December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing. The defense is real, the bypass surface is well-documented, and the deployment lessons cut across every team shipping age-gated AI.
AI Content Moderation: How LLM Filters Work and Where They Break
A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and how to layer defenses that hold under real adversarial pressure.