Output Classification: A PII and Secrets Detector for LLM Apps
Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.
The input side of LLM security gets all the attention. Input is the obvious attack surface; the output side feels like a downstream concern. In practice, output classification is where you catch the things your input filter missed — leaked system prompts, exfiltrated PII, emitted secrets, malicious payloads heading to a downstream renderer. Skipping it leaves the heaviest gaps unguarded. It is the output-rail half of the LLM guardrails stack; this post is the build guide for it.
This is the production-grade approach we’ve shipped at multiple defenders.
What an output classifier actually checks
A useful output classifier runs every model response through a sequence of detectors. Each detector is independent, returns a verdict + confidence, and never exceeds 50ms p99 latency on its own.
The base detectors:
- PII detection — names, emails, phone numbers, addresses, government IDs, payment cards, health info
- Secrets detection — API keys, tokens, private keys, passwords, internal hostnames
- System prompt leakage — substring overlap between output and the system prompt
- Injected instruction echo — output that contains imperative-mood directives at unusual density (signs the model is forwarding an injected payload)
- Markdown/HTML rendering hazards — output contains JS URIs, off-domain links, encoded HTML
- Off-policy content — falling outside the application’s allowed domain (e.g. a customer-support bot answering medical questions)
In production, you typically get the first three for free with off-the-shelf libraries. The last three require app-specific rules. Toxicity and policy-category scoring is a related but separate job — see our content moderation tools guide for that surface.
Building the PII detector
Microsoft Presidio ↗ is the most-used open-source option. It ships ~30 entity types, integrates with spaCy or Transformers backends, and is fast enough for production.
The trap is that Presidio’s defaults flag aggressively. Out of the box, you’ll get false positives on:
- Random alphanumeric strings the regex thinks are credit cards
- Common first names embedded in unrelated content
- Hex strings that look like IBAN
Tune the recognizers per your domain. For most LLM apps:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.predefined_recognizers import (
EmailRecognizer, PhoneRecognizer, CreditCardRecognizer,
UsSsnRecognizer, IbanRecognizer
)
registry = RecognizerRegistry()
# Only the recognizers your domain actually needs
registry.add_recognizer(EmailRecognizer())
registry.add_recognizer(PhoneRecognizer(supported_regions=["US", "GB"]))
registry.add_recognizer(CreditCardRecognizer())
analyzer = AnalyzerEngine(registry=registry)
results = analyzer.analyze(text=output, language="en")
flagged = [(r.entity_type, r.score, output[r.start:r.end]) for r in results if r.score > 0.85]
The threshold is a precision/recall dial on Presidio’s per-entity confidence score: a lower threshold (around 0.7) catches more borderline matches at the cost of more false positives, while a higher threshold (around 0.85) trims false positives but lets some lower-confidence true PII through. There is no universally correct value: calibrate it against your own labeled outputs, and set it per entity type if your false-positive tolerance differs across, say, credit-card numbers versus names.
Building the secrets detector
Don’t roll your own regex. TruffleHog ↗ maintains the most comprehensive set of secret-detection rules — 800+ patterns across cloud providers, SaaS APIs, internal services. The engine validates suspected secrets against the actual provider’s API where possible, eliminating false positives entirely.
Run it as a library, not a subprocess:
import trufflehog3
results = trufflehog3.scan_text(output)
verified_hits = [r for r in results if r.get("Verified") is True]
If Verified is True, the string was confirmed as a live secret by calling the issuing service. Block the output unconditionally.
If Verified is False but a high-confidence pattern matched, you have a likely secret that wasn’t validatable (revoked, custom-issued, or rate-limited). Default to block in production; allow override for trusted internal flows.
System prompt leakage
This is the detector most teams skip and most attacks exploit. Compute a rolling-hash representation of the system prompt at request time. After generating the response, scan for substring matches at length ≥ 40 characters.
import re
def system_prompt_overlap(output: str, system_prompt: str, min_len: int = 40) -> bool:
# Find longest common substring; if >= min_len, flag
# Production version uses suffix automaton; this is the simple form
for i in range(len(system_prompt) - min_len):
chunk = system_prompt[i:i + min_len]
if chunk in output:
return True
return False
40 characters is short enough to catch leakage of meaningful instructions and long enough to avoid false positives on common phrases. Tune to your prompt length.
When this fires, the response is blocked AND the session is marked compromised — assume the attacker can reproduce and exfiltrate further. Log the request_id for retro analysis.
Off-policy content
The hardest detector to build because “off-policy” is application-specific. The most reliable approach we’ve found:
- Define the application’s allowed topic surface as a fixed set of phrases (10-50 entries; the customer-support bot’s “what we sell,” “shipping policies,” “returns process,” etc.)
- Compute embedding similarity between the output and each allowed phrase
- Take the max similarity. If below threshold, the output drifted off-policy.
This catches:
- A coding assistant answering legal questions
- A customer-support bot weighing in on geopolitics
- A medical-info bot opining on stock picks
The threshold is empirical (typically 0.4-0.6 with sentence-transformer/all-mpnet) and needs domain tuning.
Output composition: pipeline structure
def classify_output(output: str, ctx: dict) -> Verdict:
checks = [
pii_check(output),
secrets_check(output),
system_prompt_leak_check(output, ctx["system_prompt"]),
markdown_hazard_check(output),
off_policy_check(output, ctx["allowed_topics"]),
]
# Any block-severity match → block
if any(c.severity == "block" for c in checks):
return Verdict.block(reasons=[c for c in checks if c.severity == "block"])
# Any warn → log + alert + pass
if any(c.severity == "warn" for c in checks):
return Verdict.warn(reasons=[c for c in checks if c.severity == "warn"])
return Verdict.pass_()
In production, run this as a sidecar to the LLM call, not inline. Latency budget for the full pipeline should be ≤100ms p95. If you exceed that, the LLM call’s UX degrades noticeably.
What to log
Every classification decision emits a structured event with:
request_id,session_id,user_id- The detectors that fired (entity types, severity, confidence)
- The truncated matched span (max 50 chars; never the full secret/PII)
- The action taken (block, warn, pass)
This is the audit trail when you need to investigate a leakage incident or tune detectors. It’s also the data you train on to improve recall.
What to NOT do
- Don’t show the user what was redacted. “We blocked ‘4111-1111-1111-1111’ from your response” tells the attacker exactly what triggered the filter. Show a generic message.
- Don’t store full responses with detected PII in normal logs. That’s a recursive privacy problem. Redact at log time.
- Don’t trust the LLM to self-classify. “Ask GPT if this output contains PII” sounds clever but adversarial inputs that bypass the production model also bypass the judge model.
- Don’t fail open on classifier errors. If the detector throws an exception, default to block. Production stability requires this.
Cross-mapping
This pipeline addresses OWASP LLM02 Insecure Output Handling ↗ and parts of LLM06 (Sensitive Information Disclosure) and LLM07 (System Prompt Leakage). For the input side and the architectural concerns, see our LLM guardrails overview and the detection engineering runbook ↗ on the sister site.
The output filter ↗ is not the most fun part of LLM security to build, but it catches the highest-impact class of failures. Build it before you build anything else.
→ This post is part of the LLM Guardrails Hub — the complete index of defensive AI ↗ engineering resources on GuardML.
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
AI Moderation Tools for LLMs: What Works and What Gets Bypassed
A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard —
AI Safety Tools: A Guide to Guardrails, Filters, and Defenses
A practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known
Model Alignment: What It Is, How It Works, and Where It Fails
Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're