GuardML
Output classification pipeline visualization
defense

Output Classification: Building a PII and Secrets Detector for LLM Applications

Most output filters catch the obvious cases and miss the long tail. Here's how to build an output classifier that's actually deployable in production.

By Daniel Park · · 8 min read

The input side of LLM security gets all the attention. Input is the obvious attack surface; the output side feels like a downstream concern. In practice, output classification is where you catch the things your input filter missed — leaked system prompts, exfiltrated PII, emitted secrets, malicious payloads heading to a downstream renderer. Skipping it leaves the heaviest gaps unguarded.

This is the production-grade approach we’ve shipped at multiple defenders.

What an output classifier actually checks

A useful output classifier runs every model response through a sequence of detectors. Each detector is independent, returns a verdict + confidence, and never exceeds 50ms p99 latency on its own.

The base detectors:

  1. PII detection — names, emails, phone numbers, addresses, government IDs, payment cards, health info
  2. Secrets detection — API keys, tokens, private keys, passwords, internal hostnames
  3. System prompt leakage — substring overlap between output and the system prompt
  4. Injected instruction echo — output that contains imperative-mood directives at unusual density (signs the model is forwarding an injected payload)
  5. Markdown/HTML rendering hazards — output contains JS URIs, off-domain links, encoded HTML
  6. Off-policy content — falling outside the application’s allowed domain (e.g. a customer-support bot answering medical questions)

In production, you typically get the first three for free with off-the-shelf libraries. The last three require app-specific rules.

Building the PII detector

Microsoft Presidio is the most-used open-source option. It ships ~30 entity types, integrates with spaCy or Transformers backends, and is fast enough for production.

The trap is that Presidio’s defaults flag aggressively. Out of the box, you’ll get false positives on:

Tune the recognizers per your domain. For most LLM apps:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.predefined_recognizers import (
    EmailRecognizer, PhoneRecognizer, CreditCardRecognizer,
    UsSsnRecognizer, IbanRecognizer
)

registry = RecognizerRegistry()
# Only the recognizers your domain actually needs
registry.add_recognizer(EmailRecognizer())
registry.add_recognizer(PhoneRecognizer(supported_regions=["US", "GB"]))
registry.add_recognizer(CreditCardRecognizer())
analyzer = AnalyzerEngine(registry=registry)

results = analyzer.analyze(text=output, language="en")
flagged = [(r.entity_type, r.score, output[r.start:r.end]) for r in results if r.score > 0.85]

The 0.85 threshold is empirical — at 0.7 we saw 4-6% false positive rate; at 0.85 we get ~1% with 95%+ recall on real PII in our test corpus.

Building the secrets detector

Don’t roll your own regex. TruffleHog maintains the most comprehensive set of secret-detection rules — 800+ patterns across cloud providers, SaaS APIs, internal services. The engine validates suspected secrets against the actual provider’s API where possible, eliminating false positives entirely.

Run it as a library, not a subprocess:

import trufflehog3
results = trufflehog3.scan_text(output)
verified_hits = [r for r in results if r.get("Verified") is True]

If Verified is True, the string was confirmed as a live secret by calling the issuing service. Block the output unconditionally.

If Verified is False but a high-confidence pattern matched, you have a likely secret that wasn’t validatable (revoked, custom-issued, or rate-limited). Default to block in production; allow override for trusted internal flows.

System prompt leakage

This is the detector most teams skip and most attacks exploit. Compute a rolling-hash representation of the system prompt at request time. After generating the response, scan for substring matches at length ≥ 40 characters.

import re
def system_prompt_overlap(output: str, system_prompt: str, min_len: int = 40) -> bool:
    # Find longest common substring; if >= min_len, flag
    # Production version uses suffix automaton; this is the simple form
    for i in range(len(system_prompt) - min_len):
        chunk = system_prompt[i:i + min_len]
        if chunk in output:
            return True
    return False

40 characters is short enough to catch leakage of meaningful instructions and long enough to avoid false positives on common phrases. Tune to your prompt length.

When this fires, the response is blocked AND the session is marked compromised — assume the attacker can reproduce and exfiltrate further. Log the request_id for retro analysis.

Off-policy content

The hardest detector to build because “off-policy” is application-specific. The most reliable approach we’ve found:

  1. Define the application’s allowed topic surface as a fixed set of phrases (10-50 entries; the customer-support bot’s “what we sell,” “shipping policies,” “returns process,” etc.)
  2. Compute embedding similarity between the output and each allowed phrase
  3. Take the max similarity. If below threshold, the output drifted off-policy.

This catches:

The threshold is empirical (typically 0.4-0.6 with sentence-transformer/all-mpnet) and needs domain tuning.

Output composition: pipeline structure

def classify_output(output: str, ctx: dict) -> Verdict:
    checks = [
        pii_check(output),
        secrets_check(output),
        system_prompt_leak_check(output, ctx["system_prompt"]),
        markdown_hazard_check(output),
        off_policy_check(output, ctx["allowed_topics"]),
    ]
    # Any block-severity match → block
    if any(c.severity == "block" for c in checks):
        return Verdict.block(reasons=[c for c in checks if c.severity == "block"])
    # Any warn → log + alert + pass
    if any(c.severity == "warn" for c in checks):
        return Verdict.warn(reasons=[c for c in checks if c.severity == "warn"])
    return Verdict.pass_()

In production, run this as a sidecar to the LLM call, not inline. Latency budget for the full pipeline should be ≤100ms p95. If you exceed that, the LLM call’s UX degrades noticeably.

What to log

Every classification decision emits a structured event with:

This is the audit trail when you need to investigate a leakage incident or tune detectors. It’s also the data you train on to improve recall.

What to NOT do

Cross-mapping

This pipeline addresses OWASP LLM02 Insecure Output Handling and parts of LLM06 (Sensitive Information Disclosure) and LLM07 (System Prompt Leakage). For the input side and the architectural concerns, see our detection engineering runbook on the sister site.

The output filter is not the most fun part of LLM security to build, but it catches the highest-impact class of failures. Build it before you build anything else.

Sources

  1. Microsoft Presidio
  2. TruffleHog
  3. OWASP LLM02 Insecure Output Handling
#output-filtering #pii-detection #secrets-detection #llm-security #detection-engineering
Subscribe

GuardML — in your inbox

Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments