LLM Security Tools: A Practical Guide to the Current Stack
A working guide to LLM security tools for 2026 — covering red-teaming frameworks, runtime guardrails, and observability layers, with honest notes on what each category gets wrong.
The market for LLM security tools has matured significantly since 2023, when teams were largely writing ad-hoc keyword filters and hoping for the best. Today there are purpose-built frameworks for each phase of the security lifecycle: pre-deployment red-teaming, runtime guardrails, and post-deployment observability. This guide covers the tools worth knowing, what threat surface each one actually addresses, and where the gaps remain.
The OWASP Top 10 for LLM Applications 2025 ↗ is the closest thing the industry has to a shared threat model. It identifies ten risk categories — prompt injection, sensitive data disclosure, supply chain attacks, data poisoning, improper output handling, excessive agency, system prompt leakage, vector/embedding weaknesses, misinformation, and unbounded consumption — and it is useful as an evaluation checklist when selecting tools. Any LLM security tool should be assessed against which of these it actually addresses and with what fidelity.
Pre-Deployment: Red Teaming and Vulnerability Scanning
Red teaming tools run before you ship. They probe your model and application stack systematically, surfacing vulnerability classes that manual testing misses. This phase is where you find out whether your guardrails survive adversarial input, not after you’re in production.
Promptfoo is the most widely deployed open-source option in this category. It runs automated adversarial test suites against LLM endpoints and can be wired into CI/CD. The tool tests across 50+ vulnerability types organized into plugins: prompt injection, jailbreaks, PII leakage from RAG context, BOLA (broken object-level authorization), BFLA (broken function-level authorization), data exfiltration via tool calls, and indirect injection through external content. Promptfoo uses diverse adversarial input strategies — not just static fixtures — which means it surfaces failures that a fixed test suite would miss. It covers OWASP LLM Top 10 categories, NIST AI RMF controls, and EU AI Act presets. OpenAI acquired Promptfoo in March 2026; it remains MIT licensed.
The [Promptfoo red-team ↗ documentation](https://www.promptfoo.dev/docs/red-team/ ↗) is worth reading specifically for its treatment of application-layer threats. Most LLM vulnerabilities that matter in practice live at the application layer — a model that handles tool calls, retrieves documents, or operates inside an agent chain has a much larger attack surface than a simple chat endpoint. The docs reflect this: the framework’s plugin architecture maps directly to the attack surface of realistic deployments.
For adversarial ML testing, adversarialml.dev ↗ tracks research on evasion attacks against classifiers and detectors — relevant when your guardrail layer itself becomes the target.
Runtime: Guardrail Toolkits
Runtime guardrails intercept traffic between the user and the model. The three open-source frameworks with the most production adoption are LLM Guard, NeMo Guardrails, and LlamaFirewall.
LLM Guard, maintained by Protect AI, is a Python toolkit structured around scanners that run on inputs and outputs independently. It ships 15 input scanners and 20 output scanners covering prompt injection detection, PII anonymization, secrets detection, toxicity classification, malicious URL identification, and factual consistency checks. The prompt injection scanner uses a fine-tuned DeBERTa-v3 model rather than regex, which means it generalizes better across paraphrase and indirect injection patterns. All processing runs locally — no prompt data leaves your infrastructure — which matters for compliance-sensitive deployments. See the LLM Guard repository ↗ for scanner configuration details.
NeMo Guardrails (NVIDIA) takes a different architectural approach. Rather than a scanner pipeline, it introduces a Colang-based policy language that lets you define conversation flows, topic restrictions, and response constraints as explicit rules. This is more expressive than a scanner stack for use cases where you need to enforce dialog structure — a customer service bot that should never discuss competitor pricing, or a copilot that must follow a specific decision tree. The tradeoff is that Colang adds a layer of complexity that scanner-based tools avoid.
LlamaFirewall (Meta) targets the agentic deployment case specifically, where the threat model is harder: the model takes actions, calls tools, and operates over long contexts where prompt injection can arrive through tool outputs, not just user messages. LlamaFirewall comprises three components ↗: PromptGuard 2 (a jailbreak detector claiming state-of-the-art performance), Agent Alignment Checks (a chain-of-thought auditor that reads the agent’s reasoning trace and flags goal deviation or injected instructions), and CodeShield (a static analysis engine that intercepts unsafe code before execution). LlamaFirewall is production-deployed inside Meta’s own systems. For teams building agents rather than simple chat endpoints, it deserves serious evaluation.
One thing all three frameworks share: they are trained on existing attack patterns and degrade on novel ones. The OWASP Prompt Injection Prevention Cheat Sheet ↗ is direct about this: “research shows attackers can eventually bypass safety measures through sufficient variation attempts.” A guardrail toolkit reduces your attack surface substantially, but it does not close it. Layer it with structural mitigations — privilege minimization, clear instruction/data separation in prompts, output schema enforcement — not just classifier scoring.
For context on documented bypasses in the wild, ai-alert.org ↗ tracks reported incidents involving guardrail failures and jailbreak disclosures, which is useful for calibrating how much real-world adversarial pressure looks like your threat model.
Observability: Knowing When You’re Losing
A guardrail that fails silently is worse than no guardrail, because it creates false confidence. LLM security tools in the observability layer give you the signal to detect when runtime defenses are being bypassed, when behavior is drifting, or when a new attack pattern is landing.
Langfuse is the most widely used open-source option here. It provides tracing for LLM applications — capturing inputs, outputs, latency, and cost at each step of a chain — and supports attaching evaluation scores to traces, including custom classifier scores from your guardrail stack. This means you can log not just what the model said, but what your scanners scored, flag borderline cases, and build dashboards that surface patterns over time. Integrations exist for LangChain, LlamaIndex, and direct OpenAI/Anthropic SDK instrumentation.
The observability layer also matters for agentic systems where multi-step reasoning traces are your primary audit surface. If you cannot reconstruct what an agent did and why, you cannot investigate incidents. llmops.report ↗ covers the operational patterns for deploying and monitoring LLM systems in production, including trace management for agents.
Combining the Layers
No single tool covers the full threat model. A defensible stack in 2026 looks like:
- Pre-deployment: Promptfoo integrated into CI/CD, running against staging endpoints before each release. Covers prompt injection, jailbreaks, PII leakage, and authorization failures.
- Runtime input screening: LLM Guard’s input scanners or LlamaFirewall’s PromptGuard 2 on the request path. Fast ML classifiers, not regex, to handle paraphrase and multilingual inputs.
- Runtime output screening: LLM Guard’s output scanners for PII redaction and secrets detection. Schema validation for structured outputs. LlamaFirewall’s CodeShield for agent code generation.
- Observability: Full request/response tracing with guardrail scores attached. Alerting on score distributions, not just individual flags.
The OWASP LLM Top 10 categories that this stack leaves least covered are supply chain (LLM03), vector/embedding weaknesses (LLM08), and unbounded consumption (LLM10). Supply chain risks require vendor assessment and model provenance tracking outside the runtime stack. Embedding security requires schema validation on retrieval results and query sanitization before vector search. Rate limiting and cost controls address unbounded consumption, which standard guardrail frameworks do not handle.
Effective LLM security is not a product purchase — it is a layered operational posture. Pick tools that cover your actual attack surface, instrument them so you can see failures, and run the red-teaming loop continuously rather than as a one-time pre-launch check.
Sources
- OWASP Top 10 for LLM Applications 2025 ↗ — The threat taxonomy the industry has converged on; covers ten risk categories with mitigations and references.
- LLM Guard (Protect AI) ↗ — Open-source scanner toolkit; repository contains scanner documentation, configuration examples, and benchmark data.
- LlamaFirewall (Meta AI Research) ↗ — Research publication describing the architecture and evaluation of Meta’s open-source agent guardrail framework.
- Promptfoo ↗ — Open-source LLM red-teaming and vulnerability scanning CLI; MIT licensed, used by OpenAI and Anthropic internally.
- OWASP LLM Prompt Injection Prevention Cheat Sheet ↗ — Concrete defensive controls for prompt injection, including the honest caveat that current defenses have known bypass limits.
For more context, AI defense strategies ↗ covers related topics in depth.
Sources
- OWASP Top 10 for LLM Applications 2025 — OWASP GenAI Security Project
- LLM Guard — The Security Toolkit for LLM Interactions (Protect AI)
- LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents — Meta AI Research
- Promptfoo: LLM Red Teaming and Vulnerability Scanning
- OWASP LLM Prompt Injection Prevention Cheat Sheet
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
AI Moderation Tools for LLMs: What Works and What Gets Bypassed
A practitioner's comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard — with honest numbers on bypass rates, false positives, and latency cost.
AI Safety Tools: A Guide to Guardrails, Filters, and Defenses
A practitioner's breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known bypasses, and deployment guidance.
LLM Guardrails: Comparing Tools and Implementation Patterns
A practical comparison of LLM guardrail implementations — classifiers, rule engines, LLM judges — with empirical bypass rates and deployment patterns that don't collapse under adversarial pressure.