5.7 KiB
API Reference: Prompt Injection Detection Tools
PromptInjectionDetector (agent.py)
The primary detection class combining three layers of prompt injection analysis.
Constructor
PromptInjectionDetector(
mode: str = "full", # "regex", "heuristic", or "full"
threshold: float = 0.85, # Classifier confidence threshold (0.0-1.0)
device: str = "cpu", # "cpu" or "cuda" for GPU inference
)
Methods
analyze(text: str) -> DetectionResult
Runs the configured detection layers against the input text and returns a structured result.
Parameters:
text(str): The user prompt to analyze for injection attempts.
Returns: DetectionResult dataclass with the following fields:
| Field | Type | Description |
|---|---|---|
input_text |
str | The original input text |
injection_detected |
bool | Final boolean verdict |
composite_score |
float | Weighted score from all active layers (0.0 - 1.0) |
regex_matches |
list[str] | Names of matched regex patterns |
regex_score |
float | Regex layer score (0.0 - 1.0) |
heuristic_score |
float | Heuristic layer score (0.0 - 1.0) |
classifier_score |
float | DeBERTa classifier injection probability (0.0 - 1.0) |
classifier_label |
str | "INJECTION", "SAFE", "SKIPPED", or "ERROR" |
detection_time_ms |
float | Total detection time in milliseconds |
layer_details |
dict | Detailed breakdown from each layer |
RegexDetector
Fast pattern-matching layer using compiled regular expressions.
scan(text: str) -> tuple[float, list[str]]
Scans input against 20+ compiled regex patterns for known injection signatures.
Returns: Tuple of (score, matched_pattern_names). Score is min(1.0, match_count * 0.25).
Pattern Categories:
system_prompt_override-- "ignore previous instructions" and variantsrole_play_escape-- "you are now", "act as", "pretend to be"instruction_hijack-- "do not follow", "new instructions", "instead do"delimiter_escape-- Markdown code fences with system/assistant roles, XML instruction tagsdata_exfiltration-- Attempts to extract system prompts, keys, credentialsencoding_obfuscation-- Base64/ROT13/hex encoding referencessql_injection_via_prompt-- SQL payloads embedded in promptscommand_injection_via_prompt-- Shell command payloadsdeveloper_mode-- "DAN mode", "developer mode", "god mode"prompt_leaking-- "what are your instructions", "repeat your prompt"token_smuggling-- Zero-width Unicode characters and control charactersbase64_payload-- Long Base64-encoded strings that may contain hidden instructions
HeuristicScorer
Structural anomaly detection using weighted feature analysis.
score(text: str) -> tuple[float, dict]
Computes an anomaly score from seven structural features.
Features and Weights:
| Feature | Weight | Description |
|---|---|---|
instruction_density |
0.30 | Ratio of instruction keywords to total words |
special_char_ratio |
0.10 | Ratio of non-alphanumeric characters |
delimiter_presence |
0.15 | Count of delimiter sequences (```, ---, ###) |
capitalization_ratio |
0.10 | Proportion of uppercase alphabetic characters |
line_structure_anomaly |
0.10 | Many short lines indicating structured payloads |
unicode_anomaly |
0.15 | Zero-width and control character presence |
repetition_score |
0.10 | Low unique-word ratio indicating repetitive overrides |
ClassifierDetector
Transformer-based binary classifier using ProtectAI's DeBERTa-v3 model.
Constructor
ClassifierDetector(
threshold: float = 0.85, # Confidence threshold for INJECTION label
device: str = "cpu", # Inference device
)
predict(text: str) -> tuple[float, str]
Runs the DeBERTa model on the input (truncated to 512 tokens) and returns the injection probability and label.
Model Details:
- Model:
protectai/deberta-v3-base-prompt-injection-v2 - Architecture: microsoft/deberta-v3-base fine-tuned for binary classification
- Labels: INJECTION (class 1) / SAFE (class 0)
- Max Input Length: 512 tokens
- Accuracy: 99.1% on holdout test set
- Size: ~700 MB (downloaded from Hugging Face Hub on first use)
CLI Reference
usage: agent.py [-h] [--input INPUT] [--file FILE]
[--mode {regex,heuristic,full}]
[--threshold THRESHOLD]
[--output {text,json}]
[--device {cpu,cuda}]
Arguments:
--input, -i Single prompt string to analyze
--file, -f Path to file with one prompt per line
--mode, -m Detection mode: regex | heuristic | full (default: full)
--threshold, -t Classifier confidence threshold (default: 0.85)
--output, -o Output format: text | json (default: text)
--device Inference device: cpu | cuda (default: cpu)
Exit Codes:
0-- No injections detected1-- Error (file not found, model load failure)2-- One or more injections detected
External Resources
- OWASP LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
- OWASP Prompt Injection Prevention Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html
- ProtectAI DeBERTa Model: https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2
- Deepset Prompt Injection Dataset: https://huggingface.co/datasets/deepset/prompt-injections
- Rebuff Framework: https://github.com/protectai/rebuff
- Simon Willison's Prompt Injection Tag: https://simonwillison.net/tags/prompt-injection/
- Meta Prompt Guard 86M: https://huggingface.co/meta-llama/Prompt-Guard-86M