mirror of
https://github.com/mukul975/Anthropic-Cybersecurity-Skills.git
synced 2026-06-13 06:34:57 +03:00
152 lines
5.7 KiB
Markdown
152 lines
5.7 KiB
Markdown
# API Reference: Prompt Injection Detection Tools
|
|
|
|
## PromptInjectionDetector (agent.py)
|
|
|
|
The primary detection class combining three layers of prompt injection analysis.
|
|
|
|
### Constructor
|
|
|
|
```python
|
|
PromptInjectionDetector(
|
|
mode: str = "full", # "regex", "heuristic", or "full"
|
|
threshold: float = 0.85, # Classifier confidence threshold (0.0-1.0)
|
|
device: str = "cpu", # "cpu" or "cuda" for GPU inference
|
|
)
|
|
```
|
|
|
|
### Methods
|
|
|
|
#### `analyze(text: str) -> DetectionResult`
|
|
|
|
Runs the configured detection layers against the input text and returns a structured result.
|
|
|
|
**Parameters:**
|
|
- `text` (str): The user prompt to analyze for injection attempts.
|
|
|
|
**Returns:** `DetectionResult` dataclass with the following fields:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `input_text` | str | The original input text |
|
|
| `injection_detected` | bool | Final boolean verdict |
|
|
| `composite_score` | float | Weighted score from all active layers (0.0 - 1.0) |
|
|
| `regex_matches` | list[str] | Names of matched regex patterns |
|
|
| `regex_score` | float | Regex layer score (0.0 - 1.0) |
|
|
| `heuristic_score` | float | Heuristic layer score (0.0 - 1.0) |
|
|
| `classifier_score` | float | DeBERTa classifier injection probability (0.0 - 1.0) |
|
|
| `classifier_label` | str | "INJECTION", "SAFE", "SKIPPED", or "ERROR" |
|
|
| `detection_time_ms` | float | Total detection time in milliseconds |
|
|
| `layer_details` | dict | Detailed breakdown from each layer |
|
|
|
|
---
|
|
|
|
## RegexDetector
|
|
|
|
Fast pattern-matching layer using compiled regular expressions.
|
|
|
|
### `scan(text: str) -> tuple[float, list[str]]`
|
|
|
|
Scans input against 20+ compiled regex patterns for known injection signatures.
|
|
|
|
**Returns:** Tuple of (score, matched_pattern_names). Score is min(1.0, match_count * 0.25).
|
|
|
|
**Pattern Categories:**
|
|
- `system_prompt_override` -- "ignore previous instructions" and variants
|
|
- `role_play_escape` -- "you are now", "act as", "pretend to be"
|
|
- `instruction_hijack` -- "do not follow", "new instructions", "instead do"
|
|
- `delimiter_escape` -- Markdown code fences with system/assistant roles, XML instruction tags
|
|
- `data_exfiltration` -- Attempts to extract system prompts, keys, credentials
|
|
- `encoding_obfuscation` -- Base64/ROT13/hex encoding references
|
|
- `sql_injection_via_prompt` -- SQL payloads embedded in prompts
|
|
- `command_injection_via_prompt` -- Shell command payloads
|
|
- `developer_mode` -- "DAN mode", "developer mode", "god mode"
|
|
- `prompt_leaking` -- "what are your instructions", "repeat your prompt"
|
|
- `token_smuggling` -- Zero-width Unicode characters and control characters
|
|
- `base64_payload` -- Long Base64-encoded strings that may contain hidden instructions
|
|
|
|
---
|
|
|
|
## HeuristicScorer
|
|
|
|
Structural anomaly detection using weighted feature analysis.
|
|
|
|
### `score(text: str) -> tuple[float, dict]`
|
|
|
|
Computes an anomaly score from seven structural features.
|
|
|
|
**Features and Weights:**
|
|
|
|
| Feature | Weight | Description |
|
|
|---------|--------|-------------|
|
|
| `instruction_density` | 0.30 | Ratio of instruction keywords to total words |
|
|
| `special_char_ratio` | 0.10 | Ratio of non-alphanumeric characters |
|
|
| `delimiter_presence` | 0.15 | Count of delimiter sequences (```, ---, ###) |
|
|
| `capitalization_ratio` | 0.10 | Proportion of uppercase alphabetic characters |
|
|
| `line_structure_anomaly` | 0.10 | Many short lines indicating structured payloads |
|
|
| `unicode_anomaly` | 0.15 | Zero-width and control character presence |
|
|
| `repetition_score` | 0.10 | Low unique-word ratio indicating repetitive overrides |
|
|
|
|
---
|
|
|
|
## ClassifierDetector
|
|
|
|
Transformer-based binary classifier using ProtectAI's DeBERTa-v3 model.
|
|
|
|
### Constructor
|
|
|
|
```python
|
|
ClassifierDetector(
|
|
threshold: float = 0.85, # Confidence threshold for INJECTION label
|
|
device: str = "cpu", # Inference device
|
|
)
|
|
```
|
|
|
|
### `predict(text: str) -> tuple[float, str]`
|
|
|
|
Runs the DeBERTa model on the input (truncated to 512 tokens) and returns the injection probability and label.
|
|
|
|
**Model Details:**
|
|
- **Model**: `protectai/deberta-v3-base-prompt-injection-v2`
|
|
- **Architecture**: microsoft/deberta-v3-base fine-tuned for binary classification
|
|
- **Labels**: INJECTION (class 1) / SAFE (class 0)
|
|
- **Max Input Length**: 512 tokens
|
|
- **Accuracy**: 99.1% on holdout test set
|
|
- **Size**: ~700 MB (downloaded from Hugging Face Hub on first use)
|
|
|
|
---
|
|
|
|
## CLI Reference
|
|
|
|
```
|
|
usage: agent.py [-h] [--input INPUT] [--file FILE]
|
|
[--mode {regex,heuristic,full}]
|
|
[--threshold THRESHOLD]
|
|
[--output {text,json}]
|
|
[--device {cpu,cuda}]
|
|
|
|
Arguments:
|
|
--input, -i Single prompt string to analyze
|
|
--file, -f Path to file with one prompt per line
|
|
--mode, -m Detection mode: regex | heuristic | full (default: full)
|
|
--threshold, -t Classifier confidence threshold (default: 0.85)
|
|
--output, -o Output format: text | json (default: text)
|
|
--device Inference device: cpu | cuda (default: cpu)
|
|
```
|
|
|
|
**Exit Codes:**
|
|
- `0` -- No injections detected
|
|
- `1` -- Error (file not found, model load failure)
|
|
- `2` -- One or more injections detected
|
|
|
|
---
|
|
|
|
## External Resources
|
|
|
|
- OWASP LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
|
|
- OWASP Prompt Injection Prevention Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html
|
|
- ProtectAI DeBERTa Model: https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2
|
|
- Deepset Prompt Injection Dataset: https://huggingface.co/datasets/deepset/prompt-injections
|
|
- Rebuff Framework: https://github.com/protectai/rebuff
|
|
- Simon Willison's Prompt Injection Tag: https://simonwillison.net/tags/prompt-injection/
|
|
- Meta Prompt Guard 86M: https://huggingface.co/meta-llama/Prompt-Guard-86M
|