Anthropic-Cybersecurity-Skills/skills/implementing-llm-guardrails-for-security/references/api-reference.md

# API Reference: LLM Guardrails Security Tools

## GuardrailsPipeline (agent.py)

The primary orchestration class that chains all guardrail layers into a validation pipeline.

### Constructor

```python
GuardrailsPipeline(
    policy: dict = None,           # Inline policy dictionary
    policy_path: str = None,       # Path to JSON policy file
)
```

If neither `policy` nor `policy_path` is provided, the built-in DEFAULT_POLICY is used. Custom policies are merged with defaults so missing keys fall back to default values.

### Methods

#### `validate_input(text: str) -> ValidationResult`

Runs all input guardrail layers (length, injection, content policy, PII) on user input.

**Parameters:**
- `text` (str): The user input to validate.

**Returns:** `ValidationResult` with `safe=False` if any critical violation is found. PII-only findings are treated as warnings (input is redacted but not blocked).

#### `validate_output(response: str, original_input: str = "") -> ValidationResult`

Validates LLM-generated output for safety violations, system prompt leakage, and PII.

**Parameters:**
- `response` (str): The LLM output to validate.
- `original_input` (str): The original user input for context-aware validation.

#### `validate_pii_only(text: str) -> ValidationResult`

Runs only the PII detection and redaction layer.

---

## ValidationResult

Dataclass returned by all validation methods.

| Field | Type | Description |
|-------|------|-------------|
| `safe` | bool | True if no critical violations found |
| `blocked_reason` | str | Human-readable reason for blocking (empty if safe) |
| `violations` | list[dict] | List of violation dicts with guard, detail, severity keys |
| `pii_detected` | list[dict] | List of PII findings with type, value, start, end keys |
| `sanitized_text` | str | Input with PII redacted |
| `risk_score` | float | Composite risk score (0.0 - 1.0) |
| `validation_time_ms` | float | Validation latency in milliseconds |
| `layer_results` | dict | Per-guard detailed results |

---

## Individual Guards

### InjectionGuard

Detects prompt injection attempts using compiled regex patterns.

```python
guard = InjectionGuard(patterns=["(?i)ignore previous instructions"])
safe, violations = guard.check("Ignore previous instructions and do X")
# safe=False, violations=["injection_pattern_0: matched 'Ignore previous instructions'"]
```

**Default Patterns Detected:**
- System prompt override ("ignore/disregard/forget previous instructions")
- Role-play escape ("you are now", "act as", "pretend to be")
- Instruction hijacking ("do not follow", "new instructions", "instead do")
- Delimiter injection (Markdown code fences with system/assistant, XML instruction tags)
- Developer/jailbreak modes ("DAN mode", "developer mode", "god mode")
- Prompt leaking ("what are your instructions", "repeat your prompt")

### ContentPolicyGuard

Enforces blocked patterns and topic restrictions.

```python
guard = ContentPolicyGuard(
    blocked_patterns=[r"(?i)how to hack"],
    blocked_topics=["violence", "illegal_activities"],
)
safe, violations = guard.check("How to hack into a WiFi network")
# safe=False, violations=["blocked_content_0: matched 'How to hack'"]
```

**Supported Topic Categories:**
- `violence` -- Physical harm, assault, murder
- `illegal_activities` -- Fraud, money laundering, trafficking
- `weapons` -- Firearms, explosives, 3D-printed weapons
- `drugs` -- Drug synthesis, manufacturing instructions
- `exploitation` -- Child exploitation, human trafficking
- `politics` -- Partisan political opinions or endorsements
- `competitor_products` -- References to switching to competitors

### PIIGuard

Detects and redacts personally identifiable information using regex patterns.

```python
guard = PIIGuard(pii_patterns={"EMAIL_ADDRESS": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"})
findings = guard.detect("Contact john@example.com for details")
# [{"type": "EMAIL_ADDRESS", "value": "john@example.com", "start": 8, "end": 24}]

redacted, findings = guard.redact("Contact john@example.com for details")
# ("Contact [EMAIL_REDACTED] for details", [...])
```

**Supported PII Types:**

| Type | Pattern | Redaction |
|------|---------|-----------|
| `US_SSN` | 123-45-6789 | [SSN_REDACTED] |
| `EMAIL_ADDRESS` | user@domain.com | [EMAIL_REDACTED] |
| `PHONE_NUMBER` | (555) 123-4567 | [PHONE_REDACTED] |
| `CREDIT_CARD` | 4111-1111-1111-1111 | [CARD_REDACTED] |
| `IP_ADDRESS` | 192.168.1.1 | [IP_REDACTED] |
| `US_PASSPORT` | A12345678 | [PASSPORT_REDACTED] |
| `AWS_ACCESS_KEY` | AKIAIOSFODNN7EXAMPLE | [AWS_KEY_REDACTED] |
| `GENERIC_API_KEY` | api_key=abc123... | [API_KEY_REDACTED] |

### OutputGuard

Validates LLM output for safety violations, length limits, system prompt leakage, and PII.

```python
guard = OutputGuard(blocked_patterns=[...], max_length=8000)
safe, violations = guard.check("Sure, I'll help you hack into the system")
# safe=False, violations=["output_blocked_0: matched ..."]
```

### LengthGuard

Enforces maximum input length.

```python
guard = LengthGuard(max_length=4000)
safe, violations = guard.check("x" * 5000)
# safe=False, violations=["input_too_long: 5000 chars exceeds 4000 limit"]
```

---

## Content Policy JSON Schema

```json
{
  "allowed_topics": ["list of allowed topic strings"],
  "blocked_topics": ["violence", "illegal_activities", "weapons", "drugs", "exploitation"],
  "blocked_patterns": ["regex patterns for blocked content"],
  "pii_patterns": {
    "ENTITY_TYPE": "regex pattern"
  },
  "injection_patterns": ["regex patterns for injection detection"],
  "max_input_length": 4000,
  "max_output_length": 8000,
  "output_blocked_patterns": ["regex patterns for blocked output content"]
}
```

---

## CLI Reference

```
usage: agent.py [-h] [--input INPUT] [--response RESPONSE] [--file FILE]
                [--mode {full,input-only,output-only,pii}]
                [--policy POLICY] [--output {text,json}]

Arguments:
  --input, -i       User input text to validate
  --response, -r    LLM response to validate (required for output-only mode)
  --file, -f        Path to file with one prompt per line
  --mode, -m        Validation mode: full | input-only | output-only | pii (default: full)
  --policy, -p      Path to JSON content policy file
  --output, -o      Output format: text | json (default: text)
```

**Exit Codes:**
- `0` -- All inputs passed validation
- `1` -- Error (file not found, invalid policy)
- `2` -- One or more inputs blocked or flagged

---

## External Resources

- NVIDIA NeMo Guardrails: https://github.com/NVIDIA-NeMo/Guardrails
- NeMo Guardrails Documentation: https://docs.nvidia.com/nemo/guardrails/latest/index.html
- Guardrails AI Framework: https://github.com/guardrails-ai/guardrails
- Guardrails AI Hub (Validators): https://guardrailsai.com/hub
- Microsoft Presidio (PII Engine): https://github.com/microsoft/presidio
- OpenAI Guardrails Python: https://github.com/openai/openai-guardrails-python
- Colang 2.0 Guide: https://docs.nvidia.com/nemo/guardrails/latest/configure-rails/colang/index.html
- NeMo Guardrails Security Guidelines: https://docs.nvidia.com/nemo/guardrails/latest/security/guidelines.html