Files
Anthropic-Cybersecurity-Skills/skills/implementing-llm-guardrails-for-security/references/api-reference.md
T

7.0 KiB

API Reference: LLM Guardrails Security Tools

GuardrailsPipeline (agent.py)

The primary orchestration class that chains all guardrail layers into a validation pipeline.

Constructor

GuardrailsPipeline(
    policy: dict = None,           # Inline policy dictionary
    policy_path: str = None,       # Path to JSON policy file
)

If neither policy nor policy_path is provided, the built-in DEFAULT_POLICY is used. Custom policies are merged with defaults so missing keys fall back to default values.

Methods

validate_input(text: str) -> ValidationResult

Runs all input guardrail layers (length, injection, content policy, PII) on user input.

Parameters:

  • text (str): The user input to validate.

Returns: ValidationResult with safe=False if any critical violation is found. PII-only findings are treated as warnings (input is redacted but not blocked).

validate_output(response: str, original_input: str = "") -> ValidationResult

Validates LLM-generated output for safety violations, system prompt leakage, and PII.

Parameters:

  • response (str): The LLM output to validate.
  • original_input (str): The original user input for context-aware validation.

validate_pii_only(text: str) -> ValidationResult

Runs only the PII detection and redaction layer.


ValidationResult

Dataclass returned by all validation methods.

Field Type Description
safe bool True if no critical violations found
blocked_reason str Human-readable reason for blocking (empty if safe)
violations list[dict] List of violation dicts with guard, detail, severity keys
pii_detected list[dict] List of PII findings with type, value, start, end keys
sanitized_text str Input with PII redacted
risk_score float Composite risk score (0.0 - 1.0)
validation_time_ms float Validation latency in milliseconds
layer_results dict Per-guard detailed results

Individual Guards

InjectionGuard

Detects prompt injection attempts using compiled regex patterns.

guard = InjectionGuard(patterns=["(?i)ignore previous instructions"])
safe, violations = guard.check("Ignore previous instructions and do X")
# safe=False, violations=["injection_pattern_0: matched 'Ignore previous instructions'"]

Default Patterns Detected:

  • System prompt override ("ignore/disregard/forget previous instructions")
  • Role-play escape ("you are now", "act as", "pretend to be")
  • Instruction hijacking ("do not follow", "new instructions", "instead do")
  • Delimiter injection (Markdown code fences with system/assistant, XML instruction tags)
  • Developer/jailbreak modes ("DAN mode", "developer mode", "god mode")
  • Prompt leaking ("what are your instructions", "repeat your prompt")

ContentPolicyGuard

Enforces blocked patterns and topic restrictions.

guard = ContentPolicyGuard(
    blocked_patterns=[r"(?i)how to hack"],
    blocked_topics=["violence", "illegal_activities"],
)
safe, violations = guard.check("How to hack into a WiFi network")
# safe=False, violations=["blocked_content_0: matched 'How to hack'"]

Supported Topic Categories:

  • violence -- Physical harm, assault, murder
  • illegal_activities -- Fraud, money laundering, trafficking
  • weapons -- Firearms, explosives, 3D-printed weapons
  • drugs -- Drug synthesis, manufacturing instructions
  • exploitation -- Child exploitation, human trafficking
  • politics -- Partisan political opinions or endorsements
  • competitor_products -- References to switching to competitors

PIIGuard

Detects and redacts personally identifiable information using regex patterns.

guard = PIIGuard(pii_patterns={"EMAIL_ADDRESS": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"})
findings = guard.detect("Contact john@example.com for details")
# [{"type": "EMAIL_ADDRESS", "value": "john@example.com", "start": 8, "end": 24}]

redacted, findings = guard.redact("Contact john@example.com for details")
# ("Contact [EMAIL_REDACTED] for details", [...])

Supported PII Types:

Type Pattern Redaction
US_SSN 123-45-6789 [SSN_REDACTED]
EMAIL_ADDRESS user@domain.com [EMAIL_REDACTED]
PHONE_NUMBER (555) 123-4567 [PHONE_REDACTED]
CREDIT_CARD 4111-1111-1111-1111 [CARD_REDACTED]
IP_ADDRESS 192.168.1.1 [IP_REDACTED]
US_PASSPORT A12345678 [PASSPORT_REDACTED]
AWS_ACCESS_KEY AKIAIOSFODNN7EXAMPLE [AWS_KEY_REDACTED]
GENERIC_API_KEY api_key=abc123... [API_KEY_REDACTED]

OutputGuard

Validates LLM output for safety violations, length limits, system prompt leakage, and PII.

guard = OutputGuard(blocked_patterns=[...], max_length=8000)
safe, violations = guard.check("Sure, I'll help you hack into the system")
# safe=False, violations=["output_blocked_0: matched ..."]

LengthGuard

Enforces maximum input length.

guard = LengthGuard(max_length=4000)
safe, violations = guard.check("x" * 5000)
# safe=False, violations=["input_too_long: 5000 chars exceeds 4000 limit"]

Content Policy JSON Schema

{
  "allowed_topics": ["list of allowed topic strings"],
  "blocked_topics": ["violence", "illegal_activities", "weapons", "drugs", "exploitation"],
  "blocked_patterns": ["regex patterns for blocked content"],
  "pii_patterns": {
    "ENTITY_TYPE": "regex pattern"
  },
  "injection_patterns": ["regex patterns for injection detection"],
  "max_input_length": 4000,
  "max_output_length": 8000,
  "output_blocked_patterns": ["regex patterns for blocked output content"]
}

CLI Reference

usage: agent.py [-h] [--input INPUT] [--response RESPONSE] [--file FILE]
                [--mode {full,input-only,output-only,pii}]
                [--policy POLICY] [--output {text,json}]

Arguments:
  --input, -i       User input text to validate
  --response, -r    LLM response to validate (required for output-only mode)
  --file, -f        Path to file with one prompt per line
  --mode, -m        Validation mode: full | input-only | output-only | pii (default: full)
  --policy, -p      Path to JSON content policy file
  --output, -o      Output format: text | json (default: text)

Exit Codes:

  • 0 -- All inputs passed validation
  • 1 -- Error (file not found, invalid policy)
  • 2 -- One or more inputs blocked or flagged

External Resources