Files
Anthropic-Cybersecurity-Skills/benchmarks

Agent Compatibility Benchmarks

Tests run against real AI agents to verify skill discovery and execution.

Test Matrix

AI Agent Discovery Execution Score
Claude Code Passed Passed 100%
GitHub Copilot Passed Testing TBD
OpenAI Codex CLI Testing Testing TBD
Cursor Passed Testing TBD
Gemini CLI Testing Testing TBD

What We Test

Discovery Tests

Verify the agent can find and parse skills from this repository:

  1. Index parsing -- Agent reads index.json and understands the skill catalog
  2. Frontmatter parsing -- Agent reads SKILL.md YAML frontmatter correctly
  3. Subdomain filtering -- Agent filters skills by subdomain (e.g., "show me all threat-hunting skills")
  4. Tag-based search -- Agent finds skills by tag (e.g., "mitre-attack", "owasp")
  5. Framework lookup -- Agent maps a framework reference (e.g., "T1566") to relevant skills
  6. Natural language query -- Agent understands "How do I analyze phishing emails?" and returns relevant skills

Execution Tests

Verify the agent can use skill content to perform tasks:

  1. Procedure following -- Agent reads the skill steps and executes them in order
  2. Tool invocation -- Agent installs/uses tools referenced in the skill (e.g., Volatility, Wireshark)
  3. Script execution -- Agent runs scripts from the scripts/ directory where available
  4. Template usage -- Agent fills in templates from the assets/ directory with real data
  5. Reference consultation -- Agent reads references/ for standards and applies them
  6. Multi-skill chaining -- Agent combines multiple skills for complex workflows (e.g., forensic acquisition followed by analysis)

Scoring Methodology

Each test category is scored on a 0-100 scale:

Score Meaning
0-25 Agent cannot perform the task
26-50 Agent partially performs the task with significant errors
51-75 Agent performs the task with minor issues
76-100 Agent performs the task correctly and completely

The overall score is the average of Discovery and Execution scores.

How to Run Benchmarks

Prerequisites

  • Access to the AI agent being tested
  • This repository cloned locally or accessible to the agent
  • Python 3.10+ for the test harness

Running Discovery Tests

# Point the agent at the repository and ask it to find skills
# Record pass/fail for each discovery test category

# Example prompts to test:
# 1. "List all skills in the threat-hunting subdomain"
# 2. "Find skills tagged with mitre-attack"
# 3. "What skills help with T1566 Phishing?"
# 4. "How many skills are in this repository?"
# 5. "Show me the skill for analyzing memory dumps with Volatility"

Running Execution Tests

# Point the agent at a specific skill and ask it to execute the procedure
# Record pass/fail for each execution test category

# Example prompts to test:
# 1. "Follow the steps in analyzing-phishing-email-headers/SKILL.md"
# 2. "Run the script in analyzing-security-logs-with-splunk/scripts/"
# 3. "Fill in the template for incident-response using the provided assets"
# 4. "Analyze this PCAP file using the analyzing-network-traffic-with-wireshark skill"

Recording Results

Results should be recorded in the following format:

{
  "agent": "Claude Code",
  "version": "1.0",
  "date": "2026-02-25",
  "discovery": {
    "index_parsing": 100,
    "frontmatter_parsing": 100,
    "subdomain_filtering": 100,
    "tag_search": 100,
    "framework_lookup": 100,
    "natural_language": 95
  },
  "execution": {
    "procedure_following": 100,
    "tool_invocation": 95,
    "script_execution": 100,
    "template_usage": 100,
    "reference_consultation": 100,
    "multi_skill_chaining": 95
  },
  "overall_score": 99
}

Benchmark History

Date Agent Score Notes
2026-02-25 Claude Code 100% Full discovery and execution capability

Contributing Benchmarks

To add benchmark results for a new agent:

  1. Run both discovery and execution test suites
  2. Record results in JSON format
  3. Add a summary row to the test matrix above
  4. Submit a pull request with the results and any agent-specific notes