mirror of
https://github.com/mukul975/Anthropic-Cybersecurity-Skills.git
synced 2026-06-10 21:24:56 +03:00
Agent Compatibility Benchmarks
Tests run against real AI agents to verify skill discovery and execution.
Test Matrix
| AI Agent | Discovery | Execution | Score |
|---|---|---|---|
| Claude Code | Passed | Passed | 100% |
| GitHub Copilot | Passed | Testing | TBD |
| OpenAI Codex CLI | Testing | Testing | TBD |
| Cursor | Passed | Testing | TBD |
| Gemini CLI | Testing | Testing | TBD |
What We Test
Discovery Tests
Verify the agent can find and parse skills from this repository:
- Index parsing -- Agent reads
index.jsonand understands the skill catalog - Frontmatter parsing -- Agent reads SKILL.md YAML frontmatter correctly
- Subdomain filtering -- Agent filters skills by subdomain (e.g., "show me all threat-hunting skills")
- Tag-based search -- Agent finds skills by tag (e.g., "mitre-attack", "owasp")
- Framework lookup -- Agent maps a framework reference (e.g., "T1566") to relevant skills
- Natural language query -- Agent understands "How do I analyze phishing emails?" and returns relevant skills
Execution Tests
Verify the agent can use skill content to perform tasks:
- Procedure following -- Agent reads the skill steps and executes them in order
- Tool invocation -- Agent installs/uses tools referenced in the skill (e.g., Volatility, Wireshark)
- Script execution -- Agent runs scripts from the
scripts/directory where available - Template usage -- Agent fills in templates from the
assets/directory with real data - Reference consultation -- Agent reads
references/for standards and applies them - Multi-skill chaining -- Agent combines multiple skills for complex workflows (e.g., forensic acquisition followed by analysis)
Scoring Methodology
Each test category is scored on a 0-100 scale:
| Score | Meaning |
|---|---|
| 0-25 | Agent cannot perform the task |
| 26-50 | Agent partially performs the task with significant errors |
| 51-75 | Agent performs the task with minor issues |
| 76-100 | Agent performs the task correctly and completely |
The overall score is the average of Discovery and Execution scores.
How to Run Benchmarks
Prerequisites
- Access to the AI agent being tested
- This repository cloned locally or accessible to the agent
- Python 3.10+ for the test harness
Running Discovery Tests
# Point the agent at the repository and ask it to find skills
# Record pass/fail for each discovery test category
# Example prompts to test:
# 1. "List all skills in the threat-hunting subdomain"
# 2. "Find skills tagged with mitre-attack"
# 3. "What skills help with T1566 Phishing?"
# 4. "How many skills are in this repository?"
# 5. "Show me the skill for analyzing memory dumps with Volatility"
Running Execution Tests
# Point the agent at a specific skill and ask it to execute the procedure
# Record pass/fail for each execution test category
# Example prompts to test:
# 1. "Follow the steps in analyzing-phishing-email-headers/SKILL.md"
# 2. "Run the script in analyzing-security-logs-with-splunk/scripts/"
# 3. "Fill in the template for incident-response using the provided assets"
# 4. "Analyze this PCAP file using the analyzing-network-traffic-with-wireshark skill"
Recording Results
Results should be recorded in the following format:
{
"agent": "Claude Code",
"version": "1.0",
"date": "2026-02-25",
"discovery": {
"index_parsing": 100,
"frontmatter_parsing": 100,
"subdomain_filtering": 100,
"tag_search": 100,
"framework_lookup": 100,
"natural_language": 95
},
"execution": {
"procedure_following": 100,
"tool_invocation": 95,
"script_execution": 100,
"template_usage": 100,
"reference_consultation": 100,
"multi_skill_chaining": 95
},
"overall_score": 99
}
Benchmark History
| Date | Agent | Score | Notes |
|---|---|---|---|
| 2026-02-25 | Claude Code | 100% | Full discovery and execution capability |
Contributing Benchmarks
To add benchmark results for a new agent:
- Run both discovery and execution test suites
- Record results in JSON format
- Add a summary row to the test matrix above
- Submit a pull request with the results and any agent-specific notes