mirror of
https://github.com/mukul975/Anthropic-Cybersecurity-Skills.git
synced 2026-06-13 14:44:58 +03:00
130 lines
4.2 KiB
Markdown
130 lines
4.2 KiB
Markdown
# Agent Compatibility Benchmarks
|
|
|
|
Tests run against real AI agents to verify skill discovery and execution.
|
|
|
|
## Test Matrix
|
|
|
|
| AI Agent | Discovery | Execution | Score |
|
|
|----------|-----------|-----------|-------|
|
|
| Claude Code | Passed | Passed | 100% |
|
|
| GitHub Copilot | Passed | Testing | TBD |
|
|
| OpenAI Codex CLI | Testing | Testing | TBD |
|
|
| Cursor | Passed | Testing | TBD |
|
|
| Gemini CLI | Testing | Testing | TBD |
|
|
|
|
## What We Test
|
|
|
|
### Discovery Tests
|
|
|
|
Verify the agent can find and parse skills from this repository:
|
|
|
|
1. **Index parsing** -- Agent reads `index.json` and understands the skill catalog
|
|
2. **Frontmatter parsing** -- Agent reads SKILL.md YAML frontmatter correctly
|
|
3. **Subdomain filtering** -- Agent filters skills by subdomain (e.g., "show me all threat-hunting skills")
|
|
4. **Tag-based search** -- Agent finds skills by tag (e.g., "mitre-attack", "owasp")
|
|
5. **Framework lookup** -- Agent maps a framework reference (e.g., "T1566") to relevant skills
|
|
6. **Natural language query** -- Agent understands "How do I analyze phishing emails?" and returns relevant skills
|
|
|
|
### Execution Tests
|
|
|
|
Verify the agent can use skill content to perform tasks:
|
|
|
|
1. **Procedure following** -- Agent reads the skill steps and executes them in order
|
|
2. **Tool invocation** -- Agent installs/uses tools referenced in the skill (e.g., Volatility, Wireshark)
|
|
3. **Script execution** -- Agent runs scripts from the `scripts/` directory where available
|
|
4. **Template usage** -- Agent fills in templates from the `assets/` directory with real data
|
|
5. **Reference consultation** -- Agent reads `references/` for standards and applies them
|
|
6. **Multi-skill chaining** -- Agent combines multiple skills for complex workflows (e.g., forensic acquisition followed by analysis)
|
|
|
|
## Scoring Methodology
|
|
|
|
Each test category is scored on a 0-100 scale:
|
|
|
|
| Score | Meaning |
|
|
|-------|---------|
|
|
| 0-25 | Agent cannot perform the task |
|
|
| 26-50 | Agent partially performs the task with significant errors |
|
|
| 51-75 | Agent performs the task with minor issues |
|
|
| 76-100 | Agent performs the task correctly and completely |
|
|
|
|
The overall score is the average of Discovery and Execution scores.
|
|
|
|
## How to Run Benchmarks
|
|
|
|
### Prerequisites
|
|
|
|
- Access to the AI agent being tested
|
|
- This repository cloned locally or accessible to the agent
|
|
- Python 3.10+ for the test harness
|
|
|
|
### Running Discovery Tests
|
|
|
|
```bash
|
|
# Point the agent at the repository and ask it to find skills
|
|
# Record pass/fail for each discovery test category
|
|
|
|
# Example prompts to test:
|
|
# 1. "List all skills in the threat-hunting subdomain"
|
|
# 2. "Find skills tagged with mitre-attack"
|
|
# 3. "What skills help with T1566 Phishing?"
|
|
# 4. "How many skills are in this repository?"
|
|
# 5. "Show me the skill for analyzing memory dumps with Volatility"
|
|
```
|
|
|
|
### Running Execution Tests
|
|
|
|
```bash
|
|
# Point the agent at a specific skill and ask it to execute the procedure
|
|
# Record pass/fail for each execution test category
|
|
|
|
# Example prompts to test:
|
|
# 1. "Follow the steps in analyzing-phishing-email-headers/SKILL.md"
|
|
# 2. "Run the script in analyzing-security-logs-with-splunk/scripts/"
|
|
# 3. "Fill in the template for incident-response using the provided assets"
|
|
# 4. "Analyze this PCAP file using the analyzing-network-traffic-with-wireshark skill"
|
|
```
|
|
|
|
### Recording Results
|
|
|
|
Results should be recorded in the following format:
|
|
|
|
```json
|
|
{
|
|
"agent": "Claude Code",
|
|
"version": "1.0",
|
|
"date": "2026-02-25",
|
|
"discovery": {
|
|
"index_parsing": 100,
|
|
"frontmatter_parsing": 100,
|
|
"subdomain_filtering": 100,
|
|
"tag_search": 100,
|
|
"framework_lookup": 100,
|
|
"natural_language": 95
|
|
},
|
|
"execution": {
|
|
"procedure_following": 100,
|
|
"tool_invocation": 95,
|
|
"script_execution": 100,
|
|
"template_usage": 100,
|
|
"reference_consultation": 100,
|
|
"multi_skill_chaining": 95
|
|
},
|
|
"overall_score": 99
|
|
}
|
|
```
|
|
|
|
## Benchmark History
|
|
|
|
| Date | Agent | Score | Notes |
|
|
|------|-------|-------|-------|
|
|
| 2026-02-25 | Claude Code | 100% | Full discovery and execution capability |
|
|
|
|
## Contributing Benchmarks
|
|
|
|
To add benchmark results for a new agent:
|
|
|
|
1. Run both discovery and execution test suites
|
|
2. Record results in JSON format
|
|
3. Add a summary row to the test matrix above
|
|
4. Submit a pull request with the results and any agent-specific notes
|