# Agent Compatibility Benchmarks Tests run against real AI agents to verify skill discovery and execution. ## Test Matrix | AI Agent | Discovery | Execution | Score | |----------|-----------|-----------|-------| | Claude Code | Passed | Passed | 100% | | GitHub Copilot | Passed | Testing | TBD | | OpenAI Codex CLI | Testing | Testing | TBD | | Cursor | Passed | Testing | TBD | | Gemini CLI | Testing | Testing | TBD | ## What We Test ### Discovery Tests Verify the agent can find and parse skills from this repository: 1. **Index parsing** -- Agent reads `index.json` and understands the skill catalog 2. **Frontmatter parsing** -- Agent reads SKILL.md YAML frontmatter correctly 3. **Subdomain filtering** -- Agent filters skills by subdomain (e.g., "show me all threat-hunting skills") 4. **Tag-based search** -- Agent finds skills by tag (e.g., "mitre-attack", "owasp") 5. **Framework lookup** -- Agent maps a framework reference (e.g., "T1566") to relevant skills 6. **Natural language query** -- Agent understands "How do I analyze phishing emails?" and returns relevant skills ### Execution Tests Verify the agent can use skill content to perform tasks: 1. **Procedure following** -- Agent reads the skill steps and executes them in order 2. **Tool invocation** -- Agent installs/uses tools referenced in the skill (e.g., Volatility, Wireshark) 3. **Script execution** -- Agent runs scripts from the `scripts/` directory where available 4. **Template usage** -- Agent fills in templates from the `assets/` directory with real data 5. **Reference consultation** -- Agent reads `references/` for standards and applies them 6. **Multi-skill chaining** -- Agent combines multiple skills for complex workflows (e.g., forensic acquisition followed by analysis) ## Scoring Methodology Each test category is scored on a 0-100 scale: | Score | Meaning | |-------|---------| | 0-25 | Agent cannot perform the task | | 26-50 | Agent partially performs the task with significant errors | | 51-75 | Agent performs the task with minor issues | | 76-100 | Agent performs the task correctly and completely | The overall score is the average of Discovery and Execution scores. ## How to Run Benchmarks ### Prerequisites - Access to the AI agent being tested - This repository cloned locally or accessible to the agent - Python 3.10+ for the test harness ### Running Discovery Tests ```bash # Point the agent at the repository and ask it to find skills # Record pass/fail for each discovery test category # Example prompts to test: # 1. "List all skills in the threat-hunting subdomain" # 2. "Find skills tagged with mitre-attack" # 3. "What skills help with T1566 Phishing?" # 4. "How many skills are in this repository?" # 5. "Show me the skill for analyzing memory dumps with Volatility" ``` ### Running Execution Tests ```bash # Point the agent at a specific skill and ask it to execute the procedure # Record pass/fail for each execution test category # Example prompts to test: # 1. "Follow the steps in analyzing-phishing-email-headers/SKILL.md" # 2. "Run the script in analyzing-security-logs-with-splunk/scripts/" # 3. "Fill in the template for incident-response using the provided assets" # 4. "Analyze this PCAP file using the analyzing-network-traffic-with-wireshark skill" ``` ### Recording Results Results should be recorded in the following format: ```json { "agent": "Claude Code", "version": "1.0", "date": "2026-02-25", "discovery": { "index_parsing": 100, "frontmatter_parsing": 100, "subdomain_filtering": 100, "tag_search": 100, "framework_lookup": 100, "natural_language": 95 }, "execution": { "procedure_following": 100, "tool_invocation": 95, "script_execution": 100, "template_usage": 100, "reference_consultation": 100, "multi_skill_chaining": 95 }, "overall_score": 99 } ``` ## Benchmark History | Date | Agent | Score | Notes | |------|-------|-------|-------| | 2026-02-25 | Claude Code | 100% | Full discovery and execution capability | ## Contributing Benchmarks To add benchmark results for a new agent: 1. Run both discovery and execution test suites 2. Record results in JSON format 3. Add a summary row to the test matrix above 4. Submit a pull request with the results and any agent-specific notes