Index parsing -- Agent reads index.json and understands the skill catalog
Frontmatter parsing -- Agent reads SKILL.md YAML frontmatter correctly
Subdomain filtering -- Agent filters skills by subdomain (e.g., "show me all threat-hunting skills")
Tag-based search -- Agent finds skills by tag (e.g., "mitre-attack", "owasp")
Framework lookup -- Agent maps a framework reference (e.g., "T1566") to relevant skills
Natural language query -- Agent understands "How do I analyze phishing emails?" and returns relevant skills

Execution Tests

Verify the agent can use skill content to perform tasks:

Procedure following -- Agent reads the skill steps and executes them in order
Tool invocation -- Agent installs/uses tools referenced in the skill (e.g., Volatility, Wireshark)
Script execution -- Agent runs scripts from the scripts/ directory where available
Template usage -- Agent fills in templates from the assets/ directory with real data
Reference consultation -- Agent reads references/ for standards and applies them
Multi-skill chaining -- Agent combines multiple skills for complex workflows (e.g., forensic acquisition followed by analysis)

Scoring Methodology

Each test category is scored on a 0-100 scale:

Score	Meaning
0-25	Agent cannot perform the task
26-50	Agent partially performs the task with significant errors
51-75	Agent performs the task with minor issues
76-100	Agent performs the task correctly and completely

The overall score is the average of Discovery and Execution scores.

How to Run Benchmarks

Prerequisites

Access to the AI agent being tested
This repository cloned locally or accessible to the agent
Python 3.10+ for the test harness

Running Discovery Tests

# Point the agent at the repository and ask it to find skills
# Record pass/fail for each discovery test category

# Example prompts to test:
# 1. "List all skills in the threat-hunting subdomain"
# 2. "Find skills tagged with mitre-attack"
# 3. "What skills help with T1566 Phishing?"
# 4. "How many skills are in this repository?"
# 5. "Show me the skill for analyzing memory dumps with Volatility"

Running Execution Tests

# Point the agent at a specific skill and ask it to execute the procedure
# Record pass/fail for each execution test category

# Example prompts to test:
# 1. "Follow the steps in analyzing-phishing-email-headers/SKILL.md"
# 2. "Run the script in analyzing-security-logs-with-splunk/scripts/"
# 3. "Fill in the template for incident-response using the provided assets"
# 4. "Analyze this PCAP file using the analyzing-network-traffic-with-wireshark skill"

Recording Results

Results should be recorded in the following format:

{
  "agent": "Claude Code",
  "version": "1.0",
  "date": "2026-02-25",
  "discovery": {
    "index_parsing": 100,
    "frontmatter_parsing": 100,
    "subdomain_filtering": 100,
    "tag_search": 100,
    "framework_lookup": 100,
    "natural_language": 95
  },
  "execution": {
    "procedure_following": 100,
    "tool_invocation": 95,
    "script_execution": 100,
    "template_usage": 100,
    "reference_consultation": 100,
    "multi_skill_chaining": 95
  },
  "overall_score": 99
}

Benchmark History

Date	Agent	Score	Notes
2026-02-25	Claude Code	100%	Full discovery and execution capability

Contributing Benchmarks

To add benchmark results for a new agent:

Run both discovery and execution test suites
Record results in JSON format
Add a summary row to the test matrix above
Submit a pull request with the results and any agent-specific notes