Files
claudekit/skills/root-cause-tracing/references/tracing-techniques.md
T
2026-04-19 14:10:38 +07:00

5.8 KiB

Tracing Techniques Reference

Backward-tracing techniques for systematic root cause analysis.

Stack Trace Analysis

Reading a Stack Trace

  1. Start at the bottom (most recent call) to find the immediate failure
  2. Scan upward to find the first frame in your code (not library code)
  3. That frame is usually the symptom location, not the cause
  4. Continue upward to find where bad data or state originated

Symptom vs Cause

What You See Likely Actual Cause
NullPointerException / TypeError: cannot read property of undefined Value not set upstream, missing null check at origin
IndexOutOfBoundsException Off-by-one in loop logic or empty collection not guarded
ConnectionRefusedError Service down, wrong port, firewall rule, DNS resolution
TimeoutError Deadlock, resource exhaustion, slow query, network partition
ValidationError Caller passing wrong shape, schema mismatch, migration gap

Tips

  • Filter out framework frames to reduce noise
  • In async code, the stack may be split; look for caused by or previous sections
  • In Python, read __cause__ and __context__ on chained exceptions
  • In TypeScript/Node, check error.cause (ES2022+)

Binary Search / Git Bisect

When to Use

  • Bug exists now but worked at some known-good point
  • Reproducer is automatable (script, test command)

Process

git bisect start
git bisect bad                    # current commit is broken
git bisect good <known-good-sha> # last known working commit
# Git checks out a midpoint; run your test
git bisect good   # or bad, based on result
# Repeat until Git identifies the first bad commit
git bisect reset  # return to original branch

Automated Bisect

git bisect start HEAD <good-sha>
git bisect run ./test-script.sh
# Exit 0 = good, exit 1 = bad, exit 125 = skip

Log Correlation

Technique

  1. Identify the exact timestamp of the error
  2. Search all related service logs within a window (e.g., +/- 30 seconds)
  3. Filter by correlation ID, request ID, or user ID across services
  4. Build a timeline of events across services

Correlation Fields to Look For

  • request_id or trace_id (distributed tracing)
  • user_id or session_id
  • Source IP or client identifier
  • Timestamps (normalize to UTC)

Tools

  • grep / rg with timestamp ranges
  • Structured logging with JSON output + jq
  • Distributed tracing (OpenTelemetry, Jaeger, Zipkin)

Dependency Analysis (Backward Data Flow)

Process

  1. Start at the error location
  2. Identify the variable or value that is wrong
  3. Trace backward: where was this value set?
  4. At each step, ask: is this value correct here? If yes, move forward. If no, keep going back.
  5. The root cause is where correct data first becomes incorrect.

Common Data Flow Points

User Input -> Validation -> Transform -> Business Logic -> Persistence -> Query -> Response

Trace backward through this chain from wherever the error manifests.

Dependency Categories

Dependency What to Check
Function arguments Caller passing wrong values
Config / env vars Wrong environment, stale config
Database state Missing migration, corrupt data
External API Changed response format, auth expiry
Shared state Race condition, stale cache

Instrumentation Points

Where to Add Temporary Logging

  1. Entry/exit of suspected function — log arguments and return value
  2. Before/after external calls — log request and response
  3. Branch points — log which path was taken and why
  4. Data transformation steps — log before and after
  5. Error handlers — log the full error with context

Guidelines

  • Use a distinct prefix (e.g., [DEBUG-TRACE]) so logs are easy to find and remove
  • Log the type as well as the value (catches "null" vs null)
  • In production, use feature flags or debug log levels, not code changes
  • Remove all temporary logging before committing

Python Example

import logging
logger = logging.getLogger(__name__)

def process_order(order_id: str) -> Order:
    logger.debug("[DEBUG-TRACE] process_order called with: %s (type: %s)", order_id, type(order_id))
    order = db.get_order(order_id)
    logger.debug("[DEBUG-TRACE] db.get_order returned: %s", order)
    # ... rest of logic

TypeScript Example

function processOrder(orderId: string): Order {
  console.debug(`[DEBUG-TRACE] processOrder called with: ${orderId} (type: ${typeof orderId})`);
  const order = db.getOrder(orderId);
  console.debug(`[DEBUG-TRACE] db.getOrder returned:`, order);
  // ... rest of logic
}

Common Root Cause Categories

Category Symptoms Investigation Approach
Data issues Wrong output, validation errors, corrupt state Trace the bad value backward through the data flow
Race conditions Intermittent failures, works-on-retry, order-dependent Look for shared mutable state, add timing logs, test with delays
Config drift Works locally but not in staging/prod Diff environment configs, check env vars, verify secrets
Dependency changes Broke after deploy with no code changes Check lock file diffs, dependency changelogs, API version headers
Resource exhaustion Timeouts, OOM, connection pool errors Monitor metrics (memory, CPU, connections, disk), check for leaks
Schema mismatch Serialization errors, missing fields Compare expected vs actual schema, check migration status

Quick Decision: Which Technique to Use

Situation Start With
Have a stack trace Stack trace analysis
"It used to work" Git bisect
Multi-service issue Log correlation
Wrong data in output Backward data flow
No idea where to start Add instrumentation at boundaries