Files
claudekit/skills/systematic-debugging/SKILL.md
T

357 lines
9.4 KiB
Markdown

---
name: systematic-debugging
user-invocable: true
description: >
Use when encountering ANY bug, error, test failure, or unexpected behavior. Activate for keywords like "bug", "error", "failing", "broken", "doesn't work", "unexpected", "crash", "exception", "TypeError", "undefined", stack traces, or any error message. Also trigger when tests fail unexpectedly, when behavior differs from expectations, when investigating production incidents, or when flaky/intermittent issues appear. ALWAYS investigate root cause before proposing fixes -- never guess at solutions.
---
# Systematic Debugging
## When to Use
- Bug reports with unclear cause
- Errors appearing in production
- Tests failing unexpectedly
- Intermittent/flaky issues
- Complex multi-component failures
## When NOT to Use
- Known issues with documented fixes already available in the codebase or runbook
- Simple typo or syntax errors that are immediately obvious from the error message
- Configuration issues where the fix is simply updating an environment variable or config value
---
## The Four Phases
### Phase 1: Root Cause Investigation
**Goal**: Understand what's happening before attempting to fix.
**Steps**:
1. **Read error messages carefully**
```markdown
- What is the exact error message?
- What is the stack trace?
- What line numbers are mentioned?
- What values are shown?
```
2. **Reproduce consistently**
```markdown
- Can you trigger the bug reliably?
- What exact steps reproduce it?
- What environment is required?
- Document the reproduction steps
```
3. **Track recent changes**
```markdown
- What changed recently?
- git log --oneline -20
- When did it last work?
- What was deployed?
```
4. **Gather evidence**
```markdown
- Collect logs
- Check monitoring/metrics
- Review related code
- Note any patterns
```
5. **Add instrumentation** (for multi-component systems)
```typescript
// Add diagnostic logging at each boundary
console.error('[DEBUG] Input received:', JSON.stringify(input));
console.error('[DEBUG] After validation:', JSON.stringify(validated));
console.error('[DEBUG] Before database call:', JSON.stringify(query));
console.error('[DEBUG] Database result:', JSON.stringify(result));
```
```python
# Python equivalent — add diagnostic logging at boundaries
import logging
logger = logging.getLogger(__name__)
async def get_user(user_id: str, db: AsyncSession) -> User:
logger.error(f"get_user called with user_id={user_id!r}, type={type(user_id)}")
user = await db.get(User, user_id)
logger.error(f"get_user result: {user!r}")
if not user:
raise HTTPException(status_code=404, detail=f"User {user_id} not found")
return user
```
---
### Phase 2: Pattern Analysis
**Goal**: Find comparable working code to identify differences.
**Steps**:
1. **Find working code**
```markdown
- Is there similar functionality that works?
- What did this code look like before?
- Are there reference implementations?
```
2. **Study reference thoroughly**
```markdown
- How does the working version handle this case?
- What dependencies does it use?
- What assumptions does it make?
```
3. **Identify differences**
```markdown
- What's different between working and broken?
- Configuration differences?
- Data differences?
- Environment differences?
```
4. **Understand dependencies**
```markdown
- What does this code depend on?
- What depends on this code?
- Are dependencies behaving correctly?
```
---
### Phase 3: Hypothesis and Testing
**Goal**: Form and test a specific theory about the cause.
**Steps**:
1. **Form specific hypothesis**
```markdown
Write it down explicitly:
"The bug occurs because [X] causes [Y] when [Z]"
Example:
"The bug occurs because the cache returns stale data
when the user's session expires during an active request"
```
2. **Test with minimal changes**
```markdown
- Change ONE variable at a time
- Don't combine multiple fixes
- Verify results after each change
```
3. **Validate hypothesis**
```markdown
- Does the fix address the hypothesis?
- Can you explain WHY it works?
- Does it make the bug impossible, not just unlikely?
```
---
### Phase 4: Implementation
**Goal**: Fix properly with verification.
**Steps**:
1. **Write failing test first**
```typescript
it('should handle expired session during request', () => {
const session = createExpiredSession();
const result = processRequest(session);
expect(result.error).toBe('SESSION_EXPIRED');
});
```
```python
# Python equivalent
async def test_expired_session_returns_401(client, expired_token):
response = await client.get(
"/api/me",
headers={"Authorization": f"Bearer {expired_token}"},
)
assert response.status_code == 401
```
2. **Implement single targeted fix**
```typescript
// Fix addresses root cause, not symptom
function processRequest(session: Session) {
if (session.isExpired()) {
return { error: 'SESSION_EXPIRED' };
}
// ... rest of logic
}
```
```python
# Python equivalent — add expiry check in dependency
async def get_current_user(token: str = Depends(oauth2_scheme)):
payload = decode_token(token)
if payload.exp < datetime.utcnow().timestamp():
raise HTTPException(status_code=401, detail="Token expired")
return await get_user(payload.sub)
```
3. **Verify fix works**
```bash
# TypeScript
npm test -- --grep "expired session"
# Python
pytest tests/test_auth.py -v -k "expired_session"
```
4. **Verify no regressions**
```bash
# TypeScript
npm test
# Python
pytest -v
```
---
## The Three-Fix Rule
**If three or more fixes fail consecutively, STOP.**
This signals an architectural problem, not a simple bug:
```markdown
Fix attempt 1: Failed
Fix attempt 2: Failed
Fix attempt 3: Failed
STOP: This is not a bug - this is a design problem.
Action: Discuss with user/team before proceeding
- Explain what's been tried
- Explain why it's not working
- Propose architectural changes
```
---
## Key Principles
### Never Skip Error Details
```markdown
BAD: "There's an error somewhere"
GOOD: "TypeError: Cannot read property 'id' of undefined
at UserService.getUser (user-service.ts:42)"
```
### Reproduce Before Investigating
```markdown
BAD: "I think I know what's wrong" (starts coding)
GOOD: "Let me reproduce this first" (writes repro steps)
```
### Trace Backward to Origin
```markdown
BAD: Fix where error appears
GOOD: Trace data backward to find where it became invalid
```
### One Change Per Test
```markdown
BAD: "I changed A, B, and C - now it works!"
(Which one fixed it? Are the others safe?)
GOOD: "I changed A - still broken.
I reverted A and changed B - now it works.
B was the fix."
```
---
## Debugging Checklist
Before attempting any fix:
- [ ] Error message fully read and understood
- [ ] Bug reproduced consistently
- [ ] Recent changes reviewed
- [ ] Evidence gathered (logs, traces)
- [ ] Hypothesis written down
- [ ] Similar working code identified
- [ ] Root cause identified (not just symptom)
Before declaring fixed:
- [ ] Failing test written
- [ ] Fix implemented
- [ ] Test passes
- [ ] No regressions (full test suite passes)
- [ ] Fix explained (can articulate why it works)
---
## Stack-Specific Debugging Tools
| Stack | Log Inspection | REPL Debug | Test Isolation |
|-------|---------------|------------|----------------|
| Python/FastAPI | `logging` + `structlog` | `breakpoint()` / `pdb` | `pytest -x -k test_name` |
| TypeScript/NestJS | NestJS `Logger` | `debugger` + `--inspect` | `jest --testNamePattern` |
| Next.js | `console.error` + React DevTools | Browser DevTools | `vitest run file.test.ts` |
| React | React DevTools + `useDebugValue` | Browser DevTools | `vitest run --reporter=verbose` |
| Django | `django.utils.log` + `DEBUG=True` | `breakpoint()` / `pdb` | `python manage.py test app.tests.TestCase.test_name` |
### Python-specific debugging tips
```python
# Quick pdb breakpoint (Python 3.7+)
breakpoint() # drops into pdb at this line
# Conditional breakpoint
if user_id == "problematic_id":
breakpoint()
# SQLAlchemy query logging — see actual SQL
import logging
logging.getLogger("sqlalchemy.engine").setLevel(logging.INFO)
# FastAPI request/response logging middleware
@app.middleware("http")
async def log_requests(request: Request, call_next):
logger.info(f"{request.method} {request.url}")
response = await call_next(request)
logger.info(f"Status: {response.status_code}")
return response
```
### TypeScript-specific debugging tips
```typescript
// NestJS — enable verbose logging
const app = await NestFactory.create(AppModule, { logger: ['verbose'] });
// Prisma — log queries
const prisma = new PrismaClient({ log: ['query', 'info', 'warn', 'error'] });
// Next.js — debug server components
// Add to next.config.js
module.exports = { logging: { fetches: { fullUrl: true } } };
```
---
## Related Skills
- `root-cause-tracing` -- Deep-dive technique for tracing issues back through complex dependency chains
- `defense-in-depth` -- Add defensive layers to prevent similar bugs from recurring
- `verification-before-completion` -- Ensures the fix is actually verified with evidence before claiming the bug is resolved