claudekit/skills/databases/references/mongodb.md

# Databases — MongoDB Patterns


# MongoDB

## When to Use

- MongoDB database operations
- Document-based data modeling
- Aggregation pipelines
- Semi-structured or polymorphic data that varies per record
- Rapid prototyping where schema flexibility accelerates iteration
- Event logging, IoT telemetry, or content management systems

## When NOT to Use

- Relational-heavy data models with complex joins and foreign key constraints
- SQL-only projects where the entire stack is built around relational databases
- Simple key-value storage where Redis or a lightweight store is more appropriate
- Financial systems requiring multi-table ACID transactions as the norm

---

## Core Patterns

### 1. Schema Design

The central decision in MongoDB modeling is **embed vs. reference**.

**Decision tree:**

```
Does the child data belong to exactly one parent?
  YES --> Is the child array unbounded (could grow to thousands)?
            YES --> Reference (separate collection)
            NO  --> Embed
  NO  --> Is it a many-to-many relationship?
            YES --> Reference (with array of ObjectIds on one or both sides)
            NO  --> Reference
```

**Embedding pattern -- best for data that is read together:**

```javascript
// User with embedded address and preferences
// Good: one read fetches everything the profile page needs
db.users.insertOne({
  email: "user@example.com",
  name: "Alice Chen",
  address: {
    street: "123 Main St",
    city: "Portland",
    state: "OR",
    zip: "97201"
  },
  preferences: {
    theme: "dark",
    language: "en",
    notifications: { email: true, push: false }
  },
  createdAt: new Date()
});
```

**Referencing pattern -- best for independent or unbounded data:**

```javascript
// Orders reference the user by ID
// Good: orders grow unboundedly, accessed independently
db.orders.insertOne({
  userId: ObjectId("6651a..."),
  status: "shipped",
  totalCents: 4999,
  items: [
    { sku: "WIDGET-001", name: "Blue Widget", qty: 2, priceCents: 1999 },
    { sku: "GADGET-010", name: "Mini Gadget", qty: 1, priceCents: 1001 }
  ],
  placedAt: new Date()
});
```

**Denormalization pattern -- duplicate data to avoid frequent lookups:**

```javascript
// Store author name directly on the post (denormalized from users)
// Trade-off: faster reads, but updates to user name require updating all posts
db.posts.insertOne({
  title: "Getting Started with MongoDB",
  body: "...",
  author: {
    _id: ObjectId("6651a..."),
    name: "Alice Chen"    // denormalized -- must be updated if name changes
  },
  tags: ["mongodb", "tutorial"],
  publishedAt: new Date()
});
```

**Polymorphic pattern -- different shapes in one collection:**

```javascript
// Events collection stores different event types
db.events.insertMany([
  {
    type: "page_view",
    userId: ObjectId("6651a..."),
    url: "/products/widget",
    timestamp: new Date()
  },
  {
    type: "purchase",
    userId: ObjectId("6651a..."),
    orderId: ObjectId("6651b..."),
    totalCents: 4999,
    timestamp: new Date()
  }
]);
// Use a discriminator field (type) and query by it
```

**Schema validation -- enforce structure at the database level:**

```javascript
db.createCollection("users", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["email", "name", "createdAt"],
      properties: {
        email: {
          bsonType: "string",
          pattern: "^.+@.+\\..+$",
          description: "Must be a valid email"
        },
        name: {
          bsonType: "string",
          minLength: 1
        },
        role: {
          enum: ["admin", "editor", "viewer"],
          description: "Must be a valid role"
        },
        createdAt: { bsonType: "date" }
      }
    }
  },
  validationLevel: "strict",
  validationAction: "error"
});
```

---

### 2. Aggregation Pipeline

Build complex data transformations as a sequence of stages.

```javascript
// Revenue report: total and average spend per user, last 30 days
db.orders.aggregate([
  // Stage 1: filter to recent delivered orders
  { $match: {
    status: "delivered",
    placedAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
  }},

  // Stage 2: group by user
  { $group: {
    _id: "$userId",
    totalSpent: { $sum: "$totalCents" },
    orderCount: { $sum: 1 },
    avgOrderValue: { $avg: "$totalCents" }
  }},

  // Stage 3: sort by spend
  { $sort: { totalSpent: -1 } },

  // Stage 4: limit to top 10
  { $limit: 10 },

  // Stage 5: join user details
  { $lookup: {
    from: "users",
    localField: "_id",
    foreignField: "_id",
    as: "user"
  }},

  // Stage 6: flatten the joined array
  { $unwind: "$user" },

  // Stage 7: reshape output
  { $project: {
    _id: 0,
    userName: "$user.name",
    email: "$user.email",
    totalSpent: 1,
    orderCount: 1,
    avgOrderValue: { $round: ["$avgOrderValue", 0] }
  }}
]);
```

**$unwind -- flatten arrays into individual documents:**

```javascript
// Expand order items to analyze product-level metrics
db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.sku",
    totalQty: { $sum: "$items.qty" },
    totalRevenue: { $sum: { $multiply: ["$items.qty", "$items.priceCents"] } }
  }},
  { $sort: { totalRevenue: -1 } }
]);
```

**$lookup with pipeline -- filtered/correlated joins:**

```javascript
// For each user, get their 3 most recent orders
db.users.aggregate([
  { $lookup: {
    from: "orders",
    let: { uid: "$_id" },
    pipeline: [
      { $match: { $expr: { $eq: ["$userId", "$$uid"] } } },
      { $sort: { placedAt: -1 } },
      { $limit: 3 },
      { $project: { status: 1, totalCents: 1, placedAt: 1 } }
    ],
    as: "recentOrders"
  }}
]);
```

**$facet -- run multiple aggregations in parallel:**

```javascript
// Dashboard: get summary stats and top products in one query
db.orders.aggregate([
  { $match: { status: "delivered" } },
  { $facet: {
    summary: [
      { $group: {
        _id: null,
        totalRevenue: { $sum: "$totalCents" },
        totalOrders: { $sum: 1 }
      }}
    ],
    topProducts: [
      { $unwind: "$items" },
      { $group: { _id: "$items.sku", sold: { $sum: "$items.qty" } } },
      { $sort: { sold: -1 } },
      { $limit: 5 }
    ],
    monthlyTrend: [
      { $group: {
        _id: { $dateToString: { format: "%Y-%m", date: "$placedAt" } },
        revenue: { $sum: "$totalCents" }
      }},
      { $sort: { _id: 1 } }
    ]
  }}
]);
```

---

### 3. Index Strategies

```javascript
// Single field index -- most common
db.users.createIndex({ email: 1 }, { unique: true });

// Compound index -- order matters, follows the ESR rule:
// Equality fields first, Sort fields next, Range fields last
db.orders.createIndex({ status: 1, placedAt: -1 });
// Supports: find({status: "pending"}).sort({placedAt: -1})
// Also supports: find({status: "pending"}) alone (prefix)

// Multikey index -- automatically indexes each array element
db.posts.createIndex({ tags: 1 });
// Supports: find({ tags: "mongodb" })

// Text index -- basic full-text search
db.posts.createIndex(
  { title: "text", body: "text" },
  { weights: { title: 10, body: 1 }, name: "posts_text_search" }
);
// Usage:
db.posts.find(
  { $text: { $search: "mongodb aggregation" } },
  { score: { $meta: "textScore" } }
).sort({ score: { $meta: "textScore" } });

// TTL index -- auto-delete documents after expiry
db.sessions.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }  // delete when expiresAt is in the past
);
// Documents must have a Date field; they are removed by a background task ~every 60s

// Partial index -- only index documents matching a filter
db.orders.createIndex(
  { placedAt: -1 },
  { partialFilterExpression: { status: "pending" } }
);
// Smaller index; only used when the query includes the filter condition

// Wildcard index -- for querying arbitrary keys in a sub-document
db.products.createIndex({ "attributes.$**": 1 });
// Supports: find({ "attributes.color": "red" }) without knowing keys in advance

// Collation -- case-insensitive sorting and matching
db.users.createIndex(
  { name: 1 },
  { collation: { locale: "en", strength: 2 } }
);
```

**The ESR rule for compound indexes:** order fields by **E**quality, **S**ort, **R**ange. This produces the most efficient index scans.

```javascript
// Query: find active orders for a user, sorted by date, in a date range
// Equality: userId, status
// Sort: placedAt
// Range: placedAt (but sort and range on same field -- sort wins)
db.orders.createIndex({ userId: 1, status: 1, placedAt: -1 });
```

---

### 4. Transactions

Multi-document transactions work across collections (requires replica set or sharded cluster).

```javascript
const session = client.startSession();

try {
  session.startTransaction({
    readConcern: { level: "snapshot" },
    writeConcern: { w: "majority" },
    readPreference: "primary"
  });

  const accounts = client.db("bank").collection("accounts");

  // Transfer $50 from account A to account B
  const fromAccount = await accounts.findOne(
    { _id: "account-A" },
    { session }
  );

  if (fromAccount.balanceCents < 5000) {
    await session.abortTransaction();
    throw new Error("Insufficient funds");
  }

  await accounts.updateOne(
    { _id: "account-A" },
    { $inc: { balanceCents: -5000 } },
    { session }
  );

  await accounts.updateOne(
    { _id: "account-B" },
    { $inc: { balanceCents: 5000 } },
    { session }
  );

  // Record the transfer in a separate collection -- still in the same tx
  await client.db("bank").collection("transfers").insertOne({
    from: "account-A",
    to: "account-B",
    amountCents: 5000,
    timestamp: new Date()
  }, { session });

  await session.commitTransaction();
} catch (error) {
  await session.abortTransaction();
  throw error;
} finally {
  await session.endSession();
}
```

**Guidelines:**
- Keep transactions short -- they hold locks and consume resources
- Design your schema to minimize the need for multi-document transactions
- Transactions have a default 60-second timeout (`maxTimeMS`)
- Retryable writes (`retryWrites=true` in connection string) handle transient errors automatically

---

### 5. Change Streams

Watch for real-time changes to collections, databases, or the entire deployment.

```javascript
// Watch a single collection for inserts and updates
const pipeline = [
  { $match: {
    operationType: { $in: ["insert", "update"] },
    "fullDocument.status": "urgent"
  }}
];

const changeStream = db.collection("tickets").watch(pipeline, {
  fullDocument: "updateLookup"  // include the full document on updates
});

changeStream.on("change", (change) => {
  console.log("Change detected:", change.operationType);
  console.log("Document:", change.fullDocument);
  console.log("Resume token:", change.resumeToken);

  // Process the change (e.g., send notification, update cache)
  notifyTeam(change.fullDocument);
});

// Handle errors and resume from last known position
changeStream.on("error", (error) => {
  console.error("Change stream error:", error);
  // Reconnect using the stored resume token
});
```

**Resumable pattern for production:**

```javascript
let resumeToken = await loadResumeTokenFromStorage();

async function watchWithResume(collection) {
  const options = { fullDocument: "updateLookup" };
  if (resumeToken) {
    options.resumeAfter = resumeToken;
  }

  const stream = collection.watch([], options);

  stream.on("change", async (change) => {
    // Process change
    await handleChange(change);

    // Persist resume token so we can recover after restart
    resumeToken = change._id;
    await saveResumeTokenToStorage(resumeToken);
  });

  stream.on("error", async () => {
    // Wait and reconnect
    await new Promise(r => setTimeout(r, 5000));
    watchWithResume(collection);
  });
}
```

**Use cases:** real-time dashboards, cache invalidation, event-driven architectures, syncing data to search indexes (e.g., Elasticsearch).

---

### 6. Performance

#### Reading explain() output

```javascript
// Run explain to see the query plan
db.orders.find({
  userId: ObjectId("6651a..."),
  status: "pending"
}).sort({ placedAt: -1 }).explain("executionStats");
```

**Key fields in executionStats:**

| Field | What to look for |
|-------|-----------------|
| `winningPlan.stage` | `IXSCAN` good, `COLLSCAN` bad (full collection scan) |
| `totalKeysExamined` | Should be close to `nReturned` (no wasted index scans) |
| `totalDocsExamined` | Should be close to `nReturned` (no wasted document reads) |
| `executionTimeMillis` | Overall query time |
| `rejectedPlans` | Shows alternatives the optimizer considered |

**Covered queries -- answered entirely from the index:**

```javascript
// Create an index that covers the query
db.orders.createIndex({ userId: 1, status: 1, totalCents: 1 });

// This query only needs fields in the index -- no document fetch
db.orders.find(
  { userId: ObjectId("6651a..."), status: "delivered" },
  { _id: 0, totalCents: 1 }  // projection must exclude _id and only include indexed fields
);
// explain() will show: "totalDocsExamined": 0
```

**Projection optimization -- fetch only what you need:**

```javascript
// BAD: fetches entire document including large body field
const posts = await db.posts.find({ author: userId }).toArray();

// GOOD: only fetch fields needed for the list view
const posts = await db.posts.find(
  { author: userId },
  { projection: { title: 1, publishedAt: 1, tags: 1 } }
).toArray();
```

**Bulk operations for write-heavy workloads:**

```javascript
const bulk = db.products.initializeUnorderedBulkOp();

for (const update of priceUpdates) {
  bulk.find({ sku: update.sku })
      .updateOne({ $set: { priceCents: update.newPrice, updatedAt: new Date() } });
}

const result = await bulk.execute();
console.log(`Modified: ${result.nModified}, Errors: ${result.getWriteErrorCount()}`);
```

---

## Best Practices

1. **Design schema around query patterns, not data relationships.** Ask "how will I read this data?" before "how does this data relate?" Embed data that is always fetched together; reference data accessed independently.

2. **Use the ESR rule for compound indexes.** Order index fields by Equality, Sort, Range. This maximizes the index's usefulness and minimizes keys examined.

3. **Set read/write concerns appropriately.** Use `w: "majority"` and `readConcern: "majority"` for data that must survive failovers. Use `w: 1` for non-critical writes where speed matters more than durability.

4. **Use projection to limit returned fields.** Transferring large documents over the network when you only need two fields wastes bandwidth and memory. Always project.

5. **Avoid unbounded array growth.** An embedded array that can grow to thousands of elements bloats the document (16 MB max) and degrades performance. Move to a separate collection with a reference when the array exceeds ~100 elements.

6. **Use bulk operations for batch writes.** Individual `insertOne` or `updateOne` calls in a loop are slow. Batch them with `bulkWrite` or `initializeUnorderedBulkOp` for 10-50x throughput improvement.

7. **Enable retryable writes.** Add `retryWrites=true` to your connection string. This handles transient network errors and primary elections automatically without application-level retry logic.

8. **Monitor with database profiler and serverStatus.** Use `db.setProfilingLevel(1, { slowms: 100 })` to log slow queries. Check `db.serverStatus().opcounters` and `db.serverStatus().connections` for overall health.

## Common Pitfalls

1. **Treating MongoDB like a relational database.** Normalizing everything into separate collections and using `$lookup` for every query defeats the purpose. If you need heavy joins, PostgreSQL is likely a better fit. Design for embedding first.

2. **Missing indexes on query fields.** Every `find()`, `$match`, and `sort()` should be backed by an index. Use `db.collection.getIndexes()` and `explain()` to verify. A `COLLSCAN` on a large collection is almost always a bug.

3. **Ignoring the 16 MB document size limit.** Embedding unbounded arrays (comments, logs, events) will eventually hit this wall, crashing writes. Use the bucket pattern (fixed-size sub-documents) or reference a separate collection.

4. **Not using readPreference for read-heavy workloads.** By default all reads go to the primary. For analytics or non-critical reads, use `readPreference: "secondaryPreferred"` to distribute load across replicas.

5. **Forgetting that updates replace matched array elements, not all of them.** Using `$set` on a matched array element with positional `$` only updates the first match. Use `$[]` for all elements or `$[<identifier>]` with `arrayFilters` for conditional updates:

```javascript
// Update price for a specific item in all orders
db.orders.updateMany(
  { "items.sku": "WIDGET-001" },
  { $set: { "items.$[item].priceCents": 2499 } },
  { arrayFilters: [{ "item.sku": "WIDGET-001" }] }
);
```

6. **Running aggregation pipelines without early $match.** Always filter as early as possible in the pipeline. A `$group` or `$unwind` before `$match` processes the entire collection unnecessarily. Put `$match` first to leverage indexes and reduce documents flowing through subsequent stages.

## Related Skills

- `postgresql` - Relational database patterns for structured data with complex relationships
- `caching` - Caching strategies to reduce database load
- `logging` - Logging patterns for query debugging and monitoring