Files
claudekit/skills/databases/references/mongodb.md
T
2026-04-19 14:10:38 +07:00

577 lines
17 KiB
Markdown

# Databases — MongoDB Patterns
# MongoDB
## When to Use
- MongoDB database operations
- Document-based data modeling
- Aggregation pipelines
- Semi-structured or polymorphic data that varies per record
- Rapid prototyping where schema flexibility accelerates iteration
- Event logging, IoT telemetry, or content management systems
## When NOT to Use
- Relational-heavy data models with complex joins and foreign key constraints
- SQL-only projects where the entire stack is built around relational databases
- Simple key-value storage where Redis or a lightweight store is more appropriate
- Financial systems requiring multi-table ACID transactions as the norm
---
## Core Patterns
### 1. Schema Design
The central decision in MongoDB modeling is **embed vs. reference**.
**Decision tree:**
```
Does the child data belong to exactly one parent?
YES --> Is the child array unbounded (could grow to thousands)?
YES --> Reference (separate collection)
NO --> Embed
NO --> Is it a many-to-many relationship?
YES --> Reference (with array of ObjectIds on one or both sides)
NO --> Reference
```
**Embedding pattern -- best for data that is read together:**
```javascript
// User with embedded address and preferences
// Good: one read fetches everything the profile page needs
db.users.insertOne({
email: "user@example.com",
name: "Alice Chen",
address: {
street: "123 Main St",
city: "Portland",
state: "OR",
zip: "97201"
},
preferences: {
theme: "dark",
language: "en",
notifications: { email: true, push: false }
},
createdAt: new Date()
});
```
**Referencing pattern -- best for independent or unbounded data:**
```javascript
// Orders reference the user by ID
// Good: orders grow unboundedly, accessed independently
db.orders.insertOne({
userId: ObjectId("6651a..."),
status: "shipped",
totalCents: 4999,
items: [
{ sku: "WIDGET-001", name: "Blue Widget", qty: 2, priceCents: 1999 },
{ sku: "GADGET-010", name: "Mini Gadget", qty: 1, priceCents: 1001 }
],
placedAt: new Date()
});
```
**Denormalization pattern -- duplicate data to avoid frequent lookups:**
```javascript
// Store author name directly on the post (denormalized from users)
// Trade-off: faster reads, but updates to user name require updating all posts
db.posts.insertOne({
title: "Getting Started with MongoDB",
body: "...",
author: {
_id: ObjectId("6651a..."),
name: "Alice Chen" // denormalized -- must be updated if name changes
},
tags: ["mongodb", "tutorial"],
publishedAt: new Date()
});
```
**Polymorphic pattern -- different shapes in one collection:**
```javascript
// Events collection stores different event types
db.events.insertMany([
{
type: "page_view",
userId: ObjectId("6651a..."),
url: "/products/widget",
timestamp: new Date()
},
{
type: "purchase",
userId: ObjectId("6651a..."),
orderId: ObjectId("6651b..."),
totalCents: 4999,
timestamp: new Date()
}
]);
// Use a discriminator field (type) and query by it
```
**Schema validation -- enforce structure at the database level:**
```javascript
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["email", "name", "createdAt"],
properties: {
email: {
bsonType: "string",
pattern: "^.+@.+\\..+$",
description: "Must be a valid email"
},
name: {
bsonType: "string",
minLength: 1
},
role: {
enum: ["admin", "editor", "viewer"],
description: "Must be a valid role"
},
createdAt: { bsonType: "date" }
}
}
},
validationLevel: "strict",
validationAction: "error"
});
```
---
### 2. Aggregation Pipeline
Build complex data transformations as a sequence of stages.
```javascript
// Revenue report: total and average spend per user, last 30 days
db.orders.aggregate([
// Stage 1: filter to recent delivered orders
{ $match: {
status: "delivered",
placedAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
}},
// Stage 2: group by user
{ $group: {
_id: "$userId",
totalSpent: { $sum: "$totalCents" },
orderCount: { $sum: 1 },
avgOrderValue: { $avg: "$totalCents" }
}},
// Stage 3: sort by spend
{ $sort: { totalSpent: -1 } },
// Stage 4: limit to top 10
{ $limit: 10 },
// Stage 5: join user details
{ $lookup: {
from: "users",
localField: "_id",
foreignField: "_id",
as: "user"
}},
// Stage 6: flatten the joined array
{ $unwind: "$user" },
// Stage 7: reshape output
{ $project: {
_id: 0,
userName: "$user.name",
email: "$user.email",
totalSpent: 1,
orderCount: 1,
avgOrderValue: { $round: ["$avgOrderValue", 0] }
}}
]);
```
**$unwind -- flatten arrays into individual documents:**
```javascript
// Expand order items to analyze product-level metrics
db.orders.aggregate([
{ $unwind: "$items" },
{ $group: {
_id: "$items.sku",
totalQty: { $sum: "$items.qty" },
totalRevenue: { $sum: { $multiply: ["$items.qty", "$items.priceCents"] } }
}},
{ $sort: { totalRevenue: -1 } }
]);
```
**$lookup with pipeline -- filtered/correlated joins:**
```javascript
// For each user, get their 3 most recent orders
db.users.aggregate([
{ $lookup: {
from: "orders",
let: { uid: "$_id" },
pipeline: [
{ $match: { $expr: { $eq: ["$userId", "$$uid"] } } },
{ $sort: { placedAt: -1 } },
{ $limit: 3 },
{ $project: { status: 1, totalCents: 1, placedAt: 1 } }
],
as: "recentOrders"
}}
]);
```
**$facet -- run multiple aggregations in parallel:**
```javascript
// Dashboard: get summary stats and top products in one query
db.orders.aggregate([
{ $match: { status: "delivered" } },
{ $facet: {
summary: [
{ $group: {
_id: null,
totalRevenue: { $sum: "$totalCents" },
totalOrders: { $sum: 1 }
}}
],
topProducts: [
{ $unwind: "$items" },
{ $group: { _id: "$items.sku", sold: { $sum: "$items.qty" } } },
{ $sort: { sold: -1 } },
{ $limit: 5 }
],
monthlyTrend: [
{ $group: {
_id: { $dateToString: { format: "%Y-%m", date: "$placedAt" } },
revenue: { $sum: "$totalCents" }
}},
{ $sort: { _id: 1 } }
]
}}
]);
```
---
### 3. Index Strategies
```javascript
// Single field index -- most common
db.users.createIndex({ email: 1 }, { unique: true });
// Compound index -- order matters, follows the ESR rule:
// Equality fields first, Sort fields next, Range fields last
db.orders.createIndex({ status: 1, placedAt: -1 });
// Supports: find({status: "pending"}).sort({placedAt: -1})
// Also supports: find({status: "pending"}) alone (prefix)
// Multikey index -- automatically indexes each array element
db.posts.createIndex({ tags: 1 });
// Supports: find({ tags: "mongodb" })
// Text index -- basic full-text search
db.posts.createIndex(
{ title: "text", body: "text" },
{ weights: { title: 10, body: 1 }, name: "posts_text_search" }
);
// Usage:
db.posts.find(
{ $text: { $search: "mongodb aggregation" } },
{ score: { $meta: "textScore" } }
).sort({ score: { $meta: "textScore" } });
// TTL index -- auto-delete documents after expiry
db.sessions.createIndex(
{ expiresAt: 1 },
{ expireAfterSeconds: 0 } // delete when expiresAt is in the past
);
// Documents must have a Date field; they are removed by a background task ~every 60s
// Partial index -- only index documents matching a filter
db.orders.createIndex(
{ placedAt: -1 },
{ partialFilterExpression: { status: "pending" } }
);
// Smaller index; only used when the query includes the filter condition
// Wildcard index -- for querying arbitrary keys in a sub-document
db.products.createIndex({ "attributes.$**": 1 });
// Supports: find({ "attributes.color": "red" }) without knowing keys in advance
// Collation -- case-insensitive sorting and matching
db.users.createIndex(
{ name: 1 },
{ collation: { locale: "en", strength: 2 } }
);
```
**The ESR rule for compound indexes:** order fields by **E**quality, **S**ort, **R**ange. This produces the most efficient index scans.
```javascript
// Query: find active orders for a user, sorted by date, in a date range
// Equality: userId, status
// Sort: placedAt
// Range: placedAt (but sort and range on same field -- sort wins)
db.orders.createIndex({ userId: 1, status: 1, placedAt: -1 });
```
---
### 4. Transactions
Multi-document transactions work across collections (requires replica set or sharded cluster).
```javascript
const session = client.startSession();
try {
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" },
readPreference: "primary"
});
const accounts = client.db("bank").collection("accounts");
// Transfer $50 from account A to account B
const fromAccount = await accounts.findOne(
{ _id: "account-A" },
{ session }
);
if (fromAccount.balanceCents < 5000) {
await session.abortTransaction();
throw new Error("Insufficient funds");
}
await accounts.updateOne(
{ _id: "account-A" },
{ $inc: { balanceCents: -5000 } },
{ session }
);
await accounts.updateOne(
{ _id: "account-B" },
{ $inc: { balanceCents: 5000 } },
{ session }
);
// Record the transfer in a separate collection -- still in the same tx
await client.db("bank").collection("transfers").insertOne({
from: "account-A",
to: "account-B",
amountCents: 5000,
timestamp: new Date()
}, { session });
await session.commitTransaction();
} catch (error) {
await session.abortTransaction();
throw error;
} finally {
await session.endSession();
}
```
**Guidelines:**
- Keep transactions short -- they hold locks and consume resources
- Design your schema to minimize the need for multi-document transactions
- Transactions have a default 60-second timeout (`maxTimeMS`)
- Retryable writes (`retryWrites=true` in connection string) handle transient errors automatically
---
### 5. Change Streams
Watch for real-time changes to collections, databases, or the entire deployment.
```javascript
// Watch a single collection for inserts and updates
const pipeline = [
{ $match: {
operationType: { $in: ["insert", "update"] },
"fullDocument.status": "urgent"
}}
];
const changeStream = db.collection("tickets").watch(pipeline, {
fullDocument: "updateLookup" // include the full document on updates
});
changeStream.on("change", (change) => {
console.log("Change detected:", change.operationType);
console.log("Document:", change.fullDocument);
console.log("Resume token:", change.resumeToken);
// Process the change (e.g., send notification, update cache)
notifyTeam(change.fullDocument);
});
// Handle errors and resume from last known position
changeStream.on("error", (error) => {
console.error("Change stream error:", error);
// Reconnect using the stored resume token
});
```
**Resumable pattern for production:**
```javascript
let resumeToken = await loadResumeTokenFromStorage();
async function watchWithResume(collection) {
const options = { fullDocument: "updateLookup" };
if (resumeToken) {
options.resumeAfter = resumeToken;
}
const stream = collection.watch([], options);
stream.on("change", async (change) => {
// Process change
await handleChange(change);
// Persist resume token so we can recover after restart
resumeToken = change._id;
await saveResumeTokenToStorage(resumeToken);
});
stream.on("error", async () => {
// Wait and reconnect
await new Promise(r => setTimeout(r, 5000));
watchWithResume(collection);
});
}
```
**Use cases:** real-time dashboards, cache invalidation, event-driven architectures, syncing data to search indexes (e.g., Elasticsearch).
---
### 6. Performance
#### Reading explain() output
```javascript
// Run explain to see the query plan
db.orders.find({
userId: ObjectId("6651a..."),
status: "pending"
}).sort({ placedAt: -1 }).explain("executionStats");
```
**Key fields in executionStats:**
| Field | What to look for |
|-------|-----------------|
| `winningPlan.stage` | `IXSCAN` good, `COLLSCAN` bad (full collection scan) |
| `totalKeysExamined` | Should be close to `nReturned` (no wasted index scans) |
| `totalDocsExamined` | Should be close to `nReturned` (no wasted document reads) |
| `executionTimeMillis` | Overall query time |
| `rejectedPlans` | Shows alternatives the optimizer considered |
**Covered queries -- answered entirely from the index:**
```javascript
// Create an index that covers the query
db.orders.createIndex({ userId: 1, status: 1, totalCents: 1 });
// This query only needs fields in the index -- no document fetch
db.orders.find(
{ userId: ObjectId("6651a..."), status: "delivered" },
{ _id: 0, totalCents: 1 } // projection must exclude _id and only include indexed fields
);
// explain() will show: "totalDocsExamined": 0
```
**Projection optimization -- fetch only what you need:**
```javascript
// BAD: fetches entire document including large body field
const posts = await db.posts.find({ author: userId }).toArray();
// GOOD: only fetch fields needed for the list view
const posts = await db.posts.find(
{ author: userId },
{ projection: { title: 1, publishedAt: 1, tags: 1 } }
).toArray();
```
**Bulk operations for write-heavy workloads:**
```javascript
const bulk = db.products.initializeUnorderedBulkOp();
for (const update of priceUpdates) {
bulk.find({ sku: update.sku })
.updateOne({ $set: { priceCents: update.newPrice, updatedAt: new Date() } });
}
const result = await bulk.execute();
console.log(`Modified: ${result.nModified}, Errors: ${result.getWriteErrorCount()}`);
```
---
## Best Practices
1. **Design schema around query patterns, not data relationships.** Ask "how will I read this data?" before "how does this data relate?" Embed data that is always fetched together; reference data accessed independently.
2. **Use the ESR rule for compound indexes.** Order index fields by Equality, Sort, Range. This maximizes the index's usefulness and minimizes keys examined.
3. **Set read/write concerns appropriately.** Use `w: "majority"` and `readConcern: "majority"` for data that must survive failovers. Use `w: 1` for non-critical writes where speed matters more than durability.
4. **Use projection to limit returned fields.** Transferring large documents over the network when you only need two fields wastes bandwidth and memory. Always project.
5. **Avoid unbounded array growth.** An embedded array that can grow to thousands of elements bloats the document (16 MB max) and degrades performance. Move to a separate collection with a reference when the array exceeds ~100 elements.
6. **Use bulk operations for batch writes.** Individual `insertOne` or `updateOne` calls in a loop are slow. Batch them with `bulkWrite` or `initializeUnorderedBulkOp` for 10-50x throughput improvement.
7. **Enable retryable writes.** Add `retryWrites=true` to your connection string. This handles transient network errors and primary elections automatically without application-level retry logic.
8. **Monitor with database profiler and serverStatus.** Use `db.setProfilingLevel(1, { slowms: 100 })` to log slow queries. Check `db.serverStatus().opcounters` and `db.serverStatus().connections` for overall health.
## Common Pitfalls
1. **Treating MongoDB like a relational database.** Normalizing everything into separate collections and using `$lookup` for every query defeats the purpose. If you need heavy joins, PostgreSQL is likely a better fit. Design for embedding first.
2. **Missing indexes on query fields.** Every `find()`, `$match`, and `sort()` should be backed by an index. Use `db.collection.getIndexes()` and `explain()` to verify. A `COLLSCAN` on a large collection is almost always a bug.
3. **Ignoring the 16 MB document size limit.** Embedding unbounded arrays (comments, logs, events) will eventually hit this wall, crashing writes. Use the bucket pattern (fixed-size sub-documents) or reference a separate collection.
4. **Not using readPreference for read-heavy workloads.** By default all reads go to the primary. For analytics or non-critical reads, use `readPreference: "secondaryPreferred"` to distribute load across replicas.
5. **Forgetting that updates replace matched array elements, not all of them.** Using `$set` on a matched array element with positional `$` only updates the first match. Use `$[]` for all elements or `$[<identifier>]` with `arrayFilters` for conditional updates:
```javascript
// Update price for a specific item in all orders
db.orders.updateMany(
{ "items.sku": "WIDGET-001" },
{ $set: { "items.$[item].priceCents": 2499 } },
{ arrayFilters: [{ "item.sku": "WIDGET-001" }] }
);
```
6. **Running aggregation pipelines without early $match.** Always filter as early as possible in the pipeline. A `$group` or `$unwind` before `$match` processes the entire collection unnecessarily. Put `$match` first to leverage indexes and reduce documents flowing through subsequent stages.
## Related Skills
- `postgresql` - Relational database patterns for structured data with complex relationships
- `caching` - Caching strategies to reduce database load
- `logging` - Logging patterns for query debugging and monitoring