Files
claudekit/.claude/skills/databases/mongodb/references/schema-patterns.md
T

6.8 KiB

MongoDB Schema Design Patterns

Quick reference for embedding vs referencing decisions and common schema patterns.

Embedding vs Referencing Decision Tree

What is the relationship cardinality?
|
+-- One-to-Few (< 50 items)?
|   --> EMBED in parent document
|   Example: user.addresses, post.tags
|
+-- One-to-Many (50 - 1000s)?
|   |
|   +-- Child data always accessed with parent?
|   |   --> EMBED (but watch 16 MB doc limit)
|   |
|   +-- Child data accessed independently?
|   |   --> REFERENCE (store child _id in parent array)
|   |
|   +-- Need atomic updates across parent + children?
|       --> EMBED
|
+-- One-to-Millions?
|   --> REFERENCE from child to parent
|   Example: log_entry.host_id (not host.log_entry_ids)
|
+-- Many-to-Many?
    --> REFERENCE with array of _ids on one or both sides
    Example: student.course_ids[], course.student_ids[]

Decision Factors

Factor Favor Embedding Favor Referencing
Read pattern Always read together Read independently
Write pattern Infrequent child updates Frequent child updates
Data size Small, bounded children Large or growing children
Atomicity Need single-doc transactions Can tolerate multi-doc txn
Duplication OK to denormalize Must avoid duplication
Cardinality Few items Many/unbounded items
Document size Well under 16 MB limit Approaching 16 MB

Pattern Catalog

1. Subset Pattern

Problem: Document is large but reads only need a few fields from embedded data.

Solution: Embed a subset; keep full data in a separate collection.

// products collection - fast reads for listing pages
{
  _id: ObjectId("..."),
  name: "Widget",
  price: 29.99,
  // Only the 10 most recent reviews (subset)
  recent_reviews: [
    { user: "alice", rating: 5, text: "Great!", date: ISODate("...") }
  ],
  review_count: 247
}

// reviews collection - full review data
{
  _id: ObjectId("..."),
  product_id: ObjectId("..."),
  user: "alice",
  rating: 5,
  text: "Great!",
  date: ISODate("..."),
  helpful_votes: 12
}

When to use: Product pages, user profiles, any "preview + detail" pattern.

2. Computed Pattern

Problem: Expensive aggregation queries run repeatedly on the same data.

Solution: Pre-compute and store the result, update on write.

// movies collection
{
  _id: ObjectId("..."),
  title: "Example Movie",
  // Pre-computed from screenings collection
  computed: {
    total_revenue: 1250000,
    avg_rating: 4.2,
    rating_count: 843,
    last_computed: ISODate("2025-01-15T00:00:00Z")
  }
}

Update strategy: On each new rating, increment count and recalculate average. Or use a background job for less time-sensitive data.

When to use: Dashboards, leaderboards, summary statistics.

3. Bucket Pattern

Problem: Many small, time-series documents create overhead (indexes, storage per doc).

Solution: Group related data into fixed-size buckets.

// sensor_readings collection - one doc per sensor per hour
{
  sensor_id: "sensor-42",
  bucket_start: ISODate("2025-01-15T14:00:00Z"),
  bucket_end: ISODate("2025-01-15T14:59:59Z"),
  count: 60,
  readings: [
    { ts: ISODate("2025-01-15T14:00:00Z"), temp: 22.1, humidity: 45 },
    { ts: ISODate("2025-01-15T14:01:00Z"), temp: 22.3, humidity: 44 }
    // ... up to 60 readings per bucket
  ],
  // Pre-computed aggregates for the bucket
  summary: {
    avg_temp: 22.4,
    min_temp: 21.8,
    max_temp: 23.1
  }
}

Bucket sizing: Choose a size that balances doc count reduction vs update frequency. Common choices: 1 hour, 1 day, 100 events.

When to use: IoT, time-series, event logging, analytics.

4. Outlier Pattern

Problem: A few documents have vastly more data than the norm (e.g., a viral post with millions of likes).

Solution: Flag outliers and overflow into separate documents.

// books collection - normal case
{
  _id: ObjectId("..."),
  title: "Normal Book",
  customers_purchased: ["user1", "user2", "user3"],
  has_overflow: false
}

// books collection - outlier (bestseller)
{
  _id: ObjectId("..."),
  title: "Bestseller",
  customers_purchased: ["user1", "user2", /* ... first 1000 */],
  has_overflow: true
}

// book_purchases_overflow collection
{
  book_id: ObjectId("..."),
  page: 2,
  customers_purchased: ["user1001", "user1002", /* ... next 1000 */]
}

When to use: Social media (viral posts), e-commerce (bestsellers), any data with power-law distribution.

5. Extended Reference Pattern

Problem: Frequent joins (lookups) to get a few fields from a referenced document.

Solution: Copy the most-accessed fields into the referencing document.

// orders collection
{
  _id: ObjectId("..."),
  date: ISODate("..."),
  customer_id: ObjectId("..."),
  // Extended reference - copied fields for fast reads
  customer_name: "Alice Smith",
  customer_email: "alice@example.com",
  items: [
    {
      product_id: ObjectId("..."),
      product_name: "Widget",  // copied
      price: 29.99,            // copied (snapshot at time of order)
      quantity: 2
    }
  ]
}

Trade-off: Stale data is acceptable (order snapshots price at purchase time). For data that must be current, keep only the reference.

When to use: Orders (snapshot pricing), notifications (snapshot user name), audit logs.

6. Polymorphic Pattern

Problem: Objects share some fields but differ in others (e.g., different product types).

Solution: Store in a single collection with a type discriminator.

// vehicles collection
{ type: "car", make: "Toyota", doors: 4, trunk_size_liters: 450 }
{ type: "truck", make: "Ford", doors: 2, payload_kg: 5000 }
{ type: "motorcycle", make: "Harley", engine_cc: 1200 }

Index strategy: Index common fields. Use partial indexes for type-specific fields.

db.vehicles.createIndex(
  { payload_kg: 1 },
  { partialFilterExpression: { type: "truck" } }
);

When to use: Product catalogs, content management (articles, videos, images), mixed event streams.

Anti-Patterns

Mistake Problem Fix
Unbounded array growth Document exceeds 16 MB Use bucket or outlier pattern
Deep nesting (> 3 levels) Hard to query and index Flatten or reference
Normalizing everything Too many lookups, slow reads Embed when read together
Embedding large blobs Wastes RAM in working set Store in GridFS or S3
No schema validation Inconsistent data over time Use JSON Schema validation
Indexing every field Slow writes, wasted space Index based on query patterns

Schema Validation

Use db.createCollection() with $jsonSchema validator to enforce structure. Set validationLevel: "moderate" to apply only on inserts and updates (not existing docs).