mirror of https://github.com/duthaho/claudekit.git synced 2026-07-05 23:08:58 +03:00

Files

T

duthaho 7fa9a48c6c feat: adding new skills, including testing patterns and methodologies, along with bundled resources for better usability.

2026-03-30 12:18:00 +07:00

6.8 KiB

Raw Blame History

MongoDB Schema Design Patterns

Quick reference for embedding vs referencing decisions and common schema patterns.

Embedding vs Referencing Decision Tree

What is the relationship cardinality?
|
+-- One-to-Few (< 50 items)?
|   --> EMBED in parent document
|   Example: user.addresses, post.tags
|
+-- One-to-Many (50 - 1000s)?
|   |
|   +-- Child data always accessed with parent?
|   |   --> EMBED (but watch 16 MB doc limit)
|   |
|   +-- Child data accessed independently?
|   |   --> REFERENCE (store child _id in parent array)
|   |
|   +-- Need atomic updates across parent + children?
|       --> EMBED
|
+-- One-to-Millions?
|   --> REFERENCE from child to parent
|   Example: log_entry.host_id (not host.log_entry_ids)
|
+-- Many-to-Many?
    --> REFERENCE with array of _ids on one or both sides
    Example: student.course_ids[], course.student_ids[]

Decision Factors

Factor	Favor Embedding	Favor Referencing
Read pattern	Always read together	Read independently
Write pattern	Infrequent child updates	Frequent child updates
Data size	Small, bounded children	Large or growing children
Atomicity	Need single-doc transactions	Can tolerate multi-doc txn
Duplication	OK to denormalize	Must avoid duplication
Cardinality	Few items	Many/unbounded items
Document size	Well under 16 MB limit	Approaching 16 MB

Pattern Catalog

1. Subset Pattern

Problem: Document is large but reads only need a few fields from embedded data.

Solution: Embed a subset; keep full data in a separate collection.

// products collection - fast reads for listing pages
{
  _id: ObjectId("..."),
  name: "Widget",
  price: 29.99,
  // Only the 10 most recent reviews (subset)
  recent_reviews: [
    { user: "alice", rating: 5, text: "Great!", date: ISODate("...") }
  ],
  review_count: 247
}

// reviews collection - full review data
{
  _id: ObjectId("..."),
  product_id: ObjectId("..."),
  user: "alice",
  rating: 5,
  text: "Great!",
  date: ISODate("..."),
  helpful_votes: 12
}

When to use: Product pages, user profiles, any "preview + detail" pattern.

2. Computed Pattern

Problem: Expensive aggregation queries run repeatedly on the same data.

Solution: Pre-compute and store the result, update on write.

// movies collection
{
  _id: ObjectId("..."),
  title: "Example Movie",
  // Pre-computed from screenings collection
  computed: {
    total_revenue: 1250000,
    avg_rating: 4.2,
    rating_count: 843,
    last_computed: ISODate("2025-01-15T00:00:00Z")
  }
}

Update strategy: On each new rating, increment count and recalculate average. Or use a background job for less time-sensitive data.

When to use: Dashboards, leaderboards, summary statistics.

3. Bucket Pattern

Problem: Many small, time-series documents create overhead (indexes, storage per doc).

Solution: Group related data into fixed-size buckets.

// sensor_readings collection - one doc per sensor per hour
{
  sensor_id: "sensor-42",
  bucket_start: ISODate("2025-01-15T14:00:00Z"),
  bucket_end: ISODate("2025-01-15T14:59:59Z"),
  count: 60,
  readings: [
    { ts: ISODate("2025-01-15T14:00:00Z"), temp: 22.1, humidity: 45 },
    { ts: ISODate("2025-01-15T14:01:00Z"), temp: 22.3, humidity: 44 }
    // ... up to 60 readings per bucket
  ],
  // Pre-computed aggregates for the bucket
  summary: {
    avg_temp: 22.4,
    min_temp: 21.8,
    max_temp: 23.1
  }
}

Bucket sizing: Choose a size that balances doc count reduction vs update frequency. Common choices: 1 hour, 1 day, 100 events.

When to use: IoT, time-series, event logging, analytics.

4. Outlier Pattern

Problem: A few documents have vastly more data than the norm (e.g., a viral post with millions of likes).

Solution: Flag outliers and overflow into separate documents.

// books collection - normal case
{
  _id: ObjectId("..."),
  title: "Normal Book",
  customers_purchased: ["user1", "user2", "user3"],
  has_overflow: false
}

// books collection - outlier (bestseller)
{
  _id: ObjectId("..."),
  title: "Bestseller",
  customers_purchased: ["user1", "user2", /* ... first 1000 */],
  has_overflow: true
}

// book_purchases_overflow collection
{
  book_id: ObjectId("..."),
  page: 2,
  customers_purchased: ["user1001", "user1002", /* ... next 1000 */]
}

When to use: Social media (viral posts), e-commerce (bestsellers), any data with power-law distribution.

5. Extended Reference Pattern

Problem: Frequent joins (lookups) to get a few fields from a referenced document.

Solution: Copy the most-accessed fields into the referencing document.

// orders collection
{
  _id: ObjectId("..."),
  date: ISODate("..."),
  customer_id: ObjectId("..."),
  // Extended reference - copied fields for fast reads
  customer_name: "Alice Smith",
  customer_email: "alice@example.com",
  items: [
    {
      product_id: ObjectId("..."),
      product_name: "Widget",  // copied
      price: 29.99,            // copied (snapshot at time of order)
      quantity: 2
    }
  ]
}

Trade-off: Stale data is acceptable (order snapshots price at purchase time). For data that must be current, keep only the reference.

When to use: Orders (snapshot pricing), notifications (snapshot user name), audit logs.

6. Polymorphic Pattern

Problem: Objects share some fields but differ in others (e.g., different product types).

Solution: Store in a single collection with a type discriminator.

// vehicles collection
{ type: "car", make: "Toyota", doors: 4, trunk_size_liters: 450 }
{ type: "truck", make: "Ford", doors: 2, payload_kg: 5000 }
{ type: "motorcycle", make: "Harley", engine_cc: 1200 }

Index strategy: Index common fields. Use partial indexes for type-specific fields.

db.vehicles.createIndex(
  { payload_kg: 1 },
  { partialFilterExpression: { type: "truck" } }
);

When to use: Product catalogs, content management (articles, videos, images), mixed event streams.

Anti-Patterns

Mistake	Problem	Fix
Unbounded array growth	Document exceeds 16 MB	Use bucket or outlier pattern
Deep nesting (> 3 levels)	Hard to query and index	Flatten or reference
Normalizing everything	Too many lookups, slow reads	Embed when read together
Embedding large blobs	Wastes RAM in working set	Store in GridFS or S3
No schema validation	Inconsistent data over time	Use JSON Schema validation
Indexing every field	Slow writes, wasted space	Index based on query patterns

Schema Validation

Use db.createCollection() with $jsonSchema validator to enforce structure. Set validationLevel: "moderate" to apply only on inserts and updates (not existing docs).

6.8 KiB Raw Blame History