Relationship based StrategiesGeneral Rules for MongoDB Schema DesignAnti-PatternsRecomended PatternsRead Ratio to Write Ratio Avoid Application Joins Pre-aggregate Data Avoid Growing Documents (MMAP) Avoid Updating Whole Documents (MMAP) Pre-allocated Documents (MMAP) Field Names Take up Space (MMAP) Over Eager Indexing Custom _id Field Covered Indexes
- One-to-One - Prefer key value pairs within the document
- One-to-Few - Prefer embedding
- One-to-Many - Prefer embedding
- One-to-Squillions - Prefer Referencing
- Many-to-Many - Prefer Referencing
- Rule 1: Favor embedding unless there is a compelling reason not to.
- Rule 2: Needing to access an object on its own is a compelling reason not to embed it.
- Rule 3: Avoid joins and lookups if possible, but don't be afraid if they can provide a better schema design.
- Rule 4: Arrays should not grow without bound. If there are more than a couple of hundred documents on the many side, don't embed them; if there are more than a few thousand documents on the many side, don't use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
- Rule 5: As always, with MongoDB, how you model your data depends entirely on your particular application's data access patterns. You want to structure your data to match the ways that your application queries and updates it.
Below is a brief description of each of the schema design anti-patterns we've covered in this series.
- Massive arrays: storing massive, unbounded arrays in your documents.
- Massive number of collections: storing a massive number of collections (especially if they are unused or unnecessary) in your database.
- Unnecessary indexes: storing an index that is unnecessary because it is (1) rarely used if at all or (2) redundant because another compound index covers it.
- Bloated documents: storing large amounts of data together in a document when that data is not frequently accessed together.
- Separating data that is accessed together: separating data between different documents and collections that is frequently accessed together.
- Case-insensitive queries without case-insensitive indexes: frequently executing a case-insensitive query without having a case-insensitive index to cover it.
Determining if your application is read heavy or write heavy will lead to how you design your schema. If your application is read heavy, you might want to choose a schema that minimizes the amount of reads from MongoDB.
As an example consider an auction type website. As most operations are read operations caused by people browsing the catalog, it might makes sense to use a denormalized schema for the product including as much relevant information as needed to render the entire product page.
Similarly if your application is write heavy, you might want to ensure that you use a schema that maximizes MongoDB write throughput.
MongoDB does not support server side joins. All joins have to be performed in the application itself. The performance can suffer if you are pulling back and joining a lot of data due to all the round-trips required to bring back all the data and the time it takes to perform the in application join. If you find your schema is depending on a lot of joins, it might make sense to denormalize the schema in order to reduce the number of joins.
Additionally, if you find you are aggregating data in a lot of application queries, you might want to consider pre-aggregating. One example might be a page view counter. Instead of summing up the number of views for a particular page on request, we can increment a view counter for that page each time the page is viewed and use this counter to show the number of page views.
If, you find that your schema design creates documents that are constantly growing in size, it will have impact on your disk IO and database performance. Using document buckets and document pre-allocation will help address the issues for the MMAP storage engine.
MongoDB provides for atomic operators that let you modify fields in an existing document, and in most cases will cause an in-place update when using the MMAP storage engine. This ensures we spend as little time as possible re-allocating documents in memory and improves write performance.
If your schema grows to a known size you can avoid document moves by pre-allocating the maximum size of the document causing all operations on the document to be in-place updates.
In some cases documents can contain more space allocated for the field names than the actual data stored. For this case you may want to consider compressing your field names if you are using MMAP or switch to using WiredTiger that supports compression using snappy or zlib.
You might get tempted to add all kinds of indexes to your schema. You have to keep in mind, that each index will impact performance, as they will need to be updated when documents change and the more indexes you have on a collection the more overhead there will be for each write operation. Each index also takes up space and memory so keep in mind that over eager indexing can cause your storage size to balloon.
You can save some space and additional indexes by overriding the meaning of the _id field. The only requirement is that _id is a unique field for the collection. For example you might have a structure that contains a timestamp, userid and machineid allowing you to use the _id index to query for those fields without having to create an additional index.
If your application can leverage covered indexes, it might help performance given that a query might be completely answerable using the data stored in the index without materializing the underlying documents.