1. The distribution of reads and writes2. The size of your chunks3. The number of shards each query hitsGood shard key schemesHashed idMulti-tenant compound indexIn summary
Sharding DetailsTo do this you tell MongoDB to use one of your indexes as a shard key. It then divides your documents into chunks with similar shard keys. These chunks are then spread out to your replica sets, in approximate shard key order.
1. The distribution of reads and writes
The most important of these is distribution of reads and writes. If you’re always writing to one machine, then that machine will have a high write-lock-%,and so writes to your cluster will be slow. It doesn’t matter how many machines you have in total, as all the writes will go to the same place. This is why you should never use the monotonically increasing
_id
or a timestamp as the shard key, you’ll always be adding things to the last replica set.Similarly, if all of your reads are going to the same replica set, then you’d better hope that your working set fits into RAM on one machine. By splitting reads evenly across all replica sets, you can scale your working set size linearly with number of shards. You will be utilising RAM and disks equally across all machines.
2. The size of your chunks
Secondarily important is the chunk size. MongoDB will split large chunks into smaller ones if, and only if, the shard keys are different. If you have too many documents with the same shard key you end up with jumbo chunks. Jumbo chunks are bad not only because they cause the data to be unevenly distributed but also because once they grow too large you cannot move them between shards at all.
3. The number of shards each query hits
Finally it’s nice to ensure that most queries hit as few shards as possible.The latency of a query is directly dependent on the latency of the slowest server it hits; so the fewer you hit, the faster queries are likely to run. This isn’t a hard requirement, but it’s nice to strive for. Because the distribution of chunks onto shards is only approximately in order it can never be enforced strictly.
Good shard key schemes
Hashed id
As a first approximation you can use a hash of the
_id
of your documents.db.events.createIndex({_id: 'hashed'})
This will distribute reads and writes evenly, and it will ensure that each document has a different shard key so chunks can be fine-grained and small.
It’s not perfect, because queries for multiple documents will have to hit all shards, but it might be good enough.
Multi-tenant compound index
If you want to beat the hashed
_id
scheme, you need to come up with way of grouping related documents close together in the index. At Bugsnag we group the documents by project, because of the way our app works most queries are run in the scope of a project. You will have to figure out a grouping that works for your app.We can’t just use projectId as a shard key because that leads to jumbo chunks, so we also include the
_id
to break large projects into multiple chunks. These chunks are still adjacent in the index, and so still most likely to end up on the same shard.db.events.createIndex({projectId: 1, _id: 1})
This works particularly well for us because the number of reads and writes fora project is mostly independent of the age of that project, and old projects usually get deleted. If that wasn’t the case we might see a slight imbalance towards higher load on more modern projects.
To avoid this problem in the future, we will likely migrate to an index on
{projectId: 'hashed', _id: 1}
In summary
Choosing a shard key is hard, but there are really only two options. If you can’t find a good grouping key for your application, hash the
_id
. If you can, then go with that grouping key and add the _id
to avoid jumbo chunks. Remember that whichever grouping key you use, it needs to also distribute reads and writes evenly to get the most out of each node in your cluster.