Sharding Details

Queries that contain the shard key and can be sent to a single shard or subset of shards are called targeted queries. Queries that must be sent to all shards are called scatter- gather queries: mongos scatters the query to all the shards and then gathers up the results.

For loading data as fast as possible, hashed shard keys are the best option. A hashed shard key can make any field randomly distributed, so it is a good choice if you’re going to be using an ascending key a in a lot of queries but want writes to be random dis‐ tributed. The trade-off is that you can never do a targeted range query with a hashed shard key. If you will not be doing range queries, though, hashed shard keys are a good option.

Non-targeted Sharding

If you do not want the ability to target chunks to prevent executing the query on every shard, the best option is to shard by hashed _id so that the cardinality is high while hot shards are minimized

Targeted Sharding

To accomplish this, we use a compound shard key. The first value in the compound key is a rough, random value with low-ish cardinality. The second part of the shard key is an ascending key. This means that, within a chunk, values are always increasing. Thus, if you had one chunk per shard, you’d have the perfect setup: ascending writes on every shard.

Of course having n chunks with n hotspots spread across n shards isn’t very extensible: add a new shard and it won’t get any writes because there’s no hot chunk to put on it. Thus, you want a few hotspot chunks per shard (to give you room to grow). However, you don’t want too many. A few hotspot chunks will keep the effectiveness of ascending writes. But having, say, a thousand “hotspots” on a shard will end up being equivalent to random writes.

High Randomness Shard Key (Write Scaling)

A key with high randomness will evenly distribute the writes and reads across all the shards. This works well if documents are self contained entities such as Users. However queries for ranges of documents, such as all users with ages less than 35 years will require a scatter gather query where all shards are queried and a merge sort is done on Mongos.

Single Shard Targeted Key (Query Isolation)

Picking a shard key, that groups the documents together will make most of the queries go to a specific Shard. This can avoid scatter gather queries. One possible example might be a Geo application for the UK, where the first part of the key includes the postcode and the second is the address. Due to the first part of the shard key being the postcode, all documents for that particular sort key will end up on the same Shard, meaning all queries for a specific postcode will be routed to a single Shard.

The UK postcode works as it has a lot of possible values due to the resolution of postcodes in the UK. This means there will only be a limited amount of documents in each chunk for a specific postcode. However, if we were to do this for a US postcode we might find that each postcode includes a lot of addresses causing the chunks to be hard to split into new ranges. The effect is that MongoDB is less able to spread out the documents and in the end this impacts performance.

Sharding Anti-Patterns

There are a couple of sharding anti-patterns that you should keep in mind to avoid some of the more common pitfalls.

Monotonically increasing shard key

A monotonically increasing shard key, is an increasing function such as a counter or an ObjectId. When writing documents to a sharded system using an incrementing counter as it’s shard key, all documents will be written to the same shard and chunk until MongoDB splits the chunk and attempts to migrate it to a different shard. There are two possible simple solutions to avoid this issue.

Hashing the shard key

From MongodDB 2.4, we can tell MongoDB to automatically hash all of the shard key values. If we wanted to create documents where the shard key is _id with an ObjectId, we could shard the collection using the following options.

sh.shardCollection( "users.users", { _id: "hashed" } )

Shard using hashed _id

Pre-split the chunks

The second alternative is to pre-split the shard key ranges and migrate the chunks manually, ensuring that writes will be distributed. Say you wanted to use a date timestamp as a shard key. Everyday you set up a new sharded system to collect data and then crunch the data into aggregated numbers before throwing away the original data. We could pre-split the key range so each chunk represented a single minute of data.

// 1. first switch to the data DB 2 use data;
   
// now enable sharding for the data DB
sh.enableSharding("data");
// enable sharding on the relevant collection 6 sh.shardCollection("data.data", {"_id" : 1}); 7 // Disable the balancer
sh.stopBalancer();

Shard using _id field

Let’s split the data by minutes.

// first switch to the admin db
use admin;
// 60 minutes in one hour, 24 hours in a day
var numberOfMinutesInDay = 60*24; var date = new Date(); date.setHours(0); date.setMinutes(0) date.setSeconds(0);
// Pre-split the collection
for(var i = 0; i < numberOfMinutesInDay; i++) { db.runCommand({ split: "data.data",
middle: {_id: date} });
date.setMinutes(date.getMinutes() + 1); }

Pre-split

Finally re-enable the balancer and allow MongoDB to start balancing the chunks to the shards.

// Enable the balancer
sh.startBalancer();

Enable balancer

You can monitor the migrations by connecting to mongos, using the mongo shell and executing the sh.status() command helper to view the current status of the sharded system. The goal of pre-splitting the ranges is to ensure the writes are distributed across all the shards even though we are using a monotonically increasing number.