Say I've got a "message" table with 2 secondary indexes:

  • "recipient_id"
  • "sender_id"

I wish to shard the "message" table by "recipient_id". This way to retrieve all messages delivered to a particular recipient I only have to query one shard.

But simultaneously, I wish to have the ability to create a query that request for those messages sent with a certain sender. Now I'd rather not send that question to each single shard from the "message" table. One method to do that would be to duplicate the information and also have a "message_by_sender" table sharded by "sender_id".

The issue that way is the fact that whenever a message continues to be sent, I have to place the content into both "message" and "message_by_sender" tables.

But let's say after placing into "message" the insertion into "message_by_sender" fail? For the reason that situation the content is available in "message" although not in "message_by_sender".

How do you make certain when a note is available in "message" it also is available in "message_by_sender" without turning to two phase commit?

This should be one such problem for anybody who shards their databases. How can you deal woth it?

There's no "silver bullet" for this problem. Some options:

  1. Make use of a message queue to publish the alterations. Eventually the alterations would reach the various partitions.
  2. Possess a trigger around the message table partitions that induce a "index entry needed" row inside a table. Another thing would periodically scan this and make the index.

You might like to look at this blog entry about doing distributed transactions on the internet Application Engine: http://blog.notdot.net/2009/9/Distributed-Transactions-on-App-Engine. Essentially, if you do not want 2phase commit or Paxos or something like that like this, you will want to reside with a few kind of eventually consistent model.

-Dork