How you can design data storage for huge marking system (like digg or scrumptious)?
There's already discussion about this, but it's about centralized database. Because the data should really grow, we'll have to partition the information into multiple shards soon or later. So, the question turns to become: How you can design data storage for partitioned marking system?
The marking system essentially has 3 tables:
Item (item_id, item_content) Tag (tag_id, tag_title) TagMapping(map_id, tag_id, item_id)
That actually works acceptable for finding all products for given tag and finding all tags for given item, when the table is saved in a single database instance. If we have to partition the information into multiple database instances, it's not that simple.
For table Item, we are able to partition its quite happy with its key item_id. For table Tag, we are able to partition its quite happy with its key tag_id. For instance, you want to partition table Tag into K databases. We are able to simply choose number (tag_id % K) database to keep given tag.
But, how you can partition table TagMapping?
The TagMapping table signifies the numerous-to-many relationship. I'm able to only image to possess duplication. That's, same content of TagMappping has two copies. The first is partitioned with tag_id and also the other is partitioned with item_id. In scenario to locate tags for given item, we use partition with tag_id. If scenario to locate products for given tag, we use partition with item_id.
Consequently, there's data redundancy. And, the applying level ought to keep the consistency of tables. It appears hard.
Can there be much better means to fix solve this many-to-many partition problem?