We are aiming to construct a web-based platform (API, Servers, Data, Wahoo!). For context, suppose we have to build something similar to twitter, however with your comments ought to (tweets) organized around an active event. Details about the live event itself should be shipped to clients as quickly and consistently as you possibly can, while comments concerning the event can most likely wait a little longer to become shipped. We'll be read-heavy following the live event finishes.

Scalability is essential. You want to begin leasing VPS slices, and scale after that. I am a large fan from the cloud, and want to remain there as lengthy as you possibly can. We'll most likely be utilising ruby.

I am believing that I wish to consider using a document store rather than an RDBMS. I like the thought of schema-less storage and also the promises of simpler scalability by concentrating on key-value.

The issue is I'm not sure which technology is easily the most right for our platform. I have checked out Couch, Mongo, Tokyo, japan Cabinet, Cassandra, as well as an RDBMS with blobbed documents. Any help choosing the right tool with this particular job?

Checkout no SQL options comparison by BJ Clark.

Scalability is essential.

You will want to think about the excerpts from his blog:

  1. Tokyo, japan Cabinet - Does not scale
  2. Redis - Does not scale
  3. Project Voldemort - scales
  4. MongoDB - limted (sharding is been implemented)
  5. Cassandra - scales
  6. Amazon . com S3 - scales
  7. Couch - Does not scale (Clustering &lifier replication)
  8. MySQL - Does not scale

And consider HyperTable. This is a significant contender in No-SQL options. This is an free implementation of Google's BigTable concept. In my opinion it scales mainly because it's extensively utilized by china internet search engine Baidu and entertainment portal Rediff.

You had been saying:

Details about the live event itself should be shipped to clients as fast and consistently as you possibly can, while comments concerning the event can most likely wait a little longer to become shipped. We'll be read-heavy after the live event finishes.

This really is something similar to Twitter's approach. Your programming language selection can also be extremely important, because Twitter initially opted for Ruby for back-finish message delivery but they were saying it isn't a proper choice and they've moved the whole message delivery system towards the Scala language.

They're still using Ruby for his or her front-finish. If you wish to opt for a very reliable, fault tolerant system that's perfect for scalable conditions, then you should look at Scala or Erlang.

Ramesh includes a good summary. I'd include that Cassandra includes a more potent data model than vanilla Dynamo clones (like Voldemort or Dynomite): rows with named, sorted posts as opposed to just key/value. Cassandra has been utilized by Twitter, Mahalo, Ooyala, SimpleGeo, WebEx, yet others (http://n2.nabble.com/Cassandra-customers-survey-td4040068.html), a minimum of most of which are running Cassandra groupings on EC2 or rackspace cloud servers.

If you wish to scale flat (distribute your computer data over several node) you need to go ahead and take CAP theorem into consideration.

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

It's not easy stuff but you need to choose, there's always some type of downside.