For a little of background - this handles a project running on one small EC2 instance, and it is going to migrate to some medium one. The primary components are Django, MySQL and a lot of custom analysis tools designed in python and java, which perform the heavy lifting. Exactly the same machine is running Apache too.

The information model appears like the next - a lot of real-time data is available in streamed from various networked sensors, and ideally, Let me begin a lengthy-poll approach as opposed to the current poll every fifteen minutes approach (a limitation of computing stats and writing in to the database itself). When the data is available in, I keep raw version in MySQL, allow the analysis tools loose about this data, and store statistics in another couple of tables. All this is made using Django.

Relational features I'd need -

  • Order by [SliceRange in Cassandra's API appears to satisy this]
  • Group by
  • Manytomany relations between multiple tables [Cassandra SuperColumns appear to complete well for you to many]
  • Sphinx about this provides me with a pleasant full text engine, so thats essential too. [On Cassandra, the Lucandra project appears to fulfill this need]

My significant problem is the fact that data reads are very slow (and creates aren't that hot either). I'd rather not throw lots of money and hardware onto it at this time, and I'd prefer something which can scale easily as time passes. Up and down scaling MySQL isn't trivial for the reason that sense (or cheap).

So basically, after getting read a great deal about NOSQL and played around with with such things as MongoDB, Cassandra and Voldemort, my questions are,

  • On the medium EC2 instance, would I gain any benefits in reads/creates by shifting to something similar to Cassandra? This short article (pdf) certainly appears to claim that. Presently, I'd say a couple of hundred creates each minute will be the norm. For reads - because the data changes every a few minutes approximately, cache invalidation needs to happen pretty rapidly. Sooner or later, it will have the ability to handle a lot of concurrent customers too. The application performance presently will get wiped out on MySQL doing a bit of joins on large tables even when indexes are produced - something towards the order of 32k rows takes greater than a minute to render. (This might be an artifact of EC2 virtualized I/O too). Size tables is about 4-5 million rows, and you will find about 5 such tables.

  • Everybody discusses using Cassandra on multiple nodes, because of the CAP theorem and eventual consistency. But, for any project that's just starting to grow, will it seem sensible to deploy a 1 node cassandra server? What are the caveats? For example, will it replace MySQL like a after sales for Django? [Is suggested?]

  • Basically do change, I am speculating I'm going to rewrite areas of the application to perform a much more "administrivia" since I'd need to do multiple searches to fetch rows.

  • Wouldn't it make sense at all to simply use MySQL like a key value store as opposed to a relational engine, and opt for that? This way I possibly could utilize a lot of stable APIs available, in addition to a stable engine (and go relational when needed). (Brett Taylor's publish from Friendfeed about this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)

Any experience from people who've done a change could be greatly appreciated!

Thanks.

Cassandra and also the other distributed databases currently available don't supply the type of ad-hoc query give you support are utilized to from sql. The reason being you cannot distribute queries with joins performantly, therefore the emphasis is on denormalization rather.

However, Cassandra .6 (beta formally out tomorrow, however, you can build in the .6 branch yourself if you are impatient) supports Hadoop map/reduce for statistics, which really seems like a great fit for you personally.

Cassandra provides excellent support for adding new nodes easily, even going to a preliminary number of one.

Nevertheless, in a couple of hundred creates/minute you are likely to be fine on mysql for any lengthy, very long time. Cassandra is way better at as being a key/value store (better still, key/columnfamily) but MySQL is way better at as being a relational database. :)

There's no django support for Cassandra (or any other nosql database) yet. They're speaking about doing something for the following version after 1.2, but according to speaking to django devs at pycon, nobody is actually sure what that may be like yet.