We are creating a measurement system which will eventually contain 1000's of measurement stations. Each station helps you to save around 500 million dimensions composed of 30 scalar values over its lifetime. These is going to be float values. We are now wondering how you can save this data on each station, thinking about we'll build an internet application on each station so that
- you want to visualize the information on multiple timescales (eg dimensions of 1 week, month, year)
- we have to build moving earnings within the data (eg average on the month to exhibit each year graph)
- the database must be crash resistant (energy black outs)
- we're only doing creates and reads, no updates or removes around the data
furthermore we want yet another server that may show the information of, say, 1000 measurement stations. That might be ~50TB of information in 500 billion dimensions. To deliver the information from measurement station to server, I figured that some form of database-level replication will be a neat and efficient way.
Now I am wondering if your noSQL solution may be much better than mySQL of these reasons. Especially couchDB, Cassandra and perhaps key-value stores like Redis look attractive to me. Which of individuals would suit the "measurement time series" data model very best in your opinion? How about other advantages like crash-safety and replication from measurement station to primary server?
I believe CouchDB is a superb database -- but it is ability to cope with large information is questionable. CouchDB's primary focus is on simplicity of development and offline replication, not always on performance or scalability. CouchDB itself doesn't support partitioning, so you will be restricted to the utmost node size unless of course you utilize BigCouch or invent your personal partitioning plan.
No foolin, Redis is definitely an in-memory database. It's very fast and efficient at getting data interior and exterior RAM. It will be capable of use disk for storage, but it is not terribly proficient at it. It is good for bounded amounts of information that change frequently. Redis comes with replication, but doesn't have any built-in support for partitioning, so again, you will be by yourself here.
Additionally you pointed out Cassandra, that we think is much more on track to use situation. Cassandra is perfect for databases that grow indefinitely, basically it's original use situation. The partitioning and availability is baked in which means you will not need to bother about it greatly. The information model can also be a little more flexible compared to average key/value store, adding another dimension of posts, and may practically accomodate countless posts per row. This enables time-series data to become "bucketed" into rows which cover time ranges, for instance. The distribution of information over the cluster (partitioning) is performed in the row level, so just one node is essential to do procedures inside a row.
Hadoop plugs directly into Cassandra, with "native motorists" for MapReduce, Pig, and Hive, therefore it may potentially be employed to aggregate the collected data and materialize the running earnings. The very best practice would be to shape data around queries, so most likely wish to store multiple copies from the data in "denormalized" form, one for every kind of query.
Read this publish on doing time-series in Cassandra:
For highly structured data of the character (time number of float vectors) I am inclined to be put off by databases altogether. The majority of the options that come with a database aren't quite interesting you essentially aren't thinking about such things as atomicity or transactional semantics. The only real feature that is desirable is resilience to crashes. Which include, however, is trivially simple to apply whenever you do not ever have to undo a write (no updates/removes), simply by appending to some file. crash recovery is straightforward open a brand new file by having an incremented serial number within the filename.
May well format with this is plain-old csv. after each measurement is taken, call
flush() around the underlying
file. Obtaining the data duplicated to the central server is really a job effectively solved by
rsync(1). After that you can import the information within the analysis tool of your liking.