I am beginning to create a brand new application that'll be utilized by about 50000 products. Each device creates about 1440 registries each day, which means that is going to be saved over 72 million of registries daily. These registries continue to come every minute, and that i must have the ability to query this data with a Java application (J2EE). Therefore it have to be fast to create, fast to see and indexed to permit report generation. Products only place data and also the J2EE application will have to read then from time to time. Now I am searching to software options to aid this type of operation.

  • Putting this data on one table would result in a catastrophic condition, because I will not have the ability to make use of this data because of its quantity of data saved more than a year.

  • I am using Postgres, and database partitioning appears not to become a answer, since I'd have to partition tables by month, or might be more granular approach, days for instance.

I believed on the solution using SQLite. Each device might have its very own SQLite database, compared to information could be granular enough permanently maintenance and fast insertions and queries.

What is your opinion?

  1. Record only changes of device positions - more often than not any device won't move - a vehicle is going to be parked, you sit or sleep, a telephone is going to be on unmoving person or billed etc. - this could cause you to a purchase of magnitude less data to keep.

  2. You will be producing for the most part about 1TB annually (even if not applying point 1), which isn't a really large quantity of data. What this means is about 30MB/of data, which single SATA drive are designed for.

  3. A simple unpartitioned Postgres database on much less large hardware should manage additional. The only issue might be when you will need to query or backup - this is often resolved using a Hot Standby mirror using Streaming Replication - this can be a new feature in potential launched PostgreSQL 9.. Just query against / backup one - if it's busy it'll temporarily and instantly queue changes, and get caught up later.

  4. When you wish to partition get it done for instance on device_id modulo 256 rather than time. By doing this you'd have creates disseminate on every partition. Should you partition promptly only one partition can be really busy on any time yet others is going to be idle. Postgres supports partitioning by doing this perfectly. After that you can also spread load to many storage products using tablespaces, that are also well supported in Postgres.

Time-interval partitioning is an extremely good solution, even when you need to roll your personal. Maintaining separate connections to 50,000 SQLite databases is a smaller amount practical than the usual single Postgres database, for countless card inserts each day.

With respect to the type of queries you need to run against your dataset, you may consider partitioning your remote products across several servers, after which query individuals servers to create aggregate data to some after sales server.

The important thing to high-volume tables is: minimize the quantity of data you are writing and the amount of indexes that has to be up-to-date avoid UPDATEs or Removes, only Card inserts (and employ partitioning for data that you'll remove within the future—DROP TABLE is a lot faster than Remove FROM TABLE!).

Table design and query optimisation becomes very database-specific while you begin to challenge the database engine. Consider employing a Postgres expert to a minimum of consult in your design.

Maybe the time is right for any db that you could shard over many machines? Cassandra? Redis? Don't limit you to ultimately sql db's.