Included in the requirement we have to process nearly 3 million records and connect all of them with a bucket. This association is made the decision on some rules (composed of of 5-15 characteristics, with single or selection of values and priority) which derive the bucket for any record. Consecutive processing of these a large number is clearly from scope. Can someone guide us around the method of effectively design an answer?
3 million records is not really much from the volume-of-data perspective (based on record size, clearly), so I'd claim that the simplest factor to test is parallelising the processing across multiple threads (while using java.util.concurrent.Executor framework). As lengthy as you've multiple CPU cores available, you need to have the ability to get near-linear performance increases.
It is dependent around the databases. If it's just one database, it will cost more often than not locating the information anyway. If it's inside a local file, you'll be able to partition the information into more compact files or pad the records to possess equal size - this enables random use of a load of records.
For those who have a multi-core machine, the partitioned data could be processed in parallel. Should you determined the record-bucket assignment, you are able to write back the data in to the database while using PreparedStatement's batch capacity.
For those who have merely a single core machine, you are able to still achieve some performance enhancements by creating an information retrieval - information systems - batch writeback separation to make use of the pause occasions from the I/O procedures.
I am less than sure what you are after but here's a blog post about how the New York Times used Apache Hadoop Project to process a large volume of data.
It is possible to reason you need to use Java to process the information? Could not you utilize SQL queries to create to intermediate fields? You can build upon each area -- characteristics -- til you have my way through the bucket you'll need.
Or you might make use of a hybrid of SQL and java... Use different methods to obtain different "containers" of knowledge after which send that lower one thread path for additional detailed processing and the other query to obtain another group of data and send that lower another thread path...
Like a meaningless benchmark, there exists a system which has a internal cache. We are presently loading 500K rows. For every row we generate statistics, place secrets in various caches, etc. Presently this takes < 20s for all of us to process.
It is a meaningless benchmark, but it's a case that, with respect to the conditions, 3M rows is very little rows on the modern hardware.
As others have recommended, break the task up directly into pieces, and parallelize the runs, 1-2 threads per core. Each thread keeps their very own local data structures and condition, and also at the finish, the actual process consolidates the outcomes. This can be a crude "map/reduce" formula. The important thing here's to make sure that the threads aren't fighting over global assets like global counters, etc. Allow the final processing from the thread results cope with individuals serially.
You should use several thread per core if each thread does DB IO, since not one thread is going to be purely CPU bound. Simply run the procedure several occasions with various thread counts until it arrives quickest.
We have seen 50% speed ups even if we run batches via a persistent queueing system like JMS to distribute the job versus linear processing, and I have seen these gains on 2 core laptops, so there's definite room for progress here.
Another factor if at all possible is avoid ANY disk IO (save reading through the information in the DB) before the very finish. At that time you've got a much more chance to batch any updates that should be made so that you can, a minimum of, cut lower on network round trip occasions. Even when you needed to update each and every row, large batches of SQL will still show internet gains in performance. Clearly this is often memory intensive. Fortunately, most contemporary systems have lots of memory.