I've got a requirement where I must choose around 60 million plus records from database. After I have records in ResultSet then I must formate some posts according to the customer requirement(date format and number format) after which I must write all records inside a file(secondary memory).

  • Presently I'm choosing records on day basis (7 chooses for seven days) from DB and putting these questions HashMap. Reading through from HashMap and formating some posts and lastly writing inside a file(separate apply for seven days).
  • Finally I'm merging all 7 files in one file.

  • But this complete process takes 6 several hours to accomplish. To enhance this method I've produced 7 threads for seven days and all sorts of threads are writing separate files.

  • Finally I'm merging all 7 files in one file. This method takes 2 hrs. But my program will OutOfMemory after one hour and thus.

Please suggest me the very best design with this scenario, must i used some caching mechanism, if so, then which and just how?

Note: Client does not wish to change anything at Database like create indexes or saved methods, they don't wish to touch database. Thanks ahead of time.

Must you have the records in memory to format them? You could attempt and stream the records via a process and to the file. In case your in a position to even break the query up further you may have the ability to start processing the outcomes, while your still locating them.

Based on your DB after sales they may have tools to assist with this particular for example SSIS for Sql Server 2005+.

Edit

I am a .internet developer so allow me to suggest things i would do insinternet and hopefully you are able to convert into comparable technologies around the java side.

ADO.Internet includes a DataReader the industry forward only, read only (Firehose) cursor of the resultset. It returns data because the totally performing. This is important. Basically, my logic could be:

IDataReader reader=GetTheDataReader(dayOfWeek);

while (reader.Read())
{
    file.Write(formatRow(reader));
}

As this is performing when we're coming back rows your not likely to block around the network access that we am speculating has become a bottleneck for you personally. The important thing here's we're not storing any one of this in memory for lengthy, once we cycle the readers will discard the outcomes, and also the file will write the row to disk.

I believe what Josh is recommending is:

You've loops, in which you presently undergo all of the result records of the query (simply using pseudo code here):

while (rec = getNextRec() )
   {
   put in hash ...
   }

for each rec in (hash)
   {
   format and save back in hash ...
   }

for each rec in (hash)
   {
   write to a file ...
   }

instead, do it like this:

while (rec = getNextRec() )
   {
   format fields ...
   write to the file ...
   }

then you definitely not have a lot more than 1 record in memory at any given time ... and you will process an limitless quantity of records.

Is dependent around the database you're using, but when it had been SQL Server, I would suggest using something similar to SSIS to get this done instead of writing a course.

Clearly reading through 60 million records at the same time is depleting all of your memory - which means you can't do this. (ie your 7 thread model). Reading through 60 millions records individually is depleting all of your time - which means you can't do this either (ie your initial read to file for model).

So.... you are going to need to compromise and perform a little bit of both.

Josh has it right - open a cursor for your DB that merely reads the following record, one by one within the easiest, most feature-light way. A "firehose" cursor (also known as a read-only, forward-only cursor) is what you would like because it imposes minimal strain on the database. The DB is not going to help you to update the records, or go backwards within the recordset, which you wouldn't want anyway, therefore it will not have to handle memory for that records.

Now you must this cursor, you are receiving 1 record at any given time through the DB - see clearly, and write it to some file (or several files), this will finish quite rapidly. Your career then would be to merge the files into 1 using the correct order, that is relatively simple.

Given the amount of records you need to process, I believe this is actually the optimum solution for you personally.

But... since you are doing very well to date anyway, why don't you just reduce the amount of threads before you are in your memory limits. Batch processing operates overnight is a lot of companies, this just appears to become a different one of individuals processes.