I’m trying to resolve an issue where we're examining a large amount of data from the table. We have to pull in a few subsets of the data and evaluate it. Out of the box, In my opinion that it might be better to multithread it and produce in just as much data as you possibly can initially and perform various computations on each region. Let’s think that each subset of information to evaluate is denoted as S1, S2, … So You will see a thread for every. After carrying out the information, some visualization might be produced too and also the results will have to be saved into the database because there might be many gb price of data within the analysis results. Let’s think that the outcomes are denoted by R1, R2, …
Even though this is just a little vague, I'm wondering whether we ought to produce a table for every of R1, R2, etc or store all the produces a single table? It'll likely be that people will need multiple threads storing results simultaneously (recall threads for S1, S2) therefore if there's just one table, I have to make sure that multiple threads can can get on simultaneously. Whether it helps, once the data for R1, R2, etc is required again, everything is going to be drawn inside and out a particular order that might be simple to maintain if there have been a table for every of R1, R2, etc. Also, I believed that people will have a single object for every table that handles demands compared to that results table when we go down that path. Basically, I'd like the item to become just like a bean that only loads in data from that database as necessary (an excessive amount of to help keep in memory at the same time). Another point is the fact that we're using InnoDB as our storage engine just in case which makes any difference whether multiple threads can access a specific table.
So, with this particular little bit of information, will it be best to produce a group of tables for that results a treadmill for every region of results (possibly 100s)?
You can, however you need to manage 100 tables. And becoming statistics for the entire set is going to be much harder.
When the data can be simply partitioned to various subsets that don't intersect, the database shouldn't be securing rows, particularly if you are simply doing reads and processing inside your application. In this situation you don't have to partition the table into 100s of tables and every thread inside your application may be used individually.
this seems like a great map reduce candidate. That's presuming that you will carry out the same calculation overall set and would like to accelerate the procedure.
Have you thought about using something similar to MongoDB? you are able to write your personal map reduce aggregations inside it.
Map reduce: http://en.wikipedia.org/wiki/MapReduce
Mongo does support update in position and it is a lockless eventually consistent store.