I've got a bioinformatics analysis program that's made up of 5 different steps. Each step is basically a perl script that can take in input, does miracle, and output several text files. Each step must be completely finished prior to the next begins. The whole process takes 24 hrs approximately on core i7 computer systems.
One significant problem is the fact that each step produces about 5-10 gb of intermediate output text files required by subsequent steps, and there is a lot of redundancy. For instance, the creation of step one can be used by step two and three and 4, and every you do exactly the same preprocessing into it. This structure increased 'organically' b/c each step was created individually. Doing my way through memory regrettably won't work with us since data that's 10 gigs on-disk loaded right into a perl hash/array is much too large for squeeze into memory.
It might be nice when the data might be loaded onto medium difficulty database, processed occasionally step, and become available in most subsequent steps. The information is basically relational/tabular. A few of the steps just have use of data sequentially, while some need random use of files.
Does anybody have experience of this kind of factor?
Which database could be suitable for this type of task? I have tried personally and loved SQLite, but will it scale to 20GB+ dimensions? Do you know postgresql or mysql to heavily cache data in memory? (I figure that databases designed in C/C++ could be a lot more efficient memory-smart than perl hashes/arrays, so the majority of it may be cached in memory on 24GB machine). Or it is possible to better, non-rdbms related solution, because of the overhead of making, indexing, and subsequently wrecking 20GB+ inside a RDBMS for single-run analyses?