I've the prevalent problem of making a catalog to have an in-disk variety of strings. In a nutshell, I have to keep position of every string within the in-disk representation. For instance, a really naive solution could be a catalog array the following:
uint64 idx =
Which states the first string reaches position , the 2nd at position 20, the 3rd at position 500 and also the nth at position 103434.
The positions will always be non-negative 64 bits integers in consecutive order. Even though amounts could vary by any difference, used I expect the normal impact on be within the vary from 2^8 to two^20. I expect this index to become mmap'ed in memory, and also the positions is going to be utilized at random (assume uniform distribution).
I believed about writing my very own code for doing a bit of kind of block delta encoding or any other modern-day encoding, but you will find a wide variety of trade-offs between encoding/decoding speed and space which i prefer to obtain a working library like a beginning point and perhaps even be satisfied with something with no custom remaking.
Any hints? A c library could be ideal, but a c++ you might also let me run some initial benchmarks.
A couple of more particulars if you're still following. This is accustomed to develop a library much like cdb (http://cr.yp.to/cdb/cdbmake.html) on the top the library cmph (http://cmph.sf.internet). In a nutshell, it's for any large disk based read only associative map having a small index in memory.
As it is a library, I do not have total control over input, however the typical use situation that I wish to optimize have countless 100s of values, average value size within the couple of kilobytes ranges and maximum value at 2^31.
For that record, basically don't look for a library available I plan to implement delta encoding in blocks of 64 integers using the initial bytes indicating the block offset to date. The blocks themselves could be indexed having a tree, giving me O(log (n/64)) access time. You will find too many other available choices and that i would rather not discuss them. I'm really searching forward available code instead of ideas regarding how to implement the encoding. I'll be glad to see everybody things i did after i get it working.
I thank you for help and tell me for those who have any doubts.
I personally use fastbit (Kesheng Wu LBL.GOV), it appears you'll need something good, fast and today, so fastbit is really a highly competient step up from Oracle's BBC (byte aligned bitmap code, berkeleydb). It's not hard to setup and incredibly good gernally.
However, given additional time, you might want to consider a gray code solution, it appears optimal for the reasons.
Daniel Lemire has numerous libraries for C/++/Java launched on code.google, I have review his papers and they're quite nice, several developments on fastbit and alternative processes for column re-ordering with permutated gray codes's.
Almost didn't remember, I additionally discovered Tokyo, japan Cabinet, though I don't think it will likely be perfect for my current project, I might of considered it more basically had known about this before ), it features a degree of interoperability,
Tokyo, japan Cabinet is designed in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo, japan Cabinet can be obtained on platforms that have API conforming to C99 and POSIX.
While you known to CDB, the TC benchmark includes a TC mode (TC support's several operational constraint's for different perf) where it surpassed CDB by 10 occasions for read performance and a pair of occasions for write.
Regarding your delta encoding requirement, I'm quite positive about bsdiff and it is capability to out-perform any file.exe content patching system, this may also possess some fundimental connects for the general needs.
Google's new binary compression application, courgette might be worth looking at, just in case you skipped the pr release, 10 x more compact diff's than bsdiff within the one test situation I've come across released.