I am thinking about writing an easy key/value store having a file architecture much like CouchDB, i.e. an append-only b+tree.
I have read everything I'm able to find on B+trees as well as everything I'm able to find on CouchDB's internals, however i haven't had time for you to work all things in the origin code (being in an exceedingly different language causes it to be a unique project on its own).
So I've got a question concerning the sizing the of B+tree nodes that is: given key-length is variable, is it more beneficial to help keep the nodes exactly the same length (in bytes) or is it more beneficial to provide them exactly the same quantity of secrets/child-pointers no matter how large they become?
I understand that in conventional databases the B+tree nodes are stored in a fixed length in bytes (e.g. 8K) because space within the documents is handled in fixed size pages. However in an append-only file plan in which the documents could be any length and also the up-to-date tree nodes are written after, there appears to become no benefit to getting a set-size node.
The aim of a b --tree would be to minimize the amount of disk accesses. When the file system cluster dimensions are 4k, then your ideal size for that nodes is 4k. Also, the nodes ought to be correctly aligned. A misaligned node may cause two groupings to become read, reducing performance.
Having a log based storage plan, selecting a 4k node dimensions are most likely the worst choice unless of course gaps are produced within the log to enhance alignment. Otherwise, 99.98% of times one node is saved on two groupings. Having a 2k node size, the chances of the happening are simply under 50%. However, there's an issue with a little node size: average height from the b-tree is elevated and also the time spent reading through a disk cluster isn't fully utilized.
Bigger node dimensions lessen the height from the tree, however they too increase the amount of disk accesses. Bigger nodes may also increase the overhead of maintaining the records inside the node. Make a b-tree in which the node dimensions are big enough to encapsulate the whole database. You need to embed a much better data structure inside the node itself, possibly another b-tree?
I spent a while prototyping a b --tree implementation over an append-only log format and finally declined the idea altogether. To pay for performance deficits because of node/cluster imbalance, you must have a really large cache. A classical storage approach could make better utilisation of the RAM.
The ultimate blow was after i examined the performance of at random-purchased card inserts. This kills performance associated with a disk-backed storage system, but log based formats suffer a lot more. A write of the littlest entry forces several nodes to become written towards the log, and also the internal nodes are invalidated soon after being written. Consequently, the log quickly fills track of garbage.
BerkeleyDB-JE (BDB-JE) can also be log based, and that i analyzed its performance qualities too. It suffers exactly the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append making it through records towards the log, however the random order is maintained. Consequently, the brand new "clean" records have previously produced log files filled with garbage. The general performance from the system degrades to the stage the only factor running may be the cleaner, and it is hogging all system assets.
Log based formats are extremely attractive because it's possible to rapidly implement a strong database. The Achilles heel may be the cleaner, that is non-trivial. Caching methods will also be tricky to obtain right.