I am searching for an answer, which is capable of doing:

  • storing arbitrary sized unique words, together with their own 64 bit unsigned integer identifier along with a 32 or 64 bit unsigned int reference count
  • being able to access the information rapidly with one of these designs:
    • for any research of the word, hand back its uint64 identifier
    • for any research of the identifier, hand back the term
  • placing new records, ideally with auto incremented identifier and atomically incremented reference count, ideally in batch commits (meaning not word by word, each inside a separate transaction, but several words in a single committed transaction)
  • atomically removing records, that has zero reference count (this may be done despite an interest rate limited full table scan, by iterating through all of the records and removing those with refcount inside a transaction)
  • storing great records on traditional spinning rust (hard drives), the record number is approximately 100 million and 1000 billion (1000*10^9)
  • the typical word dimensions are approximately 25-80 bytes
  • it might be good to possess a python (for prototyping) and C interface, mainly embeddable, or perhaps an efficient "remote" (is going to be on localhost only) API

For instance a MySQL schema could be something similar to this:

CREATE TABLE words (
    id SERIAL,
    word MEDIUMTEXT,
    refcnt INT UNSIGNED,
    INDEX(word(12)),
    PRIMARY KEY (id)
)

This obviously works, but MySQL is not as much as this, and because of the index required for word searches, it stores redundant information needlessly.

Throughout the quest for the best solution, I determined the next to date: - since the words share lots of commonality (many of them are plain dictionary words in a variety of languages and character sets), something this: http://www.unixuser.org/~euske/doc/tcdb/index.html could be good - the very best I possibly could find to date is Tokyo, japan Cabinet's TDB: packages.python.org/tokyocabinet-python/TDB.html, but I must evaluate its performance, and possible configurations (where you can store what and employ what type of index where for the best some time and space efficiency)

Any ideas, calculations, of better still, available items and configurations?

Thanks,