The question: What solution or tips can you suffer from a really large (multi terabytes) database indexed on strong hashes rich in redundancy?
Some type of inverted storage?
Can there be something that may be completed with Postgres?
I'm prepared to roll my very own storage as needed.
(Hint: Should be free, no Java, must operate on Linux, should be disk-based, C/C++/Python preferred)
I have to produce a large database where each record has:
- some arbitrary meta data (some text fields) including some primary key
- one hashes (128 bits hash, strong MD5-like)
The amount of records is exactly what I'd become qualified as quite large: several 10 to hundreds billions). There's a substantial redundancy of hashes across rows (over 40% from the records get their hash distributed to a minimum of another record, some hash appear in 100K records)
The main usage would be to research by hash, then retrieve the metadata. The secondary usage would be to research by primary key, then retrieve the metadata.
It is really an statistics-type database, therefore the overall load is medium, mostly read, couple of creates, mostly batched creates.
The present approach is by using Postgres, by having an index around the primary key as well as an index around the hash column. The table is loaded in batch using the index around the hash switched off.
All indexes are btrees. The index around the hash column keeps growing huge, as large or larger than the table itself. On the 120 GB table it takes approximately each day to recreate the index. The query performances are very good though.
However , the forecasted size for that target database is going to be over 4TB according to tests having a more compact data group of 400GB representing about 10% from the total target. Once loaded in Postgres, 50 plusPercent from the storage is regrettably getting used through the SQL index around the hash column.
This really is far too large. And That I believe that the redundancy in hashes is definitely an chance for storing less.
Note additionally that although this describes the issue, you will find a couple of of those tables that should be produced.
You can produce a table with only id and Hash, as well as your other data with index, Metadata, and hashId. Doing this, you are able to prevent writing exactly the same hash as much as 100k occasions within the table.