Storing and indexing binary strings inside a database

A binary string as defined here's fixed size "array" of bits. I give them a call strings since there's no order in it (sorting/indexing them as amounts doesn't have meaning), every bit is in addition to the others. Each such string is N bits lengthy, with N within the 100s.

I have to store these strings and given a brand new binary string query for that nearest neighbor while using Hamming distance because the distance metric.
You will find specialized data-structures (metric-trees) for metric-based search (Vice president-trees, cover-trees, M-trees), but I have to make use of a regular database (MongoDB during my situation).

Can there be some indexing function that may be put on the binary strings that will help the DB access merely a subset from the records before carrying out the main one-to-one Hamming distance match? Alternatively, how will it be easy to implement such Hamming based explore a typical DB?

The hamming distance is really a metric therefore it satisfies the triangular inequality. For every bitstring inside your database, you can keep it's hamming distance with a pre-defined constant bitstring. You'll be able to make use of the triangular inequality to remove bitstrings within the database.

So let us say

``````C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)
``````

So for every `B`, you'd store it's corresponding `distB`.

A lesser bound for `hamming(B,S)` would then be `abs(distB-distS)`. And also the upper bound could be `distB+distS`.

You are able to discard all `B` so that the low bound is greater compared to cheapest upper bound.

I am not 100% sure regarding the optimal method to determine which `C`. I believe you'd want that it is a bitstring that's near to the "center" of the metric space of bitstrings.

You could utilize a strategy much like levenshtein automata, put on bitstrings:

1. Compute the very first (lexicographically littlest) bitstring that will meet your hamming-distance criteria.
2. Fetch the very first derive from the database that's more than or comparable to that value
3. When the value is really a match, output it and fetch the following result. Otherwise, compute the following value more than it that is a match, and goto 2.

This is the same as carrying out a merge join over your database and also the listing of possible matches, without needing to generate every possible match. It'll lessen the search space, but it is still prone to need a significant quantity of queries.