Let's say I've some large assortment of rows of information, where each aspect in the row is really a (key, value) pair:

1)    [(bird, "eagle"), (fish, "cod"),      ... , (soda, "coke")]
2)    [(bird, "lark"),  (fish, "bass"),     ...,  (soda, "pepsi")]
n)    ....
n+1)  [(bird, "robin"), (fish, "flounder"), ...,  (soda, "fanta")]

I'd like a chance to run some computation that will let me determine for any new row, what's the row that's "most similar" for this row?

Probably the most direct way I possibly could think about locating the "most similar" row for just about any particular row would be to directly compare stated row against other rows. This really is clearly computationally very costly.

I'm searching for an answer from the following form.

  • A function that can a row, and generate some derivative integer for your row. This came back integer will be a kind of "signature" from the row. The key property of the signature is when two rows are extremely "similar" they'd generate very close integers, if rows are extremely "different", they'd generate distant integers. Clearly, if they're identical rows they'd create the same signature.

  • I possibly could then takes these produced signatures, using the index from the row they indicate, and sort all of them by their signatures. This data structure I'd keep to ensure that I'm able to do fast searches. Refer to it as database B.

  • When I've got a new row, If only to understand which existent row in database B is most similar, I'd:

    1. Produce a signature for that new row
    2. Binary sort through the sorted listing of (signature,index) in database B for that closet match
    3. Return the nearest matching (might be a right diamond necklace) row in database B.

I understand their quite a bit of hands waving within this question. My issue is that I don't really understand what the function is would generate this signature. I see Levenshtein distances, but individuals represent the transformation cost, less the signature. which i could try lossy compressions, a couple of things may be "bucketable" because they compress towards the same factor. I'm searching for other tips on how to do that.


Should you have had lots of data, and wanted to get this done hardcore, I recommend a record method like PLSA or PSVM, which could extract determining subjects from text and identify documents concentrating on the same subject odds.

A less complicated, but less accurate method of doing the work is applying Soundex, that is readily available for many languages. You are able to keep soundex (which is a brief string, no integer I am afraid), and search for exact matches towards the soundex, that ought to indicate similar rows.

I believe it's impractical to anticipate a function to show a number of strings into an integer so that integers near one another map to similar strings. The nearest you may come does a checksum on every individual tuple, and evaluating the checksums for that new row towards the checksums of existing rows, but I am speculating you are attempting to develop just one number you are able to index on.