# Compare 5000 strings with PHP Levenshtein

I've 5000, sometimes more, home address strings in a wide array. Let me do a comparison with levenshtein to locate similar matches. How do i do that without looping through all 5000 and evaluating them directly with almost every other 4999?

Edit: I'm also thinking about alternate techniques if anybody has suggestions. The general goal is to locate similar records (and eliminate replicates) according to user-posted street addresses.

Use a bk-tree to hurry-in the search/comparison.

Now we can produce a particularly helpful observation concerning the Levenshtein Distance: It forms a Metric Space.
[...]
Assume as it were we've two parameters, query, the string we're using within our search, and n the utmost distance a string could be from query but still be came back. Say we take an arbitary string, make sure compare it to question. Call the resultant distance d. Because we all know the triangular inequality holds, all of our results should have for the most part distance d+n and a minimum of distance d-n from test.
[...]
Tests reveal that searching having a distance of just one queries a maximum of 5-8% from the tree, and looking out with two errors queries a maximum of 17-25% from the tree - a considerable improvement over checking every node!

edit: But that does not assist you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Just with your brute-pressure m*n comparison.

I believe an easy method to group similar addresses is always to:

1. produce a database with two tables Body for that address (along with a id), one for that soundexes of words or literal amounts within the address (using the foreign key from the addresses table)

2. uppercase the address, replace anything apart from [A-Z] or [-9] having a space

3. split the address by space, calculate the soundex of every 'word', leave anything with only numbers out of the box and store it within the soundexes table using the foreign key from the address you began with

4. for every address (with id \$target) discover the most similar addresses:

``````SELECT similar.id, similar.address, count(*)
FROM adress similar, word cmp, word src
AND src.soundex=cmp.soundex
ORDER BY count(*)
LIMIT \$some_value;
``````
5. calculate the levenstein distinction between your source address and also the top couple of values came back through the query.

(doing any kind of operation on large arrays is frequently faster in databases)

I believe you can't avoid looping with the array because the levenstein() function takes only strings and never an assortment as input.

That you can do something similar to:

``````for(\$i=0;\$i<count(\$array)-1;\$i++)
{
for(\$j=\$i+1;\$j<count(\$array);\$j++)
{
\$lev = levenshtein(\$array[\$i],\$array[\$j]);
if(\$lev == 0)
{
// exact match
}
else if(\$lev <= THRESHOLD)
{
// similar
}
}
}
``````

Because of the character from the Levenshtein formula (particularly the truth that it is a comparision between two strings), I can not observe how you could do.

You can obviously reduce the amount of evaluations by carrying out some fundamental matching needs first, but this has run out of the scope of the items you are asking.

Like a (potentially irrelevant) option, you can always use something similar to `soundex` which may allow you to pre-compute the string values. (You may also utilize it directly in MySQL In my opinion.)

You can group them according to soundexes then limit the evaluations towards the nearest N cases...

`````` \$mashed=array();