I'm focusing on a potential architecture to have an abuse recognition mechanism with an account management system. Things I want would be to identify possible duplicate customers according to certain correlating fields inside a table. To create the issue simplistic, allows say I've got a USER table using the following fields:

Current Address

It is extremely possible that certain user has produced multiple records in this particular table. There can be a particular pattern by which this user has produced his/her accounts. An amount it decide to try mine this table to flag records that might be possible replicates. Another problem is scale. When we have allows say millions of customers, taking one user and matching it from the remaining customers is impractical computationally. Let's say this info are distributed across various machines in a variety of geographic locations?

What are the techniques, will be able to use, to resolve this issue? I've attempted to pose this inside a technologically agnostic manner using the hopes that individuals can offer me with multiple perspectives.


The solution really is dependent upon the way you model your customers and what comprises a replica.

There might be a person that utilizes names all harry potter figures. Best of luck discovering that pattern :)

If you're searching for records which are roughly similar do this simple approach: Hash each word within the doc and select the min shingle. Do that for k different hash functions. Concatenate these min hashes. That which you have is really a near duplicate.

To become obvious, allows say an archive has words w1....wn. Allows say your hash functions are h1...hk.

let m_i = min_j (h_i(w_j)

and also the signature is S = m1.m2.m3....mk

The awesome factor with this particular signature is when two documents contain 90% same words then there's a great 90% chance so good chance the signatures will be the same for that two documents. Hence, rather than searching for near replicates, you search for exact replicates within the signatures. If you wish to increase the amount of matches then you definitely decrease the need for k, if you're getting a lot of false positives then you definitely increase the amount of k.

Obviously there's the approach of implicit options that come with customers for example thier IP addresses and cookie etc.