I've got a requirement to create a listing of possible replicates before a person saves an entity towards the database and warn them from the possible replicates.

You will find 7 criteria which we ought to look into the for replicates and when a minimum of 3 match we ought to flag this as much as the consumer. The factors will all match on ID, so there's no fuzzy string matching needed but my problem originates from the truth that you will find many good ways (99 ways if I have done my sums corerctly) not less than 3 products to complement in the listing of 7 possibles.

I'd rather not need to do 99 separate db queries to locate my search engine results and nor do I wish to bring the entire lot back in the db and filter around the client side. We are most likely only speaking of the couple of hundreds of 1000's of records at the moment, but this can come to be the millions because the system matures.

Anybody got any thoughs of the nice efficient method of doing this? I had been thinking about an easy OR query to obtain the records where a minumum of one area matches in the db after which doing a bit of processing around the client to filter it more, but a couple of from the fields have really low cardinality and will not really lessen the amounts by a large amount.

Thanks Jon

OR and CASE summing works but they are quite inefficient, given that they avoid using indexes.

You have to make UNION for indexes to become functional.

If your user makes its way into name, phone, email and address in to the database, and you need to check all records that match a minimum of 3 of those fields, you problem:

SELECT  i.*
FROM    (
        SELECT  id, COUNT(*)
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = 'Eve Chianese'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = '+15558000042'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = '42@example.com'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = '42 North Lane'
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        ) dq
JOIN    t_info i
ON      i.id = dq.id

This can use indexes on these fields and also the query is going to be fast.

Check this out article during my blog for particulars:

  • Matching 3 of 4: how you can match an archive which fits a minimum of 3 of 4 possible conditions

Also check this out question the content relies upon.

If you wish to have a listing of DISTINCT values within the existing data, you simply wrap this question right into a subquery:

SELECT  i.*
FROM    t_info i1
WHERE   EXISTS
        (
        SELECT  1
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = i1.name
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = i1.phone
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = i1.email
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = i1.address
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        )

Observe that this DISTINCT isn't transitive: if A matches B and B matches C, it doesn't mean that A matches C.

You may want something similar to the next:

SELECT id
FROM 
    (select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
        CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
        ...,
        CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
    FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3

This is not examined, however it shows a method to tackle this.

What DBS are you currently using? Some support using such constraints by utilizing server side code.