I am focusing on an issue which involves multiple database instances, each with various table structures. The issue is, between these tables, you will find numerous of replicates, and that i need a method to effectively locate them, report them, and perhaps get rid of them.
Eg. I've two tables, the very first table,
CustomerData using the fields:
_countId, customerFID, customerName, customerAddress, _someRandomFlags
and that i have another table,
CustomerData2 (built later) using the fields:
_countId, customerFID, customerFirstName, customerLocation, _someOtherRandomFlags.
Between your two tables above, I understand for certain that
customerFirstName were utilised to keep exactly the same data, together with
customerAddress were also accustomed to keep same data.
Allows say, a few of the salesforce have used
customerData, yet others have used
customerData2. Let me have a scalable method of discovering the redundancies between your tables and report them. It may be assumed with a few quantity of certainty that
customerFID both in tables are consistent, and make reference to exactly the same customer.
One solution I possibly could think off was, to produce a
customerData class in python, map the records within the two tables for this class, and compute a hash/signature for that objects inside the class which are needed (
customerName, customerLocation/Address) and store these to a signature table, that has the posts:
sourceTableName, entityType (customerData), identifyingKey (customerFID), signature
as well as each
entityType, I search for duplicate signatures for every
The truth is, I am dealing with huge teams of biomedical data, with numerous of posts. These were produced at differing people (but yet without any standard nomenclature or structure) and also have been duplicate data saved inside them
EDIT: For simplicity sake, I'm able to move all of the database instances to some single server instance.
Basically could not take care of performance, I'd make use of a high-level practical approach. Use Django (or SQLAlchemy or...) to construct your preferred models (your tables) and fetch the information to check. Then make use of an formula for effectively determining replicates (...from lists or dicts,it is dependent of "how" you possess your computer data). To enhance performance you might attempt to "enhance" your application using the multiprocessing module or think about a map-reduce solution.