I've an ASP .Internet MVC application, which I am attempting to write an import function for.

I actually do possess some specifics, for instance I'm using Entity Framework v4 within an MVC application, however i am particularly concerned within an formula that will work the very best, ideally by having an explanation of the items type of performance it's, and why.

This operation will probably be carried out asynchronously, so execution time isn't as a factor as something similar to RAM use.

I ought to explain that you will find a number of things (the database being the primary one) that I've been instructed to inherit and because of time limitations, won't have the ability to cleanup till in the future.


The import function would be to take an in-memory CSV file (that has been released from Sales Pressure and submitted) and merge it into a current database table. The procedure must be ready to:

  • Update existing records who have been changed within the CSV, without removing an re-adding the database record in order to preserve the main key of every record.

  • Add and take away any records because they alternation in the CSV file.

The present structure from the CSV and Database table are so that:

  • The Table and CSV both contain 52 posts.

  • Each column within the existing database schema is really a VARCHAR(100) area I'm likely to optimise this, but cannot inside the current time-frame.

  • Database back-finish is MS SQL.

  • The CSV file has about 1700 rows price of data inside it. I can not check this out number exceeding 5000, as you will find already many duplicate records, apparently.

  • At this time, I'm only thinking about really posting 10 of individuals posts in the CSV, the relaxation from the table's fields is going to be left null, and I'll be getting rid of the needless posts later on.

  • The CSV file has been read right into a DataTable to really make it simpler to utilize.

  • I initially believed that the ContactID area during my Sales Pressure CSV would be a unique identifier, although after doing a bit of test imports, it appears that you will find zero unique fields within the CSV file itself, a minimum of will be able to find.

  • Considering that, I've been instructed to give a primary key area towards the Contacts table to ensure that other tables can continue to conserve a valid relationship with a contact. However, this clearly prevents me from simply removing and re-creating the records on each import.


Its obvious in my experience that things i was attempting to achieve, perform updates on existing database records when no relationship is available between your table and also the CSV, just can't be accomplished.

It had not been a lot which i did not know this in advance, but more which i was wishing there is some vibrant idea I had not considered that may do that.

Knowing that, I wound up determining simply to result in the assumption during my formula that ContactID is a distinctive identifier, after which see the number of replicates I wound up with.

I am going a potential solution being an answer below. Both formula as well as an actual implementation. I'll let it rest for any couple of more days because I'd much would rather accept another person's better solution because the answer.

Here's several things I discovered after applying my below solution:

  • I needed to narrow the rows supplied by the CSV to ensure that it matched up individuals rows being imported in to the database.
  • The SqlDataReader is perfectly fine, what's the greatest impact may be the individual UPDATE/Place queries which are carried out.
  • For any completely fresh import, the first read of products into memory isn't observed through the UI, the place process takes about thirty seconds to accomplish.
  • There have been only 15 duplicate IDs missed on the fresh import, that is under 1% from the total data set. I've considered this to become a suitable loss, like me told the Sales Pressure database will have a clean-up anyway. I'm wishing the IDs could be regenerated in these instances.
  • I haven't collected any resource metrics throughout the import, but when it comes to speed this really is OK, due to the progress bar I have carried out to provide feedback towards the user.

Finish EDIT


Because of the allocation size each area, despite this relatively few records, I'm concerned mostly about the quantity of memory that could be allotted throughout the import.

The applying won't be run inside a shared atmosphere, so there's room to inhale that respect. Also, this specific function would simply be run once per week approximately, by hand.

My goal would be to a minimum of have the ability to run easily on the semi-devoted machine. Machine specs are variable because the application may eventually become offered like a product (though again, not specific to some shared atmosphere).

When it comes to run-time for that import process it-self, as pointed out, this will probably be asynchronous and that i have previously come up with some AJAX calls along with a progress bar. And So I would suppose up to just a few minutes could be OK.


Used to do discover the following publish which appears to bond with things i want:

C#, how to compare two datatables A + B, how to show rows which are in B but not in A

It appears in my experience that carrying out searches against a hashtable may be the right idea. However, as pointed out, basically can avoid loading both CSV and also the Contacts table into memory entirely, that might be preferred, and that i can't see staying away from it using the hashtable method.

One factor I don't know how you can achieve is when I would calculate a hash of every row to check when a bouquet of information is a DataTable object and also the other is definitely an EntitySet of Contact products.

I'm convinced that unless of course I wish to by hand iterate over each column value to be able to calculate the hash, I have to have both data sets function as the same object type, unless of course anybody has some fancy solutions.

Shall We Be Held better to simply your investment Entity Framework with this procedure? I have certainly spent considerable time attempting to event remotely perform procedures in large quantities, so I am more than pleased to take it out of the equation.

Contrary does not seem sensible or perhaps is missing, I apologise, I am very tired. Just tell me and I'll repair it tomorrow.

I appreciate any help that may be offered, as I am starting to get desperate. I have spent additional time agonising how to overcome this than I'd planned.