A business we conduct business with really wants to provide us with single.2 gb CSV file every single day that contains about 900,000 product entries. Merely a small area of the file changes every single day, maybe under .5%, and it is just items being added or dropped, not modified. We have to display the merchandise entries to the partners.
Why is this more difficult is the fact that our partners should only have the ability to see product entries available inside a 30-500 mile radius of the zipcode. Each product listing row includes a area for which the particular radius for that method is (some are just 30, some are 500, some are 100, etc. 500 may be the max). Someone inside a given zipcode will probably have only 20 results approximately, and therefore there's likely to be a lot of unused data. We do not understand all the partner zip codes in advance.
We must consider performance, so I am unsure what the easiest method to build a storage shed is.
Must I have two databasesBody with zip codes and latitude/longitude and employ the Haversine formula for calculating distance...and also the other the particular product database...after which exactly what do I actually do? Return all of the zip codes inside a given radius and search for a match within the product database? For any 500 mile radius that will be a lot of zip codes. Or write a MySQL function?
We're able to use Amazon . com SimpleDB to keep the database...however I have this issue using the zip codes. I possibly could make two "domain names" as Amazon . com calls them, one for that items, and something for that zip codes? I do not think you may make a question across multiple SimpleDB domain names, though. A minimum of, I do not observe that any place in their documentation.
I am available to another solution entirely. It does not need to be PHP/MySQL or SimpleDB. Just bear in mind our devoted server is really a P4 with 2 gb. We're able to upgrade the RAM, it is simply that people can't throw a lot of processing energy only at that. As well as store and process the database every evening on the VPS somewhere where it can't be considered a problem when the VPS were unbearably slow that can be a 1.2 gb CSV has been processed. We're able to even process the file offline on the pc after which remotely update the database every single day...other than i quickly have this issue with zip codes and product entries requiring to become mix-recommended.
Particularly with Postgres 9.1, which enables k-nearest neighbour search queries using GIST indexes.
Well, that's a fascinating problem indeed.
This appears like its really two issues, one how in the event you index the databases and the second reason is how you can you retain it current. The very first you are able to achieve while you describe, but normalization might be considered a problem, for the way you're storing the zipcode. This mainly comes lower as to the your computer data appears like.
For the 2nd one, this really is more my specialization. You could have the consumer upload the csv for you because they presently are, have a copy from the one from yesterday and run it via a diff utility, or leverage Perl, PHP, Python, Party or other tools you've, to obtain the lines which have transformed. Pass individuals right into a second block that will improve your database. I've worked with clients with issues along this line and scripting it away is commonly the best option. If you want assist with organizing your script that's always available.