I've got a file that contains 250 million website Web addresses, each by having an Ip, page title, country title, server banner (e.g. "Apache"), response time (in ms), quantity of images and so forth. Right now, this info have been in a 25gb flat file.
I am thinking about producing various statistics out of this file, for example:
- quantity of IP addresses symbolized per country
- average response time per country
- quantity of images v response time
and so forth.
My real question is, how does one accomplish this type and scale of processing, and what platform and tools wuld you utilize(inside a reasonable time)?
I'm available to all suggestions, from MS SQL on Home windows to Ruby on Solaris, all suggestions :-) Bonuses for DRY (don't repeat yourself), I'd prefer to not write a brand new program every time a different cut is needed.
Any comments on which works, and what's to become prevented would greatly be appreciated.
Step One: obtain the data right into a DBMS that may handle the amount of information. Index properly.
Step Two: use SQL queries to look for the values of great interest.
You will still have to write a brand new query for every separate question you would like clarified. However, I believe that's inevitable. It will help you save replicating the relaxation from the work.
Observe that even though you most likely can perform a simple upload right into a single table, you may well improve performance from the queries should you normalize the information after loading it in to the single table. This is not completely trivial, and can likely lessen the amount of data. Ensuring you've got a good procedure (that will most likely not be considered a saved procedure) for normalizing the information can help.
Load the information right into a table inside a SQL Server (or other mainstream db) database, after which write queries to create the data you'll need. You wouldn't need any tools apart from the database itself and whatever UI can be used to have interaction using the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).