I want suggestions about parsing a sizable text file - 6GB in dimensions

Things I did is download my Gmail using Thundervird Now i come with an mbox file with all of my email in - this can be a text file - of size 6GB

I have to parse this file and take out specific data that follows a particular pattern

First question: what language must i use? I have looked another threads such as this and realize that Perl or Python (and something or 2 others) could be fine

Second question though: I just read within the publish replies that it may be easier to load the written text file right into a database and allow the database sort through the written text file?

I have to possess a CSV produced being an output

So... could it be smarter that i can go the DB route?

Third question: How lengthy is a bit of string... erm I am talking about... how lengthy does it take to undergo my 6Gb file... OK, difficult to reply to without some particulars!

I have to take out the next data:

First Name: 
Last Name: 

Address:

Telephone: 
Mobile:
Email:

So... I have to determine if I have to run the script and then leave my machine running overnight I am unsure when the above is a very dumb question or otherwise - but I decided to request anyway

ANY replies could be great

Thanks

Omar

  1. You need to use whatever language you are more acquainted with. When it comes to performance, Perl programs generally can parse text data faster than python.

  2. You have to parse the information no matter using database or otherwise. If you are likely to be doing lots of queries/searches later on, then you should look at loading the parsed data right into a database.

  3. Is dependent how complex the pattern you are attempting to match on. Most likely a maximum of one hour.

You should use the aperture project to question the mbox contents:

http://aperture.sourceforge.net/