I've got a large Pdf (200,000 KB or even more) which consists of a number of pages that contains only tables. Let me in some way parse these details using Ruby, and import the resultant data right into a MySQL database.

Does anybody are conscious of any techniques for tugging this data from the PDF? The information is formatted within the following manner:

Title Address Cash Reported Year Reported Holder Title

Sometimes the Title area overflows in to the address area, by which situation the rest of the posts are shown on the next line.

Because of the irregular format, I have been stuck on foreseeing this out. At the minimum, could anybody point me to some Ruby PDF library with this task?

UPDATE: I accidentally provided incorrect information! The particular size the file is 300 Megabytes, or 300,000 KB. I made the modification above to mirror this.

At the minimum, could anybody point me to some Ruby PDF library with this task?

Should you haven't done this, you can examine the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are the relatively popular recommended libraries. There's even a suggestion of using JRuby and some Java PDF library parser.

I am unsure if these solutions is really appropriate for the problem, especially that you're coping with such huge PDF files. So unless of course someone provides a more informative answer, possibly you need to choose a library or two and drive them try it out.

I suppose you are able to copy'n'paste text clips without problems whenever your PDF is opened up in Acrobat Readers as well as other PDF Viewer?

Prior to trying to parse and extract text from such monster files programmatically (even when it's 200 MByte only -- for straightforward text in tables that's huuuuge, unless of course you've 200000 pages...), I'd proceed such as this:

  1. Attempt to sanitize the file first by re-distilling it.
  2. Try with various CLI tools to extract the written text right into a .txt file.

This really is a few minutes. Writing a Ruby program to get this done is really dependent on hrs, days or days (based on your understanding concerning the PDF fileformat internals... I suspect you do not have much connection with that yet).

If "2." works, you might midway be achieved already. Whether it works, additionally you realize that doing the work programmatically with Ruby is really a job that may in principle be solved. If "2." does not work, you realize it might be very difficult to achieve programmatically.

Sanitize the 'Monster.pdf':

I would recommend to make use of Ghostscript. You may also use Adobe Acrobat Distiller if you can get it.

gswin32c.exe ^
  -o Monster-PDF-sanitized ^
  -sDEVICE=pdfwrite ^
  -f Monster.pdf

(I am curious just how much that single command can make your output PDF shrink if in comparison towards the input.)

Extract text from PDF:

I would recommend to try pdftotext.exe (from the XPDF folks). You will find other, a little more bothersome techniques made available, but this may get the job done already:

pdftotext.exe ^
   -f 1 ^
   -l 10 ^
   -layout ^
   -eol dos ^
   -enc Latin1 ^
   -nopgbrk ^
   Monster-PDF-sanitized.pdf ^
   first-10-pages-from-Monster-PDF-sanitized.txt

This can not extract all pages only 1-10 (for evidence of concept, to ascertain if it really works whatsoever). To extract of the many page, just leave from the -f 1 -l 10 parameter. You may want to tweak the encoding by altering the parameter to -enc ASCII7 (or UTF-8, UCS-2).

If the does not work the quick'n'easy way (because, as can occur, some font within the original PDF uses "custom encoding vector") you need to request a brand new question, explaining the particulars of the findings to date. You will want to resort bigger calibres to shoot lower the issue.