I'm using PdfBox in Java to extract text from PDF files. A few of the input files provided aren't valid and PDFTextStripper halts on these files. It is possible to clean way to see if the provided file is actually a valid PDF?
Here's what I personally use into my NUnit tests, that has to validate against multiple versions of PDF produced using Very Reviews:
public static void CheckIsPDF(byte data) amplifier&lifier data==0x2E &lifier&lifier data==0x33) // version is 1.3 ? if(data==0x31 &lifier&lifier data==0x2E &lifier&lifier data==0x34) // version is 1.4 ? Assert.Fail("Unsupported extendableInch)
Because you use PDFBox you can just do:
It'll fail by having an Exception when the PDF is corrupted etc.
Whether it works you may also see if the PDF is encoded using
Pdf files begin "%PDF" (open one out of TextPad or similar and have a look)
Any reason you cannot just browse the file having a StringReader and appearance with this?
Exactly what do you mean with a valid Pdf? It must also have a valid data reference table properly pointing to any or all the objects within the file.
you are able to discover the mime kind of personal files (or byte array), which means you dont dumbly depend around the extension. I actually do it with aperture's MimeExtractor (http://aperture.sourceforge.internet/) or I saw some days ago a library just for your (http://sourceforge.internet/projects/mime-util)
I personally use aperture to extract text from a number of files, not just pdf, but need to tweak thinks for ebooks for instance (aperture uses pdfbox, however i added another library as fallback when pdfbox fails)