I'm attempting to parse an ordinary text file using Tika but getting sporadic behavior.

More particularly, I've defined an easy handler the following:

public class MyHandler extends DefaultHandler
     public void characters(char ch[], int start, int length) throws SAXException
        System.out.println(new String(ch));

Then, I parse the file ("myfile.txt") the following:

Tika tika = new Tika();
InputStream is = new FileInputStream("myfile.txt");

Metadata metadata = new Metadata();
ContentHandler handler = new MyHandler();

Parser parser = new TXTParser();
ParseContext context = new ParseContext();

String mimeType = tika.detect(is);
metadata.set(HttpHeaders.CONTENT_TYPE, mimeType);

tikaParser.parse(is, handler, metadata, context);

I'd expect all of the text within the file to become printed on screen, but a small part ultimately isn't. More particularly, the figures() callback keeps reading through 4,096 figures per callback but ultimately it apparently leaves the last 5,083 figures of the particular file (the industry couple of Megabytes lengthy), therefore it even goes past missing the final callback.

Also, testing on another, small file, that is about 5,000 figures lengthy, no callback appears to occur!

The MIME type is properly detected as text/plain in the two cases.

Any ideas?


What version of Tika are you currently using? Searching in the source code it reads portions of 4096 bytes which may be seen online 129 of TXTParser. At line 132 the characters(...) routine is invoked.

In a nutshell, the prospective code is:

   char[] buffer = new char[4096];
   int n = reader.read(buffer);
   while (n != -1) {
       xhtml.characters(buffer, 0, n);
       n = reader.read(buffer);

where reader is really a BufferedReader. I am unable to use whatever flaw within this code, hence I am thinking you may be working a mature version?

I'm using version .9, the most recent.

I've also seen the origin code you pointed out, however it still doesn't work for me personally, when i referred to already. Very strange.