We have to get tree like structure from the given text document using Java. Used file type ought to be common and open (rtf, odt, ...). Presently we use Apache Tika to parse plain text from multiple documents.

What file type and API we ought to use to ensure that we're able to most dependably obtain the correct structure parsed? If this sounds like possible with Tika, I'd gladly use whatever demos.

For instance, we ought to understand this type of data in the given document:

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

Primary Heading may be the title from the paper. Paper has two primary titles, Heading 1 and Heading 2 plus they have one subheadings. We ought to will also get contents under each heading (paragraph text).

Any assistance is appreciated.

OpenDocument (.odt) is virtually a zip package that contains multiple xml files. Content.xml consists of the particular text message from the document. We are curious about titles and they may be found inside text:h tags. Find out more about ODT.

I discovered an implementation for removing titles from .odt files with QueryPath.

Because the original question involved Java, here you go. First we have to obtain access to content.xml by utilizing ZipFile. Only then do we use SAX to parse xml content from content.xml. Sample code simply prints out all of the titles:

Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions

Sample code:

    public void printHeadingsOfOdtFIle(File odtFile) {

    try {

        ZipFile zFile = new ZipFile(odtFile);
        System.out.println(zFile.getName());

        ZipEntry contentFile = zFile.getEntry("content.xml");

        System.out.println(contentFile.getName());
        System.out.println(contentFile.getSize());
        XMLReader xr = XMLReaderFactory.createXMLReader();
        OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
        xr.setContentHandler(handler);

        xr.parse(new InputSource(zFile.getInputStream(contentFile)));

    } catch (Exception e) {

        e.printStackTrace();

    }

}

public static void main(String[] args) {

    new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));

}

Relevant areas of used ContentHandler seem like this:

    @Override

public void startElement(String uri, String localName, String qName, Characteristics atts) throws SAXException 

    



@Override

public void figures(char[] ch, int start, int length) throws SAXException 

@Override

public void endElement(String uri, String localName, String qName) throws SAXException