I download tikacore and tikaparse libraries however i couldn't discover the example codes to parse html codes to string? I must eliminate all html tag of supply of an internet page . So what can i actually do? the way i will find codes of apache tika?
have a look in the example it will let you
Would you like an ordinary text version of the html file? If that's the case, you just need something similar to:
InputStream input = new FileInputStream("myfile.html"); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); new HtmlParser().parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString();
The BodyContentHandler, when produced without any constructor arguments or having a character limit, will capture the written text (only) from the body from the html and give it back for you.