I've just began reading through on Lucene. Within the good examples provided, a whole file had been put into a Document just before adding the Document for an Index.
Nevertheless the documentation recommended this indexing technique wouldn't give good performance. The suggested strategy is to keep each type of the file inside a separate document.
I had been curious to understand how this can help to enhance indexing performance.
Also, I needed to validate my knowning that to include every type of file like a Document area, we will need to first tokenize the road to get the tokens after which produce a area for the similar.
Even when you do not take performance into consideration, both of these approaches will not yield exactly the same results. For those who have just one document whose first lines are "fox" and 2nd lines are "dog", and when you look for "fox" AND "dog", there won't be any results using the second approach.
Relating to your second question, no, you don't have to perform any tokenization before creating documents and fields. Tokenization is going to be carried out whenever you call IndexWriter#add(Document).
If you're getting began with Lucene, I recommend you read the demo code. This will highlight how you can create after which search a Lucene index.
And when indexing speed is crucial for that application you're developing, you will find excellent advices on Lucene wiki.