I am using nutch 1.2. After I run the crawl command like so:
bin/nutch crawl urls -dir crawl -depth 2 -topN 1000 Injector: starting at 2011-07-11 12:18:37 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-11 12:18:44, elapsed: 00:00:07 Generator: starting at 2011-07-11 12:18:45 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. **No URLs to fetch - check your seed list and URL filters.** crawl finished: crawl
However , it keeps worrying concerning the: No Web addresses to fetch - look at your seed list and URL filters.
I've a listing of web addresses to crawl underneath the nutch_root/web addresses/nutch file. my crawl-urlfilter.txt can also be set.
Why wouldn't it complain about my url list and filters? it never did this before.
Here's my crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*184.108.40.206/ +^http://([a-z0-9]*\.)*220.127.116.11/ # skip everything else -.
Your URL filter rules look strange and that i don't believe they match valid Web addresses, something similar to this ought to be better no ?