I'm a Search engine optimization employed by a flight ticket booking company. We are attempting to have an XML sitemap installed for the site. I'd requested the expansion team of my opportunity to set up a Perl script that can help to create an XML sitemap for the huge site (a lot more than 1.5 lakh pages).
We used the Google Perl Sitemap Generator for the similar, for some reasons we are able to only use Perl. The output file had lots of garbage because it mainly indexed with the static pages along with other content within the server folders (it essentially didn't stick to the Web addresses in the home page and lower the website, but indexed every file around the server). I don't know when the terminology is correct however i think you're going to get my point.
The configuration choices are pointed out within the link above, however we aren't able to evaluate which parameters to make use of to acquire a perfect XML sitemap with no unnecessary Web addresses.
Could anybody help using the Perl script or how you can configure it.
If a lot of this really is needed, I'm going to be a lot more than glad to place it across.
Create a copy from the site with 'wget' (mirror option) and make a sitemap from that.
Take a look here, it's the code: http://world wide web.isrcomputing.com/understanding-base/linux-tips/240-how-to-create-google-sitemap- using-perl.html
Possibly I am naive, but could not you perform a BFS 'http::get' of links beginning in the root, parsing out each
a href ?
Perl supports that pretty much.