I must generate a graphical sitemap for my website. You will find two stages, so far as I will tell:

  1. crawl the web site and analyse the hyperlink relationship to extract the tree structure
  2. generate a aesthetically pleasing render from the tree

Does anybody have advice or knowledge about accomplishing this, or are conscious of existing work I'm able to develop (ideally in Python)?

I discovered some nice CSS for rendering the tree, however it only works best for 3 levels.


The only real automatic way to produce a sitemap would be to be aware of structure of the site and write a course which develops that understanding. Just moving the hyperlinks will not usually work because links could be between any pages so you receive a graph (i.e. connections between nodes). There's not a way to transform a graph right into a tree within the general situation.

Which means you must identify the dwelling of the tree yourself after which crawl the appropriate pages to find the game titles from the pages.

For "however it only works best for 3 levels": Three levels is ample. By trying to produce more levels, your sitemap will end up useless (too large, too wide). Nobody may wish to download a 1MB sitemap after which scroll through 100'000 pages of links. In case your site develops that large, then you definitely must implement some type of search.

This is a python web crawler, that ought to create a good beginning point. Your current technique is this:

  • you have to be mindful that outgoing links will never be adopted, including links on a single domain but greater up than your beginning point.
  • while you spider, the website collect a hash of page web addresses planned to a listing of all of the internal web addresses incorporated in each page.
  • have a pass over this list, setting an expression to every unique url.
  • make use of your hash of to develop a graphviz file which will construct a graph for you personally
  • convert the graphviz output into an imagemap where each node links to the corresponding web page

The main reason you must do all of this is, as leonm noted, that websites are graphs, not trees, and installing graphs is really a harder problem than that you can do inside a simple bit of javascript and css. Graphviz is nice at what it really does.

Please visit http://aaron.oirt.rutgers.edu/myapp/paperwork/W1100_2200.TreeView regarding how to format tree sights. You may also most likely customize the example application http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your pages if they're organized as sites of HTML files.