I'm storing my sitemaps during my web folder. I would like web spiders (Googlebot etc) to have the ability to access the file, however i dont always want just about anybody to get access to it.
For instance, this website (stackoverflow.com), includes a site index - as per its robots.txt file (http://stackoverflow.com/robots.txt).
However, whenever you type http://stackoverflow.com/sitemap.xml, you're forwarded to a 404 page.
How do i implement exactly the same factor on my small website?
I'm managing a Light website, also I'm utilizing a sitemap index file (and so i have multiple sitemaps for that site). I must make use of the same mechanism to ensure they are not available using a browser, as referred to above.
You should check the consumer-agent header the customer transmits, and just pass the sitemap to known search bots. However, this isn't really safe because the user-agent header is definitely spoofed.
Stack Overflow most probably inspections a couple of things when determining who will get accessibility sitemaps:
- The coming initially from Ip
both will most likely be matched up against a database of known legitimate bots.
USER_AGENT string is fairly simple to sign in a server side language it's also super easy to fake. More information:
Based on how to determine the USER_AGENT string Way to tell bots from human visitors?
For instructions on IP checking Google: Google Webmaster Central: How to verify Googlebot
First, choose which systems you need to get the actual sitemap.
Second, configure your internet server to grant demands from individuals systems for the sitemap file, and configure your internet server to redirect other demands for your 404 error page.
For nginx, you are searching to stay something similar to
allow 10.10.10.0/24; right into a
location block for that sitemap file.
For apache, you are searching to make use of mod_authz_host's
Allow directive inside a
<Files> directive for that sitemap file.