I am using wordpress with custom permalinks, and I wish to disallow my posts but leave my category pages available to bots. Here are a few good examples of the items the Web addresses seem like:
Category page: somesite us dot com /2010/category-title/
Publish: somesite us dot com /2010/category-title/product-title/
So, I am curious if there's some form of a regex means to fix leave the page at /category-title/ permitted while disallowing anything one level much deeper (the 2nd example.)
Any ideas? Thanks! :)
Some good info that can help.
There's no official standards body or RFC for that robots.txt protocol. It had been produced by consensus in June 1994 by people from the robots subscriber list (firstname.lastname@example.org kingdom). The data indicating the various components that shouldn't be utilized is specified by personal files known as robots.txt within the top-level directory from the website. The robots.txt designs are matched up by simple substring evaluations, so care should automatically get to make certain that designs matching sites possess the final '/' character appended, otherwise all files with names beginning with this substring will match, as opposed to just individuals within the directory intended.
There’s no 100% sure way to exclude your website from being found, apart from to not distribute them whatsoever, obviously.
There's no Allow within the Consensus. As well as the Regex choice is not within the Consensus either.
In the Robots Consensus:
This really is presently a little awkward, as there's no "Allow" area. The easiest way would be to invest files to become disallowed right into a separate directory, say "stuff", and then leave the main one file within the level above ezinearticles:
User-agent: * Disallow: /~joe/stuff/
Alternatively you are able to clearly disallow all disallowed pages:
User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
A Potential Solution:
Use .htaccess to create to disallow search robots from the specific folder while obstructing bad robots.
Would the next have the desired effect?
User-agent: * Disallow: /2010/category-name/*/
You will need to clearly allow certain folders under
User-agent: * Disallow: /2010/category-name/ Allow: /2010/category-name/product-name-1/ Allow: /2010/category-name/product-name-2/
But based on this article,
Allow area isn't inside the standard, so some spiders may not support it.
EDIT: I simply found another resource for use within each page. This page describes rid of it:
The fundamental idea is when you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
inside your HTML document, that document will not be indexed.
Should you choose:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the hyperlinks for the reason that document won't be parsed through the robot.