I've got a joomla based news website which has a lot of useless pages turning up in internet search engine indices. A minimum of as a fast fix until I'm able to take a look at repairing the website on your own I wish to implement a NOINDEX, FOLLOW meta tag on all pages except the house page and article pages that finish inshtml code

Working off various clips of code found here and elsewhere I've develop this:

<?php
if ((JRequest::getVar('view') == "frontpage" ) || ($_SERVER['REQUEST_URI']=='*.html' ))    {
echo "<meta name=\"robots\" content=\"index,follow\"/>\n";
} else {
echo "<meta name=\"robots\" content=\"noindex,follow\"/>\n";
}
?> 

I am still very a new comer to php programming and I am sure I am certain to make a few mistakes so I'm wondering if your kind soul would have the ability to give my code the once over and tell me whether it's ok to make use of before I accidentally nuke my website.

Thanks,

Tom

Would not it be easier to make use of the robots.txt apply for this?

Some major spiders support an Allow directive which could combat followers Disallow directive. This really is helpful when one disallows a whole directory but nonetheless wants some HTML documents for the reason that directory indexed and indexed. While by standard implementation the very first matching robots.txt pattern always wins, Google's implementation differs for the reason that Allow designs with equal or even more figures within the directive path conquer an identical Disallow pattern. Bing uses the Allow or Disallow directive the most specific.

To be able to be compatible to any or all robots, if a person really wants to allow single files in a otherwise disallowed directory, it's important to put the Allow directive(s) first, then the Disallow, for instance:

Allow: /folder1/myfile.html
Disallow: /folder1/

This situation will Disallow anything in /folder1/ except /folder1/myfile.html, because the latter will match first. Just in case of Google, though, an order matters not.

This can never match:

$_SERVER['REQUEST_URI']=='*.html'

== is really a literal comparison and doesn't parse wildcards. You might look into the finish of the string with substr:

substr($_SERVER['REQUEST_URI'], -5) == '.html'

or use a regular expression:

//This will match when .html is enywhere inside the string
preg_match('/\.html/', $_SERVER['REQUEST_URI'])

//This will match when .html is at the end of the string, but the
//substr solution is faster in that case
preg_match('/\.html$/', $_SERVER['REQUEST_URI'])

taking advice in the posters here along with a friend I've develop this:

you have to visit /public_html/libraries/joomla/document/html and edit html.php

replace

//set default document metadata
     $this->setMetaData('Content-Type', $this->_mime . '; charset=' . $this->_charset , true );
     $this->setMetaData('robots', 'index, follow' );

with

//set default document metadata
$this->setMetaData('Content-Type', $this->_mime . '; charset=' . $this->_charset , true );

$queryString = $_SERVER['REQUEST_URI'];
if (( $queryString == '' ) || ( $queryString == 'index.php/National-news' ) || ( $queryString == 'index.php/Business' ) || ( $queryString == 'index.php/Sport' ) || ( substr($queryString, -5 ) == '.html' )) {
$this->setMetaData('robots', 'index, follow' );
}else {
$this->setMetaData('robots', 'noindex, follow' );
}

this can update the meta robots tag on every page on the website, getting rid of all of the screwed up content from search engines like google and departing just the content you want to be based in the index.

I'll try running it on the test server within the next couple of days and report on their behavior.