Monday, November 1, 2010

How To: Dynamically Load Different Robots.txt Files on Apache Server

The approach requires messing around with the Apache configuration directives. Also, make sure that you have the rewrite engine module enabled for this exercise.

In the following instructions, it is asserted that we want to prevent crawlers that comply with robots.txt specifications to prevent them in crawling/indexing from your alternate hostname (possibly, you maintain this other hostname because it acts as an origin server for your edge servers serving your content and you don't want it being index by Google or Yahoo for instance) and in effect prevent competition in search results and content authority used by Search Engine ranking algorithms.

Here's what we need to do.

1) Write the necessary Apache configuration directives to either your .htaccess file, your virtual host, or under the main configuration. This step involves utilizing the Rewrite Engine module filter of Apache to trap the hostnames and direct the user(s) to your desired robots.txt file.

First Example:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^dev\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.dev.txt [NC,L]

RewriteCond %{HTTP_HOST} ^origin\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.origin.txt [NC,L]

Note: the [NC,L] to say that it should match whatever the case(case insensitive) for "NC" and that if it matches should be the last rule for "L". Alternately, you can also perform a negating condition using "!" (i.e !^dev\.somesite\.com) if you want to use the original robots.txt file for the main hostname and use and alternative for everything else. Example below.

Second Example:

RewriteCond %{HTTP_HOST} !^dev\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.alt.txt [NC,L]

2) Using the first example, write your crawler specifications to robots.txt file.

For robots.dev.txt, this is your file for the index-able hostname.

User-agent: *
Disallow: /path/you/not/to/index/or/crawl

Crawl-Delay: 1
Sitemap: http://dev.somesite.com/sitemap.xml

robots.origin.txt, this is your file to show so that indexing for the hostname is not allowed

User-agent: *
Disallow: /
  • Related Links Widget for Blogspot

No comments: