How To: Dynamically Load Different Robots.txt Files on Apache Server

Monday, November 1, 2010

How To: Dynamically Load Different Robots.txt Files on Apache Server

The approach requires messing around with the Apache configuration directives. Also, make sure that you have the rewrite engine module enabled for this exercise.

In the following instructions, it is asserted that we want to prevent crawlers that comply with robots.txt specifications to prevent them in crawling/indexing from your alternate hostname (possibly, you maintain this other hostname because it acts as an origin server for your edge servers serving your content and you don't want it being index by Google or Yahoo for instance) and in effect prevent competition in search results and content authority used by Search Engine ranking algorithms.

Here's what we need to do.

1) Write the necessary Apache configuration directives to either your .htaccess file, your virtual host, or under the main configuration. This step involves utilizing the Rewrite Engine module filter of Apache to trap the hostnames and direct the user(s) to your desired robots.txt file.

First Example:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^dev\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.dev.txt [NC,L]

RewriteCond %{HTTP_HOST} ^origin\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.origin.txt [NC,L]

Note: the [NC,L] to say that it should match whatever the case(case insensitive) for "NC" and that if it matches should be the last rule for "L". Alternately, you can also perform a negating condition using "!" (i.e !^dev\.somesite\.com) if you want to use the original robots.txt file for the main hostname and use and alternative for everything else. Example below.

Second Example:

RewriteCond %{HTTP_HOST} !^dev\.somesite\.com [NC]
RewriteRule ^robots.txt$ robots.alt.txt [NC,L]

2) Using the first example, write your crawler specifications to robots.txt file.

For robots.dev.txt, this is your file for the index-able hostname.

User-agent: *
Disallow: /path/you/not/to/index/or/crawl

Crawl-Delay: 1
Sitemap: http://dev.somesite.com/sitemap.xml

robots.origin.txt, this is your file to show so that indexing for the hostname is not allowed

User-agent: *
Disallow: /

Related Links Widget for Blogspot

No comments:

Post a Comment

Author Interests

Amazon.com Widgets

Parallel and Distributed Computing
Service-Oriented Architecture
Application Optimization
Network and Application Security
Process Automation
Data Warehousing
Data Visualization
Artificial Intelligence
Open Source Software

This blog is here because I realized that there is no better way to ensure knowledge assimilation for myself and education to the netizens about the things I discover, invent and learn from anything about computing, especially information technology in the cloud, system administration and anything interesting more than writing. Blogger is free so I don't have to worry about publication.

I would be writing mostly on open-source solutions to real-world IT and computing problems. I would also like to write on topics about simplifying, analyzing and aggregating sparsely distributed information, natural language and human behavior whenever possible. I will start by discussing concepts then theories proceeding on practical application or a proof with the aim to provide a model solution.

I'd also like to note that I do have substantial knowledge and experience with Microsoft products but as a matter of preference, I won't be discussing any of those things as long as I can avoid.

I might jump to other topics depending on my mood.

Monday, November 1, 2010