SgCERT Advisory No. 042005 : Steps to disallow robots/spiders from indexing webservers containing email addresses

Audience

This document is targeted at webmasters, system administrators, application developers or any person(s) with similar job description.

Introduction

To minimise the volume of harvesting of email addresses by search engine robots or other software spiders, which gives rise to spam, it is advisable to disallow them from indexing parts of webservers containing email addresses.

It should be noted that some robots, by design, do not obey these. If stronger protection from robots and other agents is needed, alternative methods should be used such as password protection.

Procedure

This can be achieved by limiting what the robot can do via mechanisms such as the Robot Exclusion Protocol or the Robots META Tag.

The Robots Exclusion Protocol

The Robots Exclusion Protocol is a method that allows webmasters to indicate to visiting robots which parts of their site should not be visited by the robot.

In a nutshell, when a Robot vists a Web site, say http://www.jpkn.sabah.gov.my, it firsts checks for http://www.jpkn.sabah.gov.my/robots.txt. If it can find this document, it will analyse its contents for records like:

User-agent: *
Disallow: /

to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:

The Robots META tag

The Robots META tag allows web application developers to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required. Currently only a few robots implement this.

In this simple example:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

a robot should neither index this document, nor analyse it for links.

Full details on how this tags works is provided:

References

1       The Robots Exclusion Protocol: http://www.robotstxt.org/wc/exclusion.html#robotstxt
2       The Robots META tag: http://www.robotstxt.org/wc/exclusion.html#meta
3       Case study on how Google implemented this in its organisation : http://www.zone-h.com/en/news/read/id=122/.