Andrew Gatenby

10 Things you can do with Robots.txt for SEO

Robots.txt is a very useful and underutilised tool that allows web designers to control how their site handles spiders from search engines. The majority of search engine spiders take robots.txt and will parse the instructions prior to indexing a website.

Using robots.txt you can ensure that spiders only index certain parts of your website, and you can further specify as to which spiders are allowed to do so, and what pages they index.

Stop Search Engines Indexing SessionID URLs

Your website may run on a software installation that uses Session ID’s attached to URLs in some instances. However, the ?session_id=lkj23lj234 that can be attached to some URLs shows up as a duplicate page to search engines. The content at the base URL and the URL with a Session ID attched are the same, but located at different addresses. This can lead to being penalised in search listings, so stop spiders from being able to index any Session ID URL:

User-agent: *
Disallow: *session_id=

Tell Spiders your XML Sitemap Location

If you have an XML sitemap, then you may be interested that you can give the location of this file to spiders using robots.txt. Of course, you can still login to Google Webmaster Tools and Yahoo! Site Explorer, but this is a useful step to let any other spiders who may support the Sitemap protocol know:

Sitemap: http://www.domain.com/sitemap.xml

Disallow Indexing of Particular File Extensions

You might not want particular pages or files on your site being indexed into search engines, so make it so!

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /*.inc$

Block Spiders from Certain Directories

It could be risky to allow spiders access to particular directories, luckily this is easily done using robots.txt:

User-agent:  *
Disallow: /cgi-bin/

Block Spiders from Directories Containing Specific Words

You might have a number of directories with a similar name structure that you want to block spiders from:

User-Agent: *
Disallow: /admin*/

Block Entire Directory, Except Specific Files

Even though you’ve blocked a full directory, there may be a page in that directory you still want spidered:

User-Agent: *
Disallow: /restricted/
Allow: /restricted/public.htm

Block a Particular Spider

As well as search engine spiders, there are a number of spiders that are in use harvesting pages from the web for things like offline browsing, copying content, arching websites etc. It’s possible that you don’t want this to happen to your website, so here’s an example:

User-agent: WebCopier
Disallow: /

Let Google Images Spider Just Spider Images

User-agent: Googlebot-Image
Disallow:
Allow: /*.gif$
Allow: /*.png$
Allow: /*.jpeg$
Allow: /*.jpg$
Allow: /images

Allow Google AdSense Spider to Index Whole Website

If you have Google AdSense adverts on your site, then the AdSense spider will crawl the content of your site to ensure that the adverts being served are appropriate and matched to the content. If you’ve restricted other spiders from parts of your site, then you may want AdSense to still spider the full site so it knows what sort of ads to place even on pages that are restricted from normal spiders.

User-agent: Mediapartners-Google*
Disallow:
Allow: /*

Shoot Yourself in the Foot

Robots.txt can be a very powerful tool - how about using it to stop Google from visiting your site ever again!?

User-agent: Googlebot
Disallow: /

Bookmark or share this article:

  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • Reddit
  • StumbleUpon

Category: SEO

Tagged:

One Response

  1. [...] can access.  You can also specify which spiders are allowed.  Here is a concise, well written article with great ideas and example robots.txt code. Published on November 20th, 2008 Posted by [...]

Leave a Reply