Jul 10, 2008
10 Things you can do with Robots.txt for SEO
Robots.txt is a very useful and underutilised tool that allows web designers to control how their site handles spiders from search engines. The majority of search engine spiders take robots.txt and will parse the instructions prior to indexing a website.
Using robots.txt you can ensure that spiders only index certain parts of your website, and you can further specify as to which spiders are allowed to do so, and what pages they index.
Stop Search Engines Indexing SessionID URLs
Your website may run on a software installation that uses Session ID’s attached to URLs in some instances. However, the ?session_id=lkj23lj234 that can be attached to some URLs shows up as a duplicate page to search engines. The content at the base URL and the URL with a Session ID attched are the same, but located at different addresses. This can lead to being penalised in search listings, so stop spiders from being able to index any Session ID URL:
User-agent: * Disallow: *session_id=
Tell Spiders your XML Sitemap Location
If you have an XML sitemap, then you may be interested that you can give the location of this file to spiders using robots.txt. Of course, you can still login to Google Webmaster Tools and Yahoo! Site Explorer, but this is a useful step to let any other spiders who may support the Sitemap protocol know:
Sitemap: http://www.domain.com/sitemap.xml
Disallow Indexing of Particular File Extensions
You might not want particular pages or files on your site being indexed into search engines, so make it so!
User-agent: * Disallow: /*.pdf$ Disallow: /*.doc$ Disallow: /*.inc$
Block Spiders from Certain Directories
It could be risky to allow spiders access to particular directories, luckily this is easily done using robots.txt:
User-agent: * Disallow: /cgi-bin/
Block Spiders from Directories Containing Specific Words
You might have a number of directories with a similar name structure that you want to block spiders from:
User-Agent: * Disallow: /admin*/
Block Entire Directory, Except Specific Files
Even though you’ve blocked a full directory, there may be a page in that directory you still want spidered:
User-Agent: * Disallow: /restricted/ Allow: /restricted/public.htm
Block a Particular Spider
As well as search engine spiders, there are a number of spiders that are in use harvesting pages from the web for things like offline browsing, copying content, arching websites etc. It’s possible that you don’t want this to happen to your website, so here’s an example:
User-agent: WebCopier Disallow: /
Let Google Images Spider Just Spider Images
User-agent: Googlebot-Image Disallow: Allow: /*.gif$ Allow: /*.png$ Allow: /*.jpeg$ Allow: /*.jpg$ Allow: /images
Allow Google AdSense Spider to Index Whole Website
If you have Google AdSense adverts on your site, then the AdSense spider will crawl the content of your site to ensure that the adverts being served are appropriate and matched to the content. If you’ve restricted other spiders from parts of your site, then you may want AdSense to still spider the full site so it knows what sort of ads to place even on pages that are restricted from normal spiders.
User-agent: Mediapartners-Google* Disallow: Allow: /*
Shoot Yourself in the Foot
Robots.txt can be a very powerful tool - how about using it to stop Google from visiting your site ever again!?
User-agent: Googlebot Disallow: /







[...] can access. You can also specify which spiders are allowed. Here is a concise, well written article with great ideas and example robots.txt code. Published on November 20th, 2008 Posted by [...]