The "robots" meta tag
If your web host prohibits you from uploading "robots.txt" to the root directory, or you simply wish to restrict crawlers from a few select pages on your site, an alternative to "robots.txt" is to use the robots meta tag.
Creating your "robots" meta tag
The "robots" meta tag looks similar to any meta tag, and should be added between the HEAD section of your page(s) in question:
<meta name="robots" content="noindex,nofollow" />
Here's a list of the values you can specify within the "contents" attribute of this tag:
Value | Description |
---|---|
(no)index | Determines whether crawler should index this page. Possible values: "noindex" or "index" |
(no)follow | Determines whether crawler should follow links on this page and crawl them. Possible values: "nofollow" and "follow." |
Here are a few examples:
1) This disallows both indexing and following of links by a crawler on that specific page:
<meta name="robots" content="noindex,nofollow" />
2) This disallows indexing of the page, but lets the crawler go on and follow/crawl links contained within it.
<meta name="robots" content="noindex,follow" />
3) This allows indexing of the page, but instructs the crawler to not crawl links contained within it:
<meta name="robots" content="index,nofollow" />
4) Finally, there is a shorthand way of declaring 1) above (don't index nor follow links on page):
<meta name="robots" content="none">
Useful Links on "robots.txt"
To conclude this tutorial, here are some useful resources on "robots.txt" on the web. Enjoy!
- List of robots and crawlers (note that like most lists it's not a complete list)
- Robots.txt validator: Validates your "robots.txt" for any syntax error. Very handy.
- Blocking bad bots and site rippers (aka offline browsers): Use mod_rewrite to ban bots and site downloaders.
- How to ban spambots, spybots, and evil robots: When "robots.txt" is not enough to ban the truly evil robots.
- Introduction to "robots.txt"
- The "robots" meta tag/ Useful links on "robots.txt"