When running larger sites, especially ones build on platforms such as Magento and WordPress it’s good practice for the webmaster to give search engines a little help with the crawling and indexing your site. As well as some other advanced tactics, using robots.txt and sitemap.xml files are common ways to do this.
What is robots.txt?
A file called robots.txt placed in the root of your website’s folder structure is basically the gate keeper to your site that search engines need to pass through. It’s a file where you can list instructions to search engines as to what they can (and more importantly) can’t look at.
Why Block Search Engines?
When a search engine crawls a site it’s making certain judgements… what the site’s about, is it full of useful information, is it an eCommerce site, etc. For this reason you want to make sure the information it sees is relevant to what you do or offer, and you don’t want this signal clouded with irrelevant noise. When using systems like Magento, lots of pages ate generated, a site with just a few products could have over 1000 URL’s (pages) listed in Googles index. Only 30 of these URL’s might be the real ‘meat’ of the site. In this instance it could make sense to look at some of these URL’s that are indexed and block search engines from seeing them any more, thus conveying a more concentrated view of your site.
As well as blocking URL’s, webmasters will also commonly block access to directories that contain scripts. The intent might be good… this information is useless to Google right?
Block these important files and it’s only prudent to think Google will presume the worst. This is either that your site contains no layout and style information at all, or even worse, you’re up to something shady that you’re trying to hide. Either way it will have a negative impact on your rankings.
So our advice is basically to be very careful with what you block. If it’s important to a user then it’s more than likely just as important to a search engine.
The Google help files are a good place to start form more information on robots.txt.
There are lots of features in Google Webmaster Tools that will help you see how Google views your site. One in particular is ‘Fetch as Google’ in the ‘Crawl’ section.