Robots.txt File And SEO
New SEO specialists often aren't clued in as to what the robots.txt file is all about. That's a damn shame, because this little file can absolutely kill you, humble you, if it's put in place - properly or not - and noobs don't know about it!
This is a file (duh!) that goes onto your site in the main directory that asks the search engine crawlers to ignore what you've put into the file itself. The file's purpose is to block the search engine robots from accessing certain pages or files on your site and not index them.
Here's the in-depth look at robots.txt files from webconfs.com.
So, I'm going to give an overview of what the robots.txt file is, how to use it properly, and most importantly, when to take the damn thing off so your site can be crawled and indexed by search engine crawlers or robots.
First, let's set the stage about the need for this file. For a lot of sites, there may be pages, media or files you just don't want the search engines to find, crawl and index for showing up in search engine results pages (SERPs). The most common reason why is to avoid duplicate content issues that happen.
The best example I can give you is WordPress. I love, love, LOVE WordPress. Hell, my site is built with a WordPress template. I love me some WordPress because it's so damned SEO friendly! However, it can be too SEO friendly! Here's how:
You have a WordPress site and blog. Your blog has categories and tags. If you don't exclude categories and tags from being indexed by the search engines, they'll show up in search results and since they're all about your published blog posts, they are duplicate content. Therefore, you don't want your blog categories and tags indexed and ranked.
Or, suppose you have a site where each page can generate a print-friendly version. You only need to have the HTML page available for being indexed, not the print-friendly version. It's exactly like the HTML page, isn't it?
The next common reason to have it in place is when you're developing a new website, and it's on a public server, not a private test server. Sometimes we don't always get to choose how to develop a site, so on a public, test domain you want to slap the robots.txt file in place IMMEDIATELY.
Examples of A Robots.txt File
The way you can exclude them is through a little program called the robots.txt file. This powerful little file can exclude an entire site, certain files, or individual pages. It's up to you, but be aware you need to create it properly, with the right syntax to make it work the way you want it to work. Below are some examples from the great folks at SEObook:
Moz has a great little cheat sheet that I think perfectly illustrates the uses of the file:
Finally, let's talk about an all-too common scenario that I've seen time and time and time again.
You're all excited, because you've been part of a web development team to roll out an awesome website. Ta da!!! You launch it! Done like a dinner, amirite?
Well, remember how you had the web dev team put the robots.txt file in place? Did you go and get it removed right before launch? No? Bwahahahahaha...there's a couple things you'll see. First, in the SERPs:
Oops! The other place you'll see error is in Google Webmaster Tools in the Index Status section:
Now that you know a little bit more about the robots.txt file, how to use it and what it looks like in search results, go forth and use it wisely!
And here's some related information about how your robots.txt file can kill crawling and indexing of your website.
Not only will you learn more about the robots.txt file to avoid crawling and indexing issues, you'll also learn basic and advanced search engine optimization in Invenio SEO's training classes - look at them here and sign up for one today!
Or, you can do online training now!
Until we meet again, stay safely between the ditches!
All the very best to you,
Nancy McDonald