Having the keywords “coding” and “SEO” in the title grabbed my attention and I thought it would be a must-read article.
Boy was I wrong! The article contained nothing new and was filled with inaccurate information.
It’s further proof that guest blogging has become an abused art. Read what Matt Cutts said about guest blogging and then think about the misinformation the article I’m referencing contains. You’ll understand why Matt said what he did about guest blogging.
The Inaccurate Information Regarding Robots.txt
The information presented in Tip #5 is inaccurate with regards to keeping content out of Google’s index. The other problem within Tip #5 is the notion of keeping specific pages of your website out of Google’s index for reasons I can only speculate is for link flow distribution by way of link sculpting.
The image below is a snapshot of Tip #5 dated August 11, 2012:
Why is it Wrong?
For starters. There is no such thing as a noindex tag in the robots.txt file. The directive to use in the robots.txt file is disallow which only tells the bots what they can and cannot crawl. It does not tell them what can and cannot be indexed. You need to use the noindex robots meta tag to have a page noindexed.
While Googlebot won’t crawl or index the content of pages blocked by a robots.txt file using the disallow directive, they may still index the URLs if they find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
I found two examples online in which two large websites are using the robots.txt file improperly when trying to keep pages out of Google’s index.
Example 1: Snagajob.com
Lets take a peek at the robots.txt file from Snagajob.com. The image below is a snippet of their file.
The following is an assumption based upon what I can see from the outside looking in. In the image above you can see that they are trying to hide the page found at the URL /home. My guess is they are testing an alternate homepage and don’t want Google to see or index the page. Problem is… Google has indexed the page.
Go to Google and type http://www.snagajob.com/home into the search box. Then hit the search button and the first result you will see is the page in question.
If they wanted to keep this URL completely out of the Google index they should use the noindex robots meta tag on the page itself and remove the disallow directive from the robots.txt file.
As you can see in the code above you’ve instructed the bots not to index the page itself, but to allow the links on the page be crawled and followed due to using the directive FOLLOW in the noindex robots meta tag.
UPDATE: Snagajob now performs a 301 redirect from /home to the main homepage. This is good, but it wouldn’t be needed had they done it right when they implemented their test page.
Example 2: Elephant Auto Insurance
Lets take a peek at the robots.txt file found on the Elephant.com website.
It’s such a small file that you would think they got it right. I would even say it’s so simple even a caveman could do it right. But just like Snagajob, they didn’t get it right.
Also like Snagajob, they have a second version of their homepage named home.aspx that they’re or were testing at one time. Based upon the robots.txt file having the Disallow directive I can only speculate that they didn’t want this page found in Google’s index.
But as you can see in the image below, it is in Google’s index.
Why is it in the Google index? Because they tried blocking it from Googlebot using the robots.txt file and not using the noindex robots meta tag.
From Google: However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.
A Note About 404 Errors: If you visit http://elephant.com/home.aspx you will see a 404 error page that is about as user un-friendly as they come. All websites should create user-friendly 404 pages that also pass the proper status code to the browsers. Mine is here.
What Else is Wrong with Tip #5?
The statement “Keeping your contact page out of the Google index is wrong”, very wrong!
Using a restaurant website as an example…
Most restaurant websites have a contact page to display their phone number, physical street address, a map of their location and other contact information.
Why would you want to keep your contact page out of the Google search results? If a user on a mobile device typed in your restaurant name and the phrase map or location, you would obviously want your contact page to show up. It would probably be the most relevant page of your website for that search phrase, so please let Google find and index that page.
What about the Archive page?
Some people mix the terminology archive and category pages.
If the author is identifying an archive page properly then I agree with keeping it out of the Google index. Here is an example of one of my archive pages that has no value being in the SERPs. It’s only organized by date and contains random articles and as such shouldn’t be returned in the SERPs.
If you view the source code of my archive page you will see the noindex robots meta tag where I’m telling the bots not to index the page. That’s how you keep pages out of the Google index.
If the author has the terminology mixed up, then it’s simply wrong not to have your category pages available in the SERPs. For websites running WordPress and an SEO Framework for your child themes you can build a valuable category page that will rank well in Google and provide value to the users.
Again, let me provide an example.
When I Google the phrase Genesis tutorials I can see an archive/category page ranking #3. It’s the listing http://briangardner.com/genesis-tutorials/. That is an archive page while others might call it a category page. In either case it’s highly relevant for the search I entered into Google and obviously Brian recognizes the importance of that page and is allowing Google to index it accordingly.
Sidenote: The search mentioned above was done with appending pws=0 to the end of the URL. Not everyone will see the same exact search results as me due to personalization. Whether you have it turned on or off, your results are still personalized.
In both examples above (the archive & contact pages) I can only speculate that the author is trying to do some sort of link sculpting for link flow distribution. Link sculpting is all nonsense and a waste of time.
- Use the robots.txt file to let the bots know what they can and cannot see in your website.
- Use the noindex robots meta tag to let the bots know what you do not want to be included in their index.
I’ve provided two links below that will go into even more details on the robots.txt file. Enjoy!
- Google: https://developers.google.com/webmasters/control-crawl-index/docs/faq
- Robotstxt.org: http://www.robotstxt.org/