Web Crawler

Is A Web Crawler Giving Up On Your Site?

A web crawler is an automated program that is sent from search engines to crawl - analyze - all the pages on your web site. However, two domains I'm working on now illustrate how you can cause a crawler to give up on finding the most important pages on your site.

What many business owners may not realize is that Google has a "crawl budget" for each site. If your site is small, this is no big deal. All of your pages will be crawled by the search engine.

But what if you've had a site up for several years and have been steadily adding new content regularly? Is the web crawler using your crawl budget wisely to find and index the information on them?

In other words, your money making pages (products and services) may not even be visited, and if so, they'll never see the light of day in search results! Let's look at why this can happen.

web crawler can't find pages!

The first thing that can cause the crawlers to never find your new or updated content is if nobody is in charge of the site's architecture. If every publisher or web dev person is allowed to make new levels of categories or layered navigation levels, this can cause big trouble once the site has hundreds or thousands of pages.

As an example, one site I'm working on has a file category called "Events." Then somebody created "Event Categories." Another one is "Alumni Events." And so on. Then there's "News." And "Alumni News." And so forth and so on.

While I can't argue with categorizing different types of events, there has to be a plan in place to keep the site architecture flat to allow the crawlers to find all of these published pages. What happens is that these categories get nested like Russian dolls, and the next thing you know, the most recently published content is several file levels down, and is blocked by old, top level pages.

How To See Which URLs The Web Crawler Is Indexing

You may be wondering how to find which URLs are being crawled. There's a handy tool made by Screaming Frog called Log Analyser. You can use it for free, but you're severely limited as to what you can see. Don't be cheap - shell out the approximately $125 USD and get full functionality!

Here's how it works. You need to get your friendly neighborhood web dev person to compile and download log analyzer files. These are made every day on your web hosting server, and they're stored for a number of days.

I'd get at least 15 - 30 days' worth of log analyzer files. You need to have some software to unzip them, and through trial and error, I found that using 7-Zip File Manager unzips these automatically as I download the log analyzer files to my desktop.

You import them into Screaming Frog, and let the software do it's thing. The log analyzer shows you which search engines are knocking on your website's door, and more importantly, which URLs are being crawled!

So, what you are looking for is if old, outdated content is still being indexed first. You also want to make sure sections like categories, tags, user logins, etc., aren't being crawled and indexed. This is how your crawl budget gets wasted!

web crawler info

You can take the results and create a robots.txt file to tell the search engines what parts of your site you don't want crawled, like the categories and tags on your blog. You can also work with your web dev team or person to determine if there is really old content that is no longer relevant, or doesn't need a high priority for the crawlers to analyze.

Here's a sample robots.txt file that has a number of web page/site elements that shouldn't be crawled:

User-agent: Yandex
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /wp-content/themes
Disallow: /wp-content/plugins
Disallow: /wp-content/upgrade
Disallow: /wp-content/themes_backup
Disallow: /wp-content/cache
Disallow: /xmlrpc.php
Disallow: /template.html
Disallow: /wp-comments
Disallow: /cgi-bin
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /comment-page
Disallow: /replytocom=
Disallow: /author
Disallow: /?author=
Disallow: /tag

You can also customize your XML sitemap to make priorities on a page by page basis. New content that is regularly published, such as blog posts, should be a really high priority. Static pages that don't change much, like your home, contact, about and product/service pages that rarely change should be adjusted for low priority crawling.

Here's a sample entry from a fictitious XML sitemap:

<url>

<loc>https://www.example.com/</loc> (The specific URL)

<lastmod>2005-01-01</lastmod> (The last time the URL was modified or updated)

<changefreq>monthly</changefreq> (This URL changes monthly)

<priority>0.8</priority> (Goes from 0.0 - 1.0; 0.8 is pretty high)

Back to Screaming Frog...you can also see which search engine robots (crawlers) are visiting:

web crawler info from Screaming Frog

web crawler & response codes

As you can see, there's a lot of data you can get from a log analyzer. The important thing is to run it periodically and see if there are any crawl budget issues developing. It's easier to nip these in the bud before you have thousands of URLs and a butt ton of work to do!

See how a poorly laid out/non-existent information architecture scheme impacts crawl budget in a bad way!

Want to know about technical SEO? Take a search engine optimization training class with Invenio SEO and learn all about crawlers, keyword research, on page optimization and more!

Do you need the basics of search engine optimization, but don't want to sit in a class to learn them? You're in luck, because you now get this knowledge in our online course.

Until we meet again, stay safely between the ditches!

All the very best to you,

Nancy McDonald

Web crawler/web page image courtesy of Karen McDonald

Screen shots courtesy of author

Comments are closed.