Tuesday 30 June 2009

Googlebot, Sitemaps and heavy crawling

Googlebot over-crawling a site

I recently had an issue with one of my jobboards that meant that Googlebot was over-crawling the site which was causing the following problems:
  1. Heavy loads on the server. The site in question was recording 5 million page loads a month which had doubled from 2.4 million within a month.
  2. 97% of all their traffic was accounted by Googlebot.
  3. The site is on a shared server so this heavy load was causing very high CPU and affecting other sites.

The reasons the site was receiving so much traffic boiled down to the following points.

  1. The site has a large number of categories which users can filter job searches by. These categories are displayed in whole and in subsets in prominent places such as quick links and a job browser which allows users to filter results. As multiple categories can be chosen when filtering a search this meant Googlebot was crawling every possible combination in various orders.
  2. A new link had been added within the last month to the footer which passed a sessionID in the URL. The link was to log whether users had Javascript enabled. As Googlebot doesn't keep session state or use Javascript it meant the number of crawled URLs actively doubled as each page the crawler hit would find a new link that it hadn't already spidered due to the new SessionID.
  3. The sitemap had been setup incorrectly containing URLs that didn't need crawling as well as incorrect change frequencies.
  4. The crawl rate was set to a very high level in Webmaster tools.

Therefore a site with around a thousand jobs was receiving 200,000 page loads a day nearly all of them from crawlers. To put this in some perspective other sites with 3000+ jobs, good SEO and high PR usually get around 20,000 page loads a day from crawlers.

One of the ways I rectified this situation was by changing the crawl rate to a low custom crawl rate of 0.2 crawls per second. This caused a nice big vertical drop in the graph and it alarmed the site owner as he didn't realise that there is no relation between the amount of pages crawled by Google and the sites page ranking or overall search engine optimisation.


Top Tips for getting the best out of crawlers

  • Setup a sitemap and submit it to Google, Yahoo and Live.
  • Make sure only relevant URLs are put in the sitemap. For example don't include pages such as error pages and logoff pages.
  • If you are rewriting URLs then don't include the non-rewritten URL as well as this will be counted as duplicate content.
  • If you are including URLs that take IDs as parameters to display database content then make sure you don't include the URL without a valid ID. Taking the site I spoke about earlier as an example, someone had included the following
www.some-site.com/jobview.asp

instead of

www.some-site.com/jobview.asp?jobid=35056

This meant crawlers were accessing pages without content and it was a pretty pointless and careless thing to do.

  • Make sure the change frequency value is set appropriately. For example on a jobboard when a job is posted its usually posted for between 7 and 28 days. It only needs to be crawled between once a week and once a month depending on the days it was advertised for. It does not need to be crawled every time so setting a value of always is inappropriate as the content will not change every time Googlebot accesses the URL.
  • Avoid circular references such as placing links to a site-index or category listings index in the footer of each page on a site. It makes it hard for the bot to determine the site structure as every path it drills down its able to find the parent page again. Although I suspect the bots technology is clever enough to realise its found a link already spidered and not crawl it again I have heard that it looks bad in terms of site structure.
  • Avoid dead links or links that lead to pages with no content. If you have a category index page and some categories have no content related to them then don't make the category into a link or otherwise link to a page that can show related content rather than nothing.
  • Prevent duplicate content and variations of the same URL being indexed by implementing one of the following two methods.
  1. Set your Robots.txt to disallow your non URL rewritten pages from being crawled and then only display rewritten URLS to agents identified as crawlers.
  2. Allow both forms of URL to be crawled but use a CANONICAL META tag to specify that you want the rewritten version to be indexed.
  • Ban crawlers who misbehave. If we don't spank them when they are naughty they will never learn so punish those that misbehave. Its very easy for an automated process to parse a Robots.txt file therefore there is no excuse for those bots that ignore the commands set out in it. If you want to know those bots who ignore the Robots.txt rules then there are various ways such as parsing your webserver log files or using a dynamic Robots.txt file to record those agents that access it. There are other ways such as using the IsBanned flag available in the Browscap.ini file however this relies on the user-agent being correct and more and more people spoof their agent nowadays. Not only is banning bots good for your servers performance as it reduces load its good for your sites security as bots that ignore the Robots.txt rules are more likely to hack, spam, and scrape your sites content.
If you are having similar issues with over-crawling then I would advise you to first check your sites structure to see if the problem is due to bad structure, invalid sitemap values and over categorisation first before changing the crawl rate. Remember a sites SEO is unrelated to the amount of crawler activity and more is not necessarily better. Its not the number of crawled pages that counts but rather the quality of the content that is found when the crawlers visit that matters.

2 comments:

John said...

My site http://allawebbisar.se has approx 50 000 pages but the problem for me is that the sitemaps in total are very large. How can I decrease the size of the sitemaps thuss decreasing traffic without excluding pages from the sitemap?

Rob Reid said...

I am not sure what it is you want to achieve? Are you trying to reduce crawler traffic or reduce the size of the sitemap?

Changing what is in the sitemap won't automatically decrease traffic. The sitemap is purely letting search engine bots know what pages you think are important on your site so that when they do visit they can access these URLs straight off rather than crawl your site.

If you want to reduce the size of the sitemap without excluding pages then just create multiple sitemaps instead of one large one and then use a siteindex page to link them together.

If you want to reduce cralwer traffic then the article explains what you can do including banning bad bots, having a correct site hierarchy and for google setting a custom crawl rate in Google Analytics to prevent them from crawling too much.

Hope this helps