Tuesday, 30 June 2009

Googlebot, Sitemaps and heavy crawling

Googlebot over-crawling a site

I recently had an issue with one of my jobboards that meant that Googlebot was over-crawling the site which was causing the following problems:
  1. Heavy loads on the server. The site in question was recording 5 million page loads a month which had doubled from 2.4 million within a month.
  2. 97% of all their traffic was accounted by Googlebot.
  3. The site is on a shared server so this heavy load was causing very high CPU and affecting other sites.

The reasons the site was receiving so much traffic boiled down to the following points.

  1. The site has a large number of categories which users can filter job searches by. These categories are displayed in whole and in subsets in prominent places such as quick links and a job browser which allows users to filter results. As multiple categories can be chosen when filtering a search this meant Googlebot was crawling every possible combination in various orders.
  2. A new link had been added within the last month to the footer which passed a sessionID in the URL. The link was to log whether users had Javascript enabled. As Googlebot doesn't keep session state or use Javascript it meant the number of crawled URLs actively doubled as each page the crawler hit would find a new link that it hadn't already spidered due to the new SessionID.
  3. The sitemap had been setup incorrectly containing URLs that didn't need crawling as well as incorrect change frequencies.
  4. The crawl rate was set to a very high level in Webmaster tools.

Therefore a site with around a thousand jobs was receiving 200,000 page loads a day nearly all of them from crawlers. To put this in some perspective other sites with 3000+ jobs, good SEO and high PR usually get around 20,000 page loads a day from crawlers.

One of the ways I rectified this situation was by changing the crawl rate to a low custom crawl rate of 0.2 crawls per second. This caused a nice big vertical drop in the graph and it alarmed the site owner as he didn't realise that there is no relation between the amount of pages crawled by Google and the sites page ranking or overall search engine optimisation.


Top Tips for getting the best out of crawlers

  • Setup a sitemap and submit it to Google, Yahoo and Live.
  • Make sure only relevant URLs are put in the sitemap. For example don't include pages such as error pages and logoff pages.
  • If you are rewriting URLs then don't include the non-rewritten URL as well as this will be counted as duplicate content.
  • If you are including URLs that take IDs as parameters to display database content then make sure you don't include the URL without a valid ID. Taking the site I spoke about earlier as an example, someone had included the following
www.some-site.com/jobview.asp

instead of

www.some-site.com/jobview.asp?jobid=35056

This meant crawlers were accessing pages without content and it was a pretty pointless and careless thing to do.

  • Make sure the change frequency value is set appropriately. For example on a jobboard when a job is posted its usually posted for between 7 and 28 days. It only needs to be crawled between once a week and once a month depending on the days it was advertised for. It does not need to be crawled every time so setting a value of always is inappropriate as the content will not change every time Googlebot accesses the URL.
  • Avoid circular references such as placing links to a site-index or category listings index in the footer of each page on a site. It makes it hard for the bot to determine the site structure as every path it drills down its able to find the parent page again. Although I suspect the bots technology is clever enough to realise its found a link already spidered and not crawl it again I have heard that it looks bad in terms of site structure.
  • Avoid dead links or links that lead to pages with no content. If you have a category index page and some categories have no content related to them then don't make the category into a link or otherwise link to a page that can show related content rather than nothing.
  • Prevent duplicate content and variations of the same URL being indexed by implementing one of the following two methods.
  1. Set your Robots.txt to disallow your non URL rewritten pages from being crawled and then only display rewritten URLS to agents identified as crawlers.
  2. Allow both forms of URL to be crawled but use a CANONICAL META tag to specify that you want the rewritten version to be indexed.
  • Ban crawlers who misbehave. If we don't spank them when they are naughty they will never learn so punish those that misbehave. Its very easy for an automated process to parse a Robots.txt file therefore there is no excuse for those bots that ignore the commands set out in it. If you want to know those bots who ignore the Robots.txt rules then there are various ways such as parsing your webserver log files or using a dynamic Robots.txt file to record those agents that access it. There are other ways such as using the IsBanned flag available in the Browscap.ini file however this relies on the user-agent being correct and more and more people spoof their agent nowadays. Not only is banning bots good for your servers performance as it reduces load its good for your sites security as bots that ignore the Robots.txt rules are more likely to hack, spam, and scrape your sites content.
If you are having similar issues with over-crawling then I would advise you to first check your sites structure to see if the problem is due to bad structure, invalid sitemap values and over categorisation first before changing the crawl rate. Remember a sites SEO is unrelated to the amount of crawler activity and more is not necessarily better. Its not the number of crawled pages that counts but rather the quality of the content that is found when the crawlers visit that matters.

Saturday, 20 June 2009

Using Google APIS

Creating a content rich site using Google APIs

I recently had a domain renewal notification about a domain I had bought but never got round to using for its original purpose. I didn't want the domain to go to waste so I thought about creating a site as a way for me to play with Googles APIS. They offer a wide range of objects and frameworks which let you add content to a site very quickly such as the ability to search feeds, translate content on the fly from one language to another, search blogs, news and videos and much much more.


Hattrick Heaven

The site I created is a football site called www.hattrickheaven.com its main purpose apart from an example of Googles APIs in action is to display the football league tables of all the countries in the world. I found that on some other sites it was quite a few clicks from the home page to drill down to the league standings so my site was a way to offer this data straight away. As well as the league tables I have links to the latest news, football related blogs, discussion boards and videos.


Google Search APIS

To make use of Googles APIs you just need to setup an access key which is linked to the domain and is free to use. This is passed along in the URI when you reference the main Google JavaScript file. This file is all you need to include as it handles the loading in of all the other libraries with commands such as:
// load google APIS I want to use
google.load("feeds", "1");
google.load("language", "1");
google.load("search", "1");
As well as loading in Googles own framework with this main google.load method you can also reference other well used frameworks such as JQuery, YUI, prototype, dojo, Mootools and others or you can do what I chose to do and reference the relevant framework directly from Googles CDN (Content Delivery Network ).


Targeting Visitors with Googles APIs

One of the very cool features I liked about Googles API is that you get as standard geographical information about your users such as Longitude, Latitude, Country, Region and City. This was great for me as it meant I can use this information to default the search criteria for my feed, blog, news and video searches. If your in the UK and live near a major footballing city such as Liverpool, Manchester or London then HattrickHeaven.com will default the searches with the names of the clubs related to those towns such as Man UTD and Man City for Manchester.

Below is an example from the site of my Visitor object which uses the google.loader object to set up details about the current users location and browser language.

(function(){

V = H.Visitor = {

sysDefLanguage : "en", // the language I wrote the system in and that transalations will be converted from en=English

Latitude : google.loader.ClientLocation.latitude,

Longitude : google.loader.ClientLocation.longitude,

CountryCode : google.loader.ClientLocation.address.country_code,

Country : google.loader.ClientLocation.address.country,

Region : google.loader.ClientLocation.address.region,

City : google.loader.ClientLocation.address.city,

BrowserLanguage : (navigator.language || navigator.browserLanguage || this.sysDefLanguage), // the language currently set in users browser

Language : "",

isEnglish : false

}

// set visitors language making sure "en-gb", "en-us" is converted to "en"
V.Language = H.ConvertLanguage(V.BrowserLanguage);
V.isEnglish = (V.Language=="en") ? true : false; // check for English

})();

Translating Content

Another cool feature is the ability to translate content from one language to another on demand. I make use of this on www.hattrickheaven.com to translate the headers and content that is delivered through Googles feed and search objects if the users language is different from that set for the page (at the moment its all English). You can see this in action on a specific page I created http://www.hattrickheaven.com/spanish-news which converts the content from English into Spanish once its been inserted into the DOM. The code to do this is very simple you just pass the text to convert, the language code for the language the text is written in, the language code for the language to translate the text into and a callback function to run once the translation has completed.
google.language.translate(txt, langFrom, langTo, function(result){
// On translation this method is called which will either run the function
// defined in destFunc or set the innerHTML for outputTo
self.TranslateComplete(result, destFunc, outputTo);
});
On www.hattrickheaven.com I have created my own wrapper object for the site which encapsulates a lot of the various Google options and makes it very easy for me to specify new content to search, translate and output. I have options which control the number of results to show, whether to ignore duplicate URLs and whether to show just the link or to show the snippet of content underneath. For example the following is the code I use on the page

<script type="text/javascript">
// load google APIS I want to use
google.load("feeds", "1");
google.load("language", "1");
google.load("search", "1", H.Visitor.Language);

H.newsPageSteup = function(){
//set up 2 feed objects for the sidebar content
var blog = new H.Feeder();
var wnews = new H.Feeder();

blog.OutputNo = 12; // output 12 links
blog.FeedType = "blogs"; // set the type of feed
blog.getFeedOutputElement = "blogs"; // the node ID to output results
blog.findFeeds(); // find some relevant blogs, translate if necessary and output

wnews.FeedType = "news";
wnews.searchQuery = "World Cup 2010 News"; // overwrite the default search terms
wnews.ShowSnippet = true; // show the content snippet under the link
wnews.OutputNo = 5; // output 5 news articles
wnews.getFeedOutputElement = "worldcupnews"; // where to output news results
wnews.findFeeds(); // run news feed search, translate if necessary and output

// set up a search control to output a search box, paging for results etc
var news = new H.SearchControl();
news.controlType = "news"; // tell object to search for news
news.getFeedOutputElement = "news"; // where to output results in DOM
news.searchSetup(); // run search, translate and output results

// if visitor is not English then I want to translate some headers on this page
if(!V.isEnglish){
var sobj = new H.Search();
var arr = ["WorldCupNewsHeader","NewsHeader","BlogsHeader"];
sobj.TranslateContents(arr);
}

}
// On load run my initialise function
google.setOnLoadCallback(H.newsPageSteup,true);
</script>


As you will see if you take a look at the site its very easy to get some rich content up with very few lines of code. The only issue I currently have is that this functionality is all being done client side with Javascript which leads to two problems.

1. Roughly 10% of visitors (ignoring bots) have Javascript disabled by default. This means that apart from the league tables the site will look pretty bare.

2. Because the content is all loaded in using Javascript its only visible in the DOM after the page has loaded it means that for SEO purposes the source code is going to be pretty empty. I have a few ideas floating around regarding this and I will give more details if any of them come to fruition.

All in all I am pretty impressed with the framework especially its simplicity and hopefully others will feel the same way once they get stuck into developing with it.


Related Links



Tuesday, 9 June 2009

Optimizing a query with Forced Parameterization

SQL 2005 Forced Parameterization Overview

SQL 2005 has a feature which will take AdHoc queries containing literal values and remove those values replacing them with parameters. This means that the query plan which gets cached can be re-used for similar queries that have different values which can aid performance as it will reduce compilation time. This is called FORCED PARAMETERIZATION however the default mode is SIMPLE which will only cache and re-use AdHoc plans containing the same literal values. This cool feature is useful for systems where stored procs and prepared statements aren't being used as these forms of query will already benefit from cached plan re-use. You should read BOL about the suitability of this feature as certain constructs and features will prevent it from being utilized. It may also result in unsuitable query plans being re-used in certain cases rather than a new plan being generated for a query.


Using Forced Parameterization

I came across this feature on the wonderful SQLServerCentral.com site on a discussion about dynamic SQL. I read up about it and thought it might come in useful on a site that has intermittent problems on a results page when hit by concurrent crawlers which results in high CPU on the SQL Server. The page in question is a results page that allows filtering and ordering by multiple columns, both directions and also uses paging. However the query is an AdHoc query that is built up on the client using literal values that are appended to a string rather than using parameters and something like ADO's sqlCommand object to pass those parameter values to the database.

I looked into rewriting this query to make use of parameters but the WHERE clause was being cached and re-used by multiple queries on the same page so it wasn't a simple change therefore it seemed like a good candidate for FORCED PARAMETERIZATION.


Testing the theory

Before enabling this feature for the live site I wanted to prove on our development box that
a) Simple mode just wasn't good enough and was causing multiple plans to be compiled and cached for this query.
b) Forced mode would enable one plan to be cached and re-used for multiple parameter variations of this query.


Enabling Forced Parameterization

I had some trouble with this due to a number of other sites/blogs showing the incorrect way to set this but the correct way is:
-- enable forced mode
ALTER DATABASE YOUR_DATABASE SET PARAMETERIZATION FORCED

-- enable simple mode
ALTER DATABASE YOUR_DATABASE SET PARAMETERIZATION SIMPLE

You can also use the Management Console and the properties option under the relevant database to switch the setting over manually. When switching from one mode to another I made sure to clear the current cache by running the following:
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
I then used the following query which makes use of SQL 2005's DMV's (data management views) to check the queries that were currently cached and the type of method being engaged.

SELECT *
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
INNER JOIN sys.dm_exec_cached_plans as cp on qs.plan_handle=cp.plan_handle
WHERE cp.plan_handle=qs.plan_handle
AND (ObjType = 'Adhoc' OR ObjType='Prepared')

Notice that I am filtering on AdHoc and Prepared and ignoring Stored Procs, Triggers and Views. I initially tried filtering by the dbid (database id) column to narrow down the results to my particular database however for AdHoc and Prepared queries this column contains NULL values.

The columns I am particular interested in are the plan_handle which is the ID of the query plan related to the SQL statement and execution_count which tells me how many times the SQL related to the plan has been executed.

A very simple version of the query in question which conveys the point I am trying to make is below.

SELECT * FROM JOBS WHERE Row between 1 AND 20

With SIMPLE mode enabled a test page running my query with different criteria values for my BETWEEN statement caused multiple records to be returned from sys.dm_exec_cached_plans each with a different plan_handle value and the value for ObjType set to AdHoc. This tells me that the plans are not being re-used by similar queries when they could be. Running the same query with the same values does increment the execution_count against the relevant plan record which shows that a very simple form of re-use is being utilized. However as I want the best performance as possible this isn't good enough.

With FORCED mode enabled and the cache cleared the test pages with different criteria for my BETWEEN statement results in a single record to be returned with one plan_handle. The value for ObjType has now changed from AdHoc to Prepared and looking at the start of the text column which contains the SQL I can see the following before the start of my SQL:
(@0 int,@1 int,@2 int,@3 int,@4 varchar(8000),@5 int,@6 int)

Which is the database creating the parameters required to replace the literal values from my query. The values from my BETWEEN clause have been changed from literals to parameters e.g:
--Simple Mode Enabled
WHERE Row BETWEEN 1 AND 20
--Forced Mode Enabled
WHERE Row BETWEEN @5 and @6
Re-running the test page with different values for these parameters causes the execution_count to increment which informs me that the query plan is being cached and re-used across similar queries which is what I wanted to prove.


Enabling Forced Parameterization for a single query

There maybe situations where you want to enable this feature for one or more queries rather than set it on for the whole database. You can do this by setting up a plan guide for the query in question. An easy way to do this is to run the following SQL making sure to replace the first parameter value with the SQL you want to parametrize.


DECLARE @stmt nvarchar(max);
DECLARE @params nvarchar(max);
EXEC sp_get_query_template
N'SELECT * FROM JOBS WHERE Row between 1 AND 20;',
@stmt OUTPUT,
@params OUTPUT;
EXEC sp_create_plan_guide
N'MyPlanGuide_JobResults',
@stmt,
N'TEMPLATE',
NULL,
@params,
N'OPTION(PARAMETERIZATION FORCED)';


You can then set you database to SIMPLE mode but still benefit from having parametrized queries that match the template you have just created in the previous plan. Some useful SQL related to managing query plans is below:


--View all plans in your system
SELECT * FROM sys.plan_guides

--Disable the plan guide.
EXEC sp_control_plan_guide N'DISABLE', N'MyPlanGuide_JobResults';
GO

--Enable the plan guide.
EXEC sp_control_plan_guide N'ENABLE', N'MyPlanGuide_JobResults';
GO

--Drop the plan guide.
EXEC sp_control_plan_guide N'DROP', N'MyPlanGuide_JobResults';