Monday 1 December 2008

Job Rapists and other badly behaved bots

How bad does a bot have to be before banning it?

I had to ban a particular naughty crawler the other day that belonged to one of the many job aggregators that spring up ever so regularly. This particular bot belonged to a site called jobrapido which one of my customers calls "that job rapist" for the very good reason that it seems to think that any job posted on a jobboard is theirs to take without asking either the job poster or site owner that its posted on. Wheres the good aggregators such as JobsUK require that the job poster or jobboard submits a regular XML feed of their jobs rapido just seems to endlessly crawl and crawl every site they know about and just take whatever they find. In fact on their home page they state the following:

Do you know a job site that does not appear in Jobrapido? Write to us!

Isn't that nice? Anyone can help them thieve, rape and pillage another sites jobs. Maybe there is some bonus point system for regular informers. Times are hard as we all know and people have to get their money where they can! However it shouldn't be from taking other peoples content unrequested.

This particular bot crawls so much it actually made one of our tiniest jobboards appear in our top 5 rankings one month purely from its crawling alone. The site had less than 50 jobs but a lot of categories which meant that this bot decided to visit each day and crawl every single search combination which meant 50,000 page loads a day!

Although this rapido site did link back to the original site that posted the job to allow the jobseeker to apply to the job the amount of referrals was tiny (150 a day across all 100+ of our sites) compared to the huge amount of bandwidth it was stealing from us (16-20% of all traffic). It was regularly ranked above Googlebot, MSN and Yahoo as the biggest crawler in my daily server reports as well as being the biggest hitter (page requests / time).

So I tried banning it using robots.txt directives as any legal well behaved bot should pay attention to that file however 2 weeks after adding all 3 of their agents to the file they were still paying us visits each day and no bot should cache the file for that length of time so I banned it using the htaccess file.

So if you work in the jobboard business and don't want a particular heavy bandwidth thief, content scrapper and robots.txt file ignorer hitting your sites every day then do yourself a favour and ban these agents and IP addresses:

Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)

"Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)"

Mozilla/5.0 (JobRapido WebPump)

85.214.130.145
85.214.124.254
85.214.20.207
85.214.66.152

If you want a good UK Job Aggregator that doesn't pinch sites jobs without asking first then use JobsUK. Its simple to use and has thousands of new jobs from hundreds of jobboards added every day.

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

  1. Check the URL looks valid using a regular expression.
  2. Access the URL through a proxy server and return the HTML source.
  3. Locate and store the META keywords, description and title if they exist.
  4. Remove everything apart from content between the start and close BODY tags.
  5. From the body I strip any SCRIPT tags and anything between them.
  6. Remove all HTML tags.
  7. Clean the remaining content by removing noise words, numbers and swear words.
  8. I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
  9. I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.


1. Method to access remote URL through proxy server.




public static string GetURLHTML(string remoteURL, string proxyServer)
{
string remoteURLContent = "";

WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
WebRequest webReq = WebRequest.Create(remoteURL);
webReq.Proxy = proxy; //set request to use proxy

// Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
if (webReq is HttpWebRequest)
{
((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
}

WebResponse webResp;
int responseStatusCode = 0;

try{
// Get the response instance
webResp = (HttpWebResponse)webReq.GetResponse();

// Read an HTTP-specific property.
if (webResp is HttpWebResponse)
{
responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
}
}catch(Exception ex){
return remoteURLContent;
}

//we can only collect HTML from valid responses so ignore 404s and 500s
if (responseStatusCode != 200)
{
return remoteURLContent;
}

// Get the response stream.
Stream respStream = webResp.GetResponseStream();

StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
remoteURLContent = reader.ReadToEnd();

// Close the response and response stream.
webResp.Close();

return remoteURLContent;
}



The reason I use a proxy is down to the security policy set on our web servers.


2. Method to gather the main content.



//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
protected string IndexURL(string remoteURL)
{
KeywordParser keywordParser;
string METAKeywords = "", METADescription = "", METATitle = "";
string cleanHTML = "";
StringBuilder indexText = new StringBuilder();

//As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

//now access the remote URL and return the HTML source code if we can
string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

//if we have some HTML content to parse and clean
if (!String.IsNullOrEmpty(remoteURLHTML))
{
remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

//Set up some regular expressions to help identify the META conent we want to index in the source
Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
Regex HasTitle = new Regex("<title>");

//As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
//contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
if (HasKeywords.IsMatch(remoteURLHTML))
{
//get the data we require by replacing anything either side of the tag
METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasDescription.IsMatch(remoteURLHTML))
{
METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasTitle.IsMatch(remoteURLHTML))
{
METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
}

cleanHTML = remoteURLHTML;

//now get main content which is between open close body tags
cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

//strip any client side script by removing anything between open and close script tags
cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);

//put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

//strip HTML tags
cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

//Decode the HTML so that any encoded HTML entities get stripped
cleanHTML = HttpUtility.HtmlDecode(cleanHTML);

//now add all the content we want to index back together
if (!String.IsNullOrEmpty(METAKeywords))
{
indexText.Append(METAKeywords + " ");
}
if (!String.IsNullOrEmpty(METADescription))
{
indexText.Append(METADescription + " ");
}
if (!String.IsNullOrEmpty(METATitle))
{
indexText.Append(METATitle + " ");
}
if (!String.IsNullOrEmpty(cleanHTML))
{
indexText.Append(cleanHTML);
}

}

return indexText.ToString();
}


I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.

Monday 24 November 2008

Trying to detect spoofed user-agents

User Agent Spoofing

A lot of traffic comes from browsers either masking their real identity by using a different user agent than the real one associated with the browser or a random string which relates to no known browser. The purpose of doing this is many fold from malicious users trying to mask their real identity to get round code that may ban on user-agents or get round client side code that may be blocking certain browsers from using certain functionality which is all a good reason for using object detection rather than browser sniffing when deciding which code branch to run.

However as you will probably know if you have tried doing anything apart from simple JavaScript coding there are still times when you need to know the browser because object detection just isn't feasible and trying to use object detection to work out the browser is just as bad in my opinion as using browser sniffing to work out the object.

Therefore when someone is using an agent switcher to mask the browsers agent and you come across one of these moments then it may cause you to run code that will raise errors. There is no foolproof way to spot whether an agent is spoofed but one of the things you can do if you do require this information is compare the agent with known objects that should be supported by that browser and if they don't match then you can confirm the spoof.

This form of spoof detection will only work if its only the user agent string that has been changed but an example of some of the checks you can do include for agents that say they are Opera make sure it supports window.opera as well as both event models document.addEventListener && document.attachEvent as far as I know its the only browser that does support both. For IE you shouldn't check document.all by itself as you will actually find Firefox will return true for this but you can check for window.ActiveXObject the non existence of addEventListener and use conditional comments to test for JScript. Firefox should obviously not support JScript as it uses Javascript.

Those are just a few checks you could do and you are basically using object detection as well as agent sniffing together to make sure they match. They may not tell you the real browser being masked but they can be used to tell you what its not. 

The idea of this is to make sure that in those cases where you have to branch on browser rather than object (see this previous article) that you make the right choice and don't cause errors. Obviously you may decide that if the user is going to spoof the agent then leave them to suffer any errors that may come their way.

If you do require a lightweight browser detector that checks for user agent spoofing amongst the main browsers as well as support for advanced CSS, Flash and other properties then see this article.

Tuesday 18 November 2008

Problems upgrading to Firefox 3

Invalid Null Pointer Error relating to DesignMode and ContentEditable

I finally upgraded to Firefox 3 the other day on my work PC but only after I managed to fix an annoying bug that had suddenly appeared in my CMS WYSIWYG editor when I had upgraded at home.

The bug was occurring whenever I closed my editor (display:none;) and then re-opened it again on the same page. I finally tracked down the bug and it was related to code that initialises the editor on opening where it sets designmode and contentEditable on. In Firefox 2 this didn't seem to cause any problems if these properties had previously been set but in FF3 it caused invalid null pointer errors. A quick check to make sure I wasn't trying to set something on that was already on fixed this problem e.g:


if(self.IE && theIframe.contentWindow.document.body.contentEditable!=true){
theIframe.contentWindow.document.body.contentEditable = true;
}
if(!self.IE && theIframe.contentWindow.document.designMode!="on" ){
//in moz you cannot change desgin mode straight away so try and then retry
try{
theIframe.contentWindow.document.designMode = "on";
theIframe.contentWindow.document.execCommand("redo", false, null);
}catch(e){
setTimeout(function(){self.MakeEditable(theID);},250);
return false;
}
}


UserAgentButton is null Error in Firefox 3

Anyway I upgraded to Firefox 3 and then I started getting these "UserAgentButton is null" errors appear on my main site all over the place. The error in Firebug was referring to lines in my own code so it was quite confusing at first as I do not have any code that mentions UserAgentButton.

I disabled Firebug and still the problem persisted until I remembered I had installed another add on the other day that allows you switch useragent. The plug in is called "User Agent Switcher" version 0.6.11 and is very useful for testing sites. After disabling this add on the problem went away. So if you see a similar problem try disabling any add ons first before spending any time debugging your code.


Flash Not Playing Movies Correctly

Another problem I came across after upgrading to FF3 was with Flash movies not playing correctly. They would start playing but only for a few seconds before stopping and freezing.
The problem was definitely related to FF3 as they played correctly in IE and other browsers.
I read a lot of "fixes" for this problem on the web but the only thing that worked successfully
for me was to upgrade Flash to version 10.

I am not sure what other problems will come up and surprise me but the flash problem alone was reason enough not to make FF3 a level 1 supported browser on my main site. If anyone else has come across any issues relating to upgrading to Firefox 3 please let me know.

Tuesday 4 November 2008

Foreign Keys

In my ten years of database development I have come across a number of people who did not believe in using foreign key constraints in their database systems. Some of these people have actually been my senior at the time and its been very frustrating when I have asked why they are against enforcing referential integrity and they give replies along the lines of:

"When I was first starting out and didn't know any better I used to use foreign keys but now I know what I am doing so I let the application handle it all."

What it actually means is that they were taught by someone who thought they knew better than everyone else and unfortunately have not come to realise that rather than being an SQL guru their teacher was just another one of those people who thought they knew better than everyone else and they have foolishly followed in the same footsteps.

The way I see it is that the whole point of a is to enforce data integrity if you just want to persist data then there are probably better and faster ways but if you remove the relationships and constraints then you are just using your fancy and maybe expensive RDBMS as a storage system.

The analogy I use for is a car where the constraints are the safety features within the car i.e seat belts, airbags, ABS etc. You can build a car without all these safety features and it may run perfectly well but if you're unfortunate to have a crash you are going to be in a lot of trouble and you would wish that you hadn't removed those safety features for some imaginary performance gain. The data is your cars passenger and the constraints are there to ensure that the passenger arrives in one piece. If you accidentally bump into the car in front of you the air bag will prevent a broken nose just like if you have a bug in your code a check constraint will prevent dodgy data getting inserted into your table. You can live with bumps and bruises just like you can live with dodgy data however life is better without a bent nose and you will get less customer complaints when the data they report on is guaranteed to be accurate.

The reasons I have been given for not using foreign key constraints are usually one of the following two:

"It slows your system down and you should handle your data integrity checks in the application."


"It makes my life difficult when setting up sites and transferring data between systems due to all those referential key constraint errors when I try to insert data."

I will deal with each one in turn.

1. You are not going to be able to beat the RDBMS for speed at its own game. If you are going to be checking data for integrity then the best place to do it is as close to the data as possible. Therefore if you care about the quality of your data and want to check it let the RDBMS handle it. For example lets imagine a website that lets a user store a number of selected list options against their profile. Lets imagine we are going to let the application handle the integrity checks rather than the RDBMS and see what problems we may have.

Lets say the system has a table that holds the users who are on the website and allowed to save profiles.


CREATE TABLE USERS(UserPK int PRIMARY KEY, UserName varchar(50))

We also have a table that stores a number of locations which are displayed to the user on the profile page. The primary key is using a surrogate key which is an integer.


CREATE TABLE LOCATIONS(LocationPK int PRIMARY KEY, Location varchar(10))

The user can select multiple options so rather than hold the chosen Location value in the actual USERS table we have created a separate table that enables us to have a ONE to MANY relationship between an individual user and their selected locations.

CREATE TABLE USER_LOCATIONS (UserFK int,
LocationFK int,
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserPK,LocationFK))

As you can see we have created a table that has columns that reference the primary keys in both the USERS and LOCATIONS table but we have not enforced referential integrity by defining them as FOREIGN KEY constraints. I have however defined a PRIMARY KEY as this is still usually followed by the majority of people even if foreign keys are forgotten about or deliberately left off.

So we are going to let our application handle all data checks. Therefore we have a web page with a list that lets our USER choose the locations to save.


<select id="lstLocation" size="1" multiple="multiple">
<option value="1">Greater London</option>
<option value="2">South East</option>
<option value="3">South West</option>
</select>

Point 1. If the list only shows options that exist in the LOCATIONS table then we do not have to worry about invalid values being saved? WRONG! HTML can be manipulated hacked and HTTP requests forged. Anyone with a web developer toolbar could change this list and therefore the values that the form submitted.

Therefore we have a tampered POST request from the form that has supplied us with the following values: 454,676

We call our stored procedure that handles UPDATES/INSERTS to the USER_LOCATION table:

EXEC dbo.usp_save_user_locations @LocationIDs = '454,676';

Point 2. Within our stored procedure we only INSERT new values into the USER_LOCATION table if they exist in the LOCATIONS table by either doing a LEFT JOIN or an IF EXISTS statement. Yes we could do this but we are trying to carry out the same check that a FOREIGN KEY constraint would do anyway and at a higher level.

Point 3. We could use TRIGGERS to ensure that the data we want to insert is correct but this is the same as point 2. We are replicating behaviour best left to the RDBMS. Some people use TRIGGERS to handle CASCADING DELETES however the discussion of CASCADES is another topic that is unrelated to this basic discussion. In that case it may be a good idea as it would give you more control over the data you are removing. However any code written by the developer to replicate basic functionality that the RDBMS can handle is always open to coding errors.

Even if you decide that using the application is the way to go then you are still left with the problem of the management console. Who has access to your database server? Even if access to your system is limited to a few trusted people the fact remains that its possible to bypass any checks that your application may have by opening the table and editing/deleting the data by hand whether by accident or on purpose. Using FOREIGN KEY constraints will prevent this from occurring and this is the key point in my opinion going back to my car analogy. Its your safety belt / airbag preventing your data from getting crushed and mangled up in any database accident. Whether this accident happens on purpose or by a bug in your code its your
last line of defense and its the best kind you can get because the checks are as close to the actual data as possible and therefore cover all possible points of corruption.

The correct way to save all this trouble and potential pitfalls is to define our USER_LOCATIONS table with FOREIGN KEY relationships such as below

CREATE TABLE USER_LOCATIONS (UserFK int CONSTRAINT fk_USER_LOCATIONS_UseFK FOREIGN KEY (UserFK) REFERENCES USERS(UserPK),
LocationFK int CONSTRAINT fk_USER_LOCATIONS_LocationFK FOREIGN KEY (LocationFK) REFERENCES LOCATIONS(LocationPK),
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserFK,LocationFK))

As you can see the primary key consists of both foreign keys. One relates to the LOCATIONS table and holds the ID for the chosen location. The other key relates to the USERS table and holds the ID of the user who made the selection.

We can test this by populating our tables with some test data.

INSERT INTO USERS
SELECT 101, 'Mr Smith'

INSERT INTO LOCATIONS
SELECT 1,'Greater London'
UNION
SELECT 2, 'South East'
UNION
SELECT 3, 'South West'

Now running the following will be okay as both a USER with a USERPK of 101 and a LOCATION with a LocationPK of 1 exists in the tables referenced by the FOREIGN KEY relationships.

INSERT INTO USER_LOCATIONS
SELECT 101, 1

However running the following SQL will result in an error being raised as neither the USER or LOCATION exist in the relevant tables with the corresponding values.

INSERT INTO USER_LOCATIONS
SELECT 102, 12


Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY constraint "fk_USER_LOCATIONS_UseFK". The conflict occurred in database "XXX", table "dbo.USERS", column 'UserPK'.
The statement has been terminated.

As for making life difficult well its suppose to if by making life difficult you mean its preventing you from adding or deleting data willy nilly. You cannot TRUNCATE a table engaged in FOREIGN KEY constraints which may cause problems but then in a production environment why would you want to without taking into consideration related data. You can easily create scripts to drop all your constraints and then reapply them if you needed to do this. There are also easy ways to disable your constraints if you do not want to enforce checks when inserting data for example:

--Disable all table constraints
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT ALL

-- Enable all table constraints
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT ALL

-- Disable single constraint
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

-- Enable single constraint
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

If you're the DBA and giving me this argument then I would say that you're not a proper DBA in the first place if you don't understand the importance of data in a relational database. Also whatDBA worth the name does not have to hand or the ability to write a simple script to drop and reapply constraints. Surely this trouble is worth doing if it means your key business resource i.e the data is left intact and uncompromised. If you take away the relations all you have left are just isolated tables that persist data.

I will reminisce for a moment about a job I had some time back working for a telecoms company that specialised in predictive dialer software for call centers. My boss designed the database system and it was not in any way what you could call a normalised relational database. There was plenty of redundancy and lots of flat tables without a relationship in sight. His argument was that it was a real time database that had to process thousands of calculations and DML statements a second and therefore performance was the key issue. Having relations would have slowed down INSERTS and therefore the denormalised schema was appropriate to the system. My job was to develop the OLTP reporting and management systems and he used to joke about me enforcing referential integrity and how it was a rookies mistake that experts like himself did not make. However my other job at the company was to actually write the numerous reports for the call center customers that had to use the dialler data from his "unrelational" system. What a nightmare that was! The amount of hours I spent hunting for missing records trying to work out why I had data for one time slot in table A but no corresponding data in table B when it should be there. Yes his system may have been fast but because there was no defined relationships the data integrity was poor. I can understand wanting to make the system fast but having a well defined relational system should not mean the opposite.

Apart from the two reasons mentioned as reasons for not using referential integrity I think the main reason is purely laziness or just the fact that on a lot of sites you can get away with not doing it and no-one notices. A lot of sites are actually developed without RI and extra coding is carried out to handle the fact that a percentage of the data is either rubbish, orphaned records or duplicates due to not having defined constraints. Systems like this if they still care about the quality of the data will have scheduled jobs to clean up these records or they will have to spend extra time on reports like I had to do to accommodate the fact that you are reporting on data that you cannot guarantee to be 100%.

Remember the old saying "Rubbish in Rubbish Out" well if this saying was not invented to describe relational database design then it should have been done.


Further Reading

A helpful how to guide on Foreign Keys with useful SQL related to reporting on existing keys and disabling and enabling keys across databases.
http://www.sqlservercentral.com/articles/Advanced/foreignkeys/2423/

A discussion on using foreign keys or triggers for referential integrity
http://www.mssqltips.com/tip.asp?tip=1508

Tuesday 21 October 2008

Dangers of Pattern Matching

Regular Expressions and Pattern Matching

Most modern systems use some form of pattern matching in there day to day running whether its client side Javascript replace or match statements, server side regular expression objects or database residing LIKE and PATINDEX statements and since SQL 2005 the ability to run CLR based regular expressions. Now although these are very useful tools to use you must be aware that its possible to max out your web or database servers CPU by running certain types of patterns.

I will tell you a short story about a very long weekend spent in the office at the beginning of 2007. We were migrating a big system to another server and we had copied all the code and database settings up and were watching the performance of the systems. Something strange was happening every minute or so. The CPU would jump from hardly anything up to 25%. After 30 seconds it would go back down again. Sometimes it would jump to 25% and then a few seconds later to 50%. In the daytime with all the sites running full throttle at intermittent stages throughout the day the CPU would jump in blocks of 25% up to 100% and all the sites on the server would become unusable causing lots of error messages and customer phone calls. We were at a loss to what was going on. It took quite a while before someone discovered that clicking on a link on a certain company profile on a certain site would always cause the CPU to spike. The site was a jobboard and the link was to display all the job results for that company. I copied down that sites job data to our dev box and ran the same test. Instead of causing the CPU to jump to 25% it maxed straight out to 100%. This made sense as our live server had a quad core. I debugged the code and found the cause was a regular expression that was in a function that re-formatted the snippet of job description shown on the results page. Although this function worked on 99% of the data in the system this company had a couple of jobs that contained certain Unicode characters and it was obviously causing the regular expression engine to freak out. We put it down to a problem inherit in Microsoft's Regular Expression engine and rewrote the code to break the one pattern down into 3 smaller steps which worked fine. However I then read an article about catastrophic backtracking and how certain patterns that can be matched in multiple ways can cause these sorts of CPU issues as the complexity of the pattern grows exponentially. If you have a very small pattern you may not notice it but if you have long strings then you can run into these CPU nightmares.

As well as problems that maybe built into your application due to patterns that could cause these high CPU issues when a certain replace or match is carried out in a specific instance you may be opening your site up to deliberate attacks by users that take advantage of an online searching tool that uses SQL and LIKE to search records in your database. If you are not careful and screen out wildcards and other symbols used within LIKEs pattern matching or regular expression symbols (+*^?$) if you are using a CLR build proc or function in SQL 2005 then you could be vulnerable to what is known as SQL Denial of Service attacks. Again its possible to max out your database servers CPU whilst a complex LIKE statement is run against your database.


Ways to avoid complex pattern matching affecting your system

  • Avoid backtracking.
  • If you have overly complex expressions consider breaking them down into multiple smaller expressions. Expressions that include lots of nested quantifiers ?*+.
  • Avoid expressions that can be matched by multiple string patterns.
  • If you are re-using an expression then compile it if possible.
  • If you are using an inherently useless regular expression engine (VB) then make sure you re-use your objects. Create a global regular expression object variable that's instantiated on first use and then re-used on subsequent calls and then destroyed at the end of the page.
  • If you have a page that is hit frequently that uses lots of expressions for formatting such as the job results example I mentioned then consider rewriting your code so that any complex formatting is done on input rather than output. Consider creating a separate column in your table that stores the formatted version which can be accessed quickly whenever it needs to be viewed. The performance gained by not having to run complex pattern matching constantly on each page load will be considerable.
  • If you are taking input from a user and using it in pattern matching in SQL LIKE or regular expression database searches then sanitise all input to make sure you do not fall victim to an SQL denial of service attack.
  • Also consider rewriting any database searching to use a full text index instead of LIKE searches. You will be able to offer your users a more feature rich and comprehensive search facility.

So now you know to be on the lookout for spikes in CPU. If you see the CPU in performance monitor jump up to 25% on a quad, 50% on dual or 100% in one leap then it could well be down to a regular expression pattern somewhere in your application that needs investigating.


Links

SQL Server Denial of Service Attack (LIKE Attack)

Detailed description of how backtracking can cause CPU issues

Catastrophic Backtracking

Tuesday 7 October 2008

Is there any point anymore using Javascript to output Flash?

Using Javascript to load Flash.

Because of the well known Eolas patent issue regarding Internet Explorer and users having to click to activate Flash content a number of Javascript libraries has emerged to get round this problem e.g swfobject, UFO. The reason being that any Flash content loaded up through script was not affected by this patent issue and the user did not have to click anything.

I along with many others implemented generic functions to get round the "click to activate" problem. However since March of 2008 the issue has been resolved and current versions of Internet Explorer and any previous versions that have had patches applied to them do not have this issue. Therefore I am wondering whether there is any point in having Flash content delivered through Javascript rather than Server side code?

The main reason for doing it this way has been removed and by using Javascript to load all Flash you are narrowing your user base. More and more users are crawling the web with Javascript de-activated and I imagine that in this security conscious times we live in that this percentage will only increase. Checking my logged traffic statistics for today I can see that 10% of my sites non crawler users have Javascript disabled so this is quite a large number of users that won't be able to see all those Flash adverts and banners that your customers have paid £££ to place on your site.

Therefore in any new sites that I create I am reverting back to generating the Flash HTML server side unless there is a need for client side user interaction such as when I use flash for text counters. I am using the same XHTML that SWFobject and others have started doing which checks the users browser type and outputs the relevant OBJECT code accordingly i.e classid and codebase for IE and type for the others. This removes the need for embed tags and is still XHTML compliant plus its done server side so there is no issue when users turn Javascript off.

<object type="application/x-shockwave-flash" data="/images/Banner.swf" id="BANNER643" align="middle" width="468" height="60">
<param name="movie" value="/images/Banner.swf">
<param name="allowScriptAccess" value="sameDomain">
<param name="quality" value="high" />
<param name="wmode" value="transparent">
</object>


If the users browser is Internet Explorer I output a Javascript function call at the bottom of the page that will "fix" any OBJECT tags so that any users that don't have patched versions of IE and so still ask the user to "click to activate" will not have this problem as long as they have Javascript enabled.

A simple version of the EOLAS fix code is below although I am currently using a more advanced option that handles complex OBJECT params alot better.

objects = document.getElementsByTagName("object");for (var i = 0; i < objects.length; i++){objects[i].outerHTML = objects[i].outerHTML;}


All my users should get this content and only the unpatched Internet Explorer users who have Javascript disabled will have the inconvenience of being asked to "click to activate".

It would be nice to be able to detect whether Internet Explorer users had patched or unpatched versions so that I could only run the Javascript fix code if it was absolutely necessary but I cannot think of a way to determine this. I was thinking of maybe trying to access the first flash movie on a page and check the PercentLoaded property which must be unavailable if the flash movie is unactivated. However using the multiple IE version hack to run IE 6, 5 etc you always get a pop up saying "Press okay to continue loading content" when a page contains Flash so I cannot test this out. Either way this seems the best way to code Flash for the future as it ensures the maximum user coverage and still handles the EOLAS patent problem for those IE users who have not updated.

If anyone can think of any other reasons to be using Javascript and limiting your user coverage then please let me know.


Thursday 2 October 2008

Using Javascript to Parse a Querystring

Javascript has no support for querystring parsing

There maybe times when you need to parse a querystring using Javascript. I have recently found myself requiring to do this when rewriting a web application as lots of places were using inline Javascript script due to parameters being outputed into the HTML source through server side code. I wanted to move as much inline script out of the page source into external files and I found that a lot of cases I could only do this by using Javascript to access the querystring instead of server side code.

As Javascript has no inbuilt objects that you can reference you will either have to create your own one making use of the location.href and location.search objects which refer to the current URL and the querystring portion of the URL.

A lot of scripts will use string parsing to split the querystring up on & and = which is fine if you remember about the # anchor portion of the URL. However my script uses regular expressions to split the querystring up into its constituant parts.

I have created a little object that can be accessed on any page that references it. The features of this object are:

  • Parse the current location's querystring or supply your own querystring. It maybe that you have passed a URL encoded querystring as a value within the querystring and once you have accessed this parameter you need to parse it on its own.
  • Returns the number of parameters within the querystring.
  • Returns the parameters as key/value in an associative array.
  • Option to output the parameters as a formatted string.
  • Ability to access the anchor value seperatley if it exists.
  • Handles parameter values specified multiple times by comma seperating them.

The code:


function PageQuery(qry){

this.ParamValues = {};
this.ParamNo = 0;

var CurrentQuery, AnchorValue = "";

//if no querystring passed to constructor default to current location
if(qry && qry.length>0){
CurrentQuery = qry;
}else{
if(location.search.length>0){
CurrentQuery = location.href;
}else{
CurrentQuery = "";
}
}

//may want to parse a query that is not the current window.location
this.ParseQuery = function(qry){
var rex = /[?&]([^=]+)(?:=([^&#]*))?/g;
var rexa = /(\#.*$)/;
var qmatch, key, amatch, cnt=0;

//parse querystring storing key/values in the ParamValues associative array
while(qmatch = rex.exec(qry)){
key = denc(qmatch[1]);//get decoded key
val = denc(qmatch[2]);//get decoded value

if(this.ParamValues[key]){ //if we already have this key then update it if it has a value
if(key&&key!="") this.ParamValues[key] = this.ParamValues[key] + ","+val;
}else{
this.ParamValues[key] = val;
cnt++;
}
}
//as no length property with associative arrays
this.ParamNo = cnt;

//store anchor value if there is one
amatch = rexa.exec( qry );
if(amatch) AnchorValue = amatch[0].replace("#","");
}

//run function to parse querystring and store array of key/values and any anchor tag
if(CurrentQuery.length){
this.ParseQuery( CurrentQuery );
}

this.GetValue = function(key){ if(!this.ParamValues[key]) return ""; return this.ParamValues[key]; }
this.GetAnchor = AnchorValue;

// Output a string for display purposes
this.OutputParams = function(){
var Params = "";
if(this.ParamValues && this.ParamNo>0){
for(var key in this.ParamValues){
Params+= key + ": " + this.ParamValues[key] + "\n";
}
}
if(AnchorValue!="") Params+= "Anchor: " + AnchorValue + "\n";
return Params;
}
}

//Functions for encoding/decoding URL used in object

//encode
function enc(val){
if (typeof(encodeURIComponent)=="function"){
return encodeURIComponent(val);
}else{
return escape(val);
}
}
//decode
function denc(val){
if (typeof(decodeURIComponent)=="function"){
return decodeURIComponent(val);
}else{
return unescape(val);
}
}


How to call the code

To make use of the objects functions you can instantiate the PageQuery object by either passing the string that you want to parse as the constructors only parameter or you can pass nothing in which case it will default to the current location's querystring if there is one. Once you have created the object you can then reference the properties you require to return the relevant information.

//create new instance of the object
var qry = new PageQuery(); //defaults to current location if nothing passed to the constuctor
//var qry = new PageQuery("?id=1044&name=Rob+Reid#56"); //parse a specific string instead of using location.search
var id = qry.GetValue("id"); //get value for a parameter called id
var anc = qry.GetAnchor; //get the anchor # value if exists
var no = qry.ParamNo; //get the number of parameters
var s = qry.OutputParams();//return a formatted string for display purposes
var p = qry.ParamValues; //return the array of parameters

//loop through array of parameters
if(p){
for(var z in p){
alert("Query Parameter["+z+"] = " +p[z]);
}
}


So as you can see its a pretty simple but very flexible function that provides me with all the necessary functionality I require when handling querystrings with Javascript.

Click here to download the ParseQuery object script.

Tuesday 23 September 2008

Latest SQL Injection URLS

Cleaning up a site infected with multiple SQL injected URLs

I have just had to clean up an ancient system that had been successfully hacked by automated hack bots. The site was a small news system that was visited rarely and written about 7 years ago. The code was ASP classic and the SQL was all client side and created using string concatenation with poor parameter sanitization and no thought paid at all to SQL injection methods. Luckily the site owner is moving to a new system this week however I still had to clean the database up and the main affected table contained at least 20 different script tags, some appearing over 5 times all referencing dodgy URIs. In fact by the looks of things the majority of the sites traffic over the last month was purely from hack bots which just goes to show that no matter how small a site is if it can be found on the web then a hackbot is going to try its luck. Luckily I managed to remove all traces of the hack using my clean up script and there was no need for a database backup restore.

However I thought it would be helpful to list out all the URI's injected into the system.
As you can see most are Russian with a few Chinese thrown in for good measure so nothing new there. They all caused Googles vulnerable site report to raise a flag and I believe the JS is the standard hack that makes use of the well known Iframe vulnerabilities in old browsers.

http://www0.douhunqn.cn/csrss/w.js
http://www.usaadp.com/ngg.js
http://www.bnsdrv.com/ngg.js
http://www.cdport.eu/ngg.js
http://www.movaddw.com/ngg.js
http://www.lodse.ru/ngg.js
http://www.sdkj.ru/ngg.js
http://www.kc43.ru/ngg.js
http://www.jex5.ru/ngg.js
http://www.bnrc.ru/ngg.js
http://www.bts5.ru/ngg.js
http://www.d5sg.ru/ngg.js
http://www.nemr.ru/ngg.js
http://www.kr92.ru/ngg.js
http://www.bjxt.ru/ngg.js
http://sdo.1000mg.cn/csrss/w.js
http://www.ujnc.ru/js.js
http://www.cnld.ru/js.js
http://www.juc8.ru/js.js
http://www.3njx.ru/js.js
http://www.19ssl.net/script.js
http://www.vtg43.ru/script.js

See my recovering from an SQL injection attack post for more details about clean ups and quick plasters that can be applied to prevent further injections.

Thursday 18 September 2008

What browser support do you offer your users.

Graceful degradation

In an ideal world you would want any website you develop to have the largest audience as possible. Sites these days can rely on multiple technologies to work (Javascript, Flash, Java, ActiveX, CSS) most of which the user can disable through choice even when their browser supports it. This means that when it comes to developing a site you have the decision between drawing a line in the sand and saying you must use this browser above this version and have JavaScript, Flash and ActiveX enabled for it to work which will mean a small and very annoyed audience. Or the better way which is to build the site from bottom up starting with a good HTML core structure that should work in all browsers and then add layers of functionality on top.

This means making sure all links have href tags that point to URIs and not Javascript: , images have alt tags, client side validation is duplicated on the server (which is good practise anyway to prevent hackers avoiding it by disabling javascript) and that the core navigation is not reliant on Javascript to work. This is called graceful degradation and ensures that your sites core functionality is available to as many users as possible.

For example if you were displaying flash banners you would start with an image that had alt tags defined so that if the user has a text based browser they read the content within the alt tags. If they have image support they would see the image but if they had Flash enabled they would view the movie.

The site I am currently working on divides browser support up into 3 levels and to help the user we include a browser compatibility page that lets the user see what level of support their browser settings are providing them along with links to let them download the latest version of their browser if they are using an old version. As well as showing the browser support level (see below) the page lists a number of key settings and features that the site makes use of with an indicator of whether this feature is enabled or not. Settings displayed include Flash, JavaScript, AJAX, Java, Cookies etc. Having one or more items on this feature list unavailable does not mean the site will not function only that certain parts of the site use this technology and therefore to get the richest experience from the site and enjoy all the functionality they should enable those missing features. Obviously there are some items that have to be available for the site to work such as forms and session cookies and we let the user know whether their browser and settings are compatible with the site by showing them crosses or ticks next to each item and an overall compatibility rating.

Browser support levels

The browser support grades are defined in the following way.

Leval 1 Support
This means that the site should be fully functional and display correctly. We will test the site fully in level 1 browsers to make sure any bugs are fixed as soon as possible. Currently our level 1 browsers are IE 7,6, Firefox 2 (3 still buggy), Safari 3, Opera 9.

Level 2 Support
This means browsers that should work without any problem with our software but as we cannot test every single available browser we cannot guarantee that everything will work. For example if the browser is the latest version of SeaMonkey or Firebird which is based on the same Gecko rendering engine that Firefox is (which is a level 1 browser) then the system should work perfectly depending on your browser settings. However as we do not fully test
the system with this browser we do not guarantee it will work 100%. Current level 2 browsers would be Firefox 3, Opera 8, Sea Monkey, Konqueror.

Level 3 Support
This means older or niche browsers such as PDA/Mobile phones that we do not test the system on. This is not to say the system won't work as it might do depending on your browser settings. However we are not going to test our system on Netscape Navigator 4 or every mobile phone to ensure it works.

A fully functional site means no Javascript errors however the Javascript is not really a problem as most code that works in IE 7 is going to work in IE 5 and we all try to write cross browser code nowadays which is made easier with all the libraries available. The issue is more to do with CSS and all the quirks between IE/Mozilla and the various versions. IE 6 for instance has a number of major differences which means special stylesheets or hacks have to be implemented. As much as we would like to downgrade IE 6 to a level 2 browser we cannot due to the simple fact that a number of our largest clients still use IE 6 as their browser. We have enquired about the possibility of them upgrading but were actually told that their technical support team would not allow them to which is understandable if you consider how much support time would be taken up by requests of help because of the new browser layout. I can imagine don't want to suffer the hundreds of phone calls asking "where has the history menu disapeared to."

I also have Firefox 3 in level 2 due to the fact that there are still a number of outstanding issues with the site which need resolving before we would make it level 1. Some of these issues are definitely a problem with the browser such as the well known problem of playing Flash videos which an upgrade to version 10 is the only workaround I have found. There is also an issue with the execCommand function which I am not sure about as the code worked fine in Firefox 2 and other browsers but has suddenly started giving me Component returned failure code: 0x80004003 (NS_ERROR_INVALID_POINTER) [nsIDOMNSHTMLDocument.execCommand]" errors in an iframe based wysiwyg editor the site uses.

Current browser usage

I like to occasionally look at the traffic statistics to see whether our grading of browsers over 3 levels is in keeping with actual Internet use.

The system I am logging against is recording on average 350,000 page loads a day at the moment roughly 50-60% of which are from crawlers which is nothing out of the ordinary. So looking at the traffic from those site members who have logged into the website as opposed to just visitors the browser breakdown for this month so far is:

Browser %
IE 7.0 61.56
IE 6.0 27.21
Firefox 3.0 5.44
Firefox 2.0 3.06
Safari 3.1 2.72

Which is no great surprise and shows how much of a market share IE 6 still has even after 2 years of IE 7 being around.

As for obscure and old browser versions would you be surprised or not to know that for today only the following browsers appeared in the stats for all non-crawler traffic:

  • 62 different people used IE 5.0 and 1 came along in IE 4!
  • 5 visited on various versions of Netscape Navigator 4.
  • Overall users to the site Yandex the Russian search engine came third after IE 7 and 6.
  • There were over 100 different types of browser/version recorded today alone.

Although not earth shatteringly exciting it does show the breadth of browser types available and also that a large majority of the users out there do not seem upgrade very often, some not at all by the looks of things. With new versions of browsers rolling out all the time it would be physically impossible to test a site in each one but if your site works in the major 4 rendering engines (IE/trident, Firefox/Gecko, Safari/Webkit, Opera/Presto) then you will be covering the majority of those users who do keep up to date with browser revisions.

Sunday 14 September 2008

My growing love for Javascript

A shaky start to the love affair

I have no shame in admitting that I am currently in love with JavaScript. It may be that for the last eon or two I have been working with clunky old scripting languages such as ASP classic and whenever I get a chance to do some JavaScript I grab it with open arms as an opportunity to do what seems like proper programming. I will be the first to admit that I never used to see JavaScript in the way I do now and as well as never understanding its full potential I never even thought of it as a proper object orientated language which it most certainly is. When I first swapped over from client / server applications in the early 90's to web development using ASP/COM/SQL Server 6 JavaScript was just a nice little scripting language that could set the focus of inputs, validate input client side and other little tweaks that didn't seem of much importance in the grand scheme of things. The "proper" coding was done server side and I would have traded my ciggies by the dozen with anyone to work on a stored procedure or COM component than have to fiddle round with a scripting language that I had little time or patience for.

Coming from a VB background I hated the problems of case sensitivity and having to remember that equality tests involved more than one equals sign, sometimes even three. Trying to debug a script was always a nightmare having to sit there clicking away all those alert boxes before realising you were in some sort of never ending loop and having to kill the browser and start again. Yes I really didn't think much of it and I am certainly not alone in feeling like that.


The love that grew from necessity.

So over the years my views on JavaScript changed from hatred to a mild respect still outweighed by all the annoyances that come with trying to write a fully functional script that is complex and cross browser at the same time. I still didn't have to work with it that much apart from replicating any server side form validation on the client and some mild DOM manipulation. My annoyance with the language itself had disappeared after I had learnt Java and C# and the scripts that I did have to knock out were not that complex however I still had an attitude that if it worked in the browser that I was using which was always the latest version of Internet Explorer then the script was fine by me. If it didn't work in Netscape or Safari then I would just ask the office "Javascript Guru" to have a look and the code I was given usually seemed to work even if I didn't know what it was doing. Then the other year I wanted to implement a WYSIWYG editor for the system I was working on. The system was currently using the FCKEditor and I wanted to implement what seemed like a simple request at the time a character counter so that as the user typed the number of characters used in the HTML source was available for viewing. I remember trying to edit the FCK source and realising what a huge beast it was. The size of its directory was ten times the size of the rest of the site. I was sure that half the code and possible functionality was not required for my systems requirements. I had a look at some other widgets including OpenWYSIWYG and another one that our Javascript guru had used and then I decided to write my own combining the best bits of each, stripping out anything not needed, adding my bullet counter and making it as flexible as possible. It seemed like a straight forward task on paper but it was the start of a painstaking development but more importantly it was the start of a long learning process which although extremely painful at the time opened my eyes to the wonders of cross browser coding and all the different caveats and pitfalls that were waiting for me to discover.


Items of interest discovered along the way.

Whilst developing this widget some of the most seemingly simple things turned out to be some of the most complex. Who would have thought just putting a cursor at the end of any content in the IFrame editor would be such a tall order. So as well as learning far too much about browser differences and the history of Netscape and Mozilla and why User-Agents seem to make little sense I found out some very important information and came across some specific problems that everyone comes across at some stage when developing cross browser script.

1. How Firefox is indeed a wonderful invention with all those extensions especially Firebug which made my debugging life pain free once again. Not only that but Firebug lite brings most of that joy to debugging in IE. No more tired fingers from closing alert buttons.

2. The problems with relative URIs displayed within Iframes in Internet Explorer. Read this article for an explanation. The solution was to write out the Iframe content with document.write.

3. Different implementations of content editable HTML between browsers. Issues setting design mode on in Mozilla and disappearing event listeners. All good fun for the clueless I can assure you.

4. All the fun involved in learning about the event model and the problem of "this" keyword in IE as well as the memory leakage in older IE versions and the illogical ordering IE fires events in.

5. Differences between browsers when trying to calculate the size of the viewport and window dimensions for making my editor appear in a floating div.

6. Trying to make the content outputted by the editor as consistent as possible across browsers and XHTML compliant. IE seems to love capital letters and forgetting to close LI and DT elements for some reason.

7. Much much more.

So as you can see if you have yourself covered all those topics in detail, which means you will most certainly have read Dead Edwards competition blog article from start to finish as well as follow most of the links it leads to, this is a lot of information to take in and understand. However rather than put me off JavaScript for life its actually made me come to love the bloody thing.


Conclusion

So whereas in the 90's I used to hate all those cross browser problems they are more of a challenge to be overcome now and I love it when I get a complicated piece of code working in the main 4 browsers as well as many old versions as possible. In fact I may get a little too keen sometimes and often need my colleagues to tell me that the widget doesn't actually need to work in Netscape Navigator 4 or IE 4.

I am one of those people who will readily admit that I don't know everything but I like finding out about those missing chunks of knowledge and when given the choice of an easy life by implementing someone else's code as is will now often choose the more painful but also more enjoyable option of trying to write my own version. I will have a look at some of the best examples out on the web and try to put them all together which is usually the best way of learning about all those little nice little cross browser intricacies on the way.

As the saying goes nothing worthwhile in life comes easily and this seems to be particularly true with writing cross browser JavaScript code.

Tuesday 9 September 2008

SQL Performance Tuning Queries

SQL 2005 Performance Tuning Queries

Here are some of my favourite queries that help with SQL performance tuning.
I have gathered most of these from the web and 3rd party sources over the years with the odd tweak here and there if needed and they form the bulk of my performance tuning toolkit. You should have a file containing a number of similar queries always close at hand as they are invaluable in hunting down problematic queries or even SQL injections or DOS attacks.

Update: 07-Dec-08
In response to the comment about filtering by db_id() I have removed that filter from a few of the examples as he is correct in stating that filtering by db_id() will not return all the possible data and will miss out queries run from the client or from within stored procs that use dynamic sql. Also if you are in a hurry and want to skip to a very useful procedure that will output nearly 20 different performance related reports for one or more databases then go straight to this download of mine: Download SQL 2005 Performance Reports. 


Top 50 worst queries for I/O

There are 2 different kinds of I/Os that are tracked within SQL Server: Logical and Physical IOs. Logical I/O account for data that is processed from the buffer pool which resides in memory, hence the phrase Logical I/O. Physical I/Os are I/Os that are associated with accessing data directly from the physical disks that SQL Server uses to store databases. Physical I/O’s are more expensive I/O’s, meaning they take longer to process. I/O is general the single most expensive operation that impacts the overall performance of a TSQL statement. So when you are tuning your queries you want to minimize the number of logical and physical I/O’s operation performed to produce a result set.

If you query shows physical I/O on the first run but none on the second its due to SQL holding the data in memory so you need to clear your buffer cache using DBCC DROPCLEANBUFFER

SELECT TOP 50
(qs.total_logical_reads + qs.total_logical_writes) /qs.execution_count as [Avg IO],
substring (qt.text,qs.statement_start_offset/2,
(case when qs.statement_end_offset = -1
then len(convert(nvarchar(max), qt.text)) * 2
else qs.statement_end_offset end - qs.statement_start_offset)/2)
as query_text,
qt.dbid,
qt.objectid
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text (qs.sql_handle) as qt
ORDER BY [Avg IO] DESC


Top 50 worst queries by average running time

The following query will list the worst performing queries based on their average elapsed running time. It will give you the times, logical & physical reads and the TSQL of the statement in question.

SELECT TOP 50
creation_time,last_execution_time,total_physical_reads
,total_logical_reads,total_logical_writes, execution_count
,total_worker_time, total_elapsed_time, total_elapsed_time / execution_count
avg_elapsed_time, SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
((CASE statement_end_offset WHEN -1 THEN DATALENGTH(st.text) ELSE
qs.statement_end_offset END - qs.statement_start_offset)/2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY total_elapsed_time / execution_count DESC;


This query is similar to the previous one in the DMVs (dynamic management views) it uses but it provides the most CPU intensive queries in a system. If you are seeing high CPU spikes or top end flatlining on your DB server then this query will help find out whether you are being attacked with SQL Denial of Service attacks or whether you have some queries that are using pattern matching techniques (LIKE, PATINDEX, CLR Regular Expressions) that have been given a complex string to search for that has caused the server to max out on CPU.

SELECT TOP 50 (a.total_worker_time/a.execution_count) AS [Avg_CPU_Time],
Convert(Varchar,Last_Execution_Time) AS 'Last_execution_Time',
Total_Physical_Reads,
SUBSTRING(b.text,a.statement_start_offset/2,
(case when a.statement_end_offset = -1 then len(convert(nvarchar(max), b.text)) * 2
else
a.statement_end_offset end - a.statement_start_offset)/2) AS Query_Text,
db_name(b.dbid)as DatabaseName,
b.objectid AS 'Object_ID'
FROM sys.dm_exec_query_stats a
CROSS APPLY sys.dm_exec_sql_text(a.sql_handle) AS b
ORDER BY [Avg_CPU_Time] DESC


Suggested index columns and usage

SELECT d.*
,s.avg_total_user_cost
,s.avg_user_impact
,s.last_user_seek
,s.unique_compiles
FROM sys.dm_db_missing_index_group_stats s
,sys.dm_db_missing_index_groups g
,sys.dm_db_missing_index_details d
WHERE s.group_handle = g.index_group_handle
AND d.index_handle = g.index_handle
AND database_id = db_id()
ORDER BY s.avg_user_impact DESC
View index usage by table.

Ideally you want high figures in the user_seeks column.
If you have no seeks and lots of lookups then you should investigate. Also if you have indexes with no seeks, scans or lookups then they should be dropped as the index is not being used.

SELECT i.Name, i.Type_Desc, i.Is_unique, d.*
FROM sys.dm_db_index_usage_stats as d
JOIN sys.indexes as i
ON i.object_id = d.Object_id and
i.index_id = d.Index_id
WHERE database_id=db_id()
AND d.object_Id=object_id('JOBS')
View the fragmentation levels of your In Row (non blob/text) indexes


SELECT object_name(d.object_id), d.object_id, d.index_id, i.name as IndexName,
avg_fragmentation_in_percent, alloc_unit_type_desc
FROM sys.dm_db_index_physical_stats (db_id(), NULL, NULL, NULL, NULL) as d
JOIN sys.indexes as i
ON i.object_id = d.object_id and
d.index_id = i.index_id
WHERE d.Index_Type_Desc <> 'HEAP' AND
i.Name is not null AND
avg_fragmentation_in_percent>= 30 --change this figure to reflect your own requirements
ORDER BY object_name(d.object_id)


Finds the degree of selectivity for a specific column in a row

The higher the percentage the more selective it is and an index should be considered. There is no point having indexes on columns that have low selectivity as it would not help differentiate between rows when 50% of rows are one value and 50% another. This is the reason why its not a good idea to place indexes on bit columns as there are only two possible values. The only exception to this would be when you have a very high percentage that have one value (99% true) and you want to find the 1% that have the opposite value (false) however even then I would advise against it.


Declare @total_unique float
Declare @total_rows float
Declare @selectivity_ratio float

SELECT @total_unique = 0
SELECT @total_rows = 0
SELECT @selectivity_ratio = 0

/*
Finds the Total Number of Unique Rows in a Table
Be sure to replace OrderID below with the name of your column
Be sure to replace [Order Details] below with your table name
*/
SELECT @total_unique = (SELECT COUNT(DISTINCT ClientFK) FROM JOBS)

/*
Calculates Total Number of Rows in Table
Be sure to replace [Order Details] below with your table name
*/
SELECT @total_rows = (SELECT COUNT(*) FROM JOBS)

--Calculates Selectivity Ratio for a Specific Column
SELECT @selectivity_ratio = ROUND((SELECT @total_unique/@total_rows),2,2)
SELECT @selectivity_ratio as 'Selectivity Ratio'



So those are a few of my favourite queries that I use very frequently to help hunt down and resolve back end performance bottlenecks. Since SQL 2005 came out with all these really cool DMVs a lot of people have spent time putting these sorts of queries together into one report.

If you want to get your hands on one of these all encompassing queries then check out the following stored procedure which uses a query I came across and extended to make it more customisable:


Its a beast of a report but gives you everything you need to know in one report. Read the comments before each section carefully so you understand what the data is showing you. Also if you run the report as it is on a server with lots of databases, tables, indexes etc it will take some considerable time to execute so I suggest using one of the modes that enables you to filter only on those databases you require statistics for.  So call it like so:

EXEC dbo.usp_sql_rpt_database_performance_stats 3, 'strictlysoftware%'

will only report on databases on that server that have a name (sys.databases) that start with strictlysoftware. You will most likely also want to exclude the system tables (master, msdb, model) although you may want to report on tempdb so call it like so:

EXEC dbo.usp_sql_rpt_database_performance_stats 1, NULL

Another script that contains some useful code related to blocking and wait times can be found at Microsoft's site. The code is near the bottom of the page.



So there you go a one stop shop of goodies that should help you on your way. Please let me know if you have other useful DMV related queries or anything specific to SQL 2008 that is coming our way soon.