Tuesday 21 October 2008

Dangers of Pattern Matching

Regular Expressions and Pattern Matching

Most modern systems use some form of pattern matching in there day to day running whether its client side Javascript replace or match statements, server side regular expression objects or database residing LIKE and PATINDEX statements and since SQL 2005 the ability to run CLR based regular expressions. Now although these are very useful tools to use you must be aware that its possible to max out your web or database servers CPU by running certain types of patterns.

I will tell you a short story about a very long weekend spent in the office at the beginning of 2007. We were migrating a big system to another server and we had copied all the code and database settings up and were watching the performance of the systems. Something strange was happening every minute or so. The CPU would jump from hardly anything up to 25%. After 30 seconds it would go back down again. Sometimes it would jump to 25% and then a few seconds later to 50%. In the daytime with all the sites running full throttle at intermittent stages throughout the day the CPU would jump in blocks of 25% up to 100% and all the sites on the server would become unusable causing lots of error messages and customer phone calls. We were at a loss to what was going on. It took quite a while before someone discovered that clicking on a link on a certain company profile on a certain site would always cause the CPU to spike. The site was a jobboard and the link was to display all the job results for that company. I copied down that sites job data to our dev box and ran the same test. Instead of causing the CPU to jump to 25% it maxed straight out to 100%. This made sense as our live server had a quad core. I debugged the code and found the cause was a regular expression that was in a function that re-formatted the snippet of job description shown on the results page. Although this function worked on 99% of the data in the system this company had a couple of jobs that contained certain Unicode characters and it was obviously causing the regular expression engine to freak out. We put it down to a problem inherit in Microsoft's Regular Expression engine and rewrote the code to break the one pattern down into 3 smaller steps which worked fine. However I then read an article about catastrophic backtracking and how certain patterns that can be matched in multiple ways can cause these sorts of CPU issues as the complexity of the pattern grows exponentially. If you have a very small pattern you may not notice it but if you have long strings then you can run into these CPU nightmares.

As well as problems that maybe built into your application due to patterns that could cause these high CPU issues when a certain replace or match is carried out in a specific instance you may be opening your site up to deliberate attacks by users that take advantage of an online searching tool that uses SQL and LIKE to search records in your database. If you are not careful and screen out wildcards and other symbols used within LIKEs pattern matching or regular expression symbols (+*^?$) if you are using a CLR build proc or function in SQL 2005 then you could be vulnerable to what is known as SQL Denial of Service attacks. Again its possible to max out your database servers CPU whilst a complex LIKE statement is run against your database.


Ways to avoid complex pattern matching affecting your system

  • Avoid backtracking.
  • If you have overly complex expressions consider breaking them down into multiple smaller expressions. Expressions that include lots of nested quantifiers ?*+.
  • Avoid expressions that can be matched by multiple string patterns.
  • If you are re-using an expression then compile it if possible.
  • If you are using an inherently useless regular expression engine (VB) then make sure you re-use your objects. Create a global regular expression object variable that's instantiated on first use and then re-used on subsequent calls and then destroyed at the end of the page.
  • If you have a page that is hit frequently that uses lots of expressions for formatting such as the job results example I mentioned then consider rewriting your code so that any complex formatting is done on input rather than output. Consider creating a separate column in your table that stores the formatted version which can be accessed quickly whenever it needs to be viewed. The performance gained by not having to run complex pattern matching constantly on each page load will be considerable.
  • If you are taking input from a user and using it in pattern matching in SQL LIKE or regular expression database searches then sanitise all input to make sure you do not fall victim to an SQL denial of service attack.
  • Also consider rewriting any database searching to use a full text index instead of LIKE searches. You will be able to offer your users a more feature rich and comprehensive search facility.

So now you know to be on the lookout for spikes in CPU. If you see the CPU in performance monitor jump up to 25% on a quad, 50% on dual or 100% in one leap then it could well be down to a regular expression pattern somewhere in your application that needs investigating.


Links

SQL Server Denial of Service Attack (LIKE Attack)

Detailed description of how backtracking can cause CPU issues

Catastrophic Backtracking

Tuesday 7 October 2008

Is there any point anymore using Javascript to output Flash?

Using Javascript to load Flash.

Because of the well known Eolas patent issue regarding Internet Explorer and users having to click to activate Flash content a number of Javascript libraries has emerged to get round this problem e.g swfobject, UFO. The reason being that any Flash content loaded up through script was not affected by this patent issue and the user did not have to click anything.

I along with many others implemented generic functions to get round the "click to activate" problem. However since March of 2008 the issue has been resolved and current versions of Internet Explorer and any previous versions that have had patches applied to them do not have this issue. Therefore I am wondering whether there is any point in having Flash content delivered through Javascript rather than Server side code?

The main reason for doing it this way has been removed and by using Javascript to load all Flash you are narrowing your user base. More and more users are crawling the web with Javascript de-activated and I imagine that in this security conscious times we live in that this percentage will only increase. Checking my logged traffic statistics for today I can see that 10% of my sites non crawler users have Javascript disabled so this is quite a large number of users that won't be able to see all those Flash adverts and banners that your customers have paid £££ to place on your site.

Therefore in any new sites that I create I am reverting back to generating the Flash HTML server side unless there is a need for client side user interaction such as when I use flash for text counters. I am using the same XHTML that SWFobject and others have started doing which checks the users browser type and outputs the relevant OBJECT code accordingly i.e classid and codebase for IE and type for the others. This removes the need for embed tags and is still XHTML compliant plus its done server side so there is no issue when users turn Javascript off.

<object type="application/x-shockwave-flash" data="/images/Banner.swf" id="BANNER643" align="middle" width="468" height="60">
<param name="movie" value="/images/Banner.swf">
<param name="allowScriptAccess" value="sameDomain">
<param name="quality" value="high" />
<param name="wmode" value="transparent">
</object>


If the users browser is Internet Explorer I output a Javascript function call at the bottom of the page that will "fix" any OBJECT tags so that any users that don't have patched versions of IE and so still ask the user to "click to activate" will not have this problem as long as they have Javascript enabled.

A simple version of the EOLAS fix code is below although I am currently using a more advanced option that handles complex OBJECT params alot better.

objects = document.getElementsByTagName("object");for (var i = 0; i < objects.length; i++){objects[i].outerHTML = objects[i].outerHTML;}


All my users should get this content and only the unpatched Internet Explorer users who have Javascript disabled will have the inconvenience of being asked to "click to activate".

It would be nice to be able to detect whether Internet Explorer users had patched or unpatched versions so that I could only run the Javascript fix code if it was absolutely necessary but I cannot think of a way to determine this. I was thinking of maybe trying to access the first flash movie on a page and check the PercentLoaded property which must be unavailable if the flash movie is unactivated. However using the multiple IE version hack to run IE 6, 5 etc you always get a pop up saying "Press okay to continue loading content" when a page contains Flash so I cannot test this out. Either way this seems the best way to code Flash for the future as it ensures the maximum user coverage and still handles the EOLAS patent problem for those IE users who have not updated.

If anyone can think of any other reasons to be using Javascript and limiting your user coverage then please let me know.


Thursday 2 October 2008

Using Javascript to Parse a Querystring

Javascript has no support for querystring parsing

There maybe times when you need to parse a querystring using Javascript. I have recently found myself requiring to do this when rewriting a web application as lots of places were using inline Javascript script due to parameters being outputed into the HTML source through server side code. I wanted to move as much inline script out of the page source into external files and I found that a lot of cases I could only do this by using Javascript to access the querystring instead of server side code.

As Javascript has no inbuilt objects that you can reference you will either have to create your own one making use of the location.href and location.search objects which refer to the current URL and the querystring portion of the URL.

A lot of scripts will use string parsing to split the querystring up on & and = which is fine if you remember about the # anchor portion of the URL. However my script uses regular expressions to split the querystring up into its constituant parts.

I have created a little object that can be accessed on any page that references it. The features of this object are:

  • Parse the current location's querystring or supply your own querystring. It maybe that you have passed a URL encoded querystring as a value within the querystring and once you have accessed this parameter you need to parse it on its own.
  • Returns the number of parameters within the querystring.
  • Returns the parameters as key/value in an associative array.
  • Option to output the parameters as a formatted string.
  • Ability to access the anchor value seperatley if it exists.
  • Handles parameter values specified multiple times by comma seperating them.

The code:


function PageQuery(qry){

this.ParamValues = {};
this.ParamNo = 0;

var CurrentQuery, AnchorValue = "";

//if no querystring passed to constructor default to current location
if(qry && qry.length>0){
CurrentQuery = qry;
}else{
if(location.search.length>0){
CurrentQuery = location.href;
}else{
CurrentQuery = "";
}
}

//may want to parse a query that is not the current window.location
this.ParseQuery = function(qry){
var rex = /[?&]([^=]+)(?:=([^&#]*))?/g;
var rexa = /(\#.*$)/;
var qmatch, key, amatch, cnt=0;

//parse querystring storing key/values in the ParamValues associative array
while(qmatch = rex.exec(qry)){
key = denc(qmatch[1]);//get decoded key
val = denc(qmatch[2]);//get decoded value

if(this.ParamValues[key]){ //if we already have this key then update it if it has a value
if(key&&key!="") this.ParamValues[key] = this.ParamValues[key] + ","+val;
}else{
this.ParamValues[key] = val;
cnt++;
}
}
//as no length property with associative arrays
this.ParamNo = cnt;

//store anchor value if there is one
amatch = rexa.exec( qry );
if(amatch) AnchorValue = amatch[0].replace("#","");
}

//run function to parse querystring and store array of key/values and any anchor tag
if(CurrentQuery.length){
this.ParseQuery( CurrentQuery );
}

this.GetValue = function(key){ if(!this.ParamValues[key]) return ""; return this.ParamValues[key]; }
this.GetAnchor = AnchorValue;

// Output a string for display purposes
this.OutputParams = function(){
var Params = "";
if(this.ParamValues && this.ParamNo>0){
for(var key in this.ParamValues){
Params+= key + ": " + this.ParamValues[key] + "\n";
}
}
if(AnchorValue!="") Params+= "Anchor: " + AnchorValue + "\n";
return Params;
}
}

//Functions for encoding/decoding URL used in object

//encode
function enc(val){
if (typeof(encodeURIComponent)=="function"){
return encodeURIComponent(val);
}else{
return escape(val);
}
}
//decode
function denc(val){
if (typeof(decodeURIComponent)=="function"){
return decodeURIComponent(val);
}else{
return unescape(val);
}
}


How to call the code

To make use of the objects functions you can instantiate the PageQuery object by either passing the string that you want to parse as the constructors only parameter or you can pass nothing in which case it will default to the current location's querystring if there is one. Once you have created the object you can then reference the properties you require to return the relevant information.

//create new instance of the object
var qry = new PageQuery(); //defaults to current location if nothing passed to the constuctor
//var qry = new PageQuery("?id=1044&name=Rob+Reid#56"); //parse a specific string instead of using location.search
var id = qry.GetValue("id"); //get value for a parameter called id
var anc = qry.GetAnchor; //get the anchor # value if exists
var no = qry.ParamNo; //get the number of parameters
var s = qry.OutputParams();//return a formatted string for display purposes
var p = qry.ParamValues; //return the array of parameters

//loop through array of parameters
if(p){
for(var z in p){
alert("Query Parameter["+z+"] = " +p[z]);
}
}


So as you can see its a pretty simple but very flexible function that provides me with all the necessary functionality I require when handling querystrings with Javascript.

Click here to download the ParseQuery object script.