Friday 25 May 2012

Debugging Regular Expressions

Regular Expressions - A guide to debugging

Regular expressions are a powerful tool to master and a great skill to learn as a good expression can save many lines of procedural code. They can be used for many tasks and are especially great for validating user input such as form validation as well as great for cleaning content such as HTML entered from a content management system. When used with remote content grabbers they are great for extracting sections of text from larger pieces as well as useful for quickly reformatting it.

Although a useful and powerful tool when used correctly its also quite easy to use them either accidentally or on purpose for destructive purposes. For example a sure sign that a regular expression has gone haywire and needs looking at is when the CPU on the computer running the expression suddenly spikes to 100%.

This can often happen for numerous reasons but a couple of the most common are when the expression gets itself into an internal loop because of a partial match that then matches multiple sections of text, a complicated pattern that matches nothing or because of a negative lookahead. All of these problems may not be spotted if the text you are matching against is quite small but when the text is quite big the problem amplifies. In fact some people have designed complicated patterns that are specifically created to match nothing but consume CPU whilst doing so. For more information about the dangers of regular expressions and SQL Denial of Service Attacks read the following two articles.


Common Problems

Sometimes trying to work out what complex regular expressions are doing can be quite hard especially if you didn't write it in the first place. For a novice looking at an expression like the following it could be quite confusing trying to work out what is going on:
var re_getme = /^<([^> ]+)[^>]*>(?:.|\n)+?<\/\1>$|^(\#?([-\w]+)|\.(\w[-\w]+))$/i
Therefore when you get stuck with a complex expression its a good idea to break it down into tiny little steps and build it up bit by bit. If your pattern isn't matching what you expect then comment it out and start a new expression. Start with a cut down simple version of your pattern and get it matching or replacing something and then build it up from there.

Even for performance reasons it might be worthwhile considering breaking a complicated expression down into multiple expressions. If you are using a pattern to replace certain content then using multiple expressions is a good idea especially if your existing expression contains lots of OR statements or negative matches.

Multiple ways to skin a cat

The great thing about regular expressions however is that they offer you the ability to handle a task in multiple ways. For example both of the following expressions look for HTML tags:
var reHTML1 = /<[^>]+?>/;

var reHTML2 = /<(.|\n)+?>/;
The first one matches an open bracket and then uses a negative match to look for any character apart from the closing tag one or more times before then matching the closing tag. The second one matches an open bracket and then any character or any space one or more times and then a close tag.

Therefore if you are getting stuck and have spent a lot of time trying to crack an expression one way but its not just working try to attack it from a different angle.

Negative Matching

Trying to do negative matching can be quite complicated especially if you trying to match a string rather than one character or a group of characters. For example this negative expression just removes any characters that are not between A to Z or 1 to 9.
[^A-Z1-9]+
Because this expression doesn't require any ordering of the characters its not matching its not hard for the engine to carry out. The following expression is a bit more complicated as its doing a negative lookahead to make sure the text is not one of the specified strings in the OR list.
(?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent)).*

Expressions that carry out negative lookaheads are okay as long as the text being searched is not too large however trying to do complex lookaheads with large pieces of text is a sure way of maxing out your CPU.

Using Placeholders

If you are trying to carry out replacements of text that involve a negative match then you may want to consider using a placeholder to make your code simpler and perform better. Instead of trying to negatively match a string that you don't want to replace you do a positive match on this string replacing it with a placeholder then do your replacement before putting the value back in.

For example in my HTMLEncoder object I have a method that HTML Encodes text passed to it. When it comes to replacing ampersands with entities I cannot do a straight replacement as I want to keep any ampersands already in the text that are there as part of an existing numerical or HTML entity as I cannot guarantee the text I am encoding doesn't already contain some encoded characters. Therefore instead of trying to do a negative match for &# I do a positive match replacing that string with a placeholder. I then do my replacement of all other & ampersands before putting back my placeholder values.
 
// replace ampersands from existing numerical entities with a placeholder
s = s.replace(/&#/g,"##AMPHASH##");

// replace all remaining ampersands with their entity version
s = s.replace(/&/g, "&amp;");

// put back in my placeholder values for numeric entities
s = s.replace(/##AMPHASH##/g,"&#");

Re-use objects and Cache Expressions

If you using a regular expression object and carrying out multiple expressions or executing the same expression multiple times then you should look into caching the expression if possible. If you are using wrapper functions to carry out certain tasks then try not to create a new regular expression object on each call to the function. The creation and destruction of a COM object can be very expensive so try to use global objects that are created once and used by all your functions and then destroyed at the bottom of your page or script.

Differences in Regular Expression Engine

Although most languages have inbuilt support for regular expressions the quality of their engines differ as well as the syntax of the expressions themselves. I have found Microsoft's Visual Basic Regular Expression Engine is very poor and doing complex expressions in VBScript or ASP Classic can often cause CPU spikes on the web server. I have also seen certain patterns that work in other languages cause problems with this engine when the text being matched contained Unicode characters.

The JScript engine is a lot better than the VBScript engine and if you are using ASP classic then I would recommend using server-side JScript for any complex regular expressions. Not only is the engine better you will have much more flexibility with your code such as using lambda expressions and being able to pass in a function as the argument for the replacement value.

In SQL Server there is no inbuilt regular expression function like there is in MySQL but you can use LIKE or PATINDEX for basic patterns or if you are using SQL 2005 - 2008 you can extend the CLR and build some user defined functions that hook into .NET to allow you to carry out proper regular expression functions. In SQL 2000 you can use OLE and the built in extended stored procedures sp_OACreate, sp_OAsetProperty and sp_OAMethod to instantiate the VBScript.RegExp COM object to carry out regular expression tasks. However this method can be quite an overhead and should be used carefully if required especially if the create and destroy methods are wrapped in a UDF as it will mean a new COM object is created every time the function is called and we all know object re-use is a key ingredient when it comes to performance especially with COM objects.

Knowing when you have a problem

If your regular expression code is client side Javascript or VBScript then a good indicator that you have a problem is when your browser hangs or pops up an unresponsive pop-up. Check your task manager and investigate the CPU usage as a long running expression that is killing your machine will surely be maxing out your PC's CPU.

If the code is server-side either in a database or web server then if you are viewing your servers performance monitor and you notice the CPU jump up in blocks of 25% (on a quad), or 50% (on a dual) or 100% (single), stay there for a few seconds and then drop down again then the chances are its regular expressions doing the maxing.

No comments: