Tuesday 21 October 2008

Dangers of Pattern Matching

Regular Expressions and Pattern Matching

Most modern systems use some form of pattern matching in there day to day running whether its client side Javascript replace or match statements, server side regular expression objects or database residing LIKE and PATINDEX statements and since SQL 2005 the ability to run CLR based regular expressions. Now although these are very useful tools to use you must be aware that its possible to max out your web or database servers CPU by running certain types of patterns.

I will tell you a short story about a very long weekend spent in the office at the beginning of 2007. We were migrating a big system to another server and we had copied all the code and database settings up and were watching the performance of the systems. Something strange was happening every minute or so. The CPU would jump from hardly anything up to 25%. After 30 seconds it would go back down again. Sometimes it would jump to 25% and then a few seconds later to 50%. In the daytime with all the sites running full throttle at intermittent stages throughout the day the CPU would jump in blocks of 25% up to 100% and all the sites on the server would become unusable causing lots of error messages and customer phone calls. We were at a loss to what was going on. It took quite a while before someone discovered that clicking on a link on a certain company profile on a certain site would always cause the CPU to spike. The site was a jobboard and the link was to display all the job results for that company. I copied down that sites job data to our dev box and ran the same test. Instead of causing the CPU to jump to 25% it maxed straight out to 100%. This made sense as our live server had a quad core. I debugged the code and found the cause was a regular expression that was in a function that re-formatted the snippet of job description shown on the results page. Although this function worked on 99% of the data in the system this company had a couple of jobs that contained certain Unicode characters and it was obviously causing the regular expression engine to freak out. We put it down to a problem inherit in Microsoft's Regular Expression engine and rewrote the code to break the one pattern down into 3 smaller steps which worked fine. However I then read an article about catastrophic backtracking and how certain patterns that can be matched in multiple ways can cause these sorts of CPU issues as the complexity of the pattern grows exponentially. If you have a very small pattern you may not notice it but if you have long strings then you can run into these CPU nightmares.

As well as problems that maybe built into your application due to patterns that could cause these high CPU issues when a certain replace or match is carried out in a specific instance you may be opening your site up to deliberate attacks by users that take advantage of an online searching tool that uses SQL and LIKE to search records in your database. If you are not careful and screen out wildcards and other symbols used within LIKEs pattern matching or regular expression symbols (+*^?$) if you are using a CLR build proc or function in SQL 2005 then you could be vulnerable to what is known as SQL Denial of Service attacks. Again its possible to max out your database servers CPU whilst a complex LIKE statement is run against your database.


Ways to avoid complex pattern matching affecting your system

  • Avoid backtracking.
  • If you have overly complex expressions consider breaking them down into multiple smaller expressions. Expressions that include lots of nested quantifiers ?*+.
  • Avoid expressions that can be matched by multiple string patterns.
  • If you are re-using an expression then compile it if possible.
  • If you are using an inherently useless regular expression engine (VB) then make sure you re-use your objects. Create a global regular expression object variable that's instantiated on first use and then re-used on subsequent calls and then destroyed at the end of the page.
  • If you have a page that is hit frequently that uses lots of expressions for formatting such as the job results example I mentioned then consider rewriting your code so that any complex formatting is done on input rather than output. Consider creating a separate column in your table that stores the formatted version which can be accessed quickly whenever it needs to be viewed. The performance gained by not having to run complex pattern matching constantly on each page load will be considerable.
  • If you are taking input from a user and using it in pattern matching in SQL LIKE or regular expression database searches then sanitise all input to make sure you do not fall victim to an SQL denial of service attack.
  • Also consider rewriting any database searching to use a full text index instead of LIKE searches. You will be able to offer your users a more feature rich and comprehensive search facility.

So now you know to be on the lookout for spikes in CPU. If you see the CPU in performance monitor jump up to 25% on a quad, 50% on dual or 100% in one leap then it could well be down to a regular expression pattern somewhere in your application that needs investigating.


Links

SQL Server Denial of Service Attack (LIKE Attack)

Detailed description of how backtracking can cause CPU issues

Catastrophic Backtracking

No comments: