Monday, 19 January 2009

Cool Javascript regular expressions

Using Lambda Functions for HTML Parsing

One of the cool features that made me scratch my head when I first came across it but now love to bits about Javascript is the ability to use lambda expressions. A lambda expression basically means that you can use a function as the argument for another function. This is best seen with the replace method where you can use a function as the replacement value for a matching string test e.g

somevar = something.replace(/pattern/, function(match, submatch){
if(/another pattern/.test(submatch)){
return match;
}else{
return "";
}
});

One of the ways that I have found to use this feature is within my WYSIWYG widget to parse user generated HTML content and to strip out any HTML tags or attributes that are not allowed to be entered.

The function starts off with a regular expression that matches all HTML tags and also provides a grouping that returns the actual HTML tag name.

theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)

I can then use a function as my replacement value that will either return an empty string and remove the whole tag if its not allowed or otherwise run another replacement to handle attributes.

match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)

This function does a similar job of replacing the attribute with an empty string if its not allowed or otherwise returning the sub group that matches the attribute/value pair. This ends up being a very cool way of parsing HTML content using the power of regular expressions.

The whole function is below:

// Set up my regular expressions that will match the HTML tags and attributes that I want to allow
var reAllowedAttributes = /^(face|size|style|dir|color|id|class|alignment|align|valign|rowspan|colspan|width|height|background|cellspacing|cellpadding|border|href|src|target|alt|title)$/i
var reAllowedHTMLTags = /^(h1|h2|a|img|b|em|li|ol|p|pre|strong|ul|font|span|div|u|sub|sup|table|tbody|blockquote|tr|td)$/i

function ParseHTML(theHTML){
// Start of with a test to match all HTML tags and a group for the tag name which we pass in as an extra parameter
theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)
{
// if the HTML tag does not match our list of allowed tags return empty string which will be used as a
// a replacement for the pattern in our inital test.
if(!reAllowedHTMLTags.test(HTMLTag)){
return "";
}else{
// The HTML tag is allowed so check attributes with the tag

// Certain attributes are allowed so we do another replace statement looking for attributes and using another
// function for the replacement value.
match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)
{
// If the attribute matches our list of allowed attributes we return the whole match string
// so we replace our match with itself basically allowing the attribute.
if(reAllowedAttributes.test(attributeName)){
return match2;
}else{
return ""; // not allowed so return blank string to wipe out the attribute value pair
}
});

}
return match;

}); //end of the first replace

//return our cleaned HTML
return theHTML;
}

Another good thing about this feature is that as well as being able to pass the match string in to the replacement function as a parameter you can also pass in any number of sub groups as extra parameters. So using my parseHTML function as an example again instead of only capturing the attribute name in my check for valid attributes I could also capture the attribute value and then pass that as an extra parameter to my replacement function like so:

match = match.replace(/ ([^=]+)="([^"]*)"/g, function(match2, attributeName, attributeValue)

So you could test for the validity of the supplied values if you wanted to. Maybe if you were allowing the class attribute you would want to check to make sure only certain class names were used.

This is brilliant for use in client side widgets and also as server side code for parsing user supplied HTML content. Remember even if you are using crusty ASP classic and writing your code in VB Script which has a really poor Regular Expression engine compared to Javascript you can still make use of this cool feature as there is nothing stopping you mixing and matching VB Script and Javascript on the server.

2 comments:

Yogesh Khadayate said...

How to retrieve title attribute of 'SPAN' html tag using Javascript.

Rob Reid said...

What is wrong with using getAttribute? No need for regular expressions unless I am missing something?

<span id="spanID" title="This is my span">hello there</span>

// code to get title
var title = document.getElementById('spanID').getAttribute("title");

alert("the title for the span is " + title);