Tuesday, 20 January 2009

ISAPI URL Rewriting - Hot Linking

Banning Image Hot Linking

I was updating my ISAPI rules the other day implementing some new rules to identify XSS hacks and new SQL injection fingerprints and I came across some articles on the web about banning image hot linking which I thought was a cool idea. If you have spent time and effort designing images then there is nothing worse than someone just stealing them or even worse hot linking to the image so that your bandwidth is also taken up hosting their images!

I implemented the rules with a redirect to a logging page that logged the referral property to my traffic DB so that I could see which sort of requests were being made for images that didn't originate from the site they belonged to. After a week or so of logging I had enough data to come to the conclusion that I wouldn't be able to implement a hot linking ISAPI rule in the majority of the commercial applications I work on. This is for the simple reason that most sites I develop send out one or more HTML emails to users that themselves make use of hot linking to images and logos e.g using full URLs back to the image on the webserver. Unless I could come up with a rule that could handle every type of webmail or email client then there would always be someone somewhere in the world opening up their latest site generated email, clicking the "load images" button only to get nothing in return.

There are so many different mail clients out there developed in hundreds of countries that no rule could keep up with all the possibilities. From viewing my logging I had at least 2 different Polish mail clients as well as a number of Russian, Chinese and god knows what. So if you are developing a commercial application that has to send out HTML marketing or other forms of email I would advise against it.

However for small or personal sites then there is nothing wrong with implementing a rule to ban hot linkers. One of the cool ideas I read about was to return a banner advert or some other imagery that would annoy the user. However if you are going to use these rules you should know that as soon as the linker finds out you have blocked them they will just download the image from your site and host it themselves if they really wanted it. The original reason for this type of rule from what I have read is to implement it periodically to log those sites who are linking maybe without even blocking the image and then to issue cease and desist orders to the offenders.


The ISAPI Rules for IIS

I have to work with IIS at my current company and we use both 32 bit and 64 bit servers which means we have different versions of the ISAPI DLL installed on each and therefore slightly different syntax. The component I use is ISAPI Rewrite.

You will notice that the rules are slightly different due to the regular expression engine differences between versions.

Both versions do a number of conditional checks to ensure that
-The referer is not blank
-The referer is not the current site which is obviously okay
-The referer is not an image search bot. Looking at the most popular robots.
-The referer is not an email client.
-The user-agent is not a popular search engine bot.
-The image in question must be a gif, jpeg, jpg, png or bmp
-If all those conditions are matched I redirect to a 403 forbidden page.


Version 2.7 ( 32 bit server - httpd.ini)

Notice that I capture the host in the first conditional and then use a back reference to that matched value in the third conditional which ensures only hosts that are not the current site get matched.

RewriteCond Host: (.+)
RewriteCond Referer: .+
RewriteCond Referer: (?!https?://\1.*).*
RewriteCond Referer: (?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa)).*
RewriteCond Referer: (?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent)).*
RewriteCond User-Agent: (?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw)).*
RewriteRule .*\.(?:gif|jpe?g|png|bmp) /403.aspx [I,O,L]


Version 3 (64 bit server - .htaccess)

RewriteCond %{HTTP_REFERER} ^.+$
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:www\.)?mysite\..*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*) [NC]
RewriteCond %{HTTP_USER_AGENT} ^(?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw).*) [NC]
RewriteRule .*\.(jpe?g|png|gif|bmp) /403.aspx [NC,L]


Quirks and Differences

As always nothing is ever simple especially when you want to implement the same rule across two different versions of an application. As well as the obvious syntax differences with the flags and HTTP header names I found the following:

  • Using the IIS converter tool to convert the 2.7 rules to version 3 did not convert all the rules. It could not handle the back references and therefore I explicitly match the site domain in the v3 rules.
  • The negative lookahead asserts differ between versions. I could not get them working in 2.7 without putting the trailing .* outside the grouping e.g (?!https?://\1.*).* whereas in v3 they are within the grouping e.g ^(?!https?://(?:www\.)?mysite\..*)
  • The documentation recommends NOT using the ^ and $ when using rules with conditions because internally all conditions are combined together to create one rule and this can lead to unexpected behaviour.
  • I could not get the Ignore Case flags [I] working in version 2 with the negative lookaheads. As soon as I added the flag the rules would not match. This does not seem to be a problem in version 3 and the equivalent flag [NC] works fine.

Apart from those quirks both sets of rules have been tested on live systems and work fine. However if you are going to use rules such as these you are always going to run into problems with new mail clients or user-agents that come along oh so very frequently and unless you are going to constantly update your ini files you will have a considerable percentage of false positives.

The full ISAPI Rewrite documentation can be found here: http://www.isapirewrite.com/docs/

Monday, 19 January 2009

Cool Javascript regular expressions

Using Lambda Functions for HTML Parsing

One of the cool features that made me scratch my head when I first came across it but now love to bits about Javascript is the ability to use lambda expressions. A lambda expression basically means that you can use a function as the argument for another function. This is best seen with the replace method where you can use a function as the replacement value for a matching string test e.g

somevar = something.replace(/pattern/, function(match, submatch){
if(/another pattern/.test(submatch)){
return match;
}else{
return "";
}
});

One of the ways that I have found to use this feature is within my WYSIWYG widget to parse user generated HTML content and to strip out any HTML tags or attributes that are not allowed to be entered.

The function starts off with a regular expression that matches all HTML tags and also provides a grouping that returns the actual HTML tag name.

theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)

I can then use a function as my replacement value that will either return an empty string and remove the whole tag if its not allowed or otherwise run another replacement to handle attributes.

match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)

This function does a similar job of replacing the attribute with an empty string if its not allowed or otherwise returning the sub group that matches the attribute/value pair. This ends up being a very cool way of parsing HTML content using the power of regular expressions.

The whole function is below:

// Set up my regular expressions that will match the HTML tags and attributes that I want to allow
var reAllowedAttributes = /^(face|size|style|dir|color|id|class|alignment|align|valign|rowspan|colspan|width|height|background|cellspacing|cellpadding|border|href|src|target|alt|title)$/i
var reAllowedHTMLTags = /^(h1|h2|a|img|b|em|li|ol|p|pre|strong|ul|font|span|div|u|sub|sup|table|tbody|blockquote|tr|td)$/i

function ParseHTML(theHTML){
// Start of with a test to match all HTML tags and a group for the tag name which we pass in as an extra parameter
theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)
{
// if the HTML tag does not match our list of allowed tags return empty string which will be used as a
// a replacement for the pattern in our inital test.
if(!reAllowedHTMLTags.test(HTMLTag)){
return "";
}else{
// The HTML tag is allowed so check attributes with the tag

// Certain attributes are allowed so we do another replace statement looking for attributes and using another
// function for the replacement value.
match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)
{
// If the attribute matches our list of allowed attributes we return the whole match string
// so we replace our match with itself basically allowing the attribute.
if(reAllowedAttributes.test(attributeName)){
return match2;
}else{
return ""; // not allowed so return blank string to wipe out the attribute value pair
}
});

}
return match;

}); //end of the first replace

//return our cleaned HTML
return theHTML;
}

Another good thing about this feature is that as well as being able to pass the match string in to the replacement function as a parameter you can also pass in any number of sub groups as extra parameters. So using my parseHTML function as an example again instead of only capturing the attribute name in my check for valid attributes I could also capture the attribute value and then pass that as an extra parameter to my replacement function like so:

match = match.replace(/ ([^=]+)="([^"]*)"/g, function(match2, attributeName, attributeValue)

So you could test for the validity of the supplied values if you wanted to. Maybe if you were allowing the class attribute you would want to check to make sure only certain class names were used.

This is brilliant for use in client side widgets and also as server side code for parsing user supplied HTML content. Remember even if you are using crusty ASP classic and writing your code in VB Script which has a really poor Regular Expression engine compared to Javascript you can still make use of this cool feature as there is nothing stopping you mixing and matching VB Script and Javascript on the server.