Strictly Software

Tuesday, 20 January 2009

ISAPI URL Rewriting - Hot Linking

Banning Image Hot Linking

I was updating my ISAPI rules the other day implementing some new rules to identify XSS hacks and new SQL injection fingerprints and I came across some articles on the web about banning image hot linking which I thought was a cool idea. If you have spent time and effort designing images then there is nothing worse than someone just stealing them or even worse hot linking to the image so that your bandwidth is also taken up hosting their images!

I implemented the rules with a redirect to a logging page that logged the referral property to my traffic DB so that I could see which sort of requests were being made for images that didn't originate from the site they belonged to. After a week or so of logging I had enough data to come to the conclusion that I wouldn't be able to implement a hot linking ISAPI rule in the majority of the commercial applications I work on. This is for the simple reason that most sites I develop send out one or more HTML emails to users that themselves make use of hot linking to images and logos e.g using full URLs back to the image on the webserver. Unless I could come up with a rule that could handle every type of webmail or email client then there would always be someone somewhere in the world opening up their latest site generated email, clicking the "load images" button only to get nothing in return.

There are so many different mail clients out there developed in hundreds of countries that no rule could keep up with all the possibilities. From viewing my logging I had at least 2 different Polish mail clients as well as a number of Russian, Chinese and god knows what. So if you are developing a commercial application that has to send out HTML marketing or other forms of email I would advise against it.

However for small or personal sites then there is nothing wrong with implementing a rule to ban hot linkers. One of the cool ideas I read about was to return a banner advert or some other imagery that would annoy the user. However if you are going to use these rules you should know that as soon as the linker finds out you have blocked them they will just download the image from your site and host it themselves if they really wanted it. The original reason for this type of rule from what I have read is to implement it periodically to log those sites who are linking maybe without even blocking the image and then to issue cease and desist orders to the offenders.

The ISAPI Rules for IIS

I have to work with IIS at my current company and we use both 32 bit and 64 bit servers which means we have different versions of the ISAPI DLL installed on each and therefore slightly different syntax. The component I use is ISAPI Rewrite.

You will notice that the rules are slightly different due to the regular expression engine differences between versions.

Both versions do a number of conditional checks to ensure that
-The referer is not blank
-The referer is not the current site which is obviously okay
-The referer is not an image search bot. Looking at the most popular robots.
-The referer is not an email client.
-The user-agent is not a popular search engine bot.
-The image in question must be a gif, jpeg, jpg, png or bmp
-If all those conditions are matched I redirect to a 403 forbidden page.

Version 2.7 ( 32 bit server - httpd.ini)

Notice that I capture the host in the first conditional and then use a back reference to that matched value in the third conditional which ensures only hosts that are not the current site get matched.

RewriteCond Host: (.+)
RewriteCond Referer: .+
RewriteCond Referer: (?!https?://\1.*).*
RewriteCond Referer: (?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa)).*
RewriteCond Referer: (?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent)).*
RewriteCond User-Agent: (?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw)).*
RewriteRule .*\.(?:gif|jpe?g|png|bmp) /403.aspx [I,O,L]

Version 3 (64 bit server - .htaccess)

RewriteCond %{HTTP_REFERER} ^.+$
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:www\.)?mysite\..*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*) [NC]
RewriteCond %{HTTP_USER_AGENT} ^(?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw).*) [NC]
RewriteRule .*\.(jpe?g|png|gif|bmp) /403.aspx [NC,L]

Quirks and Differences

As always nothing is ever simple especially when you want to implement the same rule across two different versions of an application. As well as the obvious syntax differences with the flags and HTTP header names I found the following:

Using the IIS converter tool to convert the 2.7 rules to version 3 did not convert all the rules. It could not handle the back references and therefore I explicitly match the site domain in the v3 rules.
The negative lookahead asserts differ between versions. I could not get them working in 2.7 without putting the trailing .* outside the grouping e.g (?!https?://\1.*).* whereas in v3 they are within the grouping e.g ^(?!https?://(?:www\.)?mysite\..*)
The documentation recommends NOT using the ^ and $ when using rules with conditions because internally all conditions are combined together to create one rule and this can lead to unexpected behaviour.
I could not get the Ignore Case flags [I] working in version 2 with the negative lookaheads. As soon as I added the flag the rules would not match. This does not seem to be a problem in version 3 and the equivalent flag [NC] works fine.

Apart from those quirks both sets of rules have been tested on live systems and work fine. However if you are going to use rules such as these you are always going to run into problems with new mail clients or user-agents that come along oh so very frequently and unless you are going to constantly update your ini files you will have a considerable percentage of false positives.

The full ISAPI Rewrite documentation can be found here: http://www.isapirewrite.com/docs/

Monday, 19 January 2009

Cool Javascript regular expressions

Using Lambda Functions for HTML Parsing

One of the cool features that made me scratch my head when I first came across it but now love to bits about Javascript is the ability to use lambda expressions. A lambda expression basically means that you can use a function as the argument for another function. This is best seen with the replace method where you can use a function as the replacement value for a matching string test e.g


somevar = something.replace(/pattern/, function(match, submatch){
if(/another pattern/.test(submatch)){
return match;
}else{
return "";
}
});

One of the ways that I have found to use this feature is within my WYSIWYG widget to parse user generated HTML content and to strip out any HTML tags or attributes that are not allowed to be entered.

The function starts off with a regular expression that matches all HTML tags and also provides a grouping that returns the actual HTML tag name.


theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)

I can then use a function as my replacement value that will either return an empty string and remove the whole tag if its not allowed or otherwise run another replacement to handle attributes.


match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)

This function does a similar job of replacing the attribute with an empty string if its not allowed or otherwise returning the sub group that matches the attribute/value pair. This ends up being a very cool way of parsing HTML content using the power of regular expressions.

The whole function is below:


// Set up my regular expressions that will match the HTML tags and attributes that I want to allow
var reAllowedAttributes = /^(face|size|style|dir|color|id|class|alignment|align|valign|rowspan|colspan|width|height|background|cellspacing|cellpadding|border|href|src|target|alt|title)$/i
var reAllowedHTMLTags = /^(h1|h2|a|img|b|em|li|ol|p|pre|strong|ul|font|span|div|u|sub|sup|table|tbody|blockquote|tr|td)$/i

function ParseHTML(theHTML){
// Start of with a test to match all HTML tags and a group for the tag name which we pass in as an extra parameter
theHTML = theHTML.replace(/<[/]?([^> ]+)[^>]*>/g, function(match,HTMLTag)
{
// if the HTML tag does not match our list of allowed tags return empty string which will be used as a
// a replacement for the pattern in our inital test.
if(!reAllowedHTMLTags.test(HTMLTag)){
return "";
}else{
// The HTML tag is allowed so check attributes with the tag

// Certain attributes are allowed so we do another replace statement looking for attributes and using another
// function for the replacement value.
match = match.replace(/ ([^=]+)="[^"]*"/g, function(match2, attributeName)
{
// If the attribute matches our list of allowed attributes we return the whole match string
// so we replace our match with itself basically allowing the attribute.
if(reAllowedAttributes.test(attributeName)){
return match2;
}else{
return "";  // not allowed so return blank string to wipe out the attribute value pair
}
});

}
return match;

}); //end of the first replace

//return our cleaned HTML
return theHTML;
}

Another good thing about this feature is that as well as being able to pass the match string in to the replacement function as a parameter you can also pass in any number of sub groups as extra parameters. So using my parseHTML function as an example again instead of only capturing the attribute name in my check for valid attributes I could also capture the attribute value and then pass that as an extra parameter to my replacement function like so:


match = match.replace(/ ([^=]+)="([^"]*)"/g, function(match2, attributeName, attributeValue)

So you could test for the validity of the supplied values if you wanted to. Maybe if you were allowing the class attribute you would want to check to make sure only certain class names were used.

This is brilliant for use in client side widgets and also as server side code for parsing user supplied HTML content. Remember even if you are using crusty ASP classic and writing your code in VB Script which has a really poor Regular Expression engine compared to Javascript you can still make use of this cool feature as there is nothing stopping you mixing and matching VB Script and Javascript on the server.

Monday, 1 December 2008

Job Rapists and other badly behaved bots

How bad does a bot have to be before banning it?

I had to ban a particular naughty crawler the other day that belonged to one of the many job aggregators that spring up ever so regularly. This particular bot belonged to a site called jobrapido which one of my customers calls "that job rapist" for the very good reason that it seems to think that any job posted on a jobboard is theirs to take without asking either the job poster or site owner that its posted on. Wheres the good aggregators such as JobsUK require that the job poster or jobboard submits a regular XML feed of their jobs rapido just seems to endlessly crawl and crawl every site they know about and just take whatever they find. In fact on their home page they state the following:

Do you know a job site that does not appear in Jobrapido? Write to us!

Isn't that nice? Anyone can help them thieve, rape and pillage another sites jobs. Maybe there is some bonus point system for regular informers. Times are hard as we all know and people have to get their money where they can! However it shouldn't be from taking other peoples content unrequested.

This particular bot crawls so much it actually made one of our tiniest jobboards appear in our top 5 rankings one month purely from its crawling alone. The site had less than 50 jobs but a lot of categories which meant that this bot decided to visit each day and crawl every single search combination which meant 50,000 page loads a day!

Although this rapido site did link back to the original site that posted the job to allow the jobseeker to apply to the job the amount of referrals was tiny (150 a day across all 100+ of our sites) compared to the huge amount of bandwidth it was stealing from us (16-20% of all traffic). It was regularly ranked above Googlebot, MSN and Yahoo as the biggest crawler in my daily server reports as well as being the biggest hitter (page requests / time).

So I tried banning it using robots.txt directives as any legal well behaved bot should pay attention to that file however 2 weeks after adding all 3 of their agents to the file they were still paying us visits each day and no bot should cache the file for that length of time so I banned it using the htaccess file.

So if you work in the jobboard business and don't want a particular heavy bandwidth thief, content scrapper and robots.txt file ignorer hitting your sites every day then do yourself a favour and ban these agents and IP addresses:

Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)

"Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)"

Mozilla/5.0 (JobRapido WebPump)

85.214.130.145
85.214.124.254
85.214.20.207
85.214.66.152

If you want a good UK Job Aggregator that doesn't pinch sites jobs without asking first then use JobsUK. Its simple to use and has thousands of new jobs from hundreds of jobboards added every day.

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

Check the URL looks valid using a regular expression.
Access the URL through a proxy server and return the HTML source.
Locate and store the META keywords, description and title if they exist.
Remove everything apart from content between the start and close BODY tags.
From the body I strip any SCRIPT tags and anything between them.
Remove all HTML tags.
Clean the remaining content by removing noise words, numbers and swear words.
I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.

1. Method to access remote URL through proxy server.


public static string GetURLHTML(string remoteURL, string proxyServer)
    {
        string remoteURLContent = "";
    
        WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
        WebRequest webReq = WebRequest.Create(remoteURL);        
        webReq.Proxy = proxy; //set request to use proxy        

        // Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
        if (webReq is HttpWebRequest)
        {
            ((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
        }

        WebResponse webResp;
        int responseStatusCode = 0;

        try{
            // Get the response instance
            webResp = (HttpWebResponse)webReq.GetResponse();

            // Read an HTTP-specific property.
            if (webResp is HttpWebResponse)
            {
                responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
            }
        }catch(Exception ex){
            return remoteURLContent;
        }

        //we can only collect HTML from valid responses so ignore 404s and 500s
        if (responseStatusCode != 200)
        {
            return remoteURLContent;
        }
    
        // Get the response stream.
        Stream respStream = webResp.GetResponseStream();
               
        StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
        remoteURLContent = reader.ReadToEnd();
   
        // Close the response and response stream.
        webResp.Close();
                
        return remoteURLContent;
    }

The reason I use a proxy is down to the security policy set on our web servers.

2. Method to gather the main content.


//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
 protected string IndexURL(string remoteURL)
 {
     KeywordParser keywordParser;
     string METAKeywords = "", METADescription = "", METATitle = "";
     string cleanHTML = "";
     StringBuilder indexText = new StringBuilder();

     //As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
     string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

     //now access the remote URL and return the HTML source code if we can
     string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

     //if we have some HTML content to parse and clean
     if (!String.IsNullOrEmpty(remoteURLHTML))
     {        
         remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

         //Set up some regular expressions to help identify the META conent we want to index in the source
         Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
         Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
         Regex HasTitle = new Regex("<title>");

         //As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
         //contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
         if (HasKeywords.IsMatch(remoteURLHTML))
         {          
             //get the data we require by replacing anything either side of the tag
             METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
         }
         if (HasDescription.IsMatch(remoteURLHTML))
         {          
             METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
         }
         if (HasTitle.IsMatch(remoteURLHTML))
         {
             METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
         }

         cleanHTML = remoteURLHTML;

         //now get main content which is between open close body tags
         cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

         //strip any client side script by removing anything between open and close script tags         
         cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
      
         //put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
         cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

         //strip HTML tags
         cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

         //Decode the HTML so that any encoded HTML entities get stripped
         cleanHTML = HttpUtility.HtmlDecode(cleanHTML);
       
         //now add all the content we want to index back together
         if (!String.IsNullOrEmpty(METAKeywords))
         {          
             indexText.Append(METAKeywords + " ");
         }
         if (!String.IsNullOrEmpty(METADescription))
         {          
             indexText.Append(METADescription + " ");
         }
         if (!String.IsNullOrEmpty(METATitle))
         {          
             indexText.Append(METATitle + " ");
         }
         if (!String.IsNullOrEmpty(cleanHTML))
         {
             indexText.Append(cleanHTML);
         }

     }

     return indexText.ToString();
 }

I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.

Monday, 24 November 2008

Trying to detect spoofed user-agents

User Agent Spoofing

A lot of traffic comes from browsers either masking their real identity by using a different user agent than the real one associated with the browser or a random string which relates to no known browser. The purpose of doing this is many fold from malicious users trying to mask their real identity to get round code that may ban on user-agents or get round client side code that may be blocking certain browsers from using certain functionality which is all a good reason for using object detection rather than browser sniffing when deciding which code branch to run.

However as you will probably know if you have tried doing anything apart from simple JavaScript coding there are still times when you need to know the browser because object detection just isn't feasible and trying to use object detection to work out the browser is just as bad in my opinion as using browser sniffing to work out the object.

Therefore when someone is using an agent switcher to mask the browsers agent and you come across one of these moments then it may cause you to run code that will raise errors. There is no foolproof way to spot whether an agent is spoofed but one of the things you can do if you do require this information is compare the agent with known objects that should be supported by that browser and if they don't match then you can confirm the spoof.

This form of spoof detection will only work if its only the user agent string that has been changed but an example of some of the checks you can do include for agents that say they are Opera make sure it supports window.opera as well as both event models document.addEventListener && document.attachEvent as far as I know its the only browser that does support both. For IE you shouldn't check document.all by itself as you will actually find Firefox will return true for this but you can check for window.ActiveXObject the non existence of addEventListener and use conditional comments to test for JScript. Firefox should obviously not support JScript as it uses Javascript.

Those are just a few checks you could do and you are basically using object detection as well as agent sniffing together to make sure they match. They may not tell you the real browser being masked but they can be used to tell you what its not.

The idea of this is to make sure that in those cases where you have to branch on browser rather than object (see this previous article) that you make the right choice and don't cause errors. Obviously you may decide that if the user is going to spoof the agent then leave them to suffer any errors that may come their way.

If you do require a lightweight browser detector that checks for user agent spoofing amongst the main browsers as well as support for advanced CSS, Flash and other properties then see this article.

http://www.strictly-software.com/BrowserObject.asp

Tuesday, 18 November 2008

Problems upgrading to Firefox 3

Invalid Null Pointer Error relating to DesignMode and ContentEditable

I finally upgraded to Firefox 3 the other day on my work PC but only after I managed to fix an annoying bug that had suddenly appeared in my CMS WYSIWYG editor when I had upgraded at home.

The bug was occurring whenever I closed my editor (display:none;) and then re-opened it again on the same page. I finally tracked down the bug and it was related to code that initialises the editor on opening where it sets designmode and contentEditable on. In Firefox 2 this didn't seem to cause any problems if these properties had previously been set but in FF3 it caused invalid null pointer errors. A quick check to make sure I wasn't trying to set something on that was already on fixed this problem e.g:


if(self.IE && theIframe.contentWindow.document.body.contentEditable!=true){
 theIframe.contentWindow.document.body.contentEditable = true;
}
if(!self.IE && theIframe.contentWindow.document.designMode!="on" ){
 //in moz you cannot change desgin mode straight away so try and then retry
 try{
  theIframe.contentWindow.document.designMode = "on";
  theIframe.contentWindow.document.execCommand("redo", false, null);
 }catch(e){
  setTimeout(function(){self.MakeEditable(theID);},250);
  return false;
 }
}

UserAgentButton is null Error in Firefox 3

Anyway I upgraded to Firefox 3 and then I started getting these "UserAgentButton is null" errors appear on my main site all over the place. The error in Firebug was referring to lines in my own code so it was quite confusing at first as I do not have any code that mentions UserAgentButton.

I disabled Firebug and still the problem persisted until I remembered I had installed another add on the other day that allows you switch useragent. The plug in is called "User Agent Switcher" version 0.6.11 and is very useful for testing sites. After disabling this add on the problem went away. So if you see a similar problem try disabling any add ons first before spending any time debugging your code.

Flash Not Playing Movies Correctly

Another problem I came across after upgrading to FF3 was with Flash movies not playing correctly. They would start playing but only for a few seconds before stopping and freezing.
The problem was definitely related to FF3 as they played correctly in IE and other browsers.
I read a lot of "fixes" for this problem on the web but the only thing that worked successfully
for me was to upgrade Flash to version 10.

I am not sure what other problems will come up and surprise me but the flash problem alone was reason enough not to make FF3 a level 1 supported browser on my main site. If anyone else has come across any issues relating to upgrading to Firefox 3 please let me know.

Tuesday, 4 November 2008

Foreign Keys

In my ten years of database development I have come across a number of people who did not believe in using foreign key constraints in their database systems. Some of these people have actually been my senior at the time and its been very frustrating when I have asked why they are against enforcing referential integrity and they give replies along the lines of:

"When I was first starting out and didn't know any better I used to use foreign keys but now I know what I am doing so I let the application handle it all."

What it actually means is that they were taught by someone who thought they knew better than everyone else and unfortunately have not come to realise that rather than being an SQL guru their teacher was just another one of those people who thought they knew better than everyone else and they have foolishly followed in the same footsteps.

The way I see it is that the whole point of a is to enforce data integrity if you just want to persist data then there are probably better and faster ways but if you remove the relationships and constraints then you are just using your fancy and maybe expensive RDBMS as a storage system.

The analogy I use for is a car where the constraints are the safety features within the car i.e seat belts, airbags, ABS etc. You can build a car without all these safety features and it may run perfectly well but if you're unfortunate to have a crash you are going to be in a lot of trouble and you would wish that you hadn't removed those safety features for some imaginary performance gain. The data is your cars passenger and the constraints are there to ensure that the passenger arrives in one piece. If you accidentally bump into the car in front of you the air bag will prevent a broken nose just like if you have a bug in your code a check constraint will prevent dodgy data getting inserted into your table. You can live with bumps and bruises just like you can live with dodgy data however life is better without a bent nose and you will get less customer complaints when the data they report on is guaranteed to be accurate.

The reasons I have been given for not using foreign key constraints are usually one of the following two:

"It slows your system down and you should handle your data integrity checks in the application."

"It makes my life difficult when setting up sites and transferring data between systems due to all those referential key constraint errors when I try to insert data."

I will deal with each one in turn.

1. You are not going to be able to beat the RDBMS for speed at its own game. If you are going to be checking data for integrity then the best place to do it is as close to the data as possible. Therefore if you care about the quality of your data and want to check it let the RDBMS handle it. For example lets imagine a website that lets a user store a number of selected list options against their profile. Lets imagine we are going to let the application handle the integrity checks rather than the RDBMS and see what problems we may have.

Lets say the system has a table that holds the users who are on the website and allowed to save profiles.


CREATE TABLE USERS(UserPK int PRIMARY KEY, UserName varchar(50))

We also have a table that stores a number of locations which are displayed to the user on the profile page. The primary key is using a surrogate key which is an integer.


CREATE TABLE LOCATIONS(LocationPK int PRIMARY KEY, Location varchar(10))

The user can select multiple options so rather than hold the chosen Location value in the actual USERS table we have created a separate table that enables us to have a ONE to MANY relationship between an individual user and their selected locations.


CREATE TABLE USER_LOCATIONS (UserFK int,
LocationFK int,
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserPK,LocationFK))

As you can see we have created a table that has columns that reference the primary keys in both the USERS and LOCATIONS table but we have not enforced referential integrity by defining them as FOREIGN KEY constraints. I have however defined a PRIMARY KEY as this is still usually followed by the majority of people even if foreign keys are forgotten about or deliberately left off.

So we are going to let our application handle all data checks. Therefore we have a web page with a list that lets our USER choose the locations to save.


<select id="lstLocation" size="1" multiple="multiple">
<option value="1">Greater London</option>
<option value="2">South East</option>
<option value="3">South West</option>
</select>

Point 1. If the list only shows options that exist in the LOCATIONS table then we do not have to worry about invalid values being saved? WRONG! HTML can be manipulated hacked and HTTP requests forged. Anyone with a web developer toolbar could change this list and therefore the values that the form submitted.

Therefore we have a tampered POST request from the form that has supplied us with the following values: 454,676

We call our stored procedure that handles UPDATES/INSERTS to the USER_LOCATION table:


EXEC dbo.usp_save_user_locations @LocationIDs = '454,676';

Point 2. Within our stored procedure we only INSERT new values into the USER_LOCATION table if they exist in the LOCATIONS table by either doing a LEFT JOIN or an IF EXISTS statement. Yes we could do this but we are trying to carry out the same check that a FOREIGN KEY constraint would do anyway and at a higher level.

Point 3. We could use TRIGGERS to ensure that the data we want to insert is correct but this is the same as point 2. We are replicating behaviour best left to the RDBMS. Some people use TRIGGERS to handle CASCADING DELETES however the discussion of CASCADES is another topic that is unrelated to this basic discussion. In that case it may be a good idea as it would give you more control over the data you are removing. However any code written by the developer to replicate basic functionality that the RDBMS can handle is always open to coding errors.

Even if you decide that using the application is the way to go then you are still left with the problem of the management console. Who has access to your database server? Even if access to your system is limited to a few trusted people the fact remains that its possible to bypass any checks that your application may have by opening the table and editing/deleting the data by hand whether by accident or on purpose. Using FOREIGN KEY constraints will prevent this from occurring and this is the key point in my opinion going back to my car analogy. Its your safety belt / airbag preventing your data from getting crushed and mangled up in any database accident. Whether this accident happens on purpose or by a bug in your code its your
last line of defense and its the best kind you can get because the checks are as close to the actual data as possible and therefore cover all possible points of corruption.

The correct way to save all this trouble and potential pitfalls is to define our USER_LOCATIONS table with FOREIGN KEY relationships such as below


CREATE TABLE USER_LOCATIONS (UserFK int CONSTRAINT fk_USER_LOCATIONS_UseFK FOREIGN KEY (UserFK) REFERENCES USERS(UserPK),
LocationFK int CONSTRAINT fk_USER_LOCATIONS_LocationFK FOREIGN KEY (LocationFK) REFERENCES LOCATIONS(LocationPK),
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserFK,LocationFK))

As you can see the primary key consists of both foreign keys. One relates to the LOCATIONS table and holds the ID for the chosen location. The other key relates to the USERS table and holds the ID of the user who made the selection.

We can test this by populating our tables with some test data.


INSERT INTO USERS
SELECT 101, 'Mr Smith'

INSERT INTO LOCATIONS
SELECT 1,'Greater London'
UNION
SELECT 2, 'South East'
UNION
SELECT 3, 'South West'

Now running the following will be okay as both a USER with a USERPK of 101 and a LOCATION with a LocationPK of 1 exists in the tables referenced by the FOREIGN KEY relationships.


INSERT INTO USER_LOCATIONS
SELECT 101, 1

However running the following SQL will result in an error being raised as neither the USER or LOCATION exist in the relevant tables with the corresponding values.


INSERT INTO USER_LOCATIONS
SELECT 102, 12


Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY constraint "fk_USER_LOCATIONS_UseFK". The conflict occurred in database "XXX", table "dbo.USERS", column 'UserPK'.
The statement has been terminated.

As for making life difficult well its suppose to if by making life difficult you mean its preventing you from adding or deleting data willy nilly. You cannot TRUNCATE a table engaged in FOREIGN KEY constraints which may cause problems but then in a production environment why would you want to without taking into consideration related data. You can easily create scripts to drop all your constraints and then reapply them if you needed to do this. There are also easy ways to disable your constraints if you do not want to enforce checks when inserting data for example:


--Disable all table constraints
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT ALL

-- Enable all table constraints
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT ALL

-- Disable single constraint
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

-- Enable single constraint
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

If you're the DBA and giving me this argument then I would say that you're not a proper DBA in the first place if you don't understand the importance of data in a relational database. Also whatDBA worth the name does not have to hand or the ability to write a simple script to drop and reapply constraints. Surely this trouble is worth doing if it means your key business resource i.e the data is left intact and uncompromised. If you take away the relations all you have left are just isolated tables that persist data.

I will reminisce for a moment about a job I had some time back working for a telecoms company that specialised in predictive dialer software for call centers. My boss designed the database system and it was not in any way what you could call a normalised relational database. There was plenty of redundancy and lots of flat tables without a relationship in sight. His argument was that it was a real time database that had to process thousands of calculations and DML statements a second and therefore performance was the key issue. Having relations would have slowed down INSERTS and therefore the denormalised schema was appropriate to the system. My job was to develop the OLTP reporting and management systems and he used to joke about me enforcing referential integrity and how it was a rookies mistake that experts like himself did not make. However my other job at the company was to actually write the numerous reports for the call center customers that had to use the dialler data from his "unrelational" system. What a nightmare that was! The amount of hours I spent hunting for missing records trying to work out why I had data for one time slot in table A but no corresponding data in table B when it should be there. Yes his system may have been fast but because there was no defined relationships the data integrity was poor. I can understand wanting to make the system fast but having a well defined relational system should not mean the opposite.

Apart from the two reasons mentioned as reasons for not using referential integrity I think the main reason is purely laziness or just the fact that on a lot of sites you can get away with not doing it and no-one notices. A lot of sites are actually developed without RI and extra coding is carried out to handle the fact that a percentage of the data is either rubbish, orphaned records or duplicates due to not having defined constraints. Systems like this if they still care about the quality of the data will have scheduled jobs to clean up these records or they will have to spend extra time on reports like I had to do to accommodate the fact that you are reporting on data that you cannot guarantee to be 100%.

Remember the old saying "Rubbish in Rubbish Out" well if this saying was not invented to describe relational database design then it should have been done.

Further Reading

A helpful how to guide on Foreign Keys with useful SQL related to reporting on existing keys and disabling and enabling keys across databases.
http://www.sqlservercentral.com/articles/Advanced/foreignkeys/2423/

A discussion on using foreign keys or triggers for referential integrity
http://www.mssqltips.com/tip.asp?tip=1508