Tuesday 4 November 2008

Foreign Keys

In my ten years of database development I have come across a number of people who did not believe in using foreign key constraints in their database systems. Some of these people have actually been my senior at the time and its been very frustrating when I have asked why they are against enforcing referential integrity and they give replies along the lines of:

"When I was first starting out and didn't know any better I used to use foreign keys but now I know what I am doing so I let the application handle it all."

What it actually means is that they were taught by someone who thought they knew better than everyone else and unfortunately have not come to realise that rather than being an SQL guru their teacher was just another one of those people who thought they knew better than everyone else and they have foolishly followed in the same footsteps.

The way I see it is that the whole point of a is to enforce data integrity if you just want to persist data then there are probably better and faster ways but if you remove the relationships and constraints then you are just using your fancy and maybe expensive RDBMS as a storage system.

The analogy I use for is a car where the constraints are the safety features within the car i.e seat belts, airbags, ABS etc. You can build a car without all these safety features and it may run perfectly well but if you're unfortunate to have a crash you are going to be in a lot of trouble and you would wish that you hadn't removed those safety features for some imaginary performance gain. The data is your cars passenger and the constraints are there to ensure that the passenger arrives in one piece. If you accidentally bump into the car in front of you the air bag will prevent a broken nose just like if you have a bug in your code a check constraint will prevent dodgy data getting inserted into your table. You can live with bumps and bruises just like you can live with dodgy data however life is better without a bent nose and you will get less customer complaints when the data they report on is guaranteed to be accurate.

The reasons I have been given for not using foreign key constraints are usually one of the following two:

"It slows your system down and you should handle your data integrity checks in the application."


"It makes my life difficult when setting up sites and transferring data between systems due to all those referential key constraint errors when I try to insert data."

I will deal with each one in turn.

1. You are not going to be able to beat the RDBMS for speed at its own game. If you are going to be checking data for integrity then the best place to do it is as close to the data as possible. Therefore if you care about the quality of your data and want to check it let the RDBMS handle it. For example lets imagine a website that lets a user store a number of selected list options against their profile. Lets imagine we are going to let the application handle the integrity checks rather than the RDBMS and see what problems we may have.

Lets say the system has a table that holds the users who are on the website and allowed to save profiles.


CREATE TABLE USERS(UserPK int PRIMARY KEY, UserName varchar(50))

We also have a table that stores a number of locations which are displayed to the user on the profile page. The primary key is using a surrogate key which is an integer.


CREATE TABLE LOCATIONS(LocationPK int PRIMARY KEY, Location varchar(10))

The user can select multiple options so rather than hold the chosen Location value in the actual USERS table we have created a separate table that enables us to have a ONE to MANY relationship between an individual user and their selected locations.

CREATE TABLE USER_LOCATIONS (UserFK int,
LocationFK int,
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserPK,LocationFK))

As you can see we have created a table that has columns that reference the primary keys in both the USERS and LOCATIONS table but we have not enforced referential integrity by defining them as FOREIGN KEY constraints. I have however defined a PRIMARY KEY as this is still usually followed by the majority of people even if foreign keys are forgotten about or deliberately left off.

So we are going to let our application handle all data checks. Therefore we have a web page with a list that lets our USER choose the locations to save.


<select id="lstLocation" size="1" multiple="multiple">
<option value="1">Greater London</option>
<option value="2">South East</option>
<option value="3">South West</option>
</select>

Point 1. If the list only shows options that exist in the LOCATIONS table then we do not have to worry about invalid values being saved? WRONG! HTML can be manipulated hacked and HTTP requests forged. Anyone with a web developer toolbar could change this list and therefore the values that the form submitted.

Therefore we have a tampered POST request from the form that has supplied us with the following values: 454,676

We call our stored procedure that handles UPDATES/INSERTS to the USER_LOCATION table:

EXEC dbo.usp_save_user_locations @LocationIDs = '454,676';

Point 2. Within our stored procedure we only INSERT new values into the USER_LOCATION table if they exist in the LOCATIONS table by either doing a LEFT JOIN or an IF EXISTS statement. Yes we could do this but we are trying to carry out the same check that a FOREIGN KEY constraint would do anyway and at a higher level.

Point 3. We could use TRIGGERS to ensure that the data we want to insert is correct but this is the same as point 2. We are replicating behaviour best left to the RDBMS. Some people use TRIGGERS to handle CASCADING DELETES however the discussion of CASCADES is another topic that is unrelated to this basic discussion. In that case it may be a good idea as it would give you more control over the data you are removing. However any code written by the developer to replicate basic functionality that the RDBMS can handle is always open to coding errors.

Even if you decide that using the application is the way to go then you are still left with the problem of the management console. Who has access to your database server? Even if access to your system is limited to a few trusted people the fact remains that its possible to bypass any checks that your application may have by opening the table and editing/deleting the data by hand whether by accident or on purpose. Using FOREIGN KEY constraints will prevent this from occurring and this is the key point in my opinion going back to my car analogy. Its your safety belt / airbag preventing your data from getting crushed and mangled up in any database accident. Whether this accident happens on purpose or by a bug in your code its your
last line of defense and its the best kind you can get because the checks are as close to the actual data as possible and therefore cover all possible points of corruption.

The correct way to save all this trouble and potential pitfalls is to define our USER_LOCATIONS table with FOREIGN KEY relationships such as below

CREATE TABLE USER_LOCATIONS (UserFK int CONSTRAINT fk_USER_LOCATIONS_UseFK FOREIGN KEY (UserFK) REFERENCES USERS(UserPK),
LocationFK int CONSTRAINT fk_USER_LOCATIONS_LocationFK FOREIGN KEY (LocationFK) REFERENCES LOCATIONS(LocationPK),
CONSTRAINT pk_USER_LOCATIONS PRIMARY KEY(UserFK,LocationFK))

As you can see the primary key consists of both foreign keys. One relates to the LOCATIONS table and holds the ID for the chosen location. The other key relates to the USERS table and holds the ID of the user who made the selection.

We can test this by populating our tables with some test data.

INSERT INTO USERS
SELECT 101, 'Mr Smith'

INSERT INTO LOCATIONS
SELECT 1,'Greater London'
UNION
SELECT 2, 'South East'
UNION
SELECT 3, 'South West'

Now running the following will be okay as both a USER with a USERPK of 101 and a LOCATION with a LocationPK of 1 exists in the tables referenced by the FOREIGN KEY relationships.

INSERT INTO USER_LOCATIONS
SELECT 101, 1

However running the following SQL will result in an error being raised as neither the USER or LOCATION exist in the relevant tables with the corresponding values.

INSERT INTO USER_LOCATIONS
SELECT 102, 12


Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY constraint "fk_USER_LOCATIONS_UseFK". The conflict occurred in database "XXX", table "dbo.USERS", column 'UserPK'.
The statement has been terminated.

As for making life difficult well its suppose to if by making life difficult you mean its preventing you from adding or deleting data willy nilly. You cannot TRUNCATE a table engaged in FOREIGN KEY constraints which may cause problems but then in a production environment why would you want to without taking into consideration related data. You can easily create scripts to drop all your constraints and then reapply them if you needed to do this. There are also easy ways to disable your constraints if you do not want to enforce checks when inserting data for example:

--Disable all table constraints
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT ALL

-- Enable all table constraints
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT ALL

-- Disable single constraint
ALTER TABLE USER_LOCATIONS NOCHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

-- Enable single constraint
ALTER TABLE USER_LOCATIONS CHECK CONSTRAINT fk_USER_LOCATIONS_UseFK

If you're the DBA and giving me this argument then I would say that you're not a proper DBA in the first place if you don't understand the importance of data in a relational database. Also whatDBA worth the name does not have to hand or the ability to write a simple script to drop and reapply constraints. Surely this trouble is worth doing if it means your key business resource i.e the data is left intact and uncompromised. If you take away the relations all you have left are just isolated tables that persist data.

I will reminisce for a moment about a job I had some time back working for a telecoms company that specialised in predictive dialer software for call centers. My boss designed the database system and it was not in any way what you could call a normalised relational database. There was plenty of redundancy and lots of flat tables without a relationship in sight. His argument was that it was a real time database that had to process thousands of calculations and DML statements a second and therefore performance was the key issue. Having relations would have slowed down INSERTS and therefore the denormalised schema was appropriate to the system. My job was to develop the OLTP reporting and management systems and he used to joke about me enforcing referential integrity and how it was a rookies mistake that experts like himself did not make. However my other job at the company was to actually write the numerous reports for the call center customers that had to use the dialler data from his "unrelational" system. What a nightmare that was! The amount of hours I spent hunting for missing records trying to work out why I had data for one time slot in table A but no corresponding data in table B when it should be there. Yes his system may have been fast but because there was no defined relationships the data integrity was poor. I can understand wanting to make the system fast but having a well defined relational system should not mean the opposite.

Apart from the two reasons mentioned as reasons for not using referential integrity I think the main reason is purely laziness or just the fact that on a lot of sites you can get away with not doing it and no-one notices. A lot of sites are actually developed without RI and extra coding is carried out to handle the fact that a percentage of the data is either rubbish, orphaned records or duplicates due to not having defined constraints. Systems like this if they still care about the quality of the data will have scheduled jobs to clean up these records or they will have to spend extra time on reports like I had to do to accommodate the fact that you are reporting on data that you cannot guarantee to be 100%.

Remember the old saying "Rubbish in Rubbish Out" well if this saying was not invented to describe relational database design then it should have been done.


Further Reading

A helpful how to guide on Foreign Keys with useful SQL related to reporting on existing keys and disabling and enabling keys across databases.
http://www.sqlservercentral.com/articles/Advanced/foreignkeys/2423/

A discussion on using foreign keys or triggers for referential integrity
http://www.mssqltips.com/tip.asp?tip=1508

2 comments:

Anonymous said...

I agree with your boss, speed is king. and if you don't know how to correlate data, to find missing data, it means you need to understand the table structure first and read the application documentation. You probably avoid understanding the application details.

Rob Reid said...

Hi

Well I wish you had my old job then!

Trying to make reports to give to the customer that included duplicate, missing or just wrong data due to the lack of data integrity.

I actually remembering him to tell me at one point when I complained about it to just "make up the missing figures" - I doubt you would agree with that? I doubt the company receiving the billing data I was producing would have wanted that either.

If you don't agree with referential integrity then I am guessing you don't use a RDBMS for your systems? And if so why do you bother when there are so many new types of DB out there designed for just dumping and retrieving data?

Also can you give me an example of how you would handle the scenario I exampled above to ensure only correct data was inserted. If you can show me how data integrity can be assured in that scenario without constraints I would be happy to listen and learn.

By the way I am not advocating the use of cascading deletes or updates here. That is something I do leave up to the application logic for many reasons to long and complicated to list here and sometimes when people hear about FK constraints that it what they think about as well.

I would also say that just because you enforce PK->FK relationships it doesn't have to slow your system down.

Example - on my Jobboard systems I store all the data in a proper normalised data structure which ensures data integrity however I also have automated jobs that create fast flat de-normalised tables which are then used for searching, reporting, prevent locks etc etc.

These tables do not need to be joined to others as they have all the data they need in one row.

However I know 100% that the data within them is correct because of the RDBMS and the PK->FK relationships on the normalised part of the system.

Therefore I get the benefits of speed for the parts of the system that need it as well as having data integrity and a properly structured DB.

Maybe this is an approach you could follow to get the best of both worlds?

Anyway thanks for your comments.