Sunday 7 September 2008

When would you use a CROSS JOIN

Cross Joins??

Like most people I would imagine when I first heard about Cross Joins in SQL many years ago I remember thinking to myself when would I ever find myself
in a situation developing a website where I'd make use of them. They are one of those features that are infrequently used in day to day web development
but I over time I have come across a few situations where they have been very useful and solved some specific problems. I'm sure there are many
more that can be mentioned but I will talk about 2 specific situations where I have made use of them recently.


1. Many to Many Save

On nearly all of my sites I have one or more pages that output the results of some sort of search that the user has made on a previous page.
On the results page the user can select one or more of these records to view in more detail through selecting them with a checkbox or highlighting
the row with a double click etc. On some sites however I also have a folder option that allows the user to save those selected records into a new or
existing folder/group for later use. For example taking a jobboard as an example the user could select 3 candidates "Mr Smith, Mr Jones and Mr Robins"
from the results and then select 2 folders to save these candidates to "Top Candidates, Tuesdays Selections".
In the old days before I fully understood the flexibility and power of SQL I would do something like the following

(pseudo code)

For Each Candidate in Selected Candidates
For Each Folder in Selected Folders

'* Execute SQL either in a proc or inline
INSERT INTO CANDIDATE_GROUPS
(Folder, Candidate)
VALUES
(Folder, Candidate)

Next Folder
Next Candidate


Which for my example would involve 5 loop iterations, 5 separate calls to the database and then 5 separate insert statements. Obviously if you were
selecting more candidates and more folders this would increase the number of loops.

What always do nowadays is collect the two strings of delimited values and then call a stored procedure once passing the strings as parameters.
Then making use of my SPLIT user defined function I simply use a cross join to populate the table.

usp_asp_save_candidate_groups
@Candidates varchar(1000),
@Groups varchar(1000)
AS

SET NOCOUNT ON

INSERT INTO CANDIDATE_GROUPS
(Folder, Candidate)
SELECT a.[VALUE],b.[VALUE]
FROM dbo.udf_SPLIT(@Groups,',') as a,
dbo.udf_SPLIT(@Candidates,',') as b


This way I have reduced the application code to the bare minimum and we only have one INSERT statement. The overhead is now the CROSS JOIN and the SPLIT functions that convert a string of delimited values to a TABLE variable however this is always going to outperform nested loop inserts by a mile. Even though the SPLIT function makes use of multiple INSERTS to populate the TABLE variable that's then selected from the CROSS JOIN code outperforms the nested loops by factors of 20+ from tests I have run. Plus you are reducing network traffic and simplifying your application code.


2. Matrix Table

On a recent jobboard website I had to come up with a matrix table so that the user had access to the number of live jobs in the system for every possible category combination. You may have seen the sort of thing I mean where you have a filter form that lists job related categories with
the number of jobs in brackets to the side of them e.g

Job Type
-Full Time (50)
-Part Time (23)

Sector
-Management (4556)
-IT (3434)
-Sales (2456)

The page had to load fast so I didn't want to calculate the counts on the fly and there were 10 category types with over 150 categories that could be saved against a job in any possible combination. This meant that that as the user narrowed down his or her job search the counts had to reflect the options already selected. The first idea was to create a matrix table that would hold a row for each possible combination and a column for the number of jobs.

JobSector JobType Industry Region Location Role LiveJobs
12 8762 562 992 NULL NULL 345
12 8762 562 993 NULL NULL 321
12 8762 562 994 NULL NULL 78
12 8762 562 995 NULL NULL 963
12 8762 562 996 NULL NULL 13

This would have allowed me to do a simple statement such as

SELECT LiveJobs FROM MATRIX WHERE JobSector=12 AND Region=992

to get the number of live jobs per category combination.


Finance job in Afghanistan anyone?

So I knocked up some SQL to test how quick it would take to generate this matrix and clicked the run button. However it soon came apparent that this wasn't going to work as well as expected. After an hour or so the transaction log had filled the disk up and we had only populated 30 million or so rows out of some stupidly high number running into the billions.
I quickly decided this wasn't the way to go. Apart from the problems with generating this matrix table quickly most of the possible combinations were never going to actually ever have
a job count greater than 0. I doubt many UK based jobboards have ever posted jobs in the Private equity and venture capital sector based in Afghanistan so it seemed a pointless overhead trying to generate category combinations that would never be accessed.

So after a re-think I decided to rewrite the matrix using CROSS JOINs to populate the table so that instead of holding all possible category combinations it would only hold category combinations that were actually saved against the job. If a job was saved with only one category
for each possible category type then there would only be one row in the matrix table for that job. If however one category type had 3 options selected and all the rest one then there were be 3 rows and so on. This meant we would only ever hold data for categories being used and the table would be as small or as large at it needed to be.

A cut down version of the finished SQL that creates the matrix table is below. I have a working table #SITE_JOB_CATEGORY_VALUES that holds each job and each saved category ID and from the first SELECT that outputs the job ID it CROSS JOINS each possible derived table
to get all the combinations of saved category values against that job.


INSERT INTO SITE_JOB_CATEGORY_MATRIX_A
(JobPK,JobSector,JobType,Industry)
SELECT sjcv.JobFK,a.JobSector,b.JobType,c.Industry
FROM #SITE_JOB_CATEGORY_VALUES as sjcv,
(
SELECT a.JobFK, CategoryFK as JobSector
FROM (SELECT @JobPK as JobFK) as a
LEFT JOIN #SITE_JOB_CATEGORY_VALUES as b
ON b.JobFk = a.JobFK
AND CategoryTypeFK = 2248
) as a ,(
SELECT a.JobFK, CategoryFK as JobType
FROM (SELECT @JobPK as JobFK) as a
LEFT JOIN #SITE_JOB_CATEGORY_VALUES as b
ON b.JobFk = a.JobFK
AND CategoryTypeFK = 2249
) as b ,(
SELECT a.JobFK, CategoryFK as Industry
FROM (SELECT @JobPK as JobFK) as a
LEFT JOIN #SITE_JOB_CATEGORY_VALUES as b
ON b.JobFk = a.JobFK
AND CategoryTypeFK = 2190
) as c
GROUP BY sjcv.JobFK,a.JobSector,b.JobType,c.Industry


The reason that each derived table selects first from a constant and then LEFT JOINS to the main category table is that each derived table needs to return a value even if its a NULL otherwise the other derived tables wouldn't join to it and I'd lose the row for that job.

Depending on how many categories were selected under each category type (as some options allow multiple select) one job could have 1 or multiple rows with all the variations possible.

For example this job has one category value per category type apart from Region and Location that have 2 each which means there is a total of 4 rows to be outputted into the matrix.

JobPK JobSector JobType Industry Region Location Role
3434 12 8762 562 992 3543 78
3434 12 8762 562 993 3543 78
3434 12 8762 562 992 3544 78
3434 12 8762 562 993 3544 78

To get the job count you would just do a DISTINCT on the JobPK and the WHERE contains the columns and values that the user is filtering by.

The application references this table through the use of a SYNONYM as behind the scenes there will be 2 of these matrix tables. The current one and then every 15 minutes an MS Agent job runs that creates the new matrix and once done it points the SYNONYM to it and drops the existing table.

Using this approach I can rebuild the matrix table every 15 minutes in a manner of seconds. I have a clustered covering index on all the columns ordered by the most selective categories to the least and I can output the job count per category and run database searches using the users selections in lightening speed. The only downside is that there is a 15 minute delay between a job being posted on the site and it being searchable but that seems to be an acceptable limit to the posters.

So those are just two uses of a CROSS JOIN I have used lately and I am sure there are many more useful implementations. If you have any more please lets hear about them.

No comments: