Sunday 19 July 2009

Removing HTML with a User Defined Function

Using SQL to parse HTML content

Today I had the task of collating a long list of items for one of my sites. The list was being obtained from manually checking the source code of numerous sites and copying and pasting the relevant HTML source into a file. The items I wanted were contained within HTML list elements (UL, LI) and therefore the actual textual data was surrounded by HTML styling, tags such as LI, Strong, em with various in-line styles and class names etc.

I didn't want to spend much time parsing the list of a thousand items and I couldn't be bothered to write much code so I reverted back to an old user defined function that I wrote to remove HTML tags. Therefore once I had collated the list it was a simple case of using an import task to insert from the text file into my table wrapping the column in my user defined function.

Although I have been making use of the CLR lately for string parsing in the database with a few good C# regular expression functions this function doesn't require the CLR and can easily be converted for use with SQL 2k and earlier by changing the data type for the input and return parameters from nvarchar(max) to nvarchar(4000).

The code is pretty simple and neat and makes use of a PATINDEX search for the initial bracket in an open or close HTML tag making sure that the subsequent character is either a forward slash to match a closing HTML tag or a letter to match an opening HTML tag e.g

SELECT @StartPos = PATINDEX('<[A-Z/]%', @CleanHTML),

Which means that any unencoded opening bracket characters don't get mistaken as HTML tags. The code just keeps looping the input string looking for opening and closing tags and then recreating the string with the STUFF function until all matches have been found.

-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

An example of the functions usage:

DECLARE @Test nvarchar(max)
SELECT @Test = '<span class="outer2" id="o1"><strong>10 is < 20 and 20 is > 10</strong></span>'

SELECT dbo.udf_STRIP_HTML(@Test)

10 is < 20 and 20 is > 10

I thought I would post this user defined function on my blog as its a good example of how a simple function using inbuilt system functions such as PATINDEX, CHARINDEX and STUFF alone can solve common day to day problems. I personally love UDF's and the day SQL 2k introduced them was a wonderful occasion almost as good as when England beat Germany 5-1 LOL. No I am not that sad of course England beating Germany was the better occasion but the introduction of user defined functions made SQL server programming a much easier job and made possible numerous pseudo-set based operations which previously had to be done iteratively (loops or cursors).

Download the Strip HTML user defined function source code here

1 comment:

Unknown said...

Nice Article !
This is my pleasure to read your article.
Really this will help to people of SQL Server Community.

I have also prepared one article about, How to remove html tag from string in SQL Server.
You can also visit my article, your comments and reviews are most welcome.