Saturday 25 July 2009

HTML Encoding and Decoding using Javascript

Javascript HTML Encoder and Decoder

I have recently started doing a lot of work with client side code especially delivering content to my football site from Googles Ajax API framework as well as other feed content providers. In doing so I have had numerous occasions where the feed data contains either content that required html encoding or content that contained partial encoded strings or malformed and double encoded entities.

Therefore the requirement for a client side library to handle html encoding and decoding became apparent. I have always wondered why there wasn't an inbuilt HTML encode function in the Javascript library. They have escape, encodeURI and encodeURIComponent but these are for encoding URIs and making sure text is portable not ensuring strings are html encoded correctly.

Whats the problem with a simple replace function?

I have seen many sites recommend something simple along the lines of:

s = s.replace(/&/g,'&amp;').replace(/"/i,'&quot;').replace(/</i,'&lt;').replace(/>/i,'&gt;').replace(/'/i,'&apos;');

The problem with this is that it won't handle content that has already been partially encoded and could cause problems with double encoding.

For example if you had a string that was partially encoded such as:

"The player was sold for &pound;4.2 million"

Then you would end up double encoding the & in the &pound; entity like so:

&quot;The player was sold for &amp;pound;4.2 million&quot;

Therefore you have to be a bit clever when encoding and make sure only & that are not part of entities get encoded. You could try a regular expression that did a negative match but you would have to handle the fact that html encoded strings could be numeric or entity based e.g


are both representations of the less than symbol <. The way I have dealt with this is to make sure all entities are converted to numerical codes first. You could then do a negative match on &# e.g

s = s.replace(/&([^#])/,"&amp;$1");

However what if you had multiple & all in a row together e.g &&&& you would only encode 2 out of the four. You could run the replace multiple times to handle it but I tend to use placeholders in cases like this as its easy and better for performance to do positive matches rather than convoluted negative matches. Therefore I replace all occurrences of &# with a placeholder then do my replacement of & with & and then finally put back the &#.

If you want to see my html encoder and decoder functions in action and get a copy of the encoder object that has a number of useful functions then go to the following page on my site The page also has a form to allow you to HTML encode and decode online if you ever need to do just that.

Tuesday 21 July 2009

Firebug 1.4.0 and Highlighter.js

The mysterious case of the disappearing code

Yesterday I posted an article relating to an issue with a code highligher script and Firebug 1.4.0. A rough overview is that:

a) I use the highlighter.js code from software maniacs to highlight code examples.
b) This works cross browser and has done for a number of months.
c) On loading Firefox 3.0.11 I was asked to upgrade Firebug from 1.3.3 to 1.4.0.
d) After doing so I noticed all my code examples were malformed with the majority of code disappearing from the DOM.
e) On disabling the Firebug add-on the code would re-appear.
f) This problem didn't occur on my other PC still using Firebug 1.4.0 and Firefox 3.0.11.

Other people have contacted me to say they had similar issues and others who were using Firefox 3.5 did not have this problem so it seemed to be a problem with Firefox 3.0.11 and Firebug 1.4.0.

So tonight I was planning to debug the highlight.js file to see what the crack was. The version of highlight,js I use on my site is compressed and minified. On uncompressing the file with my online unpacker tool. I thought I would just try some problematic pages on my site with this uncompressed version and lo and behold it worked. The code didn't disappear!

So I have re-compressed the JS file with my own compressor tool and changed the references throughout the site to use instead of the original file and it all seems to work (at least for me).

If anyone manages to get to the bottom of this problem then please let me know but it seems there must be some sort of conflict occurring between these 2 codebases and I think its very strange!

I have created a copy of yesterdays posting that still uses the original compressed file so that the problem can be viewed. You can view the problem code here.

Sunday 19 July 2009

Problems upgrading to Firebug 1.4.0

Problems with Firebug and Javascript Code Highlighting

Please read the following post for details about a resolution for this problem.
You can view the original article along with the original file at the following link:

I have recently started using a javascript include file from software maniacs called hightlight.js to highlight my code snippets that I use in my blog articles. It uses a callback function fired on page load that looks for CODE tags specified in the HTML source and then re-formats them by applying in-line styles appropriate for the code snippet. However I have just found a lot of my articles when viewed in Firefox 3 are not displaying the formatting correctly and most of the code disappears. For example if you view the code below and you can see that it shows a small JS function then that's okay. However if all you can see is closing tag } then I would ask one question. Have you just upgraded to Firebug 1.4.0 ??

function myJSfunc(var1,var2){
return (var1==100) ? var1 : var2;
I myself have just upgraded my laptop version of Firebug from 1.3.3 to 1.4.0 and since that update I have noticed that all my highlighting has gone haywire mainly where I am trying to output literal values such as > and <. For example the following code snippet should appear like this:

-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

However when using the hightligher and viewed in Firefox 3.0 with Firebug 1.4.0 it comes out as:


( another test for Firebug 1.4.0 users, can you see the colourful code below?)

-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

Now I have checked the problematic code examples in other browsers such as Chrome and IE 8 as well as in Firefox 3 with Firebug 1.3.3 and it only seems to be Firefox 3 with Firebug 1.4.0 that causes the formatting problems.

I don't know why Firebug 1.4.0 is causing the problems but it seems to be the only differing factor between a working page and a broken page. Maybe there is some sort of clash with function names when both objects are loading. I have inspected the DOM using Firebug and the actual HTML has been deleted from the DOM so something is going wrong somewhere.

Anyway I thought I should let you know in case you are having similar problems or just wondering why all my code examples have gone missing. If you are experiencing the same problem but do not have Firebug 1.4.0 installed please let me know.

I am unaware of any other issues with this version of Firebug but will keep you posted.

Removing HTML with a User Defined Function

Using SQL to parse HTML content

Today I had the task of collating a long list of items for one of my sites. The list was being obtained from manually checking the source code of numerous sites and copying and pasting the relevant HTML source into a file. The items I wanted were contained within HTML list elements (UL, LI) and therefore the actual textual data was surrounded by HTML styling, tags such as LI, Strong, em with various in-line styles and class names etc.

I didn't want to spend much time parsing the list of a thousand items and I couldn't be bothered to write much code so I reverted back to an old user defined function that I wrote to remove HTML tags. Therefore once I had collated the list it was a simple case of using an import task to insert from the text file into my table wrapping the column in my user defined function.

Although I have been making use of the CLR lately for string parsing in the database with a few good C# regular expression functions this function doesn't require the CLR and can easily be converted for use with SQL 2k and earlier by changing the data type for the input and return parameters from nvarchar(max) to nvarchar(4000).

The code is pretty simple and neat and makes use of a PATINDEX search for the initial bracket in an open or close HTML tag making sure that the subsequent character is either a forward slash to match a closing HTML tag or a letter to match an opening HTML tag e.g

SELECT @StartPos = PATINDEX('<[A-Z/]%', @CleanHTML),

Which means that any unencoded opening bracket characters don't get mistaken as HTML tags. The code just keeps looping the input string looking for opening and closing tags and then recreating the string with the STUFF function until all matches have been found.

-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

An example of the functions usage:

DECLARE @Test nvarchar(max)
SELECT @Test = '<span class="outer2" id="o1"><strong>10 is < 20 and 20 is > 10</strong></span>'

SELECT dbo.udf_STRIP_HTML(@Test)

10 is < 20 and 20 is > 10

I thought I would post this user defined function on my blog as its a good example of how a simple function using inbuilt system functions such as PATINDEX, CHARINDEX and STUFF alone can solve common day to day problems. I personally love UDF's and the day SQL 2k introduced them was a wonderful occasion almost as good as when England beat Germany 5-1 LOL. No I am not that sad of course England beating Germany was the better occasion but the introduction of user defined functions made SQL server programming a much easier job and made possible numerous pseudo-set based operations which previously had to be done iteratively (loops or cursors).

Download the Strip HTML user defined function source code here

Wednesday 8 July 2009

Another article about bots and crawlers

Good Guys versus the Bad Guys

As you will know if you have read my previous articles about badly behaved robots I have to spend a lot of time dealing with crawlers who access one of the 200+ jobboards I look after. Crawlers break down into 2 varieties, the first are the indexers such as Googlebot, YSlurp, Ask etc. They provide a service to the site owner by indexing their content and displaying it in search engines to drive traffic to the site. Although these are legitimate bots they
can still cause issues (see this article about over-crawling). The second class are those bots that don't benefit the site owner and in a large majority of cases try to harm the site either by hacking or spamming or content scraping (or job raping as we call it).

Now it maybe that the site owner wishes that any Tom, Dick and Harry can come along with a scrapper and take all their content but seeing that content is the main asset of a website its a bit like starting a shop and leaving the doors open all night with no-one manning the till. The problem is that most web content is publicly accessible and its very hard to prevent someone who is determined to scrape your content and at the same time have good SEO. For instance you could deliver all your content through Javascript or Flash but then the indexers like Googlebot won't be able to access your content either. Therefore to prevent overloaded servers, stolen bandwidth and content it becomes a complex game involving a lot more than having a blacklist of IP addresses and user-agents. Due to IP and Agent spoofing this is very unreliable so a variety of methods can be utilised by those wishing to reduce the amount of "Bad Traffic" including real time checks to identify users that are hammering the site, white lists and blacklists, bot traps, DNS checks and much more.

One of the problems I have found is that even bot traffic that is legitimate in the eyes of the site owner doesn't comply with even the most basic rules of crawler etiquette such as a proper user-agent string, parsing DNS validation or even following the Robots.txt file rules. In fact when we spoke to the technical team running one of these job-rapists/aggregators they informed us that they didn't have the technical capabilities to parse a Robots.txt file. Now I think this is plainly ridiculous as if you have the technical capability to crawl hundreds of thousands of pages a day all with different HTML and correctly parse this content to extract the job details which is then uploaded to your own server for display it shouldn't be too difficult to add a few lines in to parse the Robots.txt file. I am 99% positive this was just an excuse to hide the fact they knew if they did parse my Robots.txt file they would find they were banned from all my sites. Obviously I had banned them with other methods but just to show how easy it is to write code to parse a robots.txt file and for that matter a crawler I have added some example code to my technical site. How to parse a Robots.txt file with c#.

This is one of the primary reasons there is so much bot traffic on the web today. Its just too damn easy to either download and run a bot from the web or write your own. I'm not saying all crawlers and bots have nefarious reasons behind them but I reckon 80%+ of all Internet traffic is probably from crawlers and bots nowadays with the other 20% being porn, music and film downloads and social networking traffic. It would be interesting to get
exact figures for Internet traffic breakdown. I know from my own sites logging that on average 60-70% of traffic is crawler and I am guessing that doesn't include traffic from spoofed agents that didn't get identified as such.