I have recently started doing a lot of work with client side code especially delivering content to my www.hattrickheaven.com football site from Googles Ajax API framework as well as other feed content providers. In doing so I have had numerous occasions where the feed data contains either content that required html encoding or content that contained partial encoded strings or malformed and double encoded entities.
Therefore the requirement for a client side library to handle html encoding and decoding became apparent. I have always wondered why there wasn't an inbuilt HTML encode function in the Javascript library. They have escape, encodeURI and encodeURIComponent but these are for encoding URIs and making sure text is portable not ensuring strings are html encoded correctly.
Whats the problem with a simple replace function?
I have seen many sites recommend something simple along the lines of:
s = s.replace(/&/g,'&').replace(/"/i,'"').replace(/</i,'<').replace(/>/i,'>').replace(/'/i,''');
The problem with this is that it won't handle content that has already been partially encoded and could cause problems with double encoding.
For example if you had a string that was partially encoded such as:
"The player was sold for £4.2 million"
Then you would end up double encoding the & in the £ entity like so:
"The player was sold for &pound;4.2 million"
Therefore you have to be a bit clever when encoding and make sure only & that are not part of entities get encoded. You could try a regular expression that did a negative match but you would have to handle the fact that html encoded strings could be numeric or entity based e.g
<
<
are both representations of the less than symbol <. The way I have dealt with this is to make sure all entities are converted to numerical codes first. You could then do a negative match on &# e.g
s = s.replace(/&([^#])/,"&$1");
However what if you had multiple & all in a row together e.g &&&& you would only encode 2 out of the four. You could run the replace multiple times to handle it but I tend to use placeholders in cases like this as its easy and better for performance to do positive matches rather than convoluted negative matches. Therefore I replace all occurrences of &# with a placeholder then do my replacement of & with & and then finally put back the &#.
If you want to see my html encoder and decoder functions in action and get a copy of the encoder object that has a number of useful functions then go to the following page on my site www.strictly-software.com/htmlencode. The page also has a form to allow you to HTML encode and decode online if you ever need to do just that.