Saturday, 25 July 2009

HTML Encoding and Decoding using Javascript

Javascript HTML Encoder and Decoder

I have recently started doing a lot of work with client side code especially delivering content to my www.hattrickheaven.com football site from Googles Ajax API framework as well as other feed content providers. In doing so I have had numerous occasions where the feed data contains either content that required html encoding or content that contained partial encoded strings or malformed and double encoded entities.

Therefore the requirement for a client side library to handle html encoding and decoding became apparent. I have always wondered why there wasn't an inbuilt HTML encode function in the Javascript library. They have escape, encodeURI and encodeURIComponent but these are for encoding URIs and making sure text is portable not ensuring strings are html encoded correctly.

Whats the problem with a simple replace function?

I have seen many sites recommend something simple along the lines of:

s = s.replace(/&/g,'&amp;').replace(/"/i,'&quot;').replace(/</i,'&lt;').replace(/>/i,'&gt;').replace(/'/i,'&apos;');



The problem with this is that it won't handle content that has already been partially encoded and could cause problems with double encoding.

For example if you had a string that was partially encoded such as:

"The player was sold for &pound;4.2 million"

Then you would end up double encoding the & in the &pound; entity like so:

&quot;The player was sold for &amp;pound;4.2 million&quot;

Therefore you have to be a bit clever when encoding and make sure only & that are not part of entities get encoded. You could try a regular expression that did a negative match but you would have to handle the fact that html encoded strings could be numeric or entity based e.g

&lt;
&#60;

are both representations of the less than symbol <. The way I have dealt with this is to make sure all entities are converted to numerical codes first. You could then do a negative match on &# e.g

s = s.replace(/&([^#])/,"&amp;$1");

However what if you had multiple & all in a row together e.g &&&& you would only encode 2 out of the four. You could run the replace multiple times to handle it but I tend to use placeholders in cases like this as its easy and better for performance to do positive matches rather than convoluted negative matches. Therefore I replace all occurrences of &# with a placeholder then do my replacement of & with & and then finally put back the &#.

If you want to see my html encoder and decoder functions in action and get a copy of the encoder object that has a number of useful functions then go to the following page on my site www.strictly-software.com/htmlencode. The page also has a form to allow you to HTML encode and decode online if you ever need to do just that.

Labels: , , , , ,

8 Comments:

At 12 December 2010 19:17 , Anonymous Anonymous said...

Just what I have looking for!
You saved me from getting crazy about this sucking converts.
Thanks a lot!!!
peter

 
At 29 March 2011 12:37 , Blogger robertc said...

Great script, but there doesn't seem to be any license information. Is it OK to re-distribute this as part of commercial software?

Rob

 
At 29 March 2011 14:41 , Blogger R Reid said...

Yes feel free to include it in any software you choose as long as you put a reference to my site in there so people know where you got it from that would be great.

Also if you make any money out of this software you can find my donate button here >> http://tools.strictly-software.com/

:)

 
At 29 March 2011 19:40 , Blogger robertc said...

In the end I did the createElement hack in my current project, but next time this issue comes up I'll consider adding your library.

Rob

 
At 8 September 2011 14:21 , Anonymous vlad said...

could you please give an example how to use it, in xss prevention, whith ajax

 
At 14 September 2011 17:27 , Blogger JAB Creations said...

I encountered this same problem. Thankfully there is a solution, not working directly with a nodeValue. I bet you were doing the same thing as I in this instance, convert the nodeValue (Which alert(typeof n) will alert string even though it still treats it as a direct DOM node) and convert it to a string...

var n = String(youNode.nodeValue);

 
At 15 September 2011 00:32 , Blogger R Reid said...

But does that handle the issue of double or partially encoded content?

That was the main aim of the library as I was importing and joining content from various sources.

Taking a string that has been encoded correctly and adding it to a string partially encoded and then to one not encoded at all and running it through one function to ensure correct encoding is the target.

 
At 16 December 2011 20:27 , Anonymous Anonymous said...

Hey R, is your Decoder helper on github? I want to add something to it.

 

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home