Saturday, 25 July 2009

HTML Encoding and Decoding using Javascript

Javascript HTML Encoder and Decoder

I have recently started doing a lot of work with client side code especially delivering content to my www.hattrickheaven.com football site from Googles Ajax API framework as well as other feed content providers. In doing so I have had numerous occasions where the feed data contains either content that required html encoding or content that contained partial encoded strings or malformed and double encoded entities.

Therefore the requirement for a client side library to handle html encoding and decoding became apparent. I have always wondered why there wasn't an inbuilt HTML encode function in the Javascript library. They have escape, encodeURI and encodeURIComponent but these are for encoding URIs and making sure text is portable not ensuring strings are html encoded correctly.

Whats the problem with a simple replace function?

I have seen many sites recommend something simple along the lines of:

s = s.replace(/&/g,'&amp;').replace(/"/i,'&quot;').replace(/</i,'&lt;').replace(/>/i,'&gt;').replace(/'/i,'&apos;');



The problem with this is that it won't handle content that has already been partially encoded and could cause problems with double encoding.

For example if you had a string that was partially encoded such as:

"The player was sold for &pound;4.2 million"

Then you would end up double encoding the & in the &pound; entity like so:

&quot;The player was sold for &amp;pound;4.2 million&quot;

Therefore you have to be a bit clever when encoding and make sure only & that are not part of entities get encoded. You could try a regular expression that did a negative match but you would have to handle the fact that html encoded strings could be numeric or entity based e.g

&lt;
&#60;

are both representations of the less than symbol <. The way I have dealt with this is to make sure all entities are converted to numerical codes first. You could then do a negative match on &# e.g

s = s.replace(/&([^#])/,"&amp;$1");

However what if you had multiple & all in a row together e.g &&&& you would only encode 2 out of the four. You could run the replace multiple times to handle it but I tend to use placeholders in cases like this as its easy and better for performance to do positive matches rather than convoluted negative matches. Therefore I replace all occurrences of &# with a placeholder then do my replacement of & with & and then finally put back the &#.

If you want to see my html encoder and decoder functions in action and get a copy of the encoder object that has a number of useful functions then go to the following page on my site www.strictly-software.com/htmlencode. The page also has a form to allow you to HTML encode and decode online if you ever need to do just that.

16 comments:

Anonymous said...

Just what I have looking for!
You saved me from getting crazy about this sucking converts.
Thanks a lot!!!
peter

robertc said...

Great script, but there doesn't seem to be any license information. Is it OK to re-distribute this as part of commercial software?

Rob

Rob Reid said...

Yes feel free to include it in any software you choose as long as you put a reference to my site in there so people know where you got it from that would be great.

Also if you make any money out of this software you can find my donate button here >> http://tools.strictly-software.com/

:)

robertc said...

In the end I did the createElement hack in my current project, but next time this issue comes up I'll consider adding your library.

Rob

vlad said...

could you please give an example how to use it, in xss prevention, whith ajax

JAB Creations said...

I encountered this same problem. Thankfully there is a solution, not working directly with a nodeValue. I bet you were doing the same thing as I in this instance, convert the nodeValue (Which alert(typeof n) will alert string even though it still treats it as a direct DOM node) and convert it to a string...

var n = String(youNode.nodeValue);

Rob Reid said...

But does that handle the issue of double or partially encoded content?

That was the main aim of the library as I was importing and joining content from various sources.

Taking a string that has been encoded correctly and adding it to a string partially encoded and then to one not encoded at all and running it through one function to ensure correct encoding is the target.

Anonymous said...

Hey R, is your Decoder helper on github? I want to add something to it.

Alex Oss said...

I've been using this for some time, but recently we encountered a severe performance problem when calling htmlEncode() on a long string. After some profiling work, we found a hotspot: numEncode(). By replacing the Javascript string concatenation with array push()ing followed by join(), we cut the runtime of numEncode() to about a tenth to a fifteenth of the time.

Thanks for a great utility!

// Numerically encodes all unicode characters
numEncode : function(s){

if(this.isEmpty(s)) return "";

var e = new Array();
var slength = s.length;

for (var i = 0; i < slength; ++i)
{
var c = s.charAt(i);
if (c < " " || c > "~")
{
e.push("&#");
e.push(c.charCodeAt()); //numeric value of code point
e.push(";");
}
else
{
e.push(c);
}
}
return e.join("");
},

Rob Reid said...

Hi Alex

Thanks, I usually use string builder objects in my server side coding C# ASP PHP etc instead of string concatenation for exactly the reason you stated so the only explanation of why I didn't with this JS object is that it was written some years back. Plus I haven't experienced any bottlenecks myself with it.

However I would suggest to anyone that uses string building to either use a string builder or make your own if your'e using an old language e.g ASP classic.

E.g for classic a class with a default large array size, that is incremented in large chunks so as not to ReDim on every addition and a join at the end with a ReDim back down to the size (use a counter for each addition) if you need to. It depends what you are joining (e.g csv, or text) as a TRIM at the end might just be enough for plain text.

If you don't need array size declaration (like in ASP classic) then just pushing+joining is enough to speed things up.

Remember strings are just arrays of characters anyway.

I will update the code when I get a chance.

Thanks for that.

Rob

Anonymous said...

Tanks a lot. Grazie mille!

Roberto

Anonymous said...

This utility looks great and I'm trying to incorporate it into my ASP.NET site.

To test it I copied and pasted the source code for a large .aspx page into a text box and called the Encoder.htmlEncode function.

It encoded the page just fine, but ASP.NET still complains of "A potentially dangerous Request.Form value was detected from the client".

Apparently this is caused by the combination of &#.

Do you have a solution for this?

Thank you,
Doug

Rob Reid said...

Hi Anon?

Why are you using JS to HTML encode characters in .NET if you are POSTING the form?

Why not just do the .HtmlEncode on the value once posted before outputting the value again?

The pure combination of the characters &# are not dangerous on their own so I would suggest looking at how you are using the code.

Maybe use a replacement for them and then swap it out on the other side?

It is obviously thinking some form of XSS hack is going on so maybe some code/app/security setting is set too high or incorrectly?

The reason for this code is for JavaScript encoding where you would be doing AJAX or have no other way of HTML Encoding the characters.

Thanks

Rob

Anonymous said...

Rob, thanks for getting right back to me.

I have a page that allows users to post comments related to internal support requests.
In some cases users want to post source code of various types.

If I set ValidateRequest="false" I could handle this on the server side.
Since I don't want to do this I've been pursuing javascript solutions to encode the text on the client before .NET sees it and has a chance to complain.

It seemed that javascripts encodeURI solved the problem, I just called .NET's Uri.UnescapeDataString before sending the text back to the client.
However at least one character causes a problem, '%0A' - the line URL encoding for a line feed, causes the web page to break. I thought about adding code to replace that character, but was afraid that could turn into a game of whack-a-mole trying to figure out what other characters might break the page.

So, that put me on a search for a solution that already handles it.

Doug

Rob Reid said...

OK I did wonder why you needed it. I guess .net sees the line break character as a potential XSS hack for escaping out of an HTML control to output your own scrot or HTML. I don't know of a way of changing that but using urlencodecomponent or encode could be the way to solve it.
Glad you have a solution. A replace n replace back can always be a good performance solving way of getting round tough problems like this and if you check my code I already use that method to get around certain issues.

Javier Roca said...

This is pure gold, tomorow I have to show my final project of a course and I had several problems with accents and other spanish chars. You saved my life, thanks a lot!!