One of my most popular scripts that continues to get downloaded a lot as well as spark many an email from companies and users is my HTML Encoder script that I wrote many a year back.
You can see the script in action here: www.strictly-software.com/htmlencoder
I was emailed another function that claimed to do the same work earlier and I just thought it would be interesting to other readers to know why I created the script in the first place and why many of the examples out on the net that claim to do the same job are just not up to scratch.
On many sites nowadays people are obtaining content from various sources through XML and RSS feeds and they often use JavasScript and AJAX to do part of the job. I know that when I was creating my football site www.hattrickheaven.com which was basically an exercise in using the Google AJAX API I came across the following problems:
- Content is obtained from a multitude of sources and there is no common standard that can be relied on 100%.
- Content is often mixed together through feed blenders like Yahoo Pipes or Blastcasta.
- Content is often loaded in on the fly using AJAX, a Scraper Proxy to do cross domain content jacks and other formatting duties such as on the fly translations etc.
Now I looked at many an HTML Encode solution before writing my own but the main problem I needed to overcome was that content could already be HTML encoded or partially HTML encoded and functions I came across did nothing to handle this problem.
For example if you had this bit of text you will notice that it has an encoded ampersand in the middle but the quotes are not encoded.
"Rob says hello & goodbye"
If you ran this through many of the HTML Encode functions out there including many server side ones it will double encode the & in the & so that you end up with this.
"Rob says hello &amp; goodbye"
Now you may want this to happen but I doubt it.
Not if you want to run all your content through the same function without worrying about double encoding issues like this on your website.
This is why I wrote my HTML Encoder object as it puts into practise the technique I used whilst writing ASP classic sites that supported the UTF-8 character set.
With multi-lingual sites that display Arabic or Japanese character sets you don't want to be using the inbuilt Server.HTMLEncode function on all your textual input as it will triple the size of everything you store by converting every non ASCII character into a &#XXXX; encoded string.
When your database is set up to correctly store your text as Unicode (nvarchar etc) and you are outputting it with the correct code-base then you need something a bit more clever to do the job as you still need to encode characters that can cause you damage such as the naughty 4 " ' < > as well as the ampersand & to make your page validate (URL's, Querystrings etc).
This is obviously for security reasons as you don't want people to be able to malform your input and break your layout or insert XSS hack vectors into your system.
However running the standard HTML Encode function will encode everything as well as cause double encoding issues. Therefore the HTML Encoding needs to be done in stages and the ampersand has to be handled correctly as it makes up part of the HTML encoded &#XXXX; format.