Showing posts with label HTML Encode. Show all posts
Showing posts with label HTML Encode. Show all posts

Wednesday, 21 July 2021

Making A Super Trim Function

Using Regular Expressions To Make a SuperTrim() Function


By Strictly-Software

How many times has there been when you have two bits of text that you have extracted from various websites, or feeds or even databases and tried to compare them but they would not match?

I know I wrote a little example of when two different ASCII space characters are used within SQL the other day and how to check and remove them to make a match but what about all the various ways you can HTML Encode spaces like     and   plus others that make up a CrLF or just a Cr or Lf, a bit like the VbCrLf constant for a carriage return and line feed, either using Environment.NewLine or constants that hold values for \r \n and also maybe a tab \t.

All these are spaces that need removing and with a special function that uses regular expressions they can all easily be removed .

I use this function in MS SQL with a CLR C# UDF as well as Extending C# projects with a new SuperTrim() method like so:

public static string SuperTrim(this string value)
{
    string newval = "";

    // match each type of space from start of input up to a word character that may or may not have spaces in between
    // e.g a sentence like Hello There John and then removes the same space characters to the right to the end of sentence.
    string re = @"(^(?: | | |\s|\t|\r|\n)+?)(\w+[\s\S]+?\w+)((?: | | |\s|\t|\r|\n)+?$)";
    Regex regex = new Regex(re, RegexOptions.Compiled | RegexOptions.IgnoreCase);

    newval = regex.Replace(value, ""); // replace each space HTML char with nothing

    return newval;
}


It is pretty easy enough to create yourself a test page in HTML using JavaScript with a couple of textarea input boxes for the test value containing encoded spaces a button to run a JS function that runs the regex as seen in the C# example and then outputs the result in another box. 

The regular expression is interchangeable between languages, that's what I love about Regular Expressions, they can be tested and played about with on a simple HTML page with JavaScript and then one the expression works you can easily move it into whatever language you are working in e.g C# or PHP.

For example this encoded text:

      Rob Reid       

Then after running the regular expression or string newValue = EncodedValue.SuperTrim() method in C# or JavaScript you should get this value with no encoded characters left.
Rob Reid

I find extending whatever language I am writing in to include a SuperTrim() function very handy. If you were handling URL's you might want to remove %20 and the + sign, you can always add more or less into the expression depending on your needs of course like values for nulls or \v for vertical tabs depending on the content you are handling.


By Strictly-Software

© 2021 Strictly-Software

Wednesday, 3 August 2011

HTML Super Trim Functions

Remove non breaking spaces and other white space either side of strings

Most languages have a Trim function that removes white space from either side of a string.

However a lot of the time these Trim functions will only remove standard white space e.g if you hit the space bar a couple of times and not other forms of white space such as tabs, new lines or HTML space characters such as non breaking spaces whether they are HTML entity encoded: or Numerically encoded e.g

Therefore sometimes you may need a "Super Trim" function that will handle the removal of all types of space characters including HTML entities.

The following have been written in PHP but can easily be converted into C# or VB. The main part to take away is the regular expression used within each function which replaces a string containing one or more space characters whether they be control characters or HTML entities from either side of the string.

// wrapper function to do trim both sides
function HTMLTrim($text){

// call both functions at once
return HTMLLeftTrim(HTMLRightTrim($text));
}

// removes spaces and at the beginning of strings
function HTMLLeftTrim($text){

// remove space to the left of the text
return preg_replace("@^( | |\s)+(\S+)@","$2",$text);

}

// removes spaces and at the beginning of strings
function HTMLRightTrim($text){

// remove space to the right of the text
return preg_replace("@(\S+)( | |\s)+$@","$1",$text);

}


You can test this out in a simple PHP page with the following code:


$str = "     hello there        ";

echo "before trim its '" . $str . "'";

echo "<br><br>now its '" . HTMLTrim($str) . "'";


Which returns the following output:

before trim its '     hello there      '

now its 'hello there'

I find it very useful when I am scraping content from the web and need to handle the removal of a mixture of standard spaces and HTML spaces.

Wednesday, 27 July 2011

My JavaScript HTML Encoder Function

Why did I write my JavaScript HTML Encoder object

One of my most popular scripts that continues to get downloaded a lot as well as spark many an email from companies and users is my HTML Encoder script that I wrote many a year back.

You can see the script in action here: www.strictly-software.com/htmlencoder

I was emailed another function that claimed to do the same work earlier and I just thought it would be interesting to other readers to know why I created the script in the first place and why many of the examples out on the net that claim to do the same job are just not up to scratch.

On many sites nowadays people are obtaining content from various sources through XML and RSS feeds and they often use JavasScript and AJAX to do part of the job. I know that when I was creating my football site www.hattrickheaven.com which was basically an exercise in using the Google AJAX API I came across the following problems:

  1. Content is obtained from a multitude of sources and there is no common standard that can be relied on 100%.
  2. Content is often mixed together through feed blenders like Yahoo Pipes or Blastcasta.
  3. Content is often loaded in on the fly using AJAX, a Scraper Proxy to do cross domain content jacks and other formatting duties such as on the fly translations etc.
  4. There is no built in function for HTML encoding in the JavaScript language.

Now I looked at many an HTML Encode solution before writing my own but the main problem I needed to overcome was that content could already be HTML encoded or partially HTML encoded and functions I came across did nothing to handle this problem.

For example if you had this bit of text you will notice that it has an encoded ampersand in the middle but the quotes are not encoded.

"Rob says hello & goodbye"

If you ran this through many of the HTML Encode functions out there including many server side ones it will double encode the & in the & so that you end up with this.

&quot;Rob says hello &amp;amp; goodbye&quot;

Now you may want this to happen but I doubt it.

Not if you want to run all your content through the same function without worrying about double encoding issues like this on your website.

This is why I wrote my HTML Encoder object as it puts into practise the technique I used whilst writing ASP classic sites that supported the UTF-8 character set.

With multi-lingual sites that display Arabic or Japanese character sets you don't want to be using the inbuilt Server.HTMLEncode function on all your textual input as it will triple the size of everything you store by converting every non ASCII character into a &#XXXX; encoded string.

When your database is set up to correctly store your text as Unicode (nvarchar etc) and you are outputting it with the correct code-base then you need something a bit more clever to do the job as you still need to encode characters that can cause you damage such as the naughty 4 " ' < > as well as the ampersand & to make your page validate (URL's, Querystrings etc).

This is obviously for security reasons as you don't want people to be able to malform your input and break your layout or insert XSS hack vectors into your system.

However running the standard HTML Encode function will encode everything as well as cause double encoding issues. Therefore the HTML Encoding needs to be done in stages and the ampersand has to be handled correctly as it makes up part of the HTML encoded &#XXXX; format.

So whilst it comes as quite a surprise to see how popular this script is (#1 on Google for "HTML Encode with Javascript") it is not really surprising when one thinks about the problem this script solves.

Whilst I don't claim to be any kind of coding genius and I am sure the code can be improved it does do what is says on the tin and if you are thinking of using another JavaScript function to HTML Encode your content then compare it first with my tool to see if it handles the double encoding issue as if it doesn't then you might find yourself having problems down the road.