Wednesday, 3 August 2011

HTML Super Trim Functions

Remove non breaking spaces and other white space either side of strings

Most languages have a Trim function that removes white space from either side of a string.

However a lot of the time these Trim functions will only remove standard white space e.g if you hit the space bar a couple of times and not other forms of white space such as tabs, new lines or HTML space characters such as non breaking spaces whether they are HTML entity encoded: or Numerically encoded e.g

Therefore sometimes you may need a "Super Trim" function that will handle the removal of all types of space characters including HTML entities.

The following have been written in PHP but can easily be converted into C# or VB. The main part to take away is the regular expression used within each function which replaces a string containing one or more space characters whether they be control characters or HTML entities from either side of the string.

// wrapper function to do trim both sides
function HTMLTrim($text){

// call both functions at once
return HTMLLeftTrim(HTMLRightTrim($text));
}

// removes spaces and at the beginning of strings
function HTMLLeftTrim($text){

// remove space to the left of the text
return preg_replace("@^( | |\s)+(\S+)@","$2",$text);

}

// removes spaces and at the beginning of strings
function HTMLRightTrim($text){

// remove space to the right of the text
return preg_replace("@(\S+)( | |\s)+$@","$1",$text);

}


You can test this out in a simple PHP page with the following code:


$str = "     hello there        ";

echo "before trim its '" . $str . "'";

echo "<br><br>now its '" . HTMLTrim($str) . "'";


Which returns the following output:

before trim its '     hello there      '

now its 'hello there'

I find it very useful when I am scraping content from the web and need to handle the removal of a mixture of standard spaces and HTML spaces.

No comments: