Using Regular Expressions To Make a SuperTrim() Function
By Strictly-Software
How many times has there been when you have two bits of text that you have extracted from various websites, or feeds or even databases and tried to compare them but they would not match?
I know I wrote a little example of when two different ASCII space characters are used within SQL the other day and how to check and remove them to make a match but what about all the various ways you can HTML Encode spaces like   and   plus others that make up a CrLF or just a Cr or Lf, a bit like the VbCrLf constant for a carriage return and line feed, either using Environment.NewLine or constants that hold values for \r \n and also maybe a tab \t.
All these are spaces that need removing and with a special function that uses regular expressions they can all easily be removed .
I use this function in MS SQL with a CLR C# UDF as well as Extending C# projects with a new SuperTrim() method like so:
public static string SuperTrim(this string value)
{
string newval = "";
// match each type of space from start of input up to a word character that may or may not have spaces in between
// e.g a sentence like Hello There John and then removes the same space characters to the right to the end of sentence.
string re = @"(^(?: | | |\s|\t|\r|\n)+?)(\w+[\s\S]+?\w+)((?: | | |\s|\t|\r|\n)+?$)";
Regex regex = new Regex(re, RegexOptions.Compiled | RegexOptions.IgnoreCase);
newval = regex.Replace(value, ""); // replace each space HTML char with nothing
return newval;
}
It is pretty easy enough to create yourself a test page in HTML using JavaScript with a couple of textarea input boxes for the test value containing encoded spaces a button to run a JS function that runs the regex as seen in the C# example and then outputs the result in another box.
The regular expression is interchangeable between languages, that's what I love about Regular Expressions, they can be tested and played about with on a simple HTML page with JavaScript and then one the expression works you can easily move it into whatever language you are working in e.g C# or PHP.
For example this encoded text:
   Rob Reid    
Then after running the regular expression or string newValue = EncodedValue.SuperTrim() method in C# or JavaScript you should get this value with no encoded characters left.
Rob Reid
I find extending whatever language I am writing in to include a SuperTrim() function very handy. If you were handling URL's you might want to remove %20 and the + sign, you can always add more or less into the expression depending on your needs of course like values for nulls or \v for vertical tabs depending on the content you are handling.
By Strictly-Software
© 2021 Strictly-Software
2 comments:
I like that Function, especially useful when getting feeds from all over the place and they may or may not contain HTML/UTF8 encoded spaces etc. You need to handle all of it tabs, newlines in different formats when getting data from elsewhere and easy to convert into a PHP or JavaScript function if needed.
Love this function, I am doing in SQL but we are not allowed to use CLR for some reason so I am just doing multiple replaces on the left hand side of the string and then the right hand side using PaTINDEX to find a [A-Z] character position to use in the substring, then finishing off with a LTRIM(RTRIM(@word))) to ensure, it seems to be working. Thanks
Post a Comment