Tuesday, 4 October 2011

Find position of a string within a paragraph using a Regular Expression

How to find a string within another string using a Regular Expression instead of strpos or stripos

I was writing a piece of code in PHP the other day where I had to find a snippet of text within another longer piece of text (e.g an article) that contained a word. I then wanted to take X number of characters from that first point and return a snippet that didn't cut off the last word in the sentence.

At first I was using the PHP functions strpos and stripos but these don't allow you to use Regular Expressions as the search term (needle in the haystack as PHP.net calls the parameters) and therefore it meant that I was returning mismatches due to the search term being contained within other words.

E.G if I was looking for the word wool it would match woollen.

Therefore the answer was to use a custom function that made use of preg_match and a non greedy capture group at the beginning of a pattern that could be passed to the function (without delimiters).

The function is below



/**
 * Function to find the first occurence of a regular expression pattern within a string
 *
 * @param string $regex
 * @param string $str
 * @param bool $ignorecase
 * @return variant
 */
function preg_pos( $regex, $str, $ignorecase ) 
{ 
 // build up the RegEx wrapping it in @ delimiters
 $pattern = "@^(.*?)" . $regex . "@" . ($ignorecase===true ? "i" : "");

 if( preg_match( $pattern, $str, $matches ) ) {
  return strlen( $matches[ 1 ] ); 
 }

 return false; 
} 


As you can see the pattern needs to be passed in without delimiters e.g instead of /\bwool\b/ or @\bwool\b@ just pass in \bwool\b.

I then add a capture group to the beginning that is non greedy so that it finds the first match from the start of the input string ^(.*?) and then if the pattern is found I can do a strlen on the matching group to get the starting position of the pattern.

If you want the pattern to be case-sensitive then you can just pass in TRUE or FALSE as the extra parameter and the ignore flag will be added to the end of the pattern.

An example of this code being used is below. The code is looping through an array of words looking for the first match within a longer string (some HTML) and then taking 250 characters of text from the starting point, ensuring the last word is a whole word match.


// find first occurence of any of the terms I am looking for and then take 250 characters from the first word
// ensuring I get a whole word at the end

$a = explode(" ",$terms);
foreach($a as $w){

 // skip empty or small terms

 if(!empty($w) && strlen($w) > 2){
   
  // get the position of the word ensuring its not within another word - using \b word boundary - notice no RegEx delimiters @regex@ or /regex/
  // also ensure any special characters within the word are delimited to prevent a mismatch
  $pos = preg_pos( "\b" . preg_quote($w) . "\b", $html, true ) ;

  // if pos is false then its empty otherwise 

  if($pos !== false){

   // found the word take 250 chars from the first occurrence

   $text = substr($html, $pos, 250);
   
   // roll back to last space before our last word to ensure we don't get partial words 

   $text = substr($text, 0, strrpos($text," "));
   
   // now we have found a term exit
   break;
  }
 }
}


Also remember to wrap your word in preg_quote so that any special characters that are used by the Regular Expression engine e.g ? . + * [ ] ( ) { } etc are all characters that need to be escaped properly.

I found this function quite useful.

5 comments:

Anonymous said...

Nice job ! It took me a while to fully get what you were doing but was abble to adapt it in the end to my situation.
Try this \A (instead of ^) to start at beginning of string instead of line.
Add /s so that preg_match return answers in $matches even if there are \r\n in your text.
Etta

R Reid said...

Well ^ means start at the beginning of input not the beginning of a line so I am not sure what you mean by that.

The idea is to start at the beginning of the string and match all characters up to the start of the word you are looking for.

"@^(.*?)" . [escaped word you are looking for] . @

so everything before the word you are looking for is captured in the subgroup i.e (.*?) and then doing a strlen on that will give you the number of characters from the start of the input until your word which is what you basically want to do.

I haven't had any problems with the /s as of yet but maybe I will do so thanks.

You could also probably use ([\s\S].*?) instead of (.*?) which matches any space OR non space character 0 or more times.

Thanks for commenting

Ben Wilkins said...

Wicked regex!

I was scratching my head for days as I am useless at regular expressions and I thought with the amount of PHP functions they have for everything there would surely be one for finding the position of a string within another one using a regex.

Thanks a lot you saved my bacon!

R Reid said...
This comment has been removed by the author.
R Reid said...

Thanks Bem glad you liked it.

One of the things I actually got taught at my company from an ex employee was regular expressions and I am so glad I learnt them as they are so powerful once you get the hang of them.

At first I used to look at one and just scratch my head going "what the FxxK is that" but now I know more about them - the do's and don'ts and they are a great tool.

If you can do something simple without them I would recommend it but for complex pattern matching they are the bomb - just don't start doing things that can cause yourself trouble like catastrophic backtracking or doing negative lookaheads etc as they could cause your web servers CPU to max out in an instance (a good sign of a runnaway regex is if your servers CPU keeps getting 100% CPU use on a one of its processers for a while then dropping back down).

If you want to do a negative regex I would flip reverse it and change it to a positive match. Split it into two - add a marker into the source text then use that for your second regex. It's always better to look for something than not look for something - especially if that something DOESN'T exist!

Check out this article for more info >> http://blog.strictly-software.com/2009/11/debugging-regular-expressions.html