Monday 26 September 2011

Regular Expression functions for reformatting content for Wordpress

Using RegEx to reformat Wordpress content

If you read my blog you will know I am "Mr Automate" and I specialise in scraping, reformatting, spinning and automatically posting content to Wordpress, Twitter and other sites.

I use Wordpress for a couple of my blogs and because the well known plugin WP-o-Matic has stopped being supported and the only replacements are paid for plugins like WP Robot I have basically written my own custom version to handle my automated content imports.

A couple of useful Regular Expressions you might like to use which I find particular helpful are below.

I am using PHP as that is the language Wordpress is written in but the core Regular Expressions should be easily transferable between languages.


1. Removing IFRAMES

As YouTube has moved from it's old OBJECT / EMBED nested tag soup to an IFRAME many feeds containing videos will now contain IFRAME's which need to be handled correctly as we all know that viruses can be transfered through rouge IFRAME content.

If you are just importing content from a feed without parsing and checking it then just imagine if for whatever reason the feed you are using one day contains an IFRAME with a SRC pointing to a dodgy virus infected URL. If you don't sanitize the content you will be inserting this dangerous content into your own blog for your visitors to become infected by.

Although most modern browsers and virus checkers are good at picking up on potential malicious URL access it is always wise to run your own checks using a white list of "allowed" domains rather than a "black list" of banned ones.

This code is a basic example of looping through all the IFRAMES on a page, checking their SRC tag for a known white list of acceptable domains and then replacing any IFRAME's that don't match.

// match all XHTML IFRAME tags where $content holds your HTML
$videocount = preg_match_all("@(<iframe[\s\S]+?<\/iframe>)@",$content,$videos,PREG_SET_ORDER);

foreach($videos as $video){

$object = $video[1];

// only allow certain domains e.g youtube, dailymotion, vimeo, cnn, fox, msnbc and the bbc
if(!preg_match("@src=['\"]http:\/\/(www\.)?(?:youtube|dailymotion|vimeo|cnn|fox|msnbc|bbc)\.@i",$object)){

// replace the IFRAME as we don't know where it is pointing to

$content = preg_replace("@" . preg_quote($object) . "@","",$content);
}

}


2. Re-sizing images

I have already written an article about how you can use regular expressions and a callback function to reformat videos or images so that they fit your template and this follows that theme.

PHP like Javascript and other decent languages offers the ability to create anonymous (unnamed) functions on the fly to be used as callbacks in regular expression replace functions.

Passing a function as a parameter within another function is known as a lambda function and this is very useful in certain situations such as parsing HTML where you want to remove illegal tags and attributes.

This example is pretty niche to my own needs but the idea is useful and can easily be converted to your own reformatting needs.

Basically I show little Google Adsense box adverts at the top right of all my articles and as these are floated to the right I don't want any images within my articles that are less than the size of my content area (minus the size of the advert) to also be aligned to the right as this causes a horrible looking mess.

Therefore I use this code to ensure that the Wordpress class "alignleft" is always inserted on the first image in any article but only if the width of the image is less than 350px. It doesn't matter if the image isn't right at the top of the article (although they usually are) but it helps to ensure a smooth flow and layout.

The Preg_Replace function in PHP is nice in that you can limit the number of replacements to one or more if you so wish and in this case I only want to reformat the first image. You will see that the lambda function I use also contains multiple other regular expressions and there is nothing stopping you from writing complex code within these anonymous functions.

$content = preg_replace_callback("@(<img[^>]+?class=['\"])([\s\S]*?)(['\"][\s\S]*?>)@i",
create_function(
'$matches',
'preg_match("@width=[\'\"](\d+)[\'\"]@",$matches[0],$widthmatch);
$chk=true;
if($widthmatch){
if(is_numeric($widthmatch[1])){
if($widthmatch[1] >= 350){
$chk = false;
}
}
}
if($chk && !preg_match("@alignleft@",$matches[2])){
$res = $matches[1] . trim( preg_replace("@alignright@","",$matches[2]) . " alignleft") .$matches[3];
}else{
$res = $matches[1] . preg_replace("@alignright@","",$matches[2]) .$matches[3];
}
return $res;'),$content,1);



As you can see the lambda anonymous function carries out the following logic

  • It looks for the first image that has a class attribute on it and passes the match collection into the anonymous function.
  • It then checks the image for a width attribute using another regular expression.
  • If found it then checks that the value of the width attribute is numeric and if so it tests to see if the value is over my allowed limit of 350px setting a flag if so.
  • I then use another regular expression to test for the existence of the standard Wordpress class "alignleft" and if it isn't found then I add it in making sure to replace any "alignright" class if it exists.
  • The lambda function then returns the new reformatted content of the image which is then used as the replacement value in the Preg_Replace function.
Just two useful examples which can be expanded upon by anyone interested in Regular Expressions and AutoBlogging.

Remember if you require any custom plugin work for Wordpress, including special reformatting functions, scraping code or content automation then contact me for a quote.

4 comments:

Mike B said...

I am really struggling with the regex in WP O Matic.

Regular expressions that validate fine just don't seem to produce the results expected. Is there any useful info on how it is different or how to use it?

Rob Reid said...

Read my article on how Ro allow wp-o-magic to import iframes and object/embed tags as their kses.php file removes them unless you are logged in whilst posting which you are not if your using wp-o-matic

Fabrizio said...

Hello, congratulations for your blog. I'm going crazy with wp-o-matic. I can not use regular expressions.

I use custom template tool of wp-o-matic plugin like this:

{title}
{content}
< a href="{permalink}">READ HERE< /a >
{campaigntitle}

{Permalink} tag give me a link in post like this:

http://www.domainsite.com/deals/offer/object/7614342?CID=IT_RSS_217_389_189_22&utm_source=rss_217&utm_medium=rss_389&utm_campaign=rss_189&utm_content=rss_22

I need a regex to transform link like this:

http://www.domainsite.com/deals/offer/object/7614342/.NOAtRf (.NOAtRf it's my referer link and it's costant)

Now I've tried to use regex tools but I'm not sure to use it in the right way.

On this site gskinner I found a function that transforms the link exactly as I want but when I try to use it on wp-o-matic does not work

If I understand correctly insert this regular expression in the "ORIGIN":

/([a-z0-9_\-]{1,5}:\/\/)?(([a-z0-9_\-]{1,}):([a-z0-9_\-]{1,})\@)?((www\.)|([a-z0-9_\-]{1,}\.)+)?([a-z0-9_\-]{3,})(\.[a-z]{2,4})(\/([a-z0-9_\-]{1,}\/)+)?([a-z0-9_\-]{1,})?(\.[a-z]{2,})?(\?)?(((\&)?[a-z0-9_\-]{1,}(\=[a-z0-9_\-]{1,})?)+)?/gi

flag Regex option and put this line to Rewrite to:

$1$5$8$9$10$12/.NOAtRf.

When I try to submit the mod I receive this message:

There's an error with the supplied RegEx expression

I can not understand where mistake.
Could you help me please?

Rob Reid said...

Hi Fabrizio

The regex should be very simple and it looks like you are over complicating it as you know its a URL and resides in an href so why do all the extra checking to ensure its a URL?

You know there can only be one unencoded ? in a URL so just check for that and split it in two replacing the 2nd group with whatever it is you want.

E.G

@(^https?.+?)(\?.*?$)@

$1/.NOAtRf

I use @ as delimiters in my own expressions but you can use whatever you want (if the plugin allows it) however I have also found issues with WP-O-MATIC and the RegEx part of the article format system in that it will save a regex but then come back double encoded (I think) or missing parts.

To get round this you might need to edit the actual row in the database table to ensure it's correct.

By the way the donate button is on every page of the site so that people can show their appreciation of my work and it is only though donations that I can (or will continue to give out free debugging help)

I am not having a go at your personally as you haven't done this but I have had enough of people emailing me directly for requests of help along with promises of donations and payments that never come which is why I created the page of shame >> http://blog.strictly-software.com/2011/10/naming-and-shaming-of-programming.html

So far a few people have relented and come back to fulfil their promises.

Anyhow the tips I gave you should help you out.

Thanks

Rob