Monday 26 September 2011

Regular Expression functions for reformatting content for Wordpress

Using RegEx to reformat Wordpress content

If you read my blog you will know I am "Mr Automate" and I specialise in scraping, reformatting, spinning and automatically posting content to Wordpress, Twitter and other sites.

I use Wordpress for a couple of my blogs and because the well known plugin WP-o-Matic has stopped being supported and the only replacements are paid for plugins like WP Robot I have basically written my own custom version to handle my automated content imports.

A couple of useful Regular Expressions you might like to use which I find particular helpful are below.

I am using PHP as that is the language Wordpress is written in but the core Regular Expressions should be easily transferable between languages.


1. Removing IFRAMES

As YouTube has moved from it's old OBJECT / EMBED nested tag soup to an IFRAME many feeds containing videos will now contain IFRAME's which need to be handled correctly as we all know that viruses can be transfered through rouge IFRAME content.

If you are just importing content from a feed without parsing and checking it then just imagine if for whatever reason the feed you are using one day contains an IFRAME with a SRC pointing to a dodgy virus infected URL. If you don't sanitize the content you will be inserting this dangerous content into your own blog for your visitors to become infected by.

Although most modern browsers and virus checkers are good at picking up on potential malicious URL access it is always wise to run your own checks using a white list of "allowed" domains rather than a "black list" of banned ones.

This code is a basic example of looping through all the IFRAMES on a page, checking their SRC tag for a known white list of acceptable domains and then replacing any IFRAME's that don't match.

// match all XHTML IFRAME tags where $content holds your HTML
$videocount = preg_match_all("@(<iframe[\s\S]+?<\/iframe>)@",$content,$videos,PREG_SET_ORDER);

foreach($videos as $video){

$object = $video[1];

// only allow certain domains e.g youtube, dailymotion, vimeo, cnn, fox, msnbc and the bbc
if(!preg_match("@src=['\"]http:\/\/(www\.)?(?:youtube|dailymotion|vimeo|cnn|fox|msnbc|bbc)\.@i",$object)){

// replace the IFRAME as we don't know where it is pointing to

$content = preg_replace("@" . preg_quote($object) . "@","",$content);
}

}


2. Re-sizing images

I have already written an article about how you can use regular expressions and a callback function to reformat videos or images so that they fit your template and this follows that theme.

PHP like Javascript and other decent languages offers the ability to create anonymous (unnamed) functions on the fly to be used as callbacks in regular expression replace functions.

Passing a function as a parameter within another function is known as a lambda function and this is very useful in certain situations such as parsing HTML where you want to remove illegal tags and attributes.

This example is pretty niche to my own needs but the idea is useful and can easily be converted to your own reformatting needs.

Basically I show little Google Adsense box adverts at the top right of all my articles and as these are floated to the right I don't want any images within my articles that are less than the size of my content area (minus the size of the advert) to also be aligned to the right as this causes a horrible looking mess.

Therefore I use this code to ensure that the Wordpress class "alignleft" is always inserted on the first image in any article but only if the width of the image is less than 350px. It doesn't matter if the image isn't right at the top of the article (although they usually are) but it helps to ensure a smooth flow and layout.

The Preg_Replace function in PHP is nice in that you can limit the number of replacements to one or more if you so wish and in this case I only want to reformat the first image. You will see that the lambda function I use also contains multiple other regular expressions and there is nothing stopping you from writing complex code within these anonymous functions.

$content = preg_replace_callback("@(<img[^>]+?class=['\"])([\s\S]*?)(['\"][\s\S]*?>)@i",
create_function(
'$matches',
'preg_match("@width=[\'\"](\d+)[\'\"]@",$matches[0],$widthmatch);
$chk=true;
if($widthmatch){
if(is_numeric($widthmatch[1])){
if($widthmatch[1] >= 350){
$chk = false;
}
}
}
if($chk && !preg_match("@alignleft@",$matches[2])){
$res = $matches[1] . trim( preg_replace("@alignright@","",$matches[2]) . " alignleft") .$matches[3];
}else{
$res = $matches[1] . preg_replace("@alignright@","",$matches[2]) .$matches[3];
}
return $res;'),$content,1);



As you can see the lambda anonymous function carries out the following logic

  • It looks for the first image that has a class attribute on it and passes the match collection into the anonymous function.
  • It then checks the image for a width attribute using another regular expression.
  • If found it then checks that the value of the width attribute is numeric and if so it tests to see if the value is over my allowed limit of 350px setting a flag if so.
  • I then use another regular expression to test for the existence of the standard Wordpress class "alignleft" and if it isn't found then I add it in making sure to replace any "alignright" class if it exists.
  • The lambda function then returns the new reformatted content of the image which is then used as the replacement value in the Preg_Replace function.
Just two useful examples which can be expanded upon by anyone interested in Regular Expressions and AutoBlogging.

Remember if you require any custom plugin work for Wordpress, including special reformatting functions, scraping code or content automation then contact me for a quote.