Friday 19 March 2010

Wordpress, WP-O-Matic and custom HTML tags

Automating Wordpress posts with WP-O-Matic

If you use wordpress you should really look into the great WP-O-Matic plugin that allows you to automate postings by importing content from RSS or XML feeds. You can set up schedules to import at regular times or import on demand from the admin area.

One issue however which I have just spent ages getting to the bottom of is the use of HTML such as OBJECT and EMBED tags. As a lot of content feeds contain multimedia files nowadays and you want this content to be imported directly into your site. The problem with WP-O-Matic and Wordpress in their default mode is that you will only get this content imported when you run the import from the admin menu or the page that the CRONJOB calls directly whilst logged in as an admin or publisher.

If you try to run the page the cronjob calls e.g /wp-content/plugins/wp-o-matic/cron.php?code=XXXXX whilst logged out or allow the job to run by itself you will find that certain HTML tags and attributes are removed including OBJECT and EMBED tags.

The reason is for security to prevent XSS hacks and its possible to get round this if you require to. This took me quite a long time to get to the bottom of as I am very new to Wordpress but I managed it in the end.

1. WP-O-Matic makes use of another object called SimplePie which is a tool for extracting content from XML and RSS. This object has a number of settings for stripping out HTML and the behaviour depends on how the feed import is called.

When running the import from the admin menu a setting called set_stupidly_fast is set to true which bypasses all the normal formatting and HTML parsing. When the CRONJOB runs this is set to false so the reformatting is carried out. In reality you want to run the reformatting as it does much more than just parse the HTML such as remove excess DIV's and comment tags and ordering the results by date.

If you don't care about this formatting you need to find the fetchFeed method in the \wp-content\plugins\wp-o-matic\wpomatic.php file and force it to be false all of the time:

$feed->set_stupidly_fast(false);

If you do want to keep the benefits of the stupidly_fast function but allow OBJECT and EMBED tags then you can override the strip_htmltags property in Simplepie that defines the tags to remove. You can do this in the same fetchFeed method in the wpomatic.php file just before the init method is called by passing in an array of tags that you do want Simplepie to remove from the extracted content.

// Remove these tags from the list
$feed->strip_htmltags(array('base', 'blink', 'body', 'doctype', 'font', 'form', 'frame', 'frameset', 'html', 'iframe', 'input', 'marquee', 'meta', 'noscript', 'script', 'style'));
$feed->init();
So that takes care of the WP-O-Matic class but unfortunatley we are not done yet as Wordpress runs its own sanitisation on posts in a file called kses.php found in the wp-includes folder. If you are logged in as admin or a publisher you won't get this problem but your CRONJOB will run into it so you have two choices.

1. Comment out the hook that runs all the kses sanitisation which isn't recommended for security reasons but if you wanted to do it the following line should be commented out in the kses_init_filters function e.g

function kses_remove_filters() {
// Normal filtering.
remove_filter('pre_comment_content', 'wp_filter_kses');
remove_filter('title_save_pre', 'wp_filter_kses');

// Post filtering
// comment out the hook that sanitises the post content
//remove_filter('content_save_pre', 'wp_filter_post_kses');
remove_filter('excerpt_save_pre', 'wp_filter_post_kses');
remove_filter('content_filtered_save_pre', 'wp_filter_post_kses');
}
Commenting out this line will ensure no sanitisation is carried out on your posts whoever or whatever does the posting. Obviously this is bad for security as if you are importing a feed that one day contained an inline script or an OBJECT that loaded a virus you could be infecting all your visitors.

2. The other safer way is to add the tags and attributes that you want to allow into the list of acceptable HTML content that the kses.php file uses when sanitising input. At the top of the kses file is an array called $allowedposttags which contains a list of HTML elements and their allowed attributes.

If you wanted to allow the playing of videos and audio through OBJECT and EMBED tags then the following section of code can just be inserted into the array.

'object' => array(
'id'=>array(),
'classid'=>array(),
'data'=>array(),
'type'=>array(),
'codebase'=>array(),
'align'=>array(),
'width'=>array(),
'height'=>array()),
'param' => array(
'name'=>array(),
'value'=>array()),
'embed' => array(
'id'=>array(),
'type'=>array(),
'width'=>array(),
'height'=>array(),
'src'=>array(),
'bgcolor'=>array(),
'wmode'=>array(),
'quality'=>array(),
'allowscriptaccess'=>array(),
'allowfullscreen'=>array(),
'allownetworking'=>array(),
'flashvars'=>array()
),




Obviously you can add whichever tags and attributes you like and this is the preferred way in my opinion of getting round this problem as you are still whitelisting content rather than allowing anything.

It took me quite a while to get to the bottom of this problem but I now have all my automated feeds running correctly importing media content into my blog. Hopefully this article will help some people out.

5 comments:

HypH Life said...

where is the kses file?

where is the wp includes file? help thanks

Rob Reid said...

as the article says "as Wordpress runs its own sanitisation on posts in a file called kses.php found in the wp-includes folder."

So wherever you have put your wp-includes folder for your website.

Thanks

Aleksandar said...

Thank you a lot for this. I started to think there is no solution for my problem, stripping out part of java scrip I included in custom post.
And after reading your article I found some tags in wp-o-matic/inc/simplepie/simplepie.class.php where I just deleted "script" and other tags and attributes I want to leave.

Thank you again

Gautam said...

Sir, I want to fetch category/tags from the feed using wp-o-matic. Please help

Rob Reid said...

Well you either need to write your own version of wp-o-matic that uses regular expressions to parse the incoming post content and look for words that you then link to categories and save them in the post_meta or you need to find a plugin that will extract relevant tags from the post and save them in a similar way.

My Strictly AutoTag plugin was designed to find relevant tags from content and save them. I use it with my version of wp-o-matic to save relevant tags whenever I pass a feed and it works well for ENGLISH names, companies, institutions and other ENGLISH LANGUAGE wording.

You can buy the premium version here > http://www.strictly-software.com/plugins/strictly-auto-tags