Strictly Software

Monday, 19 April 2010

Banning Bad Bots with Mod Rewrite

Banning Scrapers and Other Bad Bots using Mod Rewrite

There are many brilliant sites out there dedicated to the never ending war on bad bots and I thought I would add my own contribution to the lists of regular expressions used for banning spammers, scrapers and other malicious bots with Mod Rewrite.

As with all security measures a sys admin takes to lock down and protect a site or server a layered approach is best. You should be utilise as many different methods as possible so that an error, misconfiguration or breach of one ring in your defence does not mean your whole system is compromised.

The major problem with using the htaccess file to block bots by useragent or referrer is that any hacker worth the name would easily get round the rules by changing their agent string to a known browser or by hiding the referrer header.

However in spite of this obvious fact it still seems that many bots currently doing the rounds scraping and exploiting are not bothering to do this so its still worth doing. I run hundreds of major sites and have my own logger system which automatically scans traffic and blocks visitors who I catch hacking, spamming, scraping and heavy hitting. Therefore I regularly have access to the details of the badly behaving bots and whereas a large percentage of them will hide behind an IE or Firefox useragent a large percentage still use identifying agent strings that can be matched and banned.

The following directives are taken from one of my own personal sites and therefore I am cracking down on all forms of bandwidth theft. Anyone using an HTTP library like CURL, Snoopy, WinHTTP and not bothering to change their useragent will get blocked. If you don't want to do this then don't just copy and paste the rules.

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

# Block blank or very short user-agents. If they cannot be bothered to tell me who they are or provide jibberish then they are not welcome!
RewriteCond %{HTTP_USER_AGENT} ^(?:-?|[a-z1-9\-\_]{1,10})$ [NC]
RewriteRule .* - [F,L]

# Block a number of libraries, email harvesters, spambots, hackbots and known bad bots

# I know these libraries are useful but if the user cannot be bothered to change the agent they are worth blocking plus I only want people
# visiting my site in their browser. Automated requests with CURL usually means someone is being naughty so tough!
RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|PHP|POE|Python|Snoopy|urllib|Wget|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|library|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch|flickbot|hook|igetter|jetcar|leach|mole|miner|mirror|race|reaper|sauger|sucker|site|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

# fake referrers and known email harvesters which I send off to a honeytrap full of fake emails
RewriteCond %{HTTP_USER_AGENT} (?:atomic|collect|e?mail|magnet|reaper|siphon|sweeper|harvest|(microsoft\surl\scontrol)|wolf) [NC,OR] # spambots, email harvesters
RewriteCond %{HTTP_REFERER} ^[^?]*(?:iaea|\.ideography|addresses)(?:(\.co\.uk)|\.org\.com) [NC]
RewriteRule ^.*$ http://english-61925045732.spampoison.com [R,L] # redirect to a honeypot


# copyright violation and brand monitoring bots
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR]
RewriteCond %{HTTP_USER_AGENT} (?:NPBot|TurnitinBot) [NC]
RewriteRule .* - [F,L]


# Image hotlinking blocker - replace any hotlinked images with a banner advert for the latest product I want free advertising for!
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?##SITEDOMAIN##\.com/.*$ [NC] # change to your own site domain!
RewriteCond %{HTTP_REFERER} !^https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*$ [NC] # ensure image indexers don"t get blocked
RewriteCond %{HTTP_REFERER} !^https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*$ [NC] # ensure email clients don't get blocked
RewriteRule .*\.(gif|jpe?g|png)$ http://www.some-banner-advert.com/myadvert-468x60.png [NC,L] # free advertising for me


# Security Rules - these rules help protect your site from hacks such as sql injection and XSS

RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK) # no-one should be running these requests against my site!
RewriteRule .* - [F]

# My basic rules for catching SQL Injection - covers the majority of the automated attacks currently doing the rounds

# SQL Injection and XSS hacks - Most hackbots will malform links and then log 500 errors for details I use a special hack.php page to log details of the hacker and ban them by IP in future
# Works with the following extensions .php .asp .aspx .jsp so change/remove accordingly and change the name of the hack.php page or replace it with  [F,L]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?DECLARE[^a-z]+\@\w+[^a-z]+N?VARCHAR\((?:\d{1,4}|max)\).*)$ /hack\.php\?$1    [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?sys.?(?:objects|columns|tables).*)$ /hack\.php\?$1    [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?;EXEC\(\@\w+\);?.*)$ /hack\.php\?$1    [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?(%3C|<)/?script(%3E|>).*)$ /hack\.php\?$1    [NC,L,U] # XSS hacks

# Bad requests which look like attacks (these have all been seen in real attacks)
RewriteRule ^[^?]*/(owssvr|strmver|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) /hack.php [NC,L]

</IfModule>

Security, Free Advertising and Less Bandwidth

As you can see from the rules I have condensed the rules into sections to keep the file manageable. The major aim of the rules is to

Improve security by blocking hackbots before they can reach the site.
Reduce bandwidth by blocking the majority of automated requests that are not from known indexers such as Googlebot or Yahoo.
Piss those people off who think they can treat my bandwidth as a drunk treats a wedding with a free bar. Known email scrapers go off to a honeypot full of fake email addresses and hotlinkers help me advertise my latest site or product by displaying banners for me.

Is it a pointless task?

A lot of the bad bot agent strings are well known and there are many more which could be added if you so wish however trying to keep a static file maintained with the latest bad boys is a pointless and thankless task. The best way is to automate the tracking of bad bots by logging users who request the robots.txt file by creating a dynamic file that can log the request to a file or database. Then you place directives in the robots.txt to DENY access to a special directory or file and then place links on your site to this file. Any agents who ignore the robots.txt file and crawl these links can then be logged and blocked.

I also utilise my own database driven logger system that constantly analyses the traffic looking for bad users which can then be banned. I have an SQL function that checks for hack attempts and spam by pattern matching the stored querystring as well as looking for heavy hitters (agents/IP's requesting lots of pages in a short time period). This helps mes prevent DDOS attacks as well as scrapers who think they can take 1000+ jobs without saying thank you!

A message for the other side

I know this probably defeats the point of me posting my htaccess rules but as well as defending my own systems from attack I also make use of libraries such as CURL in my own development to make remote HTTP requests. Therefore I can see the issues involved in automated crawling from both sides and I know all the tricks system admin use to block as well as the tricks scrapers use to bypass.

There are many legitimate reasons why you might need to crawl or scrape but you should remember that what goes around comes around. Most developers will have at least one or more of their own sites and therefore you should know that bandwidth is not free so stealing others will lead to bad karma. The web is full of lists containing bad agents and IP addresses obtained from log files or honey-traps so you don't just risk being banned from the site you are intending to crawl when you decide to test your new scraper out.

Remember if you hit a site so hard and fast it breaks (which is extremely possible in this day and age of cheaply hosted Joomla sites designed by four year olds.) then sys admin will start analysing log files looking for culprits. A quiet site that runs smoothly usually means the owner is happy and not looking for bots to ban

Rather than making multiple HTTP requests cache your content locally if possible.
Alternate requests between domains so that you don't hit a site too hard.
Put random delays in between requests.
Obey Robots.txt and don't risk getting caught in a honeypot or infinite loop by visiting pages you shouldn't be going to.
Never use the default agent as set in your chosen library.
Don't risk getting your server blacklisted by your crawling so always use a proxy.
Only scrape the bare minimum that you need to do the job. If you are only checking the header then don't return the body as well.

The following articles are a good read for people interested in this topic:

http://www.leekillough.com/robots.html

http://perishablepress.com/press/2006/01/10/stupid-htaccess-tricks/

Tuesday, 30 March 2010

Great new site - From The Stables

If you are into horse racing, like a bet or just interested in getting great information straight from the trainers mouth about upcoming races then I suggest checking out my new site www.fromthestables.com. We have teamed up with some of the best known UK trainers to provide a unique high quality service available to members on a daily basis.

Each day our top trainers will provide their expert information on their horses running that day. This isn't a tipster site and we won't pretend to guarantee winners and losers however we do promise to provide quality info straight from the stables every racing day. We have only been running for a week and already we have already provided our members with great information that has led to a number of winners and each way placed horses.

We are currently offering half price membership of only £25 a month but on top of that we are offering new users a free seven day trial so that they can experience the quality information that our trainers provide for themselves. Not only does membership guarantee great trainer insight into horses running that day we also offer a variety of deals and special offers which include discounted race course tickets, champagne tours of our trainers stables, free bets from our sponsors and to top it off we also plan to buy a racehorse later this year which will be part owned by our subscribers.

If you are interested in utilising this valuable resource for yourself or know a friend, family member or colleague who would be interested then why not take advantage of our seven day free trial. You will need to set up a PayPal subscription before being granted entry to the site but no money will be deducted from your account until the seven day trial is up and you can cancel at any time before that date. If you are happy with the service then at the end of the trial the monthly membership fee which is currently at a 50% discount of only £25 will be taken from your PayPal account and you will continue to enjoy all the benefits of the site.

To take advantage of our trial offer please visit the following link:

www.fromthestables.com

Monday, 29 March 2010

My Hundredth Article

An overview of the last 102 articles

I really can't believe that I have managed to write 102 articles for this blog in the last year and a bit. When I first started the blog I only imagined writing the odd bit here and there and saw the site purely as a place to make public some of my more useful coding tips. I never imagined that I could output this amount of content by myself.

A hundred articles has come and gone pretty fast and as with all magazines, tv shows and bloggers stuck for an idea I thought I would celebrate my hundred and 2nd article by reviewing my work so far.

Recovering from an SQL Injection Attack

This was the article that started it all and it's one that still gets read quite a bit. It's a very detailed look at how to recover an infected system from an SQL Injection Attack and includes numerous ways of avoiding future attacks as well as quick sticking plasters, security tips and methods for cleaning up an infected database.

Linked to this article is one of my most downloaded SQL scripts which helps identify injected strings inside a database as well as removing them. This article was written after a large site at work was hacked and I was tasked with cleaning up the mess so it all comes from experience.

Performance Tuning Tips

I have wrote quite a few articles on performance tuning systems both client and server side and some of my earliest articles were on top tips for tuning SQL Databases and ASP Classic sites. As well as general tips which can be applied to any system I have also delved into more detail regarding specific SQL queries for tuning SQL 2005 databases.

Regarding network issues I also wrote an extensive how to guide on troubleshooting your PC and Internet connection which covered everything from TCP/IP settings to tips on the best tools for cleaning up your system and diagnosing issues. On top of that I collated a number of tweaks and configuration options which can speed up FireFox.

Dealing with Hackers, Spammers and Bad Bots

My job means that I have to deal with users trying to bring my systems down constantly and I have spent considerable time developing custom solutions to log, identify and automatically ban users that try to cause harm to my sites. Over the last year I have written about SQL Denial of Service attacks which involve users making use of web based search forms and long running queries to bring a database driven system to a halt. I have also investigated new hacking techniques such as the two stage injection technique, the case insensitive technique, methods of client side security and why its almost pointless as well as detailing bad bots such as Job Rapists and the 4 rules I employ when dealing with them.

I have also detailed the various methods of using CAPTCHA's as well as ways to prevent bots from stealing your content and bandwidth through hot linking by using ISAPI rewriting rules.

Issues with Browsers and Add-Ons

I have also tried to bring up to date information on the latest issues with browsers and new version releases and have covered problems and bugs related to major upgrades of Firefox, Chrome, Opera and IE. When IE 8 was released I was one of the first bloggers to detail the various browser and document modes as well as techniques for identifying them through Javascript.

I have also reported on current browser usage by revealing statistics taken from my network of 200+ large systems with regular updates every few months. This culminated in my Browser survey which I carried out over Christmas which looked at the browsers and add-ons that web developers themselves used.

Scripts, Tools, Downloads and Free Code

I have created a number of online tools, add-ons and scripts for download over the last year that range from C# to PHP and Javascript.

Downloadable Scripts Include:

Tools from www.strictly-software.com/online-tools:

SQL Scripts include:

Find blocked and long running queries
Great database performance report - 16+ different reports all in one!
Find text in a database
Update stored procedures automatically
Generate Full Text Index Scripts - corrects a missing feature from SQL 2005
Remove HTML with User Defined Functions

Search Engine Optimisation

As well as writing about coding I also run a number of my own sites and have had to learn SEO the hard way. I have wrote about my experiences and the successful techniques I have found that worked in a couple of articles printed on the blog:

So there you go an overview of the last year or so of Strictly-Software's technical blog. Hopefully you have found the site a good resource and maybe even used one or two of the scripts I have posted. Let me know whether you have enjoyed the blog or not.

Sunday, 28 March 2010

Turn one Google Mail account into many

Multiplying your GMail Account to create multiple email addresses

I just come across a wonderful tip that allowed me to bypass the rule on Twitter that prevents you from using an email address for more than one account. You may have tried this yourself and found that if you try to create a new Twitter account with an email address assigned to another Twitter account you won't be able to.

Obviously there are good reasons for this e.g to prevent spam bots and auto sign up tools etc etc. However if you don't have multiple email accounts at the ready and don't fancy setting up a new one just to get round this problem then the answer lies in GMail.

Apparently its possible to add dots to your GMail address to make a unique email address in the eyes of Twitter or anyone else who sees the address. However GMail will treat it as one account no matter how many variations you use.

For example using an address like strictlysoftware@gmail.com

I could create the following aliases that would all be forwarding addresses for my underlying account:

strictly.software@gmail.com

strictly.soft.ware@gmail.com

s.trictly.softwa.re@gmail.com

So in the eyes of Twitter or any other website that requires an email address when signing up they are all unique addresses. However any email sent to these addresses would all appear in strictlysoftware@gmail.com.

This is a neat trick that helped me get round Twitters sign up form and I will sure be using it in future.

For more details read the following links:

thesocialmediaguide.com.au

bullishchina.com

Monday, 22 March 2010

Write your own Proxy Checker Tool

Creating your own Proxy Checker Tool

Finding good reliable proxies is a hard task and although many sites offer free lists the majority of the proxies on them will be out of date or not working when you get round to testing them. If you are happy with using WebProxies then there are plenty about but they don't help if you are wanting to access content that utilises complex AJAX to deliver movies or other country specific goodies you want to view that is being blocked.

Therefore its a good idea to have a proxy checker tool that can be run on demand to find working proxies. There are many tools you can buy or download for free such as Proxyway that will do this but the problem with any executable is that you don't know what the code is doing behind the scenes.

Any executable that you run on your PC that is contacting multiple servers in Russia and China should be used with caution as these are well known hotspots for hackers and utilising hidden malware inside otherwise useful tools is a well known tactic.

Therefore its always a good idea if you can to write your own code. I have just knocked up a Proxy Checker tool for my own use that not only finds useful proxy lists at the click of a button but also checks the proxies within those lists to see if they are working.

1. The code is written in PHP and uses an HTML form and multiple AJAX requests to load the results into the DOM. This makes the script very usable as your not waiting around to see the results and the page is updated as each result comes in giving it the look and feel of a real time app.

2. If you don't have access to a webserver to run PHP from then install WAMP Server on your PC. This will allow you to run scripts from your localhost plus you can enable all the extensions that a webserver may not let you use such as CURL or FOPEN. I do like running "personal apps" from my localhost as it means I get all the flexibility of a webserver plus no-one else can use the code!

3. The HTML page contains a form with a textarea. Paste in all the URL's containing the ProxyList sites you want to scrape. If you don't have a list of proxy sites then you can use the "Find Proxy List" button to scan the web for some. This is not an extensive search but it will return some lists. Remember good quality proxy lists are hard to come by and a quiet proxy is a quick proxy therefore if you find a good reliable proxy server keep it quiet and don't destroy it!

4. On hitting the "Check Proxies" button the form loops through the Proxy Lists URL's making AJAX calls to a helper script that scrapes any proxy details it can find. I am using a basic scraper function utilising file_get_contents but you can use CURL or fsockopen if you wish or like I do on other sites a custom function that utilises all 3 in case the server has blocked the use of one or more options or if your CURL settings don't allow you to use Proxy Tunnelling.

// A very simple function to extract HTTP content remotely. Requires fopen support on server.
// A better function would check for CURL use and fallback on fsockopen
function getHttpContent($url, $useragent="",$timeout=10, $maxredirs=3, $method="GET", $postdata="",$proxy="") {

$urlinfo = null;

// simple test for a valid URL
if(!preg_match("/^https?:\/\/.+/",$url)) return $urlinfo;

$headers = "";
$status = "";

// create http array
$http = array(
'method'=>$method,
'user_agent'=>$useragent,
'timeout'=>$timeout
);

// add proxy details if required
if(!empty($proxy)){
$http["proxy"] = $proxy;
}

// if we want to POST data format it correctly
if($method=="POST"){
$content_length = strlen($postdata);

$http["content"] = $postdata;
$headers .= "Content-Type: application/x-www-form-urlencoded\r\nContent-Length: $content_length";
}

// now add any headers
$http["header"] = $headers;

// set options
$opts = array('http'=>$http);

// create stream context
$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$html = @file_get_contents($url, false, $context);

// check global $http_response_header for status code e.g first part is HTTP/1.1 200 OK
if(isset($http_response_header[0])){
// Retrieve HTTP status code by splitting this into 3 vars
list($version,$status,$msg) = explode(' ',$http_response_header[0], 3);
}

// if we have a valid status code then use it otherwise default to 400
if(is_numeric($status)){
$urlinfo["status"] = $status;
}else{
$urlinfo["status"] = "400"; //bad request
$msg = "Bad Request";
}

// only return the HTML content for 200=OK status codes
if($status == "200"){
$urlinfo["html"] = $html;
//put all other headers into array in case we want to access them (similar to CURL)
}elseif(isset($http_response_header)){
$urlinfo['info'] = $http_response_header;
}

// return array containing HTML,Status,Info
return $urlinfo;
}

5. The content is decoded to get round people outputting HTML using Javascript or HTML Encoding it to hide the goodies. It then looks for text in the format of IP:PORT e.g

// call function to get content from proxy list URL
$content = getHttpContent($url, "",10, 3, "GET");

// did we get a good response?
if($content["status"]=="200" && !empty($content["html"])){

// extract content and decode it to get round people using Javascript to hide HTML
$content = urldecode(html_entity_decode($content["html"]));

// now look for all instances of IP:PORT
preg_match_all("/(\d+\.\d+\.\d+\.\d+):(\d+)/",$content,$matches,PREG_SET_ORDER);

6. I then return the list of IP's to the front end HTML page which outputs them into a table with a "TESTING" status. As each unique IP:PORT is inserted into the report another AJAX call is made to test the Proxy Server out.

7. The Proxy test utilises the same HTTP scraper function but this time it uses the IP and PORT details from the Proxy we are wanting to test. The page it calls is one of the many IP Checker tools that are available on the web. You can change the URL it calls but I am using a page that returns the Country after doing a reverse IP check. This way if the proxy is working I know the country details.

8. Once the reverse IP test is carried out on the Proxy the results are returned to the HTML report page and the relevant Proxy listing is updated in the table with either GOOD or BAD.

I have found that a lot of other Proxy checker scripts are only validating that a proxy is working by giving it a PING or opening a socket. Although this may show whether a server is accessible it doesn't tell you whether using it as proxy will work or not.

Therefore the best way to test whether a Proxy is working is to check for a valid response by requesting a page and if you are going to call a page you might as well call a useful page. One that will return the IP's location or maybe one that shows any HTTP_VIA, FORWARDED_FOR headers so you can detect whether the Proxy is anonymous or not.

Remember when you find some good quality proxies store their details as they are worth their weight in gold!

Friday, 19 March 2010

Wordpress, WP-O-Matic and custom HTML tags

Automating Wordpress posts with WP-O-Matic

If you use wordpress you should really look into the great WP-O-Matic plugin that allows you to automate postings by importing content from RSS or XML feeds. You can set up schedules to import at regular times or import on demand from the admin area.

One issue however which I have just spent ages getting to the bottom of is the use of HTML such as OBJECT and EMBED tags. As a lot of content feeds contain multimedia files nowadays and you want this content to be imported directly into your site. The problem with WP-O-Matic and Wordpress in their default mode is that you will only get this content imported when you run the import from the admin menu or the page that the CRONJOB calls directly whilst logged in as an admin or publisher.

If you try to run the page the cronjob calls e.g /wp-content/plugins/wp-o-matic/cron.php?code=XXXXX whilst logged out or allow the job to run by itself you will find that certain HTML tags and attributes are removed including OBJECT and EMBED tags.

The reason is for security to prevent XSS hacks and its possible to get round this if you require to. This took me quite a long time to get to the bottom of as I am very new to Wordpress but I managed it in the end.

1. WP-O-Matic makes use of another object called SimplePie which is a tool for extracting content from XML and RSS. This object has a number of settings for stripping out HTML and the behaviour depends on how the feed import is called.

When running the import from the admin menu a setting called set_stupidly_fast is set to true which bypasses all the normal formatting and HTML parsing. When the CRONJOB runs this is set to false so the reformatting is carried out. In reality you want to run the reformatting as it does much more than just parse the HTML such as remove excess DIV's and comment tags and ordering the results by date.

If you don't care about this formatting you need to find the fetchFeed method in the \wp-content\plugins\wp-o-matic\wpomatic.php file and force it to be false all of the time:


$feed->set_stupidly_fast(false);

If you do want to keep the benefits of the stupidly_fast function but allow OBJECT and EMBED tags then you can override the strip_htmltags property in Simplepie that defines the tags to remove. You can do this in the same fetchFeed method in the wpomatic.php file just before the init method is called by passing in an array of tags that you do want Simplepie to remove from the extracted content.


// Remove these tags from the list
$feed->strip_htmltags(array('base', 'blink', 'body', 'doctype', 'font', 'form', 'frame', 'frameset', 'html', 'iframe', 'input', 'marquee', 'meta', 'noscript', 'script', 'style'));
$feed->init();

So that takes care of the WP-O-Matic class but unfortunatley we are not done yet as Wordpress runs its own sanitisation on posts in a file called kses.php found in the wp-includes folder. If you are logged in as admin or a publisher you won't get this problem but your CRONJOB will run into it so you have two choices.

1. Comment out the hook that runs all the kses sanitisation which isn't recommended for security reasons but if you wanted to do it the following line should be commented out in the kses_init_filters function e.g

function kses_remove_filters() {
// Normal filtering.
remove_filter('pre_comment_content', 'wp_filter_kses');
remove_filter('title_save_pre', 'wp_filter_kses');

// Post filtering
// comment out the hook that sanitises the post content
//remove_filter('content_save_pre', 'wp_filter_post_kses');
remove_filter('excerpt_save_pre', 'wp_filter_post_kses');
remove_filter('content_filtered_save_pre', 'wp_filter_post_kses');
}

Commenting out this line will ensure no sanitisation is carried out on your posts whoever or whatever does the posting. Obviously this is bad for security as if you are importing a feed that one day contained an inline script or an OBJECT that loaded a virus you could be infecting all your visitors.

2. The other safer way is to add the tags and attributes that you want to allow into the list of acceptable HTML content that the kses.php file uses when sanitising input. At the top of the kses file is an array called $allowedposttags which contains a list of HTML elements and their allowed attributes.

If you wanted to allow the playing of videos and audio through OBJECT and EMBED tags then the following section of code can just be inserted into the array.

'object' => array(
'id'=>array(),
'classid'=>array(),
'data'=>array(),
'type'=>array(),
'codebase'=>array(),
'align'=>array(),
'width'=>array(),
'height'=>array()),
'param' => array(
'name'=>array(),
'value'=>array()),
'embed' => array(
'id'=>array(),
'type'=>array(),
'width'=>array(),
'height'=>array(),
'src'=>array(),
'bgcolor'=>array(),
'wmode'=>array(),
'quality'=>array(),
'allowscriptaccess'=>array(),
'allowfullscreen'=>array(),
'allownetworking'=>array(),
'flashvars'=>array()
),

Obviously you can add whichever tags and attributes you like and this is the preferred way in my opinion of getting round this problem as you are still whitelisting content rather than allowing anything.

It took me quite a while to get to the bottom of this problem but I now have all my automated feeds running correctly importing media content into my blog. Hopefully this article will help some people out.

Saturday, 27 February 2010

The difference between a fast running script and a slow one

When logging errors is not a good idea

I like to log errors and debug messages when running scripts as they help diagnose problems not immediately apparent when the script is started. There is nothing worse than starting a script running and then coming into work on a Monday morning to find that it hasn't done what it was supposed to and not know why. A helpful log file with details of the operation that failed and the input parameter values can be worth its weight in gold.

However logging is an overhead and can literally be the difference between a script that takes days to run and minutes. I recently had a script to run on a number of webservers that had to segment out hundreds of thousands of files (documents in the < 500KB range) into a new folder structure.

My ideal solution was to collate a list of physical files in the folder I wanted to segment and pass that into my SELECT statement so that I could return a recordset containing only the files that actually existed. However my ideal solution was thwarted due to an error message I have never come across before which claimed the DB server didn't have the resources available to compile the necessary query plan. Apparently this was due to the complexity of my query and it recommended to reduce the amount of joins. As I only had one join this was not a solution and after a few more attempts with some covering indexes that also failed I tried another solution.

The solution was just to try to move the file and catch any error that was raised and then log it to a file. The script was set running and a day later it was still running along with a large file containing lots of "File not found errors".

The script was eventually killed for another unrelated reason which pissed me off as I had to get the segmented folder structure up running sharpish. I modified the script so that before each move it checked whether the File Existed before the move. I initially had thought that this in itself may have been an overhead as the actual move method would be doing the same thing only at a lower level therefore I was duplicating two searches for a file in a large folder. I reckoned that just trying to do the move and then catching any error would speed things up.

However because I was logging the error to a text file this was causing a bottleneck in I/O and slowing down the procedure immensely. Adding the FileExists check round each move ensured that only files that definitely existed were attempted to be moved and no errors were raised which meant no logging and no I/O overhead.

I had forgone the nicety of knowing which files were not on the server any more but I had also reduced the scripts running time down to a mere 25 minutes. Therefore the lesson to be learned is that although logging is useful it can also be a major overhead and if you can do without it you may just speed up your scripts.