Sunday 21 November 2010

Auto Resizing OBJECT and EMBED videos for display on import

Importing Videos from feeds into blogs

I have done a lot of work lately updating the popular WP-O-MATIC wordpress plugin that auto imports content into Wordpress blogs from feeds.

Mainly this is because the plugin is no longer supported having been taken over by the product WP Robot which seems to be based on the original source but with a lot more features.

One of the main things I find myself doing is having to resize videos so that they fit into my content which is 520 px wide. If you are importing content from various other blogs you can never guarantee that the videos will always be of a similar size and if you want your blog to look nice and not have scrollbars everywhere then the best way is to reformat the imported content so that any videos are resized to your desired dimensions.

As Wordpress and WP-O-Matic is PHP I use the preg_replace function to accomplish this using the callback function to handle the formatting. It looks for OBJECT or EMBED tags where the width is over 500px and if any are found I resize to 500 * 375 (4*3 ratio).

The function is below.

$content = preg_replace_callback("@(<(?:object|embed|param)[\s\S]+?width=['\"])(\d+)(['\"][\s\S]+?>)@i",
create_function(
'$matches',
'if($matches[2] > 500){
$res = preg_replace("@ height=[\'\"]\d+[\'\"]@i"," height=\"375\"",$matches[1] . "500" .$matches[3]);
}else{
$res = $matches[1] . $matches[2] .$matches[3];
}
return $res;'),$content);




Wednesday 29 September 2010

Analysing Bot Traffic from a Twitter Rush

Twitter Rush - Bot Traffic from Twitter

I blogged the other day about a link I found that listed the traffic that visits a site whenever a link to that site is posted upon twitter.

It seems that if you post a Tweet that contains a link a Twitter Rush is caused due to numerous BOTS, Social Media sites, SERPS and other sites noticing the new link and then all visiting your site at the same time.

This is one reason I created Strictly TweetBOT PRO which differs from my free version of Strictly TweetBOT as it allows you to do the following:


  • Make an HTTP request to the new post before Tweeting anything. If you have a caching plugin on your site then this should put the new post into the cache so that when the Twitter Rush comes they all hit a cached page and not a dynamically created one.
  • Add a query-string to the URL of the new post when making an HTTP request to aid caching. Some plugins like WP Super Cache allow you to force an un-cached page to be loaded with a query-string. So this will enable the new page to be loaded and re-cached.
  • Delay tweeting for N seconds after making the HTTP request to cache your post. This will help you ensure that the post is in the cache before the Twitter Rush.
  • Add a delay between each Tweet that is sent out. If you are tweeting to multiple accounts you will cause multiple Twitter Rushes. Therefore staggering the hits aids performance.

Buy Now
I have been carrying out performance tests on one of my LAMP sites and have been analysing this sort of data in some depth. I thought I would post an update with the actual traffic my own site receives when a link is Tweeted which is below.

A few interesting points:

1. This traffic is instantaneous so that the first item in the log file visiting the site has exactly the same time stamp as the WordPress URL that submitted the tweets to my Twitter account.

2. Yahoo seems to duplicate requests. This one tweet to a single post resulted in 3 requests for Yahoo's Slurp BOT but they originated from two different IP addresses.

3. These bots are not very clever and don't seem to log the URL's they visit to prevent duplicate requests. Not only does Yahoo have issues with the same account but if you post the same link to multiple Twitter accounts you will get all this traffic for each account.

For example when I posted the same link to 3 different Twitter accounts I received 57 requests (19 * 3). You would think maybe these Bots would be clever enough to realise that they only need to visit a link once every so often no matter which account posted it.

It just serves to prove my theory that most Twitter traffic is BOT related. 

BOTS following BOTS and Re-Tweeting and following traffic generated by other BOTS.

  • 128.242.241.133 - - [29/Sep/2010:21:06:45 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
  • 216.24.142.47 - - [29/Sep/2010:21:06:47 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 26644 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 OneRiot/1.0 (http://www.oneriot.com)"
  • 204.236.254.109 - - [29/Sep/2010:21:06:48 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
  • 67.195.112.56 - - [29/Sep/2010:21:06:46 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 72.30.142.220 - - [29/Sep/2010:21:06:47 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 65.52.2.10 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26643 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
  • 85.114.136.243 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26634 "-" "Mozilla/5.0 (compatible; Windows NT 6.0) Gecko/20090624 Firefox/3.5 NjuiceBot"
  • 72.30.142.220 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 89.151.113.134 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"
  • 67.202.63.158 - - [29/Sep/2010:21:06:54 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26634 "-" "kame-rt (support@backtype.com)"
  • 38.113.234.180 - - [29/Sep/2010:21:06:57 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "-" "Voyager/1.0"
  • 74.112.128.61 - - [29/Sep/2010:21:07:03 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
  • 64.233.172.20 - - [29/Sep/2010:21:07:10 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26640 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)"
  • 208.94.147.190 - - [29/Sep/2010:21:07:17 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "http://longurl.org" "LongURL API"
  • 208.94.147.190 - - [29/Sep/2010:21:07:17 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "http://longurl.org" "LongURL API"
  • 66.249.65.166 - - [29/Sep/2010:21:07:25 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26653 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  • 64.12.237.17 - - [29/Sep/2010:21:07:32 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 455 "-" "Jakarta Commons-HttpClient/3.1"
  • 204.236.205.4 - - [29/Sep/2010:21:08:55 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 - "-" "Firefox"
  • 67.207.201.163 - - [29/Sep/2010:17:01:06 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 473 "-" "Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)"

Buy Now

Monday 27 September 2010

Twitter Traffic is causing high server loads

Wordpress and Twitter Generated Traffic

I came across this interesting article the other day over at cloudtesting.com which listed the user-agents of bots that would visit a link if it was posted on Twitter. I have copied the list below and it's quite amazing to think that as soon as you post a link to your site or blog on Twitter you will suddenly get hammered by X amount of bots.

I can definitely attest to the truthfulness of this behaviour as I am experiencing a similar problem with one of my LAMP Wordpress blogs. Whenever an article is posted I automatically post tweets to 2 (sometimes 3 depending on relevance) Twitter accounts with my new Strictly Tweetbot Wordpress plugin.

Therefore when I import content at scheduled intervals throughout the day I can receive quite a sudden rush of bot traffic to my site which spikes my server load often to levels that are unrecoverable.

  • @hourlypress
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Mozilla/5.0 (compatible; abby/1.0; +http://www.ellerdale.com/crawler.html)
  • Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot
  • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
  • Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)
  • Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)
  • User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0 OneRiot/1.0 (http://www.oneriot.com)
  • PostRank/2.0 (postrank.com)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0 Me.dium/1.0 (http://me.dium.com)
  • Mozilla/5.0 (compatible; VideoSurf_bot +http://www.videosurf.com/bot.html)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3
  • Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com]

Personally I think that this list might be out of date as from what I have seen there are quite a few more agents that can be added to that list including bots from Tweetme and Bit.ly.

Currently if I think the bots don't provide any kind of benefit to me in terms of traffic apart from stealing my bandwidth and killing my server I am serving 403's using htaccess rules.

Before banning an agent check your log files or stats to see if you can see any traffic being referred. If you want the benefit but without the constant hitting try contacting the company behind the bot to see if they could change their behaviour. You never know they may be relying on your content and be willing to tweaking their code. We can all live in hope.

Saturday 11 September 2010

Strictly System Checker - Wordpress Plugin

Updated Wordpress Plugin - Strictly System Check

Ensure your Wordpress site is up 24/7 and be kept informed when it isn't without even having to touch your server or install monitoring software with this Wordpress plugin I created.

I have just released version 1.0.2 which has some major new features including:

The option to check for fragmented indexes and to carry out an automated re-index using the OPTIMIZE command.

And most importantly I have migrated some of the key features that MySQL performance monitoring scripts such as MySQLReport use to the plugin so that you can now be kept informed of:
  • MySQL Database Uptime
  • No of connections made since last restart, connections per hour.
  • The no of aborted connections.
  • No of queries made since last restart, queries per hour.
  • The percentage of queries that are flagged as slow.
  • The number of joins being carried out without indexes.
  • The number of reads and writes.
You can find out more about this very useful plugin at my main site: www.strictly-software.com/plugins/strictly-system-check

It's a plugin I created out of necessity and one that has been 100% useful in keeping my site running and on those occasions it's not I get to know about it before anyone complains.

Wednesday 8 September 2010

An issue with mysql_unbuffered_query, CONCAT and Wordpress

MySQL Problems related to mysql_unbuffered_query

I have been doing a lot of work with Wordpress lately, mainly developing a number of plugins I have ended up creating to overcome issues with performance or lack of features in existing plugins. One of the plugins I have been working on lately is an XML feed plugin in which I have tried to make use of the database a lot more than other plugins seem to want to.

However for some time I have been experiencing an issue with one of the MySQL interface functions mysql_unbuffered_query. This function is designed to speed up the retrieval of results as records are returned to the client as soon as they are ready rather than waiting for the whole recordset to be completed.

Whilst this seems straight forward I have come across an issue which seems to be directly linked to using this method which affects queries that engage in certain replacement behaviour. In my case I am using a SELECT statement to CONCAT all the necessary column per row into one XML string. Rather than return each individual column by itself and then use PHP to string build and call other database related functions I am trying to save time and database calls by doing as much as possible in one statement. A cut down example of this SQL can be seen below in which I join a number of columns together as well as inserting the relevant value (in this case the tag slug) into the URL.

SELECT CONCAT(' ',REPLACE('http://www.mysite.com/tag/%tag%/','%tag%',t.slug),' ',REPLACE(NOW(),' ','T'),'Z always 1.0 ') as XML
FROM wp_terms AS t
JOIN wp_term_taxonomy AS tt
ON t.term_id = tt.term_id
WHERE tt.taxonomy IN ('post_tag')
ORDER BY Name;

Nothing too complex about that at all. The URL string containing the placeholder is a permalink structure that is obtained before hand and one that can contain multiple placeholders and sections. For the sake of clarity I have kept it simple so it only makes one replacement.

When I run this query in Navicat, from the website with the Wordpress SQL functions or the standard mysql_query functions it runs correctly returning all the rows with the appropriate tag values inserted into the correct %tag% place-holders within the XML e.g
<url><loc>http://www.mysite.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Spain/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/2020/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>


However when I use mysql_unbuffered_query to run this I get the problem that all rows contain the same data e.g
<url><loc>http://www.mysite.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>



Even if I break the query down to something simple like this REPLACE without the CONCAT it still "misbehaves".

SELECT t.slug,REPLACE('http://www.hottospot.com/tag/%tag%/','%tag%',t.slug)
FROM wp_terms AS t
JOIN wp_term_taxonomy AS tt ON
t.term_id = tt.term_id
WHERE tt.taxonomy IN ('post_tag')
ORDER BY Name;


and the results will show the same value in the 2nd column for all rows e.g

Col 1Col 2
Saleshttp://www.mysite.com/tag/Sales
Spainhttp://www.mysite.com/tag/Sales
2012http://www.mysite.com/tag/Sales


It definitely seems to be connected to the execution of mysql_unbuffered_query as I can have Navicat open on my office PC connected to my remote server and all these queries run correctly but as soon as on my laptop at home I run the mysql_unbuffered_query query through the website to build the XML everything goes tits up. If I hit the refresh button on the query open in Navicat on my office PC which had been returning the correct results they then come back like I have described with all the values for rows 2 onwards displaying the value from the first row.

This is strange behaviour as I am providing a link identifier parameter when opening the connection and executing the SQL for the mysql_unbuffered_query and I presumed rightly or wrongly that this should have prevented issues like this.

I am yet to find a definitive answer to this problem so if anybody knows please contact me with details. From what it looks like the MySQL query engine has not finished processing the rows correctly when it starts to return them to the client. If the Replacement of values and placeholders used in the REPLACE function wasn't being carried out until the whole recordset was completed rather than after individual rows then this "might" explain it. As I am not an expert on the inner workings of MySQL I cannot say at this point in time however a quick solution to speed up some SQL has become more problematic than I would have though it could.


Any information regarding this issue would be great.

Thursday 26 August 2010

Problems connecting to Remote Desktop over a VPN

Troubleshooting issues with Remote Desktop connections

By Strictly-Software

I have just experienced and finally resolved a problem that started suddenly last week that prevented me from connecting to my work PC from my laptop at home using Terminal Services / Remote Desktop.

The problem started suddenly and it has made little sense for the last week. The symptoms were:
  • A virgin media broadband Internet connection.
  • A windows XP laptop connected to my broadband over a wireless connection.
  • My laptop could access the VPN without any problem and it could also access the PC in question over a windows share e.g \\mycomputer\c$
  • Trying to connect using a Remote Desktop connection returned a "this computer cannot connect to the remote computer" error message.

As far as I was aware nothing had changed on my computer and I first though that maybe a virus was blocking the port or a windows update had caused a problem. However when I brought the laptop into my office and tried connecting to my PC over the office wireless connection there was no problem.

I then tried this great little tool RD Enable XP that allows you to set up remote desktop access remotely as long as you have access to the computer and the necessary admin privileges. It requires that you have PSExec installed which comes with the PSTools admin suite and allows you to carry out tasks on computers remotely such as monitoring and managing processes.

The program checks that you can access the remote PC, that the Terminal Services options are enabled correctly in the registry and that you're not being blocked by a firewall.

I had already checked that the fDenyTSConnections registry option was set correctly so when the application hung whilst trying to set firewall exceptions I thought that there was a problem with my router and firewall.

I then tried changing the port number that Terminal Service connections are carried out with from 3389 to 3390. This is another registry setting that needs to be changed on the remote computer and then after a reboot you just append the port number after the computer name or IP address when connecting e.g strmycomputer:3390.

This didn't work so I was pretty annoyed as Virgin media hadn't been much help so I was about to give up until I came across a message thread related to the same problem.

One of the suggestions was to change the Advanced Network Error Search option which is something Virgin offers it's users and is described as follows:

Our advanced network error search helps you find the website you're looking for quickly.

We all make mistakes when we type in website addresses. Perhaps we miss a few letters, or the website doesn't exist any longer. If an address you enter doesn't locate a site, this handy feature will convert the incorrect address into a web search, so instead of an error message you will get a list of our closest matches, plus some additional related links.
This option is linked to your broadband connection which explains why the problem was related to the local connection and not the PC or remote network.

Low and behold when I disabled this option I could once again access my work PC over Remote Desktop again!

I have no idea why this option was suddenly enabled as I have never come across or even heard of it before tonight so I can only imagine Virgin decided to update their settings without asking their customers first.

It also seems to be pre-selected on newly bought laptops as I found out again tonight! Lucky I wrote this blog article otherwise I would have had to hunt down the original again!

I have no idea why this Virgin config option affects remote desktop connections but it obviously can cause a total block on this type of functionality.

If you too have similar problems related to Remote Desktop connections and are also a Virgin Media customer then save yourself a whole lot of time and go to this page first and check your settings: https://my.virginmedia.com/advancederrorsearch/settings

Friday 6 August 2010

Techies Law - Definition

The definition of Techies Law

In a similar vein to "Murphys Law" which states if something can go wrong it will I have over the last two days experienced multiple occurrences of what I call "Techies Law" which all developers should be well aware of even if they haven't heard the term described as such.

The law is pretty simple and states that
If you have spent considerable time trying to resolve a bug in your code, technical problem or other such computer related issue and you finally resort to asking for help from a colleague or support team member. You can be rest assured that when you go to show said person the problem in action it has miraculously resolved itself. You are then derided for either being a numpty and/or wasting their precious time for no reason.
Obviously Techies Law often leaves developers looking foolish in front of their colleagues but I have decided to utilise this unfathomable law of nature to my own benefit.

Every time I now have a problem with my network connection or a bug in my code I need fixing instead of spending a long time trying to fix it myself I will now only spend a quarter of the usual amount of time I do before asking for help and in doing so I usually find the issue magically resolving itself.

The knack to this tactic is to make sure the person you ask for help is not someone who is able to rip the piss out of you and is someone who won't mind being pestered for such a reason otherwise you could find yourself being the subject of many jokes.

Tuesday 20 July 2010

Strictly System Check Wordpress Plugin

A Site and Server Monitoring Plugin for Wordpress

I have just created a new Wordpress plugin, Strictly System Check, which allows me to easily monitor my Wordpress sites for signs of performance problems such as an overloaded server, too many database connections, corrupt tables or website issues.

I created this plugin primarily for my own use as I noticed that I quite regularly would experience my site going down with the "Error establishing a database connection" error after bulk imports of content after using WP-O-Matic. Whether it was this plugin or another one that caused the problem I don't know but the actual underlying issue is not a misconfigured database but corrupt database tables primarily the wp_options or wp_posts tables.

Therefore the main idea behind the plugin was for it to check my site at scheduled intervals and if it came across this database error message to then scan my MySQL database tables for corrupt tables and if any were found to automatically run a REPAIR statement.

I then extended this functionality to incorporate some more tests including a scan of the HTML source for a known string to ensure that my site is running correctly as well as checks against the webserver to guage the current load average and current database connection count.

The plugin allows webmasters to set their own preferred thresholds so that if the load average was say 2.0 or above then a report would be emailed out to a specified email address informing them of the situation.

Whilst this plugin is not meant to be a replacement for all the great professional site monitoring tools that are available it's a nice add-on that allows website administrators who may not have access to such tools to regularly check the status of their site and to be notified if the site goes down or becomes overloaded.

Please let me know what you think of the plugin and if you like it or it saves your site from going down then please make a donation.

Friday 18 June 2010

The Wordpress Survival Guide

Surviving Wordpress - An Introduction

As well as my technical blog here on blogger I have a number of Wordpress sites which I host myself on a virtual server.

I have now been using Wordpress and PHP for about 4 years and in that time I have learnt a hell of a lot regarding the pros and cons and do's and don'ts that are involved in running your own Wordpress blog.

As a developer who has spent most of his working career working with Microsoft products moving from a Windows environment to Linux has been a great learning curve and as well as all the tips I have gathered regarding Wordpress I thought I would write a series for other developers in a similar situation or for Webmasters who may have the sites but not the technical know how.

Covering the bases

In the following articles I will look in detail at how to get the most out of your system in terms of performance. If you are like me you are not made of money and able to afford lots of dedicated servers to host your sites on therefore you need to make the most of what you have. Performance tuning your Wordpress site is the most important thing you can do and luckily due to the nature of WordPresses plugins a lot of performance tuning can be done with a couple of clicks.

I will also be looking at performance tuning MySQL which is the database that Wordpress runs on. Moving from MS SQL with all its features and DMV's to MySQL was quite a culture shock for me so there are a few tips I have learnt which might be useful.

First things first - Tools of the trade

First off you will need to know how to get things done. My Wordpress sites are running on a Linux box and one of the first things I did was install VirutalMin which is a graphical UI you access in your browser which lets you manage your server. You could do everything from the command line but coming from a windows environment I found it very useful to be able to see a familiar environment.

After installing VirtualMin you should also install WebMin which is another graphical interface that gives you ultimate flexibility over your server without you ever needing to go near a command line prompt.

As well as setting up FTP to access your files through SFTP (secure FTP) I also installed PUTTY which enables me to connect to my server and get used to the command line way of doing things. I would definitely recommend doing this even if you were like me a Windows person as you should never be afraid to try something new and it's always good to have as many technical skills under your belt as possible. I always try to use the command line first but I know I can fall back on VMin if I need to.

Useful Commands

A good list of Linux applications and commands can be found here: Linux Commands but here are some of the key commands I find myself using over and again.

CommandDetails
dateShow the current date and time on the server
cdchange drive e.g cd /var (go to the var directory)
cd ../go back up one directory
cd ../../go back up two directories
lslist out the contents of a directory
whoamisee who you are logged in as
su - [username]Assume the permissions of the specified user
sudo [command]Run a command as root but stay as the user you are logged in as
topShow the current running processes and server load
top -d .2Show the current running processes with .2 second refresh
tail -f access_logView the most current entries in the sites access log
grep "61.252.14.247" access_log | tailView the most current entries in the sites access log for a certain IP address
netstat -taShow all current connections to the server
grep "27/Feb/2012:" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | lessView the IP's that appear in your access log the most for a certain date ordered by the most frequent first.
/etc/init.d/apache2 restartRestart Apache
apache2ctl configtestTest the Apache configuration for configuration errors
/etc/init.d/mysql restartRestart MySQL
wget [URL]Remotely access, load and save a file to the current directory
chmod 777 [filepath]Grant full read/write/delete permission to everyone to a file or folder
chmod +x [filepath]Grant execute permission to a script
rebootReboot the server


Handling Emergencies

You need to be prepared to handle emergencies and that involves a quick diagnosis and quick action. What sorts of emergencies can you expect to run into? Well the most common form will be very poor server performance that results in your site being unavailable to visitors. This can happen for a number of reasons including:

1. High visitor traffic due to a popular article appearing on a major site or another form of traffic driver.
2. High bot traffic from undesirable crawlers such as content scrapers, spammers, hackbots or even a denial of service attack. I recently experienced a DOS attack which came from an out of control bot that was making 10+ requests to my homepage a second.
3. A poorly written plugin that is eating up resources.
4. A corrupt database table causing database errors or poorly performing SQL causing long wait times.
5. Moderately high visitor traffic mixed with an unoptimised system set-up that exacerbates the problem.

Identifying the cause of your problem

If you are having a major slow down, site freeze or just don't know what is going on then the first thing is to open up a command prompt and run top to see the current processes.

The first thing to look at is the load average as this tells you the amount of resources and pressure your server is currently under. A value of 1.00 means your server is maxed out
and anything over that means you are in trouble. I have had a value of 124 before which wasn't good at all. My site was inaccessible and only a cold reboot could get me back to a controlable state.

If your load average is high then take a look at the types of processes that are consuming the most resources. You should be looking at the CPU% and Memory used by each process (the RES) column which shows the amount of physical memory in KB consumed by the process.

Each request to your site will use its own process so if your report is full of Apache rows then you are having a traffic spike. Each page request on your site should be a very quick affair so the processes will come and go very speedily and having a short delay interval is important to being able to spot problems.

Another process to look for is the MySQL process which again should come and go unless it's currently running a long performance intensive query in which case the database could be your problem. Another tool I like to use is mytop which gives you a top like display but of your MySQL processes only. You could open up your MySQL console and run SHOW PROCESSLIST constantly but using MyTop is alot easier and it will help identify problematic queries as well as queries that are being run by your site a lot.

If you don't have monitoring tools available to keep you up to date with your sites status then you may find that your system experiences severe problems periodically without your knowledge. Being informed by a customer that your site is down is never the best way of finding out therefore you might be interested in a plugin I developed called Strictly System Check.

This Wordpress plugin is a reporting tool that you run at scheduled intervals from a CRON job. The plugin will check that your site is available and reporting a 200 status code as well as scanning for known text. It will also connect to your database, check the database and server load and report on a number of important status variables such as the number of connections, aborted connections, slow queries and much more.

The great thing about this plugin is that if it finds any issues with the database it will CHECK and then REPAIR any corrupt tables as well as running the OPTIMIZE command to keep the tables defragged and fast. If any problems are found an email can be send to let you know. I wrote this plugin because there wasn't anything like it available and I have found it 100% useful in not only keeping me informed of site issues but also in maintaining my system automatically.

Scanning Access logs for heavy hitters

Something else you should take a look straight away is your access and error logs.
If you open up your access log and watch it for a while you should soon see if you are experiencing either high traffic in general or from a particular IP/Useragent such as a malicious bot. Using a command like tail or less to view the logfile with the -f flag ensures that as new data is added to the file it will be outputted to the screen which is what you want when examining current site usage.

mydomain:~# cd /home/mywebsite/logs
mydomain:~# tail -f access_log

Banning Bad Users

If the problem is down to one particular IP or Useragent who is hammering your site then one solution is to ban the robot by returning it a 403 Forbidden status code which you can do with your .htacess file by adding in the following lines:

order allow,deny
deny from 79.125.58.227
deny from 67.207.201.
deny from 89.146.55.222
allow from all

This will return 403 forbidden codes to all requests from the two IP addresses and the one IP subnet: 67.207.201.

If you don't want to ban by IP but by user-agent then you can use the Mod Rewrite rules to identify bad agents in the following manner:

RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|POE|Python|Snoopy|urllib|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch\s|flickbot|hapax|hook|igetter|jetcar|kmbot|leach|mole|miner|mirror|mxbot|race|reaper|sauger|sucker|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

Here I am banning a multitude of known bad user-agents as well as a number of the popular HTTP libraries that script kiddies and hackers use off the shelf without knowing how to configure to hide the default values.

You should read up on banning bad robots using the htaccess file and Mod Rewrite as a considerable proportion of your traffic will be from non human bots and not the good kind e.g Googlebot or Yahoo. By banning bad bots, content scrapers, spammers, hackers and bandwidth leeches you will not only reduce the load on your server but save yourself money on bandwidth charges.

The other log file you should check ASAP in a potential emergency situation is the Apache error log as this will tell you if the problem is related to a PHP bug, a Wordpress plugin or MySQL error.

Unless you have disabled all your warnings and info messages the error log is likely to be full of non fatal errors however anything else should be checked out. If your error log is full of database errors such as "table X is corrupt" or "database has gone away" then you know where to look for a solution.

Tables get corrupted for many reasons but a common reason I have found is when I have had to carry out a cold reboot to regain control of my server. Sometimes after a reboot everything will be working okay but on accessing your website all the content will have disappeared. Do not worry yet as this could be due to the corrupt tables and carrying out a REPAIR command should remedy this.

Another potential flash point are new or recently upgraded plugins. Plugins can be written by anybody and there is no guarantee whatsoever that the code contained within the plugin is of any quality whatsoever even if the features it offers seem to be great. I have personally found some of the most popular plugins to be performance hogs due to either poor code or complex queries with missing indexes but more of that in a later article.

Unless you are prepared to tweak other peoples code you don't have many options apart from optimising the queries the plugin runs by adding missing indexes or disabling the plugin and finding an alternative. One good tip I have found is to create an empty plugin folder in the same directory as the current plugin folder and then in emergency situations you can rename your existing plugin folder to something like plugins_old and then your site will be running without plugins. Once you have remedied any problems you can add your plugins back one by one to ensure they don't cause any problems.

Regular Maintenance

You should reguarly check your access and error logs even when the site is running smoothly to ensure that problems don't build up without you realising. You should also check your slow query log for poor queries especially after installing new plugins as it's very easy to gain extra performance from adding missing indexes especially when your site has tens of thousands of articles.

You should also carry out regular backups of your database and Wordpress site and ensure that you run the OPTIMIZE command to defrag fragmented table indexes especially if you have deleted data such as posts, tags or comments. A fragmented table is slower to scan and it's very easy to optimize at the click of a button. Take a look at the Strictly System Check Wordpress Plugin which can be setup to report on and analyse your system at scheduled intervals as one of the features is the ability to run the OPTIMIZE command.

So this is the end of the first part of the Wordpress Survival Guide series and next time I will be looking at performance tuning and site optimisation techniques.


Wednesday 9 June 2010

SQL Injection attack from Googlebot

SQL Injection Hack By Googlebot Proxy

Earlier today on entering work I was faced with worried colleagues and angry customers who were complaining about Googlebot being banned from their site. I was tasked to finding out why.

First off all my large systems run with a custom built logger database that I created to help track visitors, page requests, traffic trends etc.

It also has a number of security features that constantly analyse recent traffic looking for signs of malicious intent such as spammers, scrapers and hackers.

If my system identifies a hacker it logs the details and bans the user. If a user comes to my site and its already in my banned table then it's met with a 403 error.

Today I found out that Googlebot had been hacking my site using known SQL Injection techniques.

The IP address was a legitimate Google IP coming from the 66.249 subnet and there were 20 or so records from one site in which SQL injection attack vectors had been passed in the querystring.

Why this has happened I do not know as an examination of the page in question found no trace of the logged links however I can think of a theoretical example which may explain it.

1. A malicious user has either created a page containing links to my site that contain SQL Injection attack vectors or has added content through a blog, message board or other form of user generated CMS that has not sanitised the input correctly.

2. This content has then been indexed by Google or even just appeared in a sitemap somewhere.

3. Googlebot has visited this content and crawled it following the links containing the attack vectors which have then been logged by site.

This "attack by SERP proxy" has left no trace of the actual attacker and the trail only leads back to Google who I cannot believe tried to hack me on purpose.

Therefore this is a very clever little trick as websites are rarely inclined to block the worlds foremost search engine from their site.

Therefore I was faced with the difficult choice of either adding this IP to my exception list of users never to block under any circumstance or blocking it from my site.

Obviously my sites database is secure and it's security policy is such that even if a hackbot found an exploitable hole updates couldn't be carried out by the websites login however this does not mean that in future an XSS attack vector could be created and then exploited.

Do I risk the wrath of customers and let my security system carry on doing it's job and block anyone trying to do my site harm even if its a Google by Proxy attack or do I risk a potential future attack by ignoring attacks coming from supposedly safe IP addresses?

Answer

The answer to the problem came from the now standard way of testing to see if a BOT really is a BOT. You can read about this on my page 4 Simple Rules Robots Won't Follow. It basically means doing a 2 step verification process to ensure the IP address that the BOT is crawling from belongs to the actual crawler and not someone else.

This method is also great if you have a table of IP/UserAgents that you whitelist but the BOT suddenly starts crawling from a new IP range. Without updating your table you need to make sure the BOT is really who they say they are.

Obviously it would be nice if Googlebot analysed all links before crawling them to ensure they are not hacking by proxy but then I cannot wait for them to do that.

I would be interested to know what other people think about this.

Saturday 22 May 2010

Debugging Memory Problems on Wordpress

High Memory Consumption on Wordpress

This year I have taught myself PHP and setup a couple of Wordpress blogs to get my head round the sort of code PHP developers hack together. Although PHP has a lot of good qualities such as the variety of functions compared to other scripting language such as ASP, the ability to write object orientated code and the amount of available help on the web it does have some downsides which include the ability for poor coders to write poor code without even realising it.

I won't go into a diatribe about coding practice as there will always be good coders and bad coders but my main problem with Wordpress is that although rich in functionality and easy to use the code behind the scenes does not lend itself well to performance.

My current blog is getting about 1000 human visitors a day and four times as many bots. It's not a big site and runs on a LINUX box with 1GB RAM which should be enough especially since I have already helped my performance by doing the following:
-Adding the Super Cache Plugin which GZips up content
-Tuning the MySQL and PHP configuation files
-Banning over 40% of undesirable bot traffic using MOD Rewrite
-Disabling unused plug-ins and comparing new ones before installing them so that performance is the key factor I use when deciding which one to use.

I am also looking into using a PHP accelerator which will store the complied version of the source code so that it's not re-interpreted on every load.

However I am still experiencing intermittent memory issues and the problem is trying to detect the source of them. Earlier today I added some debug functions to the Wordpress source code that outputted the Memory_Usage value at key points. The results are quiet shocking as it shows that when the home page is loaded it consumes 24MB of memory!

The file in question is located in your root directory wp_settings.php.

The output is here:

START OF SETTINGS FILE PHP
Memory Limit: 128M - Current Memory Usage: 768 kb - Peak Usage: 768 kb
Before Require Compat, Functions, Classes - Current Memory Usage: 1.25 mb
After require_wp_db - Current Memory Usage: 3 mb
Before Require Plugin, Default Filters, Pomo, l10n - Current Memory Usage: 3 mb
After Required Files loaded - Current Memory Usage: 4 mb
Before Requires of main files Formatting, Query, Widgets, Canonical, Rewrite and much more 30 files! - Current Memory Usage: 4.5 mb
After ALL Requires - Current Memory Usage: 15.25 mb
Before muplugins_loaded action fired - Current Memory Usage: 15.25 mb
Before require vars - Current Memory Usage: 15.25 mb
Before create_initial_taxonomies - Current Memory Usage: 15.25 mb
After create_initial_taxonomies - Current Memory Usage: 15.25 mb
Before include active plugins - Current Memory Usage: 15.25 mb
Before require pluggable - Current Memory Usage: 21.5 mb
Before wp_cache_postload and plugins_loaded - Current Memory Usage: 22.25 mb
After ALL Plugins loaded and action fired - Current Memory Usage: 22.75 mb
Before action sanitize_comment_cookies - Current Memory Usage: 22.75 mb
Before create global objects WP_Query WP_Rewite WP and WP_Widget_Factory - Current Memory Usage: 22.75 mb
Before action setup_theme - Current Memory Usage: 22.75 mb
Before wp-init and do_action(init) - Current Memory Usage: 23.75 mb
End of Settings - Current Memory Usage: 24 mb

As you can see the inclusion of 30 core include files that Wordpress loads on every page adds a wopping 10MB in one go is a major factor in the large size as well as the loading and activation of other plug-ins.

If I didn't have Caching enabled then you can imagine how much trouble I would be in. If I was suddenly hit with a spike in traffic of only 42 concurrent users my 1GB of RAM would have been eaten up straight away. Plus in reality it wouldn't even need 42 users as not all my RAM will be consumed by PHP / APACHE and we need to factor in all the MySQL queries that run behind the scenes.

In fact from using the Wordpress function get_num_queries() which logs each query run by the wp_db->query() method it shows that my home page makes 32 calls to the database!

Now in no developers world should these be good stats to report on and it just goes to show the sort of battle someone has to fight when making a Wordpress blog run under high loads on a small budget. Yes throwing resources at the problem is a form of answer but a better one is to resolve the underlying issues. The problem is that it's very hard to with someone else's code.

One of the big downsides to using other peoples code is that when it goes wrong you are big trouble. You can either wait for an upgrade from the developer which may or may not come within any sort of agreeable timescale or you try to make sense of the code they wrote. The issue with the core Wordpress codebase is that if you start hacking it about then you leave yourself in a position of having to ignore future updates or having to redevelop it yourself for time memorial.

I don't know which of these two undesirable options I am going to take but I know what I would be looking at if I had to redevelop the wordpress system.

1. I would most definitley consider running it from a different database instead of MySQL which has a lot of good features but also a lot of configurable options I don't want to have to worry about.

If I am joining a lot of records together in a GROUP_CONCAT or CONCAT I don't want to have to worry about calculating the size of the group_concat_max_len first I just want all my string to return.

Also as well as all the missing DML such as CTE's I really miss the very useful Data Management Views that SQL Server 2005-2008 has as they make optimising a database very easy. I have seen a lot of Plugins that use very poor SQL techniques and some time spent on proper indexing would speed things up a lot. Having to sift through the slow query log and then run an EXPLAIN on each one is a time consuming job whereas setting up a scheduled job to monitor missing indexes and then list all suggestions is very easy to do in MSSQL.

2. I would definitely change the database schema that Wordpress runs on and one of the major changes would be to horizontally partition the main wp_posts table as this is referenced a hell of a lot by plug-ins and other internal Wordpress code and the majority of the time the wide columns such as content and excerpt are never required. A narrow table containing the post id, post type, modified date and maybe the post title would help things a lot.

3. All of the queries run in the system are single executions. This means extra network traffic and extra resources. Returning multiple recordsets in one hit would save a lot of time and I know it's hard in a CMS to combine queries but with some clever thinking it could be done.

4. A lot of performance problems come from calling the same function multiple times within loops rather than calling it once, storing the value and then re-using it. A good example is all the calls to get the permalink structures when creating links.

5. All the checks to see whether functions, classes and constants exist must be consuming some resources. It is hard in a CMS to know what has and hasn't been included but maybe some of these could be left out.

6. Caching should be built into the Wordpress core code as standard. When new posts and pages are saved there should be the option to create a physical hard copy of the html instead of using the usual dynamic page. Any sidebar functionality could also be calculated and added there and then as does it really matter if a tag or category cloud is slightly out of date or do links to other blogs and sites really need to be loaded from the DB every time? The answer is obviously no.

Anyway those are just six suggestions I have just conjured up through a cursory examination of the codebase. I am sure there are lots more potential ideas and I see from reading up on the web that a lot of people have experienced similar issues with performance since upgrading to Wordpress 2.8.

If anyone else has some suggestions for reducing the memory footprint of a Wordpress blog please let me know as I really don't like the idea of a simple blog using up so much memory.

Problems with LINUX, Apache and PHP

LINUX Apache Server stopped serving up PHP pages

By Strictly-Software

When I logged into my hosted LINUX web server earlier tonight I was met with a message saying I should install a number of new packages.

I usually ignore things like this until it gets to a point where someone forces me to do purely for reasons that will shortly become obvious.

The packages were the following:
  • apache2.2-common
  • apache2-mpm-worker
  • apache2
  • usermin
  • libapt-pkg-perl
  • apt-show-versions
  • webmin
  • webmin-virtual-server
I have no idea what most of them do but they had been sitting around for a long time waiting for me to install them and tonight was the night. These are always nights I dread!

Shortly after doing the updates I noticed that my WordPress sites had stopped working and all the PHP files were being served up as file attachments with a content type of application/x-httpd-php instead of being parsed, executed and then delivered as text/html.

At first I thought it was something to do with the SQL performance tweaks I was doing but I soon remembered about the updates and I went off hunting the web for a solution.

It's nights like these that make me wish I was back doing carpet fitting, finishing the day at 3 pm and then going down the pub. Much more enjoyable than spending Friday nights scratching my head wondering what the hell I had done to bring down my websites at 3 am.

To cap off the nightmare I had just spent ages writing a detailed message to post on an APACHE web forum only for my session to timeout and then for the site to refuse to log me in.

They then decided to block me for failing 5 login attempts in a row. Obviously I couldn't get back my message so I was pretty pissed right now!

For some reason APACHE had stopped recognising PHP file extensions and I still don't know what had happened under the covers but after a long hunt on Google I came across the solution.

The libapache2-mod-php5 module had somehow disappeared so I had to re-install it with the following lines:

sudo apt-get purge libapache2-mod-php5
sudo apt-get install libapache2-mod-php5
sudo a2enmod php5
sudo /etc/init.d/apache2 restart
I also added the following two lines to /etc/php5/apache2/php.ini

AddHandler application/x-httpd-php .php
LoadModule php5_module modules/libphp5.so
I then cleared my browser cache and low and behold my site came back.

Maybe this info might come in handy for anyone else about to upgrade packages on their server or serve as a reminder of what happens when you try to behave like a sysadmin and have no real idea what your doing!

It also should make you glad that we live in the days where a Google search can provide almost any answer you are looking for. I doubt I would have owned, let alone found, a book that would have been of any use at 3am on a Saturday morning.

So despite all their snooping, minority report advertising and links to the alphabet agencies they are good for something.

Wednesday 19 May 2010

Tune Up Utilities 2010

Tune Up Utilities 2010

I don't usually promote products other than the odd Google advert and in reality that seems to be a waste of time considering I make literally pennies. However one of the products that I do admit to liking and help advertise is the TuneUp Utilities application which takes the best features of all the other clean up and performance tuning apps that are available and puts them into one nice easy to use interface.

You have all the features that other tune up tools such as CCleaner, Comodo and Defraggler have and a lot more. It also combines network and browser optimisations that well known plugins such as FireTune use as well as TCP/IP optimisers such as the DrTCP. It also puts all the important windows configuration options that manage memory, errors, start ups, background processes, programs, display settings and a whole lot more into the same application which means you should never really need another tune up utility once this has been installed.

A break down of some of the features:
  • Disk Defragmenter
  • Disk Cleaner
  • Registry Cleaner
  • Memory optimiser
  • TCP / IP optimiser
  • Internet Browser optimiser
  • File Shredder
  • Program manager (add / remove programs, shortcuts, start ups etc)
  • Windows Settings manager
  • Process manager
  • One Click maintenance
  • Maintenance Scheduler

It also has some features to increase your computers security, manage network traffic and important computer configuration options that can help improve performance. In fact it seems to have wrapped up all the important settings that you would ever need or want to change in a very nice user interface.

If you don't want to spend time going through the many levels of available settings there is the easy option of "One Click maintenance" that will check your system for potential issues and then offer solutions in an easy to understand dialog.

Another feature I quite like, which is probably more to do with the name than any perceived benefit it may give is "Turbo Mode". This is a big round button that you can press when you want a performance boost and it basically just ensures that the process your working on at the time gets the necessary resources and disables CPU and memory hogging processes such as disk de-fragmentation.

The price of the application is £29.99 but you can install it on up to 3 computers and this is a small price to pay for the convenience of having every feature wrapped up in one easy to use interface.

If you are looking for a program that can aid your PC's performance then I would suggest taking a look at Tune Up Utilities first but if you are not interested in paying for your tools and are willing to spend the time tweaking your settings manually then have a read of the following articles which discuss PC, Network and browser optimisation techniques:



TuneUp Utilities 2010

Wednesday 5 May 2010

Are you a spammer

Spam Checker Tool

If you post comments on blogs and sites around the web you might find that your time and effort is being wasted especially if you have been marked as a comment spammer.

Most blogs and sites that accept comment use a tool called Akismet to check whether a comment is spam or not. It's a free service that comes with a simple API that allows users with a key to check content and user details for spam. The plugin is easy to use and many popular blogging tools use it to check comments for spam.

The reason it's so popular is that it uses a variety of methods to identify spam such as using feedback from site owners who mark spam and ham that get through the automatic filter so that other users gain the benefit of a large community analysis. On top of that the automatic filtering uses the usual blacklists of known spammer IP addresses, bad emails, dodgy websites and also the usual content analysis that ensures your comment section doesn't get filled with adverts for Viagra.

So the service is well worth using if you want to block spammers filling your comment section up with bullshit replies and backlinks. However if you are a "commenter" and you're worried that you might be getting blocked from posting due to being unfairly marked as a spammer then the best way to find out is to use the same service yourself to see if it regards you as a spammer.

Therefore I created this Spam Checker Tool today which allows you to check whether your details (Name, Website, Email, IP) are blacklisted by Akismet as well as checking out a potential comment before you post it.

Just complete the fields you want to test and hit submit. The site will then report whether Akismet regards your content as spam or not.

All fields are optional but at least one value must be supplied to perform a spam analysis.


Note:
If Akismet clears your good name and doesn't regard you as spam it doesn't automatically mean that your comment will be posted.

Most blogging software including Wordpress allows the administrators to set up their own rules regarding comment spam. These rules usually consist of blacklists of words and a number of links to count therefore keep your comments free of profanity and the number of links down to avoid being flagged for review.

Most of all make your comment relevant to the post your commenting on. Auto posting generic comments such as "Great post, keep it up" and "I found this site is it also yours" will not get you very far unless all filtering has been disabled.

Monday 3 May 2010

Strictly AutoTags Wordpress Plugin

Wordpress Plugin - Strictly AutoTags

I have been doing a lot of work with Wordpress lately and have been working on a number of custom plugins that I already use with my own Wordpress blogs. One of these plugins that I find very useful is AutoTags which automatically detects words within new posts to use as tags.

Unlike other smart tag plugins that add lots of functionality to manage existing tags this plugin only attempts to find relevant and useful words within posts to use as tags by utilising the power of regular expressions and some simple logic to determine which words and sentences are relevant to the post.

I have been using this plugin on a couple of my own blogs for a while now and as well as helping me build up a nice big list of taxonomies it's not overly complicated in what it does so it doesn't throw any curve balls my way very often.

My posts are usually quite lengthy so I choose to set my MaxTag option to five which means that only the five most relevant tags are added and single or double occurrences are usually skipped which helps make the tags that are added a lot more relevant to the content.

Keeping in the spirit of all things open source I thought I would release the code to the public. It's my first Wordpress plugin and I have only just learned PHP so I cannot claim it to be a work of perfection or anything but I would like to hear feedback from anyone who uses it.

Also if any PHP or Wordpress developers do take a look at the source code I would be interested to hear any advice on the format the plugin takes. I started using a template I downloaded from the web but it wasn't very encapsulated so I decided to put the whole thing in a class. I don't know if this is the best way or not so please share your experience.

You can check out the plugin either at my own site - Strictly AutoTags or at the plugin directory hosted by Wordpress.

Monday 19 April 2010

Banning Bad Bots with Mod Rewrite

Banning Scrapers and Other Bad Bots using Mod Rewrite

There are many brilliant sites out there dedicated to the never ending war on bad bots and I thought I would add my own contribution to the lists of regular expressions used for banning spammers, scrapers and other malicious bots with Mod Rewrite.

As with all security measures a sys admin takes to lock down and protect a site or server a layered approach is best. You should be utilise as many different methods as possible so that an error, misconfiguration or breach of one ring in your defence does not mean your whole system is compromised.

The major problem with using the htaccess file to block bots by useragent or referrer is that any hacker worth the name would easily get round the rules by changing their agent string to a known browser or by hiding the referrer header.

However in spite of this obvious fact it still seems that many bots currently doing the rounds scraping and exploiting are not bothering to do this so its still worth doing. I run hundreds of major sites and have my own logger system which automatically scans traffic and blocks visitors who I catch hacking, spamming, scraping and heavy hitting. Therefore I regularly have access to the details of the badly behaving bots and whereas a large percentage of them will hide behind an IE or Firefox useragent a large percentage still use identifying agent strings that can be matched and banned.

The following directives are taken from one of my own personal sites and therefore I am cracking down on all forms of bandwidth theft. Anyone using an HTTP library like CURL, Snoopy, WinHTTP and not bothering to change their useragent will get blocked. If you don't want to do this then don't just copy and paste the rules.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

# Block blank or very short user-agents. If they cannot be bothered to tell me who they are or provide jibberish then they are not welcome!
RewriteCond %{HTTP_USER_AGENT} ^(?:-?|[a-z1-9\-\_]{1,10})$ [NC]
RewriteRule .* - [F,L]

# Block a number of libraries, email harvesters, spambots, hackbots and known bad bots

# I know these libraries are useful but if the user cannot be bothered to change the agent they are worth blocking plus I only want people
# visiting my site in their browser. Automated requests with CURL usually means someone is being naughty so tough!
RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|PHP|POE|Python|Snoopy|urllib|Wget|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|library|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch|flickbot|hook|igetter|jetcar|leach|mole|miner|mirror|race|reaper|sauger|sucker|site|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

# fake referrers and known email harvesters which I send off to a honeytrap full of fake emails
RewriteCond %{HTTP_USER_AGENT} (?:atomic|collect|e?mail|magnet|reaper|siphon|sweeper|harvest|(microsoft\surl\scontrol)|wolf) [NC,OR] # spambots, email harvesters
RewriteCond %{HTTP_REFERER} ^[^?]*(?:iaea|\.ideography|addresses)(?:(\.co\.uk)|\.org\.com) [NC]
RewriteRule ^.*$ http://english-61925045732.spampoison.com [R,L] # redirect to a honeypot


# copyright violation and brand monitoring bots
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR]
RewriteCond %{HTTP_USER_AGENT} (?:NPBot|TurnitinBot) [NC]
RewriteRule .* - [F,L]


# Image hotlinking blocker - replace any hotlinked images with a banner advert for the latest product I want free advertising for!
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?##SITEDOMAIN##\.com/.*$ [NC] # change to your own site domain!
RewriteCond %{HTTP_REFERER} !^https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*$ [NC] # ensure image indexers don"t get blocked
RewriteCond %{HTTP_REFERER} !^https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*$ [NC] # ensure email clients don't get blocked
RewriteRule .*\.(gif|jpe?g|png)$ http://www.some-banner-advert.com/myadvert-468x60.png [NC,L] # free advertising for me


# Security Rules - these rules help protect your site from hacks such as sql injection and XSS

RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK) # no-one should be running these requests against my site!
RewriteRule .* - [F]

# My basic rules for catching SQL Injection - covers the majority of the automated attacks currently doing the rounds

# SQL Injection and XSS hacks - Most hackbots will malform links and then log 500 errors for details I use a special hack.php page to log details of the hacker and ban them by IP in future
# Works with the following extensions .php .asp .aspx .jsp so change/remove accordingly and change the name of the hack.php page or replace it with [F,L]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?DECLARE[^a-z]+\@\w+[^a-z]+N?VARCHAR\((?:\d{1,4}|max)\).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?sys.?(?:objects|columns|tables).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?;EXEC\(\@\w+\);?.*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?(%3C|<)/?script(%3E|>).*)$ /hack\.php\?$1 [NC,L,U] # XSS hacks

# Bad requests which look like attacks (these have all been seen in real attacks)
RewriteRule ^[^?]*/(owssvr|strmver|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) /hack.php [NC,L]

</IfModule>


Security, Free Advertising and Less Bandwidth

As you can see from the rules I have condensed the rules into sections to keep the file manageable. The major aim of the rules is to
  1. Improve security by blocking hackbots before they can reach the site.
  2. Reduce bandwidth by blocking the majority of automated requests that are not from known indexers such as Googlebot or Yahoo.
  3. Piss those people off who think they can treat my bandwidth as a drunk treats a wedding with a free bar. Known email scrapers go off to a honeypot full of fake email addresses and hotlinkers help me advertise my latest site or product by displaying banners for me.

Is it a pointless task?

A lot of the bad bot agent strings are well known and there are many more which could be added if you so wish however trying to keep a static file maintained with the latest bad boys is a pointless and thankless task. The best way is to automate the tracking of bad bots by logging users who request the robots.txt file by creating a dynamic file that can log the request to a file or database. Then you place directives in the robots.txt to DENY access to a special directory or file and then place links on your site to this file. Any agents who ignore the robots.txt file and crawl these links can then be logged and blocked.

I also utilise my own database driven logger system that constantly analyses the traffic looking for bad users which can then be banned. I have an SQL function that checks for hack attempts and spam by pattern matching the stored querystring as well as looking for heavy hitters (agents/IP's requesting lots of pages in a short time period). This helps mes prevent DDOS attacks as well as scrapers who think they can take 1000+ jobs without saying thank you!

A message for the other side

I know this probably defeats the point of me posting my htaccess rules but as well as defending my own systems from attack I also make use of libraries such as CURL in my own development to make remote HTTP requests. Therefore I can see the issues involved in automated crawling from both sides and I know all the tricks system admin use to block as well as the tricks scrapers use to bypass.

There are many legitimate reasons why you might need to crawl or scrape but you should remember that what goes around comes around. Most developers will have at least one or more of their own sites and therefore you should know that bandwidth is not free so stealing others will lead to bad karma. The web is full of lists containing bad agents and IP addresses obtained from log files or honey-traps so you don't just risk being banned from the site you are intending to crawl when you decide to test your new scraper out.

Remember if you hit a site so hard and fast it breaks (which is extremely possible in this day and age of cheaply hosted Joomla sites designed by four year olds.) then sys admin will start analysing log files looking for culprits. A quiet site that runs smoothly usually means the owner is happy and not looking for bots to ban

  • Rather than making multiple HTTP requests cache your content locally if possible.
  • Alternate requests between domains so that you don't hit a site too hard.
  • Put random delays in between requests.
  • Obey Robots.txt and don't risk getting caught in a honeypot or infinite loop by visiting pages you shouldn't be going to.
  • Never use the default agent as set in your chosen library.
  • Don't risk getting your server blacklisted by your crawling so always use a proxy.
  • Only scrape the bare minimum that you need to do the job. If you are only checking the header then don't return the body as well.

The following articles are a good read for people interested in this topic:




Tuesday 30 March 2010

Great new site - From The Stables

Daily Insights straight from the stables

If you are into horse racing, like a bet or just interested in getting great information straight from the trainers mouth about upcoming races then I suggest checking out my new site www.fromthestables.com. We have teamed up with some of the best known UK trainers to provide a unique high quality service available to members on a daily basis.

Each day our top trainers will provide their expert information on their horses running that day. This isn't a tipster site and we won't pretend to guarantee winners and losers however we do promise to provide quality info straight from the stables every racing day. We have only been running for a week and already we have already provided our members with great information that has led to a number of winners and each way placed horses.

We are currently offering half price membership of only £25 a month but on top of that we are offering new users a free seven day trial so that they can experience the quality information that our trainers provide for themselves. Not only does membership guarantee great trainer insight into horses running that day we also offer a variety of deals and special offers which include discounted race course tickets, champagne tours of our trainers stables, free bets from our sponsors and to top it off we also plan to buy a racehorse later this year which will be part owned by our subscribers.

If you are interested in utilising this valuable resource for yourself or know a friend, family member or colleague who would be interested then why not take advantage of our seven day free trial. You will need to set up a PayPal subscription before being granted entry to the site but no money will be deducted from your account until the seven day trial is up and you can cancel at any time before that date. If you are happy with the service then at the end of the trial the monthly membership fee which is currently at a 50% discount of only £25 will be taken from your PayPal account and you will continue to enjoy all the benefits of the site.

To take advantage of our trial offer please visit the following link:

www.fromthestables.com

Monday 29 March 2010

My Hundredth Article

An overview of the last 102 articles

I really can't believe that I have managed to write 102 articles for this blog in the last year and a bit. When I first started the blog I only imagined writing the odd bit here and there and saw the site purely as a place to make public some of my more useful coding tips. I never imagined that I could output this amount of content by myself.

A hundred articles has come and gone pretty fast and as with all magazines, tv shows and bloggers stuck for an idea I thought I would celebrate my hundred and 2nd article by reviewing my work so far.

Recovering from an SQL Injection Attack

This was the article that started it all and it's one that still gets read quite a bit. It's a very detailed look at how to recover an infected system from an SQL Injection Attack and includes numerous ways of avoiding future attacks as well as quick sticking plasters, security tips and methods for cleaning up an infected database.

Linked to this article is one of my most downloaded SQL scripts which helps identify injected strings inside a database as well as removing them. This article was written after a large site at work was hacked and I was tasked with cleaning up the mess so it all comes from experience.

Performance Tuning Tips

I have wrote quite a few articles on performance tuning systems both client and server side and some of my earliest articles were on top tips for tuning SQL Databases and ASP Classic sites. As well as general tips which can be applied to any system I have also delved into more detail regarding specific SQL queries for tuning SQL 2005 databases.

Regarding network issues I also wrote an extensive how to guide on troubleshooting your PC and Internet connection which covered everything from TCP/IP settings to tips on the best tools for cleaning up your system and diagnosing issues. On top of that I collated a number of tweaks and configuration options which can speed up FireFox.


Dealing with Hackers, Spammers and Bad Bots

My job means that I have to deal with users trying to bring my systems down constantly and I have spent considerable time developing custom solutions to log, identify and automatically ban users that try to cause harm to my sites. Over the last year I have written about SQL Denial of Service attacks which involve users making use of web based search forms and long running queries to bring a database driven system to a halt. I have also investigated new hacking techniques such as the two stage injection technique, the case insensitive technique, methods of client side security and why its almost pointless as well as detailing bad bots such as Job Rapists and the 4 rules I employ when dealing with them.

I have also detailed the various methods of using CAPTCHA's as well as ways to prevent bots from stealing your content and bandwidth through hot linking by using ISAPI rewriting rules.

Issues with Browsers and Add-Ons

I have also tried to bring up to date information on the latest issues with browsers and new version releases and have covered problems and bugs related to major upgrades of Firefox, Chrome, Opera and IE. When IE 8 was released I was one of the first bloggers to detail the various browser and document modes as well as techniques for identifying them through Javascript.

I have also reported on current browser usage by revealing statistics taken from my network of 200+ large systems with regular updates every few months. This culminated in my Browser survey which I carried out over Christmas which looked at the browsers and add-ons that web developers themselves used.


Scripts, Tools, Downloads and Free Code

I have created a number of online tools, add-ons and scripts for download over the last year that range from C# to PHP and Javascript.

Downloadable Scripts Include:

SQL Scripts include:

Search Engine Optimisation

As well as writing about coding I also run a number of my own sites and have had to learn SEO the hard way. I have wrote about my experiences and the successful techniques I have found that worked in a couple of articles printed on the blog:
So there you go an overview of the last year or so of Strictly-Software's technical blog. Hopefully you have found the site a good resource and maybe even used one or two of the scripts I have posted. Let me know whether you have enjoyed the blog or not.