Friday, 18 June 2010

The Wordpress Survival Guide

Surviving Wordpress - An Introduction

As well as my technical blog here on blogger I have a number of Wordpress sites which I host myself on a virtual server.

I have now been using Wordpress and PHP for about 4 years and in that time I have learnt a hell of a lot regarding the pros and cons and do's and don'ts that are involved in running your own Wordpress blog.

As a developer who has spent most of his working career working with Microsoft products moving from a Windows environment to Linux has been a great learning curve and as well as all the tips I have gathered regarding Wordpress I thought I would write a series for other developers in a similar situation or for Webmasters who may have the sites but not the technical know how.

Covering the bases

In the following articles I will look in detail at how to get the most out of your system in terms of performance. If you are like me you are not made of money and able to afford lots of dedicated servers to host your sites on therefore you need to make the most of what you have. Performance tuning your Wordpress site is the most important thing you can do and luckily due to the nature of WordPresses plugins a lot of performance tuning can be done with a couple of clicks.

I will also be looking at performance tuning MySQL which is the database that Wordpress runs on. Moving from MS SQL with all its features and DMV's to MySQL was quite a culture shock for me so there are a few tips I have learnt which might be useful.

First things first - Tools of the trade

First off you will need to know how to get things done. My Wordpress sites are running on a Linux box and one of the first things I did was install VirutalMin which is a graphical UI you access in your browser which lets you manage your server. You could do everything from the command line but coming from a windows environment I found it very useful to be able to see a familiar environment.

After installing VirtualMin you should also install WebMin which is another graphical interface that gives you ultimate flexibility over your server without you ever needing to go near a command line prompt.

As well as setting up FTP to access your files through SFTP (secure FTP) I also installed PUTTY which enables me to connect to my server and get used to the command line way of doing things. I would definitely recommend doing this even if you were like me a Windows person as you should never be afraid to try something new and it's always good to have as many technical skills under your belt as possible. I always try to use the command line first but I know I can fall back on VMin if I need to.

Useful Commands

A good list of Linux applications and commands can be found here: Linux Commands but here are some of the key commands I find myself using over and again.

CommandDetails
dateShow the current date and time on the server
cdchange drive e.g cd /var (go to the var directory)
cd ../go back up one directory
cd ../../go back up two directories
lslist out the contents of a directory
whoamisee who you are logged in as
su - [username]Assume the permissions of the specified user
sudo [command]Run a command as root but stay as the user you are logged in as
topShow the current running processes and server load
top -d .2Show the current running processes with .2 second refresh
tail -f access_logView the most current entries in the sites access log
grep "61.252.14.247" access_log | tailView the most current entries in the sites access log for a certain IP address
netstat -taShow all current connections to the server
grep "27/Feb/2012:" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | lessView the IP's that appear in your access log the most for a certain date ordered by the most frequent first.
/etc/init.d/apache2 restartRestart Apache
apache2ctl configtestTest the Apache configuration for configuration errors
/etc/init.d/mysql restartRestart MySQL
wget [URL]Remotely access, load and save a file to the current directory
chmod 777 [filepath]Grant full read/write/delete permission to everyone to a file or folder
chmod +x [filepath]Grant execute permission to a script
rebootReboot the server


Handling Emergencies

You need to be prepared to handle emergencies and that involves a quick diagnosis and quick action. What sorts of emergencies can you expect to run into? Well the most common form will be very poor server performance that results in your site being unavailable to visitors. This can happen for a number of reasons including:

1. High visitor traffic due to a popular article appearing on a major site or another form of traffic driver.
2. High bot traffic from undesirable crawlers such as content scrapers, spammers, hackbots or even a denial of service attack. I recently experienced a DOS attack which came from an out of control bot that was making 10+ requests to my homepage a second.
3. A poorly written plugin that is eating up resources.
4. A corrupt database table causing database errors or poorly performing SQL causing long wait times.
5. Moderately high visitor traffic mixed with an unoptimised system set-up that exacerbates the problem.

Identifying the cause of your problem

If you are having a major slow down, site freeze or just don't know what is going on then the first thing is to open up a command prompt and run top to see the current processes.

The first thing to look at is the load average as this tells you the amount of resources and pressure your server is currently under. A value of 1.00 means your server is maxed out
and anything over that means you are in trouble. I have had a value of 124 before which wasn't good at all. My site was inaccessible and only a cold reboot could get me back to a controlable state.

If your load average is high then take a look at the types of processes that are consuming the most resources. You should be looking at the CPU% and Memory used by each process (the RES) column which shows the amount of physical memory in KB consumed by the process.

Each request to your site will use its own process so if your report is full of Apache rows then you are having a traffic spike. Each page request on your site should be a very quick affair so the processes will come and go very speedily and having a short delay interval is important to being able to spot problems.

Another process to look for is the MySQL process which again should come and go unless it's currently running a long performance intensive query in which case the database could be your problem. Another tool I like to use is mytop which gives you a top like display but of your MySQL processes only. You could open up your MySQL console and run SHOW PROCESSLIST constantly but using MyTop is alot easier and it will help identify problematic queries as well as queries that are being run by your site a lot.

If you don't have monitoring tools available to keep you up to date with your sites status then you may find that your system experiences severe problems periodically without your knowledge. Being informed by a customer that your site is down is never the best way of finding out therefore you might be interested in a plugin I developed called Strictly System Check.

This Wordpress plugin is a reporting tool that you run at scheduled intervals from a CRON job. The plugin will check that your site is available and reporting a 200 status code as well as scanning for known text. It will also connect to your database, check the database and server load and report on a number of important status variables such as the number of connections, aborted connections, slow queries and much more.

The great thing about this plugin is that if it finds any issues with the database it will CHECK and then REPAIR any corrupt tables as well as running the OPTIMIZE command to keep the tables defragged and fast. If any problems are found an email can be send to let you know. I wrote this plugin because there wasn't anything like it available and I have found it 100% useful in not only keeping me informed of site issues but also in maintaining my system automatically.

Scanning Access logs for heavy hitters

Something else you should take a look straight away is your access and error logs.
If you open up your access log and watch it for a while you should soon see if you are experiencing either high traffic in general or from a particular IP/Useragent such as a malicious bot. Using a command like tail or less to view the logfile with the -f flag ensures that as new data is added to the file it will be outputted to the screen which is what you want when examining current site usage.

mydomain:~# cd /home/mywebsite/logs
mydomain:~# tail -f access_log

Banning Bad Users

If the problem is down to one particular IP or Useragent who is hammering your site then one solution is to ban the robot by returning it a 403 Forbidden status code which you can do with your .htacess file by adding in the following lines:

order allow,deny
deny from 79.125.58.227
deny from 67.207.201.
deny from 89.146.55.222
allow from all

This will return 403 forbidden codes to all requests from the two IP addresses and the one IP subnet: 67.207.201.

If you don't want to ban by IP but by user-agent then you can use the Mod Rewrite rules to identify bad agents in the following manner:

RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|POE|Python|Snoopy|urllib|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch\s|flickbot|hapax|hook|igetter|jetcar|kmbot|leach|mole|miner|mirror|mxbot|race|reaper|sauger|sucker|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

Here I am banning a multitude of known bad user-agents as well as a number of the popular HTTP libraries that script kiddies and hackers use off the shelf without knowing how to configure to hide the default values.

You should read up on banning bad robots using the htaccess file and Mod Rewrite as a considerable proportion of your traffic will be from non human bots and not the good kind e.g Googlebot or Yahoo. By banning bad bots, content scrapers, spammers, hackers and bandwidth leeches you will not only reduce the load on your server but save yourself money on bandwidth charges.

The other log file you should check ASAP in a potential emergency situation is the Apache error log as this will tell you if the problem is related to a PHP bug, a Wordpress plugin or MySQL error.

Unless you have disabled all your warnings and info messages the error log is likely to be full of non fatal errors however anything else should be checked out. If your error log is full of database errors such as "table X is corrupt" or "database has gone away" then you know where to look for a solution.

Tables get corrupted for many reasons but a common reason I have found is when I have had to carry out a cold reboot to regain control of my server. Sometimes after a reboot everything will be working okay but on accessing your website all the content will have disappeared. Do not worry yet as this could be due to the corrupt tables and carrying out a REPAIR command should remedy this.

Another potential flash point are new or recently upgraded plugins. Plugins can be written by anybody and there is no guarantee whatsoever that the code contained within the plugin is of any quality whatsoever even if the features it offers seem to be great. I have personally found some of the most popular plugins to be performance hogs due to either poor code or complex queries with missing indexes but more of that in a later article.

Unless you are prepared to tweak other peoples code you don't have many options apart from optimising the queries the plugin runs by adding missing indexes or disabling the plugin and finding an alternative. One good tip I have found is to create an empty plugin folder in the same directory as the current plugin folder and then in emergency situations you can rename your existing plugin folder to something like plugins_old and then your site will be running without plugins. Once you have remedied any problems you can add your plugins back one by one to ensure they don't cause any problems.

Regular Maintenance

You should reguarly check your access and error logs even when the site is running smoothly to ensure that problems don't build up without you realising. You should also check your slow query log for poor queries especially after installing new plugins as it's very easy to gain extra performance from adding missing indexes especially when your site has tens of thousands of articles.

You should also carry out regular backups of your database and Wordpress site and ensure that you run the OPTIMIZE command to defrag fragmented table indexes especially if you have deleted data such as posts, tags or comments. A fragmented table is slower to scan and it's very easy to optimize at the click of a button. Take a look at the Strictly System Check Wordpress Plugin which can be setup to report on and analyse your system at scheduled intervals as one of the features is the ability to run the OPTIMIZE command.

So this is the end of the first part of the Wordpress Survival Guide series and next time I will be looking at performance tuning and site optimisation techniques.


Wednesday, 9 June 2010

SQL Injection attack from Googlebot

SQL Injection Hack By Googlebot Proxy

Earlier today on entering work I was faced with worried colleagues and angry customers who were complaining about Googlebot being banned from their site. I was tasked to finding out why.

First off all my large systems run with a custom built logger database that I created to help track visitors, page requests, traffic trends etc.

It also has a number of security features that constantly analyse recent traffic looking for signs of malicious intent such as spammers, scrapers and hackers.

If my system identifies a hacker it logs the details and bans the user. If a user comes to my site and its already in my banned table then it's met with a 403 error.

Today I found out that Googlebot had been hacking my site using known SQL Injection techniques.

The IP address was a legitimate Google IP coming from the 66.249 subnet and there were 20 or so records from one site in which SQL injection attack vectors had been passed in the querystring.

Why this has happened I do not know as an examination of the page in question found no trace of the logged links however I can think of a theoretical example which may explain it.

1. A malicious user has either created a page containing links to my site that contain SQL Injection attack vectors or has added content through a blog, message board or other form of user generated CMS that has not sanitised the input correctly.

2. This content has then been indexed by Google or even just appeared in a sitemap somewhere.

3. Googlebot has visited this content and crawled it following the links containing the attack vectors which have then been logged by site.

This "attack by SERP proxy" has left no trace of the actual attacker and the trail only leads back to Google who I cannot believe tried to hack me on purpose.

Therefore this is a very clever little trick as websites are rarely inclined to block the worlds foremost search engine from their site.

Therefore I was faced with the difficult choice of either adding this IP to my exception list of users never to block under any circumstance or blocking it from my site.

Obviously my sites database is secure and it's security policy is such that even if a hackbot found an exploitable hole updates couldn't be carried out by the websites login however this does not mean that in future an XSS attack vector could be created and then exploited.

Do I risk the wrath of customers and let my security system carry on doing it's job and block anyone trying to do my site harm even if its a Google by Proxy attack or do I risk a potential future attack by ignoring attacks coming from supposedly safe IP addresses?

Answer

The answer to the problem came from the now standard way of testing to see if a BOT really is a BOT. You can read about this on my page 4 Simple Rules Robots Won't Follow. It basically means doing a 2 step verification process to ensure the IP address that the BOT is crawling from belongs to the actual crawler and not someone else.

This method is also great if you have a table of IP/UserAgents that you whitelist but the BOT suddenly starts crawling from a new IP range. Without updating your table you need to make sure the BOT is really who they say they are.

Obviously it would be nice if Googlebot analysed all links before crawling them to ensure they are not hacking by proxy but then I cannot wait for them to do that.

I would be interested to know what other people think about this.