Tuesday 27 March 2012

Fixing The Postie Plugin for Wordpress to use Categories in the Subject line

Fixing the Postie WordPress plugin problem with categories in the Subject line

I needed a way of sending posts into a Wordpress site by email and I was looking at their default POP3 email format for email publishing but I soon realised that this wasn't good enough as I wanted to add categories to my article at the same time. The default way didn't seem to offer a solution so I looked for some plugins before writing my own.

I searched the web and came across a popular Wordpress plugin that lots of people seemed to rave about called Postie. There were a good few articles on how to set it up and get it working which were just what I needed and before long I was sending emails to a unique address to post articles on my site.


However when it came to getting the categories in the subject line working the plugin fell over like a drunk Irishman on St Patricks day! I was not happy.

The plugin was supposed to support the following formats which all relied on passing the categories in the subject line. If no categories were passed then the default category specified in the admin page was used instead.

The plugin allowed you to pass in either CategoryID's, partial Category Descriptions and multiple Categories which all sounded wonderful. The formats were supposed to be:


  • Subject: This is my title
  • Subject: Rac: This is my title
  • Subject: -Racing- -Horse Racing- This is my title
  • Subject: [1] [Racing] [Horse Racing] This is my title


The first format is not passing in any category and any post would use the default setup for the plugin e.g "horse racing".

The second format would look for the first category in the system starting with the letters "Rac" e.g Racing.

The last two formats would allow you to pass in multiple categories in two different formats e.g Racing and Horse Racing either wrapped in hyphens -Racing- or square brackets [Racing].

In all instances the title for the blog posting would be the last part of the subject line e.g 'This is my title'.

However when I started testing the category formats it seemed that whatever format I tried I would always end up with a post title exactly the same as the subject line minus the word "Subject" e.g: from a subject line of "Subject: -Racing- -Horse Racing- This is my title" I would get a post title in my blog of "-Racing- -Horse Racing- This is my title". Not good and not what is advertised on the tin!

From a cursory look online it seemed other people were having the same issue and I I posted a message to the forum but got no response. So yet again I delved into the code to fix the issue myself.

I could have written a new plugin myself as the logic isn't that hard, but why bother if only one method was screwed? However I find myself more and more disliking WordPress for the pure reason that relying on someone else's code is a pain for multiple reasons.

Not only do you have to wait ages or forever (if they have died, or stopped supporting it) for a response when bugs are found but as with all code that isn't your own you have no clue how it works unless you spent considerable effort and time finding out.

If I had searched Google a bit harder first I might have found another solution which relied on totally rewriting the function which you can find here: Fixing Postie but alas I only spotted this after my own fix was in.

Therefore as I couldn't be arsed writing a whole new plugin I thought I would give half an hour over to debugging it on my local WAMP Server setup. Luckily it didn't take too long to find the problem.

The issue is in the following function within the file called postie-functions.php. This is the function which parses the subject line and returns any categories it can find in an array otherwise returning the default category.

function GetPostCategories(&$subject, $defaultCategory) {

From looking at the GetPostCategories function which runs a number of regular expression tests to retrieve the categories from the subject line I could see that the problem was down to the order they were in and that the first test which was to find single categories before a colon was always being matched whatever format was used.

This was down to the basic nature of the regular expression which split everything up into content before and after the colon. As the subject line always contained a colon whether you used multiple, single or no categories at all it meant this first test was always being matched.

So to fix this I did the following.

  • Change the order of the tests within the function so that the most complex regular expressions were first and matched multiple categories either as -category- or [category].
  • Fix the first (now last) regular expression so that it didn't match the word "subject".
  • Ensure the word "Subject" is never returned as part of the title or as a category.
  • Test the code to prove it worked.
  • Publish it and test it on my live site.


To prove the fault and test the fix we can extract the GetPostCategories function from the postie-functions.php. file and make a simpler version of it along with some test harnesses to call it with all the various subject formats. We can run this test PHP page either on our webserver or our local machine. As I have a Windows 7 box I used WAMP Server to test it.

The following test page code calls the function GetPostCategories multiple times with all the different subject formats e.g no category, a single category, multiple categories using both hyphens and square bracket formats.

To test it just create a testpostie.php page and paste in this code:

<?php


// should be no categories - so use default
$subject = "Subject: This is my title";

echo "call GetPostCategories with $subject <br>";

$cats = GetPostCategories($subject,"Default Category");

print_r($cats);
echo "<br>";

// should use the category MyCategory
$subject = "Subject: MyCategory: This is my title";

echo "call GetPostCategories with $subject <br>";

$cats = GetPostCategories($subject,"Default Category");

print_r($cats);
echo "<br>";


// should use the category MyCategory1 and MyCategory2
$subject = "Subject: -MyCategory1- -MyCategory2- This is my title";

echo "call GetPostCategories with $subject <br>";

$cats = GetPostCategories($subject,"Default Category");

print_r($cats);
echo "<br>";

// should use the category MyCategory1 and MyCategory2
$subject = "Subject: [MyCategory1] [MyCategory2] This is my title";

echo "call GetPostCategories with $subject <br>";

$cats = GetPostCategories($subject,"Default Category");

print_r($cats);
echo "<br>";

function GetPostCategories(&$subject, $defaultCategory) {   
    $post_categories = array();
    $matches = array();

    //try and determine category by running the most complicated tests first to look for multiple categories
    if (preg_match_all('/\[(.[^\[]*)\]/', $subject, $matches)) {
 echo "matched on first [cat]<br>";
        preg_match("/](.[^\[]*)$/",$subject,$subject_matches);
        $subject = trim($subject_matches[1]);
    }
    else if ( preg_match_all('/-(.[^-]*)-/', $subject, $matches) ) {
 echo "matched on second -cat-<br>";
        preg_match("/-(.[^-]*)$/",$subject,$subject_matches);
        $subject = trim($subject_matches[1]);
    }else if ( preg_match('/Subject\:\s*(.+): (.*)/i', $subject, $matches))  {
 echo "matched on third :cat<br>";
 $subject = trim($matches[2]);
        $matches[1] = array($matches[1]);
    }else{
 $subject = preg_replace('/Subject\:\s*(.*)/i', '$1',  $subject);
 echo "matched on last no cat<br>"; 
 $subject = trim($subject);        
    } 
     
    if (count($matches)) {
        foreach($matches[1] as $match) {
            $match = trim($match);
            $category =  $match;
           
     echo "Working on $match<br>"; 

          // we have removed the SQL that looks up the categories here

   // just add the category straight into the result for this test so comment out the if statment as category is always set
          //  if ($category) {
                $post_categories[] = $category;
          //  }
        }
    }
 echo "we have " . count($post_categories) . " cats from our subject<br>";
    if (!count($post_categories)) {
  echo "use default <br>";
        $post_categories[] =  $defaultCategory;
    }
 echo "subject is now '$subject'<br>";

    return($post_categories);
}

Things to notice in this test function.

  1. Debug statements to output what is being matched to the screen
  2. The Regular Expression that was the first test is now the third test.
  3. Any regular expression that mentions Subject has the i flag to make it case insensitive in case "subject:" is passed instead of "Subject:"
  4. Any WordPress dependent code has been removed including references to the $wpdb global object that runs database queries.This is not needed in our test function as:


  • a) We don't have the database object instantiated or any other WordPress code included
  • b) I am keeping the test page simple and.....
  • c) the SQL is not the problem.


If you run this test page you should get the following output:

call GetPostCategories with Subject: This is my title
matched on last no cat
we have 0 cats from our subject
use default 
subject is now 'This is my title'
Array ( [0] => Default Category ) 
call GetPostCategories with Subject: MyCategory: This is my title 
matched on third :cat
Working on MyCategory
we have 1 cats from our subject
subject is now 'This is my title'
Array ( [0] => MyCategory ) 
call GetPostCategories with Subject: -MyCategory1- -MyCategory2- This is my title 
matched on second -cat-
Working on MyCategory1
Working on MyCategory2
we have 2 cats from our subject
subject is now 'This is my title'
Array ( [0] => MyCategory1 [1] => MyCategory2 )
call GetPostCategories with Subject: [MyCategory1] [MyCategory2] This is my title 
matched on first [cat]
Working on MyCategory1
Working on MyCategory2
we have 2 cats from our subject
subject is now 'This is my title'
Array ( [0] => MyCategory1 [1] => MyCategory2 ) 

As you can see this test function now works with all the Category formats that Postie supports.

You can now just replace the regular expressions in the function with the correct code by copying the following function over the original and uploading to your server before running a test.


/**
  * This function determines categories for the post
  * @return array
  */
function GetPostCategories(&$subject, $defaultCategory) {
    global $wpdb;
    $post_categories = array();
    $matches = array();
    //try and determine category
    if (preg_match_all('/\[(.[^\[]*)\]/', $subject, $matches)) {  
        preg_match("/](.[^\[]*)$/",$subject,$subject_matches);
        $subject = trim($subject_matches[1]);
    }
    else if ( preg_match_all('/-(.[^-]*)-/', $subject, $matches) ) {
        preg_match("/-(.[^-]*)$/",$subject,$subject_matches);
        $subject = trim($subject_matches[1]);
    }else if ( preg_match('/Subject\:\s*(.+): (.*)/i', $subject, $matches))  {
  $subject = trim($matches[2]);
        $matches[1] = array($matches[1]);
    }else{
  $subject = preg_replace('/Subject\:\s*(.*)/i', '$1',  $subject);
  $subject = trim($subject);        
 } 
    if (count($matches)) {
        foreach($matches[1] as $match) {
            $match = trim($match);
            $category = NULL;

   // this code is a bit ropey but I am not re-writing the whole plugin
            $sql_name = 'SELECT term_id 
                         FROM ' . $wpdb->terms. ' 
                         WHERE name=\'' . addslashes($match) . '\'';
            
   $sql_id = 'SELECT term_id 
                       FROM ' . $wpdb->terms. ' 
                       WHERE term_id=\'' . addslashes($match) . '\'';
            
   $sql_sub_name = 'SELECT term_id 
                             FROM ' . $wpdb->terms. ' 
                             WHERE name LIKE \'' . addslashes($match) . '%\' limit 1';
                
            if ( $category = $wpdb->get_var($sql_name) ) {
                //then category is a named and found 
            } elseif ( $category = $wpdb->get_var($sql_id) ) {
                //then cateogry was an ID and found 
            } elseif ( $category = $wpdb->get_var($sql_sub_name) ) {
                //then cateogry is a start of a name and found
            }  
            if ($category) {
                $post_categories[] = $category;
            }
        }
    }
    if (!count($post_categories)) {
        $post_categories[] =  $defaultCategory;
    }
    return($post_categories);
}


And that should be that. We now have a Postie plugin that works and supports the category formats it said it did.

Saturday 24 March 2012

Logging and Suppressing JavaScript errors

Logging JavaScript errors to a file by overwriting the window.onerror method

Sometimes you may have intermittent JavaScript errors that you cannot re-produce or maybe you just want to be able to log JavaScript errors for later viewing. Or maybe you just want to suppress them so that end users don't see them.

By using the useful and also dangerous feature of being able to overwrite core JavaScript functions and objects you can utilise this to your advantage by overwriting the window.onerror method.

The window.onerror method takes 3 parameters which are:

  • message : the error message
  • url: the URL of the file that raised the error message
  • line: the line number that the error occurred on.


Therefore it is very easy to create your own window.onerror function to take these values and then make an AJAX call to a server side page which logs the JavaScript error details to a file or database or even sends an email.

Also by overwriting the window.onerror function we can suppress JavaScript errors if we chose to.

Maybe if we are debugging a script and don't want the error console constantly filled up or maybe some of our users are still using Windows 98 and use IE 4.

If we are using unobtrusive JavaScript that builds layer upon layer of functionality starting with the lowest common denominator e.g HTML, then adding JavaScript functionality if they have it, Flash if they have it and so on then we may want to just suppress these JavaScript errors in old browsers.

To suppress a JavaScript error you just need to return true.

This example uses jQuery seeing that is so popular but any AJAX library can be used. The point is that you are taking the error parameters and logging them somewhere useful.

I have used a little JavaScript wrapper object to set some system properties like a config object to define whether error logging is on or off and whether or not to suppress errors in older browsers. 

The code to define which browsers to suppress in can be left to you but I have done a simple test for document.getElementById which means browsers like IE 4 and NN4 won't get errors raised.


// Log JS errors to a file - file is overwritten each day only use when debugging a particular page/site

// Set up our global config options to decide whether to log JavaScript errors and whether or not to suppress them. A simple false/true could suffice but we might want to test for old browsers or certain features. If the browser doesn't support document.getElementById it's a pretty old browser!
GlobalSettings = {
 SystemName : "Strictly-Software",
 Version : 2.0.1,
 LogJSErrors : true,
 SuppressJSErrors : (document.getElementById) ? false : true
}

// override the onerror object
window.onerror = function(msg, url, line)
{   
 // does our global system want to log errors - this could be a Client or Serverside setting
 if (GlobalSettings.LogJSErrors)
 {  
                // using JQuery to post a GET request to a page that logs the error details
  $.get("logJSError.php", { message: msg, errorlocation: url, lineno: line } );
 }
 

 // do we still raise the error or for old browsers which might have a lot of errors do we try and supresss them? Use our global config options again.
 if(GlobalSettings.SuppressJSErrors){  
  // return true to suppress the error so its not raised to the console.
  return true;
 }else{
  // return false to raise the error to the console.
  return false;
 } 
}

Then all you need is to define your server side page logJSError.php (or whatever language you are using) to collect the error data from the request and do whatever you want with it e.g log it somewhere for later viewing.

Remember whilst being able to overwrite functions that already exist is good in certain situations like this and the Lazy Function scenario but it can also cause you severe debugging nightmares like the one I discovered when using the common addEvent naming convention for cross browser adding of events.

Therefore be careful especially when overwriting core JavaScript features but also use them to your advantage when possible.


Friday 16 March 2012

WordPress - New Post page not loading and missing category list


Missing Categories and slow load time for New Post on Wordpress

I just had a weird issue with Wordpress in that everytime I tried to open a new post to write an article the page would hang forever and when it eventually loaded no categories would appear in the list on the right.

The categories were definitley there so I don't know why they weren't loading.

However I went through all my plugins and disabled some and updated others and I stumbled across what seems to the solution.

I had two SEO plugins installed - SEO  Ultimate and Yoast SEO for Wordpress.

I liked Yoast for the on-page SEO and Google term searching and I liked SEO Ultimate for the ability to edit .htaccess files and robots files as well as all the other reports (e.g 404) and other features.

However when I disabled SEO Ultimate the page suddenly worked. It loaded quickly and the categories all appeared in the sidebar.

I have no idea if these two SEO plugins were clashing or causing the problems I was experiencing but disabling the SEO Ultimate plugin seemed to fix the problem for me.

So if you have a similar issue and have both SEO plugins installed try disabling one or the other just to see if that fixes it for you as well.

Thursday 8 March 2012

Another MySQL Tuning Tool

MySQL Configuration and Tuning

One thing I really don't like about MySQL compared to MS SQL is the number of configurable options and the lack of data management  views to help you diagnose performance issues. Yes there are tools available but I haven not come across anything as good as the Activity Monitor for an overview of your servers performance in MySQL.

Being able to quickly see the primary cause of performance issues in graph form, high CPU, memory, I/O, blocking, the processes causing the blocking and those effected by the blocks, performance intensive queries and so on in a visual format is very useful.

Another great thing about MS SQL is the amount of Data Management Views and the reports you can create to find the most expensive queries, those with missing indexes, those that require tuning or index rebuilding, query plan re-use or under-user etc are all very useful tools.

Therefore when I come across anything that is semi useful for performance tuning MySQL I will make a note of it and list it on this blog so that other Microsoft developers using LAMP, WAMP etc can benefit from them as well.

One tool I came across tonight which has gone into my rackspace server alongside other MySQL configuration analysers is the MySQL Performance Tuning Primer Script which along with others of a similar ilk will analyse your database settings from a SHOW /*!50000 GLOBAL */ STATUS command and then make recommendations that you can use in your MySQL.cnf file.

To install this script you will need to do the following.


  1. Open a SSH console window up.
  2. Move to the right folder e.g cd /usr/local/src/
  3. Use WGet to load the file to your server e.g wget http://day32.com/MySQL/tuning-primer.sh
  4. Grant execute permission to the script e.g chmod u+x tuning-primer.sh
  5. Try running the script e.g ./tuning-primer.sh


You might get an error like I did on the first attempt which said:

Error: Command line calculator 'bc' not found!

If you don't know bc is (which I didn't) it is the command line arbitrary precision calculator from GNU and it is obviously used within the shell script we are trying to run.

Therefore you will need to install this as well by using apt-get. So run this:

apt-get install bc

Which will install the bc app.

Now try again e.g

/usr/local/src# ./tuning-primer.sh

And you should get something like this:

Using login values from ~/.my.cnf
- INITIAL LOGIN ATTEMPT FAILED -
Testing for stored webmin passwords:
Could not auto detect login info!
Found potential sockets: /var/run/mysqld/mysqld.sock
Using: /var/run/mysqld/mysqld.sock
Would you like to provide a different socket?: [y/N] n
Do you have your login handy ? [y/N] : y
User: [enter your username e.g root or the password for the DB in question]
Password: [enter your password]

You will then get a report like the following and the option to create a new MySQL configuration file if you require it.


Would you like me to create a ~/.my.cnf file for you? [y/N] : n


-- MYSQL PERFORMANCE TUNING PRIMER --
- By: Matthew Montgomery -

MySQL Version 5.0.51a-24+lenny5-log x86_64

Uptime = 7 days 23 hrs 32 min 26 sec
Avg. qps = 10
Total Questions = 7286512
Threads Connected = 1

Server has been running for over 48hrs.
It should be safe to follow these recommendations

To find out more information on how each of these
runtime variables effects performance visit:
http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html
Visit http://www.mysql.com/products/enterprise/advisors.html
for info about MySQL's Enterprise Monitoring and Advisory Service

SLOW QUERIES
The slow query log is enabled.
Current long_query_time = 2 sec.
You have 178496 out of 7286569 that take longer than 2 sec. to complete
Your long_query_time seems to be fine

BINARY UPDATE LOG
The binary update log is NOT enabled.
You will not be able to do point in time recovery
See http://dev.mysql.com/doc/refman/5.0/en/point-in-time-recovery.html

WORKER THREADS
Current thread_cache_size = 8
Current threads_cached = 7
Current threads_per_sec = 0
Historic threads_per_sec = 0
Your thread_cache_size is fine

MAX CONNECTIONS
Current max_connections = 100
Current threads_connected = 1
Historic max_used_connections = 15
The number of used connections is 15% of the configured maximum.
Your max_connections variable seems to be fine.

No InnoDB Support Enabled!

MEMORY USAGE
Max Memory Ever Allocated : 153 M
Configured Max Per-thread Buffers : 262 M
Configured Max Global Buffers : 114 M
Configured Max Memory Limit : 376 M
Physical Memory : 1.01 G
Max memory limit seem to be within acceptable norms

KEY BUFFER
Current MyISAM index space = 221 M
Current key_buffer_size = 64 M
Key cache miss rate is 1 : 3110
Key buffer free ratio = 14 %
You could increase key_buffer_size
It is safe to raise this up to 1/4 of total system memory;
assuming this is a dedicated database server.

QUERY CACHE
Query cache is enabled
Current query_cache_size = 40 M
Current query_cache_used = 20 M
Current query_cache_limit = 2 M
Current Query cache Memory fill ratio = 50.83 %
Current query_cache_min_res_unit = 4 K
MySQL won't cache query results that are larger than query_cache_limit in size

SORT OPERATIONS
Current sort_buffer_size = 2 M
Current read_rnd_buffer_size = 256 K
Sort buffer seems to be fine

JOINS
Current join_buffer_size = 132.00 K
You have had 89958 queries where a join could not use an index properly
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.
If you are unable to optimize your queries you may want to increase your join_buffer_size to accommodate larger joins in one pass.

Note! This script will still suggest raising the join_buffer_size when
ANY joins not using indexes are found.

OPEN FILES LIMIT
Current open_files_limit = 1024 files
The open_files_limit should typically be set to at least 2x-3x
that of table_cache if you have heavy MyISAM usage.
Your open_files_limit value seems to be fine

TABLE CACHE
Current table_cache value = 200 tables
You have a total of 146 tables
You have 200 open tables.
Current table_cache hit rate is 6%, while 100% of your table cache is in use
You should probably increase your table_cache

TEMP TABLES
Current max_heap_table_size = 200 M
Current tmp_table_size = 200 M
Of 985542 temp tables, 44% were created on disk
Perhaps you should increase your tmp_table_size and/or max_heap_table_size
to reduce the number of disk-based temporary tables
Note! BLOB and TEXT columns are not allow in memory tables.
If you are using these columns raising these values might not impact your ratio of on disk temp tables.

TABLE SCANS
Current read_buffer_size = 128 K
Current table scan ratio = 998 : 1
read_buffer_size seems to be fine

TABLE LOCKING
Current Lock Wait ratio = 1 : 1205
You may benefit from selective use of InnoDB.
If you have long running SELECT's against MyISAM tables and perform frequent updates consider setting 'low_priority_updates=1'
If you have a high concurrency of inserts on Dynamic row-length tables consider setting 'concurrent_insert=2'.

Then you can take the recommendations and change the MySQL configuration file yourself in /etc/mysql/my.cnf before restarting the MySQL server so that the changes take effect e.g.


/usr/local/src# /etc/init.d/mysql restart
Stopping MySQL database server: mysqld.
Starting MySQL database server: mysqld.
Checking for corrupt, not cleanly closed and upgrade needing tables..

Just another MySQL report in the same ilk as mysqltuner.pl and mysqlreport but still not as good as anything I have seen on MS SQL 2005-2008.

Saturday 3 March 2012

The Wordpress Survival Guide Part 3 - Security

This is the 3rd part of the Wordpress Survival Guide which looks at security measures.

The other two guides which cover basics for people new to Linux, Apache and Wordpress and Performance can be found here:

The Wordpress Survival Guide Part 1 - Linux, Apache and Wordpress
The Wordpress Survival Guide Part 2 - Performance

If you have an under powered or busy server then security and performance go hand in hand as reducing the amount of traffic from bad bots, hackbots, spammers, login hackers, heavy hitters and so on will also help reduce the load on your server.

There are many plugins out there which claim to help the security on Wordpress but you should be careful as from my own investigation of the code many of these plugins whilst protecting you from potential threats can reduce your sites performance as they carry out too many checks on submitted fields.

If a plugin is checking every form element submitted to the server for hundreds of known SQL injection or XSS hacks with regular expressions or string checks then this can slow down a page load incredibly.

Therefore the further up the chain you can push your security checks from PHP code running in Wordpress to the actual web server the better.

The aim is to move as much blocking code away from your site to your server so we want to make use of our firewall and our .htaccess file by adding a number of rules designed to identify and block potential hackers and spammers before they get to your site and any plugin code.

Blocking with our LINUX Firewall

Once you have found persistent offenders from the methods listed below the aim is to remove any CPU  and Memory from being wasted on them by WordPress and your .htaccess file and put them into your LINUX Firewall.

You can install a plugin to your server called Fail2Ban which will actually analyse your log files for you looking for spammers, hackers and bandwidth wasters and add them automatically to your IPTables (Network Firewall).

However you should read up on it carefully and configure it correctly so that you don't end up blocking yourself sending emails into WordPress or other actions.

The higher up you can block the bad traffic the better. Therefore read this article on how you can block bad BOTS and users by the WebMin interface.

Blocking with the .htaccess file

The .htaccess file sits in your websites root folder and contains rules local to the site which can allow or deny users to your site by blocking certain requests either by IP address, user-agent, or the type of request the user is making. 

I used to return a 403 forbidden status code to the people I wanted to block but I am now trying out a different format which seems to have increased performance. I suspect this might be down to the users of malicious bots seeing a 403 Forbidden code as a "challenge" to crack rather than a sign they should go away therefore I have replaced returning 403 with a 404 code.

As there doesn't seem to be a quick flag like [F] for 403 to use you should create your own 404 page which should contain very basic HTML and no references to any Wordpress include files or other code that could be loaded in.

At the top of you page you put some PHP to return the 404 status code. An example is below.


<?php
header("Status: 404 Not Found");
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head profile="http://gmpg.org/xfn/11">
<title>My Website</title>
<link rel="canonical" href="http://www.mywebsite.com/" />
</head>
<body id="home">
 <div id="page-wrap">
  <div id="header">  
   <p id="slogan">This is my website</p>
  </div>
 </div>
 <div id="body">
  <h1>My Website</h1>

  <p>Sorry that page doesn't exist</p>
 </div>
</body>
</html>

The idea is to have a quick loading page that returns a basic response rather than a blank one so that crawlers think they have just made a mistake and that the URL they are targeting doesn't exist. A blank response or a forbidden status could signify to them that you have caught them out. The PHP at the top ensures a 404 status code is returned.

Once you have created the custom 404.php page and put some basic text in it upload it to the root of your website.

Now you can edit your .htaccess file and change some of the main checks we are going to do so that they redirect the bot to the custom 404.php page and not the Wordpress 404 page.

We don't want to get ourselves in a big loop of circular redirects which is why we check for the 404.php page on our 2nd block of rules.

The first set of rules block common SQL injection attacks, common XSS hacks which include passing JavaScript in the querystring, known file lookups as well as calls to certain applications which should never be accessible from the webserver but sometimes are.

The second block is aimed at known bad bot user-agents, common HTTP libraries such as CURL, WGet, Snoopy and other libraries which are usually downloaded by Script Kiddies and used without any modification.

A proper hacker or spammer will mask themselves a lot better than this but these rules will stop the wannabes and baby nobs that have no clue what they are doing but still overload your server.

I also then block blank and very short user-agents or jbberish user-agents as I believe if the user cannot tell me who they are then I don't want them on my site. It is up to you whether you decide you want people masking themselves in this way to access your servers. You will notice that on this section I still use the [F] forbidden flag and return a 403 code.

The last block are known email harvesters and spammers which I redirect off to a honeypot to be logged and blocked by a proper tool designed to catch out email harvesting bots.

I have found that a good set of rules can reduce traffic to a server by over 50% which is obviously a major performance benefit and since I have changed my first two sets of rules from returning 403 to 404 codes the response time of my server and sites upon it has increased.

<IfModule mod_rewrite.c>
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{QUERY_STRING} (%3C|<)/?script(%3E|>)    [NC,OR]
RewriteCond %{QUERY_STRING} (eval\(|document\.|\.cookie|createElement)    [NC,OR]
RewriteCond %{QUERY_STRING} DECLARE[^a-z]+\@\w+[^a-z]+N?VARCHAR\((?:\d{1,4}|max)\)    [NC,OR]
RewriteCond %{QUERY_STRING} ^/+\?.*sys.?(?:objects|columns|tables|[xs]p_|exec)    [NC,OR]
RewriteCond %{REQUEST_URI} ^\/\/?(owssvr|strmver|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) [NC,OR]
RewriteCond %{REQUEST_URI} ^\/\/?(php\-?my\-?admin\-?\d?|P\/?M\/?A(\d+)?|(db|web)?(admin|db|sql)|(my)?sql\-?(admin|manager|web)?)/? [NC]
RewriteRule ^.*$ /404.php [R=301,L]


RewriteCond %{REQUEST_FILENAME} !/404\.php # ensure we are not already on our 404.php page
RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|Jakarta|HTTPClient|Java|libwww\-perl|Nutch|PycURL|Python|Snoopy|urllib) [NC,OR] # common HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:LWP|PECL|POE|PycURL|WinHttp|curl|Wget) [OR] # case sensitive HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|EventMachine|indy|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner|offline) [NC,OR] # known rippers
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|boardreader|BCD2000|blackwidow|capture|ChinaClaw|collector|copier|disco|devil|downloader|fetch|flickbot|grabber|gosospider|Gentoo|HTMLParser|hapax|hook|igetter|jetcar|JS-Kit|kame-rt|kmbot|KKman|leach|majestic|MetaURI|mole|miner|mirror|mxbot|rogerbot|race|reaper|sauger|speedy|Sogou|sucker|snake|spinn3r|Sosospider|stripper|UnwindFetchor|vampire|whacker|xenu|zeus|zip) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (?:AhrefsBot|fairshare|proxy|PageGetter|magpie|Zemanta|baidu|MiniRedir|SurveyBot|PMAFind|SolomonoBot|whitehat|blackhat|MSIE\s6\.0|ZmEu) [NC]
RewriteRule ^.*$ /404.php [R=301,L]

# Block blank or very short user-agents. If they cannot be bothered to tell me who they are or provide jibberish then they are not welcome!                                          
RewriteCond %{HTTP_USER_AGENT} ^(?:-?|[a-z1-9\-\_]{1,10})$ [NC]
RewriteRule .* - [F,L]

# fake referrers and known email harvesters which I send off to a honeytrap full of fake emails
RewriteCond %{HTTP_USER_AGENT} (?:atomic|collect|e?mail|magnet|reaper|siphon|sweeper|harvest|(?:microsoft\surl\scontrol)|wolf) [NC,OR] # spambots, email harvesters
RewriteCond %{HTTP_REFERER} ^[^?]*(?:iaea|\.ideography|addresses)(?:(\.co\.uk)|\.org\|.com) [NC]
RewriteRule ^.*$ http://english-61925045732.spampoison.com [R,L] 

</IfModule>


You should check your websites access and error logs regularly to see who has been getting banned or 404'd a lot by these rules and then you can decide whether to block their IP address by adding it your .htaccess file like so.



order allow,deny
deny from 81.24.210.2 # example IP - not known to be bad

These rules deny all requests from a particular IP address.

OR adding it to your Firewall if it's making a large amount of calls to your site.

One trick I do like, which I have to thank a commenter, RobinInTexas for is this rule which sends the BOT back to the IP address they came from in the first place. However there are two changes to the rule he sent which used

http://%{REMOTE_ADDR} [L,R=301]

And that is to return them to their localhost address 127.0.0.1 NOT the IP they came from as many people will be going through gateways such as people at work, phone or tablet users and people on WIFI systems in shops etc.

You don't want the ISP that this gateway belongs to thinking you are sending it lots of hackers and bad bots as you might get your site blocked for sending so much traffic to it.

The other change is to make it a 302 temporary redirect instead of a 301 as it is the correct status code to use. So instead of the rule above use this.

http://127.0.0.1 [L,R=302]

You could decide to send them to a honeypot website that logs them as a bad BOT so other users know about them or even some sites that are designed to keep them crawling for days wasting time going through links that lead nowhere but to other links that lead nowhere etc.

The rule in action would look like this:


RewriteCond %{HTTP_USER_AGENT} (?:Spider|MJ12bot|seomax|atomic|collect|e?mail|magnet|reaper|tools\.ua\.random|siphon|sweeper|harvest|(?:microsoft\surl\scontrol)|wolf) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]




2. Blocking SSH Attacks with DenyHosts


Install DenyHosts if you havent already which will block attacks by SSH to your server. It is amazing how many people I have blocked since installing this application and people are always on the look out for new webservers on known cloud hosting IP ranges like Rackspace to attack and hopefully compromise.

To install this you open an SSH connection (with Putty) and run the following commands.

apt-get install denyhosts

to view by console go to the directory the application is installed to.

cd /var/log/denyhosts
tail -f /var/logs/denyhosts

This will show you the tail end of the DenyHosts log file and any newly added IP addresses.

Make sure to add any IP addresses that you access your server console by SSH to the Allow Hosts which you can do by the terminal in VI or from WebMin by going to:

Webmin > Networking > TCP Wrappers > AllowedHosts > Add New Rule

Fill out the form like so:

Services: ALL
Remote hosts: Tick the second radio button and add the IP address to the text input
Shell Commands: none
Save the form.


3. Stop Being Open To SSH / BASH Hack Attacks

Also to stop your server being vulnerable to hacks like the Shellshock hack which appeared recently and exposed nearly every LINUX machine due to their use of SSH and BASH you should do the following.

Test if you are vulnerable by running this command.

env x='() { :;}; echo vulnerable' dash -c "echo this is a test"

If you are then these are somethings you can do.

Turn off BASH and install DASH an older version.

If you are using DASH and want to run BASH just type in BASH to get there.

Also replace the default shell for root and any other users to another folder with symbolic links. Look up on the web how to do this.

Disable any cgi-bin commands in all Apache config files as this is what the hack relies on e.g

#ScriptAlias /cgi-bin/ /home/searchmysite/cgi-bin/
 
#
#allow from all
#

Remove AW stats and Webalizer for all virtual min sites. These rely on CGI-BIN as well.

Regularly change all your user passwords and especially your root password.


A good technique for a strong password is to thing of a common sentence or phrase you will remember and mix the characters up and add a number on the end only you would remember (not your Birthday!) e.g a football teams last trophy win or the year of your last holiday.

Add some dashed or underscores in as well to make it even harder for password crackers to crack it with dictionary attacks. An example would be.

hOWnOWbROWNcOW__1995**

Regularly check your users table for any that look out of place e.g inserted by a hacker.

Also regularly check your home and temp folders for any files that shouldn't be there. One hack I saw replaced the default SSH config file with a temp file in /tmp/sh that loaded up (using WGET) a file hidden in a website that then ran more WGET commands to load in a library of hacks for DDOS and SSH etc and then ran the commands he wanted.

With a compromised server and a SSH config file that had been overwritten he could then use your server to run hack attacks on other machines.

If this is happening, quickly get the IP of the site he is loading the files from and block incoming and outgoing TCP requests in your firewall. Then get a default SSH config file and replace the hacked version before changing all your passwords and ensuring BASH isn't available to be used in a hack.

You can check if anyone who shouldn't be logged into your machine is with the ps ax command.



4. Using Wordpress Plugins to block dangerous traffic.

Two plugins I have found quite useful so far for reducing hack attacks are these:


The Limit Login Attempts plugin which blocks brute force attacks on the wp-login page. If you don't want people signing up to your site anyway you should use a plugin to obfusicate this page anyway otherwise just limit the number of failed attempts so that dictionary attacks are prevented.

http://wordpress.org/extend/plugins/limit-login-attempts/

Use the IP addresses this plugin collects and take the worst offenders and put them in your DENY HOSTS table as well as considering banning them with your LINUX firewall. Read this article for more information on banning bad BOTS and blocking hackers and scrapers.

Install the Wordpress Firwewall plugin to block certain hack attempts and be notified by email when attacks occur. Make sure to add any IP address you access your website to the whitelist so you don't get blocked out.

This plugin will look for some of the same tricks our .htaccess file rules are aimed at blocking as well as some different types of attack that are used when form parameters are filled with dangerous values and submitted to the server.

http://wordpress.org/extend/plugins/wordpress-firewall-2/

There are other things you can do as well but these 3 tips are a good starting point. I will update this page as and when new features are proven at increasing security without effecting site performance.


4. Using other tools on your server to add rules to DenyHosts and your Firewall

There is a tool you can use on LINUX machines called Fail2Ban which RackSpace and other cloud hosters actually recommend using. It will constantly analyse your access and error logs and add IP addresses which it things are suspicious into your DenyHosts list and your Firewall IP Table.

However be-warned I used it myself and tried some of the email rules. I then found myself having my IP being blocked and emails sent from my own computer to my server blocked.

I then tried removing these from the configuration and still ran into problems (not immediately - so it may not have been Fail2Ban's problem) of emails sent from my PC to my WordPress site where a plugin called Postie put them into the system as articles stopped working.

In the end I had to remove the Fail2Ban program from my server. However if you are not doing anything like I am or can configure it properly (I may have made a mistake) then it could be the tool for you as it will save time adding rules into DenyHosts and your IP TABLE for your firewall to use.


Read Part 1 - An Overview
Read Part 2 - Performance