Sunday 3 March 2013

Stop BOTS and Scrapers from bringing your site down

Blocking Traffic using WebMin on LINUX at the Firewall

If you have read my survival guides on Wordpress you will see that you have to do a lot of work just to get a stable and fast site due to all the code that is included.

The Wordpress Survival Guide

  1. Wordpress Basics - Tools of the trade, useful commands, handling emergencies, banning bad traffic.
  2. Wordpress Performance - Caching, plugins, bottlenecks, Indexing, turning off features
  3. Wordpress Security - plugins, htaccess rules, denyhosts.


For instance not only do you have to handle badly written plugins that could contain security holes and slow the performance of your site but the general WordPress codebase is in my opinion a very badly written piece of code.

However they are slowly learning and I remember once (and only a few versions back) that on the home page there were over 200+ queries being run most of them were returning single rows.

For example if you used a plugin like Debug Queries you would see lots of SELECT statements on your homepage that returned a single row for each article shown for every post as well as the META data, categories and tags associated with the post.

So instead of one query that returned the whole data set for the page in one query (post data, category, tag and meta data) it would be filled with lots of single queries like this.

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800)

However they have improved their code and a recent check of one of my sites showed that although they are still using seperate queries for post, category/tag and meta data they are at least getting all of the records in one go e.g

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800,36799,36798,36797,36796)

So the total number of queries has dropped which aids performance. However in my opinion they could write one query for the whole page that returned all the data they needed and hopefully in a future edition they will.

However one of the things that will kill a site like Wordpress is the amount of BOTS that hit you all day long. These could be good BOTS like GoogleBOT and BingBOT which crawl your site to find out where it should appear in their own search engine or they could be social media BOTS that look for any link Twitter shows or scrapers trying to steal your data.

Some things you can try to stop legitimate BOTS like Google and BING from hammering your site is to set up a Webmaster Tools account in Google and then change the Crawl Rate to a much slower one.

You can also do the same with BING and their webmaster tools account. However with BING they apparently respect the ROBOTS.txt command DELAY e.g


Crawl-delay: 3


Which supposedly tells BOTS that respect the Robots.TXT commands that they should wait 3 seconds before each crawl. However as far as I know only BING support this at the moment and it would be nice if more SERP BOTS did in future.

If you want a basic C# Robots.txt parser that will tell you whether your agent can crawl a page on a site, extract any sitemap command then check out > http://www.strictly-software.com/robotstxt however if you wanted to extend it to add in the Crawl-Delay command it wouldn't be hard ( line 175 in Robot.cs ) to add in so that you could extract and respect it when crawling yourself.

Obviously you want all the SERP BOTS like GoogleBot and Bingbot to search you but there are so many Social Media BOTS and Spammers out there nowadays that they can literally hammer your site into the ground no matter how many caching plugins and .htacess rules you put in to return 403 codes.

The best way to deal with traffic you don't want to hit your site is as high up the chain as possible. 

Just leaving Wordpress to deal with it means the overhead of PHP code running, include files being loaded, regular expression to test for harmful parameters being run and so on.

Moving it up to the .htaccess level is better but it still means your webserver is having to process all the .htacess rules in your file to decide whether or not to let the traffic through or not.

Therefore if you can move the worst offenders up to your Firewall then it will save any code below that level from running and the TCP traffic is stopped before any regular expressions have to be run elsewhere.

Therefore what I tend to do is follow this process:


  • Use the Wordpress plugin "Limit Login Attempts" to log people trying to login (without permission) into my WordPress website. This will log all the IP addresses that have attempted and failed as well as those tht have been blocked. This is a good starting list for your DENY HOSTS IP ban table
  • Check the same IP's as well as using the command: tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more  to see which IP addresses are visiting my site the most each day.
  • I then check the log files either in WebMin or in an SSH tool like PUTTY to see how many times they have been trying to visit my site. If I see lots of HEAD or POST/GET requests within a few seconds from the same IP I will then investigate them further. I will do an nslookup and a whois and see how many times the IP address has been visiting the site.
  • If they look suspicious e.g the same IP with multiple user-agents or lots of requests within a short time period I will comsider banning them. Anyone who is using IE 6 as a user-agent is a good suspect (who uses IE 6 anymore apart from scrapers and hackers!)
  • I will then add them to my .htaccess file and return a [F] (403 status code) to all their requests.
  • If they keep hammering my site I wll then move them from my DENY list in my .htaccess fle and add them to my firewall and Deny Hosts table.
  • The aim is to move the most troublesome IP's and BOTS up the chain so they cause the least damage to your site. 
  • Using PHP to block access is not good as it consumes memory and CPU, the .htaccess file is better but still requires APACHE to run the regular expressions on every DENY or [F] command. Therefore the most troublesome users should be moved up to the Firewall level to cause the less server usage to your system.
  • Reguarly shut down your APACHE server and use the REPAIR and OPTIMIZE options to de-frag your table indexes and ensure the tables are performing as well as possible. I have many articles on this site on other tools which can help you increase your WordPress sites perforance with free tools.

In More Details

You should regularly check the access log files for the most IP's hitting your site, check them out with a reverse DNS tool to see where they come from and if they are of no benefit to you (e.g not a SERP or Social Media agent you want hitting your site) then add them to your .htaccess file under the DENY commands e.g

order allow,deny
deny from 208.115.224.0/24
deny from 37.9.53.71

Then if I find they are still hammering my site after a week or month of getting 403 commands and ignoring them I add them to the firewall in WebMin.


Blocking Traffic at the Firewall level

If you use LINUX and have WebMin installed it is pretty easy to do.

Just go to the WebMin panel and under the "Networking" menu is an item called "Linux Firewall". Select that and a panel will open up with all the current IP addresses, Ports and packets that allowed or denied access to your server.

Choose the "Add Rule" command or if you have an existing Deny command you have setup then it's quicker to just clone it and change the IP address. However if you don't have any setup yet then you just need to do the following.

In the window that opens up just follow these steps to block an IP address from accessing your server.

In the Chain and Action Details Panel at the top:


Add a Rule Comment such as "Block 71.32.122.222 Some Horrible BOT"
In the Action to take option select "Drop"
In the Reject with ICMP Type select "Default"

In Condition Details Panel:

In source address of network select "Equals" and then add the IP address you want to ban e.g 71.32.122.222
In network protocol select "Equals" and then "TCP"

Hit "Save"

The rule should now be saved and your firewall should now ban all TCP traffic from that IP address by dropping any packets it receives as soon as it gets them.

Watch as your performance improves and the number of 403 status codes in your access files drop - until the next horrible social media BOT comes on the scene and tries scrapping all your data.

IMPORTANT NOTE

WebMin isn't very clear on this and I found out the hard way by noticing that IP addresses I had supposedly blocked were still appearing in my access log.

You need to make sure all your DENY RULES are above the default ALLOW rules in the table WebMin will show you.

Therefore your rules to block bad bots, and IP addresses that are hammering away at your server - which you can check in PUTTY with a command like this:
tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more 

Should be put above all your other commands e.g:


Drop If protocol is TCP and source is 91.207.8.110
Drop If protocol is TCP and source is 95.122.101.52
Accept If input interface is not eth0
Accept If protocol is TCP and TCP flags ACK (of ACK) are set
Accept If state of connection is ESTABLISHED
Accept If state of connection is RELATED
Accept If protocol is ICMP and ICMP type is echo-reply
Accept If protocol is ICMP and ICMP type is destination-unreachable
Accept If protocol is ICMP and ICMP type is source-quench
Accept If protocol is ICMP and ICMP type is parameter-problem
Accept If protocol is ICMP and ICMP type is time-exceeded
Accept If protocol is TCP and destination port is auth
Accept If protocol is TCP and destination port is 443
Accept If protocol is TCP and destination ports are 25,587
Accept If protocol is ICMP and ICMP type is echo-request
Accept If protocol is TCP and destination port is 80
Accept If protocol is TCP and destination port is 22
Accept If protocol is TCP and destination ports are 143,220,993,21,20
Accept If protocol is TCP and destination port is 10000


If you have added loads at the bottom then you might need to copy out the IPTables list to a text editor, change the order by putting all the DENY rules at the top then re-saving the whole IPTable list to your server before a re-start of APACHE.

Or you can use the arrows by the side of each rule to move the rule up or down in the table - which is a very laborious task if you have lots of rules.

So if you find yourself still being hammered by IP addresses you thought you had blocked then check the order of your commands in your firewall and make sure they are are at the top NOT the bottom of your list of IP addresses.