Monday, 19 April 2010

Banning Bad Bots with Mod Rewrite

Banning Scrapers and Other Bad Bots using Mod Rewrite

There are many brilliant sites out there dedicated to the never ending war on bad bots and I thought I would add my own contribution to the lists of regular expressions used for banning spammers, scrapers and other malicious bots with Mod Rewrite.

As with all security measures a sys admin takes to lock down and protect a site or server a layered approach is best. You should be utilise as many different methods as possible so that an error, misconfiguration or breach of one ring in your defence does not mean your whole system is compromised.

The major problem with using the htaccess file to block bots by useragent or referrer is that any hacker worth the name would easily get round the rules by changing their agent string to a known browser or by hiding the referrer header.

However in spite of this obvious fact it still seems that many bots currently doing the rounds scraping and exploiting are not bothering to do this so its still worth doing. I run hundreds of major sites and have my own logger system which automatically scans traffic and blocks visitors who I catch hacking, spamming, scraping and heavy hitting. Therefore I regularly have access to the details of the badly behaving bots and whereas a large percentage of them will hide behind an IE or Firefox useragent a large percentage still use identifying agent strings that can be matched and banned.

The following directives are taken from one of my own personal sites and therefore I am cracking down on all forms of bandwidth theft. Anyone using an HTTP library like CURL, Snoopy, WinHTTP and not bothering to change their useragent will get blocked. If you don't want to do this then don't just copy and paste the rules.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

# Block blank or very short user-agents. If they cannot be bothered to tell me who they are or provide jibberish then they are not welcome!
RewriteCond %{HTTP_USER_AGENT} ^(?:-?|[a-z1-9\-\_]{1,10})$ [NC]
RewriteRule .* - [F,L]

# Block a number of libraries, email harvesters, spambots, hackbots and known bad bots

# I know these libraries are useful but if the user cannot be bothered to change the agent they are worth blocking plus I only want people
# visiting my site in their browser. Automated requests with CURL usually means someone is being naughty so tough!
RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|PHP|POE|Python|Snoopy|urllib|Wget|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|library|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch|flickbot|hook|igetter|jetcar|leach|mole|miner|mirror|race|reaper|sauger|sucker|site|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

# fake referrers and known email harvesters which I send off to a honeytrap full of fake emails
RewriteCond %{HTTP_USER_AGENT} (?:atomic|collect|e?mail|magnet|reaper|siphon|sweeper|harvest|(microsoft\surl\scontrol)|wolf) [NC,OR] # spambots, email harvesters
RewriteCond %{HTTP_REFERER} ^[^?]*(?:iaea|\.ideography|addresses)(?:(\.co\.uk)|\.org\.com) [NC]
RewriteRule ^.*$ [R,L] # redirect to a honeypot

# copyright violation and brand monitoring bots
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR]
RewriteCond %{HTTP_USER_AGENT} (?:NPBot|TurnitinBot) [NC]
RewriteRule .* - [F,L]

# Image hotlinking blocker - replace any hotlinked images with a banner advert for the latest product I want free advertising for!
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?##SITEDOMAIN##\.com/.*$ [NC] # change to your own site domain!
RewriteCond %{HTTP_REFERER} !^https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*$ [NC] # ensure image indexers don"t get blocked
RewriteCond %{HTTP_REFERER} !^https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*$ [NC] # ensure email clients don't get blocked
RewriteRule .*\.(gif|jpe?g|png)$ [NC,L] # free advertising for me

# Security Rules - these rules help protect your site from hacks such as sql injection and XSS

RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK) # no-one should be running these requests against my site!
RewriteRule .* - [F]

# My basic rules for catching SQL Injection - covers the majority of the automated attacks currently doing the rounds

# SQL Injection and XSS hacks - Most hackbots will malform links and then log 500 errors for details I use a special hack.php page to log details of the hacker and ban them by IP in future
# Works with the following extensions .php .asp .aspx .jsp so change/remove accordingly and change the name of the hack.php page or replace it with [F,L]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?DECLARE[^a-z]+\@\w+[^a-z]+N?VARCHAR\((?:\d{1,4}|max)\).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?sys.?(?:objects|columns|tables).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?;EXEC\(\@\w+\);?.*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?(%3C|<)/?script(%3E|>).*)$ /hack\.php\?$1 [NC,L,U] # XSS hacks

# Bad requests which look like attacks (these have all been seen in real attacks)
RewriteRule ^[^?]*/(owssvr|strmver|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) /hack.php [NC,L]


Security, Free Advertising and Less Bandwidth

As you can see from the rules I have condensed the rules into sections to keep the file manageable. The major aim of the rules is to
  1. Improve security by blocking hackbots before they can reach the site.
  2. Reduce bandwidth by blocking the majority of automated requests that are not from known indexers such as Googlebot or Yahoo.
  3. Piss those people off who think they can treat my bandwidth as a drunk treats a wedding with a free bar. Known email scrapers go off to a honeypot full of fake email addresses and hotlinkers help me advertise my latest site or product by displaying banners for me.

Is it a pointless task?

A lot of the bad bot agent strings are well known and there are many more which could be added if you so wish however trying to keep a static file maintained with the latest bad boys is a pointless and thankless task. The best way is to automate the tracking of bad bots by logging users who request the robots.txt file by creating a dynamic file that can log the request to a file or database. Then you place directives in the robots.txt to DENY access to a special directory or file and then place links on your site to this file. Any agents who ignore the robots.txt file and crawl these links can then be logged and blocked.

I also utilise my own database driven logger system that constantly analyses the traffic looking for bad users which can then be banned. I have an SQL function that checks for hack attempts and spam by pattern matching the stored querystring as well as looking for heavy hitters (agents/IP's requesting lots of pages in a short time period). This helps mes prevent DDOS attacks as well as scrapers who think they can take 1000+ jobs without saying thank you!

A message for the other side

I know this probably defeats the point of me posting my htaccess rules but as well as defending my own systems from attack I also make use of libraries such as CURL in my own development to make remote HTTP requests. Therefore I can see the issues involved in automated crawling from both sides and I know all the tricks system admin use to block as well as the tricks scrapers use to bypass.

There are many legitimate reasons why you might need to crawl or scrape but you should remember that what goes around comes around. Most developers will have at least one or more of their own sites and therefore you should know that bandwidth is not free so stealing others will lead to bad karma. The web is full of lists containing bad agents and IP addresses obtained from log files or honey-traps so you don't just risk being banned from the site you are intending to crawl when you decide to test your new scraper out.

Remember if you hit a site so hard and fast it breaks (which is extremely possible in this day and age of cheaply hosted Joomla sites designed by four year olds.) then sys admin will start analysing log files looking for culprits. A quiet site that runs smoothly usually means the owner is happy and not looking for bots to ban

  • Rather than making multiple HTTP requests cache your content locally if possible.
  • Alternate requests between domains so that you don't hit a site too hard.
  • Put random delays in between requests.
  • Obey Robots.txt and don't risk getting caught in a honeypot or infinite loop by visiting pages you shouldn't be going to.
  • Never use the default agent as set in your chosen library.
  • Don't risk getting your server blacklisted by your crawling so always use a proxy.
  • Only scrape the bare minimum that you need to do the job. If you are only checking the header then don't return the body as well.

The following articles are a good read for people interested in this topic:

1 comment:

Anonymous said...

Great article Rob. Those rules have helped me reduce my bandwidth by over 30%.

Just goes to show how many bad bots there are out there!