Wednesday, 23 October 2013

4 simple rules robots won't follow

4 simple rules robots won't follow

Job Rapists and Content Scrapers - how to spot and stop them!

I work with many sites from small blogs to large sites that receive millions of page loads a day. I have to spend a lot of my time checking my traffic log and logger database to investigate hack attempts, heavy hitting bots and content scrappers that take content without asking (on my recruitment sites and jobboards I call this Job Raping and the BOT that does it a Job Rapist).

I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.

These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.

4 Rules For BOTS to follow


1. Look at the Robots.txt file and follow the rules

If you don't even bother looking at this file (and I know because I log those that do) then you have broken the most basic rule that all BOTS should follow.

If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.

To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)

How to write code to parse a Robots.txt file (including the sitemap directive).


2. Identify yourself correctly

Whilst it may not be set in stone, there is a "standard" for BOTS to identify themselves correctly in their user-agents and all proper SERPS and Crawlers will supply a correct user-agent.

If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)

They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.htmlhttp://www.bing.com/bingbot.htm and http://tweetedtimes.com

On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.

At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.

Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.

If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!


3. Set up a Reverse DNS Entry

I am now using the "standard" way of validating crawlers against the IP address they crawl from.

This involves doing a reverse DNS lookup with the IP used by the bot.

If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.

I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.

I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.


4. Job Raping / Scraping is not allowed under any circumstances. 

If you are crawling my system then you must have permission from each site owner as well as me to do this.

I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.

This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.

Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.

I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging.

It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.

If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!