Wednesday, 4 February 2009

4 simple rules robots won't follow

Job Rapists and Content Scrapers

I work with jobboards and spent a lot of time checking my traffic logs to investigate hack attempts, heavy hitting bots and content scrappers that take the jobs without asking. I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as its my system) if you are a bot and want to crawl my site you have to do the following:


1. Look at the Robots.txt file and follow the rules

If you don't even bother looking at it and I know because I log those that do then you have broken the most basic rule and if you can't follow the basics then you can get kicked to the kerb sharpish.


2. Identify yourself correctly

I am sure there is some sort of standard for bots to use but even if there isn't then my rules say that having a blank user-agent or a user-agent that doesn't include the URL / email address of your system is enough to see a 403. All the major crawlers do it and most of the smaller ones as well. Having a user-agent of "C4BOT"  or "Oodlebot" is just not good enough. If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you anyway.


3. Set up a Reverse DNS Entry

I am starting to use the "standard" way of validating crawlers against their IP which is being used by the online community. This involves doing a reverse DNS lookup with the IP used by the bot. If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name. I think most big crawlers are starting to come on board with this way of doing things now that its taken off and for those that don't I have a lookup table of IP/Agent for the big crawlers I allow anyway.


4. Job Raping is not allowed under any circumstances. 

If you are crawling my system then you must have permission from each site owner as well as me to do this. I have had bots hit tiny weeny itsy bitsy jobboards with 30-40 jobs up to 40,000 page loads a day. This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site. Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.

I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging. Its like an arms race from the 80's and I am banning bots every day of the week for breaking those rules. If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself, set up a reverse DNS entry , follow the robots.txt rule and don't visit my site everyday crawling every single page like a spider on yabba!

Labels: , , , , , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home