Monday 1 December 2008

Job Rapists and other badly behaved bots

How bad does a bot have to be before banning it?

I had to ban a particular naughty crawler the other day that belonged to one of the many job aggregators that spring up ever so regularly. This particular bot belonged to a site called jobrapido which one of my customers calls "that job rapist" for the very good reason that it seems to think that any job posted on a jobboard is theirs to take without asking either the job poster or site owner that its posted on. Wheres the good aggregators such as JobsUK require that the job poster or jobboard submits a regular XML feed of their jobs rapido just seems to endlessly crawl and crawl every site they know about and just take whatever they find. In fact on their home page they state the following:

Do you know a job site that does not appear in Jobrapido? Write to us!

Isn't that nice? Anyone can help them thieve, rape and pillage another sites jobs. Maybe there is some bonus point system for regular informers. Times are hard as we all know and people have to get their money where they can! However it shouldn't be from taking other peoples content unrequested.

This particular bot crawls so much it actually made one of our tiniest jobboards appear in our top 5 rankings one month purely from its crawling alone. The site had less than 50 jobs but a lot of categories which meant that this bot decided to visit each day and crawl every single search combination which meant 50,000 page loads a day!

Although this rapido site did link back to the original site that posted the job to allow the jobseeker to apply to the job the amount of referrals was tiny (150 a day across all 100+ of our sites) compared to the huge amount of bandwidth it was stealing from us (16-20% of all traffic). It was regularly ranked above Googlebot, MSN and Yahoo as the biggest crawler in my daily server reports as well as being the biggest hitter (page requests / time).

So I tried banning it using robots.txt directives as any legal well behaved bot should pay attention to that file however 2 weeks after adding all 3 of their agents to the file they were still paying us visits each day and no bot should cache the file for that length of time so I banned it using the htaccess file.

So if you work in the jobboard business and don't want a particular heavy bandwidth thief, content scrapper and robots.txt file ignorer hitting your sites every day then do yourself a favour and ban these agents and IP addresses:

Mozilla/5.0 (compatible; Jobrapido/1.1; +

"Mozilla/5.0 (compatible; Jobrapido/1.1; +"

Mozilla/5.0 (JobRapido WebPump)

If you want a good UK Job Aggregator that doesn't pinch sites jobs without asking first then use JobsUK. Its simple to use and has thousands of new jobs from hundreds of jobboards added every day.

No comments: