Wednesday 8 July 2009

Another article about bots and crawlers

Good Guys versus the Bad Guys

As you will know if you have read my previous articles about badly behaved robots I have to spend a lot of time dealing with crawlers who access one of the 200+ jobboards I look after. Crawlers break down into 2 varieties, the first are the indexers such as Googlebot, YSlurp, Ask etc. They provide a service to the site owner by indexing their content and displaying it in search engines to drive traffic to the site. Although these are legitimate bots they
can still cause issues (see this article about over-crawling). The second class are those bots that don't benefit the site owner and in a large majority of cases try to harm the site either by hacking or spamming or content scraping (or job raping as we call it).

Now it maybe that the site owner wishes that any Tom, Dick and Harry can come along with a scrapper and take all their content but seeing that content is the main asset of a website its a bit like starting a shop and leaving the doors open all night with no-one manning the till. The problem is that most web content is publicly accessible and its very hard to prevent someone who is determined to scrape your content and at the same time have good SEO. For instance you could deliver all your content through Javascript or Flash but then the indexers like Googlebot won't be able to access your content either. Therefore to prevent overloaded servers, stolen bandwidth and content it becomes a complex game involving a lot more than having a blacklist of IP addresses and user-agents. Due to IP and Agent spoofing this is very unreliable so a variety of methods can be utilised by those wishing to reduce the amount of "Bad Traffic" including real time checks to identify users that are hammering the site, white lists and blacklists, bot traps, DNS checks and much more.

One of the problems I have found is that even bot traffic that is legitimate in the eyes of the site owner doesn't comply with even the most basic rules of crawler etiquette such as a proper user-agent string, parsing DNS validation or even following the Robots.txt file rules. In fact when we spoke to the technical team running one of these job-rapists/aggregators they informed us that they didn't have the technical capabilities to parse a Robots.txt file. Now I think this is plainly ridiculous as if you have the technical capability to crawl hundreds of thousands of pages a day all with different HTML and correctly parse this content to extract the job details which is then uploaded to your own server for display it shouldn't be too difficult to add a few lines in to parse the Robots.txt file. I am 99% positive this was just an excuse to hide the fact they knew if they did parse my Robots.txt file they would find they were banned from all my sites. Obviously I had banned them with other methods but just to show how easy it is to write code to parse a robots.txt file and for that matter a crawler I have added some example code to my technical site. How to parse a Robots.txt file with c#.

This is one of the primary reasons there is so much bot traffic on the web today. Its just too damn easy to either download and run a bot from the web or write your own. I'm not saying all crawlers and bots have nefarious reasons behind them but I reckon 80%+ of all Internet traffic is probably from crawlers and bots nowadays with the other 20% being porn, music and film downloads and social networking traffic. It would be interesting to get
exact figures for Internet traffic breakdown. I know from my own sites logging that on average 60-70% of traffic is crawler and I am guessing that doesn't include traffic from spoofed agents that didn't get identified as such.

No comments: