Showing posts with label crawler. Show all posts
Showing posts with label crawler. Show all posts

Thursday, 26 May 2022

Extending Try/Catch To Handle Custom Exceptions - Part 1

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 1

By Strictly-Software

Have you ever wanted to throw custom exception errors whether for business logic such as if a customer isn't found in your database, or whether more complex logic has failed that you don't want to handle but raise so that you know about it.
 
For me, I required this solution due to a BOT I use that collects information from various sites every day and uses regular expressions to find the pieces of data I require and extract them. The problem is that the site often changes its HTML source code and it will break a regular expression I have specifically crafted to extract the data using it.
 
For example, I could have a very simple regular expression that is getting the ordinal listing of a name using the following C# regular expression:
 
string regex = @"<span class=""ordinal_1"">(\d+?)</span>";
Then one day I will run the BOT and it won't return any ordinals for my values and when I look inside the log file I find it will have my own custom message due to finding no match such as "Unable to extract value from HTML source" and when I go and check the source it's because they have changed the HTML to something like this:
<span class="ordinal_1__swrf" class="o1 klmn">"1"<sup>st</sup></span> 
 
This is obviously gibberish many designers add into HTML to stop BOTS crawling their data hoping that the BOT developer has written a very specific regex that will break and return no data when met with such guff HTML. 
 
Obviously, the first expression whilst meeting the original HTML alright is too tight to handle extra guff within the source. 
 
Therefore it required a change in the expression to something a bit more flexible in case they added even more guff into the HTML:
 
string regex = @"<span class=""ordinal_+?[\s\S]+?"">""?(\d+?)""?<sup";
 
As you can see I have made the expression looser, with extra question marks ? in case quotes are or are not wrapped around values, and using non-greedy match any character and non-character expressions like [\s\S]+? to handle the gibberish from the point it appears to where I know it has to end at the closing quote or bracket.
 
So instead of just logging the fact I have missing data from crawls I wanted to raise the errors with TRY/CATCH and make the fact that a specific piece of HTML no longer matches my expression an exception that will get raised so I can see it as soon as it happens. 
 
Well with C# you can extend the base object TRY/CATCH so that your own exceptions based upon your own logic can be used. In future articles, we will build up to a full C# Project with a couple of classes, a simple BOT and some regular expressions we can use to test what happens when trying to extract common values from the HTML source on various URL's to throw exceptions.
 

Creating The TRY CATCH CLASS

First off the TRY / CATCH C# Class where I am extending the base object to call, usual messages but using String,Format so that I can pass in specific messages. I have put numbers at the start of each method so that when running the code later we can see which exceptions get called.
 
I have just created a new solution in Visual Studio 2022 called TRYCATCH and then named my RegExException that extends the Exception object. You can call your solution what you want but for a test, I don't see why just following what I am doing is not okay.

[Serializable]
public class RegExException : Exception
{
public RegExException() { }

public RegExException(string regex, string msg)
   : base(String.Format("1 Regular Expression Exception: Regular Expression: {0}; {1}", regex, msg )) { }

public RegExException(string msg)
    : base(String.Format("2 {0}", msg)) { }

public RegExException(string regex, Exception ex)
    : base(String.Format("3 Regular Exception is no longer working: {0} {1}", regex,ex.Message.ToString())) { }

}
 
As you can see the Class has 3 overloaded methods which either just take a single message, a regular expression string and a message, and a regular expression and an exception these values are placed in specific places within the exception messages.
 
You would cause a Regular Expression Exception to be thrown by placing something like this in your code:
 
throw new RegExException(String.Format("No match could be found with Regex {0} on supplied value '{1}'", re, val));
 
Hopefully, you can see what I am getting at, and we will build on this in a future post.
 
© 2022 By Strictly-Software 
 
 

Wednesday, 23 October 2013

4 simple rules robots won't follow

4 simple rules robots won't follow

Job Rapists and Content Scrapers - how to spot and stop them!

I work with many sites from small blogs to large sites that receive millions of page loads a day. I have to spend a lot of my time checking my traffic log and logger database to investigate hack attempts, heavy hitting bots and content scrappers that take content without asking (on my recruitment sites and jobboards I call this Job Raping and the BOT that does it a Job Rapist).

I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.

These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.

4 Rules For BOTS to follow


1. Look at the Robots.txt file and follow the rules

If you don't even bother looking at this file (and I know because I log those that do) then you have broken the most basic rule that all BOTS should follow.

If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.

To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)

How to write code to parse a Robots.txt file (including the sitemap directive).


2. Identify yourself correctly

Whilst it may not be set in stone, there is a "standard" for BOTS to identify themselves correctly in their user-agents and all proper SERPS and Crawlers will supply a correct user-agent.

If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)

They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.htmlhttp://www.bing.com/bingbot.htm and http://tweetedtimes.com

On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.

At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.

Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.

If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!


3. Set up a Reverse DNS Entry

I am now using the "standard" way of validating crawlers against the IP address they crawl from.

This involves doing a reverse DNS lookup with the IP used by the bot.

If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.

I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.

I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.


4. Job Raping / Scraping is not allowed under any circumstances. 

If you are crawling my system then you must have permission from each site owner as well as me to do this.

I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.

This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.

Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.

I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging.

It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.

If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!

Thursday, 12 September 2013

SEO - Search Engine Optimization

My two cents worth about Search Engine Optimisation - SEO

Originally Posted - 2009
UPDATED - 12th Sep 2013

SEO is big bucks at the moment and it seems to be one of those areas of the web where there seem to be lots of snake oil salesmen and "SEO experts" who will promise no 1 positioning on Google, Bing and Yahoo for $$$ per month.

It is one of those areas that I didn't really pay much attention to when I started web developing mainly because I was not the person paying for the site and relying on leads coming from the web. However as I have worked on more and more sites over the years its become blatantly apparent to me that SEO comes in two forms from a development or sales point of view.

There are the forms of SEO which are basically good web development practise and will come about naturally from having a good site structure, making the site usable and readable as well as helping in terms of accessibility.

Then there are the forms which people will try and bolt onto a site afterwards. Either as an after thought or because an SEO expert has charged the site lots of money, promised the impossible, and wants to use some dubious link-sharing schemes that are believed to work.


Cover the SEO Basics when developing the site


Its a lot harder to just "add some Search Engine Optimization" in once a site has been developed especially if you are developing generic systems that have to work for numerous clients.

I am not an SEO expert and I don't claim to be otherwise I would be charging you lots of money for this advice and making promises that are impossible to be kept. However following these basic tips will only help your sites SEO.

Make sure all links have title tags on them and contain worthy content rather than words like "click here". The content within the anchor tags matter when those bots come a crawling in the dead of night.

You should also make sure all images have ALT attributes on them as well as titles and make sure the content of both differ. As far as I know Googlebot will rate ALT content higher than title content but it cannot hurt to have both.

Make sure you make use of header tags to differentiate out important sections of your site and try to use descriptive wording rather than "Section 1" etc.

Also as I'm sure you have noticed if you have read my blogs before I wrap keywords and keyword rich sentences in strong tags.

I know that Google will also rank emphasised content or content marked as strong over normal content. So as well as helping those readers who skim read to view just the important parts, it also tells Google which words are important on my article.

Write decent content and don't just fill up your pages with visible or non-visible spammy keywords.

In the old days keyword density mattered when ranking content. This was calculated by removing all the noise words and other guff (CSS, JavaScript etc) and then calculating what percentage of the overall page content were relevant keywords? 

Nowadays the bots are a lot cleverer and will penalise content that does this as it looks like spam.

Also its good for your users to have good readable content and you shouldn't remove words between keywords as it makes it more unreadable and you will lose out on the longer 3, 4, 5 word indexable search terms (called long-tail in the SEO world).

Saying this though its always good to remove filler from your pages. For example by putting your CSS and Javascript code into external files when possible and removing large commented out sections of HTML.

You should also aim to put your most important content at the top of the page so its the first thing crawled.

Try moving main menus and other content that can be positioned by CSS to the bottom of the file. This is so that social media sites and other BOTS that take the "first image" on an article and use it in their own social snippets don't accidentally use an advertisers banner instead of your logo or main article picture.

The same thing goes for links. If you have important links but they are in the footer such as links to site-indexes then try getting them higher up the HTML source.

I have seen Google recommend that 100 links a per page is the maximum to have per page. Therefore having a homepage that has your most important links at the bottom of the HTML source but 200+ links above them e.g links to searches even if not all of them are visible then this can be harmful.

If you are using a tabbed interface to switch between tabs of links then the links will still be in the source code and if they are loaded in by JavaScript on demand then that's no good at all as a lot of crawlers don't run JavaScript.

Items such as ISAPI URL rewriting are very good for SEO plus they are nicer URLs for sites to display.

For example using a site I have just worked on as an example http://jobs.professionalpassport.com/companies/perfect-placement-uk-ltd is a much nicer URL to view a particular company profile than the underlying real URL which could also be accessed as http://jobs.professionalpassport.com/jobboard/cands/compview.asp?c=6101

If you can access that page by both links and you don't want to be penalised for duplicate content then you should specify which link you would want to be indexed by specifying your canonical link.You should also use your Robots.txt file to specify that the non re-written URL's are not to be indexed e.g.

Disallow: /jobboard/cands/compview.asp

META tags such as the keywords tag are not considered as important as they once were and having good keyword rich content in the main section of the page is the way to go rather than filling up that META with hundreds of keywords.

The META Description will still be used to help describe your page on search results pages and the META Title tag is very important to describe your page's content to the user and BOT.

However some people are still living in the 90's and seem to think that stuffing their META Keywords with spam is the ultimate SEO trick when in reality that tag is probably ignored by most crawlers nowadays.

Set up a Sitemap straight away containing your sites pages ranked by their importance, how often they change, last modified date etc. The sooner you do this the quicker you site will be getting indexed and gaining site authority. It doesn't matter if it's not 100% ready yet but the sooner it's in the indexes the better.

Whilst you can do this through Googles webmaster tools or Microsofts Bing you don't actually need to use their tools and as long as you use a sitemap directive in your robots.txt file BOTS will find it e.g

Sitemap: http://www.strictly-software.com/sitemap_110908.xml

You can also use tools such as the wonderful SEOBook Toolbar which is an add-on for Firefox which has combined numerous other free online SEO tools into one helpful toolbar. It lets you see your Page Ranking and compare your site to competitors on various keywords across the major search engines.

Also using a text browser such as Lynx to see how your site would look to a crawler such as yahoo or google.is a good trick to see how BOTS would view your site as it will skip all the styling and JavaScript.

There are many other good practises which are basic "musts" in this day and age and the major SERP'S are moving more and more towards social media when it comes to indexing sites and seeing how popular they are.

You should set up a Twitter account and make sure each article is published to it as well as engaging with your followers.

A Facebook Fan page is also a good method of getting people to view snippets of your content and then find your site through the world most popular social media website.

Making your website friendly for people viewing it on tablets or smart phones is also good advice as more and more people are using these devices to view Internet content.

The Other form of SEO, Black Magic Optimization

The other form of Search engine optimization is what I would call "black magic SEO" and it comes in the form of SEO specialists that will charge you lots of money and make impossible claims about getting you to the number one spot in Google for your major keywords and so on.

The problem with SEO is that no-one knows exactly how Google and the others calculate their rankings so no-one can promise anything regarding search engine positioning.

There is Googles Page Ranking which is used in relation to other forms of analysis and it basically means that if you have a site with a high PR that links to your site that does not link back to the original site then it tells Google that your site has higher site authority than the linking site.

If your site only links out to other sites but doesn't have any links coming in from high page ranked relevant sites then you are unlikely to get a high page rank yourself. This is just one of the ways which Google will use to determine how high to place you in the rankings when a search is carried out.

Having lots of links coming in from sites that have nothing whatsoever to do with your site may help drive traffic but will probably not help your PR. Therefore engaging in all these link exchange systems are probably worth jack nipple as unless the content that links to your site is relevant or related in some way its just seen as a link for a links sake i.e spam.

Some "SEO specialists" promote special schemes which have automated 3 way linking between sites enrolled on the scheme.

They know that just having two unrelated sites link to each other basically negates the Page Rank so they try and hide this by having your site A linking to site B which in turn links to site C that then links back to you.

The problem is obviously getting relevant sites linking to you rather than every tom dick and harry.

Also advertising on other sites purely to get indexed links from that site to yours to increase PR may not work due to the fact that most of the large advert management systems output banner adverts using Javascript therefore although the advert will appear on the site and drive traffic when people click it you will not get the benefit of an indexed link. The reason being that when the crawlers come to index the page containing the advert the banner image and any link to your site won't be there.

Anyone who claims that they can get you to the top spot in Google is someone to avoid!

The fact is that Google and the others are constantly changing the way they rank and what they penalise for so something that may seem dubious that works currently could actually harm you down the line.

For example in the old days people would put hidden links on white backgrounds or position them out of site so that the crawlers would hit them but the users wouldn't see which worked for a while until Google and the others cracked down and penalised for it.

Putting any form of content up specifically for a crawler is seen as dubious and you will be penalised for doing it.

Google and BING want to crawl the content that a normal user would see and they have actually been known to mask their own identity ( IP and User-Agent ) when crawling your site so that they can check whether this is the case or not.

My advice would be to stick to the basics, don't pay anybody who makes any kind of promise about result ranking and avoid like the plague any scheme that is "unbeatable" and promises unrivalled PR within only a month or two.

Monday, 19 April 2010

Banning Bad Bots with Mod Rewrite

Banning Scrapers and Other Bad Bots using Mod Rewrite

There are many brilliant sites out there dedicated to the never ending war on bad bots and I thought I would add my own contribution to the lists of regular expressions used for banning spammers, scrapers and other malicious bots with Mod Rewrite.

As with all security measures a sys admin takes to lock down and protect a site or server a layered approach is best. You should be utilise as many different methods as possible so that an error, misconfiguration or breach of one ring in your defence does not mean your whole system is compromised.

The major problem with using the htaccess file to block bots by useragent or referrer is that any hacker worth the name would easily get round the rules by changing their agent string to a known browser or by hiding the referrer header.

However in spite of this obvious fact it still seems that many bots currently doing the rounds scraping and exploiting are not bothering to do this so its still worth doing. I run hundreds of major sites and have my own logger system which automatically scans traffic and blocks visitors who I catch hacking, spamming, scraping and heavy hitting. Therefore I regularly have access to the details of the badly behaving bots and whereas a large percentage of them will hide behind an IE or Firefox useragent a large percentage still use identifying agent strings that can be matched and banned.

The following directives are taken from one of my own personal sites and therefore I am cracking down on all forms of bandwidth theft. Anyone using an HTTP library like CURL, Snoopy, WinHTTP and not bothering to change their useragent will get blocked. If you don't want to do this then don't just copy and paste the rules.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

# Block blank or very short user-agents. If they cannot be bothered to tell me who they are or provide jibberish then they are not welcome!
RewriteCond %{HTTP_USER_AGENT} ^(?:-?|[a-z1-9\-\_]{1,10})$ [NC]
RewriteRule .* - [F,L]

# Block a number of libraries, email harvesters, spambots, hackbots and known bad bots

# I know these libraries are useful but if the user cannot be bothered to change the agent they are worth blocking plus I only want people
# visiting my site in their browser. Automated requests with CURL usually means someone is being naughty so tough!
RewriteCond %{HTTP_USER_AGENT} (?:ColdFusion|curl|HTTPClient|Java|libwww|LWP|Nutch|PECL|PHP|POE|Python|Snoopy|urllib|Wget|WinHttp) [NC,OR] # HTTP libraries
RewriteCond %{HTTP_USER_AGENT} (?:ati2qs|cz32ts|indy|library|linkcheck|Morfeus|NV32ts|Pangolin|Paros|ripper|scanner) [NC,OR] # hackbots or sql injection detector tools being misused!
RewriteCond %{HTTP_USER_AGENT} (?:AcoiRobot|alligator|auto|bandit|capture|collector|copier|disco|devil|downloader|fetch|flickbot|hook|igetter|jetcar|leach|mole|miner|mirror|race|reaper|sauger|sucker|site|snake|stripper|vampire|weasel|whacker|xenu|zeus|zip) [NC] # offline downloaders and image grabbers
RewriteRule .* - [F,L]

# fake referrers and known email harvesters which I send off to a honeytrap full of fake emails
RewriteCond %{HTTP_USER_AGENT} (?:atomic|collect|e?mail|magnet|reaper|siphon|sweeper|harvest|(microsoft\surl\scontrol)|wolf) [NC,OR] # spambots, email harvesters
RewriteCond %{HTTP_REFERER} ^[^?]*(?:iaea|\.ideography|addresses)(?:(\.co\.uk)|\.org\.com) [NC]
RewriteRule ^.*$ http://english-61925045732.spampoison.com [R,L] # redirect to a honeypot


# copyright violation and brand monitoring bots
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR]
RewriteCond %{HTTP_USER_AGENT} (?:NPBot|TurnitinBot) [NC]
RewriteRule .* - [F,L]


# Image hotlinking blocker - replace any hotlinked images with a banner advert for the latest product I want free advertising for!
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?##SITEDOMAIN##\.com/.*$ [NC] # change to your own site domain!
RewriteCond %{HTTP_REFERER} !^https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*$ [NC] # ensure image indexers don"t get blocked
RewriteCond %{HTTP_REFERER} !^https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*$ [NC] # ensure email clients don't get blocked
RewriteRule .*\.(gif|jpe?g|png)$ http://www.some-banner-advert.com/myadvert-468x60.png [NC,L] # free advertising for me


# Security Rules - these rules help protect your site from hacks such as sql injection and XSS

RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK) # no-one should be running these requests against my site!
RewriteRule .* - [F]

# My basic rules for catching SQL Injection - covers the majority of the automated attacks currently doing the rounds

# SQL Injection and XSS hacks - Most hackbots will malform links and then log 500 errors for details I use a special hack.php page to log details of the hacker and ban them by IP in future
# Works with the following extensions .php .asp .aspx .jsp so change/remove accordingly and change the name of the hack.php page or replace it with [F,L]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?DECLARE[^a-z]+\@\w+[^a-z]+N?VARCHAR\((?:\d{1,4}|max)\).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?sys.?(?:objects|columns|tables).*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?;EXEC\(\@\w+\);?.*)$ /hack\.php\?$1 [NC,L,U]
RewriteRule ^/.*?\.(?:aspx?|php|jsp)\?(.*?(%3C|<)/?script(%3E|>).*)$ /hack\.php\?$1 [NC,L,U] # XSS hacks

# Bad requests which look like attacks (these have all been seen in real attacks)
RewriteRule ^[^?]*/(owssvr|strmver|Auth_data|redirect\.adp|MSOffice|DCShop|msadc|winnt|system32|script|autoexec|formmail\.pl|_mem_bin|NULL\.) /hack.php [NC,L]

</IfModule>


Security, Free Advertising and Less Bandwidth

As you can see from the rules I have condensed the rules into sections to keep the file manageable. The major aim of the rules is to
  1. Improve security by blocking hackbots before they can reach the site.
  2. Reduce bandwidth by blocking the majority of automated requests that are not from known indexers such as Googlebot or Yahoo.
  3. Piss those people off who think they can treat my bandwidth as a drunk treats a wedding with a free bar. Known email scrapers go off to a honeypot full of fake email addresses and hotlinkers help me advertise my latest site or product by displaying banners for me.

Is it a pointless task?

A lot of the bad bot agent strings are well known and there are many more which could be added if you so wish however trying to keep a static file maintained with the latest bad boys is a pointless and thankless task. The best way is to automate the tracking of bad bots by logging users who request the robots.txt file by creating a dynamic file that can log the request to a file or database. Then you place directives in the robots.txt to DENY access to a special directory or file and then place links on your site to this file. Any agents who ignore the robots.txt file and crawl these links can then be logged and blocked.

I also utilise my own database driven logger system that constantly analyses the traffic looking for bad users which can then be banned. I have an SQL function that checks for hack attempts and spam by pattern matching the stored querystring as well as looking for heavy hitters (agents/IP's requesting lots of pages in a short time period). This helps mes prevent DDOS attacks as well as scrapers who think they can take 1000+ jobs without saying thank you!

A message for the other side

I know this probably defeats the point of me posting my htaccess rules but as well as defending my own systems from attack I also make use of libraries such as CURL in my own development to make remote HTTP requests. Therefore I can see the issues involved in automated crawling from both sides and I know all the tricks system admin use to block as well as the tricks scrapers use to bypass.

There are many legitimate reasons why you might need to crawl or scrape but you should remember that what goes around comes around. Most developers will have at least one or more of their own sites and therefore you should know that bandwidth is not free so stealing others will lead to bad karma. The web is full of lists containing bad agents and IP addresses obtained from log files or honey-traps so you don't just risk being banned from the site you are intending to crawl when you decide to test your new scraper out.

Remember if you hit a site so hard and fast it breaks (which is extremely possible in this day and age of cheaply hosted Joomla sites designed by four year olds.) then sys admin will start analysing log files looking for culprits. A quiet site that runs smoothly usually means the owner is happy and not looking for bots to ban

  • Rather than making multiple HTTP requests cache your content locally if possible.
  • Alternate requests between domains so that you don't hit a site too hard.
  • Put random delays in between requests.
  • Obey Robots.txt and don't risk getting caught in a honeypot or infinite loop by visiting pages you shouldn't be going to.
  • Never use the default agent as set in your chosen library.
  • Don't risk getting your server blacklisted by your crawling so always use a proxy.
  • Only scrape the bare minimum that you need to do the job. If you are only checking the header then don't return the body as well.

The following articles are a good read for people interested in this topic:




Tuesday, 30 June 2009

Googlebot, Sitemaps and heavy crawling

Googlebot over-crawling a site

I recently had an issue with one of my jobboards that meant that Googlebot was over-crawling the site which was causing the following problems:
  1. Heavy loads on the server. The site in question was recording 5 million page loads a month which had doubled from 2.4 million within a month.
  2. 97% of all their traffic was accounted by Googlebot.
  3. The site is on a shared server so this heavy load was causing very high CPU and affecting other sites.

The reasons the site was receiving so much traffic boiled down to the following points.

  1. The site has a large number of categories which users can filter job searches by. These categories are displayed in whole and in subsets in prominent places such as quick links and a job browser which allows users to filter results. As multiple categories can be chosen when filtering a search this meant Googlebot was crawling every possible combination in various orders.
  2. A new link had been added within the last month to the footer which passed a sessionID in the URL. The link was to log whether users had Javascript enabled. As Googlebot doesn't keep session state or use Javascript it meant the number of crawled URLs actively doubled as each page the crawler hit would find a new link that it hadn't already spidered due to the new SessionID.
  3. The sitemap had been setup incorrectly containing URLs that didn't need crawling as well as incorrect change frequencies.
  4. The crawl rate was set to a very high level in Webmaster tools.

Therefore a site with around a thousand jobs was receiving 200,000 page loads a day nearly all of them from crawlers. To put this in some perspective other sites with 3000+ jobs, good SEO and high PR usually get around 20,000 page loads a day from crawlers.

One of the ways I rectified this situation was by changing the crawl rate to a low custom crawl rate of 0.2 crawls per second. This caused a nice big vertical drop in the graph and it alarmed the site owner as he didn't realise that there is no relation between the amount of pages crawled by Google and the sites page ranking or overall search engine optimisation.


Top Tips for getting the best out of crawlers

  • Setup a sitemap and submit it to Google, Yahoo and Live.
  • Make sure only relevant URLs are put in the sitemap. For example don't include pages such as error pages and logoff pages.
  • If you are rewriting URLs then don't include the non-rewritten URL as well as this will be counted as duplicate content.
  • If you are including URLs that take IDs as parameters to display database content then make sure you don't include the URL without a valid ID. Taking the site I spoke about earlier as an example, someone had included the following
www.some-site.com/jobview.asp

instead of

www.some-site.com/jobview.asp?jobid=35056

This meant crawlers were accessing pages without content and it was a pretty pointless and careless thing to do.

  • Make sure the change frequency value is set appropriately. For example on a jobboard when a job is posted its usually posted for between 7 and 28 days. It only needs to be crawled between once a week and once a month depending on the days it was advertised for. It does not need to be crawled every time so setting a value of always is inappropriate as the content will not change every time Googlebot accesses the URL.
  • Avoid circular references such as placing links to a site-index or category listings index in the footer of each page on a site. It makes it hard for the bot to determine the site structure as every path it drills down its able to find the parent page again. Although I suspect the bots technology is clever enough to realise its found a link already spidered and not crawl it again I have heard that it looks bad in terms of site structure.
  • Avoid dead links or links that lead to pages with no content. If you have a category index page and some categories have no content related to them then don't make the category into a link or otherwise link to a page that can show related content rather than nothing.
  • Prevent duplicate content and variations of the same URL being indexed by implementing one of the following two methods.
  1. Set your Robots.txt to disallow your non URL rewritten pages from being crawled and then only display rewritten URLS to agents identified as crawlers.
  2. Allow both forms of URL to be crawled but use a CANONICAL META tag to specify that you want the rewritten version to be indexed.
  • Ban crawlers who misbehave. If we don't spank them when they are naughty they will never learn so punish those that misbehave. Its very easy for an automated process to parse a Robots.txt file therefore there is no excuse for those bots that ignore the commands set out in it. If you want to know those bots who ignore the Robots.txt rules then there are various ways such as parsing your webserver log files or using a dynamic Robots.txt file to record those agents that access it. There are other ways such as using the IsBanned flag available in the Browscap.ini file however this relies on the user-agent being correct and more and more people spoof their agent nowadays. Not only is banning bots good for your servers performance as it reduces load its good for your sites security as bots that ignore the Robots.txt rules are more likely to hack, spam, and scrape your sites content.
If you are having similar issues with over-crawling then I would advise you to first check your sites structure to see if the problem is due to bad structure, invalid sitemap values and over categorisation first before changing the crawl rate. Remember a sites SEO is unrelated to the amount of crawler activity and more is not necessarily better. Its not the number of crawled pages that counts but rather the quality of the content that is found when the crawlers visit that matters.

Monday, 1 December 2008

Job Rapists and other badly behaved bots

How bad does a bot have to be before banning it?

I had to ban a particular naughty crawler the other day that belonged to one of the many job aggregators that spring up ever so regularly. This particular bot belonged to a site called jobrapido which one of my customers calls "that job rapist" for the very good reason that it seems to think that any job posted on a jobboard is theirs to take without asking either the job poster or site owner that its posted on. Wheres the good aggregators such as JobsUK require that the job poster or jobboard submits a regular XML feed of their jobs rapido just seems to endlessly crawl and crawl every site they know about and just take whatever they find. In fact on their home page they state the following:

Do you know a job site that does not appear in Jobrapido? Write to us!

Isn't that nice? Anyone can help them thieve, rape and pillage another sites jobs. Maybe there is some bonus point system for regular informers. Times are hard as we all know and people have to get their money where they can! However it shouldn't be from taking other peoples content unrequested.

This particular bot crawls so much it actually made one of our tiniest jobboards appear in our top 5 rankings one month purely from its crawling alone. The site had less than 50 jobs but a lot of categories which meant that this bot decided to visit each day and crawl every single search combination which meant 50,000 page loads a day!

Although this rapido site did link back to the original site that posted the job to allow the jobseeker to apply to the job the amount of referrals was tiny (150 a day across all 100+ of our sites) compared to the huge amount of bandwidth it was stealing from us (16-20% of all traffic). It was regularly ranked above Googlebot, MSN and Yahoo as the biggest crawler in my daily server reports as well as being the biggest hitter (page requests / time).

So I tried banning it using robots.txt directives as any legal well behaved bot should pay attention to that file however 2 weeks after adding all 3 of their agents to the file they were still paying us visits each day and no bot should cache the file for that length of time so I banned it using the htaccess file.

So if you work in the jobboard business and don't want a particular heavy bandwidth thief, content scrapper and robots.txt file ignorer hitting your sites every day then do yourself a favour and ban these agents and IP addresses:

Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)

"Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)"

Mozilla/5.0 (JobRapido WebPump)

85.214.130.145
85.214.124.254
85.214.20.207
85.214.66.152

If you want a good UK Job Aggregator that doesn't pinch sites jobs without asking first then use JobsUK. Its simple to use and has thousands of new jobs from hundreds of jobboards added every day.