Monday, 1 December 2008

Job Rapists and other badly behaved bots

How bad does a bot have to be before banning it?

I had to ban a particular naughty crawler the other day that belonged to one of the many job aggregators that spring up ever so regularly. This particular bot belonged to a site called jobrapido which one of my customers calls "that job rapist" for the very good reason that it seems to think that any job posted on a jobboard is theirs to take without asking either the job poster or site owner that its posted on. Wheres the good aggregators such as JobsUK require that the job poster or jobboard submits a regular XML feed of their jobs rapido just seems to endlessly crawl and crawl every site they know about and just take whatever they find. In fact on their home page they state the following:

Do you know a job site that does not appear in Jobrapido? Write to us!

Isn't that nice? Anyone can help them thieve, rape and pillage another sites jobs. Maybe there is some bonus point system for regular informers. Times are hard as we all know and people have to get their money where they can! However it shouldn't be from taking other peoples content unrequested.

This particular bot crawls so much it actually made one of our tiniest jobboards appear in our top 5 rankings one month purely from its crawling alone. The site had less than 50 jobs but a lot of categories which meant that this bot decided to visit each day and crawl every single search combination which meant 50,000 page loads a day!

Although this rapido site did link back to the original site that posted the job to allow the jobseeker to apply to the job the amount of referrals was tiny (150 a day across all 100+ of our sites) compared to the huge amount of bandwidth it was stealing from us (16-20% of all traffic). It was regularly ranked above Googlebot, MSN and Yahoo as the biggest crawler in my daily server reports as well as being the biggest hitter (page requests / time).

So I tried banning it using robots.txt directives as any legal well behaved bot should pay attention to that file however 2 weeks after adding all 3 of their agents to the file they were still paying us visits each day and no bot should cache the file for that length of time so I banned it using the htaccess file.

So if you work in the jobboard business and don't want a particular heavy bandwidth thief, content scrapper and robots.txt file ignorer hitting your sites every day then do yourself a favour and ban these agents and IP addresses:

Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)

"Mozilla/5.0 (compatible; Jobrapido/1.1; +http://www.jobrapido.com)"

Mozilla/5.0 (JobRapido WebPump)

85.214.130.145
85.214.124.254
85.214.20.207
85.214.66.152

If you want a good UK Job Aggregator that doesn't pinch sites jobs without asking first then use JobsUK. Its simple to use and has thousands of new jobs from hundreds of jobboards added every day.

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

  1. Check the URL looks valid using a regular expression.
  2. Access the URL through a proxy server and return the HTML source.
  3. Locate and store the META keywords, description and title if they exist.
  4. Remove everything apart from content between the start and close BODY tags.
  5. From the body I strip any SCRIPT tags and anything between them.
  6. Remove all HTML tags.
  7. Clean the remaining content by removing noise words, numbers and swear words.
  8. I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
  9. I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.


1. Method to access remote URL through proxy server.




public static string GetURLHTML(string remoteURL, string proxyServer)
{
string remoteURLContent = "";

WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
WebRequest webReq = WebRequest.Create(remoteURL);
webReq.Proxy = proxy; //set request to use proxy

// Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
if (webReq is HttpWebRequest)
{
((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
}

WebResponse webResp;
int responseStatusCode = 0;

try{
// Get the response instance
webResp = (HttpWebResponse)webReq.GetResponse();

// Read an HTTP-specific property.
if (webResp is HttpWebResponse)
{
responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
}
}catch(Exception ex){
return remoteURLContent;
}

//we can only collect HTML from valid responses so ignore 404s and 500s
if (responseStatusCode != 200)
{
return remoteURLContent;
}

// Get the response stream.
Stream respStream = webResp.GetResponseStream();

StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
remoteURLContent = reader.ReadToEnd();

// Close the response and response stream.
webResp.Close();

return remoteURLContent;
}



The reason I use a proxy is down to the security policy set on our web servers.


2. Method to gather the main content.



//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
protected string IndexURL(string remoteURL)
{
KeywordParser keywordParser;
string METAKeywords = "", METADescription = "", METATitle = "";
string cleanHTML = "";
StringBuilder indexText = new StringBuilder();

//As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

//now access the remote URL and return the HTML source code if we can
string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

//if we have some HTML content to parse and clean
if (!String.IsNullOrEmpty(remoteURLHTML))
{
remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

//Set up some regular expressions to help identify the META conent we want to index in the source
Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
Regex HasTitle = new Regex("<title>");

//As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
//contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
if (HasKeywords.IsMatch(remoteURLHTML))
{
//get the data we require by replacing anything either side of the tag
METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasDescription.IsMatch(remoteURLHTML))
{
METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasTitle.IsMatch(remoteURLHTML))
{
METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
}

cleanHTML = remoteURLHTML;

//now get main content which is between open close body tags
cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

//strip any client side script by removing anything between open and close script tags
cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);

//put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

//strip HTML tags
cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

//Decode the HTML so that any encoded HTML entities get stripped
cleanHTML = HttpUtility.HtmlDecode(cleanHTML);

//now add all the content we want to index back together
if (!String.IsNullOrEmpty(METAKeywords))
{
indexText.Append(METAKeywords + " ");
}
if (!String.IsNullOrEmpty(METADescription))
{
indexText.Append(METADescription + " ");
}
if (!String.IsNullOrEmpty(METATitle))
{
indexText.Append(METATitle + " ");
}
if (!String.IsNullOrEmpty(cleanHTML))
{
indexText.Append(cleanHTML);
}

}

return indexText.ToString();
}


I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.