Monday, 1 December 2008

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

  1. Check the URL looks valid using a regular expression.
  2. Access the URL through a proxy server and return the HTML source.
  3. Locate and store the META keywords, description and title if they exist.
  4. Remove everything apart from content between the start and close BODY tags.
  5. From the body I strip any SCRIPT tags and anything between them.
  6. Remove all HTML tags.
  7. Clean the remaining content by removing noise words, numbers and swear words.
  8. I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
  9. I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.


1. Method to access remote URL through proxy server.




public static string GetURLHTML(string remoteURL, string proxyServer)
{
string remoteURLContent = "";

WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
WebRequest webReq = WebRequest.Create(remoteURL);
webReq.Proxy = proxy; //set request to use proxy

// Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
if (webReq is HttpWebRequest)
{
((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
}

WebResponse webResp;
int responseStatusCode = 0;

try{
// Get the response instance
webResp = (HttpWebResponse)webReq.GetResponse();

// Read an HTTP-specific property.
if (webResp is HttpWebResponse)
{
responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
}
}catch(Exception ex){
return remoteURLContent;
}

//we can only collect HTML from valid responses so ignore 404s and 500s
if (responseStatusCode != 200)
{
return remoteURLContent;
}

// Get the response stream.
Stream respStream = webResp.GetResponseStream();

StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
remoteURLContent = reader.ReadToEnd();

// Close the response and response stream.
webResp.Close();

return remoteURLContent;
}



The reason I use a proxy is down to the security policy set on our web servers.


2. Method to gather the main content.



//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
protected string IndexURL(string remoteURL)
{
KeywordParser keywordParser;
string METAKeywords = "", METADescription = "", METATitle = "";
string cleanHTML = "";
StringBuilder indexText = new StringBuilder();

//As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

//now access the remote URL and return the HTML source code if we can
string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

//if we have some HTML content to parse and clean
if (!String.IsNullOrEmpty(remoteURLHTML))
{
remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

//Set up some regular expressions to help identify the META conent we want to index in the source
Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
Regex HasTitle = new Regex("<title>");

//As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
//contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
if (HasKeywords.IsMatch(remoteURLHTML))
{
//get the data we require by replacing anything either side of the tag
METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasDescription.IsMatch(remoteURLHTML))
{
METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasTitle.IsMatch(remoteURLHTML))
{
METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
}

cleanHTML = remoteURLHTML;

//now get main content which is between open close body tags
cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

//strip any client side script by removing anything between open and close script tags
cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);

//put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

//strip HTML tags
cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

//Decode the HTML so that any encoded HTML entities get stripped
cleanHTML = HttpUtility.HtmlDecode(cleanHTML);

//now add all the content we want to index back together
if (!String.IsNullOrEmpty(METAKeywords))
{
indexText.Append(METAKeywords + " ");
}
if (!String.IsNullOrEmpty(METADescription))
{
indexText.Append(METADescription + " ");
}
if (!String.IsNullOrEmpty(METATitle))
{
indexText.Append(METATitle + " ");
}
if (!String.IsNullOrEmpty(cleanHTML))
{
indexText.Append(cleanHTML);
}

}

return indexText.ToString();
}


I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.

No comments: