Showing posts with label META. Show all posts
Showing posts with label META. Show all posts

Monday, 3 January 2011

Handling UTF-8 characters when scraping

Handling incorrectly formatted characters when scrapping

Scraping can be a bit of a nightmare as you cannot expect every web page to be written to the same standard and therefore you will find most of the time is spent trying to handle dispcrepencies in formats and bad encoding etc.

On one of my sites noagendashownotes.com I create backup links of original news stories so that if the original story gets taken down (which happens alot) the original version is still available.

This means I have to create local versions of the remote files and this isn't too much of a problem as its not too hard to convert relative links to absolute and so on. One of the problems is sites that load content such as CSS with client side Javascript code as its virtually impossible with a simple server side scraping tool to work out whats going on when libraries are loading other libraries and file paths are built up with Javascript. However luckily this doesn't happen too much so I am not too concerned about it.

One thing that does happen a lot though is character encoding issues caused by a webpage mismatching up the character sets on the server and client. This causes issues when you are scraping and saving to another file as when the file is viewed in a browser as a static html file it only has the META Charset tag value to go on.


Take a look at the HTML source code and you will see that they are using the following META tag

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 

which tells the browser to output the character set as ISO (extended Latin version of ASCII)

However if you use something like HTTP Fox to examine the response headers you will see that the actual Response Charset is UTF-8. The page will be using server side code e.g PHP, JSP or ASPX to set this like so

Response.Charset = "UTF-8"

or

header('Content-Type: text/html; charset=UTF-8');

Now when I scraped this page and saved it as a local file (with a UTF-8 Encoding) and then viewed that local file in my browser all the extended UTF-8 characters such as special quote marks or apostrophes appear as the usual garbage e.g

New York’s governor.

instead of

New York’s governor

This is because the browser only has the HTML to tell it what character set to use and this has been set incorrectly to ISO instead of UTF-8.

Because the page is now a static HTML file rather than a dynamically generated page there is no server side code setting the Response Charset headers. Again HTTP Fox is useful for examining the response headers to prove a point.

I am pretty new to PHP and I searched around the web for a few suggestions on how to fix this which included things like wrapping the file_get_contents function in a mb_convert_encoding function e.g

function file_get_contents_utf8($fn,$incpath,$context) {
$content = file_get_contents($fn,$incpath,$context);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

However this didn't solve the problem so I came up with a method that did work for what I am trying to do ( e.g create static HTML versions of dynamic pages). This method involved using regular expressions to reformat the HTML so that any mismatches of CHARSET settings are correctly set to UTF-8.

This function is designed to replace the value for any META CHARSET tags to be UTF-8. It works with both these formats (with single or double quotes)

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
OR

<meta charset='iso-8859-1' />

function ConvertHeaderToUTF8($html){

// look for an existing charset value
if(preg_match("@<meta[\s\S]+?charset=['\"]?(.+?)['\"]\s*/?>@i",$html,$match)){

$charset = $match[1];

// check value for UTF-8
if($charset != "UTF-8"){

// change it to UTF-8
$html = preg_replace("@(^[\s\S]+?<meta[\s\S]+?charset=['\"]?)(.+?)(['\"]\s*/?>[\s\S]+$)@i","$1UTF-8$3",$html);

}
}

return $html;
}



This solved the problem for me perfectly. If anyone else has another way of solving this issue without creating local PHP / ASP files that set a Response.Charset = "UTF-8" please let me know as I would be interested to hear about it.

Saturday, 21 March 2009

IE 8 Document and Browser modes

Controlling IE 8 Browser and Document Modes

In Internet Explorer 8 the developer toolbar can control the following settings and will override any other settings that have been set e.g META tags or the Compatibility View options. On changing a setting the browser will refresh and load the appropriate new configuration.

I will list out the various document and browser modes with the basic differences but if you are in a hurry and just require a Javascript function that you can use to determine the clients current IE8 settings then that link will sort that for you. For an explanation on browser compatibility mode testing then this link will give you an article explaining the combination of agent sniffing and object detection that is required for identifying a clients IE8 browser settings.




Browser Modes:

Internet Explorer 8 Mode

The browser will run as IE 8.0 and the user-agent will appear as an IE 8.0 user-agent e.g

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB5; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)

Notice the mention of Trident/4.0 which is the name and version of the rendering engine.

A new Javascript engine is used in IE 8.0 so there will be numerous differences for example using getElementById to return an element by name will not work anymore which is the correct way of doing things however if you didn't know this and did use it in IE to access elements you will experience errors when running in full IE 8 mode.

Internet Explorer 7 Mode

The browser will run as if it were actually IE 7.0 and the user-agent will appear as an IE 7.0 user-agent e.g:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)

Javascript will run as it does in IE 7.0 for example using getElementById to return an element by name will work.

Internet Explorer 8 Compatibility Mode

This mode means that, and I quote from ieblog

"In a nutshell, Compatibility View allows content designed for older web browsers to still work well in Internet Explorer 8."

So in all respects the browser is still running as IE 8.0 but allows sites that worked perfectly well in IE 7.0 to continue to work correctly without having to revert to IE 7.0 mode.

By default IE will set all publicly accessible Internet sites to run in IE 8 mode and all Intranet sites to IE 8 compat mode.

For Internet sites it comes across which it feels should run in compatibility mode such as those with strict doctypes then it will display a button to the user to allow them to change to compatibility mode. For sites that are using quirks mode it will not offer this option so do not expect the button to appear all the time.

The user-agent in IE 8 Compatibility Mode is displayed similarly to IE 8 but with a 7 e.g

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; GTB5; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)

Javascript should run as it did in IE 7 for example using getElementById to return an element by name will work.



Document Modes:

Quirks Mode (e.g IE 5)

This document mode setting will render pages as if there was no doctype specified. This will be the same as if you were using older IE versions such as IE 5.0.

Internet Explorer 7 Standards Mode

This document mode will render pages as they would be displayed in IE 7.0 when a strict doctype was specified.

Internet Explorer 8 Standards Mode (Page Default)

This document mode will render pages in the new IE 8 standards compliant manner. IE 8 apparently adheres to the CSS 2.1 specification and there are numerous changes to be aware of. For instance there is no longer any support for CSS expressions.

For a list of the various changes and potential problems that IE 8 will bring to your development view the following article: http://blogs.msdn.com/ie/archive/2009/03/12/site-compatibility-and-ie8.aspx


How to detect which settings are enabled

If for whatever reason you need to detect client side which of these various settings the user currently has enabled in their browser then you can use a combination of user-agent sniffing and object detection to work out the true browser version and rendering engine. See the following article about how to detect the settings using Javascript.


The new META UA-Compatible tag

Another new feature in IE 8 is the ability to quickly fix any potential problems that all these new document and browser modes may bring by adding a META tag to your existing sites. To force a page to run in IE 8 standards mode we can add the following META tag to pages:

<meta http-equiv="X-UA-Compatible" content="IE=8">

Or if you find that your sites do not currently work in IE 8.0 standards mode you can force them to work as they did in IE 7.0 with the following META tag:

<meta http-equiv="X-UA-Compatible" content="IE=7">

Once you get your site up to scratch you can remove the tags.

Monday, 1 December 2008

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

  1. Check the URL looks valid using a regular expression.
  2. Access the URL through a proxy server and return the HTML source.
  3. Locate and store the META keywords, description and title if they exist.
  4. Remove everything apart from content between the start and close BODY tags.
  5. From the body I strip any SCRIPT tags and anything between them.
  6. Remove all HTML tags.
  7. Clean the remaining content by removing noise words, numbers and swear words.
  8. I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
  9. I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.


1. Method to access remote URL through proxy server.




public static string GetURLHTML(string remoteURL, string proxyServer)
{
string remoteURLContent = "";

WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
WebRequest webReq = WebRequest.Create(remoteURL);
webReq.Proxy = proxy; //set request to use proxy

// Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
if (webReq is HttpWebRequest)
{
((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
}

WebResponse webResp;
int responseStatusCode = 0;

try{
// Get the response instance
webResp = (HttpWebResponse)webReq.GetResponse();

// Read an HTTP-specific property.
if (webResp is HttpWebResponse)
{
responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
}
}catch(Exception ex){
return remoteURLContent;
}

//we can only collect HTML from valid responses so ignore 404s and 500s
if (responseStatusCode != 200)
{
return remoteURLContent;
}

// Get the response stream.
Stream respStream = webResp.GetResponseStream();

StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
remoteURLContent = reader.ReadToEnd();

// Close the response and response stream.
webResp.Close();

return remoteURLContent;
}



The reason I use a proxy is down to the security policy set on our web servers.


2. Method to gather the main content.



//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
protected string IndexURL(string remoteURL)
{
KeywordParser keywordParser;
string METAKeywords = "", METADescription = "", METATitle = "";
string cleanHTML = "";
StringBuilder indexText = new StringBuilder();

//As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

//now access the remote URL and return the HTML source code if we can
string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

//if we have some HTML content to parse and clean
if (!String.IsNullOrEmpty(remoteURLHTML))
{
remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

//Set up some regular expressions to help identify the META conent we want to index in the source
Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
Regex HasTitle = new Regex("<title>");

//As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
//contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
if (HasKeywords.IsMatch(remoteURLHTML))
{
//get the data we require by replacing anything either side of the tag
METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasDescription.IsMatch(remoteURLHTML))
{
METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
}
if (HasTitle.IsMatch(remoteURLHTML))
{
METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
}

cleanHTML = remoteURLHTML;

//now get main content which is between open close body tags
cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

//strip any client side script by removing anything between open and close script tags
cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);

//put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

//strip HTML tags
cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

//Decode the HTML so that any encoded HTML entities get stripped
cleanHTML = HttpUtility.HtmlDecode(cleanHTML);

//now add all the content we want to index back together
if (!String.IsNullOrEmpty(METAKeywords))
{
indexText.Append(METAKeywords + " ");
}
if (!String.IsNullOrEmpty(METADescription))
{
indexText.Append(METADescription + " ");
}
if (!String.IsNullOrEmpty(METATitle))
{
indexText.Append(METATitle + " ");
}
if (!String.IsNullOrEmpty(cleanHTML))
{
indexText.Append(cleanHTML);
}

}

return indexText.ToString();
}


I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.