Monday, 3 January 2011

Handling UTF-8 characters when scraping

Handling incorrectly formatted characters when scrapping

Scraping can be a bit of a nightmare as you cannot expect every web page to be written to the same standard and therefore you will find most of the time is spent trying to handle dispcrepencies in formats and bad encoding etc.

On one of my sites noagendashownotes.com I create backup links of original news stories so that if the original story gets taken down (which happens alot) the original version is still available.

This means I have to create local versions of the remote files and this isn't too much of a problem as its not too hard to convert relative links to absolute and so on. One of the problems is sites that load content such as CSS with client side Javascript code as its virtually impossible with a simple server side scraping tool to work out whats going on when libraries are loading other libraries and file paths are built up with Javascript. However luckily this doesn't happen too much so I am not too concerned about it.

One thing that does happen a lot though is character encoding issues caused by a webpage mismatching up the character sets on the server and client. This causes issues when you are scraping and saving to another file as when the file is viewed in a browser as a static html file it only has the META Charset tag value to go on.


Take a look at the HTML source code and you will see that they are using the following META tag

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 

which tells the browser to output the character set as ISO (extended Latin version of ASCII)

However if you use something like HTTP Fox to examine the response headers you will see that the actual Response Charset is UTF-8. The page will be using server side code e.g PHP, JSP or ASPX to set this like so

Response.Charset = "UTF-8"

or

header('Content-Type: text/html; charset=UTF-8');

Now when I scraped this page and saved it as a local file (with a UTF-8 Encoding) and then viewed that local file in my browser all the extended UTF-8 characters such as special quote marks or apostrophes appear as the usual garbage e.g

New York’s governor.

instead of

New York’s governor

This is because the browser only has the HTML to tell it what character set to use and this has been set incorrectly to ISO instead of UTF-8.

Because the page is now a static HTML file rather than a dynamically generated page there is no server side code setting the Response Charset headers. Again HTTP Fox is useful for examining the response headers to prove a point.

I am pretty new to PHP and I searched around the web for a few suggestions on how to fix this which included things like wrapping the file_get_contents function in a mb_convert_encoding function e.g

function file_get_contents_utf8($fn,$incpath,$context) {
$content = file_get_contents($fn,$incpath,$context);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

However this didn't solve the problem so I came up with a method that did work for what I am trying to do ( e.g create static HTML versions of dynamic pages). This method involved using regular expressions to reformat the HTML so that any mismatches of CHARSET settings are correctly set to UTF-8.

This function is designed to replace the value for any META CHARSET tags to be UTF-8. It works with both these formats (with single or double quotes)

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
OR

<meta charset='iso-8859-1' />

function ConvertHeaderToUTF8($html){

// look for an existing charset value
if(preg_match("@<meta[\s\S]+?charset=['\"]?(.+?)['\"]\s*/?>@i",$html,$match)){

$charset = $match[1];

// check value for UTF-8
if($charset != "UTF-8"){

// change it to UTF-8
$html = preg_replace("@(^[\s\S]+?<meta[\s\S]+?charset=['\"]?)(.+?)(['\"]\s*/?>[\s\S]+$)@i","$1UTF-8$3",$html);

}
}

return $html;
}



This solved the problem for me perfectly. If anyone else has another way of solving this issue without creating local PHP / ASP files that set a Response.Charset = "UTF-8" please let me know as I would be interested to hear about it.

1 comment:

keval kothari said...

Hello Rob,

Thanks for posting this great article, It helped me to fix one of web scraping project where the characters becoming funny due to utf-8 issue and your code helped me to get ride on it.

I provide web scraping service to small and medium businesses who need data scraped from websites.

Thank you!