Showing posts with label ASCII. Show all posts
Showing posts with label ASCII. Show all posts

Sunday, 25 April 2021

Solving an issue where you cannot tell the difference between the two conflicting words

Weird Issue - Two Names That Look The Same But Aren't

By Strictly-software 

I have a table of Jockeys that returns me all jockeys from a RACE_RUNNERS table but I had a problem which was I kept getting one jockey returned twice in the output. 

I couldn't work out why this would be, and only for one jockey, and jumped straight to collations. The column collation is set to DB default and when I do a compare using the same collation on each value they still don't match. e.g

-- Column in dbo.RACE_RUNNERS table is 

SELECT CONVERT (varchar, SERVERPROPERTY('collation')) AS 'Server Collation'; 

DB Collation = SQL_Latin1_General_CP1_CI_AS

SELECT	Jockey COLLATE SQL_Latin1_General_CP1_CI_AS,'JOCKEY',COUNT(RaceFK)
FROM	dbo.RACE_RUNNERS as run with (nolock)
JOIN	dbo.RACES as r with (nolock)
    ON	r.RacePK = run.RaceFK
WHERE	Jockey IN('Mr P W Mullins','Mr P W Mullins')
	AND Racedatetime > '2021-JAN-01'		
GROUP BY Jockey 
ORDER BY Jockey COLLATE SQL_Latin1_General_CP1_CI_AS

/*
Returns two records for the jockey that look alike
Mr W P Mullins
Mr W P Mullins
*/

-- do some basic comparison tests with each value

IF LTRIM(RTRIM('Mr P W Mullins')) = LTRIM(RTRIM('Mr P W Mullins'))
  PRINT 'MATCH'
ELSE
  PRINT 'NO MATCH'

--NO MATCH

IF LTRIM(RTRIM('Mr P W Mullins')) COLLATE SQL_Latin1_General_CP1_CI_AS = LTRIM(RTRIM('Mr P W Mullins')) COLLATE SQL_Latin1_General_CP1_CI_AS
  PRINT 'MATCH'
ELSE
  PRINT 'NO MATCH' 

--NO MATCH


So I was slightly confused and thinking about deleting the 2nd record for MR P W MULLINS from the table or changing his name first to the MR P W NULLINS with most results. 

Then it dawned on me. Importing data often leads to characters that you cannot see the difference in but in SQL will appear different. So I did a simple test by removing all spaces in the IF statement comparison e.g
IF 'MrPWMullins' = 'MrPWMullins'
  PRINT 'MATCH'
ELSE
  PRINT 'NO MATCH'

-- And the result was
MATCH
So I just checked each space character in the two words with ASCII() function and found that the last space between W and Mullins was different from each other.
-- match the space between W and Mullins as it seems to be different
SELECT ASCII(' ') -- CHAR 160 another type of SPACE
SELECT ASCII(' ') -- CHAR 32 ASCII SPACE

160
32

Outputting both character numbers with CHAR in a select statement just gave me two columns of spaces. No way to tell the difference by sight, and it only occurred on one jockey for some reason. 

Therefore I did an update on the main RACE_RUNNERS table to replace any CHAR(160) with CHAR(32) for this jockey and might have to put it in the main stored proc that either saves jockeys into the RACE_RUNNER table or just the stat proc.

UPDATE	RACE_RUNNERS 
SET	JOCKEY = REPLACE(Jockey,CHAR(160),CHAR(32))
WHERE	JOCKEY = 'Mr P W Mullins'
Hey presto one instance of the name in the table. A tricky issue that I thought was down to collation but it wasn't. 

The collations of table columns and the DB were the same and doing matches with a specified collation didn't solve the issue. 

Therefore you have to think out of the box and go back to basics and just think what could be different. In this case it was the spaces. Removing them from both words created a MATCH result therefore I knew one space was a different character to the other. 

Here is a link to an ASCII chart if you should ever want to see what character CHAR(N) returns > https://ascii.cl/.

Problem solved. 

By Strictly-Software

© 2021 By Strictly-Software

Monday, 3 January 2011

Handling UTF-8 characters when scraping

Handling incorrectly formatted characters when scrapping

Scraping can be a bit of a nightmare as you cannot expect every web page to be written to the same standard and therefore you will find most of the time is spent trying to handle dispcrepencies in formats and bad encoding etc.

On one of my sites noagendashownotes.com I create backup links of original news stories so that if the original story gets taken down (which happens alot) the original version is still available.

This means I have to create local versions of the remote files and this isn't too much of a problem as its not too hard to convert relative links to absolute and so on. One of the problems is sites that load content such as CSS with client side Javascript code as its virtually impossible with a simple server side scraping tool to work out whats going on when libraries are loading other libraries and file paths are built up with Javascript. However luckily this doesn't happen too much so I am not too concerned about it.

One thing that does happen a lot though is character encoding issues caused by a webpage mismatching up the character sets on the server and client. This causes issues when you are scraping and saving to another file as when the file is viewed in a browser as a static html file it only has the META Charset tag value to go on.


Take a look at the HTML source code and you will see that they are using the following META tag

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 

which tells the browser to output the character set as ISO (extended Latin version of ASCII)

However if you use something like HTTP Fox to examine the response headers you will see that the actual Response Charset is UTF-8. The page will be using server side code e.g PHP, JSP or ASPX to set this like so

Response.Charset = "UTF-8"

or

header('Content-Type: text/html; charset=UTF-8');

Now when I scraped this page and saved it as a local file (with a UTF-8 Encoding) and then viewed that local file in my browser all the extended UTF-8 characters such as special quote marks or apostrophes appear as the usual garbage e.g

New York’s governor.

instead of

New York’s governor

This is because the browser only has the HTML to tell it what character set to use and this has been set incorrectly to ISO instead of UTF-8.

Because the page is now a static HTML file rather than a dynamically generated page there is no server side code setting the Response Charset headers. Again HTTP Fox is useful for examining the response headers to prove a point.

I am pretty new to PHP and I searched around the web for a few suggestions on how to fix this which included things like wrapping the file_get_contents function in a mb_convert_encoding function e.g

function file_get_contents_utf8($fn,$incpath,$context) {
$content = file_get_contents($fn,$incpath,$context);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

However this didn't solve the problem so I came up with a method that did work for what I am trying to do ( e.g create static HTML versions of dynamic pages). This method involved using regular expressions to reformat the HTML so that any mismatches of CHARSET settings are correctly set to UTF-8.

This function is designed to replace the value for any META CHARSET tags to be UTF-8. It works with both these formats (with single or double quotes)

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
OR

<meta charset='iso-8859-1' />

function ConvertHeaderToUTF8($html){

// look for an existing charset value
if(preg_match("@<meta[\s\S]+?charset=['\"]?(.+?)['\"]\s*/?>@i",$html,$match)){

$charset = $match[1];

// check value for UTF-8
if($charset != "UTF-8"){

// change it to UTF-8
$html = preg_replace("@(^[\s\S]+?<meta[\s\S]+?charset=['\"]?)(.+?)(['\"]\s*/?>[\s\S]+$)@i","$1UTF-8$3",$html);

}
}

return $html;
}



This solved the problem for me perfectly. If anyone else has another way of solving this issue without creating local PHP / ASP files that set a Response.Charset = "UTF-8" please let me know as I would be interested to hear about it.