Strictly Software

Saturday, 25 July 2009

HTML Encoding and Decoding using Javascript

Javascript HTML Encoder and Decoder

I have recently started doing a lot of work with client side code especially delivering content to my www.hattrickheaven.com football site from Googles Ajax API framework as well as other feed content providers. In doing so I have had numerous occasions where the feed data contains either content that required html encoding or content that contained partial encoded strings or malformed and double encoded entities.

Therefore the requirement for a client side library to handle html encoding and decoding became apparent. I have always wondered why there wasn't an inbuilt HTML encode function in the Javascript library. They have escape, encodeURI and encodeURIComponent but these are for encoding URIs and making sure text is portable not ensuring strings are html encoded correctly.

Whats the problem with a simple replace function?

I have seen many sites recommend something simple along the lines of:


s = s.replace(/&/g,'&amp;').replace(/"/i,'&quot;').replace(/</i,'&lt;').replace(/>/i,'&gt;').replace(/'/i,'&apos;');

The problem with this is that it won't handle content that has already been partially encoded and could cause problems with double encoding.

For example if you had a string that was partially encoded such as:

"The player was sold for £4.2 million"

Then you would end up double encoding the & in the £ entity like so:

"The player was sold for &pound;4.2 million"

Therefore you have to be a bit clever when encoding and make sure only & that are not part of entities get encoded. You could try a regular expression that did a negative match but you would have to handle the fact that html encoded strings could be numeric or entity based e.g

<
<

are both representations of the less than symbol <. The way I have dealt with this is to make sure all entities are converted to numerical codes first. You could then do a negative match on &# e.g

s = s.replace(/&([^#])/,"&$1");

However what if you had multiple & all in a row together e.g &&&& you would only encode 2 out of the four. You could run the replace multiple times to handle it but I tend to use placeholders in cases like this as its easy and better for performance to do positive matches rather than convoluted negative matches. Therefore I replace all occurrences of &# with a placeholder then do my replacement of & with & and then finally put back the &#.

If you want to see my html encoder and decoder functions in action and get a copy of the encoder object that has a number of useful functions then go to the following page on my site www.strictly-software.com/htmlencode. The page also has a form to allow you to HTML encode and decode online if you ever need to do just that.

Tuesday, 21 July 2009

Firebug 1.4.0 and Highlighter.js

The mysterious case of the disappearing code

Yesterday I posted an article relating to an issue with a code highligher script and Firebug 1.4.0. A rough overview is that:

a) I use the highlighter.js code from software maniacs to highlight code examples.
b) This works cross browser and has done for a number of months.
c) On loading Firefox 3.0.11 I was asked to upgrade Firebug from 1.3.3 to 1.4.0.
d) After doing so I noticed all my code examples were malformed with the majority of code disappearing from the DOM.
e) On disabling the Firebug add-on the code would re-appear.
f) This problem didn't occur on my other PC still using Firebug 1.4.0 and Firefox 3.0.11.

Other people have contacted me to say they had similar issues and others who were using Firefox 3.5 did not have this problem so it seemed to be a problem with Firefox 3.0.11 and Firebug 1.4.0.

So tonight I was planning to debug the highlight.js file to see what the crack was. The version of highlight,js I use on my site is compressed and minified. On uncompressing the file with my online unpacker tool. I thought I would just try some problematic pages on my site with this uncompressed version and lo and behold it worked. The code didn't disappear!

So I have re-compressed the JS file with my own compressor tool and changed the references throughout the site to use http://www.strictly-software.com/scripts/highlight/highlight.sspacked.js instead of the original file http://www.strictly-software.com/scripts/highlight/highlight.pack.js and it all seems to work (at least for me).

If anyone manages to get to the bottom of this problem then please let me know but it seems there must be some sort of conflict occurring between these 2 codebases and I think its very strange!

I have created a copy of yesterdays posting that still uses the original compressed file so that the problem can be viewed. You can view the problem code here.

Sunday, 19 July 2009

Problems upgrading to Firebug 1.4.0

Problems with Firebug and Javascript Code Highlighting

UPDATE
Please read the following post for details about a resolution for this problem.
http://blog.strictly-software.com/2009/07/firebug-140-and-highlighterjs.html
You can view the original article along with the original file at the following link: http://www.strictly-software.com/highlight-problem.htm

I have recently started using a javascript include file from software maniacs called hightlight.js to highlight my code snippets that I use in my blog articles. It uses a callback function fired on page load that looks for CODE tags specified in the HTML source and then re-formats them by applying in-line styles appropriate for the code snippet. However I have just found a lot of my articles when viewed in Firefox 3 are not displaying the formatting correctly and most of the code disappears. For example if you view the code below and you can see that it shows a small JS function then that's okay. However if all you can see is closing tag } then I would ask one question. Have you just upgraded to Firebug 1.4.0 ??


function myJSfunc(var1,var2){
return (var1==100) ? var1 : var2;
}

I myself have just upgraded my laptop version of Firebug from 1.3.3 to 1.4.0 and since that update I have noticed that all my highlighting has gone haywire mainly where I am trying to output literal values such as > and <. For example the following code snippet should appear like this:


-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

However when using the hightligher and viewed in Firefox 3.0 with Firebug 1.4.0 it comes out as:

) END

( another test for Firebug 1.4.0 users, can you see the colourful code below?)


-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

Now I have checked the problematic code examples in other browsers such as Chrome and IE 8 as well as in Firefox 3 with Firebug 1.3.3 and it only seems to be Firefox 3 with Firebug 1.4.0 that causes the formatting problems.

I don't know why Firebug 1.4.0 is causing the problems but it seems to be the only differing factor between a working page and a broken page. Maybe there is some sort of clash with function names when both objects are loading. I have inspected the DOM using Firebug and the actual HTML has been deleted from the DOM so something is going wrong somewhere.

Anyway I thought I should let you know in case you are having similar problems or just wondering why all my code examples have gone missing. If you are experiencing the same problem but do not have Firebug 1.4.0 installed please let me know.

I am unaware of any other issues with this version of Firebug but will keep you posted.

Removing HTML with a User Defined Function

Using SQL to parse HTML content

Today I had the task of collating a long list of items for one of my sites. The list was being obtained from manually checking the source code of numerous sites and copying and pasting the relevant HTML source into a file. The items I wanted were contained within HTML list elements (UL, LI) and therefore the actual textual data was surrounded by HTML styling, tags such as LI, Strong, em with various in-line styles and class names etc.

I didn't want to spend much time parsing the list of a thousand items and I couldn't be bothered to write much code so I reverted back to an old user defined function that I wrote to remove HTML tags. Therefore once I had collated the list it was a simple case of using an import task to insert from the text file into my table wrapping the column in my user defined function.

Although I have been making use of the CLR lately for string parsing in the database with a few good C# regular expression functions this function doesn't require the CLR and can easily be converted for use with SQL 2k and earlier by changing the data type for the input and return parameters from nvarchar(max) to nvarchar(4000).

The code is pretty simple and neat and makes use of a PATINDEX search for the initial bracket in an open or close HTML tag making sure that the subsequent character is either a forward slash to match a closing HTML tag or a letter to match an opening HTML tag e.g


SELECT @StartPos = PATINDEX('<[A-Z/]%', @CleanHTML),

Which means that any unencoded opening bracket characters don't get mistaken as HTML tags. The code just keeps looping the input string looking for opening and closing tags and then recreating the string with the STUFF function until all matches have been found.


-- Look for open and close HTML tags making sure a letter or / follows < ensuring its an opening
-- HTML tag or closing HTML tag and not an unencoded < symbol
WHILE PATINDEX('%<[A-Z/]%', @CleanHTML) > 0 AND CHARINDEX('>', @CleanHTML, CHARINDEX('<', @CleanHTML)) > 0 

SELECT @StartPos = PATINDEX('%<[A-Z/]%', @CleanHTML),
@EndPos = CHARINDEX('>', @CleanHTML, PATINDEX('%<[A-Z/]%', @CleanHTML)),
@Length = (@EndPos - @StartPos) + 1,
@CleanHTML = CASE WHEN @Length>0 THEN stuff(@CleanHTML, @StartPos, @Length, '') END

An example of the functions usage:


DECLARE @Test nvarchar(max)
SELECT @Test = '<span class="outer2" id="o1"><strong>10 is < 20 and 20 is > 10</strong></span>'

SELECT dbo.udf_STRIP_HTML(@Test)

--Returns
10 is < 20 and 20 is > 10

I thought I would post this user defined function on my blog as its a good example of how a simple function using inbuilt system functions such as PATINDEX, CHARINDEX and STUFF alone can solve common day to day problems. I personally love UDF's and the day SQL 2k introduced them was a wonderful occasion almost as good as when England beat Germany 5-1 LOL. No I am not that sad of course England beating Germany was the better occasion but the introduction of user defined functions made SQL server programming a much easier job and made possible numerous pseudo-set based operations which previously had to be done iteratively (loops or cursors).

Download the Strip HTML user defined function source code here

Wednesday, 8 July 2009

Another article about bots and crawlers

Good Guys versus the Bad Guys

As you will know if you have read my previous articles about badly behaved robots I have to spend a lot of time dealing with crawlers who access one of the 200+ jobboards I look after. Crawlers break down into 2 varieties, the first are the indexers such as Googlebot, YSlurp, Ask etc. They provide a service to the site owner by indexing their content and displaying it in search engines to drive traffic to the site. Although these are legitimate bots they
can still cause issues (see this article about over-crawling). The second class are those bots that don't benefit the site owner and in a large majority of cases try to harm the site either by hacking or spamming or content scraping (or job raping as we call it).

Now it maybe that the site owner wishes that any Tom, Dick and Harry can come along with a scrapper and take all their content but seeing that content is the main asset of a website its a bit like starting a shop and leaving the doors open all night with no-one manning the till. The problem is that most web content is publicly accessible and its very hard to prevent someone who is determined to scrape your content and at the same time have good SEO. For instance you could deliver all your content through Javascript or Flash but then the indexers like Googlebot won't be able to access your content either. Therefore to prevent overloaded servers, stolen bandwidth and content it becomes a complex game involving a lot more than having a blacklist of IP addresses and user-agents. Due to IP and Agent spoofing this is very unreliable so a variety of methods can be utilised by those wishing to reduce the amount of "Bad Traffic" including real time checks to identify users that are hammering the site, white lists and blacklists, bot traps, DNS checks and much more.

One of the problems I have found is that even bot traffic that is legitimate in the eyes of the site owner doesn't comply with even the most basic rules of crawler etiquette such as a proper user-agent string, parsing DNS validation or even following the Robots.txt file rules. In fact when we spoke to the technical team running one of these job-rapists/aggregators they informed us that they didn't have the technical capabilities to parse a Robots.txt file. Now I think this is plainly ridiculous as if you have the technical capability to crawl hundreds of thousands of pages a day all with different HTML and correctly parse this content to extract the job details which is then uploaded to your own server for display it shouldn't be too difficult to add a few lines in to parse the Robots.txt file. I am 99% positive this was just an excuse to hide the fact they knew if they did parse my Robots.txt file they would find they were banned from all my sites. Obviously I had banned them with other methods but just to show how easy it is to write code to parse a robots.txt file and for that matter a crawler I have added some example code to my technical site. How to parse a Robots.txt file with c#.

This is one of the primary reasons there is so much bot traffic on the web today. Its just too damn easy to either download and run a bot from the web or write your own. I'm not saying all crawlers and bots have nefarious reasons behind them but I reckon 80%+ of all Internet traffic is probably from crawlers and bots nowadays with the other 20% being porn, music and film downloads and social networking traffic. It would be interesting to get
exact figures for Internet traffic breakdown. I know from my own sites logging that on average 60-70% of traffic is crawler and I am guessing that doesn't include traffic from spoofed agents that didn't get identified as such.

Tuesday, 30 June 2009

Googlebot, Sitemaps and heavy crawling

Googlebot over-crawling a site

I recently had an issue with one of my jobboards that meant that Googlebot was over-crawling the site which was causing the following problems:

Heavy loads on the server. The site in question was recording 5 million page loads a month which had doubled from 2.4 million within a month.
97% of all their traffic was accounted by Googlebot.
The site is on a shared server so this heavy load was causing very high CPU and affecting other sites.

The reasons the site was receiving so much traffic boiled down to the following points.

The site has a large number of categories which users can filter job searches by. These categories are displayed in whole and in subsets in prominent places such as quick links and a job browser which allows users to filter results. As multiple categories can be chosen when filtering a search this meant Googlebot was crawling every possible combination in various orders.
A new link had been added within the last month to the footer which passed a sessionID in the URL. The link was to log whether users had Javascript enabled. As Googlebot doesn't keep session state or use Javascript it meant the number of crawled URLs actively doubled as each page the crawler hit would find a new link that it hadn't already spidered due to the new SessionID.
The sitemap had been setup incorrectly containing URLs that didn't need crawling as well as incorrect change frequencies.
The crawl rate was set to a very high level in Webmaster tools.

Therefore a site with around a thousand jobs was receiving 200,000 page loads a day nearly all of them from crawlers. To put this in some perspective other sites with 3000+ jobs, good SEO and high PR usually get around 20,000 page loads a day from crawlers.

One of the ways I rectified this situation was by changing the crawl rate to a low custom crawl rate of 0.2 crawls per second. This caused a nice big vertical drop in the graph and it alarmed the site owner as he didn't realise that there is no relation between the amount of pages crawled by Google and the sites page ranking or overall search engine optimisation.

Top Tips for getting the best out of crawlers

Setup a sitemap and submit it to Google, Yahoo and Live.
Make sure only relevant URLs are put in the sitemap. For example don't include pages such as error pages and logoff pages.
If you are rewriting URLs then don't include the non-rewritten URL as well as this will be counted as duplicate content.
If you are including URLs that take IDs as parameters to display database content then make sure you don't include the URL without a valid ID. Taking the site I spoke about earlier as an example, someone had included the following

www.some-site.com/jobview.asp

instead of

www.some-site.com/jobview.asp?jobid=35056

This meant crawlers were accessing pages without content and it was a pretty pointless and careless thing to do.

Make sure the change frequency value is set appropriately. For example on a jobboard when a job is posted its usually posted for between 7 and 28 days. It only needs to be crawled between once a week and once a month depending on the days it was advertised for. It does not need to be crawled every time so setting a value of always is inappropriate as the content will not change every time Googlebot accesses the URL.
Avoid circular references such as placing links to a site-index or category listings index in the footer of each page on a site. It makes it hard for the bot to determine the site structure as every path it drills down its able to find the parent page again. Although I suspect the bots technology is clever enough to realise its found a link already spidered and not crawl it again I have heard that it looks bad in terms of site structure.
Avoid dead links or links that lead to pages with no content. If you have a category index page and some categories have no content related to them then don't make the category into a link or otherwise link to a page that can show related content rather than nothing.
Prevent duplicate content and variations of the same URL being indexed by implementing one of the following two methods.

Set your Robots.txt to disallow your non URL rewritten pages from being crawled and then only display rewritten URLS to agents identified as crawlers.
Allow both forms of URL to be crawled but use a CANONICAL META tag to specify that you want the rewritten version to be indexed.

Ban crawlers who misbehave. If we don't spank them when they are naughty they will never learn so punish those that misbehave. Its very easy for an automated process to parse a Robots.txt file therefore there is no excuse for those bots that ignore the commands set out in it. If you want to know those bots who ignore the Robots.txt rules then there are various ways such as parsing your webserver log files or using a dynamic Robots.txt file to record those agents that access it. There are other ways such as using the IsBanned flag available in the Browscap.ini file however this relies on the user-agent being correct and more and more people spoof their agent nowadays. Not only is banning bots good for your servers performance as it reduces load its good for your sites security as bots that ignore the Robots.txt rules are more likely to hack, spam, and scrape your sites content.

If you are having similar issues with over-crawling then I would advise you to first check your sites structure to see if the problem is due to bad structure, invalid sitemap values and over categorisation first before changing the crawl rate. Remember a sites SEO is unrelated to the amount of crawler activity and more is not necessarily better. Its not the number of crawled pages that counts but rather the quality of the content that is found when the crawlers visit that matters.

Saturday, 20 June 2009

Using Google APIS

Creating a content rich site using Google APIs

I recently had a domain renewal notification about a domain I had bought but never got round to using for its original purpose. I didn't want the domain to go to waste so I thought about creating a site as a way for me to play with Googles APIS. They offer a wide range of objects and frameworks which let you add content to a site very quickly such as the ability to search feeds, translate content on the fly from one language to another, search blogs, news and videos and much much more.

Hattrick Heaven

The site I created is a football site called www.hattrickheaven.com its main purpose apart from an example of Googles APIs in action is to display the football league tables of all the countries in the world. I found that on some other sites it was quite a few clicks from the home page to drill down to the league standings so my site was a way to offer this data straight away. As well as the league tables I have links to the latest news, football related blogs, discussion boards and videos.

Google Search APIS

To make use of Googles APIs you just need to setup an access key which is linked to the domain and is free to use. This is passed along in the URI when you reference the main Google JavaScript file. This file is all you need to include as it handles the loading in of all the other libraries with commands such as:

// load google APIS I want to use
google.load("feeds", "1");
google.load("language", "1");
google.load("search", "1");

As well as loading in Googles own framework with this main google.load method you can also reference other well used frameworks such as JQuery, YUI, prototype, dojo, Mootools and others or you can do what I chose to do and reference the relevant framework directly from Googles CDN (Content Delivery Network ).

Targeting Visitors with Googles APIs

One of the very cool features I liked about Googles API is that you get as standard geographical information about your users such as Longitude, Latitude, Country, Region and City. This was great for me as it meant I can use this information to default the search criteria for my feed, blog, news and video searches. If your in the UK and live near a major footballing city such as Liverpool, Manchester or London then HattrickHeaven.com will default the searches with the names of the clubs related to those towns such as Man UTD and Man City for Manchester.

Below is an example from the site of my Visitor object which uses the google.loader object to set up details about the current users location and browser language.


(function(){

V = H.Visitor = {

 sysDefLanguage : "en", // the language I wrote the system in and that transalations will be converted from en=English

 Latitude : google.loader.ClientLocation.latitude,

 Longitude : google.loader.ClientLocation.longitude,

 CountryCode : google.loader.ClientLocation.address.country_code,

 Country : google.loader.ClientLocation.address.country,

 Region : google.loader.ClientLocation.address.region,

 City : google.loader.ClientLocation.address.city,

 BrowserLanguage : (navigator.language || navigator.browserLanguage || this.sysDefLanguage), // the language currently set in users browser

 Language : "",

 isEnglish : false

}

 // set visitors language making sure "en-gb", "en-us" is converted to "en"
V.Language = H.ConvertLanguage(V.BrowserLanguage);
V.isEnglish = (V.Language=="en") ? true : false; // check for English

})();

Translating Content

Another cool feature is the ability to translate content from one language to another on demand. I make use of this on www.hattrickheaven.com to translate the headers and content that is delivered through Googles feed and search objects if the users language is different from that set for the page (at the moment its all English). You can see this in action on a specific page I created http://www.hattrickheaven.com/spanish-news which converts the content from English into Spanish once its been inserted into the DOM. The code to do this is very simple you just pass the text to convert, the language code for the language the text is written in, the language code for the language to translate the text into and a callback function to run once the translation has completed.

google.language.translate(txt, langFrom, langTo, function(result){
// On translation this method is called which will either run the function
// defined in destFunc or set the innerHTML for outputTo 
self.TranslateComplete(result, destFunc, outputTo);
});

On www.hattrickheaven.com I have created my own wrapper object for the site which encapsulates a lot of the various Google options and makes it very easy for me to specify new content to search, translate and output. I have options which control the number of results to show, whether to ignore duplicate URLs and whether to show just the link or to show the snippet of content underneath. For example the following is the code I use on the page


<script type="text/javascript">
// load google APIS I want to use
google.load("feeds", "1");
google.load("language", "1");
google.load("search", "1", H.Visitor.Language);

H.newsPageSteup = function(){
//set up 2 feed objects for the sidebar content
var blog = new H.Feeder();
var wnews = new H.Feeder();

blog.OutputNo = 12; // output 12 links
blog.FeedType = "blogs"; // set the type of feed
blog.getFeedOutputElement = "blogs"; // the node ID to output results
blog.findFeeds(); // find some relevant blogs, translate if necessary and output

wnews.FeedType = "news";
wnews.searchQuery = "World Cup 2010 News"; // overwrite the default search terms
wnews.ShowSnippet = true; // show the content snippet under the link
wnews.OutputNo = 5; // output 5 news articles
wnews.getFeedOutputElement = "worldcupnews"; // where to output news results
wnews.findFeeds(); // run news feed search, translate if necessary and output

// set up a search control to output a search box, paging for results etc
var news = new H.SearchControl();
news.controlType = "news"; // tell object to search for news
news.getFeedOutputElement = "news"; // where to output results in DOM
news.searchSetup(); // run search, translate and output results

// if visitor is not English then I want to translate some headers on this page
if(!V.isEnglish){
 var sobj = new H.Search();
 var arr = ["WorldCupNewsHeader","NewsHeader","BlogsHeader"];
 sobj.TranslateContents(arr);
}

}
// On load run my initialise function
google.setOnLoadCallback(H.newsPageSteup,true);
</script>

As you will see if you take a look at the site its very easy to get some rich content up with very few lines of code. The only issue I currently have is that this functionality is all being done client side with Javascript which leads to two problems.

1. Roughly 10% of visitors (ignoring bots) have Javascript disabled by default. This means that apart from the league tables the site will look pretty bare.

2. Because the content is all loaded in using Javascript its only visible in the DOM after the page has loaded it means that for SEO purposes the source code is going to be pretty empty. I have a few ideas floating around regarding this and I will give more details if any of them come to fruition.

All in all I am pretty impressed with the framework especially its simplicity and hopefully others will feel the same way once they get stuck into developing with it.