Thursday, 27 January 2011

2011 Browser Usage Stats

Browser Coverage and Other Visitor Statistics

I like to regularly check one of my largest systems web traffic stats to see what kind of browsers our users are visiting with and I have previously posted reports which have showed IE maintaining its position at the top of the stats every time.

One of our sites is used by a large corporate company that heavily restricts the type of browser their workers can use to access the Internet which means that they have to use IE 6 but even accounting for that it surprising to see that IE 6 is still at the top of the browser usage report even though IE 8 has been out for a long time and IE 9 is on the way.

One other reason I can think of that explains why so many people are still using IE 6 is that it seems to be the useragent of choice for spoofers and hackers. I have an automated system that I have built that logs, identifies, and then bans these bad bots and users and I have built up quite a large database of known IP / Agents so I can regularly check to what kind of tricks they are up to.

The latest batch of hackbots that I have spotted are using stripped down URL Encoded HTML without quotes for attributes and without protocols in the links e.g



%3C%69%66%72%61%6D%65%20%73%72%63%3D%2F%2F%73%6F%6D%65%64%6F%64%67%79%73%69%74%65%2E%72%75%3E


When URL Decoded becomes


<iframe src=//dodgysite.ru>

Even though there are no quotes around the src attribute and no protocol at the beginning of the URL this HTML will still work and is a common technique used by minifiers (including Google) to cut down on the size of HTML files.

Obviously the whole point of this is to beat injection and hack tests that rely on pattern matching in a similar way to those sql injection attacks that are all uP aNd DoWn aiming to beat people who have forgotten to make their systems sql injection detection routines case insensitive.


Any how here are the latest browser usage reports for the first month of 2011


Top Browsers

BrowserUsage %
IE 6.048.12
IE 8.016.04
IE 7.012.14
Firefox 3.66.71
Chrome 8.04.82
IE 5.54.31
Safari 5.01.83
Firefox 3.01.06
Firefox 3.50.97
Safari 4.00.41
Opera 9.00.36
Opera 8.00.36
iPhone 4.20.34
Firefox 2.00.31
Mozilla 1.90.26
Iceweasel 3.00.26
iPhone 4.10.22
IE 9.00.21
BlackBerry0.19



Top Operating Systems

Operating SystemUsage %
WinXP68.30
WinVista10.20
Win10.12
Win20004.87
MacOSX2.90
iPhone OSX1.23
Win20031.14
Linux0.66
WinME0.49
Win980.47
Debian0.26
WinNT0.24
Android0.22
BlackBerry0.17

Sunday, 23 January 2011

Strictly Software - Online CV

Are you looking for an experienced Web Developer with 25 years experience?

If you are a company or individual that requires an experienced systems developer for bespoke development work on websites, databases, scripts, plugins, bots and security then you should consider contacting Strictly Software for a formal quote.

Over my 25 years of development experience I have managed to acquire a wide range of technical skills that are very relevant in today's fast moving Internet world. A search on google.com will return many examples of my work from my publicly available free online tools and popular blog articles, to a number of popular WordPress plugins.

My skill-set covers everything from large scale database design, development and performance tuning to website development, auto-blogging, user tracking and black or white hat SEO techniques. A non-inclusive list of my skill-set is listed below.
  • Database development and administration using MS SQL 6.5 - 2012 and MySQL.
  • Performance tuning including index optimisation, query plan analysis and caching.
  • Development of relational database systems, real time systems, EAV hybrids and systems built partially with automated code.
  • A good knowledge of system views which I have used to create a multitude of scripts to help locate, update and port data and clean up systems that have been hacked with SQL injections.
  • Automated reporting, analysis, diagnosis and correction scripts.
  • Front end development in C#, ASP.NET, PHP, Java, VB, VBA, ASP Classic and Server and Client Side JavaScript on a number of commercial and personal sites as well as a number of intranets.
  • Cross browser script development and the use of unobtrusive and progressive enhancement scripting techniques.
  • XML, HTML, XHTML, RSS and OPML parsing.
  • Web 2.0 development skills including RPC, RESTful methods and good knowledge of the issues surrounding internationalisation and multi byte character sets.
  • AJAX, JSON and using Object Orientated JavaScript.
  • Intermediate CSS skills and good DOM manipulation skills.
  • Good knowledge of writing bots, crawlers and screen scrapers.
  • Experience of hooking into popular API's such as Twitter, Google or Betfair.

Not only have I developed a number of useful PHP plugins for Wordpress including:

  • Strictly AutoTags, an automatic tagging tool that uses keyword density and HTML tag analysis to find appropriate words to be used as tags for articles.
  • Strictly Tweetbot, an automated content analysis plugin that uses OAuth to send tweets to multiple Twitter accounts.
  • Strictly System Check, a report plugin that checks for site uptime and runs a number of automatic fixes depending on identifiable problems.

I have also created a number of web tools using PHP which included:
  • Super Search - An anonymising search engine that allowed you to search the top 3 search engines in when one of my proxies was stopped however it was a test to prove it could be done and I used my own language that I had created called SCRAPE.
  • MyIP - A browser connection analysis tool.
  • WebScan - A tool to scan a webpage and find out key info such as malware, trackers, outbound links, spam ratings, DNS checks and much more.



Buy Now


As well as purchasing these PRO Twitter plugins I have numerous free to use plugins and scripts, tools, and online programs that let you do all sorts of things like HTML Encode JavaScript and De-Compress multiple times packed code using Dean Edwards packing code mechanism.

I have also worked on a number of client side tools based on JavaScript and AJAX including my Getme.js example framework which I use alongside Sizzle for DOM manipulation.

I have personally identified a number of major problems with the common frameworks such as JQuery and Prototype which is why I do not use other peoples code unless I am required to and I have wrote a number of articles about these problems which can be found here:



I also specialise in writing tools for automatic HTTP parsing and screen scraping and have build a number of objects in C#, PHP and Javascript to enable me to scrape and reformat articles on the fly. I used to actually have a fully automated business which randomised data to create seemingly specific emails, and I created a brilliant looking website that hooked into Betfairs API to show recent results, upcoming races and market price. I also showed Racing information and news articles obtained from various sites using my own SCRAPE BOT and had fully automated daily optimisation and maintenance as well as SEO optimisation and automatic advertising when new articles were imported, optimised before Tweets and Facebook/LinkedIn/Tumblr and other social medsia posts were automatically sent out to bring in traffic.

As someone who has to maintain a system that has millions of page hits a day I am constantly engaged in a battle between malicious bots and users who are trying to either hack my system or steal content. Therefore I have built up quite a large amount of knowledge on the best practices when it comes to identifying and preventing bad users and this information also comes in use when I have to make automated requests myself.

I have also written a number of articles on the topic of bad bots and scraping as well as how a bot should go about crawling a site so that it doesn't get banned including writing an example C# object to parse a Robots.txt file.

I also run a number of personal sites that have increased my own personal knowledge of successful SEO strategies and I have authored a number of articles about tips and tricks I have come across that help in this regard:




As well writing over 100 technical articles for this blog I have developed a number of popular and useful tools such as:

  • Twitter Translator - One of the first sites to offer translation of Twitter messages on the fly in all the languages Google Translate offered. This had to be shut down due to Twitter changing their open API to a closed OAuth system.
  • HTML Encoder and Decoder - This very popular tool is based on a free object I have created that allows users to HTML encode and Decode in Javascript. Unlike similar scripts my tool handles double encoding which is a big issue when you are handling content taken from multiple sources that may or may not have been merged with other content or already encoded.
  • Javascript Unpacker and Beautifier - This popular tool is used by myself and many others to reformat compressed or packed JavaScript code and to reformat it into a readable format.

I used to have an APACHE server which I don't have anymore unfortunately as is had a a number of great tools and other sites including my own URL Shortener, a Spam Checker and a Pirate Bay Proxy checker script which used to be easily accessed on my www.strictly-software.com site which due to unfortunate circumstances, was moved to a French hosting company, OVH, who have blocked the owner from accessing it.

Due to crazy circumstances my site was hosted, along with it's SSL certificates on their server, which held a horse racing site I part owned along with my own site. The horse racing site got sold, the owners built a new site on a different server but they cannot stop OVH from charging them monthly fees for Windows/SQL Server/Hosting etc due to a change in their login process where they now require people to login by accessing a link sent out to a specific email address. 

I have no access to the address, nor do the new owners, as the old owners deleted it and it cannot be recreated. Trying to speak to OVH just results in them saying that they cannot change the email due to EU Data Protection laws. So this company is paying a huge monthly fee to keep a server running which only hosts my own site that I cannot even access to add a new SSL, or move the DB and code to another webserver. So that is why you don't get an https version of my site at the moment!

I used to work at a well known UK software development company based outside London.

I worked for this company for 11 years and during that time I was the architect of 3 versions of an award winning job recruitment software product. The software I developed ran over 500 different job boards and consists of custom code (client and server) that was written from the ground up using Microsoft technologies.

The generic codebase I have created and the back-end management system that controls it allows for extremely fast production times. Also because of the generic design the main time delay between specification and deployment for each new site is actually the front end design and not the database or business layer development as it usually is with other systems.

I actually automated the process of creating a new site so that the non techies who do the project management could easily follow a stepped process to create a new job board by filling in a series of forms. These forms asked for information such as the new URL, site name, any existing site to copy their settings and categories from, as well as a site to copy the design from which could then be changed.. 

I then automated the process of creating new database records, copying data from existing sites to the new one, updating IDs and linked lists, building text files such as ISAPI rewrite files and constant files, setting up folder structures and even creating the site in IIS so that at the end of the process there was a fully functional new site ready to be tweaked by the "colouring in" people e.g the designers LOL.

Using a mix between normalised relational structures and EAV for custom requirements alongside automatically generated de-normalised flat tables and matrices for high speed searching, the system I created perfectly straddles the sometimes problematic trade off many developers have to make between quick development, good solid design and high performance.

The skills I have learnt developing this software has proved that is possible to maintain high traffic systems on Microsoft architecture utilising legacy technologies such as ASP Classic along with a number of .NET web services and console scripts to send bulk emails out, handle incoming spam mail, importing jobs by XML and many other external tasks. 

By developing this system I learnt the following skills and techniques such as:
  • Database performance tuning.
  • Application layer performance tuning.
  • Developing highly secure websites.
  • Automated traffic analysis and real time logging, visitor fingerprinting and automatic banning techniques.
  • Object orientated and procedural development methodologies.
  • Caching, Minification, compression and other optimisation techniques.
  • JavaScript widget development, including a custom WYSIWYG editor, my own framework and many animated tools including geo-graphical searching using Googles Map API and our own data.
  • Automated tasks to report, analyse and fix potential issues.
  • Good coding practises including limited 3rd party COM object use, object re-use and other well known but sadly untaught tricks of the trade.
  • C# .NET coding for various external scripts that vastly sped up processes such as bulk maling job alerts to thousands of users when compared to old VBScript techniques.
  • Use of NoSQL DataBase's such as DTSearch for indexing our jobs and CV's and providing a fast search of the files using their COM interface.

A cursory look over this blog and my very old site www.strictly-software.com (which I cannot actually access anymore due to the OVH server issue mentioned earlier), will show you the wide variety of skills that I am trained in and hopefully the depth of knowledge that my articles and examples deliver prove that I know what I am talking about.

If you are interested in learning more about my work or are looking to hire someone for web or database development then you should contact me for further details.

If you are interested in having a performance audit on your legacy systems before considering whether to rewrite or tune up then you can also contact me at the details provided in the contact link in the footer.


Why I hate debugging in Internet Explorer 8

Debugging Javascript in Internet Explorer

Now with IE 8 and it's new developer toolbar debugging Javascript should have become a lot easier. There is now no need to load up Firebug Lite to get a console or use bookmarklets to inspect the DOM or view the generated source and in theory this is all well and good.

However in practise I have found IE 8 to be so unusable that I have literally stopped using it during development unless I really have to.

When I do find myself having to test some code to ensure it works in IE I have to make a little prayer to the God of Geekdom before opening it up because I know that within the next 5 minutes I will have found myself killing the IE process within Task Manager a couple of times at the very least.

Not only does IE 8 consume a large portion of all my CPU cycles it's very ill thought out event model makes debugging any kind of DOM manipulation a very slow painful chore.

Unlike proper browsers IE's event model is single threaded which means that only one event can occur at any point in time during a pages lifetime. This is why IE has the window.event object as it holds the current event being processed at any point in time.

Many developers over the years have moaned at IE for this and have hoped that with each new release of an IE browser that they would fix this odd behaviour. However every time a new version is rolled out a lot of web developers are always bitterly disappointed because apparently Microsoft feels this quirky event model is a design feature to be enjoyed rather than a bug to be suffered and they don't seem to have any intention whatsoever of fixing it or making it DOM 2 compliant at the least.

I don't know why they cannot do what Opera does and just implement both event models. At least that way they can make all the proper web developers happy at the same time as all the sado masochists who enjoy IE development

This event model really comes into its own when you are trying to debug something using the new developer toolbar as very often you want to pipe debug messages to the console.

If you are just outputting the odd message whenever a button is clicked or form loaded then this can be okay but if you attempt to do anything that involves fast moving events such as moving elements around the page or tracking fast incrementing counters then you will soon suffer the pain that comes with waiting many minutes for the console to catch up with the browsers action.

Whereas other browsers such as Chrome or Firefox are perfectly capable of outputting lots of debug messages to the console at the same time as mousemove events are being fired IE seems to either batch them all up and spit them out when the CPU drops below 50% (which might be never) or occasionally the whole browser will just crash.

At first I thought it was just me as my work PC is not the fastest machine in the office but I have seen this problem on many other peoples computers and I have also experienced it at home on my Sony VAIO laptop.

As a test I have created the following page which can be found here:


Try it out for yourself in a couple of browsers including IE 8 and see what you think. I haven't had the chance to test IE 9 yet so I don't know if the problem is the same with that version but I would be interested to know if anyone could test this for me.

The test is simple and just includes a mousemove event which collects the current mouse co-ordinates and pipes them to the console. It does this for a number of iterations which can be set with the input box.

I have found that IE will manage when setting this counter to anything less than 100 but putting it to 1000 or above it just crashes or freezes my own browser as well as killing the CPU.

Let me know what you think of this test and whether anyone else has major issues with IE and it's console logging capabilities.

Monday, 3 January 2011

Handling UTF-8 characters when scraping

Handling incorrectly formatted characters when scrapping

Scraping can be a bit of a nightmare as you cannot expect every web page to be written to the same standard and therefore you will find most of the time is spent trying to handle dispcrepencies in formats and bad encoding etc.

On one of my sites noagendashownotes.com I create backup links of original news stories so that if the original story gets taken down (which happens alot) the original version is still available.

This means I have to create local versions of the remote files and this isn't too much of a problem as its not too hard to convert relative links to absolute and so on. One of the problems is sites that load content such as CSS with client side Javascript code as its virtually impossible with a simple server side scraping tool to work out whats going on when libraries are loading other libraries and file paths are built up with Javascript. However luckily this doesn't happen too much so I am not too concerned about it.

One thing that does happen a lot though is character encoding issues caused by a webpage mismatching up the character sets on the server and client. This causes issues when you are scraping and saving to another file as when the file is viewed in a browser as a static html file it only has the META Charset tag value to go on.


Take a look at the HTML source code and you will see that they are using the following META tag

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 

which tells the browser to output the character set as ISO (extended Latin version of ASCII)

However if you use something like HTTP Fox to examine the response headers you will see that the actual Response Charset is UTF-8. The page will be using server side code e.g PHP, JSP or ASPX to set this like so

Response.Charset = "UTF-8"

or

header('Content-Type: text/html; charset=UTF-8');

Now when I scraped this page and saved it as a local file (with a UTF-8 Encoding) and then viewed that local file in my browser all the extended UTF-8 characters such as special quote marks or apostrophes appear as the usual garbage e.g

New York̢۪s governor.

instead of

New York’s governor

This is because the browser only has the HTML to tell it what character set to use and this has been set incorrectly to ISO instead of UTF-8.

Because the page is now a static HTML file rather than a dynamically generated page there is no server side code setting the Response Charset headers. Again HTTP Fox is useful for examining the response headers to prove a point.

I am pretty new to PHP and I searched around the web for a few suggestions on how to fix this which included things like wrapping the file_get_contents function in a mb_convert_encoding function e.g

function file_get_contents_utf8($fn,$incpath,$context) {
$content = file_get_contents($fn,$incpath,$context);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

However this didn't solve the problem so I came up with a method that did work for what I am trying to do ( e.g create static HTML versions of dynamic pages). This method involved using regular expressions to reformat the HTML so that any mismatches of CHARSET settings are correctly set to UTF-8.

This function is designed to replace the value for any META CHARSET tags to be UTF-8. It works with both these formats (with single or double quotes)

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
OR

<meta charset='iso-8859-1' />

function ConvertHeaderToUTF8($html){

// look for an existing charset value
if(preg_match("@<meta[\s\S]+?charset=['\"]?(.+?)['\"]\s*/?>@i",$html,$match)){

$charset = $match[1];

// check value for UTF-8
if($charset != "UTF-8"){

// change it to UTF-8
$html = preg_replace("@(^[\s\S]+?<meta[\s\S]+?charset=['\"]?)(.+?)(['\"]\s*/?>[\s\S]+$)@i","$1UTF-8$3",$html);

}
}

return $html;
}



This solved the problem for me perfectly. If anyone else has another way of solving this issue without creating local PHP / ASP files that set a Response.Charset = "UTF-8" please let me know as I would be interested to hear about it.

Saturday, 1 January 2011

Strictly Tweetbot Wordpress Plugin

Automatically post tweets to multiple accounts with Strictly Tweetbot for Wordpress

If you are a frequent Wordpress and Twitter user then you might be interested in one of my plugins I have developed for Wordpress called Strictly Tweetbot. This plugin is ideal for news aggregators or auto blogging sites which need to run 24/7 and maintain an online presence across multiple spheres with little oversight.

Most people who make use of Twitter will post a message or two about new articles and it is a great way of getting content indexed quickly as you will see from my own investigation dozens of bots, SERPs and social media scrapers will visit your content as soon as a link is posted on Twitter.

Whilst there are many Twitter based plugins already available for Wordpress I found that they didn't really meet all the requirements that my own auto blogging sites required which were:

  • The ability to post to multiple twitter accounts
  • The ability to post different messages to different accounts
  • The ability to post multiple messages to the same account
  • The ability to decide whether or not to post to an account by checking the content of the message. For example I have an IT related Twitter account and on a certain site I only want to post messages to that account if the article in question contained certain IT keywords.
  • The ability to convert post tags or categories into hash tags
  • Using the new OAuth method of authenticating Twitter accounts without having to register each blog that uses the plugin as an application and consumer.
  • And of course the ability to automatically shorten the URLs

As I didn't require some of the other features such as the ability to post a tweet from my home page or create daily or weekly twitter digest posts I came up with my own plugin Strictly Tweetbot.

Used on its own it's great for auto blogging but when combined with my Strictly Auto Tags plugin that finds new tags within articles it means all my tweets have relevant hash tags appended to them.

Another nice feature is the reporting tool that enables you to view from the admin panel the last messages that were posted by the plugin as well as any potential problems with Twitter.

You can download the latest version of the plugin from Wordpress at: wordpress.org/extend/plugins/strictly-tweetbot/.

If you like this plugin you might be interested in my other Wordpress plugins:



Buy Now


For full details check out my plugin page at www.strictly-software.com/plugins/strictly-wordpress-plugins