Sunday 26 April 2009

Write your own script compressor

Compression Techniques

I have just finished writing my own script compressor to handle the compression on one of my larger systems. You may ask why I didn't just use one of the existing free compressors out there like jsmin or YUI compressor. Well the answer is two fold:

1. I like to write my own code for the simple reason I learn more about my craft doing it this way and although it takes time and usually blood sweat and tears I always come out the other side with more knowledge about development practises in general. See this link about whether to use a framework or write your own for more details.

2. The existing compression tools didn't do all I wanted them to do and also I found certain issues with certain syntax such as complex regular expression literals and conditional comments.

This article is just an overview of some of the things I discovered during the building of my script in case other people want to write their own.

The main features of my compressor are:
  • Takes a directory as its source and will output all compressed files to a destination folder replicating the structure of the source folder.
  • Handles multiple file types e.g JS, PHP, ASP, INC, HTML, CSS
  • A file such as an ASP file that contains in-line JS, HTML, CSS and ASP will have each "section" compressed according to that sections code type.
  • The usual compression technique of removing comments, excess white space and empty lines is carried out.
  • JS code is also compressed by having function parameters and local variables "minified" by replacing variable names with one character names. 
  • Global objects and common functions are also replaced with short versions.
  • Debug functions are removed.
  • JS functions are saved one per line so even the compressed file is nicely formatted.
  • Adds missing terminators ; to aid compression of functions to one line.
  • Corrects HTML so that its XHTML valid.
  • Skips files that are already compressed.
  • Creates a log file detailing the files compressed and the compression rate.

Building a Compressor

I found out during this task that getting a fully working Javascript compressor without the aid of a Java engine is not as simple as it may at first sound. Especially if you wanted like I did each function to appear on its own line and you are handling code that may not be correctly formatted in the first place (e.g missing terminators ;). 

If you want to just remove excess white space and comments then that's fine but doing a "minifier" that renames variables with single letter names is a bit trickier as you need to identify all your functions and parameters correctly for this to work. Remember there are a multitude of ways of defining functions in JavaScript as well so its not just a case of looking for the word function e.g:



function myFunc(var1, var2){ return }

var myFunc = function(var1, var2){ return }

SomeObject.prototype.myFunc = function(var1, var2){ return }

myFunc : function(var1, var2)

(function(){ code }())


Also unless you create a global system that checks which JavaScript functions are being referenced by each file then you can only really minify local function variables and parameters. Otherwise you may change the name of a variable that is being referenced by a file you don't know about causing an error.

If you want to create a very simple minification process you can concentrate on some common global objects such as the window, document and navigator objects. You can also create yourself a small function for document.getElementById and then update all references to that e.g

var _w=window,_d=document,_n=navigator;
//create wrapper function
_g = function(i){
return _d.getElementById(i);
}

So for every time your site references document.getElementById which is probably quite a lot you have saved yourself 18 characters. Add to that the common use of window and document you will save a fair few bytes just by these changes alone.


Store and Replace

I found the best approach was a layered approach to parsing each file as files such as PHP or ASP will contain a mixture of both server side and client side script as well as HTML and maybe CSS defined in style tags. With the help of some good regular expressions I would look for each type of code using their identifying markers for example look for the open and close style tags. I would then store each block of code in an array, parse it accordingly and then when rebuilding the compressed file re-insert in order. This got round issues where you have ASP script inside in-line JS script blocks within HTML inside an ASP page. Each section would have its own function that compressed according to that languages syntax but all would first store any string and regular expression literals in another array so that they stayed unaffected during any compression as you don't want to be removing white space and changing symbols within string literals.


Fun with regular expressions

Most of my compression was carried out using regular expressions but some tasks such as identifying literals and comments were done in loops. This is pretty easy when your looking for string literals as you know they are going to be enclosed within double or single quotes and you can combine this with a check for single and multi line comments at the same time. However in JS regular expressions can also be defined as literals such as

var re = /^\S\s+[^/].*?>/gi

You cannot just start at the first / and then stop at the next unescaped / as you can use unescaped slashes within character groups e.g [/] so you would end prematurely. You also cannot stop at the last / you come across as you may have multiple statements on the same line such as:



var re = /^\S\s+[^/].*?>/gi, str = str.replace(/##BR##/gi,"<br />");


Plus as you can see the replacement value on the second statement has a forward slash inside it so you would cut off half the replacement value causing a syntax error. I first took the decision that so what a bit more than I intend gets stored as a literal until its put back in but if that extra bit of code also contains variable names that you want to minify you will run into problems.

I got round this problem by first using a regular expression to identify unescaped forward slashes within expressions and replacing them with a placeholder. I could then use another pattern to match string functions such as replace, match, search, split, compile and another for literals and regular expression functions such as test and exec. Once the literal is stored I can put the escaped characters back.


Negative Matches

I also found that the previous technique of using a placeholder value before carrying out a regular expression match was very useful when dealing with complex negative matches. If you have a long string of text and you are trying to carry out a replacement except in a certain instances then this is a good way of having to avoid a complex negative pattern match. For example in my JS compression function I add in extra terminators to the end of lines to make sure I can get a whole function on to the same line. However doing  this sometimes causes issues when terminators are put in places they shouldn't be so at the end I run some corrections which remove terminators from places they shouldn't be such as inside certain brackets. However there are cases where a terminator can appear inside an open bracket legitimately such as a for expression without all the sections. Therefore instead of using a complex negative match to do the excess terminator replacement I use a placeholder e.g

// put a placeholder in for the terminator I want to keep
strJSCompressed = strJSCompressed.replace(/for(;/,"##_FOR_TERM_##");

// carry out the replacement of terminators
strJSCompressed = strJSCompressed.replace(/([\{\(\[,><\|&])(;)/,"$1");

Then once the replacements have been carried I put back in all the original values that the placeholders were storing e.g


// put the placeholder back in to my code
strJSCompressed = strJSCompressed.replace(/##_FOR_TERM_##/,"for(;/");


This is a very useful technique when you want to avoid complex regular expressions that involve negative matches as you should know by now complex patterns and long strings combine to cause high CPU!

Another good example is HTML comments. I want to strip all HTML comments apart from the following derivatives:

Server Side Includes e.g <!-- #virtual="/somefile.inc"-->

IE Conditional Comments e.g <!--[if lt IE 7]> OR <![endif]--> 

Server Side META Includes e.g <!--METADATA TYPE="typelib" Blah -->

As you can see trying to write one regular expression that would handle multiple pattern matches within a file that strips all HTML comments apart from those that start with # [ or METADATA would involve some hardcore matching. Its very easy though to match each individual comment type first and replace it with a placeholder, then do my replace for everything between <!-- AND --> and then put the placeholder values back in.

A tidy page is a godly page

As well as carrying all the usual compression functions I also incorporated a number of replacements to tidy up the code. If your going to loop through each page in a system then this seems like a good place to do such things as:

  • Make my HTML XHTML compliant by encoding characters, expanding attributes, making sure all attributes are quoted and that my tags are lower case and some other HTML related tweaks.
  • Removing comments from within SCRIPT blocks as they are not required anymore as well as shortening SCRIPT tags down to the minimum e.g remove the language and type attributes. Obviously this breaks your XHTML compliance but then again you can't have everything.
  • Combine multi-line string literals together into one variable.
  • Combine variable declarations (server and client side) into one declaration.
  • Remove certain function calls such as calls to my custom ShowDebug function that outputs messages for client and server side script. I always build my codebase with the debug statements built in rather than add them in later as it makes debugging quicker and easier. However on a production system these function calls are expensive and unnecessary and should be removed.
  • Remove excess white space, usually TABS within dynamic SQL strings. Obviously I don't do UPDATES or DELETES only SELECTS but I usually format my code with TABS.

Why Compress Server-Side Code?

Well yes I know that even interpreted languages such as ASP gets compiled into a token based language which is cached by the web server so there is not much scope for compression but the smaller I can make the file size then the better in terms of storage and maybe caching. Plus the main point of the server-side compression was to remove all my ShowDebug function calls to aid performance.


So can I get a copy?

Not at the moment as I am in the process of testing it on a live system to iron out any bugs. At the moment the code is a script that I point at a directory and run. I am hoping to make a C# based windows application version of it and then I might put a copy up on the site.

Should I use a framework?

When to use a framework and when to write your own code

I am always asked at work why I don't just use a framework such as jQuery, prototype, YUI, MooTools etc rather than spend time writing my own code and its a fair point. I have spent time looking at the major frameworks and its all good code written by clever people and if you haven't got the time to spend then I would definitely recommend using a library. Then again if John Reisig had thought like that then millions of people would be using YUI instead of jQuery and Microsoft would be packaging another library with Visual Studio to handle selectors instead.

Libraries are good for many reasons they hide browser incompatibilities from the developer and its good for a team of developers to stick to a standard code base rather than all adding their own functions and bloating a site up with several versions of the same addEvent or toggleClass function.

The downside is that most libraries will contain lots of code that is never even used by the developer. If you're not even going to be using selectors to return DOM objects in your JavaScript and are just looking for a shorter version of document.getElementById then using $('#blah') is not the way to go. 

The other good thing about writing your own code is that you get to understand the language of your trade a whole lot better than if you just relied on a library. There is no better way in my opinion for learning anything that being thrown in at the deep end and having to sink or swim so to speak. Yes it takes much longer as you will have to read up about the early browser wars and compatibility issues, learn about objects and their properties and understand event models and script syntax but it will make you a much better programmer and when bugs appear due to a new version of Internet Explorer you won't have to wait for an update to your framework to be released.

As with all things its swings and roundabouts and just because I like to write my own code and know why things work the way they do does not mean I won't use a library. However having spent the time researching the language for my own code has given me invaluable knowledge and it helps being able to step through something like jQuery and actually understand what its doing and why rather than just knowing that it works.

Sunday 5 April 2009

Problems with Firebug and raised errors

Firebug raising false errors or incorrect line numbers

I have noticed recently that Firebug seems to be giving me unhelpful error reports and it seems to be happening more and more. By unhelpful I mean in the way Internet Explorer used to when an error was raised with an invalid or incorrect line number. I always thought Firebug was pretty spot on for error reporting but it seems that within the last month or so I have been getting quite a lot of errors in the console that say the problem is in file X and on line 120 but in fact after stepping through the problem the problem is actually in file Z and on line 45. Sometimes the error details that are reported in the console do not correspond exactly to those in the standard error console and a lot of the time errors that appear in Firebugs console do not appear in the standard error console.


Possible Causes to spurious error messages

Conflict with Firefox Add-Ons

I have found problems in the past with certain Firefox add-ons that raise errors constantly or intermittently. One such add-on is the user-agent switcher which always raises a 
UserAgentButton is null error and another add on which is a brilliant add-on for Search Engine Optimisation called the SEO Toolbar keeps raising the following error s4fToolbar_cache is undefined.

Both of these errors disappear once the add-on has been disabled or in the case of the SEO toolbar you can turn it off without having to restart your browser. I have quite a number of add-ons on my work PC and due to my experience of errors related to add-ons I wouldn't rule out some sort of clash that is generating incorrect error details. 


Syntax Problems

I have found that sometimes when there are compilation errors usually due to missing or incorrect brackets then the error can be reported incorrectly. If a function hasn't been closed correctly due to a missing bracket then the calling function can sometimes be identified as the erroring function. However a large number of the errors I am getting invalid details for do not meet this criteria.


Firefox or Firebug installations

Within the last couple of months I have had upgrades of both Firefox and Firebug so it maybe that some problem within the application is causing the invalid error details.

If anyone else has found similar issues recently with their Firebug installation it would be good to hear about it. As my colleagues also use Firefox and Firebug but without the various add-ons I have installed do not have these problems it means at the moment I am of the opinion the problem is most likely due to some sort of clash between add-ons as I cannot see such a problem not being spotted and resolved before any upgraded version of Firebug being released. I came to expect this sort of behaviour from IE and it was just understood by everyone that you couldn't get a proper error message when you needed one so its a tad annoying to say the least for me to start experiencing this issue with Firefox.