Sunday, 26 April 2009

Write your own script compressor

Compression Techniques

I have just finished writing my own script compressor to handle the compression on one of my larger systems. You may ask why I didn't just use one of the existing free compressors out there like jsmin or YUI compressor. Well the answer is two fold:

1. I like to write my own code for the simple reason I learn more about my craft doing it this way and although it takes time and usually blood sweat and tears I always come out the other side with more knowledge about development practises in general. See this link about whether to use a framework or write your own for more details.

2. The existing compression tools didn't do all I wanted them to do and also I found certain issues with certain syntax such as complex regular expression literals and conditional comments.

This article is just an overview of some of the things I discovered during the building of my script in case other people want to write their own.

The main features of my compressor are:
  • Takes a directory as its source and will output all compressed files to a destination folder replicating the structure of the source folder.
  • Handles multiple file types e.g JS, PHP, ASP, INC, HTML, CSS
  • A file such as an ASP file that contains in-line JS, HTML, CSS and ASP will have each "section" compressed according to that sections code type.
  • The usual compression technique of removing comments, excess white space and empty lines is carried out.
  • JS code is also compressed by having function parameters and local variables "minified" by replacing variable names with one character names. 
  • Global objects and common functions are also replaced with short versions.
  • Debug functions are removed.
  • JS functions are saved one per line so even the compressed file is nicely formatted.
  • Adds missing terminators ; to aid compression of functions to one line.
  • Corrects HTML so that its XHTML valid.
  • Skips files that are already compressed.
  • Creates a log file detailing the files compressed and the compression rate.

Building a Compressor

I found out during this task that getting a fully working Javascript compressor without the aid of a Java engine is not as simple as it may at first sound. Especially if you wanted like I did each function to appear on its own line and you are handling code that may not be correctly formatted in the first place (e.g missing terminators ;). 

If you want to just remove excess white space and comments then that's fine but doing a "minifier" that renames variables with single letter names is a bit trickier as you need to identify all your functions and parameters correctly for this to work. Remember there are a multitude of ways of defining functions in JavaScript as well so its not just a case of looking for the word function e.g:



function myFunc(var1, var2){ return }

var myFunc = function(var1, var2){ return }

SomeObject.prototype.myFunc = function(var1, var2){ return }

myFunc : function(var1, var2)

(function(){ code }())


Also unless you create a global system that checks which JavaScript functions are being referenced by each file then you can only really minify local function variables and parameters. Otherwise you may change the name of a variable that is being referenced by a file you don't know about causing an error.

If you want to create a very simple minification process you can concentrate on some common global objects such as the window, document and navigator objects. You can also create yourself a small function for document.getElementById and then update all references to that e.g

var _w=window,_d=document,_n=navigator;
//create wrapper function
_g = function(i){
return _d.getElementById(i);
}

So for every time your site references document.getElementById which is probably quite a lot you have saved yourself 18 characters. Add to that the common use of window and document you will save a fair few bytes just by these changes alone.


Store and Replace

I found the best approach was a layered approach to parsing each file as files such as PHP or ASP will contain a mixture of both server side and client side script as well as HTML and maybe CSS defined in style tags. With the help of some good regular expressions I would look for each type of code using their identifying markers for example look for the open and close style tags. I would then store each block of code in an array, parse it accordingly and then when rebuilding the compressed file re-insert in order. This got round issues where you have ASP script inside in-line JS script blocks within HTML inside an ASP page. Each section would have its own function that compressed according to that languages syntax but all would first store any string and regular expression literals in another array so that they stayed unaffected during any compression as you don't want to be removing white space and changing symbols within string literals.


Fun with regular expressions

Most of my compression was carried out using regular expressions but some tasks such as identifying literals and comments were done in loops. This is pretty easy when your looking for string literals as you know they are going to be enclosed within double or single quotes and you can combine this with a check for single and multi line comments at the same time. However in JS regular expressions can also be defined as literals such as

var re = /^\S\s+[^/].*?>/gi

You cannot just start at the first / and then stop at the next unescaped / as you can use unescaped slashes within character groups e.g [/] so you would end prematurely. You also cannot stop at the last / you come across as you may have multiple statements on the same line such as:



var re = /^\S\s+[^/].*?>/gi, str = str.replace(/##BR##/gi,"<br />");


Plus as you can see the replacement value on the second statement has a forward slash inside it so you would cut off half the replacement value causing a syntax error. I first took the decision that so what a bit more than I intend gets stored as a literal until its put back in but if that extra bit of code also contains variable names that you want to minify you will run into problems.

I got round this problem by first using a regular expression to identify unescaped forward slashes within expressions and replacing them with a placeholder. I could then use another pattern to match string functions such as replace, match, search, split, compile and another for literals and regular expression functions such as test and exec. Once the literal is stored I can put the escaped characters back.


Negative Matches

I also found that the previous technique of using a placeholder value before carrying out a regular expression match was very useful when dealing with complex negative matches. If you have a long string of text and you are trying to carry out a replacement except in a certain instances then this is a good way of having to avoid a complex negative pattern match. For example in my JS compression function I add in extra terminators to the end of lines to make sure I can get a whole function on to the same line. However doing  this sometimes causes issues when terminators are put in places they shouldn't be so at the end I run some corrections which remove terminators from places they shouldn't be such as inside certain brackets. However there are cases where a terminator can appear inside an open bracket legitimately such as a for expression without all the sections. Therefore instead of using a complex negative match to do the excess terminator replacement I use a placeholder e.g

// put a placeholder in for the terminator I want to keep
strJSCompressed = strJSCompressed.replace(/for(;/,"##_FOR_TERM_##");

// carry out the replacement of terminators
strJSCompressed = strJSCompressed.replace(/([\{\(\[,><\|&])(;)/,"$1");

Then once the replacements have been carried I put back in all the original values that the placeholders were storing e.g


// put the placeholder back in to my code
strJSCompressed = strJSCompressed.replace(/##_FOR_TERM_##/,"for(;/");


This is a very useful technique when you want to avoid complex regular expressions that involve negative matches as you should know by now complex patterns and long strings combine to cause high CPU!

Another good example is HTML comments. I want to strip all HTML comments apart from the following derivatives:

Server Side Includes e.g <!-- #virtual="/somefile.inc"-->

IE Conditional Comments e.g <!--[if lt IE 7]> OR <![endif]--> 

Server Side META Includes e.g <!--METADATA TYPE="typelib" Blah -->

As you can see trying to write one regular expression that would handle multiple pattern matches within a file that strips all HTML comments apart from those that start with # [ or METADATA would involve some hardcore matching. Its very easy though to match each individual comment type first and replace it with a placeholder, then do my replace for everything between <!-- AND --> and then put the placeholder values back in.

A tidy page is a godly page

As well as carrying all the usual compression functions I also incorporated a number of replacements to tidy up the code. If your going to loop through each page in a system then this seems like a good place to do such things as:

  • Make my HTML XHTML compliant by encoding characters, expanding attributes, making sure all attributes are quoted and that my tags are lower case and some other HTML related tweaks.
  • Removing comments from within SCRIPT blocks as they are not required anymore as well as shortening SCRIPT tags down to the minimum e.g remove the language and type attributes. Obviously this breaks your XHTML compliance but then again you can't have everything.
  • Combine multi-line string literals together into one variable.
  • Combine variable declarations (server and client side) into one declaration.
  • Remove certain function calls such as calls to my custom ShowDebug function that outputs messages for client and server side script. I always build my codebase with the debug statements built in rather than add them in later as it makes debugging quicker and easier. However on a production system these function calls are expensive and unnecessary and should be removed.
  • Remove excess white space, usually TABS within dynamic SQL strings. Obviously I don't do UPDATES or DELETES only SELECTS but I usually format my code with TABS.

Why Compress Server-Side Code?

Well yes I know that even interpreted languages such as ASP gets compiled into a token based language which is cached by the web server so there is not much scope for compression but the smaller I can make the file size then the better in terms of storage and maybe caching. Plus the main point of the server-side compression was to remove all my ShowDebug function calls to aid performance.


So can I get a copy?

Not at the moment as I am in the process of testing it on a live system to iron out any bugs. At the moment the code is a script that I point at a directory and run. I am hoping to make a C# based windows application version of it and then I might put a copy up on the site.

1 comment:

Dewi S said...

Thanks for you article