Showing posts with label Compression techniques. Show all posts
Showing posts with label Compression techniques. Show all posts

Saturday, 22 August 2009

Strictly Javascript Compressor

Introducing yet another Javascript compressor

If you read a previous article about creating a script compressor you will know that I like to write my own code purely for sadistic reasons and mainly due to a rare form of OCD that makes me unable to stop coding until something is complete. I am sure this a rather common infliction amongst some in the coding community however it does have its benefits. Although creating a compressor from scratch is a long and painful process it has the upside of increasing knowledge about code syntax, compression techniques, regular expressions, language quirks and much much more. I wouldn't recommend doing it unless you have plenty of time and a desire to know the pain that regular expression based compression can bring.

The Strictly Software Javascript Compressor Tool

I have put a cut down version of my compressor tool up on my website which you can find at the following link: Strictly Javascript Compressor

The online tool only works with Javascript but I have made available a number of options which will allow you to customise the compression process.


Compression Options


The following is a list of all the advanced options available with the online version of the compressor.

Compression Modes: Simple or Complex

Complex mode will try to format the Javascript so that each function is on its own line in the output. If you have a function that contains other functions then only the outer most function will be formatted like this. To accomplish this the engine must try to auto-correct any lines that have missing terminators or brackets by inserting them. In most cases this should work but I cannot guarantee this in 100% of cases due to the engine being based on regular expressions. However if you format your code correctly before hand ensuring all lines are terminated and IF, ELSE statements are wrapped in brackets then you shouldn't get any problems.

Simple mode will not try to put each function on its own line although it will condense multiple occurrences of brackets to one line. This mode will give a longer output as it will contain more carriage returns however it will be less likely to cause syntax errors.

Complex mode will result in a better compression rate than simple mode, sometimes up to 10% or more in certain cases.

Minify Functions

If you have specific global functions that you wish to replace with smaller names then you can provide a list of the functions to replace in the format

[function:minified,function:minified]


For example if you want to replace all occurrances of the function getEl with a minified name of G and the function addEvent with the name A then you can provide a list in the format of:

[function:minified, function:minified]

[getEl:G,addEvent:A]


Important Note:
  • You can only provide 20 functions to rename with this online tool.
  • As my minification process renames long variable and parameter names in local functions with single letter versions starting from a and incrementing up to zz you should avoid using lower case letters for your minified function names to avoid clashes. Its recommended to use upper case letters or use underscores.


Minify Objects

In the same way that you can change function names you can also provide up to 20 global objects that you can replace with minified names. For example if you have global objects Debugger and System you could replace them with _D and _S respectively. Provide this list in the following format of

[object:minified, object:minified]

[Debugger:_D,System:_S]

Important Notes
  • You can only provide 20 objects to rename with this online tool.
  • As my minification process renames long variable and parameter names in local functions with single letter versions starting from a and incrementing up to zz you should avoid using lower case letters for your minified object names to avoid clashes. Its recommended to use upper case letters or use underscores.
Minify Global Objects

This option will replace some standard global objects with smaller versions. The objects it will replace are Window, Document and Navigator. The engine will add the following line of code to the output:

var _w=window,_d=document,_n=navigator


Then it will replace all occurrences of window with _w and so on. Note how I have added underscores to the variable names so not to conflict with the standard minification process of function parameters and variables.

Create a Get Function

This option will replace all occurrences of references to document.getElementById to a minified function call. As this is a very common reference in Javascript code on the web it will save considerable bytes and is very easy to do. The name of the function created will be decided by the value supplied for the Get Function Name option.

Note: If none is supplied the value will be G.

Get Function Name

This option is related to the previous option and only available if you have decided to create a Get function. The value you supply will be used for the minified function name. For example if you provide the following value _S then the following function will be created and added to the compressed output:

_S=function(i){return document.getElementById(i)}


Remove ShowDebug Functions

This maybe a very specific function catering to my own needs but its something I would recommend to all developers. As you create your code you should build in calls to a debug function that will output messages to the console (e.g Firebug, Firebug-lite, IE, Chromes console). This is a much better idea than having to add debug code once a bug has been found and it will save time in fixing the bug. However you should also remove all these functions on a live production environment as you will not want your users to view these messages and even if you turn debugging off inside the function an unnecessary call to the function is made. I always call my debug function ShowDebug. This option will remove all these calls from the code.

If you would like to know more about debugging and creating a custom debug object please read the following blog article.

Sunday, 26 April 2009

Write your own script compressor

Compression Techniques

I have just finished writing my own script compressor to handle the compression on one of my larger systems. You may ask why I didn't just use one of the existing free compressors out there like jsmin or YUI compressor. Well the answer is two fold:

1. I like to write my own code for the simple reason I learn more about my craft doing it this way and although it takes time and usually blood sweat and tears I always come out the other side with more knowledge about development practises in general. See this link about whether to use a framework or write your own for more details.

2. The existing compression tools didn't do all I wanted them to do and also I found certain issues with certain syntax such as complex regular expression literals and conditional comments.

This article is just an overview of some of the things I discovered during the building of my script in case other people want to write their own.

The main features of my compressor are:
  • Takes a directory as its source and will output all compressed files to a destination folder replicating the structure of the source folder.
  • Handles multiple file types e.g JS, PHP, ASP, INC, HTML, CSS
  • A file such as an ASP file that contains in-line JS, HTML, CSS and ASP will have each "section" compressed according to that sections code type.
  • The usual compression technique of removing comments, excess white space and empty lines is carried out.
  • JS code is also compressed by having function parameters and local variables "minified" by replacing variable names with one character names. 
  • Global objects and common functions are also replaced with short versions.
  • Debug functions are removed.
  • JS functions are saved one per line so even the compressed file is nicely formatted.
  • Adds missing terminators ; to aid compression of functions to one line.
  • Corrects HTML so that its XHTML valid.
  • Skips files that are already compressed.
  • Creates a log file detailing the files compressed and the compression rate.

Building a Compressor

I found out during this task that getting a fully working Javascript compressor without the aid of a Java engine is not as simple as it may at first sound. Especially if you wanted like I did each function to appear on its own line and you are handling code that may not be correctly formatted in the first place (e.g missing terminators ;). 

If you want to just remove excess white space and comments then that's fine but doing a "minifier" that renames variables with single letter names is a bit trickier as you need to identify all your functions and parameters correctly for this to work. Remember there are a multitude of ways of defining functions in JavaScript as well so its not just a case of looking for the word function e.g:



function myFunc(var1, var2){ return }

var myFunc = function(var1, var2){ return }

SomeObject.prototype.myFunc = function(var1, var2){ return }

myFunc : function(var1, var2)

(function(){ code }())


Also unless you create a global system that checks which JavaScript functions are being referenced by each file then you can only really minify local function variables and parameters. Otherwise you may change the name of a variable that is being referenced by a file you don't know about causing an error.

If you want to create a very simple minification process you can concentrate on some common global objects such as the window, document and navigator objects. You can also create yourself a small function for document.getElementById and then update all references to that e.g

var _w=window,_d=document,_n=navigator;
//create wrapper function
_g = function(i){
return _d.getElementById(i);
}

So for every time your site references document.getElementById which is probably quite a lot you have saved yourself 18 characters. Add to that the common use of window and document you will save a fair few bytes just by these changes alone.


Store and Replace

I found the best approach was a layered approach to parsing each file as files such as PHP or ASP will contain a mixture of both server side and client side script as well as HTML and maybe CSS defined in style tags. With the help of some good regular expressions I would look for each type of code using their identifying markers for example look for the open and close style tags. I would then store each block of code in an array, parse it accordingly and then when rebuilding the compressed file re-insert in order. This got round issues where you have ASP script inside in-line JS script blocks within HTML inside an ASP page. Each section would have its own function that compressed according to that languages syntax but all would first store any string and regular expression literals in another array so that they stayed unaffected during any compression as you don't want to be removing white space and changing symbols within string literals.


Fun with regular expressions

Most of my compression was carried out using regular expressions but some tasks such as identifying literals and comments were done in loops. This is pretty easy when your looking for string literals as you know they are going to be enclosed within double or single quotes and you can combine this with a check for single and multi line comments at the same time. However in JS regular expressions can also be defined as literals such as

var re = /^\S\s+[^/].*?>/gi

You cannot just start at the first / and then stop at the next unescaped / as you can use unescaped slashes within character groups e.g [/] so you would end prematurely. You also cannot stop at the last / you come across as you may have multiple statements on the same line such as:



var re = /^\S\s+[^/].*?>/gi, str = str.replace(/##BR##/gi,"<br />");


Plus as you can see the replacement value on the second statement has a forward slash inside it so you would cut off half the replacement value causing a syntax error. I first took the decision that so what a bit more than I intend gets stored as a literal until its put back in but if that extra bit of code also contains variable names that you want to minify you will run into problems.

I got round this problem by first using a regular expression to identify unescaped forward slashes within expressions and replacing them with a placeholder. I could then use another pattern to match string functions such as replace, match, search, split, compile and another for literals and regular expression functions such as test and exec. Once the literal is stored I can put the escaped characters back.


Negative Matches

I also found that the previous technique of using a placeholder value before carrying out a regular expression match was very useful when dealing with complex negative matches. If you have a long string of text and you are trying to carry out a replacement except in a certain instances then this is a good way of having to avoid a complex negative pattern match. For example in my JS compression function I add in extra terminators to the end of lines to make sure I can get a whole function on to the same line. However doing  this sometimes causes issues when terminators are put in places they shouldn't be so at the end I run some corrections which remove terminators from places they shouldn't be such as inside certain brackets. However there are cases where a terminator can appear inside an open bracket legitimately such as a for expression without all the sections. Therefore instead of using a complex negative match to do the excess terminator replacement I use a placeholder e.g

// put a placeholder in for the terminator I want to keep
strJSCompressed = strJSCompressed.replace(/for(;/,"##_FOR_TERM_##");

// carry out the replacement of terminators
strJSCompressed = strJSCompressed.replace(/([\{\(\[,><\|&])(;)/,"$1");

Then once the replacements have been carried I put back in all the original values that the placeholders were storing e.g


// put the placeholder back in to my code
strJSCompressed = strJSCompressed.replace(/##_FOR_TERM_##/,"for(;/");


This is a very useful technique when you want to avoid complex regular expressions that involve negative matches as you should know by now complex patterns and long strings combine to cause high CPU!

Another good example is HTML comments. I want to strip all HTML comments apart from the following derivatives:

Server Side Includes e.g <!-- #virtual="/somefile.inc"-->

IE Conditional Comments e.g <!--[if lt IE 7]> OR <![endif]--> 

Server Side META Includes e.g <!--METADATA TYPE="typelib" Blah -->

As you can see trying to write one regular expression that would handle multiple pattern matches within a file that strips all HTML comments apart from those that start with # [ or METADATA would involve some hardcore matching. Its very easy though to match each individual comment type first and replace it with a placeholder, then do my replace for everything between <!-- AND --> and then put the placeholder values back in.

A tidy page is a godly page

As well as carrying all the usual compression functions I also incorporated a number of replacements to tidy up the code. If your going to loop through each page in a system then this seems like a good place to do such things as:

  • Make my HTML XHTML compliant by encoding characters, expanding attributes, making sure all attributes are quoted and that my tags are lower case and some other HTML related tweaks.
  • Removing comments from within SCRIPT blocks as they are not required anymore as well as shortening SCRIPT tags down to the minimum e.g remove the language and type attributes. Obviously this breaks your XHTML compliance but then again you can't have everything.
  • Combine multi-line string literals together into one variable.
  • Combine variable declarations (server and client side) into one declaration.
  • Remove certain function calls such as calls to my custom ShowDebug function that outputs messages for client and server side script. I always build my codebase with the debug statements built in rather than add them in later as it makes debugging quicker and easier. However on a production system these function calls are expensive and unnecessary and should be removed.
  • Remove excess white space, usually TABS within dynamic SQL strings. Obviously I don't do UPDATES or DELETES only SELECTS but I usually format my code with TABS.

Why Compress Server-Side Code?

Well yes I know that even interpreted languages such as ASP gets compiled into a token based language which is cached by the web server so there is not much scope for compression but the smaller I can make the file size then the better in terms of storage and maybe caching. Plus the main point of the server-side compression was to remove all my ShowDebug function calls to aid performance.


So can I get a copy?

Not at the moment as I am in the process of testing it on a live system to iron out any bugs. At the moment the code is a script that I point at a directory and run. I am hoping to make a C# based windows application version of it and then I might put a copy up on the site.