Saturday 27 February 2010

The difference between a fast running script and a slow one

When logging errors is not a good idea

I like to log errors and debug messages when running scripts as they help diagnose problems not immediately apparent when the script is started. There is nothing worse than starting a script running and then coming into work on a Monday morning to find that it hasn't done what it was supposed to and not know why. A helpful log file with details of the operation that failed and the input parameter values can be worth its weight in gold.

However logging is an overhead and can literally be the difference between a script that takes days to run and minutes. I recently had a script to run on a number of webservers that had to segment out hundreds of thousands of files (documents in the < 500KB range) into a new folder structure.

My ideal solution was to collate a list of physical files in the folder I wanted to segment and pass that into my SELECT statement so that I could return a recordset containing only the files that actually existed. However my ideal solution was thwarted due to an error message I have never come across before which claimed the DB server didn't have the resources available to compile the necessary query plan. Apparently this was due to the complexity of my query and it recommended to reduce the amount of joins. As I only had one join this was not a solution and after a few more attempts with some covering indexes that also failed I tried another solution.

The solution was just to try to move the file and catch any error that was raised and then log it to a file. The script was set running and a day later it was still running along with a large file containing lots of "File not found errors".

The script was eventually killed for another unrelated reason which pissed me off as I had to get the segmented folder structure up running sharpish. I modified the script so that before each move it checked whether the File Existed before the move. I initially had thought that this in itself may have been an overhead as the actual move method would be doing the same thing only at a lower level therefore I was duplicating two searches for a file in a large folder. I reckoned that just trying to do the move and then catching any error would speed things up.

However because I was logging the error to a text file this was causing a bottleneck in I/O and slowing down the procedure immensely. Adding the FileExists check round each move ensured that only files that definitely existed were attempted to be moved and no errors were raised which meant no logging and no I/O overhead.

I had forgone the nicety of knowing which files were not on the server any more but I had also reduced the scripts running time down to a mere 25 minutes. Therefore the lesson to be learned is that although logging is useful it can also be a major overhead and if you can do without it you may just speed up your scripts.

Handling YouTube Videos in Feeds

Reformatting HTML to include YouTube video links

If you use XML feeds to import content into your site or blog you may have experienced the problem that your blogging software or add-on tries to err on the side of caution for security reasons or just messes up the import and in doing so any EMBED or OBJECT tags are stripped in the process.

This is especially annoying when the content is for a blog or news site and the imported content was linking to YouTube videos as it means you have to go through all the posts by hand and reformat the HTML so that the original video is shown.

Luckily even though the actual OBJECT tag gets removed you are usually left with a link to the video on YouTube's site. This can be utilised in a simple regular expression so that you can reformat your content and replace this link with the actual OBJECT HTML.

The regular expression is pretty simple and looks for any anchors pointing to YouTube's site and captures the video ID in a sub group. This sub group is then used in the replace statement as the value for the PARAM and EMBED values that require it.
// use preg_replace to replace anchors with OBJECT/EMBED
// $input is the source HTML

$html = preg_replace("/<a href=\"http:\/\/www\.youtube\.com\/watch\?v=([A-Z1-9]+)\">[\s\S]+?<\/a>/i","<object width=\"425\" height=\"344\"><param name=\"movie\" value=\"$1&hl=en_GB&fs=1&\"></param><param name=\"allowFullScreen\" value=\"true\"></param><param name=\"allowscriptaccess\" value=\"always\"></param><embed src=\"$1&hl=en_GB&fs=1&\" type=\"application/x-shockwave-flash\" allowscriptaccess=\"always\" allowfullscreen=\"true\" width=\"425\" height=\"344\"></embed></object>",$input);

I am using PHP as an example here but the regular expression would be the same across all languages.

Either wrap a call to this function in your own import procedure or if you are using an add-on like Wordpresses WP-o-Matic then you should utilise it in the rewrite section when setting up a new feed import.

Saturday 20 February 2010

Strictly Software's Anonymizing Super Search Tool

Introducing the Anonymous Super Search Tool

The following free online tool which I have created follows in the same vein as the wonderful anonymizing search tools such as Scroogle which give you the searching power of Google but without all the privacy issues.

If you don't know what the privacy issues are when you carry out a Google search then you should be aware of the following:

1. A search on Google, Yahoo and Bing is carried out with an HTTP GET request. This means that your search terms are visible in the address bar e.g:

this means that inside the log files on the web server that carried out the search is a record of your search request as well as your IP address and browser details.

2. As well as a history of your search being stored in the web servers log files, your browsers history as well as any tracking cookies used to deliver term specific advertising it can also be stored on any intermediate servers that it passed through on the way. If you are carrying out your search at work or school then it's very likely that all your traffic goes through a proxy or firewall and therefore they will also have a history of your search terms in the log files of those servers.

3. Because your search terms are visible in the address of the URL it is very easy to block certain terms as well as whole sites in real time. It's also easy for your employer, school, ISP or authority with a warrant to find out whether people have been searching for things they shouldn't have been e.g porn or banned sites.

So now you know why searching through the main providers can be a privacy concern. Obviously you might not care who knows about your search for penis pumps, blow up dolls, latex fetish sites, pre-op transsexuals and whatever else you may enjoy looking for but if you do what can you do about it?

Well from now on you can check out my new Super Search tool. Not only does it return the top 10 results from the major three search providers at the same time but it keeps your anonymity by using proxies and other various methods.

Method 1.
All searches are passed through a built in proxy chain containing at least three steps from your PC to the search engines web server. This means that your employer, school or ISP will only ever see your request to my domain and not to the search engine. Also the search engines will only ever see traffic from the last proxy before it arrived at their server. Because the proxies use different IP addresses from your own machine there is no way for them to link a search request with your computer only the proxy server that is used by many users.

Method 2.
The initial request on my search page is made through an HTTP POST not a GET. As it takes a lot of work, disk space and resources to log all POST data it's not done as standard by Apache and IIS when they log HTTP requests. Therefore any search terms are never stored on my web server even if I did enable logging (see next point).

Method 3
The actual logging of HTTP requests has been disabled on my web server for this domain. I keep no history of any searches made through this tool. I also don't use Google Analytics or any other client side urchin tracker tools on this page. There is no way I can tell who came to my tool and what searches they did and if I cannot know then it means no-one else can find out from me...

Method 4
When possible the search requests are passed through proxies that support HTTPS which means they get encrypted on route between my server and the next one.

Method 5
The final results are passed through an anonymizer tool which means if you want to look at any result in more detail you can be assured that the request is not linked back to your IP address as you are viewing the website through a proxy.

The tool is called Super Search and can be found here: Anon Super Search

Also on a side note if you ever want to check out the details a web proxy is actually passing around to ensure its safety then the following info tool details browser related information such as IP, GEO Location, HTTP Headers, Proxy Forwarded values, Agent and Referrer info. Its basic but a good quick way of checking what details you are currently using.

Saturday 6 February 2010

Browser Survey Results

The results of the Strictly Software Browser Survey

If you have visited my site in the last couple of months you may have noticed the survey on Browser usage that popped up. It was geared towards developers to gather information about their favourite browsers in terms of developing, debugging and features.

I have let it run for a couple of months and now I am publishing the results. Thanks to everyone who took the time to answer the questions.

Question 1: Which browser do you use when developing new code

Firefox 56%
Internet Explorer 21%
Chrome 10%
Opera 6%
Safari 5%
Other Option... 2%

Question 2: Which browser has the best inbuilt features for developing

Firefox 44%
Chrome 19%
Internet Explorer 17%
Safari 8%
Opera 8%
Other Option... 3%

Question 3: Which browser do you use for personal web surfing

Firefox 50%
Chrome 22%
Internet Explorer 14%
Opera 8%
Safari 5%
Other Option... 1%

Question 4: Which developer tool do you consider has the best features

Firebug 64%
IE 8's developer toolbar 17%
Webkit's Developer Tools 7%
Other Option... 5% (the majority of which said Web-Developer add-on for FireFox)
DragonFly 4%
Firebug-Lite 2%

Question 5: Which one feature could you not live without

Element inspection 40%
Dynamic DOM manipulation 20%
View generated source 18%
Clear cookies and cache 14%
Disable Javascript 6%
Other Option... 2%

I don't think anything stood out for me as a major surprise in the answers I received. In any survey done over the last 10 years or so over the whole web surfing community Internet Explorer always comes out in top place for browsing but the majority of users questioned are not developers and a large percentage of those are still only using it because they actually believe the little blue e on their desktop IS the Internet.

This is shown by a recent NetApps survey that found IE 8 has just taken the top spot from IE 6 and that Firefox 3.5 has taken 3rd place narrowly just beating IE 7. This is not much of a surprise to me as my own reports from the 200+ sites I run always show IE 6,7,8 taking the top 3 spots with over 75% of all usage between them.

However when looking at browser usage for people in the industry i.e web developers and designers FireFox seems to always take top spot purely because of the numerous add-ons and built in features. However I also know from personal experience and from colleagues that within the year or so Chrome has been around it has built up quite an on-line following and for pure web surfing I personally don't think it can be beaten for speed, simplicity and usability and this seems to be proven by it taking 2nd place as the browser techies like to use when surfing.

Anyway thanks for taking part.