Monday, 25 May 2009

SEO - Search Engine Optimization

My two cents worth about Search Engine Optimisation

SEO is big bucks at the moment and it seems to be one of those areas of the web where there seem to be lots of snake oil salesmen and "SEO experts" who will promise no 1 positioning on Google, Bing and Yahoo for $$$ per month.

It is one of those areas that I didn't really pay much attention to when I started web developing mainly because I was not the person paying for the site and relying on leads coming from the web. However as I have worked on more and more sites over the years its become blatantly apparent to me that SEO comes in two forms from a development or sales point of view. There are the forms of SEO which are basically good web development practise and will come about naturally from having a good site structure, making the site usable and readable as well as helping in terms of accessibility. Then there are the forms which people will try and bolt onto a site afterwards either as an after thought or because an SEO expert has charged lots of money and has some dubious link-sharing schemes that are believed to work.


Cover the SEO Basics when developing the site


Its a lot harder to just "add some Search Engine Optimization" in once a site has been developed especially if you are developing generic systems that have to work for numerous clients. I am not an SEO expert and I don't claim to be otherwise I would be charging you lots of money for this advice and making promises that are impossible to be kept however following these basic tips will only help your sites SEO.

Make sure all links have title tags on them and contain worthy content rather than words like "click here". The content within the anchor tags matter when those bots come a crawling in the dead of night.

You should also make sure all images have ALT attributes on them as well as titles and make sure the content of both differ. As far as I know Googlebot will rate ALT content higher than title content but it cannot hurt to have both.

Make sure you make use of header tags to differentiate out important sections of your site and try to use descriptive wording rather than "Section 1" etc. Also as I'm sure you have noticed if you have read my blogs before I wrap keywords and keyword rich sentences in strong tags. I know that Google will also rank emphasised content or content marked as strong over normal content so as well as helping those readers who skim read to view just the important parts it tells Google which words are important on my article.

Write decent content and don't just fill up your pages with visible or non-visible spammy keywords. In the old days keyword density mattered when ranking content for example once all noise words and other guff had been removed then what percentage of the overall page content were relevant keywords. Nowadays the bots are a lot cleverer and will penalise content that does this as it looks like spam. Also its good for your users to have good readable content and you shouldn't remove words between keywords as it makes it more unreadable and you will lose out on the longer 3, 4, 5 word indexable search terms.

Saying this though its always good to remove filler from your pages for example by putting your CSS and Javascript code into external files when possible and removing large commented out sections of HTML.

You should also aim to put your most important content at the top of the page so its the first thing crawled. Try moving main menus and other content that can be positioned by CSS to the bottom of the file.

The same thing goes for links. If you have important links but they are in the footer such as links to site-indexes then try getting them higher up the HTML source. I have seen Google recommend that 100 links a per page is the maximum to have per page. Therefore having a homepage that has your most important links at the bottom of the HTML source but 200+ links above them e.g links to searches even if not all of them are visible then this can be harmful. If you are using a tabbed interface to switch between tabs of links then the links will still be in the source and if they are loaded in by Javascript on demand then that's no good at all is it as crawlers don't run Javascript.

Items such as ISAPI URL rewriting are very good for SEO plus they are nicer URLs for sites to display. For example using a site I have just worked on as an example http://www.sugarjobs.co.uk/companies/adeptra-ltd is a much nicer URL to view a particular company profile than the underlying real URL which could also be accessed as http://www.sugarjobs.co.uk/jobboard/cands/compview.asp?c=49
If you can access that page by both links and you don't want to be penalised for duplicate content then you should specify which link you would want to be indexed by specifying your canonical link.

META tags such as the keywords tag is not considered as important as it once was and having good keyword rich content in the main section of the page is the way to go rather than filling up that META with hundreds of keywords. The Description will still be used to help describe your page on search results pages but some people seem to think that having control over the keywords META is the ultimate in SEO whereas in reality its probably ignored by most crawlers nowadays.

Set up a Sitemap straight away. Even if you don't want to use their tracking tools you should still set up an XML site-map containing your sites pages ranked by their importance, how often they change, last modified date etc. This is done through Googles webmaster tools, Yahoos Site Explorer or Microsofts Bing (or whatever they are calling it by the time you read this!) and lets you specify which links googlebot and other crawlers should look at when they come crawling. As well as setting up your sitemap with the various webmaster tools accounts that the major search engines offer you can specify a link to your sitemap in your robots.txt file. This will allow other crawlers to find the file and therefore access your important content. For example from my own robots.txt file you can see I have added a link to my sitemap.
Sitemap: http://www.strictly-software.com/sitemap_110908.xml
Use tools such as the wonderful SEOBook Toolbar which is an add-on for Firefox which has combined numerous other free online SEO tools into one helpful toolbar. It lets you see your Page Ranking and compare your site on various keywords across the major search engines.

Using a text browser such as Lynx t0 see how your site would look to a crawler such as yahoo or google.


The Other form of SEO, Black Magic Optimization

The other form of Search engine optimization is what I would call "black magic SEO" and it comes in the form of SEO specialists that will charge you lots of money and make impossible claims about number one rankings in each search engine in the world etc etc.

The problem with SEO is that no-one knows exactly how Google and the others calculate their rankings so no-one can promise anything regarding search engine positioning.

There is Googles Page Ranking which is used in relation to other forms of analysis and it basically means that if you have a site with a high PR that links to your site that does not link back to the original site then it tells Google that your site has higher site authority than the linking site. If your site only links out to other sites but doesn't have any links coming in from high page ranked relevant sites then you are unlikely to get a high page rank yourself. This is just one of the ways which Google will use to determine how high to place you in the rankings when a search is carried out.

Having lots of links coming in from sites that have nothing whatsoever to do with your site may help drive traffic but will probably not help your PR. Therefore engaging in all these link exchange systems are probably worth jack nipple as unless the content that links to your site is relevant or related in some way its just seen as a link for a links sake i.e spam.

Some "SEO specialists" promote special schemes which have automated 3 way linking between sites enrolled on the scheme. They know that just having two unrelated sites link to each other basically negates the PR so they try and hide this by your site A linking to site B which in turn links to site C that then links back to you. The problem is obviously getting relevant sites linking to you rather than every tom dick and harry.

Also advertising on other sites purely to get indexed links from that site to yours to increase PR may not work due to the fact that most of the large advert management systems output banner adverts using Javascript therefore although the advert will appear on the site and drive traffic when people click it you will not get the benefit of an indexed link. The reason being that when the crawlers come to index the page containing the advert the banner image and any link to your site won't be there.

Anyone who claims that they can get you to the top spot in Google is someone to avoid. The fact is that Google and the others are constantly changing the way they rank and what they penalise for so something that may seem dubious that works currently could actually harm you down the line. For example in the old days people would put hidden links on white backgrounds or position them out of site so that the crawlers would hit them but the users wouldn't see which worked for a while until Google and the others cracked down and penalised for it.

Putting any form of content up specifically for a crawler is seen as dubious and will be penalised against. They want to crawl the content that a normal user would see and they have actually been known to mask their own identity ( IP and User-Agent ) when crawling your site so that they can check whether this is the case or not.

My advice would be to stick to the basics, don't pay anybody who makes any kind of promise about result ranking and avoid like the plague any scheme that is "unbeatable" and promises unrivalled PR within only a month or two.

Labels: , , , , , ,

Sunday, 10 May 2009

What is the point of client side security

Is hacking the DOM really hacking?

The nature of the browser is that its a client side tool. Pages stored on web servers are downloaded and viewed in your browser which stores all the files, images and scripts locally on your PC as temporary files. Therefore putting any kind of security on the client side is pointless as anyone with a small working knowledge of Internet technology can bypass them. I don't want to link to the site in particular but it appeared as a google advert on my site the other day claiming to protect your whole website from theft including your HTML source code. If you have a spare 30 minutes on your hands, have Firebug installed and did a search for code to protect HTML you would be able to bypass the majority of the wonderful security claims with ease.

Examples of such attempts to use client side code to protect code or content include:

Trying to protect the HTML source code from being viewed or stolen. This will include the original right mouse click event blocker. This was used in the old days in the vain hope that people didn't realise that they could just go to Tool > View Source instead of using the context menu. The other option was just to save the whole web page from the File menu. However you can now just view the whole generated source in most developer tools e.g Firebug.

Some sites will also generate their whole HTML source code with Javascript code in the first place. Packing, encoding and obfuscating it on the way. The code is then run through a function to evaluate it and write it to the DOM. Shame that this can all be viewed without much effort in the Scripts part of Firebug. And many tools let you run your scripts on any page e.g someone at work the other day didn't like the way news sites like the BBC always showed large monetary numbers as £10BN and added a regular expression into one of these tools to automatically change all occurrences to £10,000,000,000 as he thought the number looked bigger and more correct :) stupid example I know but anyhow.

Using special classes to prevent users from selecting content. This is commonly used on music lyric sites to prevent people copying and pasting the lyrics straight off. Shame Firebug lets you modify the DOM on the fly! Just find the class in question with the inspect tool, blank it out and there you go.

Multimedia sites that show content from TV shows that will remain unnamed but only allow users from the USA to view them. Using a proxy sometimes works but for those flash loaded videos that don't play through a proxy you can use YSlow to find the base URI that the movie is loaded from and just load that up directly. To be honest I think these companies have got wise to the fact that people will try this as they now insert location specific adverts into the movies which they never used to do. However its still better than moving to the states!

Sites that pack and obfuscate their Javascript in the hope of preventing users from stealing their code. Obviously minification is good practise for reducing file size but if you want to unpack some JS then you have a couple of options and there maybe some valid reasons other than just wanting to see the code being run e.g preventing XSS attacks.

Option 1 is to use my script unpacker form which lets you paste the packed code into a textarea hit a button and then hey presto you get the unpacked version in another textarea for you to use. It will also decode any encoded characters as well.

If you don't want to use my wonderful form and I have no idea why you wouldn't then Firefox comes to the rescue again. Copy the packed code, open the Javascript error console and paste the code into the input box at the top with the following added to the start of it:

//add to the beginning eval=alert;
eval=alert;eval(function(p,a,c,k,e,r){e=String;if(!''.replace(/^/,String)){while(c--)r[c]=k[c]||c;k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('3(0,1){4(0===1){2("5 6")}7{2("8 9")}',10,10,'myvar1|myvar2|alert|function|if|well|done|else|not|bad'.split('|'),0,{}))

// unpacked returns
function(myvar1,myvar2){if(myvar1===myvar2){alert("well done")}else{alert("not bad")}




then hit evaluate and the unpacked code will open in an alert box which you can then copy from. What the code is doing is basically changing the meaning of the function eval to alert so that when the packed code runs within its eval statement instead of executing the evaluated code it will show it in the alert.

There are many more techniques which I won't go in to but the question then is why do people do it?

Well the main reason is that people spend a lot of time creating websites and they don't want some clever script kiddy or professional site ripper to come along steal their content and use it without permission.

People will also include whole sites nowadays within frames on their own sites or just rip the whole thing css, images, scripts and all with a click of a button from too many available tools to count. I have personally seen 2 sites now that I have either worked on or know the person who worked with appear up on the net under a different URL with the same design, images, JS code the lot apart from the wording was in Chinese .

The problem is that with every major browser now having developer tool sets like Firebug, IE8 developer toolbar, Operas dragonfly and Firebug-lite available for those without built in tools it seems pretty pointless trying to put any client side security on your sites at all. Even if you didn't want to be malicious and steal or inject anything you can still modify the DOM, run your own Javascript, change the CSS and remove x y and z.

All security measures related to user input should be handled on the server to prevent sql injection and XSS hacks but that's not to say that duplicating validation checks on the client isn't a good idea. For one thing it saves time if you can inform a user that they have inputted something incorrectly before the page is submitted. No one likes to fill in a long form submit it and wait whilst the slow network connection and bogged down server takes too long to respond only to show another page that says one of the following:
  • That user name is already in use please choose another one
  • Your email confirmation does not match
  • Your password is too short
  • You did not complete blah or blah
Things like this should be done client side if possible, using Ajax for checks that need database look ups such as user name availability tests.

However client side code that is purely there to prevent content from being accessed without consent seems pointless in the age of Firebug, IE Developer Tool bar, Dragonfly and all the other cool add-ons. Obviously there is a large percentage of web users out there that wouldn't know the first thing to do when it came to bypassing client side security code and the blocking of the context menu would seem like black magic to them but unfortunately for the people who are wanting to protect their client side code the people that do want to steal the content will have the skills to bypass all your client side cleverness. It may impress your boss and seem worth the $50 for about 10 minutes until someone shows you how you can add your own Javascript to a page to override any functions already there for blocking events and checking for iframe positioning.

My only question would be is it really hacking to modify the DOM to access or bypass certain features meant to keep the content on that page? I don't know what other people think about this but I would say no its not. The html, images, script and CSS are on my computer at the point of me viewing them on whatever browser I am using unless I am trying to change or inject anything onto the web or database server to affect future site visitors or trying to bypass challenge responses then I am not hacking the page.

I'd be interested to know what others think about that question?

Labels: , , , , , , ,

Friday, 1 May 2009

System Tables - sys.processes

Analyse current processes with sysprocesses

The following SQL is based on an article about the sys.processes system view on SQLServerCentral I read today and is another good example of using SQL system views and the new DMV's (Data Management Views).

I have combined some of the example code from the article into a helpful query for analysing your current processes to find long running queries that maybe causing issues with your system. Read the comments within the code for more details.

-- Using the sys.processes system table to find current process details

DECLARE @oldStats TABLE( os_thread_id int, kernel_time bigint, usermode_time bigint)

/* Insert current threads

The KPID is useful in that it helps us tie up what has been passed to the operating system to run commands and is actually working.
Although the SPID is constant throughout the life of the connection a KPID is allocated to each task that needs to be carried out.

The KPID maps back to an actual windows thread and so it is possible using performance monitor to get actual physical statistics
about a task instead of the purely logical statistics which SQL shows through the CPU column.

The KPID is the actual o/s thread id and you can use the "Thread" performance counter using "ID Thread" and "% Processor Time" to
match the thread to the actual cpu stats.
*/
INSERT INTO @oldStats
SELECT os_thread_id, kernel_time, usermode_time
FROM sys.dm_os_threads
WHERE os_thread_id IN (SELECT KPID
FROM sys.sysprocesses
WHERE kpid <> 0
AND spid>50)

-- wait for 2 seconds
WAITFOR DELAY '0:0:2'

/* Compare previous data to our current processes to see which task are consuming
the most CPU.

If records appear with a KPID of 0 and Physical Time of NULL then it means the
O/S thread is no longer active.

Investigate processes that have high physical times, high CPU, long wait times, and blocked
*/
SELECT sp.KPID, sp.SPID, sp.CPU AS LogicalCPU
,(new.kernel_time + new.usermode_time) - (old.kernel_time + old.usermode_time) AS PhysicalTime
,waittime,lastwaittype,blocked
,blockingSQL = CASE WHEN blocked > 0 AND blocked <> sp.SPID THEN (SELECT SUBSTRING((SELECT TEXT FROM fn_get_sql(sql_handle)), stmt_start/2,
CASE stmt_end
WHEN -1 THEN LEN(CONVERT(VARCHAR(8000), (SELECT TEXT FROM fn_get_sql(sql_handle)))) - (stmt_end/2)
WHEN 0 THEN LEN(CONVERT(VARCHAR(8000), (SELECT TEXT FROM fn_get_sql(sql_handle))))
ELSE stmt_end /2
END
) FROM sys.sysprocesses WHERE SPID = sp.blocked) ELSE NULL END
,last_batch,open_tran,sp.status,loginame,hostname,cmd
,(SELECT SUBSTRING((SELECT TEXT FROM fn_get_sql(sql_handle)), stmt_start/2,
CASE stmt_end
WHEN -1 THEN LEN(CONVERT(VARCHAR(8000), (SELECT TEXT FROM fn_get_sql(sql_handle)))) - (stmt_end/2)
WHEN 0 THEN LEN(CONVERT(VARCHAR(8000), (SELECT TEXT FROM fn_get_sql(sql_handle))))
ELSE stmt_end /2
END
) FROM sys.sysprocesses WHERE SPID = sp.SPID) as TSQL
FROM sys.sysprocesses SP
LEFT OUTER JOIN
@oldStats old
ON SP.kpid = old.os_thread_id
LEFT OUTER JOIN
sys.dm_os_threads new
ON sp.kpid = new.os_thread_id
ORDER BY PhysicalTime DESC



Store and Analyse Blocked Processes


To view your blocked processes in more detail either set up a loop with a WAITFOR DELAY or an MS Agent job that runs once a minute to log into a table the output from the following SQL. The SQL make use of a recursive CTE to link together all the processes affected by a blocking action which is useful for seeing the action that has caused the blocking and all the processes being affected by the blocking. You can view all databases on the server or filter by a particular database name or partial name.

DECLARE @DatabaseName nvarchar(255) --leave null to use current DB OR 'ALL' For all DBS
DECLARE @PROCESSES TABLE(SPID int, blockingSPID int, databaseName nvarchar(255), programName nvarchar(500), loginName nvarchar(255), ObjectName nvarchar(max), Definition nvarchar(max))
INSERT INTO @PROCESSES
SELECT s.spid, BlockingSPID = s.blocked, DatabaseName = DB_NAME(s.dbid),
s.program_name, s.loginame, ObjectName = OBJECT_NAME(objectid,s.dbid),
Definition = CAST(text AS VARCHAR(MAX))
FROM sys.sysprocesses s
CROSS APPLY
sys.dm_exec_sql_text (sql_handle)
WHERE s.spid > 50 AND
1 = CASE
WHEN @DatabaseName IS NULL AND s.dbid = db_id() THEN 1
WHEN @DatabaseName = 'ALL' THEN 1
WHEN COALESCE(@DatabaseName,'')<>'' AND DB_NAME(s.dbid) LIKE @DatabaseName + '%' THEN 1
END


;WITH Blocking(SPID, BlockingSPID, DatabaseName, BlockingStatement, RowNo, LevelRow)
AS
(
SELECT s.SPID, s.BlockingSPID, s.DatabaseName, s.Definition,
ROW_NUMBER() OVER(ORDER BY s.SPID),
0 AS LevelRow
FROM @PROCESSES s
JOIN @PROCESSES s1 ON s.SPID = s1.BlockingSPID
WHERE s.BlockingSPID = 0
UNION ALL
SELECT r.SPID, r.BlockingSPID, r.DatabaseName, r.Definition,
d.RowNo,
d.LevelRow + 1
FROM @PROCESSES r
JOIN Blocking d ON r.BlockingSPID = d.SPID
WHERE r.BlockingSPID > 0
)
SELECT * FROM Blocking
ORDER BY RowNo, LevelRow

Labels: , , , ,