Sunday 8 February 2009

CAPTCHAS

CAPTCHAs don't you just love completing them?

Everybody hates filling in CAPTCHAS and even the most complex ones can be beaten either by using bots that make use of OCR (optical character recognition) to take apart the image and calculate the letters used or for those that cannot be beat they just link directly to the site using it and offer free porn to users who complete them.

Obviously for a spammer to go to that amount of effort to beat the CAPTCHA there has to be something worthwhile at the other end like a free email account to send out spammy emails. The other major reason to use automated CAPTCHA breakers is to insert comment spam mainly for links back to nefarious sites or to malware infected sites. So if all CAPTCHAS can be broken either by bots or by humans doing the work for bots is there any point in using them? Well yes I would say as unless you are running a site that offers something worthwhile like email accounts then the chances are having even a simple CAPTCHA system will reduce a large percentage of spam requests.

However a standard image based CAPTCHA is not the only means so here are some others.

Simple Robot Identification Tests

Use Javascript to identify humans

The idea is to only allow humans to submit the form and not bots so you could go down a simple route of using JavaScript to submit the form as 99.9% of bots do not run script. However you also have the problem that roughly 10% of your human traffic don't either. You could place a message at the bottom of the form asking the user to enable Javascript to submit the form. With the message itself hidden by JS so that users with it enabled don't see it.


Identify robots using IP and User-Agent

For those bots you can positively identify as crawlers by IP and User-Agent you can obviously prevent them from submitting the form however most spammers will spoof the agent and go through proxies and other cloaking mechanisms. Identifying a bot as a bot 100% of the time is the holy grail webmasters are seeking so if there was a way of doing this CAPTCHAs wouldn't be needed in the first place.


Using CAPTCHAs to help digitise books for online use

This is a neat idea where you know that those 10 or more seconds spend deciphering the image has not been a total waste of your time. The reCAPTCHA is based on a small portion of word from a book. The image is a section from a scanned version of a page and your answers are compared with other peoples responses to validate the likelihood of a sentence being correct. This is a good idea as even if you got one word wrong from a sentence you could still pass the test if the rest of the words were correct as the system checks the answer you gave for new words that have not been validated with words it knows the answer for.


Using hidden input fields to trick robots

This idea involves adding hidden input fields to your form which you want the robot to complete but not the human. When the form is submitted you check whether a value has been added to this field and if so you can block the request.

You can use either type="hidden" to hide the input or preferably use CSS and a class name that relates to a style e.g display:none; A bot could easily detect the element was hidden and ignore it the same way it could easily read inline styles to work out it was hidden. However with a class name it would have to read in the stylesheet to find out whether the class related to a hidden style or not which is obviously more effort but not impossible.

Also the aim is to trick the bots into filling it out without also tricking any form auto-complete systems such as Googles toolbar from doing the same. I have found problems with older versions of the toolbar when you give the input a name such as "EmailConfirm2" it would complete it as it mentioned a word used within the autofill profile. You could give the field a totally random name but then a clever bot would ignore it knowing it was a trick.

You can give it a name that relates to other visible form elements but prefix it or modify it slightly. Also make sure you place the field outside the flow of the other visible controls as I have found with Googles latest toolbar that it will complete inputs hidden with CSS if they are placed between other visible elements e.g between Name and Email or within Address1 and Address2. Therefore this method is not totally reliable.


Check submitted values for similarity

Also a lot of spammers will submit the same value for all form fields for example on a simple registration form of Name, Email, Confirm Email, Password and Confirm Password the spambot will enter an email address for all 5 controls. Unless you set up validation rules to ensure that email addresses are not used for passwords or names then it would submit the form. What you could do is check whether the same value has been used for Name, Email and Password and block the user. Only a percentage of bots do this and I myself when testing a new site often supply the same email address for all parts of a registration form to quickly get on the system.


Question and multi-part CAPTCHAS

This type which is not as popular as the common garden CAPTCHA is where the user is asked a question about the image. You may have four numbered animals on the image and the question would be "Which one makes the noise mooo" and you would have to pick the image related to the cow. Or the question maybe "what colour is the sky" which you may answer well its England in January so its grey and then find yourself blocked. The problem is making enough questions that can only be interpreted in one way as you are basing your CAPTCHA on a subjective question that you hope everyone will answer the same way.

Another form of this CAPTCHA which I have started using myself is the combination CAPTCHA where the distorted image holds a series of numbers. The user is then presented with a sum based question based on those numbers for example "Subtract the second number from the first and then multiply it by the third number". The sum and image is generated server side with the answer held in a database related to a key. Only the key and the question is passed to the client and the user has to answer the question within a set time period and within a set number of attempts to pass. Not only does this method not use Javascript so its available to all users it also has the affect of weeding out users who cannot do simple math.

A variation on this math based system which I have just read myself tonight so was pretty interested to see other people using is a similar math based solution when the question is only shown to the user if Javascript is disabled. If its enabled then Javascript is used to solve the CAPTCHA behind the scenes using some encrypted keys. This way the user is spared the agony of remembering their junior school times tables :). This solution is available as plug in so can be used by those not wanting to write their own.


So which one do I use?

The problem with the IT world is that its full of people who like a challenge and want to prove they can do anything. Therefore you will always have developers who will spend time writing clever bots designed to beat any form of CAPTCHA. So as with all security methods the best approach is a layered one that makes use of multiple techniques. The idea being the more hoops there are to jump through correctly the more likely you are to trip up some of those devious spammers and hackers.

If you can make your CAPTCHA solution slightly different from all the others then you also have a good chance of it not being beaten. Unless you are offering a golden honeypot on the other side of the submitted form i.e free email then there is no money in defeating it. If you are just a regular site then you want to stop the majority of spammers without making it too much trouble for your users to complete. Remember a lot of spammers are human as well so you will never stop all spam.

Related Links




2 comments:

Dark Politricks said...

A Spam comment if I ever saw a crawler scan the web looking for keywords then forming an email around it - 100% - It came from

Marks
1. The page in question does talk about CAPTCHAS but doesn't actually include one itself - a sure sign of a BOT -> SEARCH FOR KEYWORD "CAPTCHA' -> SEND EMAIL OUT HOPING SOMEONE FALLS FOR IT OR A % DO SHOW THAT WORD ON THE PAGE.
2. I don't use my own CAPTCHAS on this blog why would I need to when Google have their own. If they break you can be sure their farm of chained up techies kept in an underground bunker near langley would be on to it ASAP

To improve your code I would get your scanner BOT that returns sites that talk about CAPTCHAS to actually look for the HTML / JS that is used in RE-CAPTCHA so it's not so hit and miss. This can be done with a simple regular expression. ( ASK, PAY, RECEIVE -> :) )

Just so you can see the scam email I was sent....

"Hi Rob,

I was on this page http://blog.strictly-software.com/2009/02/captchas.html and I noticed your link to Recaptcha.net isn’t working. It looks like after ReCAPTCHA got integrated into Google’s main site they must have let the domain 404. You might want to update the link to https://www.google.com/recaptcha/intro/ if you can.

We have a guide to 10 CAPTCHA options that you might also be interested in - [REMOVED -
AS YOU WERE SO BLATANT BUT 7/10 FOR TRYING] >> https://www.strictly-software.com. I know that many find Google’s Recaptcha to be a little annoying. Perhaps you could add a link to our list as well when you update your page?

I hope this helps. Please let me know if you have any questions.

Thanks,
Shannon

Obviously I removed their link so no link juice is passed back to them.

Dark Politricks said...

A Spam comment if I ever saw a crawler scan the web looking for keywords then forming an email around it - 100% - It came from

Marks
1. The page in question does talk about CAPTCHAS but doesn't actually include one itself - a sure sign of a BOT -> SEARCH FOR KEYWORD "CAPTCHA' -> SEND EMAIL OUT HOPING SOMEONE FALLS FOR IT OR A % DO SHOW THAT WORD ON THE PAGE.
2. I don't use my own CAPTCHAS on this blog why would I need to when Google have their own. If they break you can be sure their farm of chained up techies kept in an underground bunker near langley would be on to it ASAP

To improve your code I would get your scanner BOT that returns sites that talk about CAPTCHAS to actually look for the HTML / JS that is used in RE-CAPTCHA so it's not so hit and miss. This can be done with a simple regular expression. ( ASK, PAY, RECEIVE -> :) )

Just so you can see the scam email I was sent....

"Hi Rob,

I was on this page http://blog.strictly-software.com/2009/02/captchas.html and I noticed your link to Recaptcha.net isn’t working. It looks like after ReCAPTCHA got integrated into Google’s main site they must have let the domain 404. You might want to update the link to https://www.google.com/recaptcha/intro/ if you can.

We have a guide to 10 CAPTCHA options that you might also be interested in - [REMOVED -
AS YOU WERE SO BLATANT BUT 7/10 FOR TRYING] >> https://www.strictly-software.com. I know that many find Google’s Recaptcha to be a little annoying. Perhaps you could add a link to our list as well when you update your page?

I hope this helps. Please let me know if you have any questions.

Thanks,
Shannon

Obviously I removed their link so no link juice is passed back to them.