Monday 24 November 2008

Trying to detect spoofed user-agents

User Agent Spoofing

A lot of traffic comes from browsers either masking their real identity by using a different user agent than the real one associated with the browser or a random string which relates to no known browser. The purpose of doing this is many fold from malicious users trying to mask their real identity to get round code that may ban on user-agents or get round client side code that may be blocking certain browsers from using certain functionality which is all a good reason for using object detection rather than browser sniffing when deciding which code branch to run.

However as you will probably know if you have tried doing anything apart from simple JavaScript coding there are still times when you need to know the browser because object detection just isn't feasible and trying to use object detection to work out the browser is just as bad in my opinion as using browser sniffing to work out the object.

Therefore when someone is using an agent switcher to mask the browsers agent and you come across one of these moments then it may cause you to run code that will raise errors. There is no foolproof way to spot whether an agent is spoofed but one of the things you can do if you do require this information is compare the agent with known objects that should be supported by that browser and if they don't match then you can confirm the spoof.

This form of spoof detection will only work if its only the user agent string that has been changed but an example of some of the checks you can do include for agents that say they are Opera make sure it supports window.opera as well as both event models document.addEventListener && document.attachEvent as far as I know its the only browser that does support both. For IE you shouldn't check document.all by itself as you will actually find Firefox will return true for this but you can check for window.ActiveXObject the non existence of addEventListener and use conditional comments to test for JScript. Firefox should obviously not support JScript as it uses Javascript.

Those are just a few checks you could do and you are basically using object detection as well as agent sniffing together to make sure they match. They may not tell you the real browser being masked but they can be used to tell you what its not. 

The idea of this is to make sure that in those cases where you have to branch on browser rather than object (see this previous article) that you make the right choice and don't cause errors. Obviously you may decide that if the user is going to spoof the agent then leave them to suffer any errors that may come their way.

If you do require a lightweight browser detector that checks for user agent spoofing amongst the main browsers as well as support for advanced CSS, Flash and other properties then see this article.

7 comments:

Anonymous said...

So it will not work if the user "alters" all parts of the useragent string ?
Would you be able to detect user switching user agents with, for instance, the user agent switcher plug-in for FF using a user string like this:
Description: Iphone 3.0
User Agent: Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16
App version: 5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16
Platform: Iphone
Vendor: Apple Computer, Inc.

Thanks

Rob Reid said...

The aim is not to use the useragent string but to try and detect which objects that browser( that the useragent string says / or pretends it is) should OR should NOT support.

E.G IE supports window.attachEvent whereas FireFox does not it supports window.addEventListerner.

Opera supports both.

So does IE 9.

But opera has the window.opera object and IE 9 has a document.mode value of 9.

Firefox uses Javascript not JScript whereas IE uses JScript.

IE supports ActiveX but others don't.

As new versions of browser come out things change and it makes accurate detection harder to accomplish.

If a user in anyway can "inject" javascript into your page (which of course they shouldn't be able to) they could "create" the necessary objects to go along with the browser they are spoofing as JS lets you overwrite objects.

But it is probably impossible to come up with a totally foolproof solution for ALL cases. You might be able to detect certain spoofs but if the user turns JavaScript off then your stuck as all you have is the useragent to go on.

I did write this article a long time ago and now that IE9 supports standard JS objects like the DOM 2 event model (addEventListener) then it becomes a lot harder and in your example of wanting to detect whether the user is really WebKit/Iphone you would need to be looking for objects only that browser supports and others don't.

One way if you cannot find any objects (and I cannot think of any off the top of my head) is that you could use certain CSS styles to find out if the useragent is "real" or not.

For example with Webkit browsers (Chrome/Safari/IPhone) you could add a DIV into the DOM (using JS) and then try applying webkit styles to the DIV e.g anything with
-webkit in front of it e.g:
-webkit-border-radius
and then use JS to detect the current style of that DIV to see if that style had been actually applied or not.

If it has then you know it really is a webkit browser and if it hasn't then its a spoofer BUT it could be another webkit spoofer e.g Chrome spoofing Safari.

Telling the difference between Chrome spoofing Safari or vice versa is a lot harder as you would need to find JS objects OR CSS styles that are in one webkit based browser and not the other webkit based browser. As they are almost identical then I doubt you would find any (if you do please let me know).

That's just an idea but I have no code for it and as browsers become more standardised (all supporting the same css and JS) then it will become harder to detect differences.

I vaguely remember some other article on the web somewhere that used clever techniques to find out the "real" browser (or use a process of elimination to rule out browsers) but I don't know where it is.

This link will show you a way to detect the IE version but again with IE 9 it becomes quite hard to do 100% accurately.

http://blog.strictly-software.com/2009/03/detecting-ie-8-compatibility-modes-with.html

and I did write it a while back, so it might not even work!

Hopefully this info helps you.

Cristian said...

Detect all possible browsers like: proxies, desktops, tablets, mobiles and applications (java, symbian, android). This is the best code ever, enjoy ;) http://code.google.com/p/detect-real-user-agent/

Rob Reid said...

Well I am on an iPhone so it may be the reason but I couldn't find any source code in the link you sent me.

Without seeing the code ( and i will check when im next on a PC )I can only imagine you are trying to either emmulstd the browscap.ini system used by PHP & ASP whic only shows useragents and features they are SUPPOSED to support which is easily defeated by an agent switcher. Or it's a server side version of a massive IF statement checking all known possible agents.
As I can easily set up a transparent anonymous proxy server I can see no way in the world of detecting this kind of proxy PLUS it has nothing to do with useragents.
You would need to maintain a massive list of known Proxy IP addresses to detect proxies and as most are servers being unwittingly used as a proxy until the owner finds out and turns on a firewall then an IP can be a proxy one day and not the next.
Therefore I would be interested in seeing some code as I don't believe it possible to do.
Thanks for commenting

Rob Reid said...

I checked the code on my PC (downloaded the .zip) and as expected all you have done is a massive big IF statement trying to accommodate all known browsers at this point in time.

Reasons this won't work.

1. What if I put this in my user-agent switcher tool as the user-agent

Robs-Robob0T1

Answer - Your code wouldn't match it as it's not in there. New bots come out every day so you would have to keep this file updated everyday.

2. Spoofers, spammers, hackers etc like to use random letters and numbers (gibberish) as user-agents e.g
??
__main__/0.1
+http://robot.vedens.de VEDENSBOT
['dsin Bot']
.NET Framework Test Client
*/Nutch-1.0

Those are just a few of the 1,046,291 useragents my own logger system has collected over the last 3 years.

Your IF statement would fail on 99% of these and even if you attempted to run a regex on all 1 million+ agents you would make the page so slow it would be unusable.

3. You are relying on what the user has put in the header as the useragent - I can change the Request headers (user-agent, x-forwarded-for, content-type etc) to whatever I want when I make a request. Therefore I can be using IE 9 and change the useragent to FireFox 11 and your code would tell your user it is using Firefox 11 which is obviously wrong.

4. You are missing out on all the default HTTP libraries that a lot of script kiddies use when scraping e.g CURL, WinHTTP, LWP, ColdFusion which are defaulted to be the agent when one isn't set for the code doing the HTTP request.

These are all reasons just trying to look at the useragent string is not a feasible way of detecting the "true" useragent of the user.

The article (which is old) was trying to use known JavaScript differences between browsers to find out a) what the agent is NOT and b) what the agent COULD be.

There is no 100% foolproof way of detecting agents and as I said you cannot detect proxies from a useraget string so I don't understand why you even mentioned that.

Any questions ask me but I wouldn't run 200+ regular expressions to try and find out a useragent as :

a) it won't work accurately (for anyone who is spoofing)

b) it will slow down your server - the more Regex you use - the slower it will get.

If you are trying to find out agents for banning purposes then

a) I would ban IE 6 as most hackers on our system seem to use this.

b) I would put the rules into your .htaccess file and combine them see >> http://blog.strictly-software.com/2010/04/banning-bad-bots-with-mod-rewrite.html

c) I would ban blank and very short agents (jibberish)

d) I would use JS to find the agent - if JS cannot be run there is a good chance the user is a BOT so you can rule out humans or show a CAPTCHA / BotTrap etc.

But as I said there is no fool proof way of detecting the "real" useragent of any user by sniffing the useragent header string especially.

Thanks for commenting though.
These ar

Anonymous said...

Hi,
Thanks for the information on detecting false user agents.
I just have one question: With the increase in the technical aspects the bots/ frauds have also become more complex and difficult to detect. Do you think the methods mentioned in your post would still be effective considering its almost a decade now since this post?

Rob Reid said...

Well it is an old article and as most browsers are standards compliant and support the same events/properties/functions now e.g IE supporting features that FF/Chrome browsers also do such as document.addEventListerner it makes finding features that a single browser supports that others don't that much harder. IE may have legacy features in it that makes it support old IE browser functionality which could help you still distinguish between IE and other browsers (attachEvent, JSscript instead of JavaScript, document.all) and so on. However I haven't checked recently which features ARE still supported by modern IE browsers but knowing MicroSoft I presume they are still backward compatible for old sites. The key thing to ask is WHY do you need to find out the exact browser the user is using when nearly all are using the same code. The only real reason would be for usage statistics and not site functionality which should be written in a standards compliant manner knowing that most browsers auto update versions and support the same code. If so then I would concentrate on weeding out BOTS from HUMANS rather than trying to use JS to find the real browser. To do this you could justify presuming that the majority of BOTS don't support JavaScript (even though GoogleBOT does, but a whitelist of known GoogleBOT (and other SERP) IP addresses would help with this). Also you check using JS the loading of objects such as Flash/Java/ActiveX or other modern objects, even an image.onload event that logs that the image has loaded might help distinguish between BOTS and HUMANS on Browsers, and use this to handle the difference between Humans and BOTS. Also the difference in speed that the user is loading pages could be used to distinguish between BOTS and Humans as BOTS not following a Crawl-Limit Robots.txt command will try to hammer through web pages at a high speed in-between requests whilst Humans will be reading, moving their mouse to click links to change page and would be much slower. Basically the question is to ask yourself why you need to do this in the first place and then whether all the extra work is required if purely for usage stats. If it isn't for stats and just to prevent BOTS from hammering your site use a CAPTCHA system for form/comment posting that isn't easily beatable by OCR systems such as a 2 stage system. These articles may help you but the question is really why you need to do this as a determined hacker wanting to scrape content or carry out some other activity on your site will tailor their BOTS code for your site by analyzing your defense mechanisms first. These articles may help you >
http://blog.strictly-software.com/2016/01/quick-htaccess-rules-and-vbs-asp.html and http://blog.strictly-software.com/2012/08/a-karmic-guide-for-scraping-without.html.