Showing posts with label IE 6. Show all posts
Showing posts with label IE 6. Show all posts

Tuesday, 17 May 2016

Stopping BOTS - A Multi Layered Approach

Stopping BOTS - A Multi Layered Approach


By Strictly Software

Some people don't mind BOTS of all shapes and form roaming their sites but if you actually look into what they are doing should you be worried about their actions?

Have you examined your log files lately to see what kind of BOTS are visiting and how much bandwidth they are using?

Here are a few of the reasons you might want to care about the type of actions carried out by automated crawlers (BOTS):

1. They eat bandwidth. Social media BOTS especially who jump onto any link you post on Twitter causing Twitter Rushes. This is where 50+ BOTS all hit your site at the same time and if you are not careful could use up all your memory and cause a frozen system if not configured properly. There are plenty of articles about Twitter Rushes on this site if you use the search option down the right hand side to find more details.

2. Bandwidth costs money. If you are a one man band or don't want high server costs then why would you want social media BOTS, many that provide no benefit to you, costing you money just so they can provide their own end users with a service?

3. Content theft. If a user-agent identifying itself as IE6 is hitting a page a second is it really a human using an old IE browser visiting that many pages? Of course not. However for some reason IE 6 is the most popular user-agent used by script kiddies, scrapers and hackers. Probably because they have just downloaded an old crawler script off the web and run it without the knowledge to edit the code and change the agent. Look for user-agents from the same IP hitting lots of pages per minute and ask yourself are they helping your business or just slowing your site down by not obeying your robots.txt crawl-delay command?

4. Hacking. Automated hackbots scan the web looking for sites with old OS systems, old code and potential back doors. They then create a list of sites for their user and come back to penetrate these sites with SQL/XSS injection hacks. Some might show up in GET requests in the log file but if they are tampering with FORM elements then any POSTED data containing hack vectors won't show up. Hiding key response parameters such as your server brand and model and the scripting language you use are good simple measures to prevent your sites name ending up on this list of potential targets to hack and can easily be configured in config files on your system.

Therefore you should have a defence against these type of automated BOTS. Of course you also have the human hacker who might find a sites contact form, view the source, tamper with the HTML and work out a way to modify it so he can send out mass emails from your server with a custom script. Again security measures should be implemented to stop this. I am not going to talk about the basics of security when it comes to preventing XSS/SQL injection but the site has many articles on the topic and basic input sanitation and database login security measures should stop these kinds of hack.

So if you do want to stop automated BOTS from submitting forms, registering to your site, applying for jobs and anything else your site might do the following list might be helpful. It is just an off the head list I recently gave to someone on LinkedIn but could be helpful if expanded to your own requirements.

On my own sites I use a multi pronged approach to stop BAD BOTS as well as bandwidth wasting social media BOTS, hack bots and even manual hackers tampering with the forms. It saves me money as well as increases performance by allowing legit users only to use the site. By banning over 50% of my traffic which is of no benefit to me I can give the 50% of useful traffic a better user experience.

1) We log (using Javascript), whether the user has Javascript enabled e.g an AJAX call on the 1st page they hit that sets a session cookie using Javascript. As most BOTS don't use Javascript we can assume if they have Javascript enabled they are "probably" human.

2) We also use Javascript (or the 1st page HTTP_ALL header in IE) to log whether Flash is enabled and the version. A combo of having Flash running and Javascript isbetter than just Javascript on it's own.

3) I have my own logger DB that records browser fingerprints and IP's, Useragent, Javascript, Flash, HTTP settings, installed apps, browser extensions, Operating System and other features that can almost uniquely identify a user. The problem is of course an IP often changes either through DCHP or the use of proxies, VPN's and hired VPS boxes for an hour or two. However it does help in that I can use this combination data to look up in my historical visitor database to see what rating I gave them before e.g Human, BOT, SERP, Hacker, Spammer, Content Thief and so on. That way if the IP has changed but the majority of the browser finger print hasn't I can make an educated guess. If I am not 100%  sure however I will then go into "unsure mode" where security features such as CAPTCHAS and BOT TRAPS are introduced just in case. I can then use Session variables if cookies are enabled to store the current status of the user (Human, BOT, Unknown etc), or use my visitor table to log the browser footprint and current IP and do lookups on pages where I need to use defensive measures if cookies are not enabled.

4) These Session/DB settings are then used to decide whether to increment banner hit counters, write out emails in images or with Javascript so that only humans can see them (to prevent BOT email scrapers), and other defensive measures. If I know they are 100% human then I may chose not to deploy these measures.

5) On forms like contact forms I often use BOT Traps. These are input elements that are in the flow of the form with names like email_extra that are hidden with CSS only. If the BOT submits a value for this hidden input I don't submit the form, or I do but without carrying out the desired action and not let the BOT know that nothing happened.

6) A lot of forms (especially contact forms) can be submitted by just entering an email address for all fields (name, email, password etc). Therefore I check that the field values are different e.g not the same value for an email AND password field. I also ensure the name matches a name pattern with a regular expression.

7) I have built my own 2 stage CAPTCHA system which can be turned on or off on the fly for forms where I don't know if the user is 100% human OR I can decide to just always have it on. This is based around a maths question, where the numbers are in 3 automatically created images, grey and blurry like normal CAPTCHA's The user has to first extract the right numbers from the images then carry out an automated sum from those numbers e.g add number 1 to number 2 and deduct number 3. This works very well as it requires a human brain to interpret the question and not just use OCR techniques to extract the CAPTCHA image values. There are so many OCR breakers out there that a standard CAPTCHA where you enter the word on the picture can easily be cracked automatically now.

8) If there is textarea on the form, contact, application etc, then I use my RUDE word table which has hundreds of variants of rude words and the regular expression next to it to detect them. This can obviously be updated to include pharmacy pill names, download movies, porn and other spam words.

9) I also have a number of basic regular expressions if the user wants light detection that checks for certain strings such as "download your xxx now", "buy xxx for just $£", and words like MP3s, Films, Porn, Cialis and other common spam words that would have no place on a site not selling such goods.

10) I always log any blocking so I can weed out any false positives and refine the regular expressions etc.

11) I also have an incremental ban time so the 1st time anyone gets banned is for 1 hour, then 2, then 4 then a day etc etc.The more times they come back the longer they get banned.

12) Sometimes I use JavaScript and AJAX to submit the form instead of standard submit buttons. As Javascript is so commonly used now (just look at Google), then most people have it enabled otherwise the majority of sites just wouldn't work or would have minimum features. It would require a human hacker to analyse your page to break it and then write a custom BOT just to hack the form when a technique like this is used. To get round this you can use a rolling random key created server side, inputted into a hidden element with Javascript on page load and then examined on form submission to ensure it is correct. If it's not then the person has tampered with the form by entering an old key not the new key and can be banned or blocked.

13) Another good way to stop automatic hack BOTs (ones that just roam the web looking for forms to try and submit and break out of to send emails etc - contact forms), is to not use FORM tags in your server side code but have compressed and encrypted JavaScript that on page load converts the <div id="form">....</div> into a real FORM with an action, method etc. Anyone viewing the non generated source code like most BOTS, won't see a FORM there to try to hack. Only a generated HTML source view (once the page has loaded), would show them this, which most BOTS would not be able to view.

14) Honeypots and Robots.txt logging is also useful e.g log any hit to the robots.txt file and for any BOTS that don't visit it before crawling your site. You can then make a decision to ban them for breaking your Terms Of Service for BOTS that should state they should obey your Robots.txt rules.

15) As BAD BOTS usually use the links in the DISALLOW section of Robots.txt to crawl anyway. Then putting a fake page in the list of URLs is a good idea. This page should be linked to from your site in a way that humans cannot see the link and accidentally visit it (and if they do it should have a Javascript link on it to enable them to get back to the site). However BAD BOTS will see the link in the source and crawl it. As they have broken your TOS and followed a URL in your DISALLOW list they are being doubly "bad", so you have the right to send them off to a honeypot (many exist on the web that either put emails out for them to extract then wait for an email to be sent to that address to prove they are an email scrapper bot) OR they get sent to an unbreakable maze like system which auto generate pages on the fly so that the BOT just keeps going around in circles crawling page after page and getting nowhere. Basically wasting their own bandwidth.

16) HTACCESS Rules in your .htaccess file should identify known bad bots as well as IE 6, 5 and 5.5 and send them off to a 403 page or a 404 so they don't realise they have been sprung. No-one in their right mind should be using these old IE browsers anymore however most downloadable crawlers used by script kiddies still use IE 6 as a user-agent for some reason. My guess is that they were written so long ago that the code hasn't changed or that people had to support IE 6 due to Intranets being built in that technology e.g using VBScript as the client side scripting language.

By using IE 6 as a UA they get access to all systems due to sites having to support that ancient horrible browser. However I ban blank user-agents, user-agents less than 10 characters long, any that contain known XSS/SQL injection vectors and so on, There is a good PHP Wordpress plugin called Wordpress Firewall that if you turn on all the features and then examine the output in your .htaccess file will show you some useful rules such as banning image hot linking that you can then nick for your own file.

17) Sending bad bots back to their own server is always a good trick so that they get no-where on your own site. Another good trick is to send them to a site that might scare the hell out of them once they realise they have been trying to hack or DDOS it https://www.fbi.gov/wanted/cyber or the METS Cyber Crime department.

These are just a few of the security measures I use to stop BOTS. It is not a comprehensive list but a good starting point and these points can be expanded and automated depending on who you think is visiting your site.

Remember most of these points are backed up with detailed articles on this site so have a search if anything spikes your interest.

Hope this helps.

By Strictly Software


© 2016 Strictly Software

Sunday, 25 January 2015

Returning BAD BOTS to where they came from

Banning BAD BOTS to where they came from

By Strictly-Software

Recently in some articles I mentioned some .htaccess rules for returning "BAD BOTS" e,g crawlers you don't like such as IE 6 because no-one would be using it anymore and so on.

Now the rule I was using was suggested by a commenter in a previous article and it was to use the REMOTE_ADDRESS IP parameter to do this.

For example in a previous article (which I have now changed) about banning IE 5, 5.5 and IE 6, I originally suggested using this rule for banning all user-agents that were IE 5, 5.5 or IE 6.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://%{REMOTE_ADDR} [L,R=301]

Now this rewrite rule uses the ISAPI parameter {REMOTE_ADDR} which holds the originating IP address from the HTTP request to send anyone with IE 6 or below back to it.

It is the IP address you would normally see in your servers access logs when someone visits.

Problems with this rule

Now when I changed the rules on one of my own sites to this rule and then started testing it at work for a work site by using a user-agent switcher add-on for Chrome I ran into the problem that every time I went to my own site I was sent back to my companies gateway router page.

I had turned the switcher off but for some reason either a bug in the plugin, a cookie or session variable must have caused my own site to believe I was still on IE 6 and not the latest Chrome version. So everytime I went to my site with this rule I was kicked back to my companies gateway routers page.

Therefore after a clean up and a think and talk with my server techie guy he told me I should be using localhost instead of the REMOTE_ADDR IP address .The reason was that a lot of traffic, hackers, HACKBOTS, Spammers and so on would be hitting the Gateway page for their ISP for potential hacking,

These ISP's might get a but pissed off with your website sending their gateway routers page swathes of traffic that could potentially harm them,

Therefore to prevent getting letters in the post that you are sending swathes of hackers to your homes or phones ISP gateway - as a lot of phones or tablets use proxies for their browsers anyway - is to send them back to their own localhost or 127.0.0.1.

Also instead of using a 301 permanent redirect rule you should use a 302 temporary redirect rule instead as that is the more appropriate code to use,

Use this rule instead

Therefore the rule I now recommend for anyone wanting to ban all IE 5, 5.5 and 6 traffic is below.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]

This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302 rewrite rule. You can obviously add other rules in with BOTS and SQL/XSS injection hacks as well

This is a more valid rule as it's not a permanent redirect for the traffic such as if a page has changed it's name. Instead it's down to an invalid parameter or value in the HTTP Request that the user is being redirected to the new destination with a redirect.

If the user changed it's user-agent or parameters then it would get to the site and not be redirected with a 301 OR a 302 status code but instead get a 200 OKAY status code.

So remember, whilst an idea might seem good at first until you fully test it and ensure it doesn't cause problems it might not be all that it seems.

Saturday, 10 January 2015

2 Quick Ways To Reduce Traffic On Your System

2 Quick Ways To Reduce Traffic On Your System

By Strictly-Software

Slowing The 3 Major SERP BOTs Down To Reduce Traffic

If you run a site with a lot of pages, good rankings, or a site that tweets out a lot e.g whenever a post comes online then you will probably get most of your traffic from getting crawled by the big 3 crawlers:

I know that whenever I check my access_log on my server to find out the top visiting IP addresses with a command like

grep "Jan/2015" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | less

I always find the top IP's are the main 3 Search Engines own BOTS (SERP = Search Engine Results Page), so I call their BOTS SERP BOTS.

GoogleBot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yahoo: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Without asking these BOTS to come NOW by doing things like refreshing your sitemap and pinging these SERPS

Or tweeting out links then they will crawl your site at their own time and choosing and with nothing to tell them to slow down they will crawl at their own speed. This could be once every second and if so it could cause your site performance issues.

However thing about it logically, if you post a news article or a job advert then in reality it only needs to be crawled once by each SERP BOT for it to be indexed. 

You don't really want it to be crawled every day and on every visit by these BOTS as the content HASN'T changed so there is really no need for the visit.

Now I don't know a way of telling a BOT to only crawl a page only if it's new content or it's changed in some way even if you had a sitemap system that only put in pages that were new or edited as the BOTS will still just vist your site and crawl it.

If you cannot add rel="nofollow" on internal links that point to duplicate content which doesn't actually 100% mean the BOT won't crawl it anyway then there are some things you can try if  you find that your site is having performance problems or is under pressure from heavy loads.


Crawl-Delay

Now this only used to be supported by BingBOT and then some smaller new search engines like Blekko

However in recent months after some testing I noticed that all most major SERP BOTS apart from GoogleBOT now obey the command. To get Google to reduce their crawl rate you can use Webmaster Tools to set their crawl rate from the control panel.

For instance on one of my big news sites I have a Crawl-Delay: 25 setting and when I check my access log for those user-agents there is a 25 second (roughly) delay between each request.

Therefore extending this value will reduce your traffic load by the major visitors to your site and is easily done by adding it to your Robot.txt file e.g.

Crawl-delay: 25

Banning IE 6

Now there is no logical reason in the world for any REAL person to be using this user-agent.

This Browser was probably the worst ever Browser in history due to the quirks within it that made web developers jobs so hard. Even just between IE 5.5 and IE 7 there are so many differences with IE 6 and is the reason IE 8 and 9 had all the settings for compatibility modes and browser modes.

It is also the reason IE is going to scrap support for IE 7-9 because of all this hokerery pokery they introduced just to handle the massive differences between IE 6 and their new standard compliant browsers.

Anyone with a Windows computer nowadays should be on at least IE 10. Only if your still on XP and haven't done any Windows Updates since about 5 years ago would you be a real IE 6 user.

Yesterday at work I ran a report on the most used Browsers that day. 

IE 6.0 came 4th!

It was below the 3 SERP BOTS I mentioned earlier and above the latest Chrome version.

On more detailed inspection of my custom logger/defence system that analyses the behaviour of visitors rather than just assuming that because your agent is IE 6 you are actually human could I see these visitors were all BOTS. 

I check for things like whether they could run JavaScript by using JavaScript to log that they can in the same way as I do Flash. These users had no JavaScript or Flash support and the rate they went through pages was way too fast for a human controller.

The only reason I can think people are using this user-agent is because they are script kiddies who have downloaded an old crawling script and the default user-agent is IE 6 and they haven't changed it.

Either they don't have the skill or they are just lazy. However by banning all IE 6 visitors with a simple .htaccess rule like this you can reduce your traffic hugely.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]


This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302  rewrite rule.

No normal person would be using these agents. There maybe some Intranets using VBScript as a client side scripting language from the 90's but no modern site is designed with IE 6 in the designers mind. Therefore most sites you find will not hanlde IE 6 very well therefore like Netscape Navigator they are an old browser so don't worry about site support for it. Therefore by banning it you will find your traffic going down a lot by banning just IE 6 and below.

So two simple ideas to reduce your traffic load. Try them and see how much your site improves.

Thursday, 27 January 2011

2011 Browser Usage Stats

Browser Coverage and Other Visitor Statistics

I like to regularly check one of my largest systems web traffic stats to see what kind of browsers our users are visiting with and I have previously posted reports which have showed IE maintaining its position at the top of the stats every time.

One of our sites is used by a large corporate company that heavily restricts the type of browser their workers can use to access the Internet which means that they have to use IE 6 but even accounting for that it surprising to see that IE 6 is still at the top of the browser usage report even though IE 8 has been out for a long time and IE 9 is on the way.

One other reason I can think of that explains why so many people are still using IE 6 is that it seems to be the useragent of choice for spoofers and hackers. I have an automated system that I have built that logs, identifies, and then bans these bad bots and users and I have built up quite a large database of known IP / Agents so I can regularly check to what kind of tricks they are up to.

The latest batch of hackbots that I have spotted are using stripped down URL Encoded HTML without quotes for attributes and without protocols in the links e.g



%3C%69%66%72%61%6D%65%20%73%72%63%3D%2F%2F%73%6F%6D%65%64%6F%64%67%79%73%69%74%65%2E%72%75%3E


When URL Decoded becomes


<iframe src=//dodgysite.ru>

Even though there are no quotes around the src attribute and no protocol at the beginning of the URL this HTML will still work and is a common technique used by minifiers (including Google) to cut down on the size of HTML files.

Obviously the whole point of this is to beat injection and hack tests that rely on pattern matching in a similar way to those sql injection attacks that are all uP aNd DoWn aiming to beat people who have forgotten to make their systems sql injection detection routines case insensitive.


Any how here are the latest browser usage reports for the first month of 2011


Top Browsers

BrowserUsage %
IE 6.048.12
IE 8.016.04
IE 7.012.14
Firefox 3.66.71
Chrome 8.04.82
IE 5.54.31
Safari 5.01.83
Firefox 3.01.06
Firefox 3.50.97
Safari 4.00.41
Opera 9.00.36
Opera 8.00.36
iPhone 4.20.34
Firefox 2.00.31
Mozilla 1.90.26
Iceweasel 3.00.26
iPhone 4.10.22
IE 9.00.21
BlackBerry0.19



Top Operating Systems

Operating SystemUsage %
WinXP68.30
WinVista10.20
Win10.12
Win20004.87
MacOSX2.90
iPhone OSX1.23
Win20031.14
Linux0.66
WinME0.49
Win980.47
Debian0.26
WinNT0.24
Android0.22
BlackBerry0.17