Stopping BOTS - A Multi Layered Approach
By Strictly Software
Some people don't mind BOTS of all shapes and form roaming their sites but if you actually look into what they are doing should you be worried about their actions?
Have you examined your log files lately to see what kind of BOTS are visiting and how much bandwidth they are using?
Here are a few of the reasons you might want to care about the type of actions carried out by automated crawlers (BOTS):
1. They eat bandwidth. Social media BOTS especially who jump onto any link you post on Twitter causing Twitter Rushes. This is where 50+ BOTS all hit your site at the same time and if you are not careful could use up all your memory and cause a frozen system if not configured properly. There are plenty of articles about Twitter Rushes on this site if you use the search option down the right hand side to find more details.
2. Bandwidth costs money. If you are a one man band or don't want high server costs then why would you want social media BOTS, many that provide no benefit to you, costing you money just so they can provide their own end users with a service?
3. Content theft. If a user-agent identifying itself as IE6 is hitting a page a second is it really a human using an old IE browser visiting that many pages? Of course not. However for some reason IE 6 is the most popular user-agent used by script kiddies, scrapers and hackers. Probably because they have just downloaded an old crawler script off the web and run it without the knowledge to edit the code and change the agent. Look for user-agents from the same IP hitting lots of pages per minute and ask yourself are they helping your business or just slowing your site down by not obeying your robots.txt crawl-delay command?
4. Hacking. Automated hackbots scan the web looking for sites with old OS systems, old code and potential back doors. They then create a list of sites for their user and come back to penetrate these sites with SQL/XSS injection hacks. Some might show up in GET requests in the log file but if they are tampering with FORM elements then any POSTED data containing hack vectors won't show up. Hiding key response parameters such as your server brand and model and the scripting language you use are good simple measures to prevent your sites name ending up on this list of potential targets to hack and can easily be configured in config files on your system.
Therefore you should have a defence against these type of automated BOTS. Of course you also have the human hacker who might find a sites contact form, view the source, tamper with the HTML and work out a way to modify it so he can send out mass emails from your server with a custom script. Again security measures should be implemented to stop this. I am not going to talk about the basics of security when it comes to preventing XSS/SQL injection but the site has many articles on the topic and basic input sanitation and database login security measures should stop these kinds of hack.
So if you do want to stop automated BOTS from submitting forms, registering to your site, applying for jobs and anything else your site might do the following list might be helpful. It is just an off the head list I recently gave to someone on LinkedIn but could be helpful if expanded to your own requirements.
On my own sites I use a multi pronged approach to stop BAD BOTS as well as bandwidth wasting social media BOTS, hack bots and even manual hackers tampering with the forms. It saves me money as well as increases performance by allowing legit users only to use the site. By banning over 50% of my traffic which is of no benefit to me I can give the 50% of useful traffic a better user experience.
1) We log (using Javascript), whether the user has Javascript enabled e.g an AJAX call on the 1st page they hit that sets a session cookie using Javascript. As most BOTS don't use Javascript we can assume if they have Javascript enabled they are "probably" human.
2) We also use Javascript (or the 1st page HTTP_ALL header in IE) to log whether Flash is enabled and the version. A combo of having Flash running and Javascript isbetter than just Javascript on it's own.
3) I have my own logger DB that records browser fingerprints and IP's, Useragent, Javascript, Flash, HTTP settings, installed apps, browser extensions, Operating System and other features that can almost uniquely identify a user. The problem is of course an IP often changes either through DCHP or the use of proxies, VPN's and hired VPS boxes for an hour or two. However it does help in that I can use this combination data to look up in my historical visitor database to see what rating I gave them before e.g Human, BOT, SERP, Hacker, Spammer, Content Thief and so on. That way if the IP has changed but the majority of the browser finger print hasn't I can make an educated guess. If I am not 100% sure however I will then go into "unsure mode" where security features such as CAPTCHAS and BOT TRAPS are introduced just in case. I can then use Session variables if cookies are enabled to store the current status of the user (Human, BOT, Unknown etc), or use my visitor table to log the browser footprint and current IP and do lookups on pages where I need to use defensive measures if cookies are not enabled.
4) These Session/DB settings are then used to decide whether to increment banner hit counters, write out emails in images or with Javascript so that only humans can see them (to prevent BOT email scrapers), and other defensive measures. If I know they are 100% human then I may chose not to deploy these measures.
5) On forms like contact forms I often use BOT Traps. These are input elements that are in the flow of the form with names like email_extra that are hidden with CSS only. If the BOT submits a value for this hidden input I don't submit the form, or I do but without carrying out the desired action and not let the BOT know that nothing happened.
6) A lot of forms (especially contact forms) can be submitted by just entering an email address for all fields (name, email, password etc). Therefore I check that the field values are different e.g not the same value for an email AND password field. I also ensure the name matches a name pattern with a regular expression.
7) I have built my own 2 stage CAPTCHA system which can be turned on or off on the fly for forms where I don't know if the user is 100% human OR I can decide to just always have it on. This is based around a maths question, where the numbers are in 3 automatically created images, grey and blurry like normal CAPTCHA's The user has to first extract the right numbers from the images then carry out an automated sum from those numbers e.g add number 1 to number 2 and deduct number 3. This works very well as it requires a human brain to interpret the question and not just use OCR techniques to extract the CAPTCHA image values. There are so many OCR breakers out there that a standard CAPTCHA where you enter the word on the picture can easily be cracked automatically now.
8) If there is textarea on the form, contact, application etc, then I use my RUDE word table which has hundreds of variants of rude words and the regular expression next to it to detect them. This can obviously be updated to include pharmacy pill names, download movies, porn and other spam words.
9) I also have a number of basic regular expressions if the user wants light detection that checks for certain strings such as "download your xxx now", "buy xxx for just $£", and words like MP3s, Films, Porn, Cialis and other common spam words that would have no place on a site not selling such goods.
10) I always log any blocking so I can weed out any false positives and refine the regular expressions etc.
11) I also have an incremental ban time so the 1st time anyone gets banned is for 1 hour, then 2, then 4 then a day etc etc.The more times they come back the longer they get banned.
12) Sometimes I use JavaScript and AJAX to submit the form instead of standard submit buttons. As Javascript is so commonly used now (just look at Google), then most people have it enabled otherwise the majority of sites just wouldn't work or would have minimum features. It would require a human hacker to analyse your page to break it and then write a custom BOT just to hack the form when a technique like this is used. To get round this you can use a rolling random key created server side, inputted into a hidden element with Javascript on page load and then examined on form submission to ensure it is correct. If it's not then the person has tampered with the form by entering an old key not the new key and can be banned or blocked.
13) Another good way to stop automatic hack BOTs (ones that just roam the web looking for forms to try and submit and break out of to send emails etc - contact forms), is to not use FORM tags in your server side code but have compressed and encrypted JavaScript that on page load converts the <div id="form">....</div> into a real FORM with an action, method etc. Anyone viewing the non generated source code like most BOTS, won't see a FORM there to try to hack. Only a generated HTML source view (once the page has loaded), would show them this, which most BOTS would not be able to view.
14) Honeypots and Robots.txt logging is also useful e.g log any hit to the robots.txt file and for any BOTS that don't visit it before crawling your site. You can then make a decision to ban them for breaking your Terms Of Service for BOTS that should state they should obey your Robots.txt rules.
15) As BAD BOTS usually use the links in the DISALLOW section of Robots.txt to crawl anyway. Then putting a fake page in the list of URLs is a good idea. This page should be linked to from your site in a way that humans cannot see the link and accidentally visit it (and if they do it should have a Javascript link on it to enable them to get back to the site). However BAD BOTS will see the link in the source and crawl it. As they have broken your TOS and followed a URL in your DISALLOW list they are being doubly "bad", so you have the right to send them off to a honeypot (many exist on the web that either put emails out for them to extract then wait for an email to be sent to that address to prove they are an email scrapper bot) OR they get sent to an unbreakable maze like system which auto generate pages on the fly so that the BOT just keeps going around in circles crawling page after page and getting nowhere. Basically wasting their own bandwidth.
16) HTACCESS Rules in your .htaccess file should identify known bad bots as well as IE 6, 5 and 5.5 and send them off to a 403 page or a 404 so they don't realise they have been sprung. No-one in their right mind should be using these old IE browsers anymore however most downloadable crawlers used by script kiddies still use IE 6 as a user-agent for some reason. My guess is that they were written so long ago that the code hasn't changed or that people had to support IE 6 due to Intranets being built in that technology e.g using VBScript as the client side scripting language.
By using IE 6 as a UA they get access to all systems due to sites having to support that ancient horrible browser. However I ban blank user-agents, user-agents less than 10 characters long, any that contain known XSS/SQL injection vectors and so on, There is a good PHP Wordpress plugin called Wordpress Firewall that if you turn on all the features and then examine the output in your .htaccess file will show you some useful rules such as banning image hot linking that you can then nick for your own file.
17) Sending bad bots back to their own server is always a good trick so that they get no-where on your own site. Another good trick is to send them to a site that might scare the hell out of them once they realise they have been trying to hack or DDOS it https://www.fbi.gov/wanted/cyber or the METS Cyber Crime department.
These are just a few of the security measures I use to stop BOTS. It is not a comprehensive list but a good starting point and these points can be expanded and automated depending on who you think is visiting your site.
Remember most of these points are backed up with detailed articles on this site so have a search if anything spikes your interest.
Hope this helps.
By Strictly Software
© 2016 Strictly Software
No comments:
Post a Comment