Stopping BOTS - A Multi Layered Approach
By Strictly Software
Some people don't mind BOTS of all shapes and form roaming their sites but if you actually look into what they are doing should you be worried about their actions?
Have you examined your log files lately to see what kind of BOTS are visiting and how much bandwidth they are using?
Here are a few of the reasons you might want to care about the type of actions carried out by automated crawlers (BOTS):
1. They eat bandwidth. Social media BOTS especially who jump onto any link you post on Twitter causing Twitter Rushes. This is where 50+ BOTS all hit your site at the same time and if you are not careful could use up all your memory and cause a frozen system if not configured properly. There are plenty of articles about Twitter Rushes on this site if you use the search option down the right hand side to find more details.
2. Bandwidth costs money. If you are a one man band or don't want high server costs then why would you want social media BOTS, many that provide no benefit to you, costing you money just so they can provide their own end users with a service?
3. Content theft. If a user-agent identifying itself as IE6 is hitting a page a second is it really a human using an old IE browser visiting that many pages? Of course not. However for some reason IE 6 is the most popular user-agent used by script kiddies, scrapers and hackers. Probably because they have just downloaded an old crawler script off the web and run it without the knowledge to edit the code and change the agent. Look for user-agents from the same IP hitting lots of pages per minute and ask yourself are they helping your business or just slowing your site down by not obeying your robots.txt crawl-delay command?
4. Hacking. Automated hackbots scan the web looking for sites with old OS systems, old code and potential back doors. They then create a list of sites for their user and come back to penetrate these sites with SQL/XSS injection hacks. Some might show up in GET requests in the log file but if they are tampering with FORM elements then any POSTED data containing hack vectors won't show up. Hiding key response parameters such as your server brand and model and the scripting language you use are good simple measures to prevent your sites name ending up on this list of potential targets to hack and can easily be configured in config files on your system.
Therefore you should have a defence against these type of automated BOTS. Of course you also have the human hacker who might find a sites contact form, view the source, tamper with the HTML and work out a way to modify it so he can send out mass emails from your server with a custom script. Again security measures should be implemented to stop this. I am not going to talk about the basics of security when it comes to preventing XSS/SQL injection but the site has many articles on the topic and basic input sanitation and database login security measures should stop these kinds of hack.
So if you do want to stop automated BOTS from submitting forms, registering to your site, applying for jobs and anything else your site might do the following list might be helpful. It is just an off the head list I recently gave to someone on LinkedIn but could be helpful if expanded to your own requirements.
On my own sites I use a multi pronged approach to stop BAD BOTS as well as bandwidth wasting social media BOTS, hack bots and even manual hackers tampering with the forms. It saves me money as well as increases performance by allowing legit users only to use the site. By banning over 50% of my traffic which is of no benefit to me I can give the 50% of useful traffic a better user experience.
5) On forms like contact forms I often use BOT Traps. These are input elements that are in the flow of the form with names like email_extra that are hidden with CSS only. If the BOT submits a value for this hidden input I don't submit the form, or I do but without carrying out the desired action and not let the BOT know that nothing happened.
6) A lot of forms (especially contact forms) can be submitted by just entering an email address for all fields (name, email, password etc). Therefore I check that the field values are different e.g not the same value for an email AND password field. I also ensure the name matches a name pattern with a regular expression.
7) I have built my own 2 stage CAPTCHA system which can be turned on or off on the fly for forms where I don't know if the user is 100% human OR I can decide to just always have it on. This is based around a maths question, where the numbers are in 3 automatically created images, grey and blurry like normal CAPTCHA's The user has to first extract the right numbers from the images then carry out an automated sum from those numbers e.g add number 1 to number 2 and deduct number 3. This works very well as it requires a human brain to interpret the question and not just use OCR techniques to extract the CAPTCHA image values. There are so many OCR breakers out there that a standard CAPTCHA where you enter the word on the picture can easily be cracked automatically now.
8) If there is textarea on the form, contact, application etc, then I use my RUDE word table which has hundreds of variants of rude words and the regular expression next to it to detect them. This can obviously be updated to include pharmacy pill names, download movies, porn and other spam words.
9) I also have a number of basic regular expressions if the user wants light detection that checks for certain strings such as "download your xxx now", "buy xxx for just $£", and words like MP3s, Films, Porn, Cialis and other common spam words that would have no place on a site not selling such goods.
10) I always log any blocking so I can weed out any false positives and refine the regular expressions etc.
11) I also have an incremental ban time so the 1st time anyone gets banned is for 1 hour, then 2, then 4 then a day etc etc.The more times they come back the longer they get banned.
14) Honeypots and Robots.txt logging is also useful e.g log any hit to the robots.txt file and for any BOTS that don't visit it before crawling your site. You can then make a decision to ban them for breaking your Terms Of Service for BOTS that should state they should obey your Robots.txt rules.
16) HTACCESS Rules in your .htaccess file should identify known bad bots as well as IE 6, 5 and 5.5 and send them off to a 403 page or a 404 so they don't realise they have been sprung. No-one in their right mind should be using these old IE browsers anymore however most downloadable crawlers used by script kiddies still use IE 6 as a user-agent for some reason. My guess is that they were written so long ago that the code hasn't changed or that people had to support IE 6 due to Intranets being built in that technology e.g using VBScript as the client side scripting language.
By using IE 6 as a UA they get access to all systems due to sites having to support that ancient horrible browser. However I ban blank user-agents, user-agents less than 10 characters long, any that contain known XSS/SQL injection vectors and so on, There is a good PHP Wordpress plugin called Wordpress Firewall that if you turn on all the features and then examine the output in your .htaccess file will show you some useful rules such as banning image hot linking that you can then nick for your own file.
17) Sending bad bots back to their own server is always a good trick so that they get no-where on your own site. Another good trick is to send them to a site that might scare the hell out of them once they realise they have been trying to hack or DDOS it https://www.fbi.gov/wanted/cyber or the METS Cyber Crime department.
These are just a few of the security measures I use to stop BOTS. It is not a comprehensive list but a good starting point and these points can be expanded and automated depending on who you think is visiting your site.
Remember most of these points are backed up with detailed articles on this site so have a search if anything spikes your interest.
Hope this helps.
By Strictly Software
© 2016 Strictly Software