Saturday 10 January 2015

2 Quick Ways To Reduce Traffic On Your System

2 Quick Ways To Reduce Traffic On Your System

By Strictly-Software

Slowing The 3 Major SERP BOTs Down To Reduce Traffic

If you run a site with a lot of pages, good rankings, or a site that tweets out a lot e.g whenever a post comes online then you will probably get most of your traffic from getting crawled by the big 3 crawlers:

I know that whenever I check my access_log on my server to find out the top visiting IP addresses with a command like

grep "Jan/2015" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | less

I always find the top IP's are the main 3 Search Engines own BOTS (SERP = Search Engine Results Page), so I call their BOTS SERP BOTS.

GoogleBot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yahoo: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Without asking these BOTS to come NOW by doing things like refreshing your sitemap and pinging these SERPS

Or tweeting out links then they will crawl your site at their own time and choosing and with nothing to tell them to slow down they will crawl at their own speed. This could be once every second and if so it could cause your site performance issues.

However thing about it logically, if you post a news article or a job advert then in reality it only needs to be crawled once by each SERP BOT for it to be indexed. 

You don't really want it to be crawled every day and on every visit by these BOTS as the content HASN'T changed so there is really no need for the visit.

Now I don't know a way of telling a BOT to only crawl a page only if it's new content or it's changed in some way even if you had a sitemap system that only put in pages that were new or edited as the BOTS will still just vist your site and crawl it.

If you cannot add rel="nofollow" on internal links that point to duplicate content which doesn't actually 100% mean the BOT won't crawl it anyway then there are some things you can try if  you find that your site is having performance problems or is under pressure from heavy loads.


Crawl-Delay

Now this only used to be supported by BingBOT and then some smaller new search engines like Blekko

However in recent months after some testing I noticed that all most major SERP BOTS apart from GoogleBOT now obey the command. To get Google to reduce their crawl rate you can use Webmaster Tools to set their crawl rate from the control panel.

For instance on one of my big news sites I have a Crawl-Delay: 25 setting and when I check my access log for those user-agents there is a 25 second (roughly) delay between each request.

Therefore extending this value will reduce your traffic load by the major visitors to your site and is easily done by adding it to your Robot.txt file e.g.

Crawl-delay: 25

Banning IE 6

Now there is no logical reason in the world for any REAL person to be using this user-agent.

This Browser was probably the worst ever Browser in history due to the quirks within it that made web developers jobs so hard. Even just between IE 5.5 and IE 7 there are so many differences with IE 6 and is the reason IE 8 and 9 had all the settings for compatibility modes and browser modes.

It is also the reason IE is going to scrap support for IE 7-9 because of all this hokerery pokery they introduced just to handle the massive differences between IE 6 and their new standard compliant browsers.

Anyone with a Windows computer nowadays should be on at least IE 10. Only if your still on XP and haven't done any Windows Updates since about 5 years ago would you be a real IE 6 user.

Yesterday at work I ran a report on the most used Browsers that day. 

IE 6.0 came 4th!

It was below the 3 SERP BOTS I mentioned earlier and above the latest Chrome version.

On more detailed inspection of my custom logger/defence system that analyses the behaviour of visitors rather than just assuming that because your agent is IE 6 you are actually human could I see these visitors were all BOTS. 

I check for things like whether they could run JavaScript by using JavaScript to log that they can in the same way as I do Flash. These users had no JavaScript or Flash support and the rate they went through pages was way too fast for a human controller.

The only reason I can think people are using this user-agent is because they are script kiddies who have downloaded an old crawling script and the default user-agent is IE 6 and they haven't changed it.

Either they don't have the skill or they are just lazy. However by banning all IE 6 visitors with a simple .htaccess rule like this you can reduce your traffic hugely.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]


This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302  rewrite rule.

No normal person would be using these agents. There maybe some Intranets using VBScript as a client side scripting language from the 90's but no modern site is designed with IE 6 in the designers mind. Therefore most sites you find will not hanlde IE 6 very well therefore like Netscape Navigator they are an old browser so don't worry about site support for it. Therefore by banning it you will find your traffic going down a lot by banning just IE 6 and below.

So two simple ideas to reduce your traffic load. Try them and see how much your site improves.

2 comments:

dj.thd said...

This article is misleading and confusing. SERP is the acronym for Search Engines Result Page (each one of the result pages given by a Search engine when you do a Search)

Rob Reid said...

It shouldn't be confusing and misleading especially as I have the clear sentence a few lines down that says:

"I always find the top IP's are the main 3 Search Engines own BOTS (SERP = Search Engine Results Page), so I call their BOTS SERP BOTS."

So I explain that SERP stands for "Search Engine Results Page" and that I call BOTS belonging to those Search Engines, SERP BOTS.

If you work with BOTS an Crawlers all the time you need to distinguish between the different kinds of automated process as there are so many and therefore you need to give them names to stop confusion. In my job where I work with stopping BOTS on jobboards we use the common terms below:
-BOT - An automated process that accesses a webpage. It could be written in any language c# / PHP or be a pre-existing tool e.g CURL or WGet.
-HACK BOTS - Automated BOTS that try to hack the site by injecting SQL/XSS into forms etc.
-EMAIL BOTS - BOTS that try to scrape and collect email addresses. Honeypots are the best defense plus using obfuscation, encoding, and images/JavaScript to output email addresses on the page.
-SCRAPERS - BOTS that steal content without your permission.
-JOB RAPISTS - BOTS that come to steal content related to jobs, you will see I coined the term on Urban Dictionary > https://nb.urbandictionary.com/define.php?term=Job+Rapist.
-CRAWLERS - BOTS that just crawl the site, collecting links, scoping your sites hierarchy and looking for new content. This a general term for non malicious BOTS.
-SERP BOTS - Crawlers owned by search engine companies such as GoogleBOT, BingBOT, YSlurp etc.
-SEO BOTS - Crawlers owned by SEO companies or to analyse your site for stats such as links going back to you, outbound links, PR etc.
-SPAM BOTS - BOTS that try to insert comment spam into your site through forms
-PROMO BOTS - BOTS that promote and collect content for another site. They usually are bandwidth wasters as they provide no benefit to your own site, cost you money in bandwidth, cause havoc due to Twitter Rushes, have no thought put into them by developers to prevent over-crawling (e.g visiting the same page constantly even when it's been recently accessed and the content hasn't changed), and unless they provide benefit to your site e.g TwitterBOT, LinkedIN, Facebook then you may find that blocking them saves your site a lot of money. Just run a test with a Twitter Rush (post a tweet with a link to a page on your site and then view the log file to see how many BOTS come to that link straight away >> http://blog.strictly-software.com/2010/09/analysing-bot-traffic-from-twitter.html) to see how many SEO BOTS are wasting bandwidth. Banning the majority of these, many of which are owned by online newspapers or sites that collect information from multiple sites to provide personal content for their users, can also be a good way to reduce traffic.

So I am clear with what I mean in the article and I use these terms to break down the different type of BOTS I encounter.