Showing posts with label Bingbot. Show all posts
Showing posts with label Bingbot. Show all posts

Saturday, 10 January 2015

2 Quick Ways To Reduce Traffic On Your System

2 Quick Ways To Reduce Traffic On Your System

By Strictly-Software

Slowing The 3 Major SERP BOTs Down To Reduce Traffic

If you run a site with a lot of pages, good rankings, or a site that tweets out a lot e.g whenever a post comes online then you will probably get most of your traffic from getting crawled by the big 3 crawlers:

I know that whenever I check my access_log on my server to find out the top visiting IP addresses with a command like

grep "Jan/2015" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | less

I always find the top IP's are the main 3 Search Engines own BOTS (SERP = Search Engine Results Page), so I call their BOTS SERP BOTS.

GoogleBot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yahoo: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Without asking these BOTS to come NOW by doing things like refreshing your sitemap and pinging these SERPS

Or tweeting out links then they will crawl your site at their own time and choosing and with nothing to tell them to slow down they will crawl at their own speed. This could be once every second and if so it could cause your site performance issues.

However thing about it logically, if you post a news article or a job advert then in reality it only needs to be crawled once by each SERP BOT for it to be indexed. 

You don't really want it to be crawled every day and on every visit by these BOTS as the content HASN'T changed so there is really no need for the visit.

Now I don't know a way of telling a BOT to only crawl a page only if it's new content or it's changed in some way even if you had a sitemap system that only put in pages that were new or edited as the BOTS will still just vist your site and crawl it.

If you cannot add rel="nofollow" on internal links that point to duplicate content which doesn't actually 100% mean the BOT won't crawl it anyway then there are some things you can try if  you find that your site is having performance problems or is under pressure from heavy loads.


Crawl-Delay

Now this only used to be supported by BingBOT and then some smaller new search engines like Blekko

However in recent months after some testing I noticed that all most major SERP BOTS apart from GoogleBOT now obey the command. To get Google to reduce their crawl rate you can use Webmaster Tools to set their crawl rate from the control panel.

For instance on one of my big news sites I have a Crawl-Delay: 25 setting and when I check my access log for those user-agents there is a 25 second (roughly) delay between each request.

Therefore extending this value will reduce your traffic load by the major visitors to your site and is easily done by adding it to your Robot.txt file e.g.

Crawl-delay: 25

Banning IE 6

Now there is no logical reason in the world for any REAL person to be using this user-agent.

This Browser was probably the worst ever Browser in history due to the quirks within it that made web developers jobs so hard. Even just between IE 5.5 and IE 7 there are so many differences with IE 6 and is the reason IE 8 and 9 had all the settings for compatibility modes and browser modes.

It is also the reason IE is going to scrap support for IE 7-9 because of all this hokerery pokery they introduced just to handle the massive differences between IE 6 and their new standard compliant browsers.

Anyone with a Windows computer nowadays should be on at least IE 10. Only if your still on XP and haven't done any Windows Updates since about 5 years ago would you be a real IE 6 user.

Yesterday at work I ran a report on the most used Browsers that day. 

IE 6.0 came 4th!

It was below the 3 SERP BOTS I mentioned earlier and above the latest Chrome version.

On more detailed inspection of my custom logger/defence system that analyses the behaviour of visitors rather than just assuming that because your agent is IE 6 you are actually human could I see these visitors were all BOTS. 

I check for things like whether they could run JavaScript by using JavaScript to log that they can in the same way as I do Flash. These users had no JavaScript or Flash support and the rate they went through pages was way too fast for a human controller.

The only reason I can think people are using this user-agent is because they are script kiddies who have downloaded an old crawling script and the default user-agent is IE 6 and they haven't changed it.

Either they don't have the skill or they are just lazy. However by banning all IE 6 visitors with a simple .htaccess rule like this you can reduce your traffic hugely.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]


This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302  rewrite rule.

No normal person would be using these agents. There maybe some Intranets using VBScript as a client side scripting language from the 90's but no modern site is designed with IE 6 in the designers mind. Therefore most sites you find will not hanlde IE 6 very well therefore like Netscape Navigator they are an old browser so don't worry about site support for it. Therefore by banning it you will find your traffic going down a lot by banning just IE 6 and below.

So two simple ideas to reduce your traffic load. Try them and see how much your site improves.

Monday, 20 May 2013

Some clever code for SEO that won't annoy your users

Highlighting words for SEO, turning them off for the users

You might notice in the right side bar I have two options under the settings tab "Un-Bold" and "Re-Bold".

If you try them out you will see what the options do. Basically unbolding any STRONG or BOLD tags or re-bolding them again.

The reason is simple. Bolding important words either in STRONG or BOLD tags is good for SEO. Having content in H1 - H6 tags are even better and so are links - especially if they go to relevant and related content.

However, I don't claim to be the first person to start bolding important keywords and long tail sentences for SEO purposes but I was one of the first to catch on that the benefits for SEO were great.

To much bolding and it looks like spam, too little you might not get much benefit but you have to 2 areas to cater for.

1. The SERP crawlers (Googlebot, BingBot, Yandex etc etc) who see the original source code on the page. When they do they will just see words wrapped in normal STRONG and BOLD tags (See for yourself).

2. However if a user doesn't like the format and mix of bolded and non bolded wording then they can use the settings to add a class to all STRONG and BOLD tags that basically takes aways the font-weight of the element. You would only see this in the generated source code. Running the "Re-Bold" function after the first "Un-Bold" will just remove the class that took away the font-weight in the first place returning the element to it's normal bolded state.

Therefore the code is aimed for both BOTS and users and you can see a simple test page on my main site here: example to unbold and rebold with jQuery.

I have used jQuery for this only because it was simple to write however it wouldn't be too hard to rewrite with plain old JavaScript.

Another extension I have lost since updating this blog format but would be easy to add is the use of a JavaScript created cookie to store the users last preference so that they don't have to keep clicking the "un-bold" option when they visit the site.

As Blogger won't let  you add server side code to the blog you will need to do it all with JavaScript but with the new blogger layout (which I love by the way - unlike Google+) it is easy to add JavaScript (external and internal) plus CSS sections and link blocks to control the actions of your functions.

An example of the code is below and hopefully you can see how easy it is to use.

First I load in the latest version of jQuery from Google.

Then I use selectors to ensure I am only targeting the main content part of the page before I add or remove classes to STRONG or BOLD tags.

<style type="text/css">
.unbold{
 font-weight:normal;
}
</style>

<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>

<script>
function unbold()
{
 $(".entry-content").each(function(){  
  $("strong",this).addClass("unbold");
  $("b",this).addClass("unbold");
 });
}

function bold()
{
 $(".entry-content").each(function(){
  $("strong",this).removeClass("unbold");
  $("b",this).removeClass("unbold");
 });
}
</script>

So not only are you benefiting from SEO tweaks but you are letting your users turn it off if they feel it's a bit too much. Hey Presto!

Sunday, 3 March 2013

Stop BOTS and Scrapers from bringing your site down

Blocking Traffic using WebMin on LINUX at the Firewall

If you have read my survival guides on Wordpress you will see that you have to do a lot of work just to get a stable and fast site due to all the code that is included.

The Wordpress Survival Guide

  1. Wordpress Basics - Tools of the trade, useful commands, handling emergencies, banning bad traffic.
  2. Wordpress Performance - Caching, plugins, bottlenecks, Indexing, turning off features
  3. Wordpress Security - plugins, htaccess rules, denyhosts.


For instance not only do you have to handle badly written plugins that could contain security holes and slow the performance of your site but the general WordPress codebase is in my opinion a very badly written piece of code.

However they are slowly learning and I remember once (and only a few versions back) that on the home page there were over 200+ queries being run most of them were returning single rows.

For example if you used a plugin like Debug Queries you would see lots of SELECT statements on your homepage that returned a single row for each article shown for every post as well as the META data, categories and tags associated with the post.

So instead of one query that returned the whole data set for the page in one query (post data, category, tag and meta data) it would be filled with lots of single queries like this.

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800)

However they have improved their code and a recent check of one of my sites showed that although they are still using seperate queries for post, category/tag and meta data they are at least getting all of the records in one go e.g

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800,36799,36798,36797,36796)

So the total number of queries has dropped which aids performance. However in my opinion they could write one query for the whole page that returned all the data they needed and hopefully in a future edition they will.

However one of the things that will kill a site like Wordpress is the amount of BOTS that hit you all day long. These could be good BOTS like GoogleBOT and BingBOT which crawl your site to find out where it should appear in their own search engine or they could be social media BOTS that look for any link Twitter shows or scrapers trying to steal your data.

Some things you can try to stop legitimate BOTS like Google and BING from hammering your site is to set up a Webmaster Tools account in Google and then change the Crawl Rate to a much slower one.

You can also do the same with BING and their webmaster tools account. However with BING they apparently respect the ROBOTS.txt command DELAY e.g


Crawl-delay: 3


Which supposedly tells BOTS that respect the Robots.TXT commands that they should wait 3 seconds before each crawl. However as far as I know only BING support this at the moment and it would be nice if more SERP BOTS did in future.

If you want a basic C# Robots.txt parser that will tell you whether your agent can crawl a page on a site, extract any sitemap command then check out > http://www.strictly-software.com/robotstxt however if you wanted to extend it to add in the Crawl-Delay command it wouldn't be hard ( line 175 in Robot.cs ) to add in so that you could extract and respect it when crawling yourself.

Obviously you want all the SERP BOTS like GoogleBot and Bingbot to search you but there are so many Social Media BOTS and Spammers out there nowadays that they can literally hammer your site into the ground no matter how many caching plugins and .htacess rules you put in to return 403 codes.

The best way to deal with traffic you don't want to hit your site is as high up the chain as possible. 

Just leaving Wordpress to deal with it means the overhead of PHP code running, include files being loaded, regular expression to test for harmful parameters being run and so on.

Moving it up to the .htaccess level is better but it still means your webserver is having to process all the .htacess rules in your file to decide whether or not to let the traffic through or not.

Therefore if you can move the worst offenders up to your Firewall then it will save any code below that level from running and the TCP traffic is stopped before any regular expressions have to be run elsewhere.

Therefore what I tend to do is follow this process:


  • Use the Wordpress plugin "Limit Login Attempts" to log people trying to login (without permission) into my WordPress website. This will log all the IP addresses that have attempted and failed as well as those tht have been blocked. This is a good starting list for your DENY HOSTS IP ban table
  • Check the same IP's as well as using the command: tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more  to see which IP addresses are visiting my site the most each day.
  • I then check the log files either in WebMin or in an SSH tool like PUTTY to see how many times they have been trying to visit my site. If I see lots of HEAD or POST/GET requests within a few seconds from the same IP I will then investigate them further. I will do an nslookup and a whois and see how many times the IP address has been visiting the site.
  • If they look suspicious e.g the same IP with multiple user-agents or lots of requests within a short time period I will comsider banning them. Anyone who is using IE 6 as a user-agent is a good suspect (who uses IE 6 anymore apart from scrapers and hackers!)
  • I will then add them to my .htaccess file and return a [F] (403 status code) to all their requests.
  • If they keep hammering my site I wll then move them from my DENY list in my .htaccess fle and add them to my firewall and Deny Hosts table.
  • The aim is to move the most troublesome IP's and BOTS up the chain so they cause the least damage to your site. 
  • Using PHP to block access is not good as it consumes memory and CPU, the .htaccess file is better but still requires APACHE to run the regular expressions on every DENY or [F] command. Therefore the most troublesome users should be moved up to the Firewall level to cause the less server usage to your system.
  • Reguarly shut down your APACHE server and use the REPAIR and OPTIMIZE options to de-frag your table indexes and ensure the tables are performing as well as possible. I have many articles on this site on other tools which can help you increase your WordPress sites perforance with free tools.

In More Details

You should regularly check the access log files for the most IP's hitting your site, check them out with a reverse DNS tool to see where they come from and if they are of no benefit to you (e.g not a SERP or Social Media agent you want hitting your site) then add them to your .htaccess file under the DENY commands e.g

order allow,deny
deny from 208.115.224.0/24
deny from 37.9.53.71

Then if I find they are still hammering my site after a week or month of getting 403 commands and ignoring them I add them to the firewall in WebMin.


Blocking Traffic at the Firewall level

If you use LINUX and have WebMin installed it is pretty easy to do.

Just go to the WebMin panel and under the "Networking" menu is an item called "Linux Firewall". Select that and a panel will open up with all the current IP addresses, Ports and packets that allowed or denied access to your server.

Choose the "Add Rule" command or if you have an existing Deny command you have setup then it's quicker to just clone it and change the IP address. However if you don't have any setup yet then you just need to do the following.

In the window that opens up just follow these steps to block an IP address from accessing your server.

In the Chain and Action Details Panel at the top:


Add a Rule Comment such as "Block 71.32.122.222 Some Horrible BOT"
In the Action to take option select "Drop"
In the Reject with ICMP Type select "Default"

In Condition Details Panel:

In source address of network select "Equals" and then add the IP address you want to ban e.g 71.32.122.222
In network protocol select "Equals" and then "TCP"

Hit "Save"

The rule should now be saved and your firewall should now ban all TCP traffic from that IP address by dropping any packets it receives as soon as it gets them.

Watch as your performance improves and the number of 403 status codes in your access files drop - until the next horrible social media BOT comes on the scene and tries scrapping all your data.

IMPORTANT NOTE

WebMin isn't very clear on this and I found out the hard way by noticing that IP addresses I had supposedly blocked were still appearing in my access log.

You need to make sure all your DENY RULES are above the default ALLOW rules in the table WebMin will show you.

Therefore your rules to block bad bots, and IP addresses that are hammering away at your server - which you can check in PUTTY with a command like this:
tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more 

Should be put above all your other commands e.g:


Drop If protocol is TCP and source is 91.207.8.110
Drop If protocol is TCP and source is 95.122.101.52
Accept If input interface is not eth0
Accept If protocol is TCP and TCP flags ACK (of ACK) are set
Accept If state of connection is ESTABLISHED
Accept If state of connection is RELATED
Accept If protocol is ICMP and ICMP type is echo-reply
Accept If protocol is ICMP and ICMP type is destination-unreachable
Accept If protocol is ICMP and ICMP type is source-quench
Accept If protocol is ICMP and ICMP type is parameter-problem
Accept If protocol is ICMP and ICMP type is time-exceeded
Accept If protocol is TCP and destination port is auth
Accept If protocol is TCP and destination port is 443
Accept If protocol is TCP and destination ports are 25,587
Accept If protocol is ICMP and ICMP type is echo-request
Accept If protocol is TCP and destination port is 80
Accept If protocol is TCP and destination port is 22
Accept If protocol is TCP and destination ports are 143,220,993,21,20
Accept If protocol is TCP and destination port is 10000


If you have added loads at the bottom then you might need to copy out the IPTables list to a text editor, change the order by putting all the DENY rules at the top then re-saving the whole IPTable list to your server before a re-start of APACHE.

Or you can use the arrows by the side of each rule to move the rule up or down in the table - which is a very laborious task if you have lots of rules.

So if you find yourself still being hammered by IP addresses you thought you had blocked then check the order of your commands in your firewall and make sure they are are at the top NOT the bottom of your list of IP addresses.