Wednesday, 12 November 2014

Twitter Rush caused by Tweet BOTs visiting your site

Twitter Traffic can cause a major hit on your server

If you are using Twitter to post tweets to whenever you blog or post an article you should know that a large number of BOTS will immediately hit your site as soon as they see the link.

This is what I call a Twitter Rush as you are causing a rush of traffic from posting a link on Twitter.

I did a test some months back and I like to regularly test how many hits I get whenever I post a link so that I can weed out the chaff from the wheat and set up rules to ban any BOTS I think are just wasting me money.

Most of these BOTS are also stupid.

If you post the same link to multiple Twitter accounts e.g by using my Wordpress plugin - Strictly Tweetbot then the same BOT will come to the same page multiple times.

Why? Who wrote such crap code and why don't they check before hitting a site that they haven't just crawled that link. I cannot believe the developers at Yahoo cannot write a BOT that works out they have just crawled a page before doing it two more times.

Some BOTS are obviously needed such as the major SERP search engines e.g Google or Bing but many are start up "social media" search engines and other such content scrapers, spammers. hackbots and bandwidth wasters.

Because of this I now 403 a lot of these BOTS or even send them back to the IP address they came from with an ISAPI rewrite rule as they don't provide me with any benefit and just steal bandwidth and cost me money.

RewriteCond %{HTTP_USER_AGENT} (?:Spider|MJ12bot|seomax|atomic|collect|e?mail|magnet|reaper|tools\.ua\.random|siphon|sweeper|harvest|(?:microsoft\surl\scontrol)|wolf) [NC]
RewriteRule .* http://%{REMOTE_ADDR} [L,R=301]

However if you are using my Strictly Tweet BOT plugin which can post multiple tweets to the same or multiple accounts then the new version allows you to pre-post a page which hopefully gets cached by the caching plugin you should be using (WP SUPER CACHE or W3 TOTAL CACHE etc) before the article is made public and the BOT looking at Twitter for URL's to scrape can get to it.

The aim is to get the page cached BEFORE multiple occurrences of BOTS hit the page. If the page is already cached then the load on your server should be a lot less than if every BOT's loading of the page was trying to cache the page at the same time (due to the quickness of their visit). 

However if you are auto blogging and using my TweetBOT you might be interested in Strictly TweetBOT PRO as it had extra features for people who are tweeting to multiple accounts or multiple tweet in different formats to the same account. These new features are all designed to reduce the hit from a Twitter Rush.

The paid for version allows you to do the following:

  • Make an HTTP request to the new post before Tweeting anything. If you have a caching plugin on your site then this should put the new post into the cache so that when the Twitter Rush comes they all hit a cached page and not a dynamically created one.
  • Add a query-string to the URL of the new post when making an HTTP request to aid caching. Some plugins like WP Super Cache allow you to force an uncached page to be loaded with a querystring. So this will enable the new page to be loaded and re-cached.
  • Delay tweeting for N seconds after making the HTTP request to cache your post. This will help you ensure that the post is in the cache before the Twitter Rush.
  • Add a delay between each Tweet that is sent out. If you are tweeting to multiple accounts you will cause multiple Twitter Rushes. Therefore staggering the hits aids performance.


Buy Now


I did a test this morning to see how much traffic was generated by a test post. I got almost 50 responses within 2 seconds!

50.18.132.28 - - [30/Nov/2011:07:04:41 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 471 "-" "bitlybot"
50.57.137.74 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "EventMachine HttpClient"
50.57.137.74 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "EventMachine HttpClient"
184.72.47.46 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
204.236.150.14 - - [30/Nov/2011:07:04:44 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 471 "-" "JS-Kit URL Resolver, http://js-kit.com/"
50.18.121.55 - - [30/Nov/2011:07:04:45 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
184.72.47.71 - - [30/Nov/2011:07:04:47 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
50.18.121.55 - - [30/Nov/2011:07:04:48 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
184.72.47.71 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
199.59.149.31 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
107.20.160.159 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "http://unshort.me/about.html"
46.20.47.43 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 369 "-" "Mozilla/5.0 (compatible"
199.59.149.165 - - [30/Nov/2011:07:04:43 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Twitterbot/1.0"
173.192.79.101 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 471 "-" "-"
50.18.121.55 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
46.20.47.43 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 369 "-" "Mozilla/5.0 (compatible"
199.59.149.31 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28863 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
66.228.54.132 - - [30/Nov/2011:07:04:43 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 106183 "-" "InAGist URL Resolver (http://inagist.com)"
199.59.149.165 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28863 "-" "Twitterbot/1.0"
107.20.42.241 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
110.169.128.180 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "http://twitter.com/" "User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1"
74.112.131.128 - - [30/Nov/2011:07:05:15 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.131 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.127 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.128 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.131 - - [30/Nov/2011:07:05:17 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.128 - - [30/Nov/2011:07:05:17 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
107.20.78.114 - - [30/Nov/2011:07:05:18 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
65.52.54.253 - - [30/Nov/2011:07:05:19 +0000] "GET /2011/11/hypocrisy-rush-drug-test-welfare-benefit-recipients HTTP/1.1" 403 470 "-" "-"
107.20.78.114 - - [30/Nov/2011:07:05:28 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
65.52.62.87 - - [30/Nov/2011:07:05:30 +0000] "GET /2011/11/hypocrisy-rush-drug-test-welfare-benefit-recipients HTTP/1.1" 403 470 "-" "-"
107.20.78.114 - - [30/Nov/2011:07:05:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
199.59.149.165 - - [30/Nov/2011:07:04:45 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28860 "-" "Twitterbot/1.0"
107.20.160.159 - - [30/Nov/2011:07:04:59 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "http://unshort.me/about.html"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
199.59.149.31 - - [30/Nov/2011:07:04:46 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
107.20.42.241 - - [30/Nov/2011:07:05:01 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
107.20.42.241 - - [30/Nov/2011:07:05:07 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:17 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:17 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
50.16.51.20 - - [30/Nov/2011:07:06:21 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Summify (Summify/1.0.1; +http://summify.com)"
74.97.60.113 - - [30/Nov/2011:07:07:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28889 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"



Buy Now

5 comments:

Alec Berg said...

I've seen the same behavior to my http://Listi.net website. I found your blog on Google via one of the worst ip's uses a browser string of "EventMachine HttpClient" and sent my site 30 requests within a few seconds (ip 50.57.52.177) It took me a while to figure out what was happening. Some of the bots send the request 5 times at at the same time. I just posted to twitter and I got 69 hits from bots in a very short time frame. That doesn't include a number of bots that I've banned. I solved most of my problem by caching the web page I post and banning the worst behaving bots, where I was previously running queries to construct the page and it really hung my site.

Alec Berg said...

I've seen the same behavior to my http://Listi.net website. It took me a while to figure out what was happening. Some of the bots send the request 10-20 times within a few seconds. The worst bot is sending a browser string of "EventMachine HttpClient" and the ip addresses are from 50.57.52.177, 50.57.137.74. I just posted to twitter and I got 69 hits from bots in a very short time frame. That doesn't include a number of bots that I've banned. I solved most of my problem by caching the web page I post and banning the worst behaving bots. I was previously running queries to construct the page and it really hung my site. I feel a little better knowing someone else is having the same problem.

Rob Reid said...

I ban over 60% of my traffic now. Unless you want to sell in Russia or China then there is no need to let Yandex or Baiduspider crawl your site. Plus just adding them to your robots.txt won't do anything they just ignore it. Unless the BOTS give you value by sending you traffic later on I.e a job aggregator BOT then I just ban them.

Anonymous said...

Appreciating the hard work you put into your website and detailed information you present.
It's great to come across a blog every once in a while that isn't the same unwanted rehashed material.
Wonderful read! I've saved your site and I'm adding your RSS feeds to my
Google account.

Feel free to surf to my page - visit this
link ()

Mike Otgaar said...

Thanks for the user agent list. These bots are resource wasters.