Saturday, 3 March 2012

Twitter Rush caused by Tweet BOTs visiting your site

Twitter Traffic can cause a major hit on your server

If you are using Twitter to post tweets to whenever you blog or post an article you should know that a large number of BOTS will immediately hit your site as soon as they see the link.

I did a test some months back and I like to regularly test how many hits I get whenever I post a link so that I can weed out the chaff from the wheat and set up rules to ban any BOTS I think are just wasting me money.

Most of these BOTS are also stupid.

If you post the same link to multiple Twitter accounts e.g by using my Wordpress plugin - Strictly Tweetbot then the same BOT will come to the same page multiple times. Why? Who wrote such shit code and why don't they check before hitting a site that they haven't just crawled that link.

Some BOTS are obviously needed such as the major SERP search engines e.g Google or Bing but many are start up "social media" search engines and other such content scrapers.

I did a test this morning to see how much traffic was generated by a test post. I got almost 50 responses within 2 seconds.

Note that I 403 a lot of these BOTS as they don't provide me with any benefit and just steal bandwidth and cost me money.

However if you are using my Strictly Tweet BOT plugin which can post multiple tweets to the same or multiple accounts then the new version allows you to pre-post a page which hopefully gets cached by the caching plugin you should be using (WP SUPER CACHE or W3 TOTAL CACHE etc) before the article is made public and the BOT looking at Twitter for URL's to scrape can get to it.

The aim is to get the page cached BEFORE multiple occurrences of BOTS hit the page. If the page is already cached then the load on your server should be a lot less than if every BOT's loading of the page was trying to cache the page at the same time (due to the quickness of their visit). Read about it more here.



50.18.132.28 - - [30/Nov/2011:07:04:41 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 471 "-" "bitlybot"
50.57.137.74 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "EventMachine HttpClient"
50.57.137.74 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "EventMachine HttpClient"
184.72.47.46 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
204.236.150.14 - - [30/Nov/2011:07:04:44 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 471 "-" "JS-Kit URL Resolver, http://js-kit.com/"
50.18.121.55 - - [30/Nov/2011:07:04:45 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
184.72.47.71 - - [30/Nov/2011:07:04:47 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
50.18.121.55 - - [30/Nov/2011:07:04:48 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
184.72.47.71 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
199.59.149.31 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
107.20.160.159 - - [30/Nov/2011:07:04:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "http://unshort.me/about.html"
46.20.47.43 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 369 "-" "Mozilla/5.0 (compatible"
199.59.149.165 - - [30/Nov/2011:07:04:43 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Twitterbot/1.0"
173.192.79.101 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 471 "-" "-"
50.18.121.55 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)"
46.20.47.43 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 403 369 "-" "Mozilla/5.0 (compatible"
199.59.149.31 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28863 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
66.228.54.132 - - [30/Nov/2011:07:04:43 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 106183 "-" "InAGist URL Resolver (http://inagist.com)"
199.59.149.165 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28863 "-" "Twitterbot/1.0"
107.20.42.241 - - [30/Nov/2011:07:05:11 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
110.169.128.180 - - [30/Nov/2011:07:05:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28862 "http://twitter.com/" "User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1"
74.112.131.128 - - [30/Nov/2011:07:05:15 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.131 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.127 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.128 - - [30/Nov/2011:07:05:16 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.131 - - [30/Nov/2011:07:05:17 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
74.112.131.128 - - [30/Nov/2011:07:05:17 +0000] "GET /some-test-url-I-posted/ HTTP/1.0" 200 106203 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
107.20.78.114 - - [30/Nov/2011:07:05:18 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
65.52.54.253 - - [30/Nov/2011:07:05:19 +0000] "GET /2011/11/hypocrisy-rush-drug-test-welfare-benefit-recipients HTTP/1.1" 403 470 "-" "-"
107.20.78.114 - - [30/Nov/2011:07:05:28 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
65.52.62.87 - - [30/Nov/2011:07:05:30 +0000] "GET /2011/11/hypocrisy-rush-drug-test-welfare-benefit-recipients HTTP/1.1" 403 470 "-" "-"
107.20.78.114 - - [30/Nov/2011:07:05:43 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
199.59.149.165 - - [30/Nov/2011:07:04:45 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28860 "-" "Twitterbot/1.0"
107.20.160.159 - - [30/Nov/2011:07:04:59 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "http://unshort.me/about.html"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
199.59.149.31 - - [30/Nov/2011:07:04:46 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
107.20.42.241 - - [30/Nov/2011:07:05:01 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
107.20.42.241 - - [30/Nov/2011:07:05:07 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
107.20.78.114 - - [30/Nov/2011:07:06:15 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:17 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
107.20.78.114 - - [30/Nov/2011:07:06:17 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com"
50.16.51.20 - - [30/Nov/2011:07:06:21 +0000] "HEAD /some-test-url-I-posted/ HTTP/1.1" 200 - "-" "Summify (Summify/1.0.1; +http://summify.com)"
74.97.60.113 - - [30/Nov/2011:07:07:11 +0000] "GET /some-test-url-I-posted/ HTTP/1.1" 200 28889 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"

5 comments:

Alec Berg said...

I've seen the same behavior to my http://Listi.net website. I found your blog on Google via one of the worst ip's uses a browser string of "EventMachine HttpClient" and sent my site 30 requests within a few seconds (ip 50.57.52.177) It took me a while to figure out what was happening. Some of the bots send the request 5 times at at the same time. I just posted to twitter and I got 69 hits from bots in a very short time frame. That doesn't include a number of bots that I've banned. I solved most of my problem by caching the web page I post and banning the worst behaving bots, where I was previously running queries to construct the page and it really hung my site.

Alec Berg said...

I've seen the same behavior to my http://Listi.net website. It took me a while to figure out what was happening. Some of the bots send the request 10-20 times within a few seconds. The worst bot is sending a browser string of "EventMachine HttpClient" and the ip addresses are from 50.57.52.177, 50.57.137.74. I just posted to twitter and I got 69 hits from bots in a very short time frame. That doesn't include a number of bots that I've banned. I solved most of my problem by caching the web page I post and banning the worst behaving bots. I was previously running queries to construct the page and it really hung my site. I feel a little better knowing someone else is having the same problem.

R Reid said...

I ban over 60% of my traffic now. Unless you want to sell in Russia or China then there is no need to let Yandex or Baiduspider crawl your site. Plus just adding them to your robots.txt won't do anything they just ignore it. Unless the BOTS give you value by sending you traffic later on I.e a job aggregator BOT then I just ban them.

Anonymous said...

Appreciating the hard work you put into your website and detailed information you present.
It's great to come across a blog every once in a while that isn't the same unwanted rehashed material.
Wonderful read! I've saved your site and I'm adding your RSS feeds to my
Google account.

Feel free to surf to my page - visit this
link ()

Mike Otgaar said...

Thanks for the user agent list. These bots are resource wasters.