Wednesday 29 September 2010

Analysing Bot Traffic from a Twitter Rush

Twitter Rush - Bot Traffic from Twitter

I blogged the other day about a link I found that listed the traffic that visits a site whenever a link to that site is posted upon twitter.

It seems that if you post a Tweet that contains a link a Twitter Rush is caused due to numerous BOTS, Social Media sites, SERPS and other sites noticing the new link and then all visiting your site at the same time.

This is one reason I created Strictly TweetBOT PRO which differs from my free version of Strictly TweetBOT as it allows you to do the following:


  • Make an HTTP request to the new post before Tweeting anything. If you have a caching plugin on your site then this should put the new post into the cache so that when the Twitter Rush comes they all hit a cached page and not a dynamically created one.
  • Add a query-string to the URL of the new post when making an HTTP request to aid caching. Some plugins like WP Super Cache allow you to force an un-cached page to be loaded with a query-string. So this will enable the new page to be loaded and re-cached.
  • Delay tweeting for N seconds after making the HTTP request to cache your post. This will help you ensure that the post is in the cache before the Twitter Rush.
  • Add a delay between each Tweet that is sent out. If you are tweeting to multiple accounts you will cause multiple Twitter Rushes. Therefore staggering the hits aids performance.

Buy Now
I have been carrying out performance tests on one of my LAMP sites and have been analysing this sort of data in some depth. I thought I would post an update with the actual traffic my own site receives when a link is Tweeted which is below.

A few interesting points:

1. This traffic is instantaneous so that the first item in the log file visiting the site has exactly the same time stamp as the WordPress URL that submitted the tweets to my Twitter account.

2. Yahoo seems to duplicate requests. This one tweet to a single post resulted in 3 requests for Yahoo's Slurp BOT but they originated from two different IP addresses.

3. These bots are not very clever and don't seem to log the URL's they visit to prevent duplicate requests. Not only does Yahoo have issues with the same account but if you post the same link to multiple Twitter accounts you will get all this traffic for each account.

For example when I posted the same link to 3 different Twitter accounts I received 57 requests (19 * 3). You would think maybe these Bots would be clever enough to realise that they only need to visit a link once every so often no matter which account posted it.

It just serves to prove my theory that most Twitter traffic is BOT related. 

BOTS following BOTS and Re-Tweeting and following traffic generated by other BOTS.

  • 128.242.241.133 - - [29/Sep/2010:21:06:45 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "-" "Twitterbot/0.1"
  • 216.24.142.47 - - [29/Sep/2010:21:06:47 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 26644 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 OneRiot/1.0 (http://www.oneriot.com)"
  • 204.236.254.109 - - [29/Sep/2010:21:06:48 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
  • 67.195.112.56 - - [29/Sep/2010:21:06:46 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 72.30.142.220 - - [29/Sep/2010:21:06:47 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 65.52.2.10 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26643 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
  • 85.114.136.243 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26634 "-" "Mozilla/5.0 (compatible; Windows NT 6.0) Gecko/20090624 Firefox/3.5 NjuiceBot"
  • 72.30.142.220 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
  • 89.151.113.134 - - [29/Sep/2010:21:06:49 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"
  • 67.202.63.158 - - [29/Sep/2010:21:06:54 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26634 "-" "kame-rt (support@backtype.com)"
  • 38.113.234.180 - - [29/Sep/2010:21:06:57 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "-" "Voyager/1.0"
  • 74.112.128.61 - - [29/Sep/2010:21:07:03 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.0" 200 100253 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
  • 64.233.172.20 - - [29/Sep/2010:21:07:10 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26640 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)"
  • 208.94.147.190 - - [29/Sep/2010:21:07:17 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 - "http://longurl.org" "LongURL API"
  • 208.94.147.190 - - [29/Sep/2010:21:07:17 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 100253 "http://longurl.org" "LongURL API"
  • 66.249.65.166 - - [29/Sep/2010:21:07:25 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 200 26653 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  • 64.12.237.17 - - [29/Sep/2010:21:07:32 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 455 "-" "Jakarta Commons-HttpClient/3.1"
  • 204.236.205.4 - - [29/Sep/2010:21:08:55 +0000] "HEAD /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 - "-" "Firefox"
  • 67.207.201.163 - - [29/Sep/2010:17:01:06 +0000] "GET /2010/09/capital-punishment-and-law-and-order/ HTTP/1.1" 403 473 "-" "Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)"

Buy Now

Monday 27 September 2010

Twitter Traffic is causing high server loads

Wordpress and Twitter Generated Traffic

I came across this interesting article the other day over at cloudtesting.com which listed the user-agents of bots that would visit a link if it was posted on Twitter. I have copied the list below and it's quite amazing to think that as soon as you post a link to your site or blog on Twitter you will suddenly get hammered by X amount of bots.

I can definitely attest to the truthfulness of this behaviour as I am experiencing a similar problem with one of my LAMP Wordpress blogs. Whenever an article is posted I automatically post tweets to 2 (sometimes 3 depending on relevance) Twitter accounts with my new Strictly Tweetbot Wordpress plugin.

Therefore when I import content at scheduled intervals throughout the day I can receive quite a sudden rush of bot traffic to my site which spikes my server load often to levels that are unrecoverable.

  • @hourlypress
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Mozilla/5.0 (compatible; abby/1.0; +http://www.ellerdale.com/crawler.html)
  • Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot
  • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
  • Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)
  • Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)
  • User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0 OneRiot/1.0 (http://www.oneriot.com)
  • PostRank/2.0 (postrank.com)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0 Me.dium/1.0 (http://me.dium.com)
  • Mozilla/5.0 (compatible; VideoSurf_bot +http://www.videosurf.com/bot.html)
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3
  • Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com]

Personally I think that this list might be out of date as from what I have seen there are quite a few more agents that can be added to that list including bots from Tweetme and Bit.ly.

Currently if I think the bots don't provide any kind of benefit to me in terms of traffic apart from stealing my bandwidth and killing my server I am serving 403's using htaccess rules.

Before banning an agent check your log files or stats to see if you can see any traffic being referred. If you want the benefit but without the constant hitting try contacting the company behind the bot to see if they could change their behaviour. You never know they may be relying on your content and be willing to tweaking their code. We can all live in hope.

Saturday 11 September 2010

Strictly System Checker - Wordpress Plugin

Updated Wordpress Plugin - Strictly System Check

Ensure your Wordpress site is up 24/7 and be kept informed when it isn't without even having to touch your server or install monitoring software with this Wordpress plugin I created.

I have just released version 1.0.2 which has some major new features including:

The option to check for fragmented indexes and to carry out an automated re-index using the OPTIMIZE command.

And most importantly I have migrated some of the key features that MySQL performance monitoring scripts such as MySQLReport use to the plugin so that you can now be kept informed of:
  • MySQL Database Uptime
  • No of connections made since last restart, connections per hour.
  • The no of aborted connections.
  • No of queries made since last restart, queries per hour.
  • The percentage of queries that are flagged as slow.
  • The number of joins being carried out without indexes.
  • The number of reads and writes.
You can find out more about this very useful plugin at my main site: www.strictly-software.com/plugins/strictly-system-check

It's a plugin I created out of necessity and one that has been 100% useful in keeping my site running and on those occasions it's not I get to know about it before anyone complains.

Wednesday 8 September 2010

An issue with mysql_unbuffered_query, CONCAT and Wordpress

MySQL Problems related to mysql_unbuffered_query

I have been doing a lot of work with Wordpress lately, mainly developing a number of plugins I have ended up creating to overcome issues with performance or lack of features in existing plugins. One of the plugins I have been working on lately is an XML feed plugin in which I have tried to make use of the database a lot more than other plugins seem to want to.

However for some time I have been experiencing an issue with one of the MySQL interface functions mysql_unbuffered_query. This function is designed to speed up the retrieval of results as records are returned to the client as soon as they are ready rather than waiting for the whole recordset to be completed.

Whilst this seems straight forward I have come across an issue which seems to be directly linked to using this method which affects queries that engage in certain replacement behaviour. In my case I am using a SELECT statement to CONCAT all the necessary column per row into one XML string. Rather than return each individual column by itself and then use PHP to string build and call other database related functions I am trying to save time and database calls by doing as much as possible in one statement. A cut down example of this SQL can be seen below in which I join a number of columns together as well as inserting the relevant value (in this case the tag slug) into the URL.

SELECT CONCAT(' ',REPLACE('http://www.mysite.com/tag/%tag%/','%tag%',t.slug),' ',REPLACE(NOW(),' ','T'),'Z always 1.0 ') as XML
FROM wp_terms AS t
JOIN wp_term_taxonomy AS tt
ON t.term_id = tt.term_id
WHERE tt.taxonomy IN ('post_tag')
ORDER BY Name;

Nothing too complex about that at all. The URL string containing the placeholder is a permalink structure that is obtained before hand and one that can contain multiple placeholders and sections. For the sake of clarity I have kept it simple so it only makes one replacement.

When I run this query in Navicat, from the website with the Wordpress SQL functions or the standard mysql_query functions it runs correctly returning all the rows with the appropriate tag values inserted into the correct %tag% place-holders within the XML e.g
<url><loc>http://www.mysite.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Spain/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/2020/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>


However when I use mysql_unbuffered_query to run this I get the problem that all rows contain the same data e.g
<url><loc>http://www.mysite.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>
<url><loc>http://www.hottospot.com/tag/Sales/</loc><lastmod>2010-09-08T00:37:25Z</lastmod><changefreq>always</changefreq> <priority>1.0</priority></url>



Even if I break the query down to something simple like this REPLACE without the CONCAT it still "misbehaves".

SELECT t.slug,REPLACE('http://www.hottospot.com/tag/%tag%/','%tag%',t.slug)
FROM wp_terms AS t
JOIN wp_term_taxonomy AS tt ON
t.term_id = tt.term_id
WHERE tt.taxonomy IN ('post_tag')
ORDER BY Name;


and the results will show the same value in the 2nd column for all rows e.g

Col 1Col 2
Saleshttp://www.mysite.com/tag/Sales
Spainhttp://www.mysite.com/tag/Sales
2012http://www.mysite.com/tag/Sales


It definitely seems to be connected to the execution of mysql_unbuffered_query as I can have Navicat open on my office PC connected to my remote server and all these queries run correctly but as soon as on my laptop at home I run the mysql_unbuffered_query query through the website to build the XML everything goes tits up. If I hit the refresh button on the query open in Navicat on my office PC which had been returning the correct results they then come back like I have described with all the values for rows 2 onwards displaying the value from the first row.

This is strange behaviour as I am providing a link identifier parameter when opening the connection and executing the SQL for the mysql_unbuffered_query and I presumed rightly or wrongly that this should have prevented issues like this.

I am yet to find a definitive answer to this problem so if anybody knows please contact me with details. From what it looks like the MySQL query engine has not finished processing the rows correctly when it starts to return them to the client. If the Replacement of values and placeholders used in the REPLACE function wasn't being carried out until the whole recordset was completed rather than after individual rows then this "might" explain it. As I am not an expert on the inner workings of MySQL I cannot say at this point in time however a quick solution to speed up some SQL has become more problematic than I would have though it could.


Any information regarding this issue would be great.