Wednesday, 13 July 2011

PHP for obtaining the follower count of a Twitter Account

Get Twitter Follower Count using Regular Expressions

I came across this bit of PHP code the other day which is aimed at getting the follower count of a Twitter user.

It seems like overkill to me and is a mixture of regular expressions, string parsing, callback functions and a lot of head scratching.

The user obviously knows that the follower count HAS to reside within an element within the DOM with the id of follower_count so why not just use one single regular expression to target that element and return it's guts instead of all the DOM loading, callbacks and string parsing?

I might be missing something that someone could tell me but this seemed like a long way to go about a simple scrape job.


// Get the number of twitter followers

function string_getInsertedString($long_string,$short_string,$is_html=false){
if($short_string>=strlen($long_string))return false;
$insertion_length=strlen($long_string)-strlen($short_string);
for($i=0;$i<strlen($short_string);++$i){
if($long_string[$i]!=$short_string[$i])break;
}
$inserted_string=substr($long_string,$i,$insertion_length);
if($is_html && $inserted_string[$insertion_length-1]=='<'){
$inserted_string='<'.substr($inserted_string,0,$insertion_length-1);
}
return $inserted_string;
}

function DOMElement_getOuterHTML($document,$element){
$html=$document->saveHTML();
$element->parentNode->removeChild($element);
$html2=$document->saveHTML();
return string_getInsertedString($html,$html2,true);
}

function getFollowers($username){
$x = file_get_contents("http://twitter.com/".$username);
$doc = new DomDocument;
@$doc->loadHTML($x);
$ele = $doc->getElementById('follower_count');
$innerHTML=preg_replace('/^<[^>]*>(.*)<[^>]*>$/',"\\1",DOMElement_getOuterHTML($doc,$ele));
return $innerHTML;
}


// To display it

<?php echo getFollowers("username"); ?>



Here is the much shorter version I wrote. It still works just as well returning the follower count of the Twitter Account username passed into it.

function getFollowers($username){
$url = "http://twitter.com/".$username;
$count=0;

$x = file_get_contents($url);

preg_match("@<([a-z][^ ]*) id=\"follower_count\"[^>]+?>([0-9,]+)\s*</\\1>@i",$x,$match);

if($match){
$count = $match[2];
}

return $count;
}

// lets check how poorly my twitter account is followed!
echo "StrictlyTweets: " . getFollowers("StrictlyTweets") . "<br><br>";



As you can see from the regular expression I am matching HTML tags (they all start with < and then a letter) and storing that tag to be used in the backtrack reference later on so that if the HTML changes from a SPAN to a DIV as long as it has the id="follower_count" with the element there will be a match.

I could have loaded up the DOM, targeted the ID and then done some regex but why bother when you can go straight for the juggular!

1 comment:

UK Horse Racing Star said...

Please note that this code no longer works due to Twitter changing their code for the upteenth time.

You now need to use OAuth to do anthing with their API and simple scrapes of mobile sites are not possible anymore!