Wednesday, 13 July 2011

PHP for obtaining the follower count of a Twitter Account

Get Twitter Follower Count using Regular Expressions

I came across this bit of PHP code the other day which is aimed at getting the follower count of a Twitter user.

It seems like overkill to me and is a mixture of regular expressions, string parsing, callback functions and a lot of head scratching.

The user obviously knows that the follower count HAS to reside within an element within the DOM with the id of follower_count so why not just use one single regular expression to target that element and return it's guts instead of all the DOM loading, callbacks and string parsing?

I might be missing something that someone could tell me but this seemed like a long way to go about a simple scrape job.


// Get the number of twitter followers

function string_getInsertedString($long_string,$short_string,$is_html=false){
if($short_string>=strlen($long_string))return false;
$insertion_length=strlen($long_string)-strlen($short_string);
for($i=0;$i<strlen($short_string);++$i){
if($long_string[$i]!=$short_string[$i])break;
}
$inserted_string=substr($long_string,$i,$insertion_length);
if($is_html && $inserted_string[$insertion_length-1]=='<'){
$inserted_string='<'.substr($inserted_string,0,$insertion_length-1);
}
return $inserted_string;
}

function DOMElement_getOuterHTML($document,$element){
$html=$document->saveHTML();
$element->parentNode->removeChild($element);
$html2=$document->saveHTML();
return string_getInsertedString($html,$html2,true);
}

function getFollowers($username){
$x = file_get_contents("http://twitter.com/".$username);
$doc = new DomDocument;
@$doc->loadHTML($x);
$ele = $doc->getElementById('follower_count');
$innerHTML=preg_replace('/^<[^>]*>(.*)<[^>]*>$/',"\\1",DOMElement_getOuterHTML($doc,$ele));
return $innerHTML;
}


// To display it

<?php echo getFollowers("username"); ?>



Here is the much shorter version I wrote. It still works just as well returning the follower count of the Twitter Account username passed into it.

function getFollowers($username){
$url = "http://twitter.com/".$username;
$count=0;

$x = file_get_contents($url);

preg_match("@<([a-z][^ ]*) id=\"follower_count\"[^>]+?>([0-9,]+)\s*</\\1>@i",$x,$match);

if($match){
$count = $match[2];
}

return $count;
}

// lets check how poorly my twitter account is followed!
echo "StrictlyTweets: " . getFollowers("StrictlyTweets") . "<br><br>";



As you can see from the regular expression I am matching HTML tags (they all start with < and then a letter) and storing that tag to be used in the backtrack reference later on so that if the HTML changes from a SPAN to a DIV as long as it has the id="follower_count" with the element there will be a match.

I could have loaded up the DOM, targeted the ID and then done some regex but why bother when you can go straight for the juggular!

Labels: , , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home