A Simple Twitter Search Parser with PHP

This article is obsolete. Now Twitter has a more complex API for tweets.

I was recently asked to aggregate tweets based on their hashtags using PHP (no Ajax), so I decided to turn this into a small tutorial that will hopefully enable you build all sorts of XML parsers in PHP.

If you read my posts, you can see that I love simplicity. In all of my solutions, tips and tutorials, I strive for the simplest code that gets the job done and for the most straightforward explanation. This tutorial is no exception. It is my sincere hope that you’ll not just copy & paste the code in your project, but you will actually understand it as well and you’ll be able to modify and extend it to your purposes.

Twitter has a search service at search.twitter.com. The search results are available as an Atom feed and this is how we’re going to use it. If you’re wondering why Atom instead of RSS, one can argue that despite the popularity of RSS 2.0, Atom is a superior format.

Building the parser

My goals for this little parser were as follow:

Show the tweets in the format “Full Name: text – time”
Show the sender’s avatar
Show relative time, e.g. “5 minutes ago”.
Open links in a new window
Limit the number of results (and process just the first page of results)
Filter tweets containing profanity
Style everything with CSS
Work with PHP 5.

So, first I should stress that this code is written for PHP 5, specifically it was not tested with PHP versions prior to 5.2.0.

I made this into a class, so that you can easily use it in your project:

To load and parse an XML file, the easiest method is simplexml_load_file(), however Twitter is rather picky with request headers and doesn’t like if the user agent is not set the way it likes, so we’ll use curl instead.

    $ch = curl_init($this->searchURL . urlencode($q));
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    $response = curl_exec($ch);

Pretty simple. The search term is encoded and appended to the Twitter search url and the result is loaded in the $response variable as a string. Also note that we’re making the request using the browser user agent.

Parsing the resulted string could not be easier:

      $xml = simplexml_load_string($response);
      $output = '';
      $tweets = 0;

      for($i=0; $i<count($xml->entry); $i++)
      {
        $crtEntry = $xml->entry[$i];
        $account  = $crtEntry->author->uri;
        $image    = $crtEntry->link[1]->attributes()->href;
        $tweet    = $crtEntry->content;
      }

So we can get the link to the poster account, the image and the tweet itself right away.

To get the name, we need a little parsing. The name is sent this way: “username (Full Name)”. I prefer to show just the full name, so I’m using a simple regexp:

        $this->realNamePattern = '/\((.*?)\)/';
        preg_match($this->realNamePattern, $crtEntry->author->name, $matches);
        $name = $matches[1];

Next, it’s using relative time instead of absolute. This is a matter of personal taste, but considering how quicky new tweets are added, it’s worth doing.

For this we’ll use two arrays, one with various interval names, the other with the number of seconds in that interval, e.g. an hour has 3600 seconds and so on.

    $this->intervalNames   = array('second', 'minute', 'hour', 'day', 'week', 'month', 'year');
    $this->intervalSeconds = array( 1,        60,       3600,   86400, 604800, 2630880, 31570560);

The idea is this: we calculate the difference in seconds between the current time and the tweet time and then we start looking in the interval array from the largest to the smallest value, until our difference is larger than the value read from the array. For example, if our calculated difference is 173000 seconds, we start with the last value in the array, that is 31570560 and look until we find the value 86400, which corresponds is the ‘day’ interval. Now we know our difference is more than one day but less than one week. By dividing the difference by the interval length, that is 173000/86400, we get 2.002, that’s just a little over two days. If the division is exactly 1, we must use the singural form, i.e. ‘day’, otherwide the plural, ‘days’.

So here’s the code that does all that:

$time = 'just now';
        $secondsPassed = time() - strtotime($crtEntry->published);
        if ($secondsPassed>0)
        {
          // see what interval are we in
          for($j = count($this->intervalSeconds)-1; ($j >= 0); $j--)
          {
            $crtIntervalName = $this->intervalNames[$j];
            $crtInterval = $this->intervalSeconds[$j];

            if ($secondsPassed >= $crtInterval)
            {
              $value = floor($secondsPassed / $crtInterval);
              if ($value > 1)
                $crtIntervalName .= 's';

              $time = $value . ' ' . $crtIntervalName . ' ago';

              break;
            }
          }
        }

Finally, it’s the filtering. Depending on your site audience you may or may not need such a filter, I’m including it just in case.

You’d have a list of banned words in an array, like this:

    $this->badWords = array('bannedword', 'anotherbannedword');

and the code:

        $foundBadWord = false;
        foreach ($this->badWords as $badWord)
        {
          if(stristr($tweet, $badWord) !== FALSE)
          {
            $foundBadWord = true;
            break;
          }
        }

        // skip this tweet containing a banned word
        if ($foundBadWord)
          continue;

Now let’s put everything together:

The complete class

<?php

class twitter_class
{	
	function twitter_class()
	{
		$this->realNamePattern = '/\((.*?)\)/';
		$this->searchURL = 'http://search.twitter.com/search.atom?lang=en&q=';
		
		$this->intervalNames   = array('second', 'minute', 'hour', 'day', 'week', 'month', 'year');
		$this->intervalSeconds = array( 1,        60,       3600,   86400, 604800, 2630880, 31570560);
		
		$this->badWords = array('somebadword', 'anotherbadword');
	}

	function getTweets($q, $limit=15)
	{
		$output = '';

		// get the seach result
		$ch= curl_init($this->searchURL . urlencode($q));

		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
		$response = curl_exec($ch);

		if ($response !== FALSE)
		{
			$xml = simplexml_load_string($response);
	
			$output = '';
			$tweets = 0;
			
			for($i=0; $i<count($xml->entry); $i++)
			{
				$crtEntry = $xml->entry[$i];
				$account  = $crtEntry->author->uri;
				$image    = $crtEntry->link[1]->attributes()->href;
				$tweet    = $crtEntry->content;
	
				// skip tweets containing banned words
				$foundBadWord = false;
				foreach ($this->badWords as $badWord)
				{
					if(stristr($tweet, $badWord) !== FALSE)
					{
						$foundBadWord = true;
						break;
					}
				}
				
				$tweet = str_replace('<a href=', '<a target="_blank" href=', $tweet);
				
				// skip this tweet containing a banned word
				if ($foundBadWord)
					continue;

				// don't process any more tweets if at the limit
				if ($tweets==$limit)
					break;
				$tweets++;
	
				// name is in this format "acountname (Real Name)"
				preg_match($this->realNamePattern, $crtEntry->author->name, $matches);
				$name = $matches[1];
	
				// get the time passed between now and the time of tweet, don't allow for negative
				// (future) values that may have occured if server time is wrong
				$time = 'just now';
				$secondsPassed = time() - strtotime($crtEntry->published);

				if ($secondsPassed>0)
				{
					// see what interval are we in
					for($j = count($this->intervalSeconds)-1; ($j >= 0); $j--)
					{
						$crtIntervalName = $this->intervalNames[$j];
						$crtInterval = $this->intervalSeconds[$j];
							
						if ($secondsPassed >= $crtInterval)
						{
							$value = floor($secondsPassed / $crtInterval);
							if ($value > 1)
								$crtIntervalName .= 's';
								
							$time = $value . ' ' . $crtIntervalName . ' ago';
							
							break;
						}
					}
				}
				
				$output .= '
				<div class="tweet">
					<div class="avatar">
						<a href="' . $account . '" target="_blank"><img src="' . $image .'"></a>
					</div>
					<div class="message">
						<span class="author"><a href="' . $account . '"  target="_blank">' . $name . '</a></span>: ' . 
						$tweet . 
						'<span class="time"> - ' . $time . '</span>
					</div>
				</div>';
			}
		}
		else
			$output = '<div class="tweet"><span class="error">' . curl_error($ch) . '</span></div>';
		
		curl_close($ch);
		return $output;
	}
}

?>

To use the class in another php file, you’d do use it like this:

<?php
  require('twitter.class.php');
  $twitter = new twitter_class();
  echo $twitter->getTweets('search term', 10);
?>

This will show the latest 10 tweets for your query.

You can style the results any way you want. Styling is outside the scope of this tutorial but you can look at the end of the class to see the html tags and classes that are generated.

Further improvement

Given the quasi-real-time nature of Twitter (depending on the topic, tweets get published every moment), you may want to use Ajax to load new tweets. You can give an id to each tweet (usually the timestamp) and modify the PHP to return only tweets newer than the timestamp. You can use either an Ajax library like JQuery or Flash to load and show the new tweets and a few seconds later to make a new request specifying the latest id.

Armand Niculescu

Senior Full-stack developer and graphic designer with over 25 years of experience, Armand took on many challenges, from coding to project management and marketing.

24 Responses

Jeremy says:

June 28, 2010 at 18:03

Hey, great work. Thanks for writing this. Have you given any thought to how you would adapt this to parse feeds of geo searches, like ‘near:alabama within:50mi’ ? This returns a string like this: http://search.twitter.com/search.atom?geocode=40.75604%2C-73.986941%2C50.0mi&q=near%3Anyc+within%3A50mi which doesn’t work in the current configuration because of the geocode.
1. Armand Niculescu says:
  
  July 3, 2010 at 15:28
  
  I will look into it.
2. Armand Niculescu says:
  
  August 7, 2010 at 11:26
  
  When you do a normal search, you can use the simple NEAR operator; however, when using the ATOM feed, it always expects to have the geocode parameter, otherwise it throws an error.
  
  The only way I see it to do it dynamically is to use a free Geolocation service like https://www.geonames.org, parse the NEAR parameter, make a request to their web service, get the coordinates and then make the Twitter search. Not really worth it in my opinion.
  
  However, if the search is always the same, you can edit the twitter.class.php and on line 8, hardcode the coordinates like this:
  $this->searchURL = 'http://search.twitter.com/search.atom?geocode=40.75604%2C-73.986941%2C50.0mi&q=near%3Anyc+within%3A50mi';
  and when you make the search, just send an empty string – echo $twitter->getTweets('', 15)
  
  It’s not an ideal solution, especially if you need more than one hardcoded search.
Josh says:

July 2, 2010 at 00:54

I love it, I just wish there were a way to specify which size of avatar you want to pull… As it is now, it’s pulling a 48×48 avatar but I need to pull a 44×44 instead.

I tried using timthumb, but even after adding a1.twimg.com, a2.twimg.com, and a3.twimg.com to the list of remote sites in the timthumb script, it still won’t work.

oh well.
1. Armand Niculescu says:
  
  July 3, 2010 at 15:30
  
  I am not familiar with timthumb but it would seem a serious overhead to resize the images on the fly.
  You could use CSS to either resize the thumbnails in the browsers or to clip/mask parts of the thumbnail…
Jay says:

July 6, 2010 at 20:10

Hi, this works wonderfully for normal search terms, but I’m having trouble getting tweets from a single user using the search operator “from:user” It returns a url like this:

http://search.twitter.com/search.atom?q=from%3Anytimes

I’m plugging in ‘from%3Anytimes’ as my search term, but I’m not getting any results. Am I doing something stupid?
1. Armand Niculescu says:
  
  August 7, 2010 at 11:36
  
  Sorry for my late reply, Jay.
  You should use the normal search term, e.g. “from:nytimes” as the php class does the URL encoding for you (%3A is the encoded value for “:”)
  
  echo $twitter->getTweets(‘from:nytimes’, 15);
  will work just fine.
John says:

July 18, 2010 at 23:59

great post!!

Is there a way to get an array of found tweets, so that I can count() them? All I want is to output the number of found tweets against a search term.
1. Armand Niculescu says:
  
  August 7, 2010 at 11:38
  
  You can simply edit the twitter.class.php file.
  At line 30 you can simply write return count($xml->entry) and the class will return the number of tweets rather that the contents.
Jason says:

August 6, 2010 at 18:27

I was leaning toward client side javascript/Ajax to accomplish this, but I love what you have done with PHP instead. This is really clean code. Do I have permission to reuse and style it to for use with CSS? Also, Is there anyway to increase the tweet posts to last more than 2 days? I would like to see tweets on a #hashtag or keyword last for more than a week. Excellent work!
1. Armand Niculescu says:
  
  August 6, 2010 at 18:34
  
  Hi Jason,
  
  yes, feel free to use the code for any purpose, commercial or not. The timespan is dependent only on the number of tweets you want displayed. If you want to see old tweets, you’ll have to increase the number of displayed tweets.
Jason says:

August 6, 2010 at 19:17

Thank you Armand! I increased the number of tweets to 50:

getTweets(‘ifest’, 50);
?>

However, it appears that after 48 hours the tweets begin to drop off the list. Here is the example I created: http://jhaag.us/twitter_class/ for the search term “ifest”. There were 8 tweets on there yesterday and today there are only two.
1. Armand Niculescu says:
  
  August 7, 2010 at 11:00
  
  Hi Jason,
  
  First of all, Twitter limits the search results set to 15. It’s possible to get more, but it’s a bit more complicated as a request has to be made for each 15 results page and we also need to keep track of IDs to prevent duplication of results. I will add such a feature if requested by more people.
  
  Second, your query for ‘ifest‘ returns less than 15 results because by default the class filters for English language only. However, I see that people tweet in English even if the language code is set to something else, probably due to their twitter clients. So, in twitter.class.php, at line 8, remove the lang=en& part.
Waqqas Hanafi says:

September 17, 2010 at 19:24

Armand, having more than 15 results will be very useful. Please consider adding it as a feature in your class.
Anthony says:

October 11, 2010 at 13:38

I’ve been looking for something like this for a while. Thanks a lot!
Your tutorial was very useful for me.
I am also kind of interested in more than 15 posts, but I may try integrating this with a MySql database so that I can store tweets and display them as needed.
Thanks again!
Anthony says:

October 11, 2010 at 16:48

Actually, if possible I would love to know how to style the “Twitter Search Term”. I can style everything else with CSS, but I can’t seem to find a way to style only the search term.
1. Armand Niculescu says:
  
  October 13, 2010 at 18:48
  
  The search term is inside a tag. So you can use a CSS rule like
  .tweet b {color:red}
Anthony says:

October 13, 2010 at 18:58

Thank you so much!!!!!! I was trying to use php to style it! This is much easier 🙂
Tommy says:

November 4, 2010 at 12:24

Love your work…

I’ve got as far as reloading a div with AJAX and having that ‘div class = the time’ – any tips on the PHP for only fetching a tweet newer than the timestamp? Would appreciate any help 🙂
Madhusanka says:

November 17, 2010 at 10:41

Very simple and really helpful !!! Thanx lot Armand
keep it up this good work… 🙂
Madhusanka says:

November 17, 2010 at 11:35

hi, As you said, can you provide that mechanism in details(may be a new post) to get all the tweets for a search or geocode api rather than limiting it to 15. Really urgent !!!
Thanks again…
Jeremy says:

November 30, 2010 at 04:48

This appears to have been affected by changes to the Twitter API. I got it working again once I removed lang=en from line 8, so it just reads:

$this->searchURL = ‘http://search.twitter.com/search.atom?&q=’;
talha says:

January 3, 2011 at 11:03

Hi,
awesome tutorial. is there a way to get the total amount of tweets in a day for a keyword?
Thanks
Manoj Solanki says:

February 13, 2011 at 18:28

Thanks for this example code…..

Comments are closed.

A Simple Twitter Search Parser with PHP

Building the parser

The complete class

Further improvement

Armand Niculescu

24 Responses

Recent Posts

Chess Diagram Generator in NodeJS

A Comprehensive Guide to Backup for Home and Small Office

Easy human-readable date difference

Improving nginx integration with CloudFlare

How to create chess diagrams with PHP

Making a Fixed-Width Text File to CSV Converter in C, Java, PHP, Javascript and Python