In this post we’ll see how to use Python to get the tweets we plotted in Rude Britannia and Exploring Rude Britannia. This is a code-centric post so you have been warned! This is not a best practises post, but simply my implementation.
I’ve written this post as if I’m talking you through the code. If you don’t have the patience you can get the full code here.
Tweepy does much of the heavy lifting
we’ll need to import a custom module for our twitter authentication codes
the results are a JSON set which I store in a MongoDB
this pair are always useful…
we’ll need to read a csv to build a search string
and let’s finally bring in this for timing how long stuff takes.
Authenticate with Twitter
To read tweets our script will need to sign into Twitter. To do this, create a Twitter account, go to https://apps.twitter.com/ and make some keys and access tokens. The access tokens don’t expire so you only need to do this once.
I think of the consumer keys as an entrance fee to a nightclub and the access tokens are the rubber stamp that tells the bouncer you’ve paid. Just like a club, one generous person can pay the entry for many people and if you cause any trouble the Twitter bouncers will kick you (and possibly your mates) out!
To get the location of tweets you must enable location broadcasting on your Twitter account, so you might want to turn this on now.
Stick the keys into the code below and you should be good to go. I keep this entire snippet out of my git repo but you may only want to exclude the keys.
For more information see the Tweepy documentation.
We first load the authentication we created in the previous snippet
and use it to open a connection to the Twitter API.
We then connect to a local Mongo database called “twitter” that has a collection called “twitter_collection”.
Find the UK
The Twitter API has places, each place has a unique ID. Places can be anything from countries, regions, cities, pubs and so on. I only care about tweets that come from the UK, so I find all places that could be ‘United Kingdom’, eyeball the result set and get the ID. Once you have the ID you don’t need to query the places.
If you want a different country change the ID accordingly.
This stackoverflow helped me greatly.
Build a Search String
I load in this set of rude words but you can have whatever you want - it doesn’t even have to be rude.
We can now use the loaded words to build our search string. Be careful, searching for too many terms (more than 20 or so) will throw an error which this code doesn’t handle.
Seed the Query
People are always making lots of new tweets so we will be querying a constantly changing list. To create a reference point we grab any tweet, there’s nothing special about it, the tweet doesn’t even have to be one we care about, it’s simply so we don’t keep querying the same tweets as the list changes.
The Twitter API expects a collection of tweets, even though we’ve only returned one tweet we have to treat the result as a set. We ‘loop through’ the set, convert it to JSON and get the ID of our seed tweet. This will be the biggest ID we query up to.
If you don’t leave enough time between executions, this seed method can return duplicate tweets. I just make sure to wait long enough before successive runs, I also have some code that sorts duplicates in my database. I’ll leave the proper solution as an exercise for the reader 😉
Query the API
We’re going to query the API about 20 times, each time getting 100 tweets - giving us around 2000 tweets. Should we desire, we can query the API up to the rate limit, getting 100 tweets with each query.
This performs the query against the API. A line further down the page decreases the number of requests left to perform.
Add Custom Meta-Data
We now convert the returned data to JSON and add the search string and a comment to the returned set. These prove invaluable for analysing the data because it lets us know if a type of tweet was returned through luck or if we actively searched for it.
Convert Location Boxes to Discrete Points
When users tweet, we’re told that it came from somewhere inside a geographic box but we don’t know precisely where. The boxes give us an idea where the tweet came from without telling us the exact co-ordinates so we can’t for example, figure out the street someone lives on.
I’ve created a plot in R demonstrating how some of these boxes look around Manchester. It uses code similar to Rude Britannia. We can see that boxes are unequal in size and often overlap or nested. If you look closely you’ll see some points that look like dust on your screen, these are actually really small boxes.
We take the center of the boxe and record it against the tweet.
Insert into DB
We now insert the tweet into our database. This code attemps the replace an existing tweet (based on it’s Twitter ID), if the tweet doesn’t exist in our database we add it (upsert = True).
The documentation for collection.replace is here.
Seed the Next Query
As we process each tweet, we check if it’s ID is less then the reference point we seeded our query with. If it is, we move the reference point backwards so that our next query doesn’t return tweets we already have.
Thanks for reading. If you followed everything correctly you should now have a database of tweets that you can use to follow Rude Britannia.