How to access the Twitter API with python

The Twitter API is a great way to get started with Data Mining and NLP using real-world data. Everything that is publicly available on the Twitter App is free and easy to collect and analyze. And the best part is that it’s completely free! I highly recommend following this excellent article which contains sample code for collecting and analyzing tweets using python.

In this post, I wanted to cover just a few practical steps that are often skipped over. Before you get started in python, you must:

  1. Create a Twitter Developer Account
  2. Create a Project, App, and API keys within your developer account
  3. Read the tweepy and Twitter documentation to understand what data Twitter is actually giving you for free

The Twitter API operates independently from the specific programming language you’ll be using to access the API. These three steps will be the same regardless of which language you’ll be using to collect and analyze tweets. I’ve included one python-specific library, tweepy, just to link to the documentation and provide things to consider when you’re getting started.

This is all fairly straight forward. I’ll link to a few additional articles here and here which contain the same information. These should contain all the information you need to get started with a personal data mining project.

Create a Twitter Developer Account

The first step is to sign up for a free Twitter Developer Account. It’s surprisingly easy to get access to the Twitter API and create a Twitter bot or a tweet-scrapping tool. It just involves a short survey and a few days delay for Twitter to approve the account.

Just be aware that Twitter can suspend or deny your developer access as easily as they suspend tweet authors from their accounts, so answer the survey honestly but don’t say anything that violates their developer terms of use. They mostly care about what kind of user data you’ll be collecting and whether you’re writing your code for commercial, government, or educational purposes.

Create a Project, App, and API keys within your account

Once you have access to your account, go to the Developer Portal and create a Project and Apps for your account. Every account must have one “Project”, and your “Apps” are just sets of credentials that fit inside your projects. Your free account can have up to 10 Apps, which means up to 10 connections to the Twitter API at any given time. This is important because establishing a permanent stream to collect real-time tweets requires the exclusive use of one set of API keys, so your free account will only be able to manage up to 10 streams at any given time. This is more than enough for a personal project, though.

To create an app from the developer portal, use the side bar menu to navigate to the “Projects & Apps Overview” page (Projects & Apps -> Overview). At the bottom of the page is a button labeled “Create App”.

There are five components to each set of Twitter API credentials:

  • Consumer Key (a.k.a. API key)
  • Consumer Secret (a.k.a. Secret key)
  • Access Token
  • Access Secret
  • Bearer Token

Each of these is a one-time randomly generated key. Save these in a text file for reference and don’t lose them. If you are not prompted to create the Access Token and Secret when creating an App, go to the “Keys and tokens” pages of your app by selecting the key icon next to the name of your app and then create a new Access Token and Secret by clicking the button on this page. You can also regenerate new credentials from this page if you ever lose your API keys.

Read the tweepy and Twitter documentation to understand what data Twitter is actually giving you for free

Now that you have your Twitter API keys, you can start collecting tweets! If you are using python, Tweepy (GitHub link and documentation link) is the best open-source library for collecting tweets for free (however, tweepy doesn’t support premium or enterprise API features like 30-day search. For that, you’ll need something like the searchtweets library). If you look at the tweepy documentation, you’ll see that there are essentially three primary functionalities of tweepy: post, get, and stream. Everything you can do in the twitter app, you can do with tweepy, and that includes the follow: change your profile picture, post a tweet, and favorite or retweet other statuses. You can also get user accounts information (like name, self-described location, profile picture, and bio) or tweets (with meta-data) by their unique identifier. Finally, and most importantly, you can stream tweets in real-time by accessing the Twitter API’s filtered stream functionality. This is the least restrictive way to collect a large number of tweets for you Data Science project. The free search functionality that is also built into the Twitter API is very limited compared to filtered streams.

Here are a couple other useful references for defining the filters for your filtered streams:

https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/api-reference/post-statuses-filter

Final notes:

Now you’re ready to follow along the Medium article that I linked to at the beginning of this post. In that article, they save the raw tweets in their original JSON format, which can save time on-read compared to CSV files, but, increases storage requirements (this stores all Tweet meta-data including meta-data about the user account of the author). Keep in mind that Twitter stores tweets in Document databases which have flexible schemas, so if you want to save tweets in a CSV file, which is a common practice, you will need to enforce a schema on the tweets by extracting only the attributes that you need. Just keep in mind that not every tweet will have the attribute that you want to extract. This is where you need python exception handling with “try .. except … “.

The API can filter for either original tweets, retweets, or a mixture of both. All tweets and retweets are parsed by tweepy into the same Status Object class but with different attributes. See this section of the documentation for handling retweets in your stream.

Most of the Twitter documentation for the API is written in CURL commands. If you aren’t familiar, curl (“client url”) is the open-source command-line tool that can execute HTTP requests over the Internet. It can issue the same GET and POST requests that your web-browser does when you click on links or type queries into Google. In general, these types of HTTP requests involve passing JSON messages containing your query to a URL address that is interpreted and routed by a DNS server. The sender of the message (the web browser) then waits to receive a response (typically an HTML file, but could be XML, JSON, etc.), which the browser renders for you. CURL in the command-line has this exact functionality, minus the rendering, and basically every major programming language has built-in library for executing these HTTP requests. For python, the library is actually called requests. Tweepy uses these commands under the hood. This type of message passing over a network, where the request is encoded by the sender, decoded by the receiver, and the sender waits for a response, is called a RESTful service in the API world. Just know that CURL commands are the most platform and language independent way of issuing API request over an HTTP network, which is why you’ll see them in the Twitter API documentation.

As an aside, the RESTful part of the API includes the tweepy functions like get_status(), get_user(), and search(). Setting up a permanent stream is considered a microservice where you are a listener on a topic in a message passing queue. If you are familiar with Apache Kafka or other message passing software, you’ll be familiar with these concepts.

These resources should be enough to get you started. Happy Data Mining!

Leave a comment