Diving Into FiveThirtyEight's "Russian Troll Tweets" Dataset with BigQuery
•••FiveThityEight recently released a dataset of what is believed to be ~3 million tweets associated with “Russian trolls”. These tweets are designed to spread misinformation (let’s not mince words: lies), and ultimately influence voters. If you haven’t read the linked article, I highly suggest you do that before continuing on.
Exploring a ~700MB+ CSV file isn’t hugely practical (it’s since been sharded into < 100MB chunks), and so I’ve made the tweets available as a public dataset via Google’s BigQuery analytics engine. BigQuery has a sizeable free tier of 1TB per month, which should allow a fair bit of exploration, even if you’re a student or if paid services present a challenge for you.
Note: This isn’t a BigQuery & SQL tutorial: for that, take a look at the documentation.
If you’re already familiar with BigQuery & accessing public datasets, then you can simply run the below to start exploring the data:
#standardSQL
SELECT
author,
COUNT(*) AS tweets,
followers
FROM
`silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
author,
followers
ORDER BY
tweets DESC,
followers DESC
For everyone else: read on.
Accessing the Dataset
We’re going to use the BigQuery web UI, so navigate to the BigQuery interface and select the project you want to access it from. You’ll see the fivethirtyeight_russian_troll_tweets
table appear on the left-hand-side, in the Resource tab. From there, you can inspect the table russian_troll_tweets
, look at the schema (also pasted below), and see a preview of the data.
name | type | mode |
---|---|---|
external_author_id | FLOAT | NULLABLE |
author | STRING | NULLABLE |
content | STRING | NULLABLE |
region | STRING | NULLABLE |
language | STRING | NULLABLE |
publish_date | TIMESTAMP | NULLABLE |
harvested_date | TIMESTAMP | NULLABLE |
following | INTEGER | NULLABLE |
followers | INTEGER | NULLABLE |
updates | INTEGER | NULLABLE |
post_type | STRING | NULLABLE |
account_type | STRING | NULLABLE |
new_june_2018 | INTEGER | NULLABLE |
retweet | INTEGER | NULLABLE |
account_category | STRING | NULLABLE |
So with the data above, what can we do? We can look at how these tweets were amplified (updates), what language the tweet was posted in (what audience was it for?), and the direct audience of the account (followers). We don’t get details on the followers themselves however, which makes it hard to know how impactful the reach was: is it trolls/bots followed by other trolls, or members of the general Twitter populace?
Analyzing It
OK, let’s take a quick look at the data to get you thinking about it. We’ll answer:
- Was there a specific account with a non-negligible fraction of tweets?
- Which months saw the most activity?
- Which tweets were the most amplified in each language?
-- Was there a specific account with a non-negligible fraction of tweets?
SELECT
author,
COUNT(*) AS count,
FORMAT("%.2f", COUNT(*) / (
SELECT
COUNT(*)
FROM
`silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`) * 100) AS percent
FROM
`silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
author
ORDER BY
percent DESC
LIMIT
10
The EXQUOTE
account was definitely a sizeable contributor, although there’s not an order-of-magnitude difference across the top 10.
author | count | percent |
---|---|---|
EXQUOTE | 59652 | 2.01 |
SCREAMYMONKEY | 44041 | 1.48 |
WORLDNEWSPOLI | 36974 | 1.24 |
AMELIEBALDWIN | 35371 | 1.19 |
TODAYPITTSBURGH | 33602 | 1.13 |
SPECIALAFFAIR | 32588 | 1.10 |
SEATTLE_POST | 30800 | 1.04 |
FINDDIET | 29038 | 0.98 |
KANSASDAILYNEWS | 28890 | 0.97 |
ROOMOFRUMOR | 28360 | 0.95 |
-- Which months saw the most activity?
SELECT
FORMAT("%d-%d", EXTRACT(month
FROM
publish_date), EXTRACT(year
FROM
publish_date) ) AS date,
COUNT(*) AS count
FROM
`silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
date
ORDER BY
count DESC
LIMIT
10
Unsuprisingly here, we see October 2016 (just prior to the election on Nov 8th) feature prominently, as well August 2017, in which the North Korean conversation escalated immensely.
date | count |
---|---|
8-2017 | 191528 |
12-2016 | 155560 |
10-2016 | 152115 |
7-2015 | 145504 |
4-2017 | 136013 |
1-2017 | 135811 |
11-2015 | 132306 |
3-2017 | 128483 |
11-2016 | 123374 |
8-2015 | 119454 |
-- Which tweets were the most amplified (likes, retweets) by language?
SELECT
language,
content,
updates
FROM (
SELECT
language,
content,
updates,
RANK() OVER (PARTITION BY language ORDER BY updates DESC) AS tweet_rank
FROM
`silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
language,
updates,
content ) troll_tweets
WHERE
tweet_rank = 1
GROUP BY
language,
content,
updates
ORDER BY
updates DESC
LIMIT
10
I’ll leave analyzing these tweets as an exercise to the reader, but they certainly appear to prey on the hot button issues in a few places. Also note that I’ve truncated the output here, for brevity. Also be mindful of any links you follow here: I have not vetted them.
language | truncated_content | updates |
---|---|---|
English | ‘@JustinTrudeau Mr. Trudeau, Canadian citizens dem | 166113 |
Turkish | KARMA, KARMA, KARMA!!! https://t.co/Eh5XUyILeJ | 165833 |
Catalan | ‘@HCDotNet Excellent! 🇺🇸👠🠻😆’ | 165751 |
Farsi (Persian) | Shameful https://t.co/rll2JrUzRI | 165468 |
Dutch | Trump’s tweets. #ThingsITrustMoreThanCNN https:/ | 165407 |
Norwegian | #2018PredictionsIn5Words Pro-Trump landslide | 165371 |
Vietnamese | So sad. @TitosVodka rocks!! https://t.co/sWtLlZxL5 | 164288 |
Lithuanian | Stump for Trump @Stump4TrumpPac https://t.co/S0NS9 | 164082 |
Estonian | #QAnon @Q #FOLLOWTHEWHITERABBIT 🠇 #FLYSIDFLY# | 163448 |
Croatian | ‘@FoxNews @rayann2320 @POTUS Bravo Mr President!!’ | 163126 |
Wrap
There’s a lot of data to explore here, but it’s also worth keeping in mind that three (3) million tweets is only a small fraction of tweets associated with this kind of content, and this kind of bounded data collection may have some subjectivity to it.
If you have any questions about the dataset itself, you should open an issue on FiveThirtyEight’s GitHub repository. As for questions about exploring it via BigQuery: feel free to tweet @elithrar with your questions or explorations!
Posted on 01 August 2018