One of my main interests is in analyzing user-generated data, whether that be comments, tweets, or check-ins. I have a side research project that I am working on related to abortion and public policy and so decided to use this homework assignment as a way to get myself started on analyzing the data from this project.
I did most of the work in python, using the awesome libraries of tweepy (Twitter API wrapper), matplotlib (plotting), pymongo (interface to mongo database), and nltk (natural language toolkit). I used a mongo database to store the data but it wasn’t super necessary (plain text files can easily suffice). I forgot to take into account how long it would take for the scripts to crunch through all the data, so when I got started last night, I quickly realized I’d better let the scripts run overnight and write up a post this morning.
My dataset consisted of 663131 tweets related to abortion collected from the year 2013. To find tweets related to abortion, I looked for key terms such as “abortion”, “abort” + “baby”, “abort” + “birth”, “prolife”, “prochoice”, and some others, including common hashtags.
Here is some basic info on the tweets I collected:
Total Volume of Tweets over time (x=month of 2013, y=number of tweets):
You can see that the volume varies quite a bit. Looking at the top words used each month, removing stopwords (very common English words), we see the following (I show the word as well as the number of times that word appeared in tweets in that month):
Jan | ‘prolife’, 5783 | ‘women’, 2894 | ‘life’, 2540 | ‘roe’, 2460 | ‘baby’, 2276 |
Feb | ‘prolife’, 3300 | ‘women’, 1965 | ‘baby’, 1606 | ‘bill’, 1334 | ‘prochoice’, 1307 |
Mar | ‘prolife’, 3026 | ‘dakota’, 2528 | ‘north’, 2441 | ‘ban’, 2030 | ‘baby’, 1926 |
Apr | ‘gosnell’, 14940 | ‘prolife’, 6591 | ‘clinic’, 5682 | ‘trial’, 4586 | ‘baby’, 4081 |
May | ‘gosnell’, 6672 | ‘prolife’, 5483 | ‘murder’, 3401 | ‘doctor’, 3301 | ‘baby’, 3156 |
Jun | ‘texas’, 12705 | ‘bill’, 11218 | ‘women’, 6538 | ‘prolife’, 6530 | ‘filibuster’, 4771 |
Jul | ‘texas’, 14946 | ‘bill’, 11675 | ‘prolife’, 8196 | ‘women’, 7289 | ‘law’, 5142 |
Aug | ‘prolife’, 4958 | ‘women’, 3112 | ‘tcot’, 2493 | ‘prochoice’, 2343 | ‘like’, 1822 |
Sep | ‘prolife’, 3572 | ‘pope’, 2950 | ‘baby’, 2577 | ‘women’, 2238 | ‘church’, 1884 |
Oct | ‘prolife’, 3905 | ‘texas’, 3393 | ‘baby’, 2562 | ‘judge’, 2550 | ‘law’, 2272 |
Nov | ‘weeks’, 5254 | ‘texas’, 4721 | ‘baby’, 4634 | ‘prolife’, 4285 | ‘women’, 3596 |
Dec | ‘praytoendabortion’, 28535 | ‘prolife’, 5228 | ‘life’, 5082 | ‘women’, 4954 | ‘baby’, 4083 |
We can clearly see that some volume seems to be driven by news events, such as Senator Wendy Davis’s filibuster in June to block a restrictive abortion bill in Texas. Other drivers perhaps include Twitter campaigns (#praytoendabortion). This also is a good point at which to audit one’s data and zoom into weird findings to check if the data is properly cleaned. I didn’t have time to do that here, but if I did, I would look at the tweets behind some of the weirder top 5’s that I didn’t understand and either learn something new about abortion or find ways to remove the invalid tweets from the dataset.
The last thing I did was to analyze the language in the tweets for moral values. This is part of a larger research project I am working on related to modeling ideology and linking that to policy change. You can see a complete version of this work here when I looked at same-sex marriage. Another important step which I am skipping is validation, or trying to correlate numbers crunched from the data to traditionally collected data, such as census or poll numbers.
To analyze moral values, I am using a supplemental LIWC dictionary built by political psychologists and linguists that attempts to match key words with underlying moral values. The 5 moral values we use are taken from research on moral foundations by political scientist Jonathan Haidt and some other people. They’re an attempt to understand the underlying values that people find important. Do you care more about fairness or more about loyalty and authority? Not surprisingly, these moral foundations somewhat correlate with either liberal or conservative ideologies.
So, given the keywords ascribed to each moral foundation, I counted the relative occurrence of each moral foundation within every tweet and then averaged that relative occurrence across all tweets within a month. The following is the outcome:
Moral Values Over Time (x=month in 2013, y=average relative occurrence in tweets)
Though we see more authority and harm language than the other 3, this doesn’t necessarily mean that people think more about some values relative to other values because our method can’t be comprehensive. But we can look at a single value over time. For instance, it’s pretty notable how purity language jumps up in October. I didn’t have time to dig into why but that would be my next step.
Future work that I intend to do would be to look at these traits broken up by state. You can do that by analyzing the location field that people specify in their profile and trying to match that to a state. I would want to look at several of the other LIWC categories and also come up with some more features of my own. Finally, it would be interesting to look at features over time – leading up to and after key events, for instance.