For this week’s data journalism assignment I wanted to experiment with a data set that I’m already working with for my research at the Center for Civic Media. We are trying to see if we can extract place information from news articles to map where the news happens, where news sources pay more and less attention to, and what that coverage looks like. Where on the globe does the BBC vs. the NYT vs. the Huffington Post cover? In what proportions? Where don’t they cover? How do they cover those places – with what words and frames?
Check out the visualization here
To answer these questions, the first problem we are facing is a technical one. How do you get reliable place information from unstructured text like news articles? I’ve been starting to evaluate different ways of doing this using a data set of 100 articles each from the New York Times, the BBC and the Huffington Post taken randomly from the same time period. I wanted to use this week to go deeper with one promising technology for this work called CLAVIN, a java-based geoparser that integrates with Stanford’s well-known Named Entity Recognizer. I also wanted to experiment further with D3, a javascript-based codebase that makes beautiful infographics, maps and diagrams (here’s a link to the one I modified for this viz).
This may or may not be backwards (from the standpoint of journalism) but I wanted to create the visualization as a way of exploring the data set to see if there is a story. My idea was to try to create a kind of network map of which countries get mentioned in relation to which others and at what frequency. And additionally, to be able to compare across the three news sources. Which countries get mentioned together more frequently? Does the country of origin of the news source affect the country pairs? Which countries get relatively very few mentions? Does grouping country mentions like this show us anything we don’t already know?
The way the visualization works is that if two countries are mentioned in the same article they get a link. So if an article mentions the US, Canada and Mexico then there is a link between US-Canada, Canada-Mexico and Mexico-US. Links also have a weight. For example, if an additional article mentions the US and Mexico, then the Mexico-US link gets a weight of 2.
You can mouseover the country names in the visualization to see the breakdown of how many links a country has to other countries. This is not all that meaningful for countries like the US, the UK and France which have an abundance of links. But it is more interesting for countries like Puerto Rico, Syria or Venezuela. For example, Syria is most mentioned in the BBC in relation to the US, Russia, UK and France. In the NYT, it is most referenced in relation to the US, Russia, China and Egypt. And in the HuffPo, Syria is mentioned in relation to the US, Russia, China and then tied for fourth place are Iran, Venezuela, Lebanon and Italy. So one place a network visualization like this might help us is in framing who the state actors are that a news source thinks matters in a particular on-going story like the Syrian Revolution/Civil War.
And while this network visualization is beautiful (even my two-year-old son said “oooooh the sun”), it’s hard to discern meaningful patterns with a lot of blue lines crossing everywhere so I created a link number slider to be able to filter them. At higher numbers like 15+ links (where only country links which have occurred 15 times or more will appear), some interesting differences and similarities across the media sources appear. For all three sources the highest number of links are between the US, the UK and Canada with some countries in Western Europe (France, Italy) also in the mix. All three sources show the highest number of links between the US/Canada and Western Europe and lack equivalent linkages for countries in South America, Asia, and Africa. At the 15-link threshold, the BBC doesn’t show a single link to or from any country on those continents. HuffPo and the NYT show links from the US to Argentina and Brazil and both show several links from the US to countries on the Asian continent. But of the three sources at this threshold only the NYT shows a link between the US and an African country: Egypt.
So what is this telling us? One thing is, perhaps unsurprisingly, that national news sources tend to focus on themselves. Most of the large numbers of links are between the news source’s own country and other nations. There are also high linkages between the home country and where that country is at war (e.g. Iraq & Afghanistan for the NYT and the HuffPo) or where they are thinking of using military force (e.g. Syria or Iran which show up for HuffPo at the 15-link threshold). And economic power. In a 2003 paper called Global Attention Profiles, Ethan Zuckerman outlined a model of attention whereby media attention from international news organizations correlates with a country’s GDP. It would seem at least from first glimpse that the linkage model of attention follows this logic as well.
This is an experiment in data journalism
I’d like to just bookend this with a disclaimer that this is not meant to be conclusive in any way. First of all the data sample (100 articles per source) is too small and over two short a period of time to actually talk about long-term patterns of comparative geographic media coverage. Secondly, we still need to evaluate the performance of the geoparser for extracting places from news articles. There is certainly at least 10-20% error in how it is identifying places in the articles (which may account for why Italy shows up as having so many links? What is going on with that?). And finally, this was a way to experiment with turning articles into networks of geographic places. I’m still not entirely sure this is a useful methodology but I wanted to see what came out of the experiment to help assess whether it’s useful or not (and would love to hear thoughts and feedback to that effect).
My question for the class is – does it count as data journalism if you experiment with visualizing your data in various ways with various methodologies in order to dig up a story? My guess is that data journalism could happen and unfold in a variety of ways. It seems legit when working with large datasets to experiment with ways to explore that data in a preliminary way that may or may not lead to a narrative, but I’d love to talk more about that unfolding process.