Category Archives: Data Stories
Data on high schools in Boston
For this data piece I will tell the story of finding the answer to this seemingly simple question: “how many high schools are there in Boston?”cheap air max
I am also calling every high school in Boston and telling the story of trying to collect data by cold calling school receptionists, documenting their response to quantitative and qualitative questions such as “how many students attend your school?” and “what makes your school special?”air jordan sale
‘It is just an accident’
I could not concentrate on this assignment mainly because of my 8 year old daughter being injured at the PE class recently. Her face under her eyes was cut and ended up 11 stiches being operated at the ER which suggested us that we should consider aesthetic operation after one year to reduce the visibility of the scar.
The responsible PE teacher accused me to ‘have nerves to write rude messages’ after I question the safety measures taken in the class. The vice Principal of Graham and Parks school said that ‘That is just an accident. No one is responsible. There are children breaking their legs and arms every year at schools’. Yes, indeed.
“In 2009, an estimated 2.6 million children aged 0–19 years were treated in U.S. EDs for sports- and recreation-related injuries” as “unintentional injury”. According to National Electronic Injury Surveillance System (NEISS), only in 2012 (there are no figures later than that) total 573 children are injured at PE classes at schools in the US, according to the simple data search I made. Well, in 2014, one of the victims was my child.
Sorry for this highly personal and not developed assignment. This was a limited attempt to make a connection between data and human dimension.
Visualizing GIFGIF by country
Kevin Hu & Travis Rich built a site called GIFGIF, which aims to crowd tag animated gifs with various emotions. From GIFGIF’s website: “An animated gif is a magical thing. It contains the power to convey emotion, empathy, and context in a subtle way that text or emoticons simply can’t. GIFGIF is a project to capture that magic with quantitative methods. Our goal is to create a tool that lets people explore the world of gifs by the emotions they evoke, rather than by manually entered tags.”
For this project, Kevin and I are building a map tool, along the lines of What We Watch, so that people can explore GIFGIF’s current dataset to see which gifs are most representative of certain emotions in each country.
GIFGIF’s data will soon be made publicly available through an API.
Data Story: Why People Take Free Online Courses (MOOCs)
Millions of people have signed up for Massive Open Online Courses, known as MOOCs. Early studies show that the majority of those who have signed up already have a college degree, and most do not opt to pay for a certificate to prove they passed the class. Put simply, they’re not looking to get college credit in any way. So I’m curious to dig deeper into what motivates these online “students.”
I am late to post because I’ve been digging around for a killer data set on this. I’ve made requests to HarvardX and to some researchers who have a large MOOC dataset, but so far no one has been willing to share their raw numbers. But HarvardX has published some demographic and survey data (not much). My sense, though, is that their data does not answer the question very well (most MOOC surveys only offer a few multiple choices on motivation).
So for the assignment, I’m focusing on playing around with fuzzier “data” – the student postings to forums in a MOOC. In many MOOCs, students post short introductions in the forums at the beginning of the term, usually saying why they are taking the course. I’ll analyze the intro discussion postings in one MOOC and group them into broad categories (my categories won’t capture everything, but there are definitely clear patterns in the responses).
My plan is to pick an astronomy course on edX that just started. https://courses.edx.org/courses/ANUx/ANU-ASTRO1x/1T2014/discussion/forum/i4x-edx-templates-course-Empty/threads/533080e801772bb02e00087f
There are only about 200 intro posts, so it should be do-able in the short time frame.
I plan to pull out one student post that is the best example of each category I create. So the interface will be a simple pie chart with the percentages of each reason for taking a MOOC, but then when you click on a specific group/color, you’ll be taken to that person’s intro post so you viewers can “meet” them.
I’m certainly open to suggestions on tools, critique, etc.
Mapping the conflict in Syria
For the data story assignment, I would like to present data that I found on the current conflict in Syria while working on a GIS project. I will be using crowdsourced maps such as Syria Tracker and Open Street Map, which are based on the work of local volunteers, to map the conflict in Syria.
I am currently leaning how to use GIS, which is the study of geospatial information, and I thought it would be interesting to use the geospatial data I am currently working with, to tell a story, a story of conflict and its link to geographic features.
Syria Tracker has geospatial data on the number of deaths “resulting from the Assad regime”, recorded by volunteers on the ground since March 2011. Although this data must be taken with a grain of salt, it gives a good overview of patterns such as female as opposed to male casualties, and the location of casualties amongst the opposition forces and the population living within opposition control.
Open Street Map on the other hand, gives an overview of the main roads, waterways and land cover in Syria. By overlaying different data sets, one could visualize if there is any link between conflict density and proximity to roads for example.
Below is an example of a crowdsourced map which shows the main roads in the country. I plan on producing a more comprehensive map for this assignment but please let me know if you have any suggestions for improvement.
— Elissar
Data project: Internet in Romania
I will look at data about internet penetration in Romania, and the paradox of being one of the countries with the fastest Internet connection, and one of the lowest rates of penetration in Europe (second to last in the EU28, for sure).
I’m trying to tie that with what I can find on e-governance and the relationship citizens have with authorities through digital means. As a new generation of civic-minded activists is looking to online for offline change, I thought it’d be interesting to survey the infrastructure landscape.
I’ll use Eurostat data for the most part, and I’ll return with questions as they arise, which I’m sure they will.
How big is the gap between health news and research?
For my data storytelling assignment, I’d like to see how health journalism coverage in the US compares to mortality data and health-research spending. I hope to tell a story about whether there is adequate coverage in the US media of things that harm and kill people here, and how that maps against where the government invests in health research to find out whether there are under-covered areas of science.
NIH has data on spending by category for 2010 and beyond:
http://report.nih.gov/categorical_spending.aspx
CDC has mortality data by cause (most recent being 2010):
http://wonder.cdc.gov/mcd.html
I hope to work with a programmer to access all news articles related to health from LexisNexus and use automated clustering to identify the most covered topics.
I’ll probably use Many Eyes to visualize the data in a bubble chart so that comparisons among research spending, journalism, and mortality can be easily made. But suggestions welcome!
Data stories: Narrative of education
I am working on a data story based on the narrative of education in Pakistan— particularly how we talk about education and how the narrative has changed in recent times. My data corpora include education stories— curated by Alif Ailaan, a Pakistani political advocacy group— and mainstream media streams curated through Media Cloud. I am going to look at latent semantic structure of text corpus provided by Alif Ailaan. In addition, I will look at juxtaposition of specific events with text highlights to understand framing around the narrative of education.
A Look at Occupy Boston’s Mailing Lists
Part of an ongoing project by the author to describe Occupy Boston’s mailing lists using network analysis.
The Story
Can we learn something about a social movement by looking at the digital tools it uses to organize? The Occupy Movement was defined as much by its highly visible occupation tactic as by its use of new digital media to organize and mobilize. The success of the movement was really to inject new language into our society about inequality. Think the 1% and the 99%. This was achieved through a sustained campaign of media activism. Language was developed to describe the inequalities between the common man and the rich, embodied by Wall Street — the perpetrators of the recent global financial crisis, and various forms of media were created to get the message out. The occupations then served to keep that message in mainstream media as they attracted sustained coverage themselves for both good and bad reasons.
We see this play out in the network of mailing lists. Occupy Boston’s general Media list had the most messages posted it in during the period September 2011 – October 2012, consistent with what we would expect from a movement focused on media activism. In terms of expansiveness of user participation, the Ideas mailing list takes the crown, which is where much of the early intellectual labor on defining Occupy Boston’s mission and direction was hashed out. In the data, we also see a lot of overlap amongst the mailing lists. All but one list (OB Updates, which was a unidirectional announcement list), shares many active users with other lists. The median degree is near 20, which is almost a perfect mesh network. This suggests that this public mailing lists, although sometimes dedicated to very specific themes or “committees,” enjoyed a lot of interconnection. Between the major mailing lists (seen as an outer ring on the network), which are more general interest, we see 100+ shared users on their mutual edges. This number drops off for some of the more niche mailing lists and could represent a few key organizers or overzealous mailing list participants. A more qualitative study is needed to tell the rest of this story.
Quick Statistics
- Mailing Lists: 22
- Total Messages: 36,303
- Total Users: 922 (unique email addresses)
How I Made the Network Graph
I downloaded the mailman archives from September 2011 to October 2012 from Occupy Boston’s public mailing lists, i.e. those that do not require moderator access to join. I wrote a Python script to parse the archives, which are in a standard mbox format, into an SQLite database. I devised a schema with a standard set of ids for mailing lists and individual users, and used these ids to extract a network of users shared among different mailing lists with a simple SQL query, storing resultant nodes (mailing lists) and edges (shared user relationships) in CSV files.
I imported the nodes and edges files into Gephi after hand editing their column names to conform to Gephi’s standard. Gephi automatically aggregated the edges between nodes to create weighted edges representing the total number of shared users. I adjusted the layout in Gephi to represent the weighted edges using different thicknesses. The nodes were scaled by total users active in each mailing list, an attribute extracted from my database, and their color was scaled on a pale to dark red spectrum according to the total number of messages during the period of analysis, also extracted from the database. I used the Forced Atlas 2 layout algorithm, which forces the most central nodes out of the center for easier comprehension, and then hit the graph a few times with the Expansion layout algorithm to give extra space between nodes.
Using the Sigmajs Exporter plugin, I exported the network so that it could be viewed on the web as an interactive visualization. I customized the default javascript and css in several ways to display the network graph more clearly. In the config.json file, I manipulated the graph properties to create greater contrasts between node sizes and edge weights, and adjusted the label threshold under drawing properties to ensure all nodes were labeled. I modified the sigma.js defaults for edge color, by forcing them to be a standard grey rather than the color of their source. This corrects for what is actually an undirected network (shared user relationships are mutual) being interpreted as directed. Finally, in the “Information Pane” I forced it to display the edge weights (shared number of users) between the active node and its neighbors, next to their listed names.