Visualizing newspapers words

For this assignment I worked in a project that evolved and get included in a larger one. I’m currently participating with my hometown university (ITESO) trying to understand what the newspapers are publishing related to political candidates, with elections for mayors and local congress in 60 days the team in Mexico is collecting the news related to the campaign every day.

With the news as a data set I processed it in Wordij a tool to generate semantic networks from .txt files, the software also gets a count of words in a csv file, this files where processed in Excel and then visualized in Tableau, with all the data we could run queries to know how many times a candidate has been mentioned by each newspaper, and adding data every day we are getting a larger picture of whom the newspapers are talking about.

Screenshot 2015-04-07 22.49.09

As part of another project that I’m involved with to monitor the political campaigns, we decided to include the data-viz tool as part of the site, all the info is in spanish but if any one wants to try the tool you could use the right side panel, adjusting dates, frequency of words or search for specific words, for example search for “PRI” or “PAN” or “MC” political parties or for “Villanueva”, “Alfaro” or “Petersen” last name of the candidates. The visualization is here.

Visualization showing queries for "Alfaro", "Villanueva" and "Petersen"

Visualization showing queries for “Alfaro”, “Villanueva” and “Petersen”

Narrative of education in Pakistani media sources

The United Nations has recently announced that international donors have pledged $1 billion to provide education to millions of children in Pakistan. Nearly 25 million children are currently out of school in Pakistan, and about seven million of these children have yet to receive primary schooling, according to a recent report prepared by Society for the Protection of the Rights of the Child (SPARC).

Education in Pakistan has long been in a state of crisis. After Musharraf’s regime, Pakistan resumed elections in 2008, and media, judiciary and other democratic institutions have strengthened since then. What does the narrative of education look like in current times, and what kind of discourse underlies the education narrative? These are the questions that we explore in this inquiry.

In order to understand the narrative of education in Pakistan, we employed unsupervised learning algorithm on the text corpus provided by Alif Ailaan, an education advocacy group in Pakistan. The corpus comprises education stories curated from Pakistani media sources— including Dawn, The Express Tribune, Nation, The News and Pakistan Today— since Feb. 2013. The purpose of using unsupervised learning algorithm was to delineate underlying topical themes that are present in the text corpus.

We extracted five topic structures using our learning algorithm. The intuition behind our algorithm is that documents exhibit multiple topics. For instance, in a single document, ‘Malala’, ‘woman’ and ‘education’ are lumped together as one topic, and ‘federal’, ‘funding’ and ‘government’ are grouped into another topic. Using this technique we extracted keywords associated with five topics that our algorithm discovers.

Below is a bubble graph of the entire topical space.Each bubble represents proportional representation of a keyword in a topical cluster, which is differentiated by color.

Topics from education corpus

Topics from education corpus

Now we will look at each topic individually. We have labeled the first topic as “Federal Education” because it loosely exhibits the discourse surrounding federal policies and issues on education in form of keywords like ‘federal’, administration’, ‘CADD’ and ‘FDE’. Both Capital Administration and Development Division (CADD) and Federal Directorate of Education (FDE) are constitutional bodies that are responsible for federal functions on education.

Topic: Federal Education

Topic: Federal Education

We have labeled the second cluster as “Higher Education” since it contains terms like ‘university’, ‘international’, ‘technology’, ‘faculty’, and ‘science’ which are characteristic of higher education in Pakistan. The Higher Education Commission of Pakistan (‘HEC’) is a constitutionally established institution that drives higher education efforts in Pakistan.

 

Topic: Higher Education

Topic: Higher Education

We have labeled the third cluster “Primary Education” because of terms like ‘child’, ‘primary’, ‘enrollment’, ‘school’, ‘literacy’, ‘teacher’, and ‘english’. Last year, successful primary enrollment drives took place at provincial level in Pakistan to register out-of-school children in public schools.

Topic: Primary Education

Topic: Primary Education

The fourth cluster of topics, which we have labeled “Malala”, is the most telling one. Malala became “the spokesperson for a generation of girls” after being shot in the head by Taliban. Almost half of rural young women in Pakistan have never attended school, according to a 2012-2013 UNESCO report. The name Malala is the only personal name that appears in the topical space on education in Pakistan. This cluster of words is also marked by tension between heterogeneous discourses in Pakistan including Talibanization, religion, security, peace, rights, and gender, highlighting the disruptive power of the “Malala” narrative on the discourses around education.

Topic: Malala

Topic: Malala

Lastly, the fifth cluster of topics includes provinces-related terms such as ‘sindh’,’punjab’, ‘local’, ‘district’, ‘provincial’. We have labeled this topic as “Provinces and Education”.

Topic: Provinces and Education

Topic: Provinces and Education

In the chart below we show a timeline representation of the news stories curated in the Alif Ailaan corpus. Malala gave her first speech at the United Nations in July 2013; an increase in the number of stories on education in July could be related to Malala’s speech. Similarly, spikes in Aug. 2013 and Sept. 2013 could be explained by enrolment drives in Punjab and Khyber Pakhtunka provinces. These campaigns aimed at enrolling out-of-school children in public schools. Finally, the spike in Feb. 2014 could be related to the launch of Annual Status of Education Report (ASER) report, which highlighted Pakistan’s education crisis and made headlines in national newspapers. An in-depth analysis of these correlations is needed to provide more concrete insights on these trends.

News stories timeline

News stories timeline

In summary, these preliminary findings suggest that the current narrative of education in Pakistani media landscape is rich and diverse and covers the entire gamut of concerns around education crisis. The topics we discovered suggest that the media attention on education is produced by an active state of affairs.

GIFGIFmap

Background: Kevin Hu & Travis Rich built a site called GIFGIF, which aims to crowd tag animated gifs with various emotions. From GIFGIF’s website: “An animated gif is a magical thing. It contains the power to convey emotion, empathy, and context in a subtle way that text or emoticons simply can’t. GIFGIF is a project to capture that magic with quantitative methods. Our goal is to create a tool that lets people explore the world of gifs by the emotions they evoke, rather than by manually entered tags.” As we know, animated gifs are also a popular storytelling mechanism for social news and entertainment websites.

The cultural phenomenon of using animated gifs to express emotions has been the subject of numerous journalistic inquiries:

Fresh From the Internet’s Attic – NYTimes

Christina Hendricks on an Endless Loop: The Glorious GIF Renaissance – Slate.com

GIF hearts Tumblr: a fairytale for the Internet age – Wired.co.uk

Visualization project for this week: Kevin, Travis, and I built a map tool so people can explore GIFGIF’s current dataset to see which gifs are most representative of certain emotions across different countries. Out of 1.8 million votes, 1.4 million votes had IP data which links the votes to the location of the voter. GIFGIFmap can be found here.

Screen Shot 2014-04-02 at 1.03.12 AM

In a future version, we would like to show the top gifs per emotion that countries have in common with each other, and what are unique top gifs for each country (along the lines of What We Watch). However, there are limitations to the GIFGIF data set in terms of global coverage. For example, the top 21 countries account for 92% of the votes. Additionally, we excluded countries that had less than 10,000 total votes across all categories, so as to avoid making generalizations based on limited data. We chose to include the number of votes per country (per emotion) to make the data set more transparent in terms of representation.

We think the tool we are building could complement existing stories about the phenomenon of using animated gifs to communicate (stories like the ones we linked to above).

These are some potential questions that we hope journalists could explore using a map interface to the GIFGIF dataset:

1) Do people from different countries interpret the emotional content of gifs differently?

2) If there are variances in interpretation, are there clusters of countries that have more similar interpretations? Do these match up with proximity, or immigration patterns?

3) What top gifs per emotion are unique to a given country?

 

Note: GIFGIF’s data will soon be made publicly available through an API.

 

Boston high schools- by the numbers

My Quest for Truth

It all started with a simple question: How many high schools are there in Boston?

High-schools.com lists “all public and private high schools located in Boston” and says there are 17. Greatschools.org lists 32 public and private high schools. US News says there are 32 schools just within the Boston Public School District. Wikipedia says 33. The Massachusetts Department of Education lists 42 public and private.

I compiled a list of 56.

Why the discrepancy over a seemingly basic question? Is it because

  • We can’t agree on what “high school” means?
  • We can’t agree on what “in Boston” means?

Charter schools, special education, adult education, vocational training, private schools, religious schools- there are many ways to designate what is and is not a “high school” that could explain the differences cheap air jordan.
Boston public schools, Boston city limits, Greater Boston- the discrepancy may also be caused by varying definitions of what it means for a high school to be “in Boston.”

I aim to create an authoritative central portal that lists all high schools in Boston. I will continue exploring this in future assignments (talk to me if you want to collaborate!).

Cold Calling For Data

To preempt a similar situation arising when trying to figure out how many high school students are there in Boston, this time I chose a bottom-up rather than a top-down approach. I picked up the phone and began cold calling every high school on my list. I asked every school receptionist two questions:

  • How many students go to your school?
  • What makes your school special?

I chose these two questions because I thought they would be a good foundation to explore both quantitative and qualitative data, and the answers could give me potential follow-on questions if I continue focusing on Boston high schools.

Another Course to College- their Annual Report states 220 students; their receptionist told me 224.

Boston Adult Technical Academy- their Annual Report states 257 students; their receptionist told me 300.

Boston Arts Academy- their Annual Report states 420 students; their receptionist told me 400.

Boston International High School- their Annual Report states 359 students; their receptionist told me 500.

… and the list goes on. I could present more data but I’m not sure what story I want it to tell yet. Yes, I could add up all the numbers and create “the authoritative Julia guide to how many high school students there are in Boston.” Yes, I could put together another “a-ha” moment showing the discrepancies in calculating this number across organizations and websites. But I don’t want to present a repeat of other dry, going nowhere data pieces.

Telling a Story

I recently read the book Made to Stick: Why Some Ideas Survive and Others Die which nailed home for me the importance of telling a compelling story. With the school mapping project I am working on, I have been more focused on organizing and presenting the information and hoping others will find stories to tell, rather than having to tell the story myself. My model has been Wikipedia, which presents information in a way that is useful to the reader. Would you say that Wikipedia tells a story?

My aim has been to build a school mapping platform using data and communication tools that are informative and useful. I thought that would be enough. What I’m struggling with now is how to build a platform that tells a story, and what story do I want it to tell original new balance.

original new balance

Finding data on the Syrian conflict

Finding credible data on the conflict in Syria has been a difficult endeavour for both journalists and policy-makers. One approach many have been finding useful is the use of crowdsourced maps. Syria Tracker and the Women Under Siege Syria chapter are the most noteworthy crowdsourcing initiatives that aim at mapping the conflict in Syria with the help of local volunteers.

Syria Tracker has considerable geospatial data on the number of civilian deaths, recorded by volunteers, and “resulting from the Assad regime” since March 2011. Although this dataset must be taken with a grain of salt, as it only represents the work of activists working again the government, it gives detailed accounts of the causes of deaths (either through air strike, gun shot, bomb or the use of chemical weapons) and the victim’s identity (gender and age).

Women Under Siege monitors acts of sexual violence which are reportedly committed against men and women. Open Street Map is another crowdsourcing initiative, on a global scale, which geospatial experts contribute to for the sake of good mapping. In Syria, Open Street Map offers comprehensive maps on the country’s main roads, natural resources and facilities location (such as hospitals and schools).

By translating this geospatial data into a GIS (geographic information system) software (known as ARC Map), one can visualize if there is a correlation between different aspects of the conflict.

The following map, for example, shows the location of refugee camps surrounding the Syrian borders, the major border crossings into the country and the main IDP (Internally Displaced Persons) camps within the country.  All camps are obviously close to transit points and to major roads.

Refugee Camps and border

The next map shows the location of all of the IDP camps and the location of the main waterways (rivers and lakes) within Syria. This is important because it shows that the livelihoods of the internally displaced is closely linked to access to water (as you can see most camps are situated near a waterway).

IDP Camps and Water.j

In the end, the availability of geospatial data on conflict areas, (and in this case on Syria), is bountiful. However, this type of data is not available to the general public. I was able to acquire all of the data above for free but one must acquire technical skills to be able to make sense of the data. GIS is one way, among many others, to spatially visualize data.

Data story: Internet use in Romania

The approach I took was to look at data available on Internet use in Romania, compare it to countries in EU28, interrogate it, and come up with potential stories — my initial interests to find something on citizen involvement.

This post is more an account of the process than a finished story; it reflects better the lessons I learned.

My first and most immediate lesson is that tracking good data and making a relevant data set – even when the information is publicly available – is time consuming (especially when you use different sources). The other lesson was even more humbling. Once I gathered the data I needed, I realized that combining it, merging it, and illustrating the new set takes skills that I don’t yet have (both technically & creatively), and the learning curve was to steep to master for this assignment. I will keep trying.

What I used and looked through: data from Eurostat (the EU’s statistical office), ITU (UN’s information technology arm), UNESCO, Net Index etc. What was great, although it took some time to realize, is that ITU and Net Index make some of their data available on Google’s Public Data, which comes with handy visualization tools. Eurostat also creates visualizations, but they are less appealing.

My first step was to rank the percentage of individuals using Internet in EU28 (ITU, 2012), a dataset which has Romania coming in last. (The Eurostat numbers for 2013 shows it has now passed Greece and Bulgaria, so one potential question would be whether the penetration rate is accelerating — it was almost flat in the years when most countries had their boom – 1997-2003). Another interesting question that comes up looking at the data is whether EU accession (Romania and Bulgaria joined in 2007) has sped up Internet penetration. Countries now vying for accession – Turkey, Macedonia, Serbia – have even lower usage.

1

I then looked at download speed in EU28 on Net Index, knowing I’ll find the reverse. According to Net Index, Romania has the third fastest download speed in the world. This discrepancy remains staggering and the potential causes/correlations are interesting to investigate: #8 in the world in terms of originating attack traffic (Akamai data),  high level of piracy (BSA data), a strong engineer culture and a budding startup culture, a hacker/cyber crime base.

2 3

Another question I’d explore is whether such low internet use might be explained by the urban/rural divide, still about 50/50 (53/47 to be more precise) in Romania compared to EU28, and, more interestingly, holding steady for the past 15 years – most countries, according to UNESCO data have experienced urban migration

4

The speed/penetration difference is even more interesting if you look at other indicators in which Romania continue to be reliably at the bottom: e-commerce and regular use (daily & weekly). This data and the accompanying visualizations were generated from Eurostat data.

Ecommerce

RegularUse

Predictably, Romania also ranks last in e-governance/interaction with public authorities.

PublicAuthorities

PublicAuthoritiesMAP[The gray circle is that country’s level of interaction. The red outline is the EU28 average.]

Looking at the public’s interaction with government, a host of other question and stories spring to mind:

  • what’s with the gigantic outlier in 2012? Is it a question of measurement? Did something happen? Was data misreported (intentionally or not?)
  • does this lack of opportunity appear in any candidate’s discourse/promises (presidential elections are slated for the fall)?
  • what explains these numbers? How does this explain the gigantic recent failure of the e-Romania portal for which the Romanian government spent 8 millions of euro?
  • what does this mean for initiatives such as ReStart Romania, which aim to use technology to further public dialogue and change?

Other directions suggested by the data:

  • there seems to be a digital divide. Does it track along geography (as mentioned above)? What about income? How does it manifest itself in different generations (e.g. 50 percent of people between the ages of 35 and 45 go online every week compared with the EU28 average of 82 percent.)?
  • the computer/internet literacy market (both public and private). Eurostat data shows only 26 percent of Romanians have basic computer skills (the EU28 average is 60 percent). What programs exist? Are they working – why/why not?

Womens T-Shirt – Military-Inspired Green

This (military-inspired green T-shirt) provides a form fitting look that is both flattering and slimming. The shirt can be worn alone, or paired with jeans or dress slacks. Because of its thin structure, it can work as an additional layer on top of an outfit or even a base layer for a cold weather ensemble. The mesh fabric serves as a breathable layer, making it perfect for warm or cold seasons. No matter how you wear this shirt, the form-fitting material will perfectly accentuate the convex curve of your body. The small round neck of the shirt’s design will compliment and create slender neck and facial lines. Military styled items are very versatile making this shirt an excellent choice for any occasion, including shopping or a date after work. The shirt from (dresshead) comes in 4 sizes (S, M, L, XL). For reference, a M is a 78 cm length and a 73 cm bust.

air max 1

Journalism check-up: Are reporters doing a good job of covering health?

By Ali and Julia

It’s no secret that journalists fall into many traps when covering the contradictory and sometimes convoluted area of health research. As a 2013 Columbia Journalism Review article—titled ‘Survival of the Wrongest’—summed up: “Even while following what are considered the guidelines of good science reporting, (journalists) still manage to write articles that grossly mislead the public, often in ways that can lead to poor health decisions with catastrophic consequences.”

This can take the form of reporting science out of context, misinterpreting conclusions, or missing big stories all together. So we set out to gather data on the places where health journalism goes wrong.

We had a grim starting place: We looked at the leading causes of death in America and compared that to how well the most comprehensive national newspaper—The New York Times—covered related stories. We wanted to see whether public health issues that matter to people are under-reported.

First, we gathered mortality data from the CDC’s most recent National Vital Statistics Report, which included 2010 deaths:

Cause of death Number of deaths Percent of total deaths
All causes 2,468,435 100
Heart disease 597,689 24.2
Cancer 574,743 23.3
Chronic lower respiratory diseases 138,080 5.6
Stroke (cerebrovascular diseases) 129,476 5.2
Accidents (unintentional injuries) 120,859 4.9
Alzheimer’s disease 83,494 3.4
Diabetes 69,071 2.8
Nephritis, nephrotic syndrome, and nephrosis 50,476 2
Influenza and Pneumonia 50,097 2
Intentional self-harm (suicide) 38,364 1.6
Septicemia 34,812 1.4
Chronic liver disease and cirrhosis 31,903 1.3
Essential hypertension and hypertensive renal disease 26,634 1.1
Parkinson’s disease 22,032 0.9
Pneumonitis due to solids and liquids 17,011 0.7
All other causes 483,694 19.6

Here, the leading causes of death are represented in a bubble chart; the biggest bubbles relate to America’s leading killers: Heart disease, cancer, chronic lower respiratory disease, stroke, accidents, et cetera.  cause of death data

Then, we did a query in The New York Times corpus of key search terms related to the top 15 causes of death in America. Here, we found the number of 2010 stories which mention those key words:

Times stories in 2010 Keywords
1,630 “cancer”
1,470 “heart disease”
527 “diabetes”
456 “alzheimer”
331 “suicide”
216 “stroke”
214 “parkinson’s”
183 “accident”
121 “liver disease” “cirrhosis”
95 “influenza” “pneumonia”
88 “hypertension” “renal disease”
27 “respiratory diseases” “copd”
2 “nephritis”
1 “Septicemia”
1 “Pneumonitis”

We then created an index to represent the media attention focused on America’s leading killers. We did this by dividing the number of New York Times stories by the number of deaths in America and then multiplying that number by 100,000. So: (New York Times stories/deaths)*100,000. Here’s what we found:

Media attention index
Parkinson’s disease 971
Intentional self-harm (suicide) 863
Diabetes 763
Alzheimer’s disease 546
Chronic liver disease and cirrhosis 379
Essential hypertension and hypertensive renal disease. 330
Cancer 284
Heart disease 246
Influenza and Pneumonia 190
Stroke (cerebrovascular diseases) 167
Accidents (unintentional injuries) 151
Chronic lower respiratory diseases 20
Pneumonitis due to solids and liquids 6
Nephritis, nephrotic syndrome, and nephrosis 4
Septicemia 3

bubble of representation

As you can see, the big bubbles (Parkinson’s, suicide, diabetes, Alzheimer’s) suggest there’s a lot of coverage proportional to the number of deaths while barely visible bubbles mean these killers are under-covered by the media compared to mortality. If these data are correct, the third leading cause of death in America—COPD—is hardly covered in the newspaper nor was the fifth leading cause of death in America (accidents). Meanwhile, heart disease and cancer—the top killers—got relatively little attention when compared to Parkinson’s, Alzheimer’s, diabetes, and suicide.

So what does this mean?

The focus by the media on chronic diseases and diseases of aging—instead of, for example, accidents and COPD—probably reflects the interests of the more mature readership of the Times and the emphasis in newsrooms on “news you can use,” health journalism commentator Gary Schwitzer said.

He also offered another interpretation: This exercise may reflect the work of advocacy campaigns. Maybe, in this sample, advocacy groups for Parkinson’s, liver disease, suicide, flu, diabetes, Alzheimer’s, et cetera, were just that much more successful in priming the pump by getting stuff in the New York Times.”

What’s more, our data might not be representative. Schwitzer noted that searching by key terms could turn up spurious correlations. For example, “Suicide showing up as a key word may mean that it comes from all sorts of general news stories. That may not be comparable to stroke showing up as a keyword from a stroke study. Yes, it’s what’s in the paper, but it’s not necessarily a comparison of what health care/medical/science journalists chose to report on.”

Limitations

Of course, our data have other limitations. In addition to the potential flaws of searching for key terms, we used New York Times coverage as a proxy for health coverage. As Schwitzer pointed out, “‘What we journalists cover’ doesn’t necessarily equate to ‘what the New York Times did.’ To some degree, yes, because of copycat journalism. But to a large degree, day in and day out, not so much.” Similarly, the data only reflect one year of coverage.

Health editor and Retraction Watch blogger Ivan Oransky wondered whether the quantity of studies on a given topic drive coverage. “There may simply be more studies and press releases about the subjects that New York Times are more likely to cover,” he said. “And if that’s the case, this is another good reminder why letting journals set the agenda can skew what reporters cover.”

Andre Picard, a long-time public health reporter at Canada’s Globe and Mail, asked whether reflecting causes of mortality was truly the best measure for quality health coverage: “Should our choice of story topics be based (or influenced) by the impact of a disease/condition on the impact of the population?”

Picard’s answer was ‘sort of.’ “We should base our story choices in part, on the impact of diseases/conditions on the population. But I’m not sure mortality is the best metric for judging impact and I’m really sure that we should pay a lot more attention to the causes of illness than to illnesses themselves. We do that a bit – smoking as a cause of heart disease and lung cancer, for example. But we tend to shy away from issues that don’t have medical treatments.” He added: “I think availability of treatments, more than anything else, influences our coverage.”

What may not get a lot of attention in the health news pages, even though it drives human health more than anything, are the “causes of the causes of disease” such as poverty, Picard said. “We know that income is the single biggest determinant of health, followed by education. But I’m betting ‘poverty’ wouldn’t even show up as a tiny blip on your chart of health story topics. The poor and uneducated are many times more likely to die of heart disease, cancer, COPD, suicide, car crashes, etc., you name it.”

Future research

We were also interested in seeing whether there’s a disconnect between public investment in research spending and mortality. To look at this question, we tallied the dollar amounts of research funding by disease category at the NIH in 2010, and compared those to the data on the top causes of death in America. We then created an index for the research/death ratios. The bigger bubbles—stroke, Parkinson’s, Alzheimer’s, heart disease—are areas with relatively more research funding compared to mortality. Again, diseases related to aging attracted funding, as did those related to cardiovascular health.

research spending

In summary, our findings raised more questions than answers. This exercise gave us a chance to reflect on what other metrics we could use to measure the quality of health journalism and better identify the gaps in health reporting. Considering the limitations of our data, we plan to gather a more robust data set so that we can be more confident in our findings and recommendations to journalists.