Data story: Internet use in Romania

The approach I took was to look at data available on Internet use in Romania, compare it to countries in EU28, interrogate it, and come up with potential stories — my initial interests to find something on citizen involvement.

This post is more an account of the process than a finished story; it reflects better the lessons I learned.

My first and most immediate lesson is that tracking good data and making a relevant data set – even when the information is publicly available – is time consuming (especially when you use different sources). The other lesson was even more humbling. Once I gathered the data I needed, I realized that combining it, merging it, and illustrating the new set takes skills that I don’t yet have (both technically & creatively), and the learning curve was to steep to master for this assignment. I will keep trying.

What I used and looked through: data from Eurostat (the EU’s statistical office), ITU (UN’s information technology arm), UNESCO, Net Index etc. What was great, although it took some time to realize, is that ITU and Net Index make some of their data available on Google’s Public Data, which comes with handy visualization tools. Eurostat also creates visualizations, but they are less appealing.

My first step was to rank the percentage of individuals using Internet in EU28 (ITU, 2012), a dataset which has Romania coming in last. (The Eurostat numbers for 2013 shows it has now passed Greece and Bulgaria, so one potential question would be whether the penetration rate is accelerating — it was almost flat in the years when most countries had their boom – 1997-2003). Another interesting question that comes up looking at the data is whether EU accession (Romania and Bulgaria joined in 2007) has sped up Internet penetration. Countries now vying for accession – Turkey, Macedonia, Serbia – have even lower usage.

1

I then looked at download speed in EU28 on Net Index, knowing I’ll find the reverse. According to Net Index, Romania has the third fastest download speed in the world. This discrepancy remains staggering and the potential causes/correlations are interesting to investigate: #8 in the world in terms of originating attack traffic (Akamai data),  high level of piracy (BSA data), a strong engineer culture and a budding startup culture, a hacker/cyber crime base.

2 3

Another question I’d explore is whether such low internet use might be explained by the urban/rural divide, still about 50/50 (53/47 to be more precise) in Romania compared to EU28, and, more interestingly, holding steady for the past 15 years – most countries, according to UNESCO data have experienced urban migration

4

The speed/penetration difference is even more interesting if you look at other indicators in which Romania continue to be reliably at the bottom: e-commerce and regular use (daily & weekly). This data and the accompanying visualizations were generated from Eurostat data.

Ecommerce

RegularUse

Predictably, Romania also ranks last in e-governance/interaction with public authorities.

PublicAuthorities

PublicAuthoritiesMAP[The gray circle is that country’s level of interaction. The red outline is the EU28 average.]

Looking at the public’s interaction with government, a host of other question and stories spring to mind:

  • what’s with the gigantic outlier in 2012? Is it a question of measurement? Did something happen? Was data misreported (intentionally or not?)
  • does this lack of opportunity appear in any candidate’s discourse/promises (presidential elections are slated for the fall)?
  • what explains these numbers? How does this explain the gigantic recent failure of the e-Romania portal for which the Romanian government spent 8 millions of euro?
  • what does this mean for initiatives such as ReStart Romania, which aim to use technology to further public dialogue and change?

Other directions suggested by the data:

  • there seems to be a digital divide. Does it track along geography (as mentioned above)? What about income? How does it manifest itself in different generations (e.g. 50 percent of people between the ages of 35 and 45 go online every week compared with the EU28 average of 82 percent.)?
  • the computer/internet literacy market (both public and private). Eurostat data shows only 26 percent of Romanians have basic computer skills (the EU28 average is 60 percent). What programs exist? Are they working – why/why not?

Womens T-Shirt – Military-Inspired Green

This (military-inspired green T-shirt) provides a form fitting look that is both flattering and slimming. The shirt can be worn alone, or paired with jeans or dress slacks. Because of its thin structure, it can work as an additional layer on top of an outfit or even a base layer for a cold weather ensemble. The mesh fabric serves as a breathable layer, making it perfect for warm or cold seasons. No matter how you wear this shirt, the form-fitting material will perfectly accentuate the convex curve of your body. The small round neck of the shirt’s design will compliment and create slender neck and facial lines. Military styled items are very versatile making this shirt an excellent choice for any occasion, including shopping or a date after work. The shirt from (dresshead) comes in 4 sizes (S, M, L, XL). For reference, a M is a 78 cm length and a 73 cm bust.

air max 1

Journalism check-up: Are reporters doing a good job of covering health?

By Ali and Julia

It’s no secret that journalists fall into many traps when covering the contradictory and sometimes convoluted area of health research. As a 2013 Columbia Journalism Review article—titled ‘Survival of the Wrongest’—summed up: “Even while following what are considered the guidelines of good science reporting, (journalists) still manage to write articles that grossly mislead the public, often in ways that can lead to poor health decisions with catastrophic consequences.”

This can take the form of reporting science out of context, misinterpreting conclusions, or missing big stories all together. So we set out to gather data on the places where health journalism goes wrong.

We had a grim starting place: We looked at the leading causes of death in America and compared that to how well the most comprehensive national newspaper—The New York Times—covered related stories. We wanted to see whether public health issues that matter to people are under-reported.

First, we gathered mortality data from the CDC’s most recent National Vital Statistics Report, which included 2010 deaths:

Cause of death Number of deaths Percent of total deaths
All causes 2,468,435 100
Heart disease 597,689 24.2
Cancer 574,743 23.3
Chronic lower respiratory diseases 138,080 5.6
Stroke (cerebrovascular diseases) 129,476 5.2
Accidents (unintentional injuries) 120,859 4.9
Alzheimer’s disease 83,494 3.4
Diabetes 69,071 2.8
Nephritis, nephrotic syndrome, and nephrosis 50,476 2
Influenza and Pneumonia 50,097 2
Intentional self-harm (suicide) 38,364 1.6
Septicemia 34,812 1.4
Chronic liver disease and cirrhosis 31,903 1.3
Essential hypertension and hypertensive renal disease 26,634 1.1
Parkinson’s disease 22,032 0.9
Pneumonitis due to solids and liquids 17,011 0.7
All other causes 483,694 19.6

Here, the leading causes of death are represented in a bubble chart; the biggest bubbles relate to America’s leading killers: Heart disease, cancer, chronic lower respiratory disease, stroke, accidents, et cetera.  cause of death data

Then, we did a query in The New York Times corpus of key search terms related to the top 15 causes of death in America. Here, we found the number of 2010 stories which mention those key words:

Times stories in 2010 Keywords
1,630 “cancer”
1,470 “heart disease”
527 “diabetes”
456 “alzheimer”
331 “suicide”
216 “stroke”
214 “parkinson’s”
183 “accident”
121 “liver disease” “cirrhosis”
95 “influenza” “pneumonia”
88 “hypertension” “renal disease”
27 “respiratory diseases” “copd”
2 “nephritis”
1 “Septicemia”
1 “Pneumonitis”

We then created an index to represent the media attention focused on America’s leading killers. We did this by dividing the number of New York Times stories by the number of deaths in America and then multiplying that number by 100,000. So: (New York Times stories/deaths)*100,000. Here’s what we found:

Media attention index
Parkinson’s disease 971
Intentional self-harm (suicide) 863
Diabetes 763
Alzheimer’s disease 546
Chronic liver disease and cirrhosis 379
Essential hypertension and hypertensive renal disease. 330
Cancer 284
Heart disease 246
Influenza and Pneumonia 190
Stroke (cerebrovascular diseases) 167
Accidents (unintentional injuries) 151
Chronic lower respiratory diseases 20
Pneumonitis due to solids and liquids 6
Nephritis, nephrotic syndrome, and nephrosis 4
Septicemia 3

bubble of representation

As you can see, the big bubbles (Parkinson’s, suicide, diabetes, Alzheimer’s) suggest there’s a lot of coverage proportional to the number of deaths while barely visible bubbles mean these killers are under-covered by the media compared to mortality. If these data are correct, the third leading cause of death in America—COPD—is hardly covered in the newspaper nor was the fifth leading cause of death in America (accidents). Meanwhile, heart disease and cancer—the top killers—got relatively little attention when compared to Parkinson’s, Alzheimer’s, diabetes, and suicide.

So what does this mean?

The focus by the media on chronic diseases and diseases of aging—instead of, for example, accidents and COPD—probably reflects the interests of the more mature readership of the Times and the emphasis in newsrooms on “news you can use,” health journalism commentator Gary Schwitzer said.

He also offered another interpretation: This exercise may reflect the work of advocacy campaigns. Maybe, in this sample, advocacy groups for Parkinson’s, liver disease, suicide, flu, diabetes, Alzheimer’s, et cetera, were just that much more successful in priming the pump by getting stuff in the New York Times.”

What’s more, our data might not be representative. Schwitzer noted that searching by key terms could turn up spurious correlations. For example, “Suicide showing up as a key word may mean that it comes from all sorts of general news stories. That may not be comparable to stroke showing up as a keyword from a stroke study. Yes, it’s what’s in the paper, but it’s not necessarily a comparison of what health care/medical/science journalists chose to report on.”

Limitations

Of course, our data have other limitations. In addition to the potential flaws of searching for key terms, we used New York Times coverage as a proxy for health coverage. As Schwitzer pointed out, “‘What we journalists cover’ doesn’t necessarily equate to ‘what the New York Times did.’ To some degree, yes, because of copycat journalism. But to a large degree, day in and day out, not so much.” Similarly, the data only reflect one year of coverage.

Health editor and Retraction Watch blogger Ivan Oransky wondered whether the quantity of studies on a given topic drive coverage. “There may simply be more studies and press releases about the subjects that New York Times are more likely to cover,” he said. “And if that’s the case, this is another good reminder why letting journals set the agenda can skew what reporters cover.”

Andre Picard, a long-time public health reporter at Canada’s Globe and Mail, asked whether reflecting causes of mortality was truly the best measure for quality health coverage: “Should our choice of story topics be based (or influenced) by the impact of a disease/condition on the impact of the population?”

Picard’s answer was ‘sort of.’ “We should base our story choices in part, on the impact of diseases/conditions on the population. But I’m not sure mortality is the best metric for judging impact and I’m really sure that we should pay a lot more attention to the causes of illness than to illnesses themselves. We do that a bit – smoking as a cause of heart disease and lung cancer, for example. But we tend to shy away from issues that don’t have medical treatments.” He added: “I think availability of treatments, more than anything else, influences our coverage.”

What may not get a lot of attention in the health news pages, even though it drives human health more than anything, are the “causes of the causes of disease” such as poverty, Picard said. “We know that income is the single biggest determinant of health, followed by education. But I’m betting ‘poverty’ wouldn’t even show up as a tiny blip on your chart of health story topics. The poor and uneducated are many times more likely to die of heart disease, cancer, COPD, suicide, car crashes, etc., you name it.”

Future research

We were also interested in seeing whether there’s a disconnect between public investment in research spending and mortality. To look at this question, we tallied the dollar amounts of research funding by disease category at the NIH in 2010, and compared those to the data on the top causes of death in America. We then created an index for the research/death ratios. The bigger bubbles—stroke, Parkinson’s, Alzheimer’s, heart disease—are areas with relatively more research funding compared to mortality. Again, diseases related to aging attracted funding, as did those related to cardiovascular health.

research spending

In summary, our findings raised more questions than answers. This exercise gave us a chance to reflect on what other metrics we could use to measure the quality of health journalism and better identify the gaps in health reporting. Considering the limitations of our data, we plan to gather a more robust data set so that we can be more confident in our findings and recommendations to journalists.

Sumo Wrestler and change of Japanese society

I was away for my business trip and started to work on this since Sunday. I am trying to tell story about Sumo wrestler and the change of Japanese society. I could not find the csv data, so I am typing the profile data of top group from ranking table, Banzukehyou, in 60’s and 90’s and the latest.I am doing the tutorial of Google Fusion Table at this very last moment.

w( ̄△ ̄;)w

Kunisada_Sumo_Triptychon_c1860sbanzuke

Would I make it in time…..?

 

Posted in All

Mapping Election discrepancies

It has been a year since Kenya went to the polls. Though the polls were peaceful, there have ben claims that the electoral body did not do its work professionally. Having had the chance to work with data pre-election, specifically electoral delimitation and voter registrations  more data has has been data provided after the election. I would like to highlight some of the discrepancies before,during and after the election. The main purpose is to have a fully based data audit of the election based on data provided before,during and after the election. Also I seek to understand if there were discrepancies, how big are the discrepancies that can we conclusively say the election was bungled. The electoral body’s chaiman did receive an award last year named the ICPS Electoral Conflict Resolution Award which also did raise a storm. The two main antagonists have very divergent views on the election, one lead by Uhuru Kenyatta believes they won the election fair and square, while the other lead by the former Prime Minister believes their victory was stolen.

 

Posted in All

Data on high schools in Boston

For this data piece I will tell the story of finding the answer to this seemingly simple question: “how many high schools are there in Boston?”cheap air max

I am also calling every high school in Boston and telling the story of trying to collect data by cold calling school receptionists, documenting their response to quantitative and qualitative questions such as “how many students attend your school?” and “what makes your school special?”air jordan sale

new balance shoes

‘It is just an accident’

cares

I could not concentrate on this assignment mainly because of my 8 year old daughter being injured at the PE class recently. Her face under her eyes was cut and ended up 11 stiches being operated at the ER which suggested us that we should consider aesthetic operation after one year to reduce the visibility of the scar.

The responsible PE teacher accused me to ‘have nerves to write rude messages’ after I question the safety measures taken in the class. The vice Principal of Graham and Parks school said that ‘That is just an accident. No one is responsible. There are children breaking their legs and arms every year at schools’. Yes, indeed.

“In 2009, an estimated 2.6 million children aged 0–19 years were treated in U.S. EDs for sports- and recreation-related injuries” as “unintentional injury”. According to National Electronic Injury Surveillance System (NEISS), only in 2012 (there are no figures later than that) total 573 children are injured at PE classes at schools in the US, according to the simple data search I made. Well, in 2014, one of the victims was my child.

Sorry for this highly personal and not developed assignment. This was a limited attempt to make a connection between data and human dimension.

photo Bahar accident after photo Bahar accident before operation

Visualizing GIFGIF by country

Kevin Hu & Travis Rich built a site called GIFGIF, which aims to crowd tag animated gifs with various emotions. From GIFGIF’s website: “An animated gif is a magical thing. It contains the power to convey emotion, empathy, and context in a subtle way that text or emoticons simply can’t. GIFGIF is a project to capture that magic with quantitative methods. Our goal is to create a tool that lets people explore the world of gifs by the emotions they evoke, rather than by manually entered tags.”

For this project, Kevin and I are building a map tool, along the lines of What We Watch, so that people can explore GIFGIF’s current dataset to see which gifs are most representative of certain emotions in each country.

GIFGIF’s data will soon be made publicly available through an API.

Data Story: Why People Take Free Online Courses (MOOCs)

Millions of people have signed up for Massive Open Online Courses, known as MOOCs. Early studies show that the majority of those who have signed up already have a college degree, and most do not opt to pay for a certificate to prove they passed the class. Put simply, they’re not looking to get college credit in any way. So I’m curious to dig deeper into what motivates these online “students.”

I am late to post because I’ve been digging around for a killer data set on this. I’ve made requests to HarvardX and to some researchers who have a large MOOC dataset, but so far no one has been willing to share their raw numbers. But HarvardX has published some demographic and survey data (not much). My sense, though, is that their data does not answer the question very well (most MOOC surveys only offer a few multiple choices on motivation).

So for the assignment, I’m focusing on playing around with fuzzier “data” – the student postings to forums in a MOOC. In many MOOCs, students post short introductions in the forums at the beginning of the term, usually saying why they are taking the course. I’ll analyze the intro discussion postings in one MOOC and group them into broad categories (my categories won’t capture everything, but there are definitely clear patterns in the responses).

My plan is to pick an astronomy course on edX that just started. https://courses.edx.org/courses/ANUx/ANU-ASTRO1x/1T2014/discussion/forum/i4x-edx-templates-course-Empty/threads/533080e801772bb02e00087f
There are only about 200 intro posts, so it should be do-able in the short time frame.

I plan to pull out one student post that is the best example of each category I create. So the interface will be a simple pie chart with the percentages of each reason for taking a MOOC, but then when you click on a specific group/color, you’ll be taken to that person’s intro post so you viewers can “meet” them.

I’m certainly open to suggestions on tools, critique, etc.