The American Community Survey – 3 in 1: explainer, engagement, data story

I have thought about creating a census fan page many times. Looking at data all day makes one appreciate the history, scale, and effort of this massive public endeavor. Not only does the census provide official guidance to the formulation of public funding and policy, it has over the years also ritualistically structured our understanding of our environment. Since 1790, the census evolved not just to adapt to the massive increase in population(from under 4 million to 318 million today) and migration(from 5.1% urban to 81% in 2000), but its format has also changed to reflect our attitudes. In this 3 part(hopefully) assignment/makeup assignments, I focused on explaining and visualizing the American Community Survey(ACS), a newer data offering of the census that is a yearly long form survey for a 1% sample of the population.

Last summer, while interning at a newsroom, I built a twitter bot based on the ACS inspired by how nuanced and evocative the original collected format of the dataset is. Each tweet is a person’s data reconstituted into a mini bio. In the year since, people have retweeted when an entry is absurd or sad, but most often when an entry reminded them of themselves or someone they know. It quickly became clear that narratives are more digestible than data plotted on a map. However, I was at a loss on how to further this line of inquiry to include more data in bigger narratives.

Part of my research is to experiment with ways of making public data accessible so that individuals can make small incremental changes to improve their own environment. Many of these small daily decisions are driven by public data, but making the underlying data public is not always enough. While still plotting data on maps regularly, I started to think about narratives. Can algorithmically constructed narratives and narrative visualizations stand alone as long-form creative nonfiction?

There are so many wonderful public data projects that go the extra step out there. Socialexplorer does a great job of aggregating the data, so does actually Projects from timeLab show many examples of how census data has been used for a variety of purposes, even entertainment. And just last week, the macroconnections group unveiled a beautiful and massive effort to expose public datasets with that takes data all the way into a story presentation.

Constraints are blessings…

It’s fortunate that I work in such a time and environment but also very intimidating. What can I contribute to an already rich body of work where each endeavor normally requires many hours and even months of teamwork, not to mention the variety of skills involved? More selfishly, what can visual artists add to the conversation that is beyond simply dressing up the results? This series of 3 assignments is a start.

1. Explainer – the evolution of the census

Instead of focusing on how the population has changed, here is a visualization of how census questions have changed to reflect the attitudes and needs of the times. Unfortunately this was unfinished and only goes from 1790 to 1840 right now.


1790_1840view closeups here – 1790_1840

2. Engagement – how special are you?

I have been procrastinating by spending a lot of time on guessing the correlation. I think that buzzfeed-type quizzes are one of the best data collection tools. Of course there is also this incredible NYT series. People who commented on the census bot often directly address tweets that describe themselves. This is an experiment to get people to learn something about the data by allowing them to place themselves in it.

Screen Shot 2016-04-19 at 9.42.30 PMScreen Shot 2016-04-19 at 9.42.00 PM  This is also still very much in progress:

3. Data Story

To be continued …

Is it cancer?

Screen Shot 2016-03-07 at 10.42.27 AM

Cyberchondria refers to unfounded health concerns perpetuated by medical information found online. WebMD is a popular website and often a top search result for people seeking to self diagnose conditions and symptoms. Its tendency to increase concern for potential conditions and exaggerate the seriousness of symptoms is found at the center of jokes. Specifically, articles online have referred to how easy it is to arrive at a cancer diagnosis on the website.

We cannot determine the validity of the entire WebMD site by fact-checking the answers given by each page, but we can perhaps answer this question – given a symptom, how far away is a person from a diagnosis of cancer on WebMD?

So here is an experiment that attempts to use the physical properties(text and links) of a website to determine it’s message. The goal is to investigate the structure and content of in order to determine if and how much it perpetuates the diagnosis of cancer.

The site is a big nest of links so the scope is limited to be the A-Z common topics page. This section lists 482 health related topic pages from Acid Reflux to Zoster (Herpes) Virus. The content examined is further limited to the main article of each of the conditions.

The experiment looks at each page’s center content section for 2 things – cancer related words(a limited list I found on the internet), and all the out links from that section of the page. It continues to search through the pages until it arrives at either a page with cancer, a page with no links, or a page that is outside of WebMD.

Using this method, the simple web scraper picked up 9714 web pages. Of these,

  • 7976 pages do not have cancer related keywords on them.
  • 726 pages are cancer related conditions because keywords were found in the main content.
  • 1012 information pages had either no outlinks such as liver, or out-links that redirected to a sponsored page like this.

A rat’s nest of a directed network graph was made with a force directed layout from the resulting pages where each page is a node, and each edge a link between pages. The cancer related pages here are colored in red. It is not immediately noticeable which categories of pages have more prominence. However it is clear that there are central nodes in the network where almost every page eventually leads.

Screen Shot 2016-03-09 at 9.14.56 AM I calculated pageRank for each page(node) to determine its prominence.

PageRank, the more famous part of the google search algorithm measures the relative importance of the page given its links based on one of the algorithms that determines the order of search results. Below are the top 1000 pageranked pages in descending order. We can see that pages with cancer do not have the highest scores, and are distributed throughout the ranking.

Screen Shot 2016-03-08 at 11.32.15 PM

Unfortunately, this is a much more complicated project than I expected, so I can only tell you that given what I have seen of the network, cancer related pages do not act differently or hold prominence over other topic pages. However, it is not clear that the scope of the website’s conditions covers cancer related topics proportionally more than it should. Nor is it clear that if a cancer diagnosis occurs, how much of it is driven by the behavior of the medical advice seeker who may tend to travel the path toward the worst scenarios.

If webMD is not about diagnosing cancer, then where are the most likely places that any given webMD query will lead? A few pages with significantly higher centrality and pageRank stood out far from the rest. And these pages focus on 2 things – policy and medicine.

The page which every page eventually leads to is as expected – the disclaimer that states webMD information “are for informational purposes only. The Content is not intended to be a substitute for professional medical advice, diagnosis, or treatment…”

A equally prominent page is a tool to identify medication. The drug index comes in 3rd, but has the most user input on the website with its thousands of reviews of specific drugs..

And subsequent prominent pages serve similar purposes: privacy policy, and conditions of use.

… to be continued

Media Diary – Jia

This assignment came at the perfect time – it’s a very very busy month and I really need to improve my media diet. The goal of my media diary is to determine a media consumption routine that is the most productive(towards dissertation research).

Diary:  see screenshots below or see interactive diary here

the key: to mimic a hand-drawn feel, I used hatch marks. The messier the mark, the less productive.

Screen Shot 2016-02-16 at 11.40.07 PM

total days: 1 – 7, from last Wednesday, I went on a ill-timed vacation

Screen Shot 2016-02-16 at 11.30.52 PM

productivity is highest in early – mid morningScreen Shot 2016-02-16 at 11.28.37 PM

overall I am not productive 🙁

Screen Shot 2016-02-16 at 11.34.36 PM

I work at the lab or on the train

Screen Shot 2016-02-16 at 11.34.52 PM

Parameters: The media I measured are what inputs I get, not what I make(those are tracked in a spreadsheet already).

IMG_3045survey_screenCollection: After using rescuetime for a few days, I decided against it because it’s automatic recordings didn’t allow me to reflect on what I consumed. The method I found most helpful was to hand record as I go throughout the day. After the first few days, I adapted my notes into a google form(left) so that I can input directly from my phone into a standard format.

The format of my recording results in a spreadsheet with columns for date, time, media, productivity, who I am with, part of day, duration, place, and a short description.

Visualization: I chose this assignment to write a simple reusable visualization module and experiment with opening up visualizations to others on github. Inspired by “Dear Data”, I made a simple spreadsheet to sortable hatch marks visualization.

It is really still a work in progress.

Conclusions: I am pretty predicable. I look at Instagram and online shop throughout the day. I am most productive in the lab, on the train, and in the early mornings.

There is so much I want to do for this visualization. Changing the labels, adding more notes, increasing the clarity with some modifications, and using google spreadsheets directly instead of a downloaded spreadsheet. This repo I made for code in the project is working. (without the key panel)

the entered data, the input form

Jia Zhang

I’m a phd student working with interactive maps and data visualizations here in the Media Lab. My background is in visual art and interaction design.

I am interested in how journalists use maps and visualizations. I am a big fan of Amanda Cox and cannot express how excited I am about her new role as editor of the Upshot. I want to build visualization tools for professional and amateur storytellers alike.

1472086_10153515742355335_518222451_nI would like to focus on becoming a better writer through this class. I am especially excited to learn from so many great journalists and writers in the class. I would also like to work on a few map/dataviz-based projects that will contribute to my dissertation research if the opportunity comes up.


Relevant skills: visual design, python, front end web dev, javascript, processing, and mandarin.

Interactive Graphics that Invite Participation

Participatory interactive graphics(?) are visualizations that are designed to change around data from individual readers. These graphics use information solicited from the user or the user’s computer as a lens through which complex data or very general data is presented. This kind of interaction is increasingly important in storytelling.

Here are 3 types of stories I have come across that fall into this category.

Screen Shot 2016-02-09 at 11.05.57 AMScreen Shot 2016-02-09 at 11.48.17 AM1. Calculators and searchable interfaces – These searchable or adjustable interfaces allow users to glimpse the larger underlying system by answering each user’s targeted questions. Examples: “How The Internet* Talks” and “Is It Better to Rent or Buy?”

Screen Shot 2016-02-09 at 11.22.38 AMScreen Shot 2016-02-09 at 11.22.51 AM2. Draw your own/quizzes – Soliciting educated guesses of trends from users to involve them in thinking about the logic behind trends and increase impact. Example:  “You Draw It: How Family Income Predicts Children’s College Chances“ 

Screen Shot 2016-02-09 at 11.18.35 AM3. Geolocating Users – Using ip addresses to geolocate users and automatically alter the view and accompanying text of the visualization to be centered around a user’s location. This is used in navigating general and comprehensive datasets that cover the whole country but are only of interest to most readers as smaller slices. Example: “The Best and Worst Places to Grow Up: How Your Area Compares” 

I think these types of interaction are not only an important tool for storytelling online, but can affect larger patterns for reading online for several reasons.
1. They might be more readily shareable across social media because of how specific they are to the interest of a reader.
2. Commenting is problematic on many online articles. I think using this specific type of interaction can potentially serve as a filter for comment reading, and provide constructive directions for comment writing and discussion among readers.
3. Finally, this kind of interaction could serve as a dynamic filter for customizing out links from the article and effect recommendations.

There are discussions to be had on whether the data gathered from interacting with graphics should be used for purposes of catering content. I’m not sure yet how I feel about editorial decisions that might be increasingly challenged by the metrics of social media and how this addition contributes to the discussion. I would like to know more about how feedback is currently weighted in the newsroom. Ultimately, this interaction may result in more stories being force fit into a data-centric model that is less good than what we have now. There are also definitely issues with the quality of the data being gathered from this type of interaction, which is an interesting area of study once there is a large enough sample size.

I do believe experimenting with this type of input is ultimately worth it and could change the way we look at readers and frame select stories in a positive way. Actively using reader input is a important concept for storytelling. It is not new, but it is adoption within interactive graphics has presented very exciting recent use cases and it is a topic that I would like to explore further.