Tracing the links of the Germanwings disaster

A week ago a German jet crashed into the Alps, killing all 144 people on board. For the first several hours after the tragedy it was considered an accident, but it is now apparent that the plane’s co-pilot, Andreas Lubitz, is responsible, and details continue to emerge about his past. As more facts surface, news outlets covering the tragedy have released them in incremental updates. These updates have touched on a wide variety of questions: Why was no one aware of or worried about his mental health issues? Should he have been flying a plane in the first place? Have suicide plane crashes happened before? How has small-town Germany — such as the town of the 16 high school students on board or the pilot’s hometown — reacted to the horrific event?

When publishing these updates, publishers are often linking back to previous stories as a proxy for background information. The “original” story breaking the incident tends to be low on hyperlinks (such as the first link above, which only links to a Germany topic page) while later updates start to link back to archival stories for context. I was curious whether these internal, archival hyperlinks could be followed in order to automatically create a community of stories, one that touches on a variety of aspects of the incident. Links are rarely added to stories retroactively, so in general, following the links means traveling back in time. Could a crawler organize all the links for me, and present historical content (whether over the past 3 days or 10 years) for the Germanwings disaster?

I built a crawler that follows the inline, internal links in an article, and subsequently builds a graph spidering out from the source, storing metadata like link location and anchor text along the way. It doesn’t include navigational links, only links inside the article text; and it won’t follow links to YouTube or Wikipedia, just, for instance, the Times. This quickly builds up a dialogue of stories within a publisher’s archive, around one story; from here, it is easy to experiment with simple ranking algorithms like the most-cited, the oldest, or the longest article.

I chose three incremental update articles from March 30, one each from the Times, the Post, and the Guardian, all reporting that Lubitz was treated for suicidal tendencies:

For each of these three, I spidered out as far as they could go (though in the case of the Times that turned infinite, so I had to stop it somewhere).

New York Times

My first strategy was to simply look at the links that the article already contained. While the system can track links pointing in as well as out, this aticle only had outlinks; presumably this is because a) it was a very recent article at the time of the query, and b) we cannot be sure that we have all of the related stories from the given spider.

F46D5429-FF4A-4CC8-8727-F113C3CC1794
Clicking on a card will reveal the card’s links in turn–both inlinks and outlinks.

The “germanwings-crash.html” article had several links that formed a clear community, including archival stories about plane crashes from 1999 and 2005. The 1999 story was about an EgyptAir crash that has also been deemed a pilot suicide. This suggests that old related articles could surface from following hyperlinks, even if they were not initially tagged or indexed as being related. The 2005 crash is linked in the context of early speculation about the cause of the crash (cabin depressurization was initially considered). It is a less useful signal, but it could be useful in the right context.

This community of links is generally relevant, but it does veer into other territories sometimes. The Times’ large topic pages about France, Spain, and Germany all led the crawler towards stories about the Eurozone economy and the Charlie Hebdo shooting.

Washington Post

The Wapo article collected a community of just 32 links, forming a small community. When I limited the spidering to just 3 levels out, it yielded 12 Germanwings stories covering various aspects of the incident, as well as two older ones, one of which is titled “Ten major international airlines disasters in the past 50 years.”

wapo_links
Click on the image to see the graph in Fusion Tables.

The Washington Post articles dipped the farthest back in the past, with tangential but still related events like the missing Malaysia Airlines flight and the debate over airline cell phone regulations.

The Guardian

The Guardian crawler pulled 59 links, including the widest variety of topic and entity pages. It also picked up article author homepages though (e.g. http://www.theguardian.com/profile/melissa-davey). 32 of these links ended up being relevant Germanwings articles, which is well more than I expected to see…I wouldn’t have guessed the Guardian had published so many stories about it so quickly. These ranged from the forthcoming Lufthansa lawsuit to the safety of the Airbus.

guardian_links_germanwings
Click on the image to see the graph in Fusion Tables

The Guardian seems to have amassed the biggest network, and tellingly, they already have the dedicated topic page to show for it, even if it’s just a simple timeline format. The graph appears more clustered than Wapo’s, which was more sequential. But it doesn’t dip as far back in the past, and at one point, the crawler did find itself off-topic on a classical music tangent (the culprit was a story about an opera performance that honored the Germanwings victims).

Conclusion

In the end, the crawler worked well on a limited scope, but I found two problems for link-oriented recommendation and context provision:

  1. The links were often relevant, but it wasn’t clear why. More detail surrounding the context around the link is crucial. This could be served by previewing the paragraph on the page where the link occurs, so a reader could dive into the story itself. In short, a simple list wouldn’t be as detailed as a complete graph or more advanced views.
  2. The topic pages were important hubs, but also noisy and impermanent. Most NYT topic pages feature the most recent stories that have been tagged as such; this works better for a page like “Airbus SAS” than it does for “France.” As such, such an algorithm needs to treat topic pages with more nuance. Considering topic pages as “explainer” pages in their own right, one wonders how they could be improved or customized for a given event or incident.

Another wrinkle: I returned to the NYT article the next day after a few improvements to the crawler, and found that they had removed a crucial link from the article, one that connected it to the rest of the nodes. So already my data is outdated! This shows the fragility of a link-oriented recommendation scheme as it stands now.

An Object-Based Conversation with Bianca Datta

Bianca Datta spends a lot of time with objects. We all do, but not like her; she designs them, makes them, thinks about them, and responds to questions from prying interviewers about them.

Bianca is a product designer and first-year graduate student in the MIT Media Lab’s Object-Based Media group. I wanted to learn a bit about her design sense and the ways she relates to objects in particular, so I showed her seven objects that each sparked a conversation about different aspects and stages of her life.

01-maryland

I started off easy. Bianca is from Maryland—Montgomery County, not Baltimore, which most people mistakenly assume (or maybe they just don’t know any other cities in Maryland). She explained her home state as a “microcosm of the US,” which, looking at the state’s map, she attributed in part to its geography. The peninsula, the panhandle, and the two major metropolitan areas each form their own identity.

02-penn

Bianca then set off to Philadelphia to study at Penn’s School of Engineering. She knew that she wanted her work to have energy applications, and started off focusing on chemical engineering, but later found a home in Penn’s materials science lab as a Materials Science and Engineering major. She claims that chemical engineering didn’t work out because she is “not into math or physics,” which befuddled me. It’s all relative.

Bianca has many Penn mugs (all gifts) and paraphernalia, and when I ask which is her favorite, she ponders for a while: “that’s really tough. I have so many.” She settles on a hoodie that she got for being a residential advisor, which she likes not only for the color and comfort, but its associations; it reminds her of home, as well as camaraderie with her fellow RAs.

The mug itself also had significance: “I am really big on tea,” she says (she was late to our meeting because she was getting coffee). She associates tea with her family, and uses it as a way to connect with people; as an RA, she would offer tea to students to encourage them to stay and talk. Nowadays she organizes many of the Media Lab teas.

03-dormitron

At Penn, Bianca took a formative product design class that led her towards her current work. One of the projects in that class was Dormitron, an RFID-operated door, which would replace your dorm’s traditional key with an RFID chip, making your dorm’s door work like the key card in the campus entrance, or a bit like a hotel room.

Bianca first downplayed the project by saying “every year somebody does an RFID thing [for the class],” and mentioned that there are still barriers to wide adoption due to security liabilities. But she also insisted that her team’s product was better designed than others. Although she regrets not being able to participate in the product’s actual fabrication, it was her first opportunity to go from idea to product.

05-3m

Bianca spent one college summer in Minneapolis working for 3M, which introduced her to the corporate working world as a materials engineer. She was simultaneously impressed with the range of 3M and with their level of trust in her expertise and experience.

Her summer at 3M convinced her to go to graduate school, maybe to postpone the red tape (or poster tape?) of major corporations, and because she found that the most interesting work at 3M was being done by people with PhDs. It seemed like a good sign.

Although she was not working on improving 3M’s poster tape, she did have strong opinions: “I hate command hooks. They’re useless and always fall off the wall.” She points to 3M as proving that generic products are not all the same; her 3M sticky-notes stayed on longer and left less residue than the non-branded alternatives. Still, she notes, it’s not always worth the added cost.

06-flip

Along with Partnews RA Alexis Hope, Bianca designed a digital input/output device during the famed Media Lab class “How to Make Almost Anything.” The project was initially an excuse to try out the Processing programming environment, which allows for interesting visual effects. If you press a button, the background changes; this allowed them to switch between a “moon” view and a “sunrise” view for the object.

Bianca’s final and favorite How to Make project was a nap pod called DUSK, which she tells me currently exists and lives in the Media Lab, so I plan to find it and sleep in it tomorrow. For her the project was exciting because she made it from scratch; it was “in my head, and now it’s real, and its big, and I get to use it.”

07-stuffmatters

This book was on Bianca’s otherwise defunct Goodreads page, so she was surprised that I’d found it. On one hand Goodreads was just a “one-off thing” for her, but on the other, “this book is all about what I do.” It is a popular-scientific approach to materials and objects, with successive chapters on cement, paper, grass, and so on. Bianca’s current research examines how human beings relate and connect to materials; for instance, why we view some materials as stable, friendly, and durable, while others are considered foreign or cold. So this book is right up her alley.

Unsurprisingly, Bianca prefers paper reading over screen reading, which gives off the illusion of being “less serious.” But like most people, she makes plenty of concessions for the sake of digital convenience.

Bianca read Stuff Matters at Cambridge’s local Book Club for the Curious. As a first-year student, she felt like this connected her to the city and community. Whether tea mugs, hoodies, or books, Bianca associates her favorite objects with their social functions and associations. As an expert in things, her favorite things are the ones that connect her to her favorite people.

The Muddy cleans up

Last Christmas Eve, the graduate students of the Massachusetts Institute of Technology let out a collective cry of despair:

The venerable Muddy Charles pub has been MIT’s centrally-located campus bar since 1968. In a graduate scene where everything can seem a little stratified, the Muddy brings together the chemists and the math-heads, the Media Labbers and the Sloanies. The dark walls and carpet gave off a comfortably drab and unpretentious vibe. In a 2011 article in the MIT Technology Review, Kenrick Vazina put it succinctly: “On a campus dominated by cold concrete and hard science, the Muddy Charles pub exudes warmth.”

So the temporary closing of the Muddy was not only perturbing for temporary reasons (where would we get our $6 pitchers of High Life?)– the idea of a renovation threatened the tired charm of the place. A bright Muddy with a new floor would be, well, not muddy at all.

The new Muddy was unveiled at some point last week, to evidently little fanfare (“Last…Tuesday, I think?” says the bartender). Although the bartenders report high turnout despite the snow, its Facebook and Twitter accounts remain dormant. While a few super-secret email lists were abuzz with excitement at the reopening, few were discussing the renovation. To gauge the reactions of Muddy stalwarts and regulars, I decided to post up at the place for an afternoon with the ultimate peace offering: a pitcher of High Life.

IMG_2935
The entrance to the renovated Muddy Charles pub.

When I arrived, I found a photographer snapping pictures of the bar, and the alliteratively-named Mike the (Muddy) Manager posing. I’d been scooped! Fortunately, I soon learned that it was for promo photos rather than another Tech article. The Muddy was indeed looking to promote, even if its social media implied otherwise. In the meantime, yet another hopeful message popped up on Twitter, from a student who was still unaware that it has been open: “Any update on this?”

IMG_2937
Floor not included.

Two students were chatting over a beer and discussing chemistry (the science kind), but soon left. A few others arrived: some curiously asked whether it was officially open again and then left; others asked and then stayed; still others were friends of the bartender. At no point before 4pm were there more than 6 people present. Given the Muddy’s famous warmth, it was strangely cold.

IMG_2936
Me in front of everyone’s two favorite words.

My “FREE BEER!” sign drew attention from Mike the Manager, who asked me to remove it because free beer is, apparently, illegal under Massachusetts state law. No appeals to “but I’m paying for it!” or “but those two words are beautiful together!” could sway him, and my strategy for enticing conversation was foiled. I’d also earned the ire of the manager, a potential source. Moreover, I had to drink this pitcher myself.

At 4pm the tables began to fill up. Some friends of the bartender came in, and we asked, “What do you think?”

“…I like the walls.”

“That was sincere.”

“No, I do like the walls! I’m just not sure about the floor.”

The reception to the new look was lukewarm. The maroon, yellow, and white walls were offputting to several, though one Muddy denizen appreciated the MIT shade of maroon. Still, another remarked “It’s not quite where I thought they would go. I expected something a little more dark.” A few people missed the dinginess.

“It looks like a Burger King,” muttered one friend who joined me.

Other reactions were more positive. One Muddy regular appreciated the light tones and the welcoming nature of the front foyer. “It’s easier to navigate the space,” she said. She also pointed out the new power outlets circling the space, which will give more afternoon regulars the chance to “study at the Muddy.”

The Muddy was almost not in this position to renovate in the first place. In 2010-11, MIT’s higher-ups considered a renovation of the Walker Memorial building, which houses the Muddy along with an event room, student clubs, and a top-floor gym. The plan was to turn it into a dedicated Music and Theatre Arts building, and it was unclear whether the Muddy would be invited to return. The Muddy could move, but its central location — in between the science and business hubs of MIT — is crucial to its identity and success, as attested to by Muddy fixture Joost Bonsen, who regularly holds “office hours” for his Media Ventures class at the pub, and turns his table into a serendipitous meet-and-greet for scientists and entrepreneurs.

Fortunately for the Muddy, the Walker renovation plans were postponed, and this Muddy renovation seems to signify that it’s here to stay, for the time being. Whatever your feelings on the renovation style, this is undoubtedly a reason to celebrate (with a pitcher of High Life).

Started, completed, and fueled by the Muddy Charles Pub, Tuesday, February 24, 2:15pm-6pm

A week in the clickhole

Publishers and advertisers are after our attention, but they can’t decide how to measure it. Clicks still hold sway, but recently there has been a focus on “attention minutes” — which, as far as I can tell, is marketing-speak for “time spent” — as a more nuanced and realistic measurement of a reader’s engagement. Champions of attention minutes claim that it is a win for both advertisers and readers, who will have a better idea of impact and incentives for more engaging content.

I’m skeptical because attention minutes still require a click in the first place, and headlines still battle for attention. I’ve been curious about how clicks and minutes compare to one another. If they match up perfectly, what makes attention minutes any different as a success metric? If they don’t match up, do my attention minutes seem like a better representation of my online engagement for the week?

Process

I started by trying to manually collect data on mundane and repetitive digital events like receiving new notifications, opening blank browser tabs, and absentmindedly checking my phone. These turned out to be really hard to record. I found myself changing my habits, and getting distracted by the act of writing it down. Measurement bias was coupled with measurement fatigue, and I would often forget or not have time to write things down.

I opted for more automated tracking methods. For the week of February 8 to February 14, I tracked my clicks using Chrome’s History, and my time spent online (my attention minutes) using RescueTime. In order to limit measurement bias, I avoided looking closely at the data until the end of the week. I didn’t track my phone use, but I tend to use my phone for emails, texts, and games rather than web activities.

Tracking clicks

Chrome stores the last 3 weeks of its history locally in a SQLite database, which made it easy to retrieve my click data for the week, though it took a little query magic to limit it to the week in question.

I hit 4639 webpages over the course of the week, over 650 pages a day. This consisted of 1847 different websites — meaning I went to a single page 2.5 times on average.

I extracted the top-level domain from each visited URL using the tldextract python library, and found that I’d visited pages at 233 different domains. The breakdown of domains was as follows:

Screen Shot 2015-02-16 at 10.56.30
Fusion Tables Source

I visited Google more than domains #8-233 combined. The fact that I’m using Chrome’s history and visualizing it on Fusion Tables only drives home the point: for me, half of using the internet = using Google.

But I was also a bit surprised by some of my click data here; five of my top seven most-visited pages were Google pages, and three of them were Google Docs.

Domain Title URL Hits
google MW2015 Paper – Google Docs https://docs.google.com/document/d/1Tr8K… 291
google https://docs.google.com/document/d/1o-oL… 155
google Google https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8 137
twitter Twitter https://twitter.com/ 72
facebook (2) Facebook https://www.facebook.com/ 52
google Inbox (25) – liam.p.andrew@gmail.com – Gmail https://mail.google.com/mail/u/0/#inbox 39
google https://docs.google.com/document/d/1bnMq… 32
twitter Liam Andrew (@mailbackwards) | Twitter https://twitter.com/mailbackwards 29
mit MIT Libraries http://libraries.mit.edu/ 20
oclc Main Menu: ILLiad — MIT Libraries https://mit.illiad.oclc.org/illiad/illiad.dll 18

I was collaborating on a few projects and papers in Docs this week, but I don’t think I hit these pages hundreds of times. A one-page handout that I prepared for a class was apparently visited 155 times, even though I opened and closed it within an hour. Is Google automatically refreshing these pages, treating them as link visits? This would throw off my results immensely. This also made me think about what counts as a “visit” to Facebook or Twitter. Clicks might work for publishers, but they are less well defined for platforms and applications.

I also noticed a domain simply called “t” that had 40 overall clicks; this turned out to be the “t.co” link-shortening mechanism on Twitter. It was interesting to discover that I’d clicked on 40 Twitter links this week, but I wondered whether it was tracking the final destination too. Does a “visit” include redirects? I’d need to investigate how Chrome stores its history.

Chrome does store detailed metadata about each individual visit (e.g. did you arrive via a link from another page? a bookmark? manually typed into the address box?), so an examination of this data would allow for a glimpse into what Chrome History is storing, as well as offering a deeper dive into my interactions with specific sites. I wonder, for instance, if advertisers should be more interested in sites that users are likely to manually type into an address box, rather than arriving from email or Facebook.

Tracking minutes

I tracked my time for the week using RescueTime. I first threw out any non-internet usage, which turned out to be 47% of my total computing time (primarily Mail, Acrobat, and Evernote). Some of these differences felt arbitrary; I might decide to download a PDF rather than read it on the web, or start a cover letter in Evernote rather than a Google doc, and these decisions would affect hours of my time.

This left me with 31 hours logged online, though it’s safe to say I wasn’t paying attention to my computer this whole time. RescueTime thinks I spent 1 hour browsing “newtab” this week, when in fact I was probably distracted by the real world.

A pie chart felt appropriate as a representation of my attention, with advertisers fighting over slices:

Screen Shot 2015-02-16 at 10.52.56
Fusion Tables source

I had to consolidate a number of rows to get data consistent with the click data, but when I did, I found that 7.5 hours — about 25% of my total online time — was on some Google app or another (3:17 on Google Docs). The next three were the New York Times (2:12), Facebook, (1:40), and Twitter (1:14).

The two charts look similar, with some of the same characters and similar breakdowns, but there are a few differences. Attention minutes benefit the New York Times, while Google and MIT fall to a smaller share of the pie. Another interesting data point here is Wired, which only got one click all week, but 55 minutes of my attention (I believe I was lost in some court transcripts in the Dread Pirate Roberts case…a true attention hole). Wired wins out when measuring in this way…possibly more than it should.

My attention minutes look a bit more balanced than my clicks, but it still follows a sort of power-law distribution, with the majority of my time on just a few sites (my top 10 accounted for more than 50% of my total viewing time) and a long tail of sites that I only spent a few seconds on:

Screen Shot 2015-02-16 at 11.17.24
Fusion Tables source

I found that the few-seconds sites were the ones I used for quick facts and reference, while I spent more time on sites with full stories and articles. This seems to benefit publishers, but it actually might be a good goal for information and reference sites to reduce time spent on the site.

Combining the two

If we assume that both the click and time data are good and interchangeable (neither of which is necessarily true), then I spend an average of 23.19 seconds on any given website. Publishers like the New York Times and Wired win out when I measure minutes, while libraries and information providers like MIT and Google are more geared towards clicks.

Clicks and minutes both follow power laws, and generally feature similar sites at the top of both. There is a lot of correlation. Time spent may or may not be a slightly more balanced view of my online habits, but it brings its own skews. In the end, the most balanced representation of my week online probably sits somewhere in between these two metrics.