A week ago a German jet crashed into the Alps, killing all 144 people on board. For the first several hours after the tragedy it was considered an accident, but it is now apparent that the plane’s co-pilot, Andreas Lubitz, is responsible, and details continue to emerge about his past. As more facts surface, news outlets covering the tragedy have released them in incremental updates. These updates have touched on a wide variety of questions: Why was no one aware of or worried about his mental health issues? Should he have been flying a plane in the first place? Have suicide plane crashes happened before? How has small-town Germany — such as the town of the 16 high school students on board or the pilot’s hometown — reacted to the horrific event?
When publishing these updates, publishers are often linking back to previous stories as a proxy for background information. The “original” story breaking the incident tends to be low on hyperlinks (such as the first link above, which only links to a Germany topic page) while later updates start to link back to archival stories for context. I was curious whether these internal, archival hyperlinks could be followed in order to automatically create a community of stories, one that touches on a variety of aspects of the incident. Links are rarely added to stories retroactively, so in general, following the links means traveling back in time. Could a crawler organize all the links for me, and present historical content (whether over the past 3 days or 10 years) for the Germanwings disaster?
I built a crawler that follows the inline, internal links in an article, and subsequently builds a graph spidering out from the source, storing metadata like link location and anchor text along the way. It doesn’t include navigational links, only links inside the article text; and it won’t follow links to YouTube or Wikipedia, just, for instance, the Times. This quickly builds up a dialogue of stories within a publisher’s archive, around one story; from here, it is easy to experiment with simple ranking algorithms like the most-cited, the oldest, or the longest article.
I chose three incremental update articles from March 30, one each from the Times, the Post, and the Guardian, all reporting that Lubitz was treated for suicidal tendencies:
- NYT http://www.nytimes.com/2015/03/31/world/europe/germanwings-copilot-andreas-lubitz.html
- Post http://www.washingtonpost.com/world/europe/officials-co-pilot-in-german-plane-crash-was-treated-for-suicidal-tendencies/2015/03/30/a7b1eb96-d6da-11e4-bf0b-f648b95a6488_story.html (why are the Post’s URLs so ugly? A subject for another explainer)
- Guardian http://www.theguardian.com/world/2015/mar/30/germanwings-co-pilot-andreas-lubitz-treated-suicidal-tendencies
For each of these three, I spidered out as far as they could go (though in the case of the Times that turned infinite, so I had to stop it somewhere).
New York Times
My first strategy was to simply look at the links that the article already contained. While the system can track links pointing in as well as out, this aticle only had outlinks; presumably this is because a) it was a very recent article at the time of the query, and b) we cannot be sure that we have all of the related stories from the given spider.
Clicking on a card will reveal the card’s links in turn–both inlinks and outlinks.
The “germanwings-crash.html” article had several links that formed a clear community, including archival stories about plane crashes from 1999 and 2005. The 1999 story was about an EgyptAir crash that has also been deemed a pilot suicide. This suggests that old related articles could surface from following hyperlinks, even if they were not initially tagged or indexed as being related. The 2005 crash is linked in the context of early speculation about the cause of the crash (cabin depressurization was initially considered). It is a less useful signal, but it could be useful in the right context.
This community of links is generally relevant, but it does veer into other territories sometimes. The Times’ large topic pages about France, Spain, and Germany all led the crawler towards stories about the Eurozone economy and the Charlie Hebdo shooting.
The Wapo article collected a community of just 32 links, forming a small community. When I limited the spidering to just 3 levels out, it yielded 12 Germanwings stories covering various aspects of the incident, as well as two older ones, one of which is titled “Ten major international airlines disasters in the past 50 years.”
The Washington Post articles dipped the farthest back in the past, with tangential but still related events like the missing Malaysia Airlines flight and the debate over airline cell phone regulations.
The Guardian crawler pulled 59 links, including the widest variety of topic and entity pages. It also picked up article author homepages though (e.g. http://www.theguardian.com/profile/melissa-davey). 32 of these links ended up being relevant Germanwings articles, which is well more than I expected to see…I wouldn’t have guessed the Guardian had published so many stories about it so quickly. These ranged from the forthcoming Lufthansa lawsuit to the safety of the Airbus.
The Guardian seems to have amassed the biggest network, and tellingly, they already have the dedicated topic page to show for it, even if it’s just a simple timeline format. The graph appears more clustered than Wapo’s, which was more sequential. But it doesn’t dip as far back in the past, and at one point, the crawler did find itself off-topic on a classical music tangent (the culprit was a story about an opera performance that honored the Germanwings victims).
In the end, the crawler worked well on a limited scope, but I found two problems for link-oriented recommendation and context provision:
- The links were often relevant, but it wasn’t clear why. More detail surrounding the context around the link is crucial. This could be served by previewing the paragraph on the page where the link occurs, so a reader could dive into the story itself. In short, a simple list wouldn’t be as detailed as a complete graph or more advanced views.
- The topic pages were important hubs, but also noisy and impermanent. Most NYT topic pages feature the most recent stories that have been tagged as such; this works better for a page like “Airbus SAS” than it does for “France.” As such, such an algorithm needs to treat topic pages with more nuance. Considering topic pages as “explainer” pages in their own right, one wonders how they could be improved or customized for a given event or incident.
Another wrinkle: I returned to the NYT article the next day after a few improvements to the crawler, and found that they had removed a crucial link from the article, one that connected it to the rest of the nodes. So already my data is outdated! This shows the fragility of a link-oriented recommendation scheme as it stands now.