Problems with WikiLeaks and the Start of a Solution

WikiLeaks has been an important step toward greater transparency and accountability. By publishing leaked documents from governments, corporations and other institutions, WikiLeaks has revealed the power of transparency and openness online. That said, WikiLeaks has some serious flaws with their own transparency, openness of their releases, and methods.

This is an exploration of some of the problems with WikiLeaks through a story of my attempt to analyze the WikiLeaks Global Intelligence Files release. The Global Intelligence Files (GI Files) are a collection of five million emails from the intelligence contractor Stratfor. These emails are from the years 2004 to 2011 and discuss Stratfor’s internal operations.

I was hoping to analyze Stratfor’s communication structures by making a network graph from the GI Files. Email is perfect for this sort of analysis. Someone made a tool called cable2graph for graph analysis of Cablegate, so I planned to adapt that tool for use with emails.

Ideally, I would use all the Stratfor emails for this graph analysis. Unfortunately, WikiLeaks has not released all the Global Intelligence Files yet. They started publishing the GI Files on February 27th, 2012. This was over a year ago. Five million documents is a lot, but after fifteen months only a fraction of the documents have been released. It does not seem like WikiLeaks is seriously trying to release these documents. They have not released any sets of emails since February 2013 and most months only several sets of emails are published. Even considering time spent reviewing documents before release, the release of the GI Files is taking far too long.

The second issue with analyzing the Global Intelligence Files is WikiLeaks’s release strategy. WikiLeaks is working with “more than 25 media partners” to release the documents. These partners get access to the full set of documents. WikiLeaks releases GI Files emails only when a partner writes an article about them. WikiLeaks has released a few hundred sets of these emails so far, but most of these releases only contain a few documents.

This release strategy makes it incredibly difficult to find new stories in released Stratfor emails. After all, the emails are only released when they have been used in a story already. This means there is no obvious way for most WikiLeaks supporters to help with the release and analysis of the GI Files. Partners are invited and must be “a journalist, Professor or Associate Professor at a University or an employee of a human rights organisation.” In some ways, this is worse than a pay wall. Access to the GI Files is restricted and there is no clear way someone can get access.

Despite these difficulties, I thought using network graphs to examine the bigger picture of all the released emails may still reveal some new information. WikiLeaks does very little analysis of most of its documents. For the GI Files, WikiLeaks seems to rely entirely on the analyses done by their media partners. Unfortunately, WikiLeaks makes it difficult to analyze most of their releases. A few, like Cablegate, are accessible in machine readable formats. People have used these formats to analyze the documents in interesting ways.

Most WikiLeaks releases are not accessible in machine readable formats. There are no machine readable versions of the Global Intelligence Files. As of this month, the US government has better policies for releasing machine readable data than WikiLeaks. This new policy is a great step forward for the US government, but it is a huge failure for WikiLeaks that they are now behind the US government on certain transparency policies.

To get the GI Files email data in machine readable format, I wrote a scraper. I was able to scrape the email subjects, dates, IDs, and text. Then I ran into yet another issue. Where the email addresses in the to and from fields should have been in the HTML, there was a CloudFlare ScrapeShield script. The purpose of ScrapeShield is to stop spam bots from collecting email addresses from websites. This is a good thing, but ScrapeShield becomes problematic when it gets in the way of analysis of documents. Is it more important that Stratfor employees get less spam or that people can analyze the WikiLeaks documents? I would generally say the latter since extra spam is a minor inconvenience (and most is caught by spam filters), but Stratfor employees have no say in this, so that complicates the situation.

While not ideal, one solution is for WikiLeaks to give only their media partners access to machine readable data or have some method of requesting it. Not only does WikiLeaks not give media partners access to machine readable data, but they actively ban their partners from scraping the documents. This ban on scraping and running scripts greatly limits the types of analysis their partners can conduct. It may also make it more time consuming to find documents to write about where network and content analysis could reveal interesting sets of emails faster. This restrictive partnership system may be why so few of the GI Files emails have been released so far.

Network analysis is not possible without the to and from fields of the emails. I found a way where it may be possible to get the to and from emails by converting the emails to PDFs and then scraping the PDFs, but this would be extremely difficult. PDF scraping itself is hard, but automatically scraping thousands of slightly different PDFs may not be possible. Thus, I gave up on the network analysis idea.

WikiLeaks is not only failing to make it easy to help with document release and analysis, but they actively impede anyone who wants to help, including their own partners. While WikiLeaks has done some great things for transparency, the organization has some serious problems with secrecy. This secrecy spreads to their releases and perpetuates closed documents behind walls. As the purpose of WikiLeaks is encouraging greater transparency and accountability through release of restricted information, restricting access to that information again defeats the purpose.

WikiLeaks is dying and if it does not change its methods it will die. Regardless of what happens, some of the successes of WikiLeaks have shown the world the power of leaking and transparency. These successful releases are not the norm. Most of the WikiLeaks releases and those of other transparency initiatives are rendered useless by the issues discussed above and others.

We can do better. I am not sure exactly what will work, but there are a few tasks I think a successful solution to these problems will contain. I have been working on a transparency platform that addresses the issues described above and other problems I have noticed with leaking and other transparency initiatives.

When examining any leaked or released information, it helps to define what information the investigative group has and what information it needs. There seem to be two main parts to this defining stage and tools that could help with the process, defining investigative questions and steps to answer those questions.

First, it may be helpful to define the questions the group wants to answer about the information. These questions will likely change throughout the process, but defining some questions upfront can help guide the investigation. A platform that lets people post, edit, and add answers to questions is a simple place to start for investigation of released documents.

Second, the group needs to define how they will answer each question. This could mean determining what information they need, how they will collect that information, and what types of analysis they will conduct with the information. Again, these steps may change throughout the process, but a list of clearly defined steps to answer the investigation’s questions may be a good starting point. Clearly defined steps or tasks are also helpful because they make it clear how supporters can help with the investigation. That clear path to involvement alone would be a huge improvement over WikiLeaks where supporters struggle to figure out how they can help. Simple task management software could allow people to define and allocate these steps to answer each question. It may even be possible to suggest steps based on the wording of the question or steps already entered. Suggestions like this would make it easier for people to figure out how to conduct the investigation.

Some investigations may start around a particular set of documents the group has already. This often seems to be the case with leaking and whistleblowing. In this case, it may be helpful for these documents to be uploaded in one place so people can search and analyze them. These documents should also be uploaded in a machine readable format for analysis. These two goals can be accomplished with a combination of a searchable document storage/upload system and scrapers.

The group examining the released documents may want to collect related information like interviews, data sets, documents released by other organizations, or user contributed data. Or maybe someone has questions and has no documents to start with yet when finding an answer. These additional documents could also be uploaded to the central storage/upload system to make it easier to search, combine, and analyze all the information. This document collection system could go a step further make it easier to find documents with options to pull information from common data sources like government data APIs, Wikipedia, and search results at the click of a button. Additionally, people helping with the investigation could use a browser plugin to easily send documents or scrape web pages they find online to the information storage system.

Sometimes a whistleblower may want to upload related documents anonymously. An anonymous submission system could be adapted to send documents directly to the storage/upload system. There could also be options for automatic redaction of names (or emails) in the documents submitted by whistleblowers (perhaps with a way some people can access the full data).

After all of this information is uploaded, it would be nice if it could be used outside this single investigation. Thus, it could be helpful to give the person uploading documents the option to share them with others using the same document collection system. This sharing system would make it easy to import documents and data into a new investigation or instance of the information collection and storage system.

Collecting and releasing documents only goes so far. To use information, people need to understand it. The analysis used to understand the documents includes anything from reading and discussing the documents to combining different forms of information and using content or network analysis programs. The most helpful type of analysis will vary based on the type of information available and questions asked.

Plenty of analysis tools exist, but a toolkit of many different analysis tools (existing and new) that allows these tools to be easily linked together and use information from the document collection system would make them easier to use and more powerful. For example, users could set one tool to parse a set of data from the document collection system and pass the output of that tool to another that does content analysis to determine relevant Wikipedia pages and then have another tool pull dates from those pages. Another tool could then format the dates and the dates in the original data into the format required by TimelineJS and then have TimelineJS with those dates embedded on the analysis/results page. Each of these tools could be used on their own or with different tools as well. Such an analysis toolkit would make existing tools easier to use and make them more powerful by allowing people to hook them together without coding. People who can code could upload their own tools for others to use or modify existing ones.

While the system described above may help with collecting and analyzing information, it may still be too time consuming and complicated for someone who just wants to learn about a situation and not take part in the investigation. With all the questions, documents, and analysis in one place, a program could easily take all of these and reformat them for a release page. This release page could have summaries and basic information on the findings at the top with the full details from the investigation, analysis, and full documents underneath. While the investigation system would be structured to make it easy to contribute, the release page would be structured to make the information easy to read and understand.

No matter how nice the release system and investigation platform is, few people will find the release pages on their own. People need to write articles about the release and share links to the release page. WikiLeaks’s release model of having media partners write articles is not all bad. I think media partners can be a helpful part of a disclosure system so long as documents are not only released when media partners write articles and they have the tools to help with analysis.

Using the Information
Collecting documents, analyzing them, and releasing more understandable information is not helpful if no one uses the information for anything. Using the released information can take many forms, so I am focusing on the first steps mostly for now. That said, the same structure that helps people define questions and steps to answer them could be used to set specific goals for change or greater awareness based on the released information.

Next Steps
Tools exist to help with all of the steps described above. Some people already use these to examine leaked or released documents. Unfortunately, these tools are often difficult to use and identifying and using many different tools or methods quickly becomes cumbersome, so in many cases they are not used at all. I am building the system I described above. This system integrates both existing tools and new ones in a modular and extensible platform for collaborative investigation.

Hopefully this platform will at least make it easier for organizations already releasing or analyzing leaked and released documents to conduct good analyses. Ideally, this system will be directly integrated with sites releasing documents (both leaking organizations and government transparency initiatives) to provide a platform for collaborative investigation and civic engagement between different groups and individuals.

What I Made So Far
I built a prototype of the collaborative investigation platform for defining questions and steps to answer them (minus the automatic suggestions). This also allows people to upload documents, but it does not yet include anything close to the upload system I described. I also made a few small analysis tools to help analyze the Global Intelligence Files. These tools are a scraper that can pull all the GI Files, specific releases, and single emails, a gem to automatically generate a TimelineJS-compatible JSON from the scraped emails, and a modification of the upload system that embeds a timeline of emails from a single GI Files release when someone uploads the JSON of that release generated by the scraper.

For now, you have to run the scraper on ScraperWiki, but I hope to integrate it more directly with Transparency Toolkit in the future. The main Transparency Toolkit system can be tested on the demo site (where I’ve uploaded some of the GI Files) or downloaded from Github. The TimelineGen gem can be used as a gem or downloaded from Github. Currently you can only make timelines directly on Transparency Toolkit from specific GI Files release pages, but TimelineGen has methods that can be used in more general cases (I’m still hooking them together manually for the timeline embed).

This is a tutorial/demo video that shows how this works-

If you want to use Transparency Toolkit to make timelines of WikiLeaks GI Files releases, some instructions are below. At this point, it is helpful if you know the basics of how to use ScraperWiki.

1. Go to

2. Ask a question or set a goal by typing in the box at the top (or skip this if you want to add a task to an existing question).

3. Click the + button next to the question to which you want to add a task and add a task by typing in the box that says “Add a task”. Tasks are clearly actionable steps for answering the question, like making a timeline of a set of emails.

4. Go to the GI Files release page and click on one of the links to view a set of emails that was released. Copy the URL of the page with the set of emails.

5. Go to the GI Files scraper and scroll down to line 99 right under ”#To get all emails for a single gifiles release:”. Then replace the URL in the getEmail(url) method with the URL you copied in step 4.

6. Save and run the scraper.

7. Click “Back to scraper overview” in the top right corner. Then click the Download menu and choose “As a JSON file”. Be sure to clear the old data before running the scraper again.

8. Go back to the task page you created on Transparency Toolkit in step 3. Click the “Contribute results from task” field and type anything you want about the results.

9. Click the Browse button and select the JSON you downloaded in step 7. Submit the results

10. You should see a timeline of the emails on the release page you specified. You can see an example timeline here as well.

Validate a Social Media Movement

Rochelle Sharpe, Adrienne Debigare, and David Larochelle

       It didn’t take long for conspiracy theories about the Boston marathon bombers to take hold on social media.

       On the same day Dzhokhar Tsarnaev was captured in Watertown, Ma, the Twitter feed “FreeJahar got created – and it’s been growing ever since.

       “Look at his hair,” someone tweeted earlier today. “Would a terrorist have nice hair?”

       While hundreds of Twitter followers and more than 6,000 fans of a Facebook page asking whether Jahar has been set up, a bigger question emerges about social media. How can the public possibly know whether the conspiracy theorists are a growing movement or merely a tiny band of people who’ve figured out how to hijack social media to amplify its message?

      Certainly, the Jahar conspiracy theorists now have the nation’s attention, with large newspapers, like the New York Post, picking up the story. “Smitten teen girls stir up #FreeJahar mania for Boston marathon bombing suspect,” the Post’s headline blared last Sunday.

      We propose a tool to assess the legitimacy of opinions spread on social media. It would essentially apply the age-old basic journalism questions, the who, what, when, where, why, to assess the validity of social media posts.

      How many supporters are needed before a movement is significant? The White House has struggled with this question with their We the People petitions. They initially required 5,000 signatures within 30 days to get a response. After being inundated with more petitions than they could handle, they raised the number to 25,000 and then to 100,000.

      When covering an emerging social movement online, it is important to know who actually supports it. We developed a prototype tool to analyze the membership of the main Jamar Facebook group. Although the New York Post portrayed supporters as smitten teenage girls, we found that the Facebook group was overwhelmingly male. We also found that the overwhelming majority of members were likely to be US based. (The facebook group went invite only before we could perform additional analysis.)

      Encouraged by the success of our prototype, we propose the creation of a more powerful suite of tools for analyzing social media movements. These tools would allow journalists to quickly analyze a social movement quantitatively rather than just qualitatively.

      The proposed tool suite would be able to show the following:

  • Word clouds based on the contents of tweets containing a hashtag or posts in a facebook group.

  • What social media services the conversation was taking place on.

  • How many users took the extra step of signing an online petition.

  • How users of a twitter hashtag are connected

  • What types of sources are being cited — research papers, mainstream media articles, blogs, memes, etc.

  • Where do links in the conversation go?



Attention Plotter: A Tool for Exploring Media Ecosystems

Attention Plotter is a d3.js-based tool for graphing and comparing volumes of content from multiple media sources and word frequencies within that content over a range of dates. It’s now part of the Controversy Mapper project at the MIT Center for Civic Media. And it’s available on GitHub:

Attention Plotter Screenshot

Live demo of Attention Plotter using Trayvon Martin Data and TF-IDF:

Continue reading

Mind the Map: Toward a Handbook for Journalists

While browsing and studying maps for our final project, Catherine and I started compiling a mapping handbook for journalists. It’s far from complete, but we wanted to share an initial blog post and ask for your thoughts and feedback. Rather than focus on technical details (many of which are already covered in other blogs or places like the Data Journalism Handbook), we focused on questions a journalist might ask before making a map in the first place.

We’d love to hear your feedback, additional examples or questions.

How Close to Home? Crisis, Attention and Geographic Bias

For our final project, Luisa and I created a critical geography of the news coverage of the Boston Marathon bombings in comparison with other crises that happened that week. We put our main blog post writing up our findings on the Center for Civic Media’s blog. We created a poster and an online tool where you can explore the maps and data related to this research.


The Marathon, the Manhunt and the Media

Folks, I’ve been collecting the best writing I can find on the media and the Boston Marathon bombing. I rolled up everything on a Readlist, which you can download as a single ebook.

This is a work-in-progress, so please let me know what I’ve missed. If you have a bunch of links to add, let me know and I’ll give you a link to edit.

Posted in All

Proposed Final Project: Analysis of the Boston Marathon Bombing

By Rochelle Sharpe, Adrienne Debigare, and David Larochelle

With the investigation into the Boston marathon bombings now an international inquiry, we want to explore how different countries are covering this continuing story.
We want to compare how the U.S. and Russian media are covering the investigation, examining newspaper and blog posts on the FBI’s handling of Russian intelligence about Tamerlan Tsarnaev. We will use Media Cloud as a base to explore coverage, perhaps designing user friendly graphics to make Media Cloud information more user friendly.

Alternatively, we may look at ways to use machine learning to help analyze bombing coverage. For instance, we might be able to use machine learning to group articles by topics or look at patterns of tweets.

One major problem is that Media Cloud’s Russian coverage is in Russian. So, we have found a student to help us do some translation.

Posted in All