Problems with WikiLeaks and the Start of a Solution

Posted on May 30, 2013 by MC

WikiLeaks has been an important step toward greater transparency and accountability. By publishing leaked documents from governments, corporations and other institutions, WikiLeaks has revealed the power of transparency and openness online. That said, WikiLeaks has some serious flaws with their own transparency, openness of their releases, and methods.

This is an exploration of some of the problems with WikiLeaks through a story of my attempt to analyze the WikiLeaks Global Intelligence Files release. The Global Intelligence Files (GI Files) are a collection of five million emails from the intelligence contractor Stratfor. These emails are from the years 2004 to 2011 and discuss Stratfor’s internal operations.

I was hoping to analyze Stratfor’s communication structures by making a network graph from the GI Files. Email is perfect for this sort of analysis. Someone made a tool called cable2graph for graph analysis of Cablegate, so I planned to adapt that tool for use with emails.

Ideally, I would use all the Stratfor emails for this graph analysis. Unfortunately, WikiLeaks has not released all the Global Intelligence Files yet. They started publishing the GI Files on February 27th, 2012. This was over a year ago. Five million documents is a lot, but after fifteen months only a fraction of the documents have been released. It does not seem like WikiLeaks is seriously trying to release these documents. They have not released any sets of emails since February 2013 and most months only several sets of emails are published. Even considering time spent reviewing documents before release, the release of the GI Files is taking far too long.

The second issue with analyzing the Global Intelligence Files is WikiLeaks’s release strategy. WikiLeaks is working with “more than 25 media partners” to release the documents. These partners get access to the full set of documents. WikiLeaks releases GI Files emails only when a partner writes an article about them. WikiLeaks has released a few hundred sets of these emails so far, but most of these releases only contain a few documents.

This release strategy makes it incredibly difficult to find new stories in released Stratfor emails. After all, the emails are only released when they have been used in a story already. This means there is no obvious way for most WikiLeaks supporters to help with the release and analysis of the GI Files. Partners are invited and must be “a journalist, Professor or Associate Professor at a University or an employee of a human rights organisation.” In some ways, this is worse than a pay wall. Access to the GI Files is restricted and there is no clear way someone can get access.

Despite these difficulties, I thought using network graphs to examine the bigger picture of all the released emails may still reveal some new information. WikiLeaks does very little analysis of most of its documents. For the GI Files, WikiLeaks seems to rely entirely on the analyses done by their media partners. Unfortunately, WikiLeaks makes it difficult to analyze most of their releases. A few, like Cablegate, are accessible in machine readable formats. People have used these formats to analyze the documents in interesting ways.

Most WikiLeaks releases are not accessible in machine readable formats. There are no machine readable versions of the Global Intelligence Files. As of this month, the US government has better policies for releasing machine readable data than WikiLeaks. This new policy is a great step forward for the US government, but it is a huge failure for WikiLeaks that they are now behind the US government on certain transparency policies.

To get the GI Files email data in machine readable format, I wrote a scraper. I was able to scrape the email subjects, dates, IDs, and text. Then I ran into yet another issue. Where the email addresses in the to and from fields should have been in the HTML, there was a CloudFlare ScrapeShield script. The purpose of ScrapeShield is to stop spam bots from collecting email addresses from websites. This is a good thing, but ScrapeShield becomes problematic when it gets in the way of analysis of documents. Is it more important that Stratfor employees get less spam or that people can analyze the WikiLeaks documents? I would generally say the latter since extra spam is a minor inconvenience (and most is caught by spam filters), but Stratfor employees have no say in this, so that complicates the situation.

While not ideal, one solution is for WikiLeaks to give only their media partners access to machine readable data or have some method of requesting it. Not only does WikiLeaks not give media partners access to machine readable data, but they actively ban their partners from scraping the documents. This ban on scraping and running scripts greatly limits the types of analysis their partners can conduct. It may also make it more time consuming to find documents to write about where network and content analysis could reveal interesting sets of emails faster. This restrictive partnership system may be why so few of the GI Files emails have been released so far.

Network analysis is not possible without the to and from fields of the emails. I found a way where it may be possible to get the to and from emails by converting the emails to PDFs and then scraping the PDFs, but this would be extremely difficult. PDF scraping itself is hard, but automatically scraping thousands of slightly different PDFs may not be possible. Thus, I gave up on the network analysis idea.

WikiLeaks is not only failing to make it easy to help with document release and analysis, but they actively impede anyone who wants to help, including their own partners. While WikiLeaks has done some great things for transparency, the organization has some serious problems with secrecy. This secrecy spreads to their releases and perpetuates closed documents behind walls. As the purpose of WikiLeaks is encouraging greater transparency and accountability through release of restricted information, restricting access to that information again defeats the purpose.

WikiLeaks is dying and if it does not change its methods it will die. Regardless of what happens, some of the successes of WikiLeaks have shown the world the power of leaking and transparency. These successful releases are not the norm. Most of the WikiLeaks releases and those of other transparency initiatives are rendered useless by the issues discussed above and others.

We can do better. I am not sure exactly what will work, but there are a few tasks I think a successful solution to these problems will contain. I have been working on a transparency platform that addresses the issues described above and other problems I have noticed with leaking and other transparency initiatives.

Define
When examining any leaked or released information, it helps to define what information the investigative group has and what information it needs. There seem to be two main parts to this defining stage and tools that could help with the process, defining investigative questions and steps to answer those questions.

First, it may be helpful to define the questions the group wants to answer about the information. These questions will likely change throughout the process, but defining some questions upfront can help guide the investigation. A platform that lets people post, edit, and add answers to questions is a simple place to start for investigation of released documents.

Second, the group needs to define how they will answer each question. This could mean determining what information they need, how they will collect that information, and what types of analysis they will conduct with the information. Again, these steps may change throughout the process, but a list of clearly defined steps to answer the investigation’s questions may be a good starting point. Clearly defined steps or tasks are also helpful because they make it clear how supporters can help with the investigation. That clear path to involvement alone would be a huge improvement over WikiLeaks where supporters struggle to figure out how they can help. Simple task management software could allow people to define and allocate these steps to answer each question. It may even be possible to suggest steps based on the wording of the question or steps already entered. Suggestions like this would make it easier for people to figure out how to conduct the investigation.

Collect
Some investigations may start around a particular set of documents the group has already. This often seems to be the case with leaking and whistleblowing. In this case, it may be helpful for these documents to be uploaded in one place so people can search and analyze them. These documents should also be uploaded in a machine readable format for analysis. These two goals can be accomplished with a combination of a searchable document storage/upload system and scrapers.

The group examining the released documents may want to collect related information like interviews, data sets, documents released by other organizations, or user contributed data. Or maybe someone has questions and has no documents to start with yet when finding an answer. These additional documents could also be uploaded to the central storage/upload system to make it easier to search, combine, and analyze all the information. This document collection system could go a step further make it easier to find documents with options to pull information from common data sources like government data APIs, Wikipedia, and search results at the click of a button. Additionally, people helping with the investigation could use a browser plugin to easily send documents or scrape web pages they find online to the information storage system.

Sometimes a whistleblower may want to upload related documents anonymously. An anonymous submission system could be adapted to send documents directly to the storage/upload system. There could also be options for automatic redaction of names (or emails) in the documents submitted by whistleblowers (perhaps with a way some people can access the full data).

After all of this information is uploaded, it would be nice if it could be used outside this single investigation. Thus, it could be helpful to give the person uploading documents the option to share them with others using the same document collection system. This sharing system would make it easy to import documents and data into a new investigation or instance of the information collection and storage system.

Analyze
Collecting and releasing documents only goes so far. To use information, people need to understand it. The analysis used to understand the documents includes anything from reading and discussing the documents to combining different forms of information and using content or network analysis programs. The most helpful type of analysis will vary based on the type of information available and questions asked.

Plenty of analysis tools exist, but a toolkit of many different analysis tools (existing and new) that allows these tools to be easily linked together and use information from the document collection system would make them easier to use and more powerful. For example, users could set one tool to parse a set of data from the document collection system and pass the output of that tool to another that does content analysis to determine relevant Wikipedia pages and then have another tool pull dates from those pages. Another tool could then format the dates and the dates in the original data into the format required by TimelineJS and then have TimelineJS with those dates embedded on the analysis/results page. Each of these tools could be used on their own or with different tools as well. Such an analysis toolkit would make existing tools easier to use and make them more powerful by allowing people to hook them together without coding. People who can code could upload their own tools for others to use or modify existing ones.

Release
While the system described above may help with collecting and analyzing information, it may still be too time consuming and complicated for someone who just wants to learn about a situation and not take part in the investigation. With all the questions, documents, and analysis in one place, a program could easily take all of these and reformat them for a release page. This release page could have summaries and basic information on the findings at the top with the full details from the investigation, analysis, and full documents underneath. While the investigation system would be structured to make it easy to contribute, the release page would be structured to make the information easy to read and understand.

No matter how nice the release system and investigation platform is, few people will find the release pages on their own. People need to write articles about the release and share links to the release page. WikiLeaks’s release model of having media partners write articles is not all bad. I think media partners can be a helpful part of a disclosure system so long as documents are not only released when media partners write articles and they have the tools to help with analysis.

Using the Information
Collecting documents, analyzing them, and releasing more understandable information is not helpful if no one uses the information for anything. Using the released information can take many forms, so I am focusing on the first steps mostly for now. That said, the same structure that helps people define questions and steps to answer them could be used to set specific goals for change or greater awareness based on the released information.

Next Steps
Tools exist to help with all of the steps described above. Some people already use these to examine leaked or released documents. Unfortunately, these tools are often difficult to use and identifying and using many different tools or methods quickly becomes cumbersome, so in many cases they are not used at all. I am building the system I described above. This system integrates both existing tools and new ones in a modular and extensible platform for collaborative investigation.

Hopefully this platform will at least make it easier for organizations already releasing or analyzing leaked and released documents to conduct good analyses. Ideally, this system will be directly integrated with sites releasing documents (both leaking organizations and government transparency initiatives) to provide a platform for collaborative investigation and civic engagement between different groups and individuals.

What I Made So Far
I built a prototype of the collaborative investigation platform for defining questions and steps to answer them (minus the automatic suggestions). This also allows people to upload documents, but it does not yet include anything close to the upload system I described. I also made a few small analysis tools to help analyze the Global Intelligence Files. These tools are a scraper that can pull all the GI Files, specific releases, and single emails, a gem to automatically generate a TimelineJS-compatible JSON from the scraped emails, and a modification of the upload system that embeds a timeline of emails from a single GI Files release when someone uploads the JSON of that release generated by the scraper.

For now, you have to run the scraper on ScraperWiki, but I hope to integrate it more directly with Transparency Toolkit in the future. The main Transparency Toolkit system can be tested on the demo site (where I’ve uploaded some of the GI Files) or downloaded from Github. The TimelineGen gem can be used as a gem or downloaded from Github. Currently you can only make timelines directly on Transparency Toolkit from specific GI Files release pages, but TimelineGen has methods that can be used in more general cases (I’m still hooking them together manually for the timeline embed).

This is a tutorial/demo video that shows how this works-

If you want to use Transparency Toolkit to make timelines of WikiLeaks GI Files releases, some instructions are below. At this point, it is helpful if you know the basics of how to use ScraperWiki.

1. Go to http://transparencytoolkitdemo.herokuapp.com/

2. Ask a question or set a goal by typing in the box at the top (or skip this if you want to add a task to an existing question).

3. Click the + button next to the question to which you want to add a task and add a task by typing in the box that says “Add a task”. Tasks are clearly actionable steps for answering the question, like making a timeline of a set of emails.

4. Go to the GI Files release page and click on one of the links to view a set of emails that was released. Copy the URL of the page with the set of emails.

5. Go to the GI Files scraper and scroll down to line 99 right under ”#To get all emails for a single gifiles release:”. Then replace the URL in the getEmail(url) method with the URL you copied in step 4.

6. Save and run the scraper.

7. Click “Back to scraper overview” in the top right corner. Then click the Download menu and choose “As a JSON file”. Be sure to clear the old data before running the scraper again.

8. Go back to the task page you created on Transparency Toolkit in step 3. Click the “Contribute results from task” field and type anything you want about the results.

9. Click the Browse button and select the JSON you downloaded in step 7. Submit the results

10. You should see a timeline of the emails on the release page you specified. You can see an example timeline here as well.

Bradley Manning’s Pretrial Hearing

Posted on April 10, 2013 by MC

I made a Storify from tweets of people at Bradley Manning’s pretial hearing today. You can find it here or below.

Computer Money Going Up (or the rise of the value of Bitcoin)

Posted on April 3, 2013 by MC

Note: This is an attempt to explain computer money (Bitcoin) going up using the ten hundred most used words (like in up-goer five).

These days, a lot of people are talking about computer money. Computer money is money with no group in the middle controlling it. Instead, everyone uses their computer to track each time someone uses computer money. This table tracking the use of computer money makes it possible to give others computer money and stops people from lying about how much computer money they have.

People can send computer money to others or get computer money themselves. You do not need to know who the other person is to send computer money. When computer money is used, the person getting the money tells the person giving them computer money a number to send it to. This number can be a different number every time so no one knows who gets the computer money. The person giving away the computer money uses another number they do not tell anyone to show the computer money is real and has not been changed. The computers of the other people using computer money then check to be sure the money is real and add the use of computer money to their tables.

More computer money is made when computers get hard problems right. A computer somewhere in the world finds the answer to one of these hard problems about every ten minutes. Many strong computers work on these problems, so it is hard to get computer money this way. The person who owns the computer that gets the problem right also gets the computer money. The number of computer money given for finding the answer to a problem goes down over time.

Computer money can be used to buy other types of money. For the past few weeks, the number of other types of money one piece of computer money can buy has been going up a lot. This makes some people worried that the pieces of other types of money one piece of computer money can buy will drop very low soon.

Why are the pieces of other types of money one piece of computer money can buy going up so much? There are a few possible reasons. More places are starting to accept computer money. Money is only money if it can be used to buy things. The more things money can buy, the more it can be used. Since more places accept computer money, more people are interested in using computer money.

As more people started using computer money, one state started to make people who use their computers to make computer money or help others use computer money tell them how they use the computer money. This is to stop people from hiding that they get money from places not allowed by the state.

At the same time, a different state needed to give another a lot of money. This state had to take money from the people who live there to pay the money back to the places they got it from. People in other states were worried about the same thing happening to them. This made computer money, which does not need a state in the middle controlling it, look good.

All of this caused computer money to get more attention. As computer money got more attention, more people started accepting it. One person even tried to offer his house for computer money. This attention, in turn, made computer money easier to use. When computer money is easier to use, the pieces of other types of money one piece of computer money can buy goes up. Now, all of the computer money can be used to buy more than ten hundred ten hundred ten hundred pieces of some other types of money.

Fact Checking the Westboro Baptist Church

Posted on March 13, 2013 by MC

I fact checked the Westboro Baptist Church’s blog and used Bounce to annotate it. My annotations are here.

Interview with Erhardt Graeff

Posted on March 6, 2013 by MC

For the interview assignment, I interviewed Erhardt Graeff. Below is an edited clip from the interview with Erhardt introducing himself and discussing his research interests, how he became interested in the area, his future plans, and why he is taking this class.

Interview Audio Clip

I also made a timeline with some of the events in Erhardt’s life. You can view this below or here.

Http iframes are not shown in https pages in many major browsers. Please read this post for details.

4 Hour Challenge: Cory Doctorow’s Book Tour

Posted on February 27, 2013 by MC

I have admired Cory Doctorow’s books for a few years now. He has an amazing ability to weave simple explanations of technical concepts into rallying-cry stories. Doctorow recently released a new book, Homeland, and is on tour with the book. I went to his talk in Cambridge today. He is every bit as good a speaker as he is a writer.

Continue reading →

The Future of News, Civic Engagement, and Everything

Posted on February 25, 2013 by MC

The future of news is tied directly to the future of how citizens are informed. This connection is confirmed by Starr and Kovach and Rosenstiel. I would go a step further and say that the news is connected to civic engagement and participation. This is increasingly true as participatory and social media provide a way for anyone to join the discussion of events and articles. I think this trend towards greater participation and engagement with the news is a good thing and should continue, but I also think there is a place for traditional media.

Participatory media alone has two opposite problems; either there is so much information on an subject that it is impossible to fully understand the discussion and topic itself or not enough people talk about and analyze a subject. I think traditional media serves two key roles to balance out a mostly participatory news ecosystem. First, traditional media synthesizes and distills the discussions in participatory media and projects this information to a broader audience. Second, traditional media serves as a guidepost for what topics are important for citizens to discuss and investigate.

The traditional media is still in an incredibly powerful and potentially dangerous role. If anything, it is more powerful than before as citizens intentionally or unintentionally base their own increasingly spread beliefs on what they hear in the traditional media. As a result, it needs to be handled carefully. One example of an issue highlighted by Kovach and Rosenstiel is the corporate influence of news. We need to develop ways of minimizing and disclosing these external influences if traditional media is to have its ideas spread throughout participatory media.

Just as participatory media is balanced by traditional, this issue with traditional media can be tempered by participatory media. In the early stages of a story, journalists can listen to the conversations in participatory media, understand them, and incorporate them in the story. We need to make this balancing cycle easier by developing tools and fostering collaboration between professional journalists and citizens.

Aaron Swartz proposed an interesting solution for the future of transparency and news. He suggested that journalists, bloggers, programmers, lobbyists, and people with all sorts of skills work together in investigative strike teams to understand and fix society’s problems. I think this is a great way for traditional and participatory media to benefit from each other in a way that results in not only increased access to information but also tangible improvements to the world.

I also think this is happening naturally in many ways. Leaking organizations are one example of this process. Many leaking organizations are independent institutions run by normal citizens who receive and verify information, find background and supplementary information, analyze documents, and work with traditional media partners to release and explain the information. At their best, leaking organizations work very much like Swartz’s investigative strike teams. We need to encourage investigative strike team-like partnerships, teach people how to make participatory media, and build structures to make it easier to get involved and understand all the information available. I have one specific proposal for doing that in leaking here.

MC’s Media Diary

Posted on February 14, 2013 by MC

Tracking media consumption is hard. We are constantly inundated with ads, TVs and music in the background, pictures, and other bits of media we may barely see. Sites with dynamic content, like Facebook and Twitter are particularly hard to track. The content is varied and there is no lasting record of content viewed. This is problematic because I get a significant portion of my news through Twitter. I could track every tweet I load, but I do not read every tweet I load. The same goes for articles on a webpage with ads or multiple types of media on the page. I might not read the ads or the comments, but there is no way to automatically tell. Gaze tracking on the page may be one way to solve this problem.

It is probably possible to track media consumption well with some elaborate scheme and the right software. I have some ideas of how to do this, but I mostly stuck to tracking sites I visited on my computer plus the more major offline media consumption experiences. I also focused on content I consumed because, while I did produce content, I did not track time spent consuming or creating different types of media. While I spent a significant amount of time creating content, the number of things I made is insignificant when compared to the number I consumed.

I manually categorized the few thousand individual pages I visited and some offline media experiences in the past week. The graphs and discussions of each graph are below.

The above graph shows how I consume media. As I was primarily tracking links, I naturally consume most of my media on my computer. I also use my phone to quickly look things up and read the news while in transit. If I was fully able to track my Twitter usage, my phone percentage would likely be higher.

At 0.5%, offline consumption is barely visible. Offline consumption includes paper books and handouts, classes, and lectures. Conversations, ads I see, music or TV programs in the background of a room, and other media I consume intentionally or accidentally would greatly increase my offline consumption. Unfortunately, I did not track all of those.

This graph highlights the limitations in my tracking method. That said, I do spend much of my time in front of my computer or phone visiting links. So what types of media do I consume?

Apparently I do a lot of searches. My searches category also includes searches on individual websites, but 33.2% of all pages I view is a lot. I did not track the content of my searches, but it probably is proportioned similarly to the other categories.

Another interesting finding is that I read more blogs than standard articles. I also view many school website pages as I check hours of food places, read assignments, upload school work, and check course registration.

There are a few things this tracks poorly. Books and TV and movies have small slices because I consumed relatively few of them. In some ways, the less time a particular type of content takes to consume, the more I view it. Thus, the most time consuming activities appear far less prominently than they should. For a different reason, social networks should have a bigger slice. Tweets do not take long to read, so I read a lot of them, but each tweet does not add a page view.

The graph above shows the type of content I consume. The categories are based on which areas viewed the most media about this week. After searches and social media, I consume the most media about free information. Free information is a category I created to include transparency, open access, leaking and disclosure, and other related areas. For most people, free info would probably not be a category. Instead, they might read a couple related articles a week in US and world news. These categories are just what worked for me.

Likewise, I am not sure there is ever a normal or average week in content or type of media I consume. Some people have discussed events, like the snow storm, that make their media consumption this week abnormal. I definitely looked at more weather pages than I generally do. That said, many weeks there is a story I search for many articles on and cross reference so I can understand the whole situation. This constant searching for many articles drastically and regularly distorts what type of content I read.

Future of News and Participatory Media

Treating newsgathering as an engineering problem… since 2012!

Author Archives: MC