MAS S61 final project

An Analysis of Sina Weibo Censorship Using WeiboScope Search Data

Starting at 4:28 PM May 19, 2012, I posted on my Sina Weibo account two names as well as the Chinese words for “Taiwanese independence.” The first name I posted was “Chen Guangcheng” (in English and Chinese), the blind lawyer who escaped house arrest in Shandong province and made his way to the U.S. Embassy in Beijing. The second name was “Bo Xilai” (in English and Chinese), the former Party Secretary of Chongqing who recently fell from power. Less than 14 hours later, I received a message from Sina Weibo’s system administrator informing me that my two posts on “Chen Guangcheng” were “inappropriate” and had been censored. While I can still see the two “Chen Guangcheng” posts on my Sina Weibo account page, no one else can. Surprisingly, my posts on “Bo Xilai” and “Taiwan independence” were not censored.

Herein lies the conundrum with censorship in China. We know that certain topics are censored from blogs hosted in China, Chinese search engines and Weibos. But we don’t know where the line lies. Part of the reason is because the line is constantly moving. Baidu, Sina and Tencent could help identify the line by publishing a list of banned topics or keywords, but they don’t. Rather, they hire “monitoring editors” and rely on self-censorship to ensure that user generated content does not run afoul of Chinese authorities.

Some computer scientists in academia have tried to make sense of censorship in Sina Weibo by analyzing the data. In March 2012, David Bamman, Brendan O’Connor and Noah Smith at Carnegie Mellon University published a paper entitled “Censorship and deletion practices in Chinese social media” in First Monday after analyzing 56 million Sina Weibo messages and found that more than 16% had been deleted. King-Wa Fu and Cedric Sam at the University of Hong Kong’s Journalism and Media Studies Centre have hacked the Weibo Scope Search that archives deleted posts on Sina Weibo.

For my MIT Media Lab final project, I’ve tried to build on King-Wa Fu and Cedric Sam‘s work by analyzing the data collected from the Weibo Scope Search to try to make some sense of Sina Weibo censorship. Since its inception February 1, 2012 to May 20, 2012, the Weibo Scope Search has collected 12,032 deleted messages from Sina Weibo. The first thing I did was to simply plot all the deleted messages on a timeline from February 1, 2012 to May 20, 2012 and this is what I got:

My findings were consistent with the Carnegie Mellon team’s findings. There are spikes in Sina Weibo censorship as a result of media reports and rumors. During the Carnegie Mellon survey duration from June 27, 2011 to September 30, 2011, there was a rumor that former President Jiang Zemin passed away causing a spike in Sina Weibo deletions. From February 1, 2012 to May 20, 2012, the following incidents in China caused in censors employed by Sina Weibo to work overtime:

  • February 6, 2012 – Chongqing Public Security Bureau head Wang Lijun goes to U.S. Consulate in Chengdu with information about the death of British businessman Neil Heywood that implicates Chongqing Party Secretary Bo Xilai.
  • March 8, 2012 – Chongqing Party Secretary Bo Xilai fails to show up at the National People’s Congress, sparking rumors that he has fallen from power.
  • March 15, 2012 – Bo Xilai is removed from post as Chongqing Party Secretary.
  • March 18, 2012 – A Ferrari crashed on Beijing’s Fourth Ring Road killing one and injuring two people.
  • April 22, 2012 – Blind lawyer Chen Guangcheng escapes from house arrest in Shandong province and makes his way to U.S. Embassy in Beijing.
  • May 14, 2012 – The Beijing Daily posts a message on its official Weibo charging that U.S. Ambassador to China Gary Locke is posing as an ordinary citizen and calls for Locke to disclose his wealth.

Interestingly, deletion of Sina Weibo messages tend to hit a low on Saturdays. I’m not too sure why that is except that maybe censors want to take time off on weekends as well. If you want to maximize the length of time your message will remain on Sina Weibo, probably the best time is to post the message after 11 PM Friday night.

The second analysis I did with the Weibo Scope Search was to try and figure out how long it took the censors to delete messages on Sina Weibo. Each Sina Weibo has a time stamp for when it was created. The Weibo Scope Search checks Sina Weibo‘s timeline at most four times a day (but usually less due to limits that Sina Weibo imposes). Let’s say for instance, a user posts a message on Sina Weibo at 8 AM. Weibo Scope Search checks Sina Weibo‘s timeline at 9 AM, 3 PM, 9 PM, and 3 AM. If the message was deleted by the censor at 10 AM, it would show up on Weibo Scope Search‘s “deleted time” as 3 PM.

shortest time 0:04:04 hours
longest time 3401:51:45 hours

The fastest a post was deleted on Sina Weibo was just over 4 minutes. The longest time it took for the censor to get around deleting a message on Sina Weibo was over four months. For the posts created on May 20, 2012 and deleted on the same day, it took on average 11 hours for Weibo Scope Search to detect the deletion. It took the censors about 14 hours to delete my post “chen guangcheng.” Determining the average time it takes for censors to delete “irresponsible” messages is a bit tricky since we don’t have data on exactly how long it takes for each post to be deleted. Out of curiosity, I pulled up three messages that took over four months to delete to see what they said:

time created time deleted  hours  message
2011-12-29 00:30:41 2012-05-18 18:22:25 3401:51:45 “如果明年欧美名校在三四月份一起召开家长会的话,那么中国的十八大就很可能开不了了。”
“If the top universities in Europe and the U.S. hold their parent-teacher conference next March or April, then China will not be able to hold it’s 18th Party Congress then.”
2011-12-17 20:52:01 2012-04-27 20:40:12 3167:48:11 “【媿尔公侯高窃位,怆然世事急抢滩】国际盲人日当天@张海迪 通过私信@我是闻正兵 公布了她之前为光诚的努力。当年我和袁伟静嫂子向她求助时抱了很大期望,但坦率说从未得到她哪怕一个电话询问或慰问,这肯定谈不上“做了应该做的一切”。如今舆论环境更好而光诚处境则更糟,难道她一点努力都不能做吗?”
“On World Blind Day, paraplegic writer Zhang Haidi told Wen Zhengbin in a private letter that she did her utmost to help Chen Guangcheng. At the time, Chen Guangcheng’s wife Yuan Weijing and I asked her for help, but she didn’t even call us or ask us how we were doing. She didn’t do everything she should have. Today, Chen Guangcheng’s situation is even worse while there is greater openess for debate. Shouldn’t she have done more?”
2011-11-26 16:13:26 2012-03-28 07:55:28 2943:42:02 RT:”演藝界人士周星馳表支持唐英年參選下屆行政長官,欣賞他的處事,為人豁達開通。他又說,唐英年一點也不蠢,是有智慧的人,自己亦不會與蠢的人做朋友。對於唐英年有感情缺失,會否流失支持,周星馳認為並無關係,因為現時是選特首,不是選男朋友。″
“Hong Kong actor Stephen Chow said he supports Henry Tang’s bid to be the next Chief Executive of Hong Kong. Chow admires Tang’s way of doing things and open mindedness. Chow added that Tang is not stupid, but a smart person. Chow says that he wouldn’t be friends with stupid people. Regarding whether Tang’s infidelities will cause him to lose support, Chow says that it shouldn’t matter because people are voting for the Chief Executive, not choosing a boyfriend.”

I’m not too sure why it took so long to delete the posts. Cedric Sam points out that the posts may have been in the Weibo Scope Search database to begin with and they just didn’t turn up until several months later. The researchers at University of Hong Kong’s Journalism and Media Studies Centre are constantly adding new Sina Weibo to their list. Or, they could have just turned on the deletion marking system in the Weibo Scope Search so that it would have caught some censored posts that weren’t caught before.

To be sure, there is no way to tell for sure whether some of the posts were deleted by the users themselves instead of “monitoring editors.” Sina’s API returns two types of error messages: “Weibo does not exist” and “Permission denied.” We assume that when a post is deleted by the user, the “Weibo does not exist” error message comes up. When a post is censored, the “Permission denied” error message comes up. Weibo Scope Search keeps track of all the deleted posts that have the “Permission denied” error message.

If I had more time (and knew how to code), I would have liked to have analyzed more of the data that Weibo Scope Search came up with. Among the things I would have liked to explore are:

  • Geographic distribution of deleted messages on Sina Weibo – The Carnegie Mellon paper also looked at geographic distribution of censored Sina Weibo and found that messages issued from Tibet, Qinghai and Ningxia are deleted at a higher rate. Weibo Scope Search also had data on the city and province that each message originated from. However, I didn’t have enough time to figure out how to convert Sina’s data in its city and province into a fungible type of data to transpose on a map.
  • Relationships between the most censored Sina Weibo accounts – Using Weibo Scope Search, we’re able to rank the 3,524 users whose Sina Weibo messages are being deleted the most to last. One thing I’d be interested in exploring is how many followers these Sina Weibo accounts have and whether they follow each other. It’s not clear to me if the censors have compiled a list of influential Sina Weibo accounts and are tracking them daily or the censors are using key word searches to figure out what to censor.
  • A deeper analysis into the most censored Chinese words on Sina Weibo – Several weeks ago, I did a word cloud of the most censored Chinese words on Sina Weibo to see what came up. By far, the most censored words were the Chinese words for “retweet” followed by “ha ha” or some variation. It makes sense, but it’s not very helpful. Given more time, I would have liked to dig a little deeper to see if there were any words or code words that consistently came up again and again after filtering out the “retweets,” “ha ha,” and other stop words.

How to Analyze WeiboScope Search Data

King-Wa Fu and Cedric Sam at the University of Hong Kong’s Journalism and Media Studies Centre have built a WeiboScope Search that sends all of the deleted Weibo posts to a server in Hong Kong and stores them.  However, the data is in JSON format, which looks like this:

To make sense of the data collected, we need to first clean up the data. I used Google refine to clean up the data by:

      1) Download + install Google refine
      2) Click on “Create Project”
      3) Click on “Web Addresses (URLs)”

      4) Insert link
      5) Click “Next”

      6) Highlight the fields you’re interested in and left click the mouse.

    Google refine should automatically put all the fields into columns:

      7) Click “Create Project”
      8) Click “Export”
      9) Click “Excel”

Now that we have the data formatted, we want to make sense of it.

The first project I did was to graph the deleted weibos on a timeline. My classmate Eugene Wu suggested that the best software to visualize the data is Tableau.

      1) Download + install Tableau
      2) Click on “Open Data”
      3) Under “Connect to Data: In a file”, click on “Microsoft Excel”
      4) When the “Excel Workbook Connection” window pops up, click “Ok”

      5) Change the format for the “created at” and “deleted” columns from “text” to “Date & time” by right clicking the mouse, selecting “Change Data Type” then “Date & time.”

        6) Go to the Dimensions box and drag the “deleted” data set to “Columns”
        7) Go to the Measures box and drag the “Number of Records” data set to “Rows”
        8) In the “Show Me” box, select the type of graph you want. Voila!

Using Dataforager to Report on Pakistan

Several hours before my classmate J. Nathan Matias first showed me his new tool Dataforager Sunday, Pakistan blocked Twitter apparently because some tweets were urging people to join the third “Everybody Draw Muhammad Day” campaign on May 20, 2012. While I was playing around with Dataforager, I applied it to five news reports about Pakistan blocking (then unblocking) Twitter and came up with the table below:

newspaper Washington Post BBC New York Times Global Voices Guardian
title Pakistan blocks, then restores, Twitter access Pakistan restores Twitter after block Pakistan Blocks Twitter Over Cartoon Contest Pakistan: Twitter Goes Through Weekend of Censorship Pakistan blocks Twitter amid blasphemy fears
data forager results @Innovations @SenRehmanMalik @nytimesworld @FizaBatoolGilan @GdnPolitics
@fispahani @MarkLGoldberg @JonathanHaynes
@marvisirmed @’SenRehmanMalik @mediaguardian
@sherryrehman @abidbeli
missed Twitterers Rehman Malik Fizza Batool Gilani Farieha Aziz
Cyril Almeida Arif Rafiq Imran Khan
Rehman Malik Emrys Schoemaker
Ali Dayan Hasan
Raza Rumi

The Washington Post and Global Voices do the best job of coding links in the story to interviewees’ Twitter accounts. The New York Times and Guardian cite sources as having Twitter accounts, but forces the reader to search for each Twitter account.

One of J. Nathan Matias‘ original intentions for Dataforager was to help users compile a list of experts to learn more about a particular subject or topic. Since I know very little about Pakistan besides the fact that it borders India and it’s where Osama Bin Laden was living for the past six years, I tested J. Nathan Matias‘ theory to see if I could find enough information to write a story about Pakistan using the list of Twitterers compiled using Dataforager.

For this experiment, I used Dataforager to compile a list of experts on Pakistan on Twitter from Washington Post’s Pakistan blocks, then restores, Twitter access article by Richard Leiby and Storify’s Flurry of tweets in wake of Pakistani Twitter ban article by Annie Ali Khan.

From Annie Ali Khan‘s Storify articleDataforager pulled up a list of 17 Pakistani’s on Twitter:

From Richard Leiby’s Washington Post article, Dataforager pulled up a list of 3 Pakistani’s on Twitter:

Going through the list of tweets compiled from Dataforager Tuesday morning, both mention some kind of unrest in Karachi. Entrepreneur Mohammed Sumair Kolia tweeted that 8 people were killed and 30 injured in the unrest in Karachi. Pakistan’s Ambassador to the U.S. and former journalist Sherry Rehman tweeted that two journalists were injured during the firing at the Awami Tehrik rally in Karachi.

By themselves, the tweets aren’t enough to piece together what happened in Karachi. We know how many people were injured, but we still don’t know what happened, who did it, how it happened, and why it happened. Fortunately, the tweets use a #Karachi hashtag. Doing a “#Karachi” search on Twitter, I get a list of tweets about what happened in Karachi.

A sports reporter with GEO News, Faizan Lakhani, was the first one to tweet “Reports of firing in Boltun Market and old city area. ‪#Karachi‬” with a “#Karachi” hashtag.

Two and a half hours later, the Express Tribune posted on its web site and Twitter a live blog/news article about the riots that unfolded in Karachi after unidentified gunmen open fire on a Awami Tehreek and Peoples Amn Committee rally.

MAS S61 final project 1st draft

For the final class project, I want to do something with the data collected from the University of Hong Kong Journalism and Media Studies Centre’s WeiboScope Search project. In class last week, Ethan Zuckerman suggested that one option may be to do an online art piece using the most censored Chinese words on Sina Weibo. Out of curiosity, I did a draft of the 100 most censored Chinese words on Sina Weibo to see what came up. Here’s a quick translation of the most censored Chinese words:

转发微博 retweet weibo (simplified Chinese)
转发 retweet
轉發微博 retweet weibo (traditional Chinese)
哈哈 ha ha
偷笑 smile
嘻嘻 hee hee
呵呵 he he
哈哈哈 ha ha ha
蜡烛 candle
吃惊 surprise
围观 crowd
话筒 microphone
思考 think
威武 mighty
求证 confirm
挖鼻屎 pick boogers

The most common words are the Chinese equivalent of “retweet” or “RT.” The next most common are expressions, such as “ha ha” or “anger.” It doesn’t make much sense that the 50 cent party are simply censoring emotions. I’ll need to figure out a way to come up with a way to dig one layer deeper.

MAS S61: assignment #6

I was going through my Twitter feed when I noticed Ethan Zuckerman’s tweet about this Associated Press article: Mali Coup leader stays put, despite sanctions.

This is the first coup d’etat in Mali since 1991, when then-lieutenant colonel Amadou Toumani Toure removed Moussa Traore from office in the “March Revolution.” Amadou Toumani Toure was elected president in 2002 and was scheduled to step down in the presidential elections April 29, 2012.

The origins of this “accidental coup” can be traced back to the rise of Tuareg rebels, who returned home to northern Mali heavily armed after fighting for Muammar el-Qaddafi in the Libyan civil war. Since Mali gained independence from France in 1960, the nomadic Tuareg people have fought for the creation of a new state three times: 1962-64, 1990-95 and 2006. The heavily armed Tuareg rebels have helped reignite the separatist movement earlier this year.

In the months leading up to the coup, Mali soldiers have been frustrated by their lack of weaponry to fight the Tuareg rebels, who have captured several towns in northern Mali. The coup was ignited March 21, 2012, when the Mali defense minister Sadio Gassama visited the Kati military camp to try to defuse a planned protest by the Malian Army. The soldiers booed and threw stones at Sadio Gassama’s car, forcing him to temporarily seek refuge from his own troops. Later that day, Mali soldiers took over the presidential palace and Office of Radio and Television of Mali in Bamako, capital of Mali.

January 17, 2012
After Malian government fails to engage in dialogue and sends troops to the Touareg region of northern Mali, National Movement for the Liberation of Azawad (MNLA) exchanges fire with Mali troops and takes over town of Menaka
Source: MNLA

January 21, 2012
MNLA issues final appeal to evacuate all foreign nationals from Azawad
Source: MNLA

March 21, 2012
Mali’s Brigadier General Sadio Gassama visits the Kati military camp to defuse protests planned by soldiers in the camp for March 22 over the bad management of the conflict with the Tuareg rebellion in northern Mali

Malian soliders boo and throw stones at Sadio Gassama’s car. Sadio Gassama is sequestered for his own safety, then released.

Malian soldiers storm presidential palace and the Office of Radio and Television of Mali in Bamako, capital of Mali.
Sources: New York Times, Reuters

March 22, 2012
National Committee for the Restoration of Democracy and State (CNRDR) leader Amadou Haya Sanogo appears on Malian TV to announce curfew and appeal for calm in Bamako.
Sources: Guardian, Youtube

France suspends all cooperation with Mali
Source: Foreign Ministry of France

March 23, 2012
African Union suspends Mali from African Union
Source: African Union

March 26, 2012
U.S. State Department cuts off aid to Mali. The U.S. provides $140 million annually in aid to Mali. The U.S. said it would continue to provide humanitarian and food aid to Mali, which means that half the annual aid could be cut by the sanctions.
Source: State Department

March 28, 2012
Mali’s deposed president Amadou Toumani Toure tells Radio France Internationale he is free and unharmed
Source: Radio France Internationale

Captain Amadou Haya Sanogo gives interview to Time magazine
Source: Time

March 29, 2012
The Economic Community of Western African States (ECOWAS) dispatch high level delegation to Mali. ECOWAS give Amadou Haya Sanogo 72-hours to hand over power to civilians or else they would impose an embargo on Mali.

April 1, 2012
Tuareg seize regional capitals Kidal, Gao and Timbuktu

April 3, 2012
UN Commission for Human Rights estimates that 200,000 people have fled northern Mali due to the fighting.
Source: UN

U.S. Imposes Visa Restrictions on Malian Mutineers
Source: State Department

State Department press statement on Political and Security Situation in Mali
Source: State Department

Mali Coup leader stays put, despite sanctions
Source: Associated Press

MAS S61: assignment #5


Last week, we visited Foshan to interview factories for a consulting project that Professor Huang Yasheng is doing for the Guangdong provincial government. One of our first stops in Foshan was to the ginormous Shunde District Government Office, which the locals have dubbed “Shunde White House.” The Communist Party of China has cited Shunde’s government office one of the more extravagant government office buildings. It also got me wondering: Where did the Shunde district government get the money to build the government office building? I checked Shunde district government’s fiscal budget for the past decade and came up with this graph:

Shunde revenues

It looks like Shunde district government’s revenues have been growing because they have been collecting more income tax from the companies in the region. Shunde’s government has benefited from having white goods manufacturer Midea based there. Midea accounts for 70% of the township’s GDP. Last year, Midea paid 5.2 billion RMB ($823.6 million) in taxes or almost 60% of Shunde’s income tax revenues, according to the Beijiao Economy Promotion Bureau.

Shunde expenditures

Naturally, I then wondered where the Shunde District Government was spending all of its money (besides building huge government office buildings). Surprisingly, the number one expenditure by the Shunde District Government was in education. Last year, the Shunde district government spent 2.9 billion RMB or 22% of its total expenditures on education.

Shunde deficit

When I combined the two graphs of Shunde District Government’s revenues and expenditures, it turns out that Shunde has had a deficit in 8 out of the past 10 years. The only two years when Shunde didn’t report a deficit were in 2008 and 2009, which is a bit ironic since the financial crisis was pushed most other governments further into debt.

A lot of local governments took on debt in 2008 and 2009 to invest in transportation infrastructure projects to get through the financial crisis. The National Audit Office came out with a report in June 2011 estimating that China’s local government held a cumulative 10.7 trillion RMB ($1.7 trillion) in debt at the end of 2010. Some policymakers and academics in China have been starting to get a little concerned because 17.17% of the debt needs to be paid back last year and this year.

MAS S61: assignment #4

On March 12, 2012, Republican presidential candidate Mitt Romney responded to President Barack Obama’s progress report on the Blueprint for Secure Energy Future. Rather than take the White House’s infographics on gas prices as gospel, I checked the data from the U.S. Energy Information Administration’s historical gas prices to verify Mitt Romney ‘s claim “with gas prices setting new records.” and posted the findings below:

MAS S61: assignment #2

Four-hour challenge:
Start: 19:05 2/18/2012
Finish: 22:26 2/18/2012

CAMBRIDGE, MA — Harvard (23-3, 9-1 Ivy) beat Yale again 66-51 to tie the school’s record for victories. Harvard set the program’s record with 23 wins last season, when it tied for first place in the Ivy League.

With four games left in the season, Harvard is on track to claim sole ownership of the Ivy League title this season and make it into the NCAA tournament for the first time since its sole appearance in 1946. The Ivy League does not have a conference tournament so the top team in the league receives an automatic bid for the NCAA tournament next month.

Harvard also extended its home win streak to 27 games, the second longest home court winning streak after Kentucky, which also extended its home winning streak to 50-0. Harvard is 10-0 at home and has yet to lose at Lavietes Pavilion this season.

Led by the inside-outside duo of guard Brandyn Curry and forward Keith Wright, Harvard dominated Yale in virtually every category: field goal percentage, 3-point field goal percentage, rebounds, assists, as well as points in the paint. The Crimson made 54.3% (25-46) of its shots and outrebounded Yale 33-22.

Keith Wright scored 10 points and hauled in 8 rebounds, tying Brian Cusworth’s school career block record with 147. Brandyn Curry led Harvard with 18 points and 5 assists. Guard Oliver McNally added 9 points and 2 assists while forward Kyle Casey chipped in 8 points.

Saturday evening’s loss to Harvard drops Yale’s Ivy League record to 17-7 and 7-3 in the Ivy League. Yale was led by center Greg Mangano, who had 22 points, 11 rebounds and 5 block shots. Unfortunately, Mangano received little support from his teammates, none of whom scored more than 8 points.

Harvard climbed into the Top 25 men’s basketball teams for the first time this season. However, the Crimson fell out of the Top 25 after each loss. Harvard has lost three times this season: to University of Connecticut December 8, Fordham January 3 and Princeton February 11.

Next weekend, Harvard will play Princeton (15-10, 6-3 Ivy) and Penn (15-11, 7-2 Ivy) at home. After that, Harvard will have two more league games against Columbia (14-12, 3-7 Ivy) and Cornell (10-14, 5-5 Ivy).