Police corruption in South Africa: what does and doesn’t it tell us about the state of the country?

The Economist’s 500-word piece “South Africa’s Police: Bad cop, bad cop routine”, published March 9th, provides a glimpse into the dark side of this Southern African nation. Often referred to as “the darling [reconciliation model] of many legal policy makers in the international community” and as the richest nation in sub-Saharan Africa, South Africa also faces a host of issues that stem from immense corruption and weak rule of law.

Casual observers tend to believe that because Blacks have assumed positions of power in South Africa since 1994, the country’s Black majority is now better represented and considered in the country’s governance and policy-making processes. This might not be the case however, and investigating various components of The Economist‘s latest ZA article in more detail can serve to illustrate why.

As such, I am pulling quotes from “Bad cop, bad cop routine” and exploring them in greater detail an attempt to paint a more complete picture of South Africa’s complex and interconnected social, economic, and political issues.

1. Townships are some of the most enduring and visible remnants of Apartheid in South Africa. Over 3 million Blacks, Indians, and Coloureds were forced to move from centers that Whites retained, to barren and segregated regions. Today, although South Africa has been a democracy since 1994, it is one of the most unequal countries in the world (number 2 according to the CIA World Factbook). Accordingly, townships remain underdeveloped and are growing as the number in poverty increase. Crazily, the nation’s Gini Coefficient is higher now than it was under the Apartheid regime. Although the African National Congress (ANC) is in power and its president is Black (Zulu), inequality in South Africa is still highly correlated to race. The size, ethnic composition, and socioeconomic indicators of townships in South Africa speak to the unfinished and faltering efforts of the government to combat the Apartheid’s legacy.

2. Violence in South Africa is neither common solely between the government and its citizens nor just within the population. In the past several years there have also been frequent incidences of violence targeting immigrants from other African countries. A wave of attacks in 2011 was preceded by more serious xenophobic violence in 2008 and 2009. A comprehensive report by the UN’s Special Rapporteur on the human rights of migrants can be found here, which points to lack of legal and government enforcement mechanisms as a key hurdle to addressing this serious human rights issue. That there is no reference to violence against African immigrants in this article is a surprising omission even when considering the word limit.

3. South Africa’s telephone system is “the best developed and most modern in Africa“. I lived in South Africa in 2009-2010 and worked in Khayelitsha (the largest township in the Cape Town area); the reach of cell phones to corners of the country where even running water and regular electricity are absent astounded me. Of course, the potentially transformative power that mobile phones hold for countries in Africa is a trendy topic in development and business fields. But the prevalence of cell phones in ZA combined with the fact that bystander footage is what sparked public outrage over this scandal prompted me to wonder whether citizen journalism is also making strides in the context of weak rule of law and lack of government accountability.  Interestingly, although I found a few more recent comments regarding NGOs that have initiated citizen journalism programs, the few academic or journalist endeavors encouraging the phenomenon are out of date. The lack of investment in the topic may have something to do with the fact that despite prevalent mobile phone usage, internet penetration in ZA is relatively low and costs for accessing the web are prohibitively expensive. It is one thing to text on a Nokia and quite another to be connected to a vast network like the Internet through a device. Further, limits to internet access are connected to socioeconomic status in South Africa as they are in many other countries.

4. Regular violence on the part of the SAPS is shocking. Last year, police opened fire at an illegal strike of mine workers, killing 34 in one of the worst death tolls in violent protests since Apartheid ended. It is difficult to fathom this degree of police brutality in a democracy, and speaks to larger issues about the lack of ability or will on the part of the government to do more to protect its citizens. Indeed, although the murder rate is falling according to official numbers, unofficially many South Africans do not believe the figures. Gun ownership is high, rape is incredibly common (one survey had 37.4% of men admitting to rape), and murders like that of white supremacist Eugene Terreblanche in 2010 highlight the continued existence of race-inspired violence that still plagues the country.

5. The group was comprised of South African lawyers, activists, and other professionals concerned with corruption in South Africa. It had indicted several high-ranking officials in the African National Congress (the party that has been in power since 1994) and was dissolved by an ANC-dominated parliament after accusing President Jacob Zuma (now ANC and South African president) of corruption.  Since then, no independent corruption monitoring body has been reinstated.

6. Five senior criminal-justice posts have gone unfilled for more than a yearThe ANC’s corruption and laziness is blatant; police impunity is only one symptom of it. To many casual observers, ZA is Reading “Bad cop, bad cop routine” provides a first glimpse into some of the mounting issues that South Africa faces right now, but it fails to elucidate how this is both a product and cause of the government’s depressing failures that have resulted in corruption and ineptitude.


A Look at Occupy Boston’s Mailing Lists

Part of an ongoing project by the author to describe Occupy Boston’s mailing lists using network analysis.

Interactive Network Graph of Shared Users among Occupy Boston Mailing Lists

Interactive Network Graph of Occupy Boston's Mailing Lists linked by their Shared Users

The Story
Can we learn something about a social movement by looking at the digital tools it uses to organize? The Occupy Movement was defined as much by its highly visible occupation tactic as by its use of new digital media to organize and mobilize. The success of the movement was really to inject new language into our society about inequality. Think the 1% and the 99%. This was achieved through a sustained campaign of media activism. Language was developed to describe the inequalities between the common man and the rich, embodied by Wall Street — the perpetrators of the recent global financial crisis, and various forms of media were created to get the message out. The occupations then served to keep that message in mainstream media as they attracted sustained coverage themselves for both good and bad reasons.

We see this play out in the network of mailing lists. Occupy Boston’s general Media list had the most messages posted it in during the period September 2011 – October 2012, consistent with what we would expect from a movement focused on media activism. In terms of expansiveness of user participation, the Ideas mailing list takes the crown, which is where much of the early intellectual labor on defining Occupy Boston’s mission and direction was hashed out. In the data, we also see a lot of overlap amongst the mailing lists. All but one list (OB Updates, which was a unidirectional announcement list), shares many active users with other lists. The median degree is near 20, which is almost a perfect mesh network. This suggests that this public mailing lists, although sometimes dedicated to very specific themes or “committees,” enjoyed a lot of interconnection. Between the major mailing lists (seen as an outer ring on the network), which are more general interest, we see 100+ shared users on their mutual edges. This number drops off for some of the more niche mailing lists and could represent a few key organizers or overzealous mailing list participants. A more qualitative study is needed to tell the rest of this story.

Quick Statistics

  • Mailing Lists: 22
  • Total Messages: 36,303
  • Total Users: 922 (unique email addresses)
Distribution of Total Users and Messages across Mailing Lists

Distribution of Total Users and Messages across Mailing Lists (left y-axis is Messages scale; right y-axis is Users scale

How I Made the Network Graph
I downloaded the mailman archives from September 2011 to October 2012 from Occupy Boston’s public mailing lists, i.e. those that do not require moderator access to join. I wrote a Python script to parse the archives, which are in a standard mbox format, into an SQLite database. I devised a schema with a standard set of ids for mailing lists and individual users, and used these ids to extract a network of users shared among different mailing lists with a simple SQL query, storing resultant nodes (mailing lists) and edges (shared user relationships) in CSV files.

I imported the nodes and edges files into Gephi after hand editing their column names to conform to Gephi’s standard. Gephi automatically aggregated the edges between nodes to create weighted edges representing the total number of shared users. I adjusted the layout in Gephi to represent the weighted edges using different thicknesses. The nodes were scaled by total users active in each mailing list, an attribute extracted from my database, and their color was scaled on a pale to dark red spectrum according to the total number of messages during the period of analysis, also extracted from the database. I used the Forced Atlas 2 layout algorithm, which forces the most central nodes out of the center for easier comprehension, and then hit the graph a few times with the Expansion layout algorithm to give extra space between nodes.

Using the Sigmajs Exporter plugin, I exported the network so that it could be viewed on the web as an interactive visualization. I customized the default javascript and css in several ways to display the network graph more clearly. In the config.json file, I manipulated the graph properties to create greater contrasts between node sizes and edge weights, and adjusted the label threshold under drawing properties to ensure all nodes were labeled. I modified the sigma.js defaults for edge color, by forcing them to be a standard grey rather than the color of their source. This corrects for what is actually an undirected network (shared user relationships are mutual) being interpreted as directed. Finally, in the “Information Pane” I forced it to display the edge weights (shared number of users) between the active node and its neighbors, next to their listed names.

Should I demand more from my daily horoscope?

Most mornings, I do the puzzles in the Boston Globe‘s g section. These happen to share a page with the horoscopes, which I like to read aloud to anyone within earshot. However, over time, I grew suspicious. The same predictions, even the same turns of phrase, seemed to pop up again week after week. Someone always needed to keep a careful eye on their assets. Love was always “on the rise” for one person or another.

I wanted to find out if I’m simply being sensitive, or if there really was meaningful repetition in the predictions.

THE DATA: http://www.uclick.com/client/bos/el/

The first drawback to starting with a question rather than a data set: I assumed that the online archive of horoscopes would be more robust. Unfortunately, I discovered that only the last two weeks (i.e., March 7 – 19) are accessible. I decided to go forward with my smaller set anyway — because I was still curious, and because two weeks seemed sufficient to at least explore my repetition hunch.

I wrote a small scraper to pull down all of the existing ‘scopes. (Shout-out to Harvard’s CS171 Visualization course and to the pattern.web Python module.) The data was then split two ways: by text alone, and by Zodiac sign.

If I wanted to do this more rigorously, I would need to a good algorithm to suss out all possible repeating phrases. As it stands, I wrote a quick and dirty program to sort the text by individual words, sentences, and phrase pairs.

The most common phrases were…
love is on the rise 7
love is highlighted 5
deception is apparent 4
romance is in the stars 3
love and romance are on the rise 2
love and romance are highlighted 2
love is in the stars 2
love and romance are in the stars 2

Then, using Many Eyes, I ran my text through a few different word visualizations. Many Eyes is an IBM site that allows novice data journalists to play with information in a fairly easy way.

Overall, it helped a lot that I knew exactly what I was looking for when I started this assignment. (Does a data story start with the data? Or with what you want out of the data?) However, I feel more like I excavated some fun facts than an actual “story.”

Regardless, I really want to scrape a whole year of horoscopes now.

Boston Restaurants: Perception vs. Reality

By Rochelle Sharpe and David Larochelle

Yely’s Coffee Shop in Jamaica Plain gets rave reviews for its authentic Dominican food.

“Best Latin take-out food in the area,” one Yelp writer gushed, urging people to “come here if you want yummy pork or juicy rotisserie chicken.” Others described the meals as “incredibly delicious” and high-quality,” with one fan declaring: “I will go far, far out of my way for a plate of Yely’s rice with chicharones. . . Nowhere else measures up.”

But for Boston’s health inspectors, the restaurant has a more dubious distinction. Yely’s has had its business permit suspended more often for food safety violations than any other restaurant in the city during the past six years, with inspectors complaining of food preparers not washing their hands after coughing into them and dead mice decaying in traps on the kitchen floor.

Yely’s lost its license five times since 2007, edging out My Thai Café Vegetarian & Bubble Tea Bistro in Chinatown, which had its license suspended four times and Navarette Restaurant, a barbeque place in Dorcester, which lost its license three times. None of the other 254 restaurants that had their licenses temporarily suspended in Boston were shut down more than twice, according to city health inspection records.

Year after year, health inspectors found similar problems at Yely’s. In 2008, the first time the restaurant lost its license, inspectors complained about substandard equipment and widespread sanitation problems. “No hot water throughout,” the 2008 report said. “Owner is working on hand sink in basement.”

Workers were using wooden sticks to stir food, inspectors said, and thermometers to detect improper food temperatures were broken or missing. Standing water covered the floor, where there were broken tiles and sheets of cardboard being used as mats. “Food handler cough into hands and not wash hands or change gloves,” one inspector wrote, urging the restaurant owner to get safety training for all employees.

But the 2012 report found similar issues. Inspectors found shelves in the food cooler covered in rust and grime. “Observed employee serve food, handle money, answer phone, without stopping to wash hands,” the report said. As for mice problems, it said: “remove dead mice observed attached to four different traps in storage room in basement. . .Rodent droppings observed on floor of storage room in basement, on floor of walk-in cooler, on floor around hand sink . . .on counter tops, shelves, and various other locations.”

At My Thai Café and Navarette Restaurant, meanwhile, inspectors found a variety of sanitation and cooking problems. The city cited the Thai restaurant for improper refrigeration, leaving soiled dishes in sink overnight, and workers smoking and leaving cigarette ashes in the kitchen sink. At Navarette, they discovered pork and beef that were only half way cooked and dangers of food cross contamination because workers were storing raw fish and pork above cut ham and cheese.

Overall, restaurants in Chinatown were shut down most often, with 14 restaurants in the tiny section of town having their licenses suspended during the past six years.

Partisanship in Congress and Time on Bench

DW-Nominate is a way to measure partisan behavior in Congress. I was interested in whether the current state of polarization is different from past eras, so I built two scatterplots showing the history of both houses of Congress:

Partisanship in the Senate.

Partisanship in the House.

Also, I was curious if Supreme Court justices are being appointed younger or serving longer than in the past, so I visualized that, too.

All of this was made with D3.

MIT Police Log for Feb.12–Mar.12 2013 in 6 seconds

For this week’s data journalism assignment, I chose to look at a month’s worth of police logs from the MIT police department. The data in the form of PDFs can be found at http://web.mit.edu/cp/www/crimlog.htm. I put the data into a less awful format, which you can find at https://docs.google.com/spreadsheet/ccc?key=0AlhEOMxfxhHtdENRakF4WUdPZnNsR3p6YzFEWGtTR1E&usp=sharing.

With the data, I created a short video using Vine.

Romania 2012: A Summer of Chaos, James Bond and Die Hard?

Scandals. Corruption. Plagiarism. Indictment. That’s the stuff of a good story. In the summer of 2012, it was also the stuff of what Princeton professor Kim Lane Scheppele calls a Romanian “political crisis”.

Elena and I tried to make sense of the hundreds of memes that appeared that summer on Reddit, Google and Elena’s Facebook feed by visualizing the memes using D3, the JavaScript library.

Below is a screenshot “teaser”. To see what we came up with, check out the full story.

A Comparative Experiment in Mapping the News

For this week’s data journalism assignment I wanted to experiment with a data set that I’m already working with for my research at the Center for Civic Media. We are trying to see if we can extract place information from news articles to map where the news happens, where news sources pay more and less attention to, and what that coverage looks like. Where on the globe does the BBC vs. the NYT vs. the Huffington Post cover? In what proportions? Where don’t they cover? How do they cover those places – with what words and frames?

Check out the visualization here

To answer these questions, the first problem we are facing is a technical one. How do you get reliable place information from unstructured text like news articles? I’ve been starting to evaluate different ways of doing this using a data set of 100 articles each from the New York Times, the BBC and the Huffington Post taken randomly from the same time period. I wanted to use this week to go deeper with one promising technology for this work called CLAVIN, a java-based geoparser that integrates with Stanford’s well-known Named Entity Recognizer. I also wanted to experiment further with D3, a javascript-based codebase that makes beautiful infographics, maps and diagrams (here’s a link to the one I modified for this viz).

This may or may not be backwards (from the standpoint of journalism) but I wanted to create the visualization as a way of exploring the data set to see if there is a story. My idea was to try to create a kind of network map of which countries get mentioned in relation to which others and at what frequency. And additionally, to be able to compare across the three news sources. Which countries get mentioned together more frequently? Does the country of origin of the news source affect the country pairs? Which countries get relatively very few mentions? Does grouping country mentions like this show us anything we don’t already know?

The way the visualization works is that if two countries are mentioned in the same article they get a link. So if an article mentions the US, Canada and Mexico then there is a link between US-Canada, Canada-Mexico and Mexico-US. Links also have a weight. For example, if an additional article mentions the US and Mexico, then the Mexico-US link gets a weight of 2.

You can mouseover the country names in the visualization to see the breakdown of how many links a country has to other countries. This is not all that meaningful for countries like the US, the UK and France which have an abundance of links. But it is more interesting for countries like Puerto Rico, Syria or Venezuela. For example, Syria is most mentioned in the BBC in relation to the US, Russia, UK and France. In the NYT, it is most referenced in relation to the US, Russia, China and Egypt. And in the HuffPo, Syria is mentioned in relation to the US, Russia, China and then tied for fourth place are Iran, Venezuela, Lebanon and Italy. So one place a network visualization like this might help us is in framing who the state actors are that a news source thinks matters in a particular on-going story like the Syrian Revolution/Civil War.

New York Times - Countries with 15+ links between them

BBC - Countries with 15+ links between them

HuffPo - Countries with 15+ links between them

And while this network visualization is beautiful (even my two-year-old son said “oooooh the sun”), it’s hard to discern meaningful patterns with a lot of blue lines crossing everywhere so I created a link number slider to be able to filter them. At higher numbers like 15+ links (where only country links which have occurred 15 times or more will appear), some interesting differences and similarities across the media sources appear. For all three sources the highest number of links are between the US, the UK and Canada with some countries in Western Europe (France, Italy) also in the mix. All three sources show the highest number of links between the US/Canada and Western Europe and lack equivalent linkages for countries in South America, Asia, and Africa. At the 15-link threshold, the BBC doesn’t show a single link to or from any country on those continents. HuffPo and the NYT show links from the US to Argentina and Brazil and both show several links from the US to countries on the Asian continent. But of the three sources at this threshold only the NYT shows a link between the US and an African country: Egypt.

So what is this telling us? One thing is, perhaps unsurprisingly, that national news sources tend to focus on themselves. Most of the large numbers of links are between the news source’s own country and other nations. There are also high linkages between the home country and where that country is at war (e.g. Iraq & Afghanistan for the NYT and the HuffPo) or where they are thinking of using military force (e.g. Syria or Iran which show up for HuffPo at the 15-link threshold). And economic power. In a 2003 paper called Global Attention Profiles, Ethan Zuckerman outlined a model of attention whereby media attention from international news organizations correlates with a country’s GDP. It would seem at least from first glimpse that the linkage model of attention follows this logic as well.

This is an experiment in data journalism

I’d like to just bookend this with a disclaimer that this is not meant to be conclusive in any way. First of all the data sample (100 articles per source) is too small and over two short a period of time to actually talk about long-term patterns of comparative geographic media coverage. Secondly, we still need to evaluate the performance of the geoparser for extracting places from news articles. There is certainly at least 10-20% error in how it is identifying places in the articles (which may account for why Italy shows up as having so many links? What is going on with that?). And finally, this was a way to experiment with turning articles into networks of geographic places. I’m still not entirely sure this is a useful methodology but I wanted to see what came out of the experiment to help assess whether it’s useful or not (and would love to hear thoughts and feedback to that effect).

My question for the class is – does it count as data journalism if you experiment with visualizing your data in various ways with various methodologies in order to dig up a story? My guess is that data journalism could happen and unfold in a variety of ways. It seems legit when working with large datasets to experiment with ways to explore that data in a preliminary way that may or may not lead to a narrative, but I’d love to talk more about that unfolding process.



Americans Against Food Taxes

Wordle: Americans Against Food Taxes

Taxes website looks completely wholesome – almost literally depicting
a Mom and Apple Pie approach to political argument.
The home page features a picture of a diverse group of
smiling Americans as well as a lunch box that contains grapes, a
banana, and yes, an apple.  The home page headings contain virtuous
messages: “Smart Choices for Kids,” “Education Not Taxation,” and
“Healthy Economy.”  Who could be against these?
But, in fact, Americans Against Food Taxes is a front group for
the American Beverage Association, and has a website that uses many
advertising ploys that can be incredibly persuasive.
First, the group’s name is misleading. It implies that it is
against food taxes of all sorts, exaggerating the scope of its
mission.  But in fact, the text only discusses taxes against
sugar-sweetened beverages. This tactic may draw in people who are
concerned about taxes overall. They will be less likely to think about
the much narrower issue of whether there may be health benefits to
taxing sugar-laden drinks.
The website’s home page declares that 95,993 people have
signed up to be part of the group, a classic example of social norms
marketing. This number may convince individuals that it is socially
acceptable – and desirable – to join this group.   In the About Us
page, casual readers may be left with the impression that this is a
wholesome group of folks.  The groups says it is composed of
“responsible individuals, financially strapped families, small and
large businesses in communities across the country.”
But the Rudd Center for Food Policy and Obesity analyzed the
95,993 people who are members and discovered that 93% of them were
somehow associated with food and beverage industry groups – and 83%
had some affiliation with Coca Cola.
“Is it really Americans Against Food Taxes or just Food
Industry Against Food Taxes?” Rudd Center researchers have asked.
I was intrigued by the language used on the home page and
so put it in a word cloud to see what words were used most often.  It
can be seen here:
The wordle shows that words like Americans and taxes are
quite prominent, as well as different variations of the word
“healthy.” Balanced, obesity, exercise, smart, and children also
appear repeatedly. Yet the word “sugar” never appears and the word
“soda” appears less frequently. The overall impression with the words
used can distract readers from the real mission of the group, which is
to fight soda taxes.
Clicking through to the FACTS section of the website,
there are more pointed distortions of policy. It discusses that the
beverage industry was key in creating school beverage guidelines and
points out that nearly 80% of schools are now in compliance with them.
In fact, groups like the American Public Health
Association have criticized these guidelines, saying they’ve left
gaping loopholes that allow kids to buy sugar-sweetened sports drinks
and other unhealthy beverages.

Rochelle Sharpe