What does Hillary Clinton’s Inbox look like?

I, too, am tired of hearing about Hillary’s use of a private email server. On the other hand, it led to a pretty neat data set to unpack: a dump of emails she’s sent and received.

I played around with this data set a bit and was particularly interested in how different groups of people interacted with Hillary. Did men use shorter sentences than women, for example? Did her staffers send one-liners versus ambassadors who sent lengthy emails? Did she have interesting relationships with people we might not be familiar with?

I didn’t get a chance to answer all of these questions, but I ended up being interested in the way words in her email were clustered, and decided to come up with a visualization based on that.

For a simple representation to start, I created a scatter plot visualization using mpld3, which creates interactive matplotlib graphs for the browser. It’s clunky to navigate (you need to switch to a zoom-in mode, drag a rectangular portion of the graph to zoom in on, then switch again to the cursor mode to scroll over words), but it’s interesting to see which words appear together for a first step.

Isn’t it interesting that “bipartisan” appears well outside the main cluster of words?

Isn’t it interesting that “bipartisan” appears well outside the main cluster of words?

Lesson learned along the way: visualizing text is hard. I found that the norm for text visualizations out there, such as word clouds or circle packing, was reductionist for some of the data I have, like topic models or k-means clustering.

While I didn’t create data visualizations for some of the questions I posed earlier, I do have some statistics:

For males:

6187 sentences
83764 word tokens
10762 word types
7.78 average tokens per type
13.54 average sentence length
5.01 average word token length
7.34 average word type length
Hapax legomena (words that appear only once – an indicator of vocabulary usage) comprise 49.60% of the types

For females:

22845 sentences
369517 word tokens
30386 word types
12.16 average tokens per type
16.17 average sentence length
4.94 average word token length
7.84 average word type length
Hapax legomena comprise 49.76% of the types

The Panama Papers

In reaction to the story about the Panama Papers which captured my attention (like I’m sure it did everyone else’s), I created a short slideshow as a companion piece to this NPR article about the topic.

This was an exercise in keeping my companion piece brief — I wanted to highlight the actors in the story and what the consequences were for offshore accounts / shell companies. In addition, my call to action wasn’t very direct — most of the implicated politicians are overseas as journalists are still pursuing clients of Mossack Fonseca, so I urged people to vote for politicians this upcoming election cycle who are dedicated to tax reform.

Posted in All

AlphaGo and Machine Learning

This week, I decided to take the “hypertext is your friend” part of the assignment to heart. I, like Jorge, used FOLD to create my companion piece, which I felt was most intuitive for contextualizing information, especially with a topic like machine learning, which has a lot of moving parts.

I decided to create a companion piece for AlphaGo, the AI that defeated world Go champion Lee Sedol 4-1 about a week ago. I also attempted to give some context on another AI technology that’s gotten a lot of press — Microsoft’s Tay — and argued that we’re further from that science-fiction-robots-take-over-the-world reality than some think.

Check out my story on FOLD!

Posted in All

Turangalîla-Symphonie

I initially planned on reporting Ohio’s primary election results through Tweets, Facebook posts, etc. Then, I did a little soul-searching tonight and realized that politics was the last thing I wanted to talk about, let alone report on. Instead, in a last-ditch effort, I decided to dig through Instagram and Facebook to report on the New York Philharmonic’s performance of Messian’s Turangalîla-Symphonie. I also wanted to try my hand at audio, which seemed like a particularly well-suited medium for a piece about classical music.

Lessons learned: finding actualities is hard for events like classical music concerts. There isn’t a lot to go on, maybe because of the nature of classical music concerts (recording is prohibited) and the demographic of classical music concert-goers. In this particular case, it was also difficult to find dissenting opinions (everyone really loved the concert). In any case, here’s my audio recording:

 

To accompany, here are a few Instagram images, a few tweets, and a video of the performance mentioned at the end of the piece:

 

Here’s the ondes Martenot, the electronic instrument in the piece:

#latergram #nyphil #messiaen #turangalila #ondes #martenot

A photo posted by Taylor White (@tdwmdmfa) on

 

These next two URLs are from the New York Philharmonic: the first is an instagram from one of the librarians who compiles sheet music for the musicians (and in this case got to sit in on a rehearsal) and the next is the video from Quartet for the End of Time — I incorporated the audio into my soundcloud piece.

//

Messiaen’s “Quartet for the End of Time” at the Temple of DendurWe’re live at The Temple of Dendur at The Metropolitan Museum of Art, New York. You’re watching Messiaen’s Quartet for the End of Time with Music Director Alan Gilbert on violin, Principal Cello Carter Brey, Principal Clarinet Anthony McGill, and Artist-in-Association Inon Barnatan on piano. NOTE: If you don’t hear sound, try going to around 1 min. 40 secs. in. #messiaenweek

Posted by New York Philharmonic on Sunday, March 13, 2016

Canva

I am not a designer. At all.

Luckily, Canva allows me to compensate for what I lack in design sense. It’s a fairly easy to use tool (even if it is restrictive).

Canva is a graphic design tool that uses a drag and drop UI to allow you to create anything from blog graphics to posters. The essence of it is that you choose a theme, add elements to that theme (like a grid structure, lines, icons and charts) to create an infographic.

Pros: It’s easy to use and it’s sleek. The icons are, for the most part, designed well and you can add your own images to build on what Canva provides.

Cons: If you’re looking to represent percentages that aren’t quartiles in a chart, good luck. Canva, for the most part, provides quartile percentages for its graphs (so stats have to be 25%, 50%, 75% or 100%) and the bar graph sizing isn’t great — you essentially have to guesstimate the proportion of the bars you use.

If you’re willing to forego a perfectly accurate data representation (for the bar graphs) and can live with using a pie / circle chart for quartile percentages only, Canva is a useful tool to display information beautifully. You can even be creative and forego the typical pie chart to display stats in a more innovative format — Canva has several templates that provide decent inspiration like this one:

Screen Shot 2016-02-17 at 4.56.15 PM

If you have any questions about using Canva or how I created my graphics for my media diary, let’s chat!

SQL (the quick and dirty way)

Navigating SQL is daunting at first, but completely doable, even if you have little programming experience. I’ll walk through how I ran SQL queries against my Chrome browsing data and the tools I used to do it.

Finding the SQL database

First, make sure you know where the SQL Database is in your computer, and make sure you have the appropriate permissions to access it. For example, when I was accessing my Chrome browsing data, I needed to get to this location on my computer:
~/Library/Application\ Support/Google/Chrome/Default/History

Where “~” means my home directory and “History” is the name of the SQL database file.
If this is daunting, don’t worry! This is what you’d type into a command line. If you don’t know what a command line is, either Google it or come talk to me — I’d be happy to walk you through it. *Note, if you’re looking to access your Chrome history database, make sure you’ve quit out of your Chrome browser. I learned this the hard way.

Notably, the Chrome history database is SQLite and not SQL. For our intents and purposes, this is fine. Functionally, they’re similar in usage; SQLite is a subset of SQL. Just be sure you know what type of database file you’re working with before you start.

How to access and browse your database

So you have your database file. How do you access it and browse through it? The tool I use is sqlitebrowser, made specifically for SQLite. You can open a database file or even create your own. Once you open your database, it looks something like this:

Screen Shot 2016-02-17 at 4.33.18 PM

You can browse through the data row by row, view the structure of the database and execute SQL commands. The “table” dropdown refers to all the different tables in a database; for the Chrome history example, there’s a table for downloads, a table for URLs, and a table source for visits.

It’s great that you can view all this information, but you’re probably also looking to make some meaning from this. To extract rows that are relevant, you’ll want to write a SQL query. I’m not going to go into the details of writing SQL queries here, but I do recommend W3school’s tutorial for the quick version which should be good for most basic queries.

In general, your queries will follow a structure that’s something like this:
select *
from "urls"
where "last_visit_time" > 13099253131722513
and "url" like "%facebook%"
or "url" like "%twitter"
or "url" like "%github%"
or "url" like "%linkedin%"

“select” refers to which columns you’d like to select from the table (I just choose to display all columns by default), “from” refers to which table you’re using and “where” acts like a conditional — if x is true for row i , then include row i in results. You’ll notice my use of “%” for matching strings — these are wildcards (and is easiest to Google as needed).

Below is what’s returned when I run the SQL query I wrote above:

Screen Shot 2016-02-17 at 4.39.52 PM

You can also group together results (like I could’ve grouped by “url” to see how many of each type of URL i visited) and sort by a column.

If you’re new to programming, this probably seems overwhelming, but I definitely think reading the tutorial and just playing around with some SQL queries will help you get the hang of it. I learned basic SQL by having a test database and running queries to figure out what did and didn’t work, as well as how syntax works.

If you have any questions about the post, about databases or about SQL, please reach out! I’d be happy to chat. 🙂

Sravanti’s Media Diary

My media journal started off with me meticulously detailing when, where, and what I was consuming, as I consumed it. I was idealistic in assuming I could do this for a full week with no problems — by the weekend, I found myself consuming media left, right and center and forgetting to record it.

Luckily, I had my browser history to pull from to fill in the gaps and I found some interesting — although not entirely surprising — results. Yes, I spend more time than I should on Facebook. I also spent an extraordinary amount of time on LinkedIn and GitHub this week, which I thought was interesting. Upon reflection, though, this makes sense: 1) I’m job-hunting and 2) I use GitHub for one of my other classes.

I also found my media consumption to center around a few events — I tend to find a subject and read lots about it, rather than read about a large breadth of topics. This week’s topics were dominated by Kanye West and Gilmore Girls (as tempted as I was to hide certain browsing details, I kept them in).

I combed through my Chrome browser history, taking a look at the history file on my computer, which is stored in sqlite format. I ran a few simple queries, like the one below to get the percentage of my browser history that was social media related:

select *
from "urls"
where "last_visit_time" > 13099253131722513 #timestamp from a week ago
and "url" like "%facebook%"
or "url" like "%twitter"
or "url" like "%github%"
or "url" like "%linkedin%"

From this, I was able to get the percentage of links that were Facebook, Twitter, etc. You’ll notice that this doesn’t measure time spent on each page. I actually thought that this was fine, as throughout the week, I noticed I’m not one to scroll through Facebook too much – I just take a look at the top stories on my feed and then exit out — and probably don’t spend longer than 2-3 minutes per visit unless I’m messaging a friend. If I really wanted to, I could estimate the time by calculating the number of times I hit Facebook in the past week multiplied by my average time spent per visit (e.g. 401 visits to Facebook –> 16 hours, which is terrifying to think about).

Another aspect of media consumption I looked into this week was seeing where I found the articles I read. Did they come from Facebook? Twitter? I found that most of them came from Twitter and links shared from a friend through Slack.

Interestingly, I found that beyond my social media consumption, my media consumption is largely driven by email. About 50% of my media consumption came from email! Related, but not necessarily media (which, by my definition, was communication for an audience that wasn’t private) was that I spent a lot of time on my calendar, organizing and adding appointments with peers.

To track my mobile phone usage, I ended up conceding to phone battery to determine which apps I used to consume media. This is obviously a skewed metric — Snapchat uses much more data to transmit photos/videos compared to Twitter, which is much more text based. I found that I used Snapchat, Spotify, Twitter and Facebook the most on my phone, which is consistent with what I thought my mobile media consumption would look like. I also read about 75% of articles on my phone compared to the 25% I read on my computer. Because I tend to use my computer to either code or write extensively (and use my phone in almost all other cases), this also makes sense.

But enough of SQL code and words — I wanted to try my hand at infographics, so I decided to put together a couple of short ones: one for my computer media usage, the other for my mobile media usage.

1 2

Notably, these graphics and my earlier discussion doesn’t track any media that wasn’t consumed on my phone or the computer. I did read one print publication this week, which was my school newspaper, The Wellesley News. I also listened to roughly 150 tracks on Spotify, 50 tracks on SoundCloud. With regards to TV consumption, I watched a couple episodes of Mozart in the Jungle and the first four episodes of Billions. 

Snapchat – Discover stories, daily stories

Snapchat, the temporary media-sharing app, has quickly become commonplace among news outlets, both through the “Discover” feature and by posting daily stories, much like a typical snapchat user.

The “Discover” feature was released about a year ago, expanding the app beyond sharing photos and videos between personal networks of friends. “Discover” stories allow users to explore daily stories by publication outlets such as Vice, Mashable, National Geographic, Vox and most recently, WSJ. Snapchat Discover stories are structured as a slideshow which the user can swipe through. Usually, each panel is accompanied by a story, which the user can choose to scroll down to read and share with their friends (notably, with the normal drawing / emoji annotations that a user can do with any other snap they send).

What separates Snapchat Discover stories from other web-based posting is that in Snapchat, you can’t link to content anywhere else — the user is forced to consume the story within Snapchat without clicking on external links or enlarging photos. From my personal observations, there seems to be two camps of Snapchat publishing philosophies: one, use each slide for a separate story (like reading the headlines from a newspaper) or two, focus on a particular issue or moment and present different perspectives on that particular story. For example, WSJ tends to present stories in the first format while Vox takes the second approach, using slides to present infographics on a story or quotes from interviewees.

Lately, publishers have been taking to Snapchat to engage with users in a more typical Snapchat fashion by posting daily stories. Users can add publishers by username (for example, “npr”). This particular use of Snapchat lends itself to a more interactive experience — for example, NPR will often explicitly solicit feedback from users viewing a Snapchat story, and users can send photos, videos or text in response. For example, most recently, NPR posted to Snapchat about the Bernie Sanders / Hillary Clinton meme, asking if it was sexist and to reply by “snapping” back. In addition, reporters tend to appear in these stories in a very casual format and basically have conversations (albeit in 10-second segments) about a story that they’re reporting on at the moment.

As for the implications Snapchat has for the future of news and storytelling, it’s clear that Snapchat is a medium which encourages interactivity in a different medium than other forms of social media — not only among friends, but now between users and publications, which previously was a large barrier. Publishing daily “Discover” stories encourages publishers to be deliberate about using only a few slides and tailoring their content to millennials — an issue that is top of mind for publishers today.

Posted in All

Sravanti Tekumalla

Hi, I’m Sravanti!

11885691_10154334130142281_5778489736523912393_o (1)

I’m a current senior at Wellesley College studying computer science and I’m interested in the intersection between technology and journalism — specifically, how to apply my computer science knowledge to create tools that can help journalists parse data in a meaningful, clear way, whether that be through data analysis tools or data visualization tools.

I’m coming to this class after finishing up a stint as Editor of my college paper, The Wellesley News. During my time at The News, I also started  an online team which created, and now maintains, our website as well as our social media presence.

Skills-wise, I have some reporting and editing experience from the journalism side. From the tech side of things, I’m good with Java, Python, JavaScript and web development-related things. I’m excited to learn a lot in this class, and to create with all of you!