• let’s engage

    Send us a message and our senior staff will get back in touch.

  • This field is for validation purposes and should be left unchanged.




In Defense of Skewed Data

Patrick Ruffini

Patrick Ruffini

With #skewed polling in the news this campaign season, I stand as a lonely voice for noisy, biased, self-reported, and yes, skewed data in the Presidential race.

This piece is not about the purported bias in public polling, though I could go on and on about the shoddy reporting and analysis about polls. It’s about all the people who are getting into the polling game (Engage included) by using social media and Internet data to try and get a fix on what’s going on in in real time. This post is a field guide to these types of efforts, explaining where they’re useful, and how they do (and don’t) beat the polls that captivate the political class.

Why Scientific Polls Aren’t Enough

Let’s ask the first-order question here: Why?

Public opinion polls seem to be pretty good at forecasting the winners of elections, so why reinvent the wheel with newfangled metrics like tweets-per-minute or Facebook’s “people talking about this” number that aren’t scientific and whose subjects tend to be overly partisan and biased? Why study the Internet to figure out how public opinion is changing minute-by-minute?

On this, I still think Sir Edmund Hillary’s answer when asked why he would climb Mount Everest serves as a good guide: Because it’s there.

After more than a decade doing online activation, I can testify to the fact that there’s just about nothing users like better than answering polls. Millions of people answer online surveys every day, but the public polls released at the height of the election season reflect interviews with only a few thousand respondents per night.

In this age of abundant data, why is it getting harder and harder for pollsters to collect useful data on the electorate? Response rates to telephone surveys continue to plummet, and pollsters must recalibrate their methodologies to include cell phone-only households. You would think the Internet could pick up the slack here, but curators like Talking Points Memo won’t include online-only polls like YouGov in their averages. Technology hasn’t translated into a quantum leap forward in the volume of responses and quality of polling data.

As puzzling as I find anti-Internet bias, I have to concede there are some valid concerns here. To get a perfectly unbiased sample, you have to harass people over the phone because virtually all methodologies on the Internet are opt-in, and people who opt in to things are different than those who don’t. By definition, polls are about finding people who don’t already want to take them. Finding the rare person willing to sit through an interview and then balancing their responses is an expensive proposition. According to Chuck Todd, to do it right, NBC and the Wall Street Journal shell out between $40,000 and $60,000 per poll:


The end product of these polls (which come in more and less-expensive varieties) is between 500 and 3,000 interviews that, on average, reflect public opinion as of a few days ago. After Mitt Romney’s crushing win in the first debate, we did not know that he had moved slightly ahead in the race until more than a week later. This is partly due to how polling shifts play out over several news cycles, but also because of the delay in reporting poll samples. Gallup, for instance, uses a 7-day rolling sample, which means that the median interview took place four days before the poll was released. So, polls can be accurate as of 2 to 4 days ago, but not accurate as of now.

It’s also difficult (nigh impossible if the group is small enough) to reliably measure polling movement among different subgroups in the electorate systematically over time. The smaller the group gets, the more it’s anyone’s guess as to what the real numbers are.

It was easy for folks to get excited about the fact that Romney recently moved ahead of Obama among Jewish voters in the IBD/TIPP tracking poll by 44-40 — but the subsample of Jewish voters surveyed was no more than 25. In the most recent version of this poll, Obama leads among the same group by 78-22. Maybe there was actual movement, but more likely it was just the tiny sample size.

Yes, groups often pay more to poll specific demographics, but only a few times per election cycle. We are nowhere close, for instance, to having a RealClearPolitics average of, say, married women in Ohio, a measure that would be relevant to how campaigns actually spend money. And unless something drastically changes with how traditional polling is done, we will never have this. Ever.

The reason is because providing a balanced, unbiased sample is expensive. But what if you didn’t have to balance the sample?

This is where mining relevant streams of Internet data can help.

Social Media is the World’s Biggest Data Platform

We think of social media as the world’s biggest conversational platform. But it’s no slouch in the data department either.

Facebook users generate around 684,478 pieces of content per minute. Twitter users tweet 200 million times daily. This doesn’t even count the countless petabytes (exabytes? yottabytes?) of user account data on millions of websites, tied to demographics.

The sum total of these interactions speak volumes about each of us as a person. Not every variable will be public about every person, but our tendency to interact socially online, the language we use, and what we post can speak volumes about our personalities, our values, and our political beliefs. And this is just from what we post publicly.

Pollsters and most journalists have shied away from analyzing this data for a few reasons. First, obviously, is privacy. Second, we still lack the processing power and analytical capability to usefully make sense of these large data sets. And third, the easy, topline queries are often misleading, reflecting certain skews in online phenomena, in the sites themselves, and in things that are fundamentally hard to control for, like media attention or virality. Noting that Obama leads by 3-to-1 on Facebook is not terribly interesting, because it could reflect his incumbent status, his global popularity, his 4-year-headstart, or his cult-like status in 2008.

Nonetheless, if you ask the right questions — you can get at certain answers faster and with more granular data than a traditional poll.

To Get the Data, Embrace the Skew

During the Vice Presidential debates, Xbox Live polled viewers live during the debate as to who thought they won. The answers may have been disheartening for the Romney-Ryan ticket: undecided voters on the platform thought Joe Biden won the debate by a 44 to 23 percent margin.

But the sample was skewed: Xbox viewers as a whole were voting for Obama over Romney by a 52-36 percent margin — while public polls are tied. As gamers, the Xbox voter is typically younger, and so even the undecided might be left-leaning. Data from our Trendsetter app, which measures the political affinities of page likers on Facebook, is consistent with these results, showing a roughly 60-40 pro-Obama Xbox skew.

Before we use this skew to summarily discard the results, consider this: each question got 30,000 responses, presumably tied to rich demographic information. This means that, within the Xbox community, you have large samples of hundreds of voters for one of dozens of different slices of the electorate.

These large sample sizes mean you can get an extremely granular view of opinion changing over time, especially when data is tied to real user accounts with demographic info. Even if we don’t re-weight the demographics from the Xbox poll back to the overall population, because of the sheer volume of data, there is intrinsic value in studying the data shifts and the patterns evidenced in the polls internals.

The overall skew doesn’t matter, because we aren’t interested in the toplines (the Presidential horserace number). Traditional polls do a good enough job of measuring those. What we’re interested in is measuring change among niche demographics and doing it in real time, without the 2-to-4 day delay. When it comes to measuring what happened in the last 24 hours, campaign polls give us no data or extremely rough data. Sheer volume means Internet data can do a better job of this, particularly if it can be confirmed across multiple data sets.

In the recent debates, I polled my Twitter audience as to who they thought won. Most polls received between 200 and 1,000 responses, measured as retweets. Some tried to poke fun at this, given that my Twitter followers appear to skew 10-to-1 towards Romney based on the results. But my goal wasn’t to suggest that Romney was winning public opinion by 10-to-1. It was to collect as much data as fast as possible, extracting insight where appropriate. Last night, I asked people to indicate whether they thought each candidate was winning by a little or a lot. The data could suggest that Obama voters were a bit more enthusiastic about their guy’s performance, even though there were fewer of them (irrelevant for the purposes of this analysis).



Asking the broader question of how well Twitter performed as a barometer during the debates, Twitter searches for “Romney winning” or “Obama winning” all accurately predicted the results of snap polling done after each debate. They showed Romney dominating the first debate from 20 minutes in, while a more muddled back-and-forth picture emerged from the remaining two debates — also consistent with the polls. After the conventions, we outlined the case for how Twitter reactions to major speakers forecasted the nightly movement in the polls, and found (with one or two exceptions) a clear correlation.

Twitter is full of biased and self-interested political actors, but it mirrors and reinforces the media narrative and thus public opinion. You can’t really measure undecided voters on Twitter, but you can tell which side’s partisans felt great, and which felt “Meh.” And you can quantify this in real time, getting ahead of the polls. Even with an unrepresentative sample, we’ve found it to be a good guide of broader opinion, but you have to drill down on specific search queries and eschew broad metrics like tweets-per-minute and treat sentiment analysis with caution. For instance, we found that use of a candidate’s name in conjunction with “awesome” could be a better indicator of positive reaction to a candidate than positive sentiment scores.

Towards the Hourly Tracking Poll

The 2012 elections won’t resolve the question of whether Big Data can predict election outcomes, but it holds great promise if we can embrace the heretical idea that balance isn’t the be-all, end-all, while we mine insights from the deep of Internet data.

Your next project can be fast, cheap, and good — pick two. Opinion data can be fast, balanced, and big — pick two. In looking at absurdly large data sets, and embracing the inherent skew represented by the bias in Xbox or Facebook users, asking the right questions, you can get at things no poll can — subtle changes in the samples and among specific demographics, measured day by day, or even hour by hour.

Why should this be important, beyond feeding the media-political beast with near real-time analytics?

The political world has embraced real-time data everywhere else — in everything from voter ID calls, to fundraising emails, to online advertising. Why wouldn’t public opinion research work the same way? People like giving their opinion. Is there a way to better harness these willing participants into actionable data?

After all three debates, the political discussion quickly descended into meme graphics about Big Bird, binders, and bayonets. This was fed in part by a data-driven feedback loop of hardcore partisans on social media — combined with a complete absence of data about how these attacks worked in real time with undecideds. Interviews conducted after the fact showed these attacks fell flat with those voters, yet the memes went on for days. Real-time polling might mean less Big Bird — and more messaging that’s actually relevant in Ohio.