Big Bias

by Doug Allen

Kate Crawford warns against believing “that massive data sets and predictive analytics always reflect objective truth”:

Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.

For example, consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. … The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a “signal problem”: Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.

I would only add that these problems are not unique to big data, though they are more likely to be ignored with a larger dataset. In any data analysis, it is important to think about not only the data that you have, but also the data that you don’t have.