scispace - formally typeset
Open AccessJournal ArticleDOI

Six Provocations for Big Data

TLDR
Given the rise of Big Data as both a phenomenon and a methodological persuasion, it is time to start critically interrogating this phenomenon, its assumptions, and its biases.
Abstract
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and many others are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia, and every space where large groups of people leave digital traces and deposit data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above?This essay offers six provocations that we hope can spark conversations about the issues of Big Data. Given the rise of Big Data as both a phenomenon and a methodological persuasion, we believe that it is time to start critically interrogating this phenomenon, its assumptions, and its biases.(This paper was presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society” on September 21, 2011.)

read more

Content maybe subject to copyright    Report

Electronic copy available at: http://ssrn.com/abstract=1926431Electronic copy available at: http://ssrn.com/abstract=1926431
Paper to be presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium
on the Dynamics of the Internet and Society” on September 21, 2011.
1
Six Provocations for Big Data
danah boyd
Microsoft Research
dmb@microsoft.com
Kate Crawford
University of New South Wales
k.crawford@unsw.edu.au
Technology is neither good nor bad; nor is it neutral...technology’s interaction with the
social ecology is such that technical developments frequently have environmental, social,
and human consequences that go far beyond the immediate purposes of the technical
devices and practices themselves.
Melvin Kranzberg (1986, p. 545)
We need to open a discourse where there is no effective discourse now – about the
varying temporalities, spatialities and materialities that we might represent in our
databases, with a view to designing for maximum flexibility and allowing as possible for
an emergent polyphony and polychrony. Raw data is both an oxymoron and a bad idea; to
the contrary, data should be cooked with care.
Geoffrey Bowker (2005, p. 183-184)
The era of Big Data has begun. Computer scientists, physicists, economists,
mathematicians, political scientists, bio-informaticists, sociologists, and many others are
clamoring for access to the massive quantities of information produced by and about
people, things, and their interactions. Diverse groups argue about the potential benefits
and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook,
Wikipedia, and every space where large groups of people leave digital traces and deposit
data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases?
Or will it usher in a new wave of medical inequality? Will data analytics help make
people’s access to information more efficient and effective? Or will it be used to track
protesters in the streets of major cities? Will it transform how we study human
communication and culture, or narrow the palette of research options and alter what
‘research’ means? Some or all of the above?
Big Data is, in many ways, a poor term. As Lev Manovich (2011) observes, it has been
used in the sciences to refer to data sets large enough to require supercomputers, although
now vast sets of data can be analyzed on desktop computers with standard software.
There is little doubt that the quantities of data now available are indeed large, but that’s
not the most relevant characteristic of this new data ecosystem. Big Data is notable not
because of its size, but because of its relationality to other data. Due to efforts to mine

Electronic copy available at: http://ssrn.com/abstract=1926431Electronic copy available at: http://ssrn.com/abstract=1926431
Paper to be presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium
on the Dynamics of the Internet and Society” on September 21, 2011.
2
and aggregate data, Big Data is fundamentally networked. Its value comes from the
patterns that can be derived by making connections between pieces of data, about an
individual, about individuals in relation to others, about groups of people, or simply about
the structure of information itself.
Furthermore, Big Data is important because it refers to an analytic phenomenon playing
out in academia and industry. Rather than suggesting a new term, we are using Big Data
here because of its popular salience and because it is the phenomenon around Big Data
that we want to address. Big Data tempts some researchers to believe that they can see
everything at a 30,000-foot view. It is the kind of data that encourages the practice of
apophenia: seeing patterns where none actually exist, simply because massive quantities
of data can offer connections that radiate in all directions. Due to this, it is crucial to
begin asking questions about the analytic assumptions, methodological frameworks, and
underlying biases embedded in the Big Data phenomenon.
While databases have been aggregating data for over a century, Big Data is no longer just
the domain of actuaries and scientists. New technologies have made it possible for a
wide range of people – including humanities and social science academics, marketers,
governmental organizations, educational institutions, and motivated individuals – to
produce, share, interact with, and organize data. Massive data sets that were once
obscure and distinct are being aggregated and made easily accessible. Data is
increasingly digital air: the oxygen we breathe and the carbon dioxide that we exhale. It
can be a source of both sustenance and pollution.
How we handle the emergence of an era of Big Data is critical: while it is taking place in
an environment of uncertainty and rapid change, current decisions will have considerable
impact in the future. With the increased automation of data collection and analysis – as
well as algorithms that can extract and inform us of massive patterns in human behavior –
it is necessary to ask which systems are driving these practices, and which are regulating
them. In Code, Lawrence Lessig (1999) argues that systems are regulated by four forces:
the market, the law, social norms, and architecture – or, in the case of technology, code.
When it comes to Big Data, these four forces are at work and, frequently, at odds. The
market sees Big Data as pure opportunity: marketers use it to target advertising, insurance
providers want to optimize their offerings, and Wall Street bankers use it to read better
readings on market temperament. Legislation has already been proposed to curb the
collection and retention of data, usually over concerns about privacy (for example, the Do
Not Track Online Act of 2011 in the United States). Features like personalization allow
rapid access to more relevant information, but they present difficult ethical questions and
fragment the public in problematic ways (Pariser 2011).
There are some significant and insightful studies currently being done that draw on Big
Data methodologies, particularly studies of practices in social network sites like
Facebook and Twitter. Yet, it is imperative that we begin asking critical questions about
what all this data means, who gets access to it, how it is deployed, and to what ends. With
Big Data come big responsibilities. In this essay, we are offering six provocations that we
hope can spark conversations about the issues of Big Data. Social and cultural researchers

Paper to be presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium
on the Dynamics of the Internet and Society” on September 21, 2011.
3
have a stake in the computational culture of Big Data precisely because many of its
central questions are fundamental to our disciplines. Thus, we believe that it is time to
start critically interrogating this phenomenon, its assumptions, and its biases.
1. Automating Research Changes the Definition of Knowledge.
In the early decades of the 20th century, Henry Ford devised a manufacturing system of
mass production, using specialized machinery and standardized products.
Simultaneously, it became the dominant vision of technological progress. Fordism meant
automation and assembly lines, and for decades onward, this became the orthodoxy of
manufacturing: out with skilled craftspeople and slow work, in with a new machine-made
era (Baca 2004). But it was more than just a new set of tools. The 20th century was
marked by Fordism at a cellular level: it produced a new understanding of labor, the
human relationship to work, and society at large.
Big Data not only refers to very large data sets and the tools and procedures used to
manipulate and analyze them, but also to a computational turn in thought and research
(Burkholder 1992). Just as Ford changed the way we made cars – and then transformed
work itself – Big Data has emerged a system of knowledge that is already changing the
objects of knowledge, while also having the power to inform how we understand human
networks and community. ‘Change the instruments, and you will change the entire social
theory that goes with them,’ Latour reminds us (2009, p. 9).
We would argue that Bit Data creates a radical shift in how we think about research.
Commenting on computational social science, Lazer et al argue that it offers ‘the capacity
to collect and analyze data with an unprecedented breadth and depth and scale’ (2009, p.
722). But it is not just a matter of scale. Neither is enough to consider it in terms of
proximity, or what Moretti (2007) refers to as distant or close analysis of texts. Rather, it
is a profound change at the levels of epistemology and ethics. It reframes key questions
about the constitution of knowledge, the processes of research, how we should engage
with information, and the nature and the categorization of reality. Just as du Gay and
Pryke note that ‘accounting tools...do not simply aid the measurement of economic
activity, they shape the reality they measure’ (2002, pp. 12-13), so Big Data stakes out
new terrains of objects, methods of knowing, and definitions of social life.
Speaking in praise of what he terms ‘The Petabyte Age’, Chris Anderson, Editor-in-Chief
of Wired, writes:
This is a world where massive amounts of data and applied mathematics replace
every other tool that might be brought to bear. Out with every theory of human
behavior, from linguistics to sociology. Forget taxonomy, ontology, and
psychology. Who knows why people do what they do? The point is they do it, and
we can track and measure it with unprecedented fidelity. With enough data, the
numbers speak for themselves. (2008)

Paper to be presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium
on the Dynamics of the Internet and Society” on September 21, 2011.
4
Do numbers speak for themselves? The answer, we think, is a resounding ‘no’.
Significantly, Anderson’s sweeping dismissal of all other theories and disciplines is a tell:
it reveals an arrogant undercurrent in many Big Data debates where all other forms of
analysis can be sidelined by production lines of numbers, privileged as having a direct
line to raw knowledge. Why people do things, write things, or make things is erased by
the sheer volume of numerical repetition and large patterns. This is not a space for
reflection or the older forms of intellectual craft. As David Berry (2011, p. 8) writes, Big
Data provides ‘destablising amounts of knowledge and information that lack the
regulating force of philosophy.’ Instead of philosophy – which Kant saw as the rational
basis for all institutions – ‘computationality might then be understood as an ontotheology,
creating a new ontological “epoch” as a new historical constellation of intelligibility’
(Berry 2011, p. 12).
We must ask difficult questions of Big Data’s models of intelligibility before they
crystallize into new orthodoxies. If we return to Ford, his innovation was using the
assembly line to break down interconnected, holistic tasks into simple, atomized,
mechanistic ones. He did this by designing specialized tools that strongly predetermined
and limited the action of the worker. Similarly, the specialized tools of Big Data also
have their own inbuilt limitations and restrictions. One is the issue of time. ‘Big Data is
about exactly right now, with no historical context that is predictive,’ observes Joi Ito, the
director of the MIT Media Lab (Bollier 2010, p. 19). For example, Twitter and Facebook
are examples of Big Data sources that offer very poor archiving and search functions,
where researchers are much more likely to focus on something in the present or
immediate past – tracking reactions to an election, TV finale or natural disaster – because
of the sheer difficulty or impossibility of accessing older data.
If we are observing the automation of particular kinds of research functions, then we
must consider the inbuilt flaws of the machine tools. It is not enough to simply ask, as
Anderson suggests ‘what can science learn from Google?’, but to ask how Google and
the other harvesters of Big Data might change the meaning of learning, and what new
possibilities and new limitations may come with these systems of knowing.
2. Claims to Objectivity and Accuracy are Misleading
‘Numbers, numbers, numbers,’ writes Latour (2010). ‘Sociology has been obsessed by
the goal of becoming a quantitative science.’ Yet sociology has never reached this goal,
in Latour’s view, because of where it draws the line between what is and is not
quantifiable knowledge in the social domain.
Big Data offers the humanistic disciplines a new way to claim the status of quantitative
science and objective method. It makes many more social spaces quantifiable. In reality,
working with Big Data is still subjective, and what it quantifies does not necessarily have
a closer claim on objective truth – particularly when considering messages from social
media sites. But there remains a mistaken belief that qualitative researchers are in the
business of interpreting stories and quantitative researchers are in the business

Paper to be presented at Oxford Internet Institute’s “A Decade in Internet Time: Symposium
on the Dynamics of the Internet and Society” on September 21, 2011.
5
of producing facts. In this way, Big Data risks reinscribing established divisions in the
long running debates about scientific method.
The notion of objectivity has been a central question for the philosophy of science and
early debates about the scientific method (Durkheim 1895). Claims to objectivity suggest
an adherence to the sphere of objects, to things as they exist in and for themselves.
Subjectivity, on the other hand, is viewed with suspicion, colored as it is with various
forms of individual and social conditioning. The scientific method attempts to remove
itself from the subjective domain through the application of a dispassionate process
whereby hypotheses are proposed and tested, eventually resulting in improvements in
knowledge. Nonetheless, claims to objectivity are necessarily made by subjects and are
based on subjective observations and choices.
All researchers are interpreters of data. As Lisa Gitelman (2011) observes, data needs to
be imagined as data in the first instance, and this process of the imagination of data
entails an interpretative base: ‘every discipline and disciplinary institution has its own
norms and standards for the imagination of data.’ As computational scientists have
started engaging in acts of social science, there is a tendency to claim their work as the
business of facts and not interpretation. A model may be mathematically sound, an
experiment may seem valid, but as soon as a researcher seeks to understand what it
means, the process of interpretation has begun. The design decisions that determine what
will be measured also stem from interpretation.
For example, in the case of social media data, there is a ‘data cleaning’ process: making
decisions about what attributes and variables will be counted, and which will be ignored.
This process is inherently subjective. As Bollier explains,
As a large mass of raw information, Big Data is not self-explanatory. And yet the
specific methodologies for interpreting the data are open to all sorts of
philosophical debate. Can the data represent an ‘objective truth’ or is any
interpretation necessarily biased by some subjective filter or the way that data is
‘cleaned?’ (2010, p. 13)
In addition to this question, there is the issue of data errors. Large data sets from Internet
sources are often unreliable, prone to outages and losses, and these errors and gaps are
magnified when multiple data sets are used together. Social scientists have a long history
of asking critical questions about the collection of data and trying to account for any
biases in their data (Cain & Finch, 1981; Clifford & Marcus, 1986). This requires
understanding the properties and limits of a dataset, regardless of its size. A dataset may
have many millions of pieces of data, but this does not mean it is random or
representative. To make statistical claims about a dataset, we need to know where data is
coming from; it is similarly important to know and account for the weaknesses in that
data. Furthermore, researchers must be able to account for the biases in their
interpretation of the data. To do so requires recognizing that one’s identity and
perspective informs one’s analysis (Behar & Gordon, 1996).

Citations
More filters
Journal ArticleDOI

Big other: surveillance capitalism and the prospects of an information civilization

TL;DR: An emergent logic of accumulation in the networked sphere, ‘surveillance capitalism,’ is described and its implications for ‘information civilization’ are considered and a distributed and largely uncontested new expression of power is christened: ‘Big Other.’
Proceedings ArticleDOI

Big Data: Issues and Challenges Moving Forward

TL;DR: The issues and challenges are analyzed as a collaborative research program into methodologies for big data analysis and design are begun.
Journal ArticleDOI

Big data: a revolution that will transform how we live, work, and think

TL;DR: This chapter discusses how Wiki-Government and other open-source technologies can make government decisionmaking more expert and more democratic.

科研数据共享的挑战 (The Conundrum of Sharing Research Data)

TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.
References
More filters
Journal ArticleDOI

The Strength of Weak Ties

TL;DR: In this paper, it is argued that the degree of overlap of two individuals' friendship networks varies directly with the strength of their tie to one another, and the impact of this principle on diffusion of influence and information, mobility opportunity, and community organization is explored.
Journal ArticleDOI

Social Network Sites: Definition, History, and Scholarship

TL;DR: This publication contains reprint articles for which IEEE does not hold copyright and which are likely to be copyrighted.
Book

Writing Culture: The Poetics and Politics of Ethnography

TL;DR: The authors explore the ways in which writing culture has changed the face of ethnography over the last 25 years. But they do not discuss the role of writing culture in the development of ethnographies.
Posted Content

The Strength of Weak Ties

TL;DR: In this article, the strength of interpersonal ties, a limited aspect of small-scale interaction, is chosen to show how the use of network analysis can relate this aspect to such varied macro phenomenon as diffusion, social mobility, political organization, and social cohesion.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions in this paper?

The authors need to open a discourse – where there is no effective discourse now – about the varying temporalities, spatialities and materialities that they might represent in their databases, with a view to designing for maximum flexibility and allowing as possible for an emergent polyphony and polychrony. 

Wrangling APIs, scraping and analyzing big swathes of data is a skill set generally restricted to those with a computational background. 

Historically speaking, when sociologists and anthropologists were the primary scholars interested in social networks, data about people’s relationships was collected through surveys, interviews, observations, and experiments. 

There are significant questions of truth, control and power in Big Data studies: researchers have the tools and the access, while social media users as a whole do not. 

Due to uncertainties about what an account represents and what engagement looks like, it is standing on precarious ground to sample Twitter accounts and make claims about people and users. 

Big Data is at its most effective when researchers take account of the complex methodological processes that underlie the analysis of social data. 

These articulated networks take the form of email or cell phone address books, instant messaging buddy lists, ‘Friends’ lists on social network sites, and ‘Follower’ lists on other social media genres. 

As computational scientists have started engaging in acts of social science, there is a tendency to claim their work as the business of facts and not interpretation.