scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Data science and prediction

Vasant Dhar1
01 Dec 2013-Communications of The ACM (ACM)-Vol. 56, Iss: 12, pp 64-73
TL;DR: Big data promises automated actionable knowledge creation and predictive models for use by both humans and computers as discussed by the authors, and big data can be used for both human and computer to create knowledge.
Abstract: Big data promises automated actionable knowledge creation and predictive models for use by both humans and computers.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This work addresses key questions related to the explosion of interest in the emerging fields of big data, analytics, and data science and the strengths that the information systems IS community brings to this discourse.
Abstract: We address key questions related to the explosion of interest in the emerging fields of big data, analytics, and data science. We discuss the novelty of the fields and whether the underlying questions are fundamentally different, the strengths that the information systems IS community brings to this discourse, interesting research questions for IS scholars, the role of predictive and explanatory modeling, and how research in this emerging area should be evaluated for contribution and significance.

524 citations

Journal ArticleDOI
TL;DR: This editorial addresses both the collection and handling of big data and the analytical tools provided by data science for management scholars, and provides a primer or a “starter kit” for potential data science applications inmanagement research.
Abstract: The recent advent of remote sensing, mobile technologies, novel transaction systems, and highperformance computing offers opportunities to understand trends, behaviors, and actions in a manner that has not been previously possible. Researchers can thus leverage “big data” that are generated from a plurality of sources including mobile transactions, wearable technologies, social media, ambient networks, andbusiness transactions.An earlierAcademy of Management Journal (AMJ) editorial explored the potential implications for data science inmanagement research and highlighted questions for management scholarship as well as the attendant challenges of data sharing and privacy (George, Haas, & Pentland, 2014). This nascent field is evolving rapidly and at a speed that leaves scholars and practitioners alike attempting to make sense of the emergent opportunities that big datahold.With thepromiseof bigdata comequestions about the analytical value and thus relevance of these data for theory development—including concerns over the context-specific relevance, its reliability and its validity. To address this challenge, data science is emerging as an interdisciplinary field that combines statistics, data mining, machine learning, and analytics to understand and explainhowwecan generate analytical insights and prediction models from structured and unstructured big data. Data science emphasizes the systematic study of the organization, properties, and analysis of data and their role in inference, including our confidence in the inference (Dhar, 2013).Whereas both big data and data science terms are often used interchangeably, “big data” refer to large and varied data that can be collected and managed, whereas “data science” develops models that capture, visualize, andanalyze theunderlyingpatterns in thedata. In this editorial, we address both the collection and handling of big data and the analytical tools provided by data science for management scholars. At the current time, practitioners suggest that data science applications tackle the three core elements of big data: volume, velocity, and variety (McAfee & Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). “Volume” represents the sheer size of the dataset due to the aggregation of a large number of variables and an even larger set of observations for each variable. “Velocity” reflects the speed atwhich these data are collected and analyzed, whether in real time or near real time from sensors, sales transactions, social media posts, and sentiment data for breaking news and social trends. “Variety” in big data comes from the plurality of structured and unstructured data sources such as text, videos, networks, and graphics among others. The combinations of volume, velocity, and variety reveal the complex task of generating knowledge from big data, which often runs into millions of observations, and deriving theoretical contributions from such data. In this editorial, we provide a primer or a “starter kit” for potential data science applications inmanagement research. We do so with a caveat that emerging fields outdate and improve uponmethodologies while often supplanting them with new applications. Nevertheless, this primer can guide management scholars who wish to use data science techniques to reach better answers to existing questions or explore completely new research questions.

251 citations

Journal ArticleDOI
TL;DR: In this paper, the authors summarize evidence from studies of protest movements in the United States, Spain, Turkey, and Ukraine demonstrating that social media platforms facilitate the exchange of information that is vital to the coordination of protest activities, such as news about transportation, turnout, police presence, violence, medical services, and legal support.
Abstract: It is often claimed that social media platforms such as Facebook and Twitter are profoundly shaping political participation, especially when it comes to protest behavior. Whether or not this is the case, the analysis of “Big Data” generated by social media usage offers unprecedented opportunities to observe complex, dynamic effects associated with large‐scale collective action and social movements. In this article, we summarize evidence from studies of protest movements in the United States, Spain, Turkey, and Ukraine demonstrating that: (1) Social media platforms facilitate the exchange of information that is vital to the coordination of protest activities, such as news about transportation, turnout, police presence, violence, medical services, and legal support; (2) in addition, social media platforms facilitate the exchange of emotional and motivational contents in support of and opposition to protest activity, including messages emphasizing anger, social identification, group efficacy, and concerns about fairness, justice, and deprivation as well as explicitly ideological themes; and (3) structural characteristics of online social networks, which may differ as a function of political ideology, have important implications for information exposure and the success or failure of organizational efforts. Next, we issue a brief call for future research on a topic that is understudied but fundamental to appreciating the role of social media in facilitating political participation, namely friendship. In closing, we liken the situation confronted by researchers who are harvesting vast quantities of social media data to that of systems biologists in the early days of genome sequencing.

213 citations

Journal ArticleDOI
TL;DR: This article is set out to dissect BDA’s challenges and promises for IS research, and illustrates them by means of an exemplary study about predicting the helpfulness of 1.3 million online customer reviews.
Abstract: This essay discusses the use of big data analytics (BDA) as a strategy of enquiry for advancing information systems (IS) research. In broad terms, we understand BDA as the statistical modelling of ...

201 citations

Journal ArticleDOI
TL;DR: The concept of process-structure-property (PSP) linkages is introduced and illustrated how the determination of PSPs is one of the main objectives of materials data science.
Abstract: The field of materials science and engineering is on the cusp of a digital data revolution. After reviewing the nature of data science and Big Data, we discuss the features of materials data that distinguish them from data in other fields. We introduce the concept of process-structure-property (PSP) linkages and illustrate how the determination of PSPs is one of the main objectives of materials data science. Then we review a selection of materials databases, as well as important aspects of materials data management, such as storage hardware, archiving strategies, and data access strategies. We introduce the emerging field of materials data analytics, which focuses on data-driven approaches to extract and curate materials knowledge from available data sets. The critical need for materials e-collaboration platforms is highlighted, and we conclude the article with a number of suggestions regarding the near-term future of the materials data science field.

199 citations

References
More filters
Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations

Journal ArticleDOI

11,905 citations

Book
01 Jan 1993
TL;DR: The authors axiomatize the connection between causal structure and probabilistic independence, explore several varieties of causal indistinguishability, formulate a theory of manipulation, and develop asymptotically reliable procedures for searching over equivalence classes of causal models.
Abstract: What assumptions and methods allow us to turn observations into causal knowledge, and how can even incomplete causal knowledge be used in planning and prediction to influence and control our environment? In this book Peter Spirtes, Clark Glymour, and Richard Scheines address these questions using the formalism of Bayes networks, with results that have been applied in diverse areas of research in the social, behavioral, and physical sciences. The authors show that although experimental and observational study designs may not always permit the same inferences, they are subject to uniform principles. They axiomatize the connection between causal structure and probabilistic independence, explore several varieties of causal indistinguishability, formulate a theory of manipulation, and develop asymptotically reliable procedures for searching over equivalence classes of causal models, including models of categorical data and structural equation models with and without latent variables. The authors show that the relationship between causality and probability can also help to clarify such diverse topics in statistics as the comparative power of experimentation versus observation, Simpson's paradox, errors in regression models, retrospective versus prospective sampling, and variable selection. The second edition contains a new introduction and an extensive survey of advances and applications that have appeared since the first edition was published in 1993.

4,863 citations

Journal ArticleDOI
TL;DR: In this paper, a universally applicable attitude and skill set for computer science is presented, which is a set of skills and attitudes that everyone would be eager to learn and use, not just computer scientists.
Abstract: It represents a universally applicable attitude and skill set everyone, not just computer scientists, would be eager to learn and use.

4,819 citations