scispace - formally typeset
Search or ask a question
Topic

Data management

About: Data management is a research topic. Over the lifetime, 31574 publications have been published within this topic receiving 424326 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper presents an overview of the SDD-1 design and its solutions to the above problems.
Abstract: The declining cost of computer hardware and the increasing data processing needs of geographically dispersed organizations have led to substantial interest in distributed data management. SDD-1 is a distributed database management system currently being developed by Computer Corporation of America. Users interact with SDD-1 precisely as if it were a nondistributed database system because SDD-1 handles all issues arising from the distribution of data. These issues include distributed concurrency control, distributed query processing, resiliency to component failure, and distributed directory management. This paper presents an overview of the SDD-1 design and its solutions to the above problems.This paper is the first of a series of companion papers on SDD-1 (Bernstein and Shipman [2], Bernstein et al. [4], and Hammer and Shipman [14]).

253 citations

Journal ArticleDOI
TL;DR: This paper presents the 5Vs characteristics of big data and the technique and technology used to handle big data in a wide variety of scalable database tools and techniques.

253 citations

Journal ArticleDOI
TL;DR: NGSUtils is a suite of software tools for manipulating data common to next-generation sequencing experiments, such as FASTQ, BED and BAM format files, that provide a stable and modular platform for data management and analysis.
Abstract: Summary: NGSUtils is a suite of software tools for manipulating data common to next-generation sequencing experiments, such as FASTQ, BED and BAM format files. These tools provide a stable and modular platform for data management and analysis. Availability and implementation: NGSUtils is available under a BSD license and works on Mac OS X and Linux systems. Python 2.6+ and virtualenv are required. More information and source code may be obtained from the website: http://ngsutils.org. Contact: ude.iupui@uilnuy Supplemental information: Supplementary data are available at Bioinformatics online.

253 citations

Journal ArticleDOI
TL;DR: This editorial addresses both the collection and handling of big data and the analytical tools provided by data science for management scholars, and provides a primer or a “starter kit” for potential data science applications inmanagement research.
Abstract: The recent advent of remote sensing, mobile technologies, novel transaction systems, and highperformance computing offers opportunities to understand trends, behaviors, and actions in a manner that has not been previously possible. Researchers can thus leverage “big data” that are generated from a plurality of sources including mobile transactions, wearable technologies, social media, ambient networks, andbusiness transactions.An earlierAcademy of Management Journal (AMJ) editorial explored the potential implications for data science inmanagement research and highlighted questions for management scholarship as well as the attendant challenges of data sharing and privacy (George, Haas, & Pentland, 2014). This nascent field is evolving rapidly and at a speed that leaves scholars and practitioners alike attempting to make sense of the emergent opportunities that big datahold.With thepromiseof bigdata comequestions about the analytical value and thus relevance of these data for theory development—including concerns over the context-specific relevance, its reliability and its validity. To address this challenge, data science is emerging as an interdisciplinary field that combines statistics, data mining, machine learning, and analytics to understand and explainhowwecan generate analytical insights and prediction models from structured and unstructured big data. Data science emphasizes the systematic study of the organization, properties, and analysis of data and their role in inference, including our confidence in the inference (Dhar, 2013).Whereas both big data and data science terms are often used interchangeably, “big data” refer to large and varied data that can be collected and managed, whereas “data science” develops models that capture, visualize, andanalyze theunderlyingpatterns in thedata. In this editorial, we address both the collection and handling of big data and the analytical tools provided by data science for management scholars. At the current time, practitioners suggest that data science applications tackle the three core elements of big data: volume, velocity, and variety (McAfee & Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). “Volume” represents the sheer size of the dataset due to the aggregation of a large number of variables and an even larger set of observations for each variable. “Velocity” reflects the speed atwhich these data are collected and analyzed, whether in real time or near real time from sensors, sales transactions, social media posts, and sentiment data for breaking news and social trends. “Variety” in big data comes from the plurality of structured and unstructured data sources such as text, videos, networks, and graphics among others. The combinations of volume, velocity, and variety reveal the complex task of generating knowledge from big data, which often runs into millions of observations, and deriving theoretical contributions from such data. In this editorial, we provide a primer or a “starter kit” for potential data science applications inmanagement research. We do so with a caveat that emerging fields outdate and improve uponmethodologies while often supplanting them with new applications. Nevertheless, this primer can guide management scholars who wish to use data science techniques to reach better answers to existing questions or explore completely new research questions.

251 citations

Journal IssueDOI
TL;DR: Characteristics of and requirements for scientific workflows as identified in a number of application projects are described, and some key features of Kepler and its underlying Ptolemy II system, planned extensions, and areas of future research are described.
Abstract: Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery ‘pipelines’. A related trend is that more and more scientific communities realize the benefits of sharing their data and computational services, and are thus contributing to a distributed data and computational community infrastructure (a.k.a. ‘the Grid’). However, this infrastructure is only a means to an end and ideally scientists should not be too concerned with its existence. The goal is for scientists to focus on development and use of what we call scientific workflows. These are networks of analytical steps that may involve, e.g., database access and querying steps, data analysis and mining steps, and many other steps including computationally intensive jobs on high-performance cluster computers. In this paper we describe characteristics of and requirements for scientific workflows as identified in a number of our application projects. We then elaborate on Kepler, a particular scientific workflow system, currently under development across a number of scientific data management projects. We describe some key features of Kepler and its underlying Ptolemy II system, planned extensions, and areas of future research. Kepler is a community-driven, open source project, and we always welcome related projects and new contributors to join. Copyright © 2005 John Wiley & Sons, Ltd.

250 citations


Network Information
Related Topics (5)
Information system
107.5K papers, 1.8M citations
90% related
Software
130.5K papers, 2M citations
88% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
The Internet
213.2K papers, 3.8M citations
82% related
Cloud computing
156.4K papers, 1.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023218
2022485
2021959
20201,435
20191,745
20181,719