scispace - formally typeset
Search or ask a question

Showing papers on "Data management published in 2014"


Journal ArticleDOI
TL;DR: A framework for the realization of smart cities through the Internet of Things (IoT), which encompasses the complete urban information system, from the sensory level and networking support structure through to data management and Cloud-based integration of respective systems and services, and forms a transformational part of the existing cyber-physical system.
Abstract: Increasing population density in urban centers demands adequate provision of services and infrastructure to meet the needs of city inhabitants, encompassing residents, workers, and visitors. The utilization of information and communications technologies to achieve this objective presents an opportunity for the development of smart cities, where city management and citizens are given access to a wealth of real-time information about the urban environment upon which to base decisions, actions, and future planning. This paper presents a framework for the realization of smart cities through the Internet of Things (IoT). The framework encompasses the complete urban information system, from the sensory level and networking support structure through to data management and Cloud-based integration of respective systems and services, and forms a transformational part of the existing cyber-physical system. This IoT vision for a smart city is applied to a noise mapping case study to illustrate a new method for existing operations that can be adapted for the enhancement and delivery of important city services.

1,178 citations


Journal ArticleDOI
TL;DR: A comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used is provided.
Abstract: In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.

851 citations


Journal ArticleDOI
TL;DR: The data quality problem in the context of supply chain management (SCM) is introduced and methods for monitoring and controlling data quality are proposed and highlighted.

652 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.
Abstract: As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above.

529 citations


Book
30 Apr 2014
TL;DR: In this article, the authors outline the background and overall vision for the Internet of Things (IoT) and Machine-to-Machine (M2M) communications and services, including major standards.
Abstract: This book outlines the background and overall vision for the Internet of Things (IoT) and Machine-to-Machine (M2M) communications and services, including major standards. Key technologies are described, and include everything from physical instrumentation of devices to the cloud infrastructures used to collect data. Also included is how to derive information and knowledge, and how to integrate it into enterprise processes, as well as system architectures and regulatory requirements. Real-world service use case studies provide the hands-on knowledge needed to successfully develop and implement M2M and IoT technologies sustainably and profitably. Finally, the future vision for M2M technologies is described, including prospective changes in relevant standards. This book is written by experts in the technology and business aspects of Machine-to-Machine and Internet of Things, and who have experience in implementing solutions. Standards included: ETSI M2M, IEEE 802.15.4, 3GPP (GPRS, 3G, 4G), Bluetooth Low Energy/Smart, IETF 6LoWPAN, IETF CoAP, IETF RPL, Power Line Communication, Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE), ZigBee, 802.11, Broadband Forum TR-069, Open Mobile Alliance (OMA) Device Management (DM), ISA100.11a, WirelessHART, M-BUS, Wireless M-BUS, KNX, RFID, Object Management Group (OMG) Business Process Modelling Notation (BPMN)Key technologies for M2M and IoT covered: Embedded systems hardware and software, devices and gateways, capillary and M2M area networks, local and wide area networking, M2M Service Enablement, IoT data management and data warehousing, data analytics and big data, complex event processing and stream analytics, knowledge discovery and management, business process and enterprise integration, Software as a Service and cloud computing Combines both technical explanations together with design features of M2M/IoT and use cases. Together, these descriptions will assist you to develop solutions that will work in the real world Detailed description of the network architectures and technologies that form the basis of M2M and IoT Clear guidelines and examples of M2M and IoT use cases from real-world implementations such as Smart Grid, Smart Buildings, Smart Cities, Participatory Sensing, and Industrial Automation A description of the vision for M2M and its evolution towards IoT

488 citations


Journal ArticleDOI
24 Mar 2014
TL;DR: The HMORN VDW data model, its governance principles, data content, and quality assurance procedures are highlighted to help those wishing to implement a distributed interoperable health care data system.
Abstract: The HMO Research Network (HMORN) Virtual Data Warehouse (VDW) is a public, non-proprietary, research-focused data model implemented at 17 health care systems across theUnited States. The HMORN has created a governance structure and specified policies concerning the VDW’s content, development, implementation, and quality assurance. Data extracted from the VDW have been used by thousands of studies published in peer-reviewed journal articles. Advances in software supporting care delivery and claims processing and the availability of new data sources have greatly expanded the data available for research, but substantially increased the complexity of data management. The VDW data model incorporates software and data advances to ensure that comprehensive, up-to-date data of known quality are available for research. VDW governance works to accommodate new data and system complexities. This article highlights the HMORN VDW data model, its governance principles, data content, and quality assurance procedures. Our goal is to share the VDW data model and its operations to those wishing to implement a distributed interoperable health care data system.

307 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: Wang et al. as discussed by the authors proposed a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns, and empirically analyzed the efficiency of their protocols through various experiments.
Abstract: For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.

285 citations


Book ChapterDOI
19 Oct 2014
TL;DR: This work performs an in-depth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads and provides stress testing tools for RDF data management systems, and uses the Waterloo SParQL Diversity Test Suite (WatDiv) to address these shortcomings.
Abstract: The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data management systems are being exposed to queries that are far more diverse and workloads that are far more varied. The first contribution of our work is an in-depth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads. To address these shortcomings, our second contribution is the Waterloo SPARQL Diversity Test Suite (WatDiv) that provides stress testing tools for RDF data management systems. Using WatDiv, we have been able to reveal issues with existing systems that went unnoticed in evaluations using earlier benchmarks. Specifically, our experiments with five popular RDF data management systems show that they cannot deliver good performance uniformly across workloads. For some queries, there can be as much as five orders of magnitude difference between the query execution time of the fastest and the slowest system while the fastest system on one query may unexpectedly time out on another query. By performing a detailed analysis, we pinpoint these problems to specific types of queries and workloads.

274 citations


Journal ArticleDOI
09 Jul 2014-PLOS ONE
TL;DR: OpenPDS as mentioned in this paper is a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties and SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level, turns a hard anonymization problem into a more tractable security one.
Abstract: The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research.

242 citations


Proceedings ArticleDOI
Barna Saha1, Divesh Srivastava1
19 May 2014
TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

203 citations


Journal ArticleDOI
TL;DR: This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences.

Journal ArticleDOI
TL;DR: This article aims to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
Abstract: MapReduce is a framework for processing and managing large-scale datasets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research effort has been directed toward making it more usable and efficient for supporting database-centric operations. In this article, we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.

BookDOI
22 Nov 2014
TL;DR: This book identifies and addresses new challenges in the field of database security, offering solid advice for practitioners and researchers in industry.
Abstract: The field of database security has expanded greatly, with the rapid development of global inter-networked infrastructure. Databases are no longer stand-alone systems accessible only to internal users of organizations. Today, businesses must allow selective access from different security domains. New data services emerge every day, bringing complex challenges to those whose job is to protect data security. The Internet and the web offer means for collecting and sharing data with unprecedented flexibility and convenience, presenting threats and challenges of their own. This book identifies and addresses these new challenges and more, offering solid advice for practitioners and researchers in industry.

Book
09 Apr 2014
TL;DR: The importance of managing and sharing research data The research data lifecycle Research Data Management Planning Documenting and Providing Context for Data Formatting and organizing data Storing and Transferring data Legal and ethical issues in sharing data Rights Relating to Research Data
Abstract: The importance of managing and sharing research data The research data lifecycle Research Data Management Planning Documenting and Providing Context for Data Formatting and organizing data Storing and Transferring Data Legal and ethical issues in sharing data Rights Relating to Research Data Collaborative Research: Data Management Strategies for Research Teams and Research Managers Making Use of Other People's Research Data: Opportunities and Limitations Publishing and Citing Research Data Conclusion

Patent
10 Dec 2014
TL;DR: In this article, a secure data exchange system that includes a data management facility and a plurality of data storage nodes is described, where the data is stored by a user of a first entity and comprises content and metadata.
Abstract: In embodiments, the disclosure provides a secure data exchange system that includes a data management facility; and a plurality of data storage nodes. The data management facility manages content sharing between entities of data stored in the data storage nodes, wherein the data is stored by a user of a first entity and comprises content and metadata. The data management facility only has access to the metadata of the user data for managing of the data in the plurality of data storage nodes and not the content. The data management facility may be geographically distributed at a plurality of data management sites and the data storage nodes may exist inside and outside of a firewall of the first entity.

Journal ArticleDOI
TL;DR: Because of the paradigm shift in the kinds of data being analyzed and how this data is used, big data can be considered to be a new, fourth generation of decision support data management.
Abstract: We have entered the big data era. Organizations are capturing, storing, and analyzing data that has high volume, velocity, and variety and comes from a variety of new sources, including social media, machines, log files, video, text, image, RFID, and GPS. These sources have strained the capabilities of traditional relational database management systems and spawned a host of new technologies, approaches, and platforms. The potential value of big data analytics is great and is clearly established by a growing number of studies. The keys to success with big data analytics include a clear business need, strong committed sponsorship, alignment between the business and IT strategies, a fact-based decision-making culture, a strong data infrastructure, the right analytical tools, and people skilled in the use of analytics. Because of the paradigm shift in the kinds of data being analyzed and how this data is used, big data can be considered to be a new, fourth generation of decision support data management. Though the business value from big data is great, especially for online companies like Google and Facebook, how it is being used is raising significant privacy concerns.

Journal ArticleDOI
TL;DR: Current research that takes advantage of "Big Data" in health and biomedical informatics applications is summarized, highlighting ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas.
Abstract: Objectives: To summarise current research that takes advantage of “Big Data” in health and biomedical informatics applications. Methods:Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. Results: The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. Conclusions: The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health; all areas of biomedicine stand to benefit from Big Data and the associated technologies.

Journal ArticleDOI
02 Sep 2014-PLOS ONE
TL;DR: An open source and extensible R based data client for pre-processed data from the Firehouse, and results show that the RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data.
Abstract: Background & Objective Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehouse, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: A principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers is introduced.
Abstract: As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of intervention in the presence of multiple relations which can interact with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the top-K explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by experiments on real datasets.

Journal ArticleDOI
TL;DR: The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.
Abstract: Big data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software, methods, and tools, such as map-reduce, cloud computing, and concept drift machine learning algorithms, which will not only contribute to big data research, but may be beneficial in many biomedical informatics applications. The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This paper introduces AIDE, an Automatic Interactive Data Exploration framework that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest and provides interactive performance by limiting the user wait time per iteration to less than a few seconds in average.
Abstract: Interactive Data Exploration (IDE) is a key ingredient of a diverse set of discovery-oriented applications, including ones from scientific computing and evidence-based medicine. In these applications, data discovery is a highly ad hoc interactive process where users execute numerous exploration queries using varying predicates aiming to balance the trade-off between collecting all relevant information and reducing the size of returned data. Therefore, there is a strong need to support these human-in-the-loop applications by assisting their navigation in the data to find interesting objects. In this paper, we introduce AIDE, an Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest. Our approach leverages relevance feedback on database samples to model user interests and strategically collects more samples to refine the model while minimizing the user effort. AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching the user's interests with high accuracy) as well as high interactive performance. It delivers highly accurate query predictions for very common conjunctive queries with very small user effort while, given a reasonable number of samples, it can predict with high accuracy complex conjunctive queries. Furthermore, it provides interactive performance by limiting the user wait time per iteration to less than a few seconds in average. Our user study indicates that AIDE is a practical exploration framework as it significantly reduces the user effort and the total exploration time compared with the current state-of-the-art approach of manual exploration.

Journal ArticleDOI
TL;DR: A prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service and a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse is proposed.
Abstract: Water scarcity and floods are the major challenges for human society both present and future. Effective and scientific management of water resources requires a good understanding of water cycles, and a systematic integration of observations can lead to better prediction results. This paper presents an integrated approach to water resource management based on geoinformatics including technologies such as Remote Sensing (RS), Geographical Information Systems (GIS), Global Positioning Systems (GPS), Enterprise Information Systems (EIS), and cloud services. The paper introduces a prototype IIS called Water Resource Management Enterprise Information System (WRMEIS) that integrates functions such as data acquisition, data management and sharing, modeling, and knowledge management. A system called SFFEIS (Snowmelt Flood Forecasting Enterprise Information System) based on the WRMEIS structure has been implemented. It includes operational database, Extraction-Transformation-Loading (ETL), information warehouse, temporal and spatial analysis, simulation/prediction models, knowledge management, and other functions. In this study, a prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service. It also proposes a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse. Both users and public play the role for providing data and knowledge. This study highlights the crucial importance of a systematic approach toward IISs for effective resource and environment management.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.
Abstract: There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

01 Jan 2014
TL;DR: Challenges to biodiversity data management along the data life cycle are described and the solution that is currently being developed within the GFBio project is sketched, a collaborative effort of nineteen German research institutions ranging from museums and archives to biodiversity researchers and computer scientists.
Abstract: Biodiversity research brings together the many facets of biological environmental research. Its data management is characterized by integration and is particularly challenging due to the large volume and tremendous heterogeneity of the data. At the same time, it is particularly important: A lot of the data is not reproducible. Once it is gone, potential knowledge that could have been gained from it is irrevocably lost. In this paper, we describe challenges to biodiversity data management along the data life cycle and sketch the solution that is currently being developed within the GFBio project, a collaborative effort of nineteen German research institutions ranging from museums and archives to biodiversity researchers and computer scientists.

Journal ArticleDOI
04 Dec 2014
TL;DR: It is observed that Big Data has now become a defining challenge of the authors' time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact.
Abstract: Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14- 15, 2013 in Irvine, California. It observes that Big Data has now become a defining challenge of our time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact. To do so, the report recommends significantly more attention to five research areas: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and managing the diverse roles of people in the data life cycle.

Book
14 Mar 2014
TL;DR: Research Methods in Public Administration and Public Management represents a comprehensive guide to undertaking and using research in public management and administration as mentioned in this paper. But it is not a complete survey of all the research methods.
Abstract: Research in Public Administration and Public Management has distinctive features that influence the choices and application of research methods. The standard methodologies for researching from the social sciences can be difficult to follow in the complex world of the public sector. In a dynamic political context, the focus lies on solving societal problems whilst also using methodological principles to do scientifically sound research. The second edition of Research Methods in Public Administration and Public Management represents a comprehensive guide to undertaking and using research in Public Management and Administration. It is succinct but covers a wide variety of research strategies, including action research, experiments, case studies, desk research, systematic literature reviews and more. It pays attention to issues of design, sampling, research ethics and data management. This textbook does explain the role of theory, but also offers many international examples and practical exercises. It takes the reader through the journey of research, starting with the problem definition, choice of theory, research design options and tools to achieve impactful research. New and revised material includes, but is not limited to: A closer look at popular methods like the experiment and the systematic literature review; A deeper examination of research ethics and data management; New examples from a wide range of countries; Updated ‘Further Reading’ material and additional useful websites. This exciting new edition will be core reading for students at all levels as well as practitioners who are carrying out research on Public Management and Administration.

01 Jan 2014
TL;DR: Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers, designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
Abstract: The term 'Big Data' describes innovative techniques and technologies to capture, store, distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.
Abstract: In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myria's goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -- self-serve analytics. Using a web browser, Myria users can upload data, author efficient queries to process and explore the data, and debug correctness and performance issues. Myria queries are executed on a scalable, parallel cluster that uses both state-of-the-art and novel methods for distributed query processing. Our interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: The design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format is presented, which eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow.
Abstract: Scientific experiments and large-scale simulations produce massive amounts of data. Many of these scientific datasets are arrays, and are stored in file formats such as HDF5 and NetCDF. Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows. Major barriers include the expensive task of preparing and loading data before querying, and converting the final results to a format that is understood by the existing post-processing and visualization tools. As a consequence, integrating a data management system into an existing scientific data analysis workflow is time-consuming and requires extensive user involvement. In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. This design choice eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow. Our design leverages the increasing main memory capacities found in supercomputers through bitmap indexing and in-memory query execution. In addition, query processing over the HDF5 data format can be effortlessly parallelized to utilize the ample concurrency available in large-scale supercomputers and modern parallel file systems. We evaluate the performance of our system on a large supercomputing system and experiment with both a synthetic dataset and a real cosmology observation dataset. Our system frequently outperforms the relational database system that the cosmology team currently uses, and is more than 10X faster than Hive when processing data in parallel. Overall, by eliminating the data loading step, our query processing system is more effective in supporting in situ scientific analysis workflows.

Journal ArticleDOI
01 Aug 2014
TL;DR: This work reproduces performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compares the results.
Abstract: Distributed database system performance benchmarks are an important source of information for decision makers who must select the right technology for their data management problems. Since important decisions rely on trustworthy experimental data, it is necessary to reproduce experiments and verify the results. We reproduce performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compare the results. The scope of our reproduced experiments is extended with a performance evaluation of Cassandra on different Amazon EC2 infrastructure configurations, and an evaluation of Cassandra and HBase elasticity by measuring scaling speed and performance impact while scaling.