Showing papers on "Data management published in 2014"

PDF

Open Access

Journal Article•DOI•

An Information Framework for Creating a Smart City Through Internet of Things

[...]

Jiong Jin¹, Jayavardhana Gubbi², Slaven Marusic², Marimuthu Palaniswami²•Institutions (2)

Swinburne University of Technology¹, University of Melbourne²

09 Jan 2014-IEEE Internet of Things Journal

TL;DR: A framework for the realization of smart cities through the Internet of Things (IoT), which encompasses the complete urban information system, from the sensory level and networking support structure through to data management and Cloud-based integration of respective systems and services, and forms a transformational part of the existing cyber-physical system.

...read moreread less

Abstract: Increasing population density in urban centers demands adequate provision of services and infrastructure to meet the needs of city inhabitants, encompassing residents, workers, and visitors. The utilization of information and communications technologies to achieve this objective presents an opportunity for the development of smart cities, where city management and citizens are given access to a wealth of real-time information about the urban environment upon which to base decisions, actions, and future planning. This paper presents a framework for the realization of smart cities through the Internet of Things (IoT). The framework encompasses the complete urban information system, from the sensory level and networking support structure through to data management and Cloud-based integration of respective systems and services, and forms a transformational part of the existing cyber-physical system. This IoT vision for a smart city is applied to a noise mapping case study to illustrate a new method for existing operations that can be adapted for the enhancement and delivery of important city services.

...read moreread less

1,178 citations

Journal Article•DOI•

Outlier Detection for Temporal Data: A Survey

[...]

Manish Gupta¹, Jing Gao², Charu C. Aggarwal³, Jiawei Han⁴•Institutions (4)

Microsoft¹, State University of New York System², IBM³, University of Illinois at Urbana–Champaign⁴

01 Sep 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used is provided.

...read moreread less

Abstract: In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.

...read moreread less

851 citations

Journal Article•DOI•

Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications

[...]

Benjamin T. Hazen¹, Christopher A. Boone¹, Jeremy D. Ezell², L. Allison Jones-Farmer²•Institutions (2)

College of Business Administration¹, Auburn University²

01 Aug 2014-International Journal of Production Economics

TL;DR: The data quality problem in the context of supply chain management (SCM) is introduced and methods for monitoring and controlling data quality are proposed and highlighted.

...read moreread less

652 citations

Proceedings Article•DOI•

BigDataBench: A big data benchmark suite from internet services

[...]

Lei Wang¹, Jianfeng Zhan¹, Chunjie Luo¹, Yuqing Zhu¹, Qiang Yang¹, Yongqiang He, Wanling Gao¹, Zhen Jia¹, Yingjie Shi¹, Shujie Zhang², Chen Zheng¹, Gang Lu¹, Kent Zhan³, Xiaona Li⁴, Bizhu Qiu⁵ - Show less +11 more•Institutions (5)

Chinese Academy of Sciences¹, Huawei², Tencent³, Baidu⁴, Yahoo!⁵

19 Jun 2014

TL;DR: The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.

...read moreread less

Abstract: As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above.

...read moreread less

529 citations

Book•

From Machine-to-Machine to the Internet of Things: Introduction to a New Age of Intelligence

[...]

Jan Höller, Vlasios Tsiatsis, Catherine Mulligan, Stefan Avesand, Stamatis Karnouskos, David Boyle - Show less +2 more

30 Apr 2014

TL;DR: In this article, the authors outline the background and overall vision for the Internet of Things (IoT) and Machine-to-Machine (M2M) communications and services, including major standards.

...read moreread less

Abstract: This book outlines the background and overall vision for the Internet of Things (IoT) and Machine-to-Machine (M2M) communications and services, including major standards. Key technologies are described, and include everything from physical instrumentation of devices to the cloud infrastructures used to collect data. Also included is how to derive information and knowledge, and how to integrate it into enterprise processes, as well as system architectures and regulatory requirements. Real-world service use case studies provide the hands-on knowledge needed to successfully develop and implement M2M and IoT technologies sustainably and profitably. Finally, the future vision for M2M technologies is described, including prospective changes in relevant standards. This book is written by experts in the technology and business aspects of Machine-to-Machine and Internet of Things, and who have experience in implementing solutions. Standards included: ETSI M2M, IEEE 802.15.4, 3GPP (GPRS, 3G, 4G), Bluetooth Low Energy/Smart, IETF 6LoWPAN, IETF CoAP, IETF RPL, Power Line Communication, Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE), ZigBee, 802.11, Broadband Forum TR-069, Open Mobile Alliance (OMA) Device Management (DM), ISA100.11a, WirelessHART, M-BUS, Wireless M-BUS, KNX, RFID, Object Management Group (OMG) Business Process Modelling Notation (BPMN)Key technologies for M2M and IoT covered: Embedded systems hardware and software, devices and gateways, capillary and M2M area networks, local and wide area networking, M2M Service Enablement, IoT data management and data warehousing, data analytics and big data, complex event processing and stream analytics, knowledge discovery and management, business process and enterprise integration, Software as a Service and cloud computing Combines both technical explanations together with design features of M2M/IoT and use cases. Together, these descriptions will assist you to develop solutions that will work in the real world Detailed description of the network architectures and technologies that form the basis of M2M and IoT Clear guidelines and examples of M2M and IoT use cases from real-world implementations such as Smart Grid, Smart Buildings, Smart Cities, Participatory Sensing, and Industrial Automation A description of the vision for M2M and its evolution towards IoT

...read moreread less

488 citations

Journal Article•DOI•

The HMO Research Network Virtual Data Warehouse: A Public Data Model to Support Collaboration.

[...]

Tyler R. Ross¹, Daniel Ng², Jeffrey S. Brown³, Roy Pardee¹, Mark C. Hornbrook², Gene Hart¹, John F. Steiner² - Show less +3 more•Institutions (3)

Group Health Research Institute¹, Kaiser Permanente², Harvard University³

24 Mar 2014

TL;DR: The HMORN VDW data model, its governance principles, data content, and quality assurance procedures are highlighted to help those wishing to implement a distributed interoperable health care data system.

...read moreread less

Abstract: The HMO Research Network (HMORN) Virtual Data Warehouse (VDW) is a public, non-proprietary, research-focused data model implemented at 17 health care systems across theUnited States. The HMORN has created a governance structure and specified policies concerning the VDW’s content, development, implementation, and quality assurance. Data extracted from the VDW have been used by thousands of studies published in peer-reviewed journal articles. Advances in software supporting care delivery and claims processing and the availability of new data sources have greatly expanded the data available for research, but substantially increased the complexity of data management. The VDW data model incorporates software and data advances to ensure that comprehensive, up-to-date data of known quality are available for research. VDW governance works to accommodate new data and system complexities. This article highlights the HMORN VDW data model, its governance principles, data content, and quality assurance procedures. Our goal is to share the VDW data model and its operations to those wishing to implement a distributed interoperable health care data system.

...read moreread less

307 citations

Proceedings Article•DOI•

Secure k-nearest neighbor query over encrypted data in outsourced environments

[...]

Yousef Elmehdwi¹, Bharath K. Samanthula¹, Wei Jiang¹•Institutions (1)

Missouri University of Science and Technology¹

19 May 2014

TL;DR: Wang et al. as discussed by the authors proposed a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns, and empirically analyzed the efficiency of their protocols through various experiments.

...read moreread less

Abstract: For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.

...read moreread less

285 citations

Book Chapter•DOI•

Diversified Stress Testing of RDF Data Management Systems

[...]

Güneş Aluç¹, Olaf Hartig¹, M. Tamer Özsu¹, Khuzaima Daudjee¹•Institutions (1)

University of Waterloo¹

19 Oct 2014

TL;DR: This work performs an in-depth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads and provides stress testing tools for RDF data management systems, and uses the Waterloo SParQL Diversity Test Suite (WatDiv) to address these shortcomings.

...read moreread less

Abstract: The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data management systems are being exposed to queries that are far more diverse and workloads that are far more varied. The first contribution of our work is an in-depth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads. To address these shortcomings, our second contribution is the Waterloo SPARQL Diversity Test Suite (WatDiv) that provides stress testing tools for RDF data management systems. Using WatDiv, we have been able to reveal issues with existing systems that went unnoticed in evaluations using earlier benchmarks. Specifically, our experiments with five popular RDF data management systems show that they cannot deliver good performance uniformly across workloads. For some queries, there can be as much as five orders of magnitude difference between the query execution time of the fastest and the slowest system while the fastest system on one query may unexpectedly time out on another query. By performing a detailed analysis, we pinpoint these problems to specific types of queries and workloads.

...read moreread less

274 citations

Journal Article•DOI•

openPDS: Protecting the Privacy of Metadata through SafeAnswers

[...]

Yves-Alexandre de Montjoye¹, Erez Shmueli¹, Samuel S. Wang¹, Alex P. Pentland¹•Institutions (1)

Massachusetts Institute of Technology¹

09 Jul 2014-PLOS ONE

TL;DR: OpenPDS as mentioned in this paper is a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties and SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level, turns a hard anonymization problem into a more tractable security one.

...read moreread less

Abstract: The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research.

...read moreread less

242 citations

Proceedings Article•DOI•

Data quality: The other face of Big Data

[...]

Barna Saha¹, Divesh Srivastava¹•Institutions (1)

AT&T Labs¹

19 May 2014

TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

203 citations

Journal Article•DOI•

Machine learning for Big Data analytics in plants

[...]

Chuang Ma¹, Hao Helen Zhang¹, Xiangfeng Wang², Xiangfeng Wang¹•Institutions (2)

University of Arizona¹, China Agricultural University²

01 Dec 2014-Trends in Plant Science

TL;DR: This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences.

...read moreread less

Journal Article•DOI•

Distributed data management using MapReduce

[...]

Feng Li¹, Beng Chin Ooi¹, M. Tamer Özsu², Sai Wu³•Institutions (3)

National University of Singapore¹, University of Waterloo², Zhejiang University³

01 Jan 2014-ACM Computing Surveys

TL;DR: This article aims to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.

...read moreread less

Abstract: MapReduce is a framework for processing and managing large-scale datasets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research effort has been directed toward making it more usable and efficient for supporting database-centric operations. In this article, we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.

...read moreread less

Book•DOI•

Secure Data Management in Decentralized Systems

[...]

Ting Yu¹, Sushil Jajodia²•Institutions (2)

North Carolina State University¹, George Mason University²

22 Nov 2014

TL;DR: This book identifies and addresses new challenges in the field of database security, offering solid advice for practitioners and researchers in industry.

...read moreread less

Abstract: The field of database security has expanded greatly, with the rapid development of global inter-networked infrastructure. Databases are no longer stand-alone systems accessible only to internal users of organizations. Today, businesses must allow selective access from different security domains. New data services emerge every day, bringing complex challenges to those whose job is to protect data security. The Internet and the web offer means for collecting and sharing data with unprecedented flexibility and convenience, presenting threats and challenges of their own. This book identifies and addresses these new challenges and more, offering solid advice for practitioners and researchers in industry.

...read moreread less

Book•

Managing and Sharing Research Data: A Guide to Good Practice

[...]

Louise Corti, Veerle Van den Eynden, Libby Bishop, Matthew Woollard

09 Apr 2014

TL;DR: The importance of managing and sharing research data The research data lifecycle Research Data Management Planning Documenting and Providing Context for Data Formatting and organizing data Storing and Transferring data Legal and ethical issues in sharing data Rights Relating to Research Data

...read moreread less

Abstract: The importance of managing and sharing research data The research data lifecycle Research Data Management Planning Documenting and Providing Context for Data Formatting and organizing data Storing and Transferring Data Legal and ethical issues in sharing data Rights Relating to Research Data Collaborative Research: Data Management Strategies for Research Teams and Research Managers Making Use of Other People's Research Data: Opportunities and Limitations Publishing and Citing Research Data Conclusion

...read moreread less

Patent•

Customizable secure data exchange environment

[...]

10 Dec 2014

TL;DR: In this article, a secure data exchange system that includes a data management facility and a plurality of data storage nodes is described, where the data is stored by a user of a first entity and comprises content and metadata.

...read moreread less

Abstract: In embodiments, the disclosure provides a secure data exchange system that includes a data management facility; and a plurality of data storage nodes. The data management facility manages content sharing between entities of data stored in the data storage nodes, wherein the data is stored by a user of a first entity and comprises content and metadata. The data management facility only has access to the metadata of the user data for managing of the data in the plurality of data storage nodes and not the content. The data management facility may be geographically distributed at a plurality of data management sites and the data storage nodes may exist inside and outside of a firewall of the first entity.

...read moreread less

Journal Article•DOI•

Examining the Impact of Multicollinearity in Discovering Higher-Order Factor Models

[...]

Colleen Schwarz¹, Andrew Schwarz, William C. Black¹•Institutions (1)

University of Louisiana at Lafayette¹

27 Apr 2014-Communications of The Ais

TL;DR: Because of the paradigm shift in the kinds of data being analyzed and how this data is used, big data can be considered to be a new, fourth generation of decision support data management.

...read moreread less

Abstract: We have entered the big data era. Organizations are capturing, storing, and analyzing data that has high volume, velocity, and variety and comes from a variety of new sources, including social media, machines, log files, video, text, image, RFID, and GPS. These sources have strained the capabilities of traditional relational database management systems and spawned a host of new technologies, approaches, and platforms. The potential value of big data analytics is great and is clearly established by a growing number of studies. The keys to success with big data analytics include a clear business need, strong committed sponsorship, alignment between the business and IT strategies, a fact-based decision-making culture, a strong data infrastructure, the right analytical tools, and people skilled in the use of analytics. Because of the paradigm shift in the kinds of data being analyzed and how this data is used, big data can be considered to be a new, fourth generation of decision support data management. Though the business value from big data is great, especially for online companies like Google and Facebook, how it is being used is raising significant privacy concerns.

...read moreread less

Journal Article•DOI•

Big data in medicine is driving big changes.

[...]

Fernando Martin-Sanchez¹, Karin Verspoor•Institutions (1)

University of Melbourne¹

15 Aug 2014-Yearb Med Inform

TL;DR: Current research that takes advantage of "Big Data" in health and biomedical informatics applications is summarized, highlighting ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas.

...read moreread less

Abstract: Objectives: To summarise current research that takes advantage of “Big Data” in health and biomedical informatics applications. Methods:Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. Results: The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. Conclusions: The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health; all areas of biomedicine stand to benefit from Big Data and the associated technologies.

...read moreread less

Journal Article•DOI•

RTCGAToolbox: a new tool for exporting TCGA Firehose data.

[...]

Mehmet Kemal Samur¹•Institutions (1)

Harvard University¹

02 Sep 2014-PLOS ONE

TL;DR: An open source and extensible R based data client for pre-processed data from the Firehouse, and results show that the RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data.

...read moreread less

Abstract: Background & Objective Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehouse, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.

...read moreread less

Proceedings Article•DOI•

A formal approach to finding explanations for database queries

[...]

Sudeepa Roy¹, Dan Suciu¹•Institutions (1)

University of Washington¹

18 Jun 2014

TL;DR: A principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers is introduced.

...read moreread less

Abstract: As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of intervention in the presence of multiple relations which can interact with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the top-K explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by experiments on real datasets.

...read moreread less

Journal Article•DOI•

Big Data and Biomedical Informatics: A Challenging Opportunity

[...]

Riccardo Bellazzi¹•Institutions (1)

University of Pavia¹

22 May 2014-Yearb Med Inform

TL;DR: The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.

...read moreread less

Abstract: Big data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software, methods, and tools, such as map-reduce, cloud computing, and concept drift machine learning algorithms, which will not only contribute to big data research, but may be beneficial in many biomedical informatics applications. The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.

...read moreread less

Proceedings Article•DOI•

Explore-by-example: an automatic query steering framework for interactive data exploration

[...]

Kyriaki Dimitriadou¹, Olga Papaemmanouil¹, Yanlei Diao²•Institutions (2)

Brandeis University¹, University of Massachusetts Amherst²

18 Jun 2014

TL;DR: This paper introduces AIDE, an Automatic Interactive Data Exploration framework that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest and provides interactive performance by limiting the user wait time per iteration to less than a few seconds in average.

...read moreread less

Abstract: Interactive Data Exploration (IDE) is a key ingredient of a diverse set of discovery-oriented applications, including ones from scientific computing and evidence-based medicine. In these applications, data discovery is a highly ad hoc interactive process where users execute numerous exploration queries using varying predicates aiming to balance the trade-off between collecting all relevant information and reducing the size of returned data. Therefore, there is a strong need to support these human-in-the-loop applications by assisting their navigation in the data to find interesting objects. In this paper, we introduce AIDE, an Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest. Our approach leverages relevance feedback on database samples to model user interests and strategically collects more samples to refine the model while minimizing the user effort. AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching the user's interests with high accuracy) as well as high interactive performance. It delivers highly accurate query predictions for very common conjunctive queries with very small user effort while, given a reasonable number of samples, it can predict with high accuracy complex conjunctive queries. Furthermore, it provides interactive performance by limiting the user wait time per iteration to less than a few seconds in average. Our user study indicates that AIDE is a practical exploration framework as it significantly reduces the user effort and the total exploration time compared with the current state-of-the-art approach of manual exploration.

...read moreread less

Journal Article•DOI•

An Integrated Approach to Snowmelt Flood Forecasting in Water Resource Management

[...]

Shifeng Fang, Li Da Xu, Huan Pei¹, Yongqiang Liu², Zhihui Liu², Yunqiang Zhu, Jianwu Yan, Huifang Zhang - Show less +4 more•Institutions (2)

Yanshan University¹, Xinjiang University²

01 Feb 2014-IEEE Transactions on Industrial Informatics

TL;DR: A prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service and a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse is proposed.

...read moreread less

Abstract: Water scarcity and floods are the major challenges for human society both present and future. Effective and scientific management of water resources requires a good understanding of water cycles, and a systematic integration of observations can lead to better prediction results. This paper presents an integrated approach to water resource management based on geoinformatics including technologies such as Remote Sensing (RS), Geographical Information Systems (GIS), Global Positioning Systems (GPS), Enterprise Information Systems (EIS), and cloud services. The paper introduces a prototype IIS called Water Resource Management Enterprise Information System (WRMEIS) that integrates functions such as data acquisition, data management and sharing, modeling, and knowledge management. A system called SFFEIS (Snowmelt Flood Forecasting Enterprise Information System) based on the WRMEIS structure has been implemented. It includes operational database, Extraction-Transformation-Loading (ETL), information warehouse, temporal and spatial analysis, simulation/prediction models, knowledge management, and other functions. In this study, a prototype water resource management IIS is developed which integrates geoinformatics, EIS, and cloud service. It also proposes a novel approach to information management that allows any participant play the role as a sensor as well as a contributor to the information warehouse. Both users and public play the role for providing data and knowledge. This study highlights the crucial importance of a systematic approach toward IISs for effective resource and environment management.

...read moreread less

Proceedings Article•DOI•

Materialization optimizations for feature selection workloads

[...]

Ce Zhang¹, Arun Kumar¹, Christopher Ré²•Institutions (2)

University of Wisconsin-Madison¹, Stanford University²

18 Jun 2014

TL;DR: It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

...read moreread less

Abstract: There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

...read moreread less

Towards an integrated biodiversity and ecological research data management and archiving platform: the German federation for the curation of biological data (GFBio)

[...]

Michael Diepenbroek, Frank Oliver Glöckner, Peter Grobe, Anton Güntsch, Robert Huber, Birgitta König-Ries, Ivaylo Kostadinov, Jens Nieschulze, Bernhard Seeger, Robert Tolksdorf, Dagmar Triebel - Show less +7 more

01 Jan 2014

TL;DR: Challenges to biodiversity data management along the data life cycle are described and the solution that is currently being developed within the GFBio project is sketched, a collaborative effort of nineteen German research institutions ranging from museums and archives to biodiversity researchers and computer scientists.

...read moreread less

Abstract: Biodiversity research brings together the many facets of biological environmental research. Its data management is characterized by integration and is particularly challenging due to the large volume and tremendous heterogeneity of the data. At the same time, it is particularly important: A lot of the data is not reproducible. Once it is gone, potential knowledge that could have been gained from it is irrevocably lost. In this paper, we describe challenges to biodiversity data management along the data life cycle and sketch the solution that is currently being developed within the GFBio project, a collaborative effort of nineteen German research institutions ranging from museums and archives to biodiversity researchers and computer scientists.

...read moreread less

Journal Article•DOI•

The Beckman Report on Database Research

[...]

Daniel J. Abadi¹, Rakesh Agrawal², Anastasia Ailamaki³, Magdalena Balazinska⁴, Philip A. Bernstein², Michael J. Carey⁵, Surajit Chaudhuri², Jeffrey Dean⁶, AnHai Doan⁷, Michael J. Franklin⁸, Johannes Gehrke², Laura M. Haas⁹, Alon Halevy⁶, Joseph M. Hellerstein⁸, Yannis Ioannidis¹⁰, H. V. Jagadish, Donald Kossmann¹¹, Samuel Madden¹², Sharad Mehrotra⁵, Tova Milo¹³, Jeffrey F. Naughton⁷, Raghu Ramakrishnan², Volker Markl¹⁴, Christopher Olston⁶, Beng Chin Ooi¹⁵, Christopher Ré¹⁶, Dan Suciu⁴, Michael Stonebraker¹², Todd Walter², Jennifer Widom¹⁶ - Show less +26 more•Institutions (16)

Yale University¹, Microsoft², École Polytechnique Fédérale de Lausanne³, University of Washington⁴, University of California, Irvine⁵, Google⁶, University of Wisconsin-Madison⁷, University of California, Berkeley⁸, IBM⁹, National and Kapodistrian University of Athens¹⁰, ETH Zurich¹¹, Massachusetts Institute of Technology¹², Tel Aviv University¹³, Technical University of Berlin¹⁴, National University of Singapore¹⁵, Stanford University¹⁶

04 Dec 2014

TL;DR: It is observed that Big Data has now become a defining challenge of the authors' time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact.

...read moreread less

Abstract: Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14- 15, 2013 in Irvine, California. It observes that Big Data has now become a defining challenge of our time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact. To do so, the report recommends significantly more attention to five research areas: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and managing the diverse roles of people in the data life cycle.

...read moreread less

Book•

Research Methods in Public Administration and Public Management: An Introduction

[...]

Sandra van Thiel

14 Mar 2014

TL;DR: Research Methods in Public Administration and Public Management represents a comprehensive guide to undertaking and using research in public management and administration as mentioned in this paper. But it is not a complete survey of all the research methods.

...read moreread less

Abstract: Research in Public Administration and Public Management has distinctive features that influence the choices and application of research methods. The standard methodologies for researching from the social sciences can be difficult to follow in the complex world of the public sector. In a dynamic political context, the focus lies on solving societal problems whilst also using methodological principles to do scientifically sound research. The second edition of Research Methods in Public Administration and Public Management represents a comprehensive guide to undertaking and using research in Public Management and Administration. It is succinct but covers a wide variety of research strategies, including action research, experiments, case studies, desk research, systematic literature reviews and more. It pays attention to issues of design, sampling, research ethics and data management. This textbook does explain the role of theory, but also offers many international examples and practical exercises. It takes the reader through the journey of research, starting with the problem definition, choice of theory, research design options and tools to achieve impactful research. New and revised material includes, but is not limited to: A closer look at popular methods like the experiment and the systematic literature review; A deeper examination of research ethics and data management; New examples from a wide range of countries; Updated ‘Further Reading’ material and additional useful websites. This exciting new edition will be core reading for students at all levels as well as practitioners who are carrying out research on Public Management and Administration.

...read moreread less

A Review Paper on Big Data and Hadoop

[...]

Harshawardhan S. Bhosale, Devendra P. Gadekar

01 Jan 2014

TL;DR: Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers, designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

...read moreread less

Abstract: The term 'Big Data' describes innovative techniques and technologies to capture, store, distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

...read moreread less

Proceedings Article•DOI•

Demonstration of the Myria big data management service

[...]

Daniel Halperin¹, Victor Teixeira de Almeida¹, Lee Lee Choo¹, Shumo Chu¹, Paraschos Koutris¹, Dominik Moritz¹, Jennifer Ortiz¹, Vaspol Ruamviboonsuk¹, Jingjing Wang¹, Andrew Whitaker¹, Shengliang Xu¹, Magdalena Balazinska¹, Bill Howe¹, Dan Suciu¹ - Show less +10 more•Institutions (1)

University of Washington¹

18 Jun 2014

TL;DR: This interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

...read moreread less

Abstract: In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myria's goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -- self-serve analytics. Using a web browser, Myria users can upload data, author efficient queries to process and explore the data, and debug correctness and performance issues. Myria queries are executed on a scalable, parallel cluster that uses both state-of-the-art and novel methods for distributed query processing. Our interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

...read moreread less

Proceedings Article•DOI•

Parallel data analysis directly on scientific file formats

[...]

Spyros Blanas¹, Kesheng Wu², Surendra Byna², Bin Dong², Arie Shoshani² - Show less +1 more•Institutions (2)

Ohio State University¹, Lawrence Berkeley National Laboratory²

18 Jun 2014

TL;DR: The design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format is presented, which eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow.

...read moreread less

Abstract: Scientific experiments and large-scale simulations produce massive amounts of data. Many of these scientific datasets are arrays, and are stored in file formats such as HDF5 and NetCDF. Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows. Major barriers include the expensive task of preparing and loading data before querying, and converting the final results to a format that is understood by the existing post-processing and visualization tools. As a consequence, integrating a data management system into an existing scientific data analysis workflow is time-consuming and requires extensive user involvement. In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. This design choice eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow. Our design leverages the increasing main memory capacities found in supercomputers through bitmap indexing and in-memory query execution. In addition, query processing over the HDF5 data format can be effortlessly parallelized to utilize the ample concurrency available in large-scale supercomputers and modern parallel file systems. We evaluate the performance of our system on a large supercomputing system and experiment with both a synthetic dataset and a real cosmology observation dataset. Our system frequently outperforms the relational database system that the cosmology team currently uses, and is more than 10X faster than Hive when processing data in parallel. Overall, by eliminating the data loading step, our query processing system is more effective in supporting in situ scientific analysis workflows.

...read moreread less

Journal Article•DOI•

Benchmarking scalability and elasticity of distributed database systems

[...]

Jörn Kuhlenkamp¹, Markus Klems¹, Oliver Röss²•Institutions (2)

Technical University of Berlin¹, Karlsruhe Institute of Technology²

01 Aug 2014

TL;DR: This work reproduces performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compares the results.

...read moreread less

Abstract: Distributed database system performance benchmarks are an important source of information for decision makers who must select the right technology for their data management problems. Since important decisions rely on trustworthy experimental data, it is necessary to reproduce experiments and verify the results. We reproduce performance and scalability benchmarking experiments of HBase and Cassandra that have been conducted by previous research and compare the results. The scope of our reproduced experiments is extended with a performance evaluation of Cassandra on different Amazon EC2 infrastructure configurations, and an evaluation of Cassandra and HBase elasticity by measuring scaling speed and performance impact while scaling.

...read moreread less

Collapse