scispace - formally typeset
Search or ask a question

Showing papers on "Data management published in 2016"


Journal ArticleDOI
TL;DR: A combined IoT-based system for smart city development and urban planning using Big Data analytics, consisting of various types of sensor deployment, including smart home sensors, vehicular networking, weather and water sensors, smart parking sensors, and surveillance objects is proposed.

701 citations


Book ChapterDOI
01 Jan 2016
TL;DR: A general-purpose prototype Data Stream Management System (DSMS), also called STREAM, is built that supports a large class of declarative continuous queries over continuous streams and traditional stored data sets.
Abstract: Traditional database management systems are best equipped to run one-time queries over finite stored data sets. However, many modern applications such as network monitoring, financial analysis, manufacturing, and sensor networks require long-running, or continuous, queries over continuous unbounded streams of data. In the STREAM project at Stanford, we are investigating data management and query processing for this class of applications. As part of the project we are building a general-purpose prototype Data Stream Management System (DSMS), also called STREAM, that supports a large class of declarative continuous queries over continuous streams and traditional stored data sets. The STREAM prototype targets environments where streams may be rapid, stream characteristics and query loads may vary over time, and system resources may be limited.

510 citations


Book
01 Jan 2016
TL;DR: The author examines the impacts of IT on Individuals, Organizations, and Society through the lens of information technology in the Digital Economy and the web revolution.
Abstract: PART I: IT IN THE ORGANIZATION. 1. Information Technology in the Digital Economy. 2. Information Technologies: Concepts and Management. 3. Strategic Information Ssystems for Competitive Advantage. PART II: THE WEB REVOLUTION. 4. Network Computing: Discovery, Communication, and Collaboration. 5. E-Business and E-Commerce. 6. Mobile, Wireless, and Pervasive Computing. 7. Transaction Processing, Functional Applications, CRM, and Integration. 8. Supply Chain Management and Enterprise Resources Planning. 9. Online Planning and Business Process Redesign. PART IV: MANAGERIAL AND DECISION SUPPORT SYSTEMS. 10. Knowledge Management. 11. Data Management: Warehousing, Analyzing, Mining, and Visualization. 12. Management Decision Support and Intelligent Systems. 13. Information Technology Economics. 14. Building Information Systems. 15. Managing Information Resources and IT Security. 16. Impacts of IT on Individuals, Organizations, and Society. TECHNOLOGY GUIDES: ONLINE AT WWW WILEY.COM/COLLEGE/TURBAN. T1. Hardware. T2. Software. T3. Data and Databases. T4. Telecommunications. T5. The Internet and the Web. Glossary. Photo Credits. Global Index. Name/Subject Index.

453 citations


Journal ArticleDOI
TL;DR: Popular IoT cloud platforms are surveyed in light of solving several service domains such as application development, device management, system management, heterogeneity management, data management, tools for analysis, deployment, monitoring, visualization, and research.

413 citations


Journal ArticleDOI
TL;DR: iPlant’s platform permits researchers to easily deposit and share their data and deploy new computational tools and analysis workflows, allowing the broader community to easily use and reuse those data and computational analyses.
Abstract: The iPlant Collaborative provides life science research communities access to comprehensive, scalable, and cohesive computational infrastructure for data management; identity management; collaboration tools; and cloud, high-performance, high-throughput computing. iPlant provides training, learning material, and best practice resources to help all researchers make the best use of their data, expand their computational skill set, and effectively manage their data and computation when working as distributed teams. iPlant’s platform permits researchers to easily deposit and share their data and deploy new computational tools and analysis workflows, allowing the broader community to easily use and reuse those data and computational analyses.

283 citations


Journal ArticleDOI
TL;DR: The free and open‐source R package camtrapR is described, a new toolbox for flexible and efficient management of data generated in camera trap‐based wildlife studies and should be most useful to researchers and practitioners who regularly handle large amounts of camera trapping data.
Abstract: Summary Camera trapping is a widely applied method to study mammalian biodiversity and is still gaining popularity. It can quickly generate large amounts of data which need to be managed in an efficient and transparent way that links data acquisition with analytical tools. We describe the free and open-source R package camtrapR, a new toolbox for flexible and efficient management of data generated in camera trap-based wildlife studies. The package implements a complete workflow for processing camera trapping data. It assists in image organization, species and individual identification, data extraction from images, tabulation and visualization of results and export of data for subsequent analyses. There is no limitation to the number of images stored in this data management system; the system is portable and compatible across operating systems. The functions provide extensive automation to minimize data entry mistakes and, apart from species and individual identification, require minimal manual user input. Species and individual identification are performed outside the R environment, either via tags assigned in dedicated image management software or by moving images into species directories. Input for occupancy and (spatial) capture–recapture analyses for density and abundance estimation, for example in the R packages unmarked or secr, is computed in a flexible and reproducible manner. In addition, survey summary reports can be generated, spatial distributions of records can be plotted and exported to gis software, and single- and two-species activity patterns can be visualized. camtrapR allows for streamlined and flexible camera trap data management and should be most useful to researchers and practitioners who regularly handle large amounts of camera trapping data.

255 citations


Journal ArticleDOI
TL;DR: This editorial addresses both the collection and handling of big data and the analytical tools provided by data science for management scholars, and provides a primer or a “starter kit” for potential data science applications inmanagement research.
Abstract: The recent advent of remote sensing, mobile technologies, novel transaction systems, and highperformance computing offers opportunities to understand trends, behaviors, and actions in a manner that has not been previously possible. Researchers can thus leverage “big data” that are generated from a plurality of sources including mobile transactions, wearable technologies, social media, ambient networks, andbusiness transactions.An earlierAcademy of Management Journal (AMJ) editorial explored the potential implications for data science inmanagement research and highlighted questions for management scholarship as well as the attendant challenges of data sharing and privacy (George, Haas, & Pentland, 2014). This nascent field is evolving rapidly and at a speed that leaves scholars and practitioners alike attempting to make sense of the emergent opportunities that big datahold.With thepromiseof bigdata comequestions about the analytical value and thus relevance of these data for theory development—including concerns over the context-specific relevance, its reliability and its validity. To address this challenge, data science is emerging as an interdisciplinary field that combines statistics, data mining, machine learning, and analytics to understand and explainhowwecan generate analytical insights and prediction models from structured and unstructured big data. Data science emphasizes the systematic study of the organization, properties, and analysis of data and their role in inference, including our confidence in the inference (Dhar, 2013).Whereas both big data and data science terms are often used interchangeably, “big data” refer to large and varied data that can be collected and managed, whereas “data science” develops models that capture, visualize, andanalyze theunderlyingpatterns in thedata. In this editorial, we address both the collection and handling of big data and the analytical tools provided by data science for management scholars. At the current time, practitioners suggest that data science applications tackle the three core elements of big data: volume, velocity, and variety (McAfee & Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). “Volume” represents the sheer size of the dataset due to the aggregation of a large number of variables and an even larger set of observations for each variable. “Velocity” reflects the speed atwhich these data are collected and analyzed, whether in real time or near real time from sensors, sales transactions, social media posts, and sentiment data for breaking news and social trends. “Variety” in big data comes from the plurality of structured and unstructured data sources such as text, videos, networks, and graphics among others. The combinations of volume, velocity, and variety reveal the complex task of generating knowledge from big data, which often runs into millions of observations, and deriving theoretical contributions from such data. In this editorial, we provide a primer or a “starter kit” for potential data science applications inmanagement research. We do so with a caveat that emerging fields outdate and improve uponmethodologies while often supplanting them with new applications. Nevertheless, this primer can guide management scholars who wish to use data science techniques to reach better answers to existing questions or explore completely new research questions.

251 citations


Journal ArticleDOI
TL;DR: This paper surveys and synthesizes a wide spectrum of existing studies on crowdsourced data management and outlines key factors that need to be considered to improve crowdsourcing data management.
Abstract: Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.

240 citations


Journal ArticleDOI
TL;DR: It is concluded that administrative datasets have the potential to contribute to the development of high-quality and impactful social science research, and should not be overlooked in the emerging field of big data.

228 citations


Journal ArticleDOI
Yoshiko Atsuta1
TL;DR: The introduction of the Second-Generation Transplant Registry Unified Management Program (TRUMP2) is intended to improve data quality and more efficient data management, and expand possible uses of data, as it is capable of building a more complex relational database.
Abstract: Collection and analysis of information on diseases and post-transplant courses of allogeneic hematopoietic stem cell transplant recipients have played important roles in improving therapeutic outcomes in hematopoietic stem cell transplantation. Efficient, high-quality data collection systems are essential. The introduction of the Second-Generation Transplant Registry Unified Management Program (TRUMP2) is intended to improve data quality and more efficient data management. The TRUMP2 system will also expand possible uses of data, as it is capable of building a more complex relational database. The construction of an accessible data utilization system for adequate data utilization by researchers would promote greater research activity. Study approval and management processes and authorship guidelines also need to be organized within this context. Quality control of processes for data manipulation and analysis will also affect study outcomes. Shared scripts have been introduced to define variables according to standard definitions for quality control and improving efficiency of registry studies using TRUMP data.

227 citations


Journal ArticleDOI
01 Aug 2016
TL;DR: Magellan is novel in four important aspects: it provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do.
Abstract: Entity matching (EM) has been a long-standing challenge in data management Most current EM works focus only on developing matching algorithms We argue that far more efforts should be devoted to building EM systems We discuss the limitations of current EM systems, then present as a solution Magellan, a new kind of EM systems Magellan is novel in four important aspects (1) It provides how-to guides that tell users what to do in each EM scenario, step by step (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system We describe research challenges raised by Magellan, then present extensive experiments with 44 students and users at several organizations that show the promise of the Magellan approach

Proceedings ArticleDOI
26 Jun 2016
TL;DR: Constance is a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources that discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities.
Abstract: As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.

Journal ArticleDOI
TL;DR: In this article, the authors present a review of the quality management methods, tools or practices used in conjunction with sustainable development initiatives and highlight the need to move beyond existing standards and management systems to enable more radical improvements, and the need for empirical evidence of the effect of integrated management systems on environmental performance.

Journal ArticleDOI
TL;DR: This paper presents the first Big Data based architecture for construction waste analytics, validated for exploratory analytics of 200,000 waste disposal records from 900 completed projects and revealed that existing waste management software classify the bulk of construction waste as mixed waste, which exposes poor waste data management.
Abstract: In recent times, construction industry is enduring pressure to take drastic steps to minimise waste. Waste intelligence advocates retrospective measures to manage waste after it is produced. Existing waste intelligence based waste management software are fundamentally limited and cannot facilitate stakeholders in controlling wasteful activities. Paradoxically, despite a great amount of effort, the waste being produced by the construction industry is escalating. This undesirable situation motivates a radical change from waste intelligence to waste analytics (in which waste is propose to be tackle proactively right at design through sophisticated big data technologies). This paper highlight that waste minimisation at design (a.k.a. designing-out waste) is data-driven and computationally intensive challenge. The aim of this paper is to propose a Big Data architecture for construction waste analytics. To this end, existing literature on big data technologies is reviewed to identify the critical components of the proposed Big Data based waste analytics architecture. At the crux, graph-based components are used: in particular, a graph database (Neo4J) is adopted to store highly voluminous and diverse datasets. To complement, Spark, a highly resilient graph processing system, is employed. Provision for extensions through Building Information Modelling (BIM) are also considered for synergy and greater adoption. This symbiotic integration of technologies enables a vibrant environment for design exploration and optimisation to tackle construction waste. The main contribution of this paper is that it presents, to the best of our knowledge, the first Big Data based architecture for construction waste analytics. The architecture is validated for exploratory analytics of 200,000 waste disposal records from 900 completed projects. It is revealed that existing waste management software classify the bulk of construction waste as mixed waste, which exposes poor waste data management. The findings of this paper will be of interest, more generally to researchers, who are seeking to develop big data based simulation tools in similar non-trivial applications.

BookDOI
01 Jan 2016
TL;DR: Management of Information Technology in Health presents a very special challenge because it is an area that can be highly controversial and very costly, often with considerable dissatisfaction and with significant levels of failure.
Abstract: Management of Information Technology (IT) in Health presents a very special challenge. The IT industry itself is a very dynamic, rapidly evolving field, with a continuous stream of new technologies bringing new possibilities and challenges. However IT is not merely a technological issue, as it is generally found that many of the most difficult problems are about the way the technology relates to the organization. If technology is going to be effective in most cases, then substantial changes in the way the organization operates is necessary (Coombs et al 1992). These changes involve many conflicting factors which can be political, managerial, industrial, cultural, and may require substantial changes in skills, and roles. Implementation is much more than a technical process, and involves skills of politicians, salesmen, project management and organizational change agents (Keen 1991). It is an area that can be highly controversial and very costly, often with considerable dissatisfaction and with significant levels of failure (Sauer 1993). In addition to these problems, health is possibly one of the most complex of environments, which makes the management of information in the health industry extremely demanding.

Journal ArticleDOI
TL;DR: Several near‐term opportunities for federal agencies, as well as the broader scientific and management community, are highlighted that will help accelerate sensor development, build and leverage sites within a national network, and develop open data standards and data management protocols that are key to realizing the benefits of a large‐scale, integrated monitoring network.
Abstract: Sensors and enabling technologies are becoming increasingly important tools for water quality monitoring and associated water resource management decisions. In particular, nutrient sensors are of interest because of the well-known adverse effects of nutrient enrichment on coastal hypoxia, harmful algal blooms, and impacts to human health. Accurate and timely information on nutrient concentrations and loads is integral to strategies designed to minimize risk to humans and manage the underlying drivers of water quality impairment. Using nitrate sensors as the primary example, we highlight the types of applications in freshwater and coastal environments that are likely to benefit from continuous, real-time nutrient data. The concurrent emergence of new tools to integrate, manage, and share large datasets is critical to the successful use of nutrient sensors and has made it possible for the field of continuous monitoring to rapidly move forward. We highlight several near-term opportunities for federal agencies, as well as the broader scientific and management community, that will help accelerate sensor development, build and leverage sites within a national network, and develop open data standards and data management protocols that are key to realizing the benefits of a large-scale, integrated monitoring network. Investing in these opportunities will provide new information to guide management and policies designed to protect and restore our nation's water resources.

Journal ArticleDOI
TL;DR: An innovative architecture for collecting and accessing large amount of data generated by medical sensor networks is proposed and an effective and flexible security mechanism that guarantees confidentiality, integrity as well as fine-grained access control to outsourced medical data is proposed.

Journal ArticleDOI
TL;DR: Views on social network data based recommender systems are expressed by considering usage of various recommendation algorithms, functionalities of systems, different types of interfaces, filtering techniques, and artificial intelligence techniques.
Abstract: Rapid growth of web and its applications has created a colossal importance for recommender systems. Being applied in various domains, recommender systems were designed to generate suggestions such as items or services based on user interests. Basically, recommender systems experience many issues which reflects dwindled effectiveness. Integrating powerful data management techniques to recommender systems can address such issues and the recommendations quality can be increased significantly. Recent research on recommender systems reveals an idea of utilizing social network data to enhance traditional recommender system with better prediction and improved accuracy. This paper expresses views on social network data based recommender systems by considering usage of various recommendation algorithms, functionalities of systems, different types of interfaces, filtering techniques, and artificial intelligence techniques. After examining the depths of objectives, methodologies, and data sources of the existing models, the paper helps anyone interested in the development of travel recommendation systems and facilitates future research direction. We have also proposed a location recommendation system based on social pertinent trust walker SPTW and compared the results with the existing baseline random walk models. Later, we have enhanced the SPTW model for group of users recommendations. The results obtained from the experiments have been presented.

Journal ArticleDOI
TL;DR: The road map of a distributed modeling framework for plant-wide process monitoring is introduced, based on which the whole plant- wide process is decomposed into different blocks, and statistical data models are constructed in those blocks.
Abstract: With the growing complexity of the modern industrial process, monitoring large-scale plant-wide processes has become quite popular. Unlike traditional processes, the measured data in the plant-wide process pose great challenges to information capture, data management, and storage. More importantly, it is difficult to efficiently interpret the information hidden within those data. In this paper, the road map of a distributed modeling framework for plant-wide process monitoring is introduced. Based on this framework, the whole plant-wide process is decomposed into different blocks, and statistical data models are constructed in those blocks. For online monitoring, the results obtained from different blocks are integrated through the decision fusion algorithm. A detailed case study is carried out for performance evaluation of the plant-wide monitoring method. Research challenges and perspectives are discussed and highlighted for future work.

Journal ArticleDOI
TL;DR: This paper proposes an adaptive data collection approach on the biosensor node level that uses an early warning score system to optimize data transmission and estimates in real time the sensing frequency and presents a data fusion model on the coordinator level using a decision matrix and fuzzy set theory.
Abstract: In the past few years, wireless body sensor networks (WBSNs) emerged as a low-cost solution for healthcare applications. In WBSNs, biosensors collect periodically physiological measurement and send them to the coordinator where the data fusion process takes place. However, processing the huge amount of data captured by the limited lifetime biosensors and taking the right decisions when there is an emergency are major challenges in WBSNs. In this paper, we introduce a biosensor data management framework, starting from data collection to decision making. First, we propose an adaptive data collection approach on the biosensor node level. This approach uses an early warning score system to optimize data transmission and estimates in real time the sensing frequency. Second, we present a data fusion model on the coordinator level using a decision matrix and fuzzy set theory. To evaluate our approach, we conducted multiple series of simulations on real sensor data. The results show that our approach reduces the amount of collected data, while maintaining data integrity. In addition, we show the impact of sampling and filtering data on the accuracy of the taken decisions and compare our data fusion approach with a basic decision tree algorithm.

Journal ArticleDOI
TL;DR: In this article, the authors explore how visual strategy and performance management techniques impact performance measurement and management practices of organizations, and present a novel visual performance management approach that is developed and implemented in qualitative case studies with seven manufacturing SMEs across Europe.
Abstract: The purpose of this paper was to explore how visual strategy and performance management techniques impact performance measurement and management practices of organisations. A novel visual performance management approach is developed and implemented in qualitative case studies with seven manufacturing SMEs across Europe. The implementation cases demonstrate that visual management systems serve to support ongoing strategy development and implementation, facilitate performance measurement and review, enable people engagement, improve internal and external communication, enhance collaboration and integration, support the development of a continuous improvement culture and foster innovation. Additional explorative and longitudinal research is required to understand the long-term impact of such approaches in both small and larger organisations.

Journal ArticleDOI
TL;DR: This article is a practical guide to conducting big data research, covering data management, acquisition, processing, and analytics (including key supervised and unsupervised learning data mining methods), accompanied by walkthrough tutorials on data acquisition, text analysis with latent Dirichlet allocation topic modeling, and classification with support vector machines.
Abstract: The massive volume of data that now covers a wide variety of human behaviors offers researchers in psychology an unprecedented opportunity to conduct innovative theory- and data-driven field research. This article is a practical guide to conducting big data research, covering data management, acquisition, processing, and analytics (including key supervised and unsupervised learning data mining methods). It is accompanied by walkthrough tutorials on data acquisition, text analysis with latent Dirichlet allocation topic modeling, and classification with support vector machines. Big data practitioners in academia, industry, and the community have built a comprehensive base of tools and knowledge that makes big data research accessible to researchers in a broad range of fields. However, big data research does require knowledge of software programming and a different analytical mindset. For those willing to acquire the requisite skills, innovative analyses of unexpected or previously untapped data sources can offer fresh ways to develop, test, and extend theories. When conducted with care and respect, big data research can become an essential complement to traditional research. (PsycINFO Database Record

Journal ArticleDOI
01 Sep 2016
TL;DR: This paper discusses the limitations of current EM systems, presents Magellan, a new kind of EM systems that addresses these limitations, and proposes demonstration scenarios that show the promise of the Magellan approach.
Abstract: Entity matching (EM) has been a long-standing challenge in data management. Most current EM works, however, focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then present Magellan, a new kind of EM systems that addresses these limitations. Magellan is novel in four important aspects. (1) It provides a how-to guide that tells users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do. (3) Tools are built on top of the data science stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provide a powerful scripting environment to facilitate interactive experimentation and allow users to quickly write code to "patch" the system. We have extensively evaluated Magellan with 44 students and users at various organizations. In this paper we propose demonstration scenarios that show the promise of the Magellan approach.

Book
03 Sep 2016
TL;DR: This text presents a step-by-step methodology to understand and exploit mobility data: collecting and cleansing data, storage in Moving Object Database engines, indexing, processing, analyzing and mining mobility data.
Abstract: This text integrates different mobility data handling processes, from database management to multi-dimensional analysis and mining, into a unified presentation driven by the spectrum of requirements raised by real-world applications. It presents a step-by-step methodology to understand and exploit mobility data: collecting and cleansing data, storage in Moving Object Database (MOD) engines, indexing, processing, analyzing and mining mobility data. Emerging issues, such as semantic and privacy-aware querying and mining as well as distributed data processing, are also covered. Theoretical presentation is smoothly interchanged with hands-on exercises and case studies involving an actual MOD engine. The authors are established experts who address both theoretical and practical dimensions of the field but also presentvaluable prototype software. The background context, clear explanations and sample exercises make this an ideal textbook for graduate students studying database management, data mining and geographic information systems.

Journal ArticleDOI
TL;DR: An overview of the key technical issues related to problem framing and the ability of resource managers to learn from their experience are focused on.

Journal ArticleDOI
TL;DR: The proposed solution is flexible, dynamic, has a high semantic content and considers both virtual product models as well as feedback data from the physical product along its whole lifecycle (digital product twin).

Journal ArticleDOI
TL;DR: How well the two main privacy models used in anonymization meet the requirements of big data, namely composability, low computational cost and linkability is evaluated.
Abstract: This paper explores the challenges raised by big data in privacy-preserving data management. First, we examine the conflicts raised by big data with respect to preexisting concepts of private data management, such as consent, purpose limitation, transparency and individual rights of access, rectification and erasure. Anonymization appears as the best tool to mitigate such conflicts, and it is best implemented by adhering to a privacy model with precise privacy guarantees. For this reason, we evaluate how well the two main privacy models used in anonymization (k-anonymity and \(\varepsilon \)-differential privacy) meet the requirements of big data, namely composability, low computational cost and linkability.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: BigDAWG as mentioned in this paper is a polystore system designed to work on complex problems that naturally span across different processing or storage engines, such as unstructured text, relational data, time series waveforms and imagery.
Abstract: Organizations are often faced with the challenge of providing data management solutions for large, heterogenous datasets that may have different underlying data and programming models. For example, a medical dataset may have unstructured text, relational data, time series waveforms and imagery. Trying to fit such datasets in a single data management system can have adverse performance and efficiency effects. As a part of the Intel Science and Technology Center on Big Data, we are developing a polystore system designed for such problems. BigDAWG (short for the Big Data Analytics Working Group) is a polystore system designed to work on complex problems that naturally span across different processing or storage engines. BigDAWG provides an architecture that supports diverse database systems working with different data models, support for the competing notions of location transparency and semantic completeness via islands and a middleware that provides a uniform multi-island interface. Initial results from a prototype of the BigDAWG system applied to a medical dataset validate polystore concepts. In this article, we will describe polystore databases, the current BigDAWG architecture and its application on the MIMIC II medical dataset, initial performance results and our future development plans.


Journal ArticleDOI
TL;DR: AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries, and provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.
Abstract: In this paper, we argue that database systems be augmented with an automated data exploration service that methodically steers users through the data in a meaningful way. Such an automated system is crucial for deriving insights from complex datasets found in many big data applications such as scientific and healthcare applications as well as for reducing the human effort of data exploration. Towards this end, we present AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries. AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples. We present a number of exploration techniques as well as optimizations that minimize the number of samples presented to the user while offering interactive performance. AIDE can deliver highly accurate query predictions for very common conjunctive queries with small user effort while, given a reasonable number of samples, it can predict with high accuracy complex disjunctive queries. It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.