scispace - formally typeset
Search or ask a question

Showing papers on "Data management published in 2017"


Journal ArticleDOI
TL;DR: The results of this study show that the technologies of cloud and big data can be used to enhance the performance of the healthcare system so that humans can then enjoy various smart healthcare applications and services.
Abstract: The advances in information technology have witnessed great progress on healthcare technologies in various domains nowadays. However, these new technologies have also made healthcare data not only much bigger but also much more difficult to handle and process. Moreover, because the data are created from a variety of devices within a short time span, the characteristics of these data are that they are stored in different formats and created quickly, which can, to a large extent, be regarded as a big data problem. To provide a more convenient service and environment of healthcare, this paper proposes a cyber-physical system for patient-centric healthcare applications and services, called Health-CPS, built on cloud and big data analytics technologies. This system consists of a data collection layer with a unified standard, a data management layer for distributed storage and parallel computing, and a data-oriented service layer. The results of this study show that the technologies of cloud and big data can be used to enhance the performance of the healthcare system so that humans can then enjoy various smart healthcare applications and services.

682 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigate the relationship among knowledge management system, open innovation, knowledge management capacity and innovation capacity in the context of the Internet of Things (IoT).

422 citations


Journal ArticleDOI
TL;DR: The fundamental data management techniques employed to ensure consistency, interoperability, granularity, and reusability of the data generated by the underlying IoT for smart cities are described.
Abstract: Integrating the various embedded devices and systems in our environment enables an Internet of Things (IoT) for a smart city. The IoT will generate tremendous amount of data that can be leveraged for safety, efficiency, and infotainment applications and services for city residents. The management of this voluminous data through its lifecycle is fundamental to the realization of smart cities. Therefore, in contrast to existing surveys on smart cities we provide a data-centric perspective, describing the fundamental data management techniques employed to ensure consistency, interoperability, granularity, and reusability of the data generated by the underlying IoT for smart cities. Essentially, the data lifecycle in a smart city is dependent on tightly coupled data management with cross-cutting layers of data security and privacy, and supporting infrastructure. Therefore, we further identify techniques employed for data security and privacy, and discuss the networking and computing technologies that enable smart cities. We highlight the achievements in realizing various aspects of smart cities, present the lessons learned, and identify limitations and research challenges.

390 citations


Journal ArticleDOI
TL;DR: A general overview of the requirements and system architectures of disaster management systems is presented and state-of-the-art data-driven techniques that have been applied on improving situation awareness as well as in addressing users’ information needs in disaster management are summarized.
Abstract: Improving disaster management and recovery techniques is one of national priorities given the huge toll caused by man-made and nature calamities. Data-driven disaster management aims at applying advanced data collection and analysis technologies to achieve more effective and responsive disaster management, and has undergone considerable progress in the last decade. However, to the best of our knowledge, there is currently no work that both summarizes recent progress and suggests future directions for this emerging research area. To remedy this situation, we provide a systematic treatment of the recent developments in data-driven disaster management. Specifically, we first present a general overview of the requirements and system architectures of disaster management systems and then summarize state-of-the-art data-driven techniques that have been applied on improving situation awareness as well as in addressing users’ information needs in disaster management. We also discuss and categorize general data-mining and machine-learning techniques in disaster management. Finally, we recommend several research directions for further investigations.

364 citations


Proceedings ArticleDOI
03 Nov 2017
TL;DR: A blockchain-based design for the IoT that brings a distributed access control and data management that empower the users with data ownership and facilitates the storage of time-series IoT data at the edge of the network via a locality-aware decentralized storage system that is managed with the blockchain technology.
Abstract: Today the cloud plays a central role in storing, processing, and distributing data Despite contributing to the rapid development of IoT applications, the current IoT cloud-centric architecture has led into a myriad of isolated data silos that hinders the full potential of holistic data-driven analytics within the IoT In this paper, we present a blockchain-based design for the IoT that brings a distributed access control and data management We depart from the current trust model that delegates access control of our data to a centralized trusted authority and instead empower the users with data ownership Our design is tailored for IoT data streams and enables secure data sharing We enable a secure and resilient access control management, by utilizing the blockchain as an auditable and distributed access control layer to the storage layer We facilitate the storage of time-series IoT data at the edge of the network via a locality-aware decentralized storage system that is managed with the blockchain technology Our system is agnostic of the physical storage nodes and supports as well utilization of cloud storage resources as storage nodes

330 citations


Journal ArticleDOI
TL;DR: The Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM) as mentioned in this paper has developed a software framework which enables easy access to a range of programs and utilities.
Abstract: As part of its remit to provide computational support to the cryo-EM community, the Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM) has produced a software framework which enables easy access to a range of programs and utilities The resulting software suite incorporates contributions from different collaborators by encapsulating them in Python task wrappers, which are then made accessible via a user-friendly graphical user interface as well as a command-line interface suitable for scripting The framework includes tools for project and data management An overview of the design of the framework is given, together with a survey of the functionality at different levels The current CCP-EM suite has particular strength in the building and refinement of atomic models into cryo-EM reconstructions, which is described in detail

257 citations


Proceedings Article
01 Jan 2017
TL;DR: A framework on managing and sharing EMR data for cancer patient care using blockchain to significantly reduce the turnaround time for EMR sharing, improve decision making for medical care, and reduce the overall cost is proposed.
Abstract: Electronic medical records (EMRs) are critical, highly sensitive private information in healthcare, and need to be frequently shared among peers. Blockchain provides a shared, immutable and transparent history of all the transactions to build applications with trust, accountability and transparency. This provides a unique opportunity to develop a secure and trustable EMR data management and sharing system using blockchain. In this paper, we present our perspectives on blockchain based healthcare data management, in particular, for EMR data sharing between healthcare providers and for research studies. We propose a framework on managing and sharing EMR data for cancer patient care. In collaboration with Stony Brook University Hospital, we implemented our framework in a prototype that ensures privacy, security, availability, and fine-grained access control over EMR data. The proposed work can significantly reduce the turnaround time for EMR sharing, improve decision making for medical care, and reduce the overall cost.

247 citations


Journal ArticleDOI
TL;DR: The background and state of the art of scholarly data management and relevant technologies are examined, and data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data are reviewed.
Abstract: With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new perspective. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Second, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a comprehensive review of this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.

234 citations


Proceedings Article
01 Jan 2017
TL;DR: The architecture of Peloton is presented, the first selfdriving DBMS, which enables new optimizations that are important for modern high-performance DBMSs, but which are not possible today because the complexity of managing these systems has surpassed the abilities of human experts.
Abstract: In the last two decades, both researchers and vendors have built advisory tools to assist database administrators (DBAs) in various aspects of system tuning and physical design. Most of this previous work, however, is incomplete because they still require humans to make the final decisions about any changes to the database and are reactionary measures that fix problems after they occur. What is needed for a truly “self-driving” database management system (DBMS) is a new architecture that is designed for autonomous operation. This is different than earlier attempts because all aspects of the system are controlled by an integrated planning component that not only optimizes the system for the current workload, but also predicts future workload trends so that the system can prepare itself accordingly. With this, the DBMS can support all of the previous tuning techniques without requiring a human to determine the right way and proper time to deploy them. It also enables new optimizations that are important for modern high-performance DBMSs, but which are not possible today because the complexity of managing these systems has surpassed the abilities of human experts. This paper presents the architecture of Peloton, the first selfdriving DBMS. Peloton’s autonomic capabilities are now possible due to algorithmic advancements in deep learning, as well as improvements in hardware and adaptive database architectures.

220 citations


Posted Content
TL;DR: In this article, the authors present a blockchain-based design for the IoT that brings a distributed access control and data management, where the authors depart from the current trust model that delegates access control of our data to a centralized trusted authority and instead empower the users with data ownership.
Abstract: Today the cloud plays a central role in storing, processing, and distributing data. Despite contributing to the rapid development of IoT applications, the current IoT cloud-centric architecture has led into a myriad of isolated data silos that hinders the full potential of holistic data-driven analytics within the IoT. In this paper, we present a blockchain-based design for the IoT that brings a distributed access control and data management. We depart from the current trust model that delegates access control of our data to a centralized trusted authority and instead empower the users with data ownership. Our design is tailored for IoT data streams and enables secure data sharing. We enable a secure and resilient access control management, by utilizing the blockchain as an auditable and distributed access control layer to the storage layer. We facilitate the storage of time-series IoT data at the edge of the network via a locality-aware decentralized storage system that is managed with the blockchain technology. Our system is agnostic of the physical storage nodes and supports as well utilization of cloud storage resources as storage nodes.

219 citations


Posted Content
TL;DR: In this paper, a framework for managing and sharing electronic medical records (EMRs) for cancer patient care is proposed, which can significantly reduce the turnaround time for EMR sharing, improve decision making for medical care, and reduce the overall cost.
Abstract: Electronic medical records (EMRs) are critical, highly sensitive private information in healthcare, and need to be frequently shared among peers. Blockchain provides a shared, immutable and transparent history of all the transactions to build applications with trust, accountability and transparency. This provides a unique opportunity to develop a secure and trustable EMR data management and sharing system using blockchain. In this paper, we present our perspectives on blockchain based healthcare data management, in particular, for EMR data sharing between healthcare providers and for research studies. We propose a framework on managing and sharing EMR data for cancer patient care. In collaboration with Stony Brook University Hospital, we implemented our framework in a prototype that ensures privacy, security, availability, and fine-grained access control over EMR data. The proposed work can significantly reduce the turnaround time for EMR sharing, improve decision making for medical care, and reduce the overall cost

Journal ArticleDOI
TL;DR: In this paper, a set of measures and interpretive structural modelling methods were proposed to identify the driving and dependence powers in sustainable supply chain management within the context of knowledge management, so as to improve the performance of firms from the textile industry in Vietnam.

Proceedings ArticleDOI
05 Jun 2017
TL;DR: This paper proposes a blockchain platform architecture for clinical trial and precision medicine and discusses various design aspects and provides some insights in the technology requirements and challenges.
Abstract: This paper proposes a blockchain platform architecture for clinical trial and precision medicine and discusses various design aspects and provides some insights in the technology requirements and challenges. We identify 4 new system architecture components that are required to be built on top of traditional blockchain and discuss their technology challenges in our blockchain platform: (a) a new blockchain based general distributed and parallel computing paradigm component to devise and study parallel computing methodology for big data analytics, (b) blockchain application data management component for data integrity, big data integration, and integrating disparity of medical related data, (c) verifiable anonymous identity management component for identity privacy for both person and Internet of Things (IoT) devices and secure data access to make possible of the patient centric medicine, and (d) trust data sharing management component to enable a trust medical data ecosystem for collaborative research.

Proceedings ArticleDOI
09 May 2017
TL;DR: The goal of the tutorial is to bring forth data-management issues that arise in the context of machine learning pipelines deployed in production, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.
Abstract: The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluate the potential for use of building information modelling (BIM) as a tool to support the visualisation and management of a building's performance; demonstrating a method for the capture, collation and linking of data stored across the currently disparate BIM and building management system (BMS) data environments.

Journal ArticleDOI
28 Oct 2017-Sensors
TL;DR: A two-day workshop was held in Los Angeles to gather practitioners who work with low-cost sensors used to make air quality measurements to share knowledge developed from a variety of pilot projects in hopes of advancing the collective knowledge about how best to use low- cost air quality sensors.
Abstract: In May 2017, a two-day workshop was held in Los Angeles (California, U.S.A.) to gather practitioners who work with low-cost sensors used to make air quality measurements. The community of practice included individuals from academia, industry, non-profit groups, community-based organizations, and regulatory agencies. The group gathered to share knowledge developed from a variety of pilot projects in hopes of advancing the collective knowledge about how best to use low-cost air quality sensors. Panel discussion topics included: (1) best practices for deployment and calibration of low-cost sensor systems, (2) data standardization efforts and database design, (3) advances in sensor calibration, data management, and data analysis and visualization, and (4) lessons learned from research/community partnerships to encourage purposeful use of sensors and create change/action. Panel discussions summarized knowledge advances and project successes while also highlighting the questions, unresolved issues, and technological limitations that still remain within the low-cost air quality sensor arena.

Journal ArticleDOI
TL;DR: The range of RDM activities explored in this study are positioned on a "landscape maturity model,” which reflects current and planned research data services and practice in academic libraries, representing a “snapshot” of current developments and a baseline for future research.
Abstract: This article reports an international study of research data management (RDM) activities, services, and capabilities in higher education libraries It presents the results of a survey covering higher education libraries in Australia, Canada, Germany, Ireland, the Netherlands, New Zealand, and the UK The results indicate that libraries have provided leadership in RDM, particularly in advocacy and policy development Service development is still limited, focused especially on advisory and consultancy services (such as data management planning support and data-related training), rather than technical services (such as provision of a data catalog, and curation of active data) Data curation skills development is underway in libraries, but skills and capabilities are not consistently in place and remain a concern Other major challenges include resourcing, working with other support services, and achieving “buy in” from researchers and senior managers Results are compared with previous studies in order to assess trends and relative maturity levels The range of RDM activities explored in this study are positioned on a “landscape maturity model,” which reflects current and planned research data services and practice in academic libraries, representing a “snapshot” of current developments and a baseline for future research

Journal ArticleDOI
TL;DR: In this paper, the impact of Big Data on industrial operations and its organisational implications is explored, based on a review of the existing literature, identifying the general different fields of action for operations management related to data processing.
Abstract: The ongoing digital transformation on industry has so far mostly been studied from the perspective of cyber-physical systems solutions as drivers of change. In this paper, we turn the focus to the changes in data management resulting from the introduction of new digital technologies in industry. So far, data processing activities in operations management have usually been organised according to the existing business structures inside and in-between companies. With increasing importance of Big Data in the context of the digital transformation, the opposite will be the case: business structures will evolve based on the potential to develop value streams offered on the basis of new data processing solutions. Based on a review of the extant literature, we identify the general different fields of action for operations management related to data processing. In particular, we explore the impact of Big Data on industrial operations and its organisational implications.

Journal ArticleDOI
TL;DR: Some of the most widely used, most accessible, and most powerful tools available for the researcher interested in conducting EDM/LA research are highlighted.
Abstract: In recent years, a wide array of tools have emerged for the purposes of conducting educational data mining (EDM) and/or learning analytics (LA) research. In this article, we hope to highlight some ...

Journal ArticleDOI
TL;DR: An overview of data management for smart grids is provided, the added value of Big Data technologies for this kind of data is summarized, and the technical requirements, the tools and the main steps to implement Big Data solutions in the smart grid context are discussed.
Abstract: A smart grid is an intelligent electricity grid that optimizes the generation, distribution and consumption of electricity through the introduction of Information and Communication Technologies on the electricity grid. In essence, smart grids bring profound changes in the information systems that drive them: new information flows coming from the electricity grid, new players such as decentralized producers of renewable energies, new uses such as electric vehicles and connected houses and new communicating equipments such as smart meters, sensors and remote control points. All this will cause a deluge of data that the energy companies will have to face. Big Data technologies offers suitable solutions for utilities, but the decision about which Big Data technology to use is critical. In this paper, we provide an overview of data management for smart grids, summarise the added value of Big Data technologies for this kind of data, and discuss the technical requirements, the tools and the main steps to implement Big Data solutions in the smart grid context.

Proceedings ArticleDOI
01 Apr 2017
TL;DR: This paper surveys and synthesizes a wide spectrum of existing studies on crowdsourced data management and outlines key factors that need to be considered to improve crowdsourcing data management.
Abstract: Many important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring is an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality, (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost, (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. We survey and synthesize a wide spectrum of existing studies on crowdsourced data management.

Journal ArticleDOI
TL;DR: The most pressing unmet needs of BIO PIs are training in data integration, data management, and scaling analyses for HPC—acknowledging that data science skills will be required to build a deeper understanding of life.
Abstract: In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC-acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

Journal ArticleDOI
TL;DR: The proposed system realizes lightweight data encryption, lightweight keyword trapdoor generation and lightweight data recovery, which leaves very few computations to user's terminal, and requires much less communication cost.

Journal ArticleDOI
01 Dec 2017-BMJ Open
TL;DR: The adoption of the recommendations in this document would help to promote and support data sharing and reuse among researchers, adequately inform trial participants and protect their rights, and provide effective and efficient systems for preparing, storing and accessing data.
Abstract: Objectives We examined major issues associated with sharing of individual clinical trial data and developed a consensus document on providing access to individual participant data from clinical trials, using a broad interdisciplinary approach. Design and methods This was a consensus-building process among the members of a multistakeholder task force, involving a wide range of experts (researchers, patient representatives, methodologists, information technology experts, and representatives from funders, infrastructures and standards development organisations). An independent facilitator supported the process using the nominal group technique. The consensus was reached in a series of three workshops held over 1 year, supported by exchange of documents and teleconferences within focused subgroups when needed. This work was set within the Horizon 2020-funded project CORBEL (Coordinated Research Infrastructures Building Enduring Life-science Services) and coordinated by the European Clinical Research Infrastructure Network. Thus, the focus was on non-commercial trials and the perspective mainly European. Outcome We developed principles and practical recommendations on how to share data from clinical trials. Results The task force reached consensus on 10 principles and 50 recommendations, representing the fundamental requirements of any framework used for the sharing of clinical trials data. The document covers the following main areas: making data sharing a reality (eg, cultural change, academic incentives, funding), consent for data sharing, protection of trial participants (eg, de-identification), data standards, rights, types and management of access (eg, data request and access models), data management and repositories, discoverability, and metadata. Conclusions The adoption of the recommendations in this document would help to promote and support data sharing and reuse among researchers, adequately inform trial participants and protect their rights, and provide effective and efficient systems for preparing, storing and accessing data. The recommendations now need to be implemented and tested in practice. Further work needs to be done to integrate these proposals with those from other geographical areas and other academic domains.

Journal ArticleDOI
25 Apr 2017-PeerJ
TL;DR: It is confirmed that only a minority of biomedical journals require data sharing, and a significant association between higher Impact Factors and journals with a data sharing requirement is found, and most data sharing policies did not provide specific guidance on the practices that ensure data is maximally available and reusable.
Abstract: BACKGROUND There is wide agreement in the biomedical research community that research data sharing is a primary ingredient for ensuring that science is more transparent and reproducible. Publishers could play an important role in facilitating and enforcing data sharing; however, many journals have not yet implemented data sharing policies and the requirements vary widely across journals. This study set out to analyze the pervasiveness and quality of data sharing policies in the biomedical literature. METHODS The online author's instructions and editorial policies for 318 biomedical journals were manually reviewed to analyze the journal's data sharing requirements and characteristics. The data sharing policies were ranked using a rubric to determine if data sharing was required, recommended, required only for omics data, or not addressed at all. The data sharing method and licensing recommendations were examined, as well any mention of reproducibility or similar concepts. The data was analyzed for patterns relating to publishing volume, Journal Impact Factor, and the publishing model (open access or subscription) of each journal. RESULTS A total of 11.9% of journals analyzed explicitly stated that data sharing was required as a condition of publication. A total of 9.1% of journals required data sharing, but did not state that it would affect publication decisions. 23.3% of journals had a statement encouraging authors to share their data but did not require it. A total of 9.1% of journals mentioned data sharing indirectly, and only 14.8% addressed protein, proteomic, and/or genomic data sharing. There was no mention of data sharing in 31.8% of journals. Impact factors were significantly higher for journals with the strongest data sharing policies compared to all other data sharing criteria. Open access journals were not more likely to require data sharing than subscription journals. DISCUSSION Our study confirmed earlier investigations which observed that only a minority of biomedical journals require data sharing, and a significant association between higher Impact Factors and journals with a data sharing requirement. Moreover, while 65.7% of the journals in our study that required data sharing addressed the concept of reproducibility, as with earlier investigations, we found that most data sharing policies did not provide specific guidance on the practices that ensure data is maximally available and reusable.

Journal ArticleDOI
TL;DR: This work presents an architecture that integrates cloud and fog computing in the 5G environment that works in collaboration with the advanced technologies such as SDN and NFV with the NSC model and compares the core and edge computing with respect to the type of hypervisors, virtualization, security, and node heterogeneity.
Abstract: In the last few years, we have seen an exponential increase in the number of Internet-enabled devices, which has resulted in popularity of fog and cloud computing among end users. End users expect high data rates coupled with secure data access for various applications executed either at the edge (fog computing) or in the core network (cloud computing). However, the bidirectional data flow between the end users and the devices located at either the edge or core may cause congestion at the cloud data centers, which are used mainly for data storage and data analytics. The high mobility of devices (e.g., vehicles) may also pose additional challenges with respect to data availability and processing at the core data centers. Hence, there is a need to have most of the resources available at the edge of the network to ensure the smooth execution of end-user applications. Considering the challenges of future user demands, we present an architecture that integrates cloud and fog computing in the 5G environment that works in collaboration with the advanced technologies such as SDN and NFV with the NSC model. The NSC service model helps to automate the virtual resources by chaining in a series for fast computing in both computing technologies. The proposed architecture also supports data analytics and management with respect to device mobility. Moreover, we also compare the core and edge computing with respect to the type of hypervisors, virtualization, security, and node heterogeneity. By focusing on nodes' heterogeneity at the edge or core in the 5G environment, we also present security challenges and possible types of attacks on the data shared between different devices in the 5G environment.

Proceedings ArticleDOI
09 May 2017
TL;DR: This tutorial provides a comprehensive review of systems for advanced analytics, integrating ML algorithms and languages with existing data systems such as RDBMSs, and adapting data management-inspired techniques to new systems that target ML workloads.
Abstract: Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related challenges that arise in ML workloads, and has built several systems for advanced analytics. This tutorial provides a comprehensive review of such systems and analyzes key data management challenges and techniques. We focus on three complementary lines of work: (1) integrating ML algorithms and languages with existing data systems such as RDBMSs, (2) adapting data management-inspired techniques such as query optimization, partitioning, and compression to new systems that target ML workloads, and (3) combining data management and ML ideas to build systems that improve ML lifecycle-related tasks. Finally, we identify key open data management challenges for future research in this important area.

Journal ArticleDOI
TL;DR: This paper presents ideas for a new generation of agricultural system models that could meet the needs of a growing community of end-users exemplified by a set of Use Cases and proposes an implementation strategy that would link a “pre-competitive” space for model development to a "competitive space” for knowledge product development and through private-public partnerships for new data infrastructure.

Journal ArticleDOI
TL;DR: The nature of data literacy is described and enumerating the related skills and the application of phenomenographic approaches to data literacy and its relationship to the digital humanities have been identified as subjects for further investigation.
Abstract: This paper describes data literacy and emphasizes its importance. Data literacy is vital for researchers who need to become data literate science workers and also for (potential) data management pr...

Journal ArticleDOI
TL;DR: The MERRA Analytic Services (MERRAAS) as mentioned in this paper is an example of cloud-enabled CAaaS built on this principle, which enables MapReduce analytics over NASAs Modern-Era Retrospective Analysis for Research and Applications data collection.