scispace - formally typeset
Search or ask a question
Topic

Data management

About: Data management is a research topic. Over the lifetime, 31574 publications have been published within this topic receiving 424326 citations.


Papers
More filters
Journal ArticleDOI
02 Sep 2014-PLOS ONE
TL;DR: An open source and extensible R based data client for pre-processed data from the Firehouse, and results show that the RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data.
Abstract: Background & Objective Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehouse, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.

148 citations

Journal ArticleDOI
01 Dec 2000
TL;DR: The theoretical issues concerning the problem of answering queries using views, which is to find efficient methods of answering a query using a set of previously materialized views over the database, are surveyed.
Abstract: The problem of answering queries using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database relations The problem has recently received significant attention because of its relevance to a wide variety of data management problems, such as query optimization, the maintenance of physical data independence, data integration and data warehousing This article surveys the theoretical issues concerning the problem of answering queries using views

148 citations

Book
01 Jan 2000
TL;DR: Network management: principles and practice, Network management: Principles and practice , مرکز فناوری اطلاعات £1,000,000; اوشاوρزی £1,500,000.
Abstract: Network management: principles and practice , Network management: principles and practice , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

148 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Abstract: Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

147 citations

Journal ArticleDOI
TL;DR: An interconnected set of software and services called caCORE, which implements an object-oriented model of the biomedical domain and provides Java, Simple Object Access Protocol and HTTP-XML application programming interfaces, has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources.
Abstract: Motivation: Sites with substantive bioinformatics operations are challenged to build data processing and delivery infrastructure that provides reliable access and enables data integration. Locally generated data must be processed and stored such that relationships to external data sources can be presented. Consistency and comparability across data sets requires annotation with controlled vocabularies and, further, metadata standards for data representation. Programmatic access to the processed data should be supported to ensure the maximum possible value is extracted. Confronted with these challenges at the National Cancer Institute Center for Bioinformatics, we decided to develop a robust infrastructure for data management and integration that supports advanced biomedical applications. Results: We have developed an interconnected set of software and services called caCORE. Enterprise Vocabulary Services (EVS) provide controlled vocabulary, dictionary and thesaurus services. The Cancer Data Standards Repository (caDSR) provides a metadata registry for common data elements. Cancer Bioinformatics Infrastructure Objects (caBIO) implements an object-oriented model of the biomedical domain and provides Java, Simple Object Access Protocol and HTTP–XML application programming interfaces. caCORE has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources. Availability: caCORE downloads and web interfaces can be accessed from links on the caCORE web site (http://ncicb.nci.nih.gov/core). caBIO software is distributed under an open source license that permits unrestricted academic and commercial use. Vocabulary and metadata content in the EVS and caDSR, respectively, is similarly unrestricted, and is available through web applications and FTP downloads.

147 citations


Network Information
Related Topics (5)
Information system
107.5K papers, 1.8M citations
90% related
Software
130.5K papers, 2M citations
88% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
The Internet
213.2K papers, 3.8M citations
82% related
Cloud computing
156.4K papers, 1.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023218
2022485
2021959
20201,435
20191,745
20181,719