Topic

Data management

About: Data management is a research topic. Over the lifetime, 31574 publications have been published within this topic receiving 424326 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

RTCGAToolbox: a new tool for exporting TCGA Firehose data.

[...]

Mehmet Kemal Samur¹•Institutions (1)

Harvard University¹

02 Sep 2014-PLOS ONE

TL;DR: An open source and extensible R based data client for pre-processed data from the Firehouse, and results show that the RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data.

...read moreread less

Abstract: Background & Objective Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehouse, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.

...read moreread less

148 citations

Journal Article•DOI•

Theory of answering queries using views

[...]

Alon Halevy¹•Institutions (1)

University of Washington¹

01 Dec 2000

TL;DR: The theoretical issues concerning the problem of answering queries using views, which is to find efficient methods of answering a query using a set of previously materialized views over the database, are surveyed.

...read moreread less

Abstract: The problem of answering queries using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database relations The problem has recently received significant attention because of its relevance to a wide variety of data management problems, such as query optimization, the maintenance of physical data independence, data integration and data warehousing This article surveys the theoretical issues concerning the problem of answering queries using views

...read moreread less

148 citations

Book•

Network Management Principles and Practice

[...]

Mani M. Subramanian

01 Jan 2000

TL;DR: Network management: principles and practice, Network management: Principles and practice , مرکز فناوری اطلاعات Â£1,000,000; اوشاوρزی £1,500,000.

...read moreread less

Abstract: Network management: principles and practice , Network management: principles and practice , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

...read moreread less

148 citations

Proceedings Article•DOI•

Learning Generalized Linear Models Over Normalized Data

[...]

Arun Kumar¹, Jeffrey F. Naughton¹, Jignesh M. Patel¹•Institutions (1)

University of Wisconsin-Madison¹

27 May 2015

TL;DR: A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.

...read moreread less

Abstract: Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

...read moreread less

147 citations

Journal Article•DOI•

caCORE: a common infrastructure for cancer informatics.

[...]

Peter A. Covitz¹, Frank W. Hartel², Carl F. Schaefer², Sherri de Coronado², Gilberto Fragoso², Himanso Sahni², Scott Gustafson², Kenneth H. Buetow² - Show less +4 more•Institutions (2)

United States Department of Health and Human Services¹, National Institutes of Health²

12 Dec 2003-Bioinformatics

TL;DR: An interconnected set of software and services called caCORE, which implements an object-oriented model of the biomedical domain and provides Java, Simple Object Access Protocol and HTTP-XML application programming interfaces, has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources.

...read moreread less

Abstract: Motivation: Sites with substantive bioinformatics operations are challenged to build data processing and delivery infrastructure that provides reliable access and enables data integration. Locally generated data must be processed and stored such that relationships to external data sources can be presented. Consistency and comparability across data sets requires annotation with controlled vocabularies and, further, metadata standards for data representation. Programmatic access to the processed data should be supported to ensure the maximum possible value is extracted. Confronted with these challenges at the National Cancer Institute Center for Bioinformatics, we decided to develop a robust infrastructure for data management and integration that supports advanced biomedical applications. Results: We have developed an interconnected set of software and services called caCORE. Enterprise Vocabulary Services (EVS) provide controlled vocabulary, dictionary and thesaurus services. The Cancer Data Standards Repository (caDSR) provides a metadata registry for common data elements. Cancer Bioinformatics Infrastructure Objects (caBIO) implements an object-oriented model of the biomedical domain and provides Java, Simple Object Access Protocol and HTTP–XML application programming interfaces. caCORE has been used to develop scientific applications that bring together data from distinct genomic and clinical science sources. Availability: caCORE downloads and web interfaces can be accessed from links on the caCORE web site (http://ncicb.nci.nih.gov/core). caBIO software is distributed under an open source license that permits unrestricted academic and commercial use. Vocabulary and metadata content in the EVS and caDSR, respectively, is similarly unrestricted, and is available through web applications and FTP downloads.

...read moreread less

147 citations

Collapse

Network Information

Performance

Metrics

32,259

Papers

465,338

Citations

No. of papers in the topic in previous years
Year	Papers
2023	218
2022	485
2021	959
2020	1,435
2019	1,745
2018	1,719

Data management

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics