scispace - formally typeset
Search or ask a question
Topic

Data management

About: Data management is a research topic. Over the lifetime, 31574 publications have been published within this topic receiving 424326 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: ProHits is a complete open source software solution for MS-based interaction proteomics that manages the entire pipeline from raw MS data files to fully annotated protein-protein interaction datasets and can accommodate multiple instruments within a facility, multiple user groups, multiple laboratory locations, and any number of parallel projects.
Abstract: Affinity purification coupled with mass spectrometric identification (AP-MS) is now a method of choice for charting novel protein-protein interactions, and has been applied to a large number of both small scale and high-throughput studies1. However, general and intuitive computational tools for sample tracking, AP-MS data analysis, and annotation have not kept pace with rapid methodological and instrument improvements. To address this need, we developed the ProHits LIMS platform. ProHits is a complete open source software solution for MS-based interaction proteomics that manages the entire pipeline from raw MS data files to fully annotated protein-protein interaction datasets. ProHits was designed to provide an intuitive user interface from the biologist's perspective, and can accommodate multiple instruments within a facility, multiple user groups, multiple laboratory locations, and any number of parallel projects. ProHits can manage all project scales, and supports common experimental pipelines, including those utilizing gel-based separation, gel-free analysis, and multi-dimensional protein or peptide separation. ProHits is a client-based HTML program written in PHP that runs a MySQL database on a dedicated server. The complete ProHits software solution consists of two main components: a Data Management module, and an Analyst module (Fig. 1a; see Supplementary Fig. 1 for data structure tables). These modules are supported by an Admin Office module, in which projects, instruments, user permissions and protein databases are managed (Supplementary Fig. 2). A simplified version of the software suite (“ProHits Lite”), consisting only of the Analyst module and Admin Office, is also available for users with pre-existing data management solutions or who receive pre-computed search results from analyses performed in a core MS facility (Supplementary Fig. 3). A step-by-step installation package, installation guide and user manual (see Supplementary Information) are available on the ProHits website (www.prohitsMS.com). Figure 1 Overview of ProHits. (a) Modular organisation of ProHits. The Data Management module backs up all raw mass spectrometry data from acquisition computers, and handles data conversion and database searches. The Analyst module organizes data by project, bait, ... In the Data Management module, raw data from all mass spectrometers in a facility or user group are copied to a single secure storage location in a scheduled manner. Data are organized in an instrument-specific manner, with folder and file organization mirroring the organization on the acquisition computer. ProHits also assigns unique identifiers to each folder and file. Log files and visual indicators of current connection status assist in monitoring the entire system. The Data Management module monitors the use of each instrument for reporting purposes (Supplementary Fig. 4–5). Raw MS files can be automatically converted to appropriate file formats using the open source ProteoWizard converters (http://proteowizard.sourceforge.net/). Converted files may be subjected to manual or automated database searches, followed by statistical analysis of the search results, according to any user-defined schedule; search engine parameters are also recorded to facilitate reporting and compliance with MIAPE guidelines2. Mascot3, X!Tandem4 and the TransProteomics Pipeline (TPP5) are fully integrated with ProHits via linked search engine servers (Supplementary Fig. 6–7). The Analyst module organizes data by project, bait, experiment and/or sample, for gel-based or gel-free approaches (Fig. 1a; for description of a gel-based project, see Supplementary Fig. 8). To create and analyze a gel-free affinity purification sample, the user specifies the bait gene name and species. ProHits automatically retrieves the amino acid sequence and other annotation from its associated database. Bait annotation may then be modified as necessary, for example to specify the presence of an epitope tag or mutation (Supplementary Fig. 9). A comprehensive annotation page tracks experimental details (Supplementary Fig. 10), including descriptions of the Sample, Affinity Purification protocol, Peptide Preparation methodology, and LC-MS/MS procedures. Controlled vocabulary lists for experimental descriptions can be added via drop-down menus to facilitate compliance with annotation guidelines such as MIAPE6 and MIMIx7, and to facilitate the organization and retrieval of data files. Free text notes for cross-referencing laboratory notebook pages, adding experimental details not captured in other sections, describing deviations from reference protocols and links to gel images or other file types may be added in the Experimental Detail page. Once an experiment is created, multiple samples may be linked to it, for example technical replicates of the same sample, or chromatographic fractions derived from the same preparation. All baits, experiments, samples and protocols are assigned unique identifiers. Once a sample is created, it is linked to both the relevant raw files and database search results. For multiple samples in HTP projects, automatic sample annotation may be established by using a standardized file naming system (Supplementary Fig. 11), or files may be manually linked. Alternatively, search results obtained outside of ProHits (with the X!Tandem or Mascot search engines) can be manually imported into the Analyst module (Supplementary Fig. 12). The ProHits Lite version enables uploading of external search results for users with an established MS data management system. In the Analyst module, mass spectrometry data can be explored in an intuitive manner, and results from individual samples, experiments or baits can be viewed and filtered (Supplementary Fig. 13–14). A user interface enables alignment of data from multiple baits or MS analyses using the Comparison tool. Data from individual MS runs, or derived from any user-defined sample group, are selected for visualization in a tabular format, for side-by-side comparisons (Fig. 1b; Supplementary Fig. 15–17). In the Comparison view, control groups and individual baits, experiments or samples are displayed by column. Proteins identified in each MS run or group of runs are displayed by row, and each cell corresponds to a putative protein hit, according to user-specified database search score cutoff. Cells display spectral count number, unique peptides, scores from search engines, and/or protein coverage information; a mouse-over function reveals all associated data for each cell in the table. For each protein displayed in the Comparison view, an associated Peptide link (Fig. 1b) may also be selected to reveal information such as sequence, location, spectral counts, and score, for each associated peptide. Importantly, all search results can be filtered. For example, ProHits allows for the removal of non-specific background proteins from the hit list, as defined by negative controls, search engine score thresholds, or contaminant lists. Links to the external NCBI and BioGRID8 databases are provided for each hit to facilitate data interpretation. Overlap with published interaction data housed in the BioGRID database8 can be displayed to allow immediate identification of new interaction partners. A flexible export function enables visualization in a graphical format with Cytoscape9, in which spectral counts, unique peptides, and search engine scores can be visualized as interaction edge attributes. The Analyst module also includes advanced search functions, bulk export functions for filtered or unfiltered data, and management of experimental protocols and background lists (e.g. Supplementary Fig. 18–20). Deposition of all mass spectrometry-associated data in public repositories is likely to become mandatory for publication of proteomics experiments2, 7, 10. Open access to raw files is essential for data reanalysis and cross-platform comparison; however, data submission to public repositories can be laborious due to strict formatting requirements. ProHits facilitates extraction of the necessary details in compliance with current standards, and generates Proteomic Standard Initiative (PSI) v2.5 compliant reports11, either in the MITAB format for BioGRID8 or in XML format for submission to IMEx consortium databases12, including IntAct13 (Supplementary Fig. 21). MS raw files associated with a given project can also be easily retrieved and grouped for submission to data repositories such as Tranche14. ProHits has developed to manage many large-scale in-house projects, including a systematic analysis of kinase and phosphatase interactions in yeast, consisting of 986 affinity purifications15. Smaller-scale projects from individual laboratories are readily handled in a similar manner. Examples of AP-MS data from both yeast and mammalian projects are provided in a demonstration version of ProHits at www.prohitsMS.com, and in Supplementary documents. The modular architecture of ProHits will accommodate additional new features, as dictated by future experimental and analytical needs. Although ProHits has been designed to handle protein interaction data, simple modifications of the open source code will enable straightforward adaptation to other proteomics workflows.

214 citations

Journal ArticleDOI
TL;DR: A unified yet flexible national citizen science program aimed at tracking invasive species location, abundance, and control efforts could be designed using centralized data sharing and management tools, and a prototype for such a system is presented.
Abstract: Limited resources make it difficult to effectively document, monitor, and control invasive species across large areas, resulting in large gaps in our knowledge of current and future invasion patterns. We surveyed 128 citizen science program coordinators and interviewed 15 of them to evaluate their potential role in filling these gaps. Many programs collect data on invasive species and are willing to contribute these data to public databases. Although resources for education and monitoring are readily available, groups generally lack tools to manage and analyze data. Potential users of these data also retain concerns over data quality. We discuss how to address these concerns about citizen scientist data and programs while preserving the advantages they afford. A unified yet flexible national citizen science program aimed at tracking invasive species location, abundance, and control efforts could be designed using centralized data sharing and management tools. Such a system could meet the needs of multiple stakeholders while allowing efficiencies of scale, greater standardization of methods, and improved data quality testing and sharing. Finally, we present a prototype for such a system (see www.citsci.org).

214 citations

Journal ArticleDOI
01 Aug 2016
TL;DR: Magellan is novel in four important aspects: it provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do.
Abstract: Entity matching (EM) has been a long-standing challenge in data management Most current EM works focus only on developing matching algorithms We argue that far more efforts should be devoted to building EM systems We discuss the limitations of current EM systems, then present as a solution Magellan, a new kind of EM systems Magellan is novel in four important aspects (1) It provides how-to guides that tell users what to do in each EM scenario, step by step (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system We describe research challenges raised by Magellan, then present extensive experiments with 44 students and users at several organizations that show the promise of the Magellan approach

214 citations

Journal ArticleDOI
TL;DR: Two models are presented for measuring knowledge management performance and knowledge management behaviours: a performance framework based on the balanced scorecard approach, and a behaviour framework that identifies levels of practice demonstrated by individuals.
Abstract: Measuring the business benefits of knowledge management is difficult. Even more so for public sector agencies whose outcomes are social benefits, rather than simple profit. Describes an approach for measuring the performance of knowledge management strategies for a public sector agency in Victoria, Australia. Knowledge management is defined as those actions which support collaboration and integration. Two models are presented for measuring knowledge management performance and knowledge management behaviours: a performance framework based on the balanced scorecard approach, and a behaviour framework that identifies levels of practice demonstrated by individuals. The Knowledge Management Performance Scorecard maps the objectives for knowledge management across the balanced scorecard’s key result areas The Knowledge Management Behaviour Framework identifies seven levels of knowledge management skills for demonstrating collaborative behaviour. The framework also outlines what might be typical behaviours of managers and the roles they would assume in relation to individuals at each level.

214 citations

Journal ArticleDOI
01 Aug 2009
TL;DR: Modern data management applications often require integrating available data sources and providing a uniform interface for users to access data from different sources, and such requirements have been driving fruitful research on data integration over the last two decades.
Abstract: The amount of information produced in the world increases by 30% every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [11, 13].

213 citations


Network Information
Related Topics (5)
Information system
107.5K papers, 1.8M citations
90% related
Software
130.5K papers, 2M citations
88% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
The Internet
213.2K papers, 3.8M citations
82% related
Cloud computing
156.4K papers, 1.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023218
2022485
2021959
20201,435
20191,745
20181,719