scispace - formally typeset
Search or ask a question

Showing papers on "Data management published in 2010"


Proceedings ArticleDOI
06 Jun 2010
TL;DR: This talk describes a set of motivating examples and uses them to explain the features of SciDB, explaining the novel storage manager, array data model, query language, and extensibility frameworks.
Abstract: SciDB [4, 3] is a new open-source data management system intended primarily for use in application domains that involve very large (petabyte) scale array data; for example, scientific applications such as astronomy, remote sensing and climate modeling, bio-science information management, risk management systems in financial applications, and the analysis of web log data. In this talk we will describe our set of motivating examples and use them to explain the features of SciDB. We then briefly give an overview of the project 'in flight', explaining our novel storage manager, array data model, query language, and extensibility frameworks.

420 citations


Journal ArticleDOI
TL;DR: A survey of powerful visualization techniques, from the obvious to the obscure, as well as some of the more obscure, to help developers and marketers better understand the power of visualization.
Abstract: THANKS TO ADVANCES in sensing, networking, and data management, our society is producing digital information at an astonishing rate. According to one estimate, in 2010 alone we will generate 1,200 exabytes—60 million times the content of the Library of Congress. Within this deluge of data lies a wealth.

333 citations


Journal ArticleDOI
TL;DR: A matrix based k-means clustering strategy for data placement in scientific cloud workflows that dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage is proposed.

325 citations


Patent
07 Sep 2010
TL;DR: In this article, a unified approach to data management that enables compliance, legal and IT personnel to focus efforts on, eg, a single data repository, is described, where users can define and utilize information governance policies that help automate and systematize different compliance tasks.
Abstract: Systems and methods of electronic document handling permit organizations to comply with legal or regulatory requirements, electronic discovery and legal hold requirements, and/or other business requirements The systems described provide a unified approach to data management that enables compliance, legal and IT personnel to focus efforts on, eg, a single data repository The systems permit users to define and utilize information governance policies that help automate and systematize different compliance tasks In some examples, organizations may push data in any third-party data format to the systems described herein The systems may permit compliance or IT personnel to detect when a legally sensitive production file has been changed or deleted The systems may also provide a unified dashboard user interface From a dashboard interface, users may perform searches, participate in collaborative data management workflows, obtain data management reports, and adjust policies Other elements and features are disclosed herein

290 citations


Proceedings ArticleDOI
10 Jun 2010
TL;DR: G-Store is designed and implemented which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access, and preserves the desired properties of key- Value stores.
Abstract: Cloud computing has emerged as a preferred platform for deploying scalable web-applications. With the growing scale of these applications and the data associated with them, scalable data management systems form a crucial part of the cloud infrastructure. Key-Value stores -- such as Bigtable, PNUTS, Dynamo, and their open source analogues-- have been the preferred data stores for applications in the cloud. In these systems, data is represented as Key-Value pairs, and atomic access is provided only at the granularity of single keys. While these properties work well for current applications, they are insufficient for the next generation web applications -- such as online gaming, social networks, collaborative editing, and many more -- which emphasize collaboration. Since collaboration by definition requires consistent access to groups of keys, scalable and consistent multi key access is critical for such applications. We propose the Key Group abstraction that defines a relationship between a group of keys and is the granule for on-demand transactional access. This abstraction allows the Key Grouping protocol to collocate control for the keys in the group to allow efficient access to the group of keys. Using the Key Grouping protocol, we design and implement G-Store which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access. Our implementation using a standard key-value store and experiments using a cluster of commodity machines show that G-Store preserves the desired properties of key-value stores, while providing multi key access functionality at a very low overhead.

268 citations


Patent
27 Dec 2010
TL;DR: In this paper, a system and method for using a snapshot as a data source is described, where the system stores a snapshot and an associated data structure or index to storage media to create a secondary copy of a volume of data.
Abstract: A system and method for using a snapshot as a data source is described. In some cases, the system stores a snapshot and an associated data structure or index to storage media to create a secondary copy of a volume of data. In some cases, the associated index includes application specific data about a file system or other application that created the data to identify the location of the data. The associated index may include three entries, and may be used to facilitate the recovery of data via the snapshot. The snapshot may be used by ancillary applications to perform various functions, such as content indexing, data classification, deduplication, e-discovery, and other functions.

240 citations


Proceedings ArticleDOI
04 Nov 2010
TL;DR: This paper presents a model for smart grid data management based on specific characteristics of cloud computing, such as distributed data management for real-time data gathering, parallel processing forreal-time information retrieval, and ubiquitous access.
Abstract: This paper presents a model for smart grid data management based on specific characteristics of cloud computing, such as distributed data management for real-time data gathering, parallel processing for real-time information retrieval, and ubiquitous access. The appliance of the cloud computing model meets the requirements of data and computing intensive smart grid applications. We gathered these requirements by analyzing the set of well-known smart grid use cases, most of which demand flexible collaboration across organizational boundaries of network operators and energy service providers as well as the active participation of the end user. Hence, preserving confidentiality and privacy, whilst processing the massive amounts of smart grid data, is of paramount importance in the design of the proposed Smart Grid Data Cloud.

230 citations


Journal ArticleDOI
TL;DR: ProHits is a complete open source software solution for MS-based interaction proteomics that manages the entire pipeline from raw MS data files to fully annotated protein-protein interaction datasets and can accommodate multiple instruments within a facility, multiple user groups, multiple laboratory locations, and any number of parallel projects.
Abstract: Affinity purification coupled with mass spectrometric identification (AP-MS) is now a method of choice for charting novel protein-protein interactions, and has been applied to a large number of both small scale and high-throughput studies1. However, general and intuitive computational tools for sample tracking, AP-MS data analysis, and annotation have not kept pace with rapid methodological and instrument improvements. To address this need, we developed the ProHits LIMS platform. ProHits is a complete open source software solution for MS-based interaction proteomics that manages the entire pipeline from raw MS data files to fully annotated protein-protein interaction datasets. ProHits was designed to provide an intuitive user interface from the biologist's perspective, and can accommodate multiple instruments within a facility, multiple user groups, multiple laboratory locations, and any number of parallel projects. ProHits can manage all project scales, and supports common experimental pipelines, including those utilizing gel-based separation, gel-free analysis, and multi-dimensional protein or peptide separation. ProHits is a client-based HTML program written in PHP that runs a MySQL database on a dedicated server. The complete ProHits software solution consists of two main components: a Data Management module, and an Analyst module (Fig. 1a; see Supplementary Fig. 1 for data structure tables). These modules are supported by an Admin Office module, in which projects, instruments, user permissions and protein databases are managed (Supplementary Fig. 2). A simplified version of the software suite (“ProHits Lite”), consisting only of the Analyst module and Admin Office, is also available for users with pre-existing data management solutions or who receive pre-computed search results from analyses performed in a core MS facility (Supplementary Fig. 3). A step-by-step installation package, installation guide and user manual (see Supplementary Information) are available on the ProHits website (www.prohitsMS.com). Figure 1 Overview of ProHits. (a) Modular organisation of ProHits. The Data Management module backs up all raw mass spectrometry data from acquisition computers, and handles data conversion and database searches. The Analyst module organizes data by project, bait, ... In the Data Management module, raw data from all mass spectrometers in a facility or user group are copied to a single secure storage location in a scheduled manner. Data are organized in an instrument-specific manner, with folder and file organization mirroring the organization on the acquisition computer. ProHits also assigns unique identifiers to each folder and file. Log files and visual indicators of current connection status assist in monitoring the entire system. The Data Management module monitors the use of each instrument for reporting purposes (Supplementary Fig. 4–5). Raw MS files can be automatically converted to appropriate file formats using the open source ProteoWizard converters (http://proteowizard.sourceforge.net/). Converted files may be subjected to manual or automated database searches, followed by statistical analysis of the search results, according to any user-defined schedule; search engine parameters are also recorded to facilitate reporting and compliance with MIAPE guidelines2. Mascot3, X!Tandem4 and the TransProteomics Pipeline (TPP5) are fully integrated with ProHits via linked search engine servers (Supplementary Fig. 6–7). The Analyst module organizes data by project, bait, experiment and/or sample, for gel-based or gel-free approaches (Fig. 1a; for description of a gel-based project, see Supplementary Fig. 8). To create and analyze a gel-free affinity purification sample, the user specifies the bait gene name and species. ProHits automatically retrieves the amino acid sequence and other annotation from its associated database. Bait annotation may then be modified as necessary, for example to specify the presence of an epitope tag or mutation (Supplementary Fig. 9). A comprehensive annotation page tracks experimental details (Supplementary Fig. 10), including descriptions of the Sample, Affinity Purification protocol, Peptide Preparation methodology, and LC-MS/MS procedures. Controlled vocabulary lists for experimental descriptions can be added via drop-down menus to facilitate compliance with annotation guidelines such as MIAPE6 and MIMIx7, and to facilitate the organization and retrieval of data files. Free text notes for cross-referencing laboratory notebook pages, adding experimental details not captured in other sections, describing deviations from reference protocols and links to gel images or other file types may be added in the Experimental Detail page. Once an experiment is created, multiple samples may be linked to it, for example technical replicates of the same sample, or chromatographic fractions derived from the same preparation. All baits, experiments, samples and protocols are assigned unique identifiers. Once a sample is created, it is linked to both the relevant raw files and database search results. For multiple samples in HTP projects, automatic sample annotation may be established by using a standardized file naming system (Supplementary Fig. 11), or files may be manually linked. Alternatively, search results obtained outside of ProHits (with the X!Tandem or Mascot search engines) can be manually imported into the Analyst module (Supplementary Fig. 12). The ProHits Lite version enables uploading of external search results for users with an established MS data management system. In the Analyst module, mass spectrometry data can be explored in an intuitive manner, and results from individual samples, experiments or baits can be viewed and filtered (Supplementary Fig. 13–14). A user interface enables alignment of data from multiple baits or MS analyses using the Comparison tool. Data from individual MS runs, or derived from any user-defined sample group, are selected for visualization in a tabular format, for side-by-side comparisons (Fig. 1b; Supplementary Fig. 15–17). In the Comparison view, control groups and individual baits, experiments or samples are displayed by column. Proteins identified in each MS run or group of runs are displayed by row, and each cell corresponds to a putative protein hit, according to user-specified database search score cutoff. Cells display spectral count number, unique peptides, scores from search engines, and/or protein coverage information; a mouse-over function reveals all associated data for each cell in the table. For each protein displayed in the Comparison view, an associated Peptide link (Fig. 1b) may also be selected to reveal information such as sequence, location, spectral counts, and score, for each associated peptide. Importantly, all search results can be filtered. For example, ProHits allows for the removal of non-specific background proteins from the hit list, as defined by negative controls, search engine score thresholds, or contaminant lists. Links to the external NCBI and BioGRID8 databases are provided for each hit to facilitate data interpretation. Overlap with published interaction data housed in the BioGRID database8 can be displayed to allow immediate identification of new interaction partners. A flexible export function enables visualization in a graphical format with Cytoscape9, in which spectral counts, unique peptides, and search engine scores can be visualized as interaction edge attributes. The Analyst module also includes advanced search functions, bulk export functions for filtered or unfiltered data, and management of experimental protocols and background lists (e.g. Supplementary Fig. 18–20). Deposition of all mass spectrometry-associated data in public repositories is likely to become mandatory for publication of proteomics experiments2, 7, 10. Open access to raw files is essential for data reanalysis and cross-platform comparison; however, data submission to public repositories can be laborious due to strict formatting requirements. ProHits facilitates extraction of the necessary details in compliance with current standards, and generates Proteomic Standard Initiative (PSI) v2.5 compliant reports11, either in the MITAB format for BioGRID8 or in XML format for submission to IMEx consortium databases12, including IntAct13 (Supplementary Fig. 21). MS raw files associated with a given project can also be easily retrieved and grouped for submission to data repositories such as Tranche14. ProHits has developed to manage many large-scale in-house projects, including a systematic analysis of kinase and phosphatase interactions in yeast, consisting of 986 affinity purifications15. Smaller-scale projects from individual laboratories are readily handled in a similar manner. Examples of AP-MS data from both yeast and mammalian projects are provided in a demonstration version of ProHits at www.prohitsMS.com, and in Supplementary documents. The modular architecture of ProHits will accommodate additional new features, as dictated by future experimental and analytical needs. Although ProHits has been designed to handle protein interaction data, simple modifications of the open source code will enable straightforward adaptation to other proteomics workflows.

214 citations


Journal ArticleDOI
TL;DR: A unified yet flexible national citizen science program aimed at tracking invasive species location, abundance, and control efforts could be designed using centralized data sharing and management tools, and a prototype for such a system is presented.
Abstract: Limited resources make it difficult to effectively document, monitor, and control invasive species across large areas, resulting in large gaps in our knowledge of current and future invasion patterns. We surveyed 128 citizen science program coordinators and interviewed 15 of them to evaluate their potential role in filling these gaps. Many programs collect data on invasive species and are willing to contribute these data to public databases. Although resources for education and monitoring are readily available, groups generally lack tools to manage and analyze data. Potential users of these data also retain concerns over data quality. We discuss how to address these concerns about citizen scientist data and programs while preserving the advantages they afford. A unified yet flexible national citizen science program aimed at tracking invasive species location, abundance, and control efforts could be designed using centralized data sharing and management tools. Such a system could meet the needs of multiple stakeholders while allowing efficiencies of scale, greater standardization of methods, and improved data quality testing and sharing. Finally, we present a prototype for such a system (see www.citsci.org).

214 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: R Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system.
Abstract: Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories poses a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data management systems - such as MapReduce-based systems - can scale to petabytes of data, but provide insufficient analytical functionality. We report our experiences in building Ricardo, a scalable platform for deep analytics. Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system. This decomposition attempts to minimize the transfer of data across system boundaries. Ricardo contrasts with previous approaches, which try to get along with only one type of system, and allows analysts to work on huge datasets from within a popular, well supported, and powerful analysis environment. Because our approach avoids the need to re-implement either statistical or data-management functionality, it can be used to solve complex problems right now.

207 citations


Journal ArticleDOI
TL;DR: The current knowledge in the management and analysis of data in disaster situations is surveyed, as well as the challenges and future research directions are presented.

Book
01 Jan 2010
TL;DR: The Fraunhofer Knowledge Management Audit (FKM-Audit) as discussed by the authors was the first audit of a German company's knowledge management system, which was conducted by the European Commission's Information Society Technologies Programme.
Abstract: 1 Introduction.- I: Design Fields.- 2 Business Process Oriented Knowledge Management.- 3 The Fraunhofer Knowledge Management Audit (FKM-Audit).- 4 Motivation for Knowledge Management.- 5 Role Models, Human Resources and Strategy.- 6 Knowledge Management Tools.- 7 Intellectual Capital: Measuring Knowledge Management.- II: Survey.- 8 Delphi Study on the Future of Knowledge Management - Overview of the Results.- 9 A Survey on Good Practices in Knowledge Management in European Companies.- 10 How German Companies Employ Knowledge Management. An OECD Survey on Usage, Motivations and Effects.- III: Case Studies.- 11 Knowledge Management - Results of a Benchmarking Study.- 12 Knowledge Management: The Holistic Approach of Arthur D. Little, Inc..- 13 The Aventis Approach to Knowledge Management: Locating Inhouse Expertise for Immediate Anytime, Anywhere Availability.- 14 Cultural Change Triggers Best Practice Sharing - British Aerospace plc..- 15 Knowledge Management and Customer Orientation Hewlett Packard Austria.- 16 Knowledge Management in a Global Company - IBM Global Services.- 17 Open Minded Corporate Culture and Management Supports the Sharing of External and Internal Knowledge - Phonak.- 18 Sharing Process Knowledge in Production Environments - Roche Diagnostics - Laboratory Systems.- 19 KnowledgeSharing@MED - Enabling Knowledge Sharing by Turning Knowledge into Business.- IV: KM - Made in Europe.- 20 Building Communities. Organizational Knowledge Management within the European Commission's Information Society Technologies Programme.- List of Figures.- References.- Recommended Further Readings.- Editors.- Contributors.

Proceedings ArticleDOI
06 Jun 2010
TL;DR: This paper characterizes such users and applications and highlights the resulting principles, such as seamless Web integration, emphasis on ease of use, and incentives for data sharing, that underlie the design of Fusion Tables.
Abstract: It has long been observed that database management systems focus on traditional business applications, and that few people use a database management system outside their workplace. Many have wondered what it will take to enable the use of data management technology by a broader class of users and for a much wider range of applications. Google Fusion Tables represents an initial answer to the question of how data management functionality that focused on enabling new users and applications would look in today's computing environment. This paper characterizes such users and applications and highlights the resulting principles, such as seamless Web integration, emphasis on ease of use, and incentives for data sharing, that underlie the design of Fusion Tables. We describe key novel features, such as the support for data acquisition, collaboration, visualization, and web-publishing.

Book
19 Jan 2010
TL;DR: The integrated Rule-Oriented Data System implements the data management framework required to support policy-based data management and is a highly extensible and tunable system that can enforce management policies, automate administrative tasks, and periodically validate assessment criteria.
Abstract: Policy-based data management enables the creation of community-specific collections. Every collection is created for a purpose. The purpose defines the set of properties that will be associated with the collection. The properties are enforced by management policies that control the execution of procedures that are applied whenever data are ingested or accessed. The procedures generate state information that defines the outcome of enforcing the management policy. The state information can be queried to validate assessment criteria and verify that the required collection properties have been conserved. The integrated Rule-Oriented Data System implements the data management framework required to support policy-based data management. Policies are turned into computer actionable Rules. Procedures are composed from a Micro-service-oriented architecture. The result is a highly extensible and tunable system that can enforce management policies, automate administrative tasks, and periodically validate assessment criteria. Table of Contents: Introduction / Integrated Rule-Oriented Data System / iRODS Architecture / Rule-Oriented Programming / The iRODS Rule System / iRODS Micro-services / Example Rules / Extending iRODS / Appendix A: iRODS Shell Commands / Appendix B: Rulegen Grammar / Appendix C: Exercises / Author Biographies

Journal ArticleDOI
28 Sep 2010-PLOS ONE
TL;DR: The LONI Pipeline features include distributed grid-enabled infrastructure, virtualized execution environment, efficient integration, data provenance, validation and distribution of new computational tools, automated data format conversion, and an intuitive graphical user interface.
Abstract: Modern computational neuroscience employs diverse software tools and multidisciplinary expertise to analyze heterogeneous brain data. The classical problems of gathering meaningful data, fitting specific models, and discovering appropriate analysis and visualization tools give way to a new class of computational challenges—management of large and incongruous data, integration and interoperability of computational resources, and data provenance. We designed, implemented and validated a new paradigm for addressing these challenges in the neuroimaging field. Our solution is based on the LONI Pipeline environment [3], [4], a graphical workflow environment for constructing and executing complex data processing protocols. We developed study-design, database and visual language programming functionalities within the LONI Pipeline that enable the construction of complete, elaborate and robust graphical workflows for analyzing neuroimaging and other data. These workflows facilitate open sharing and communication of data and metadata, concrete processing protocols, result validation, and study replication among different investigators and research groups. The LONI Pipeline features include distributed grid-enabled infrastructure, virtualized execution environment, efficient integration, data provenance, validation and distribution of new computational tools, automated data format conversion, and an intuitive graphical user interface. We demonstrate the new LONI Pipeline features using large scale neuroimaging studies based on data from the International Consortium for Brain Mapping [5] and the Alzheimer's Disease Neuroimaging Initiative [6]. User guides, forums, instructions and downloads of the LONI Pipeline environment are available at http://pipeline.loni.ucla.edu.

Proceedings ArticleDOI
09 Apr 2010
TL;DR: This paper proposes four data mining models for the Internet of Things, which are multi-layer data mining model, distributed data Mining model, Grid based datamining model and data miningmodel from multi-technology integration perspective.
Abstract: In this paper, we propose four data mining models for the Internet of Things, which are multi-layer data mining model, distributed data mining model, Grid based data mining model and data mining model from multi-technology integration perspective. Among them, multi-layer model includes four layers: 1) data collection layer, 2) data management layer, 3) event processing layer, and 4) data mining service layer. Distributed data mining model can solve problems from depositing data at different sites. Grid based data mining model allows Grid framework to realize the functions of data mining. Data mining model from multi-technology integration perspective describes the corresponding framework for the future Internet. Several key issues in data mining of IoT are also discussed.

Patent
16 Nov 2010
TL;DR: In this paper, the authors present a data management engine for performing data management functions, including at least a back-up function to create a backup copy of data from a first storage pool to another storage pool using difference information between time states.
Abstract: Systems and methods for backing-up data from a first storage pool to a second storage pool using difference information between time states are disclosed. The system has a data management engine for performing data management functions, including at least a back-up function to create a back-up copy of data. By executing a sequence of snapshot operations to create point-in-time images of application data on a first storage pool, each successive point-in-time image corresponding to a specific, successive time-state of the application data, a series of snapshots is created. The snapshots are then used to create difference information indicating which application data has changed and the content of the changed application data for the corresponding time state. This difference information is then sent to a second storage pool to create a back-up copy of data for the current time-state.

Journal ArticleDOI
TL;DR: In this paper, the authors examined the relationship between knowledge management, human resource management, and typical knowledge learning goals of an accredited business education program and proposed a theoretical model illustrating how these relationships might overlap.
Abstract: Much has been written on the importance of knowledge management, the challenges facing organizations, and the important human resource management activities involved in assuring the acquisition and transfer of knowledge. Higher business education plays an important role in preparing students to assume the knowledge management and human resource roles so necessary to organizations. The authors examined the relationship between knowledge management, human resource management, and typical knowledge learning goals of an accredited business education program. A theoretical model is presented, illustrating how these relationships might overlap. The model proposes a linkage between knowledge management tenets, human resource management activities in organizations, and Bloom's Revised Taxonomy for planning and evaluating educational goals.


Journal ArticleDOI
TL;DR: A general direction of development is suggested, based on a modular software architecture with a spatial database at its core, where interoperability, data model design and integration with remote-sensing data sources play an important role in successful GPS data handling.
Abstract: To date, the processing of wildlife location data has relied on a diversity of software and file formats. Data management and the following spatial and statistical analyses were undertaken in multiple steps, involving many time-consuming importing/exporting phases. Recent technological advancements in tracking systems have made large, continuous, high-frequency datasets of wildlife behavioural data available, such as those derived from the global positioning system (GPS) and other animal-attached sensor devices. These data can be further complemented by a wide range of other information about the animals' environment. Management of these large and diverse datasets for modelling animal behaviour and ecology can prove challenging, slowing down analysis and increasing the probability of mistakes in data handling. We address these issues by critically evaluating the requirements for good management of GPS data for wildlife biology. We highlight that dedicated data management tools and expertise are needed. We explore current research in wildlife data management. We suggest a general direction of development, based on a modular software architecture with a spatial database at its core, where interoperability, data model design and integration with remote-sensing data sources play an important role in successful GPS data handling.

Journal ArticleDOI
TL;DR: This paper reviewed studies of online and blended learning in management-oriented disciplines and management-related topics and concluded that although several multi-course studies have been published, there is ample opportunity for research within the respective management disciplines.
Abstract: This paper reviews studies of online and blended learning in management-oriented disciplines and management-related topics. The review shows that over the last decade, this emerging field has seen dramatic conceptual, methodological, and analytical advances. However, these advances have progressed within the particular disciplines at uneven rates. Studies examining courses in Organizational Behavior and Strategic Management have seen the most progress, with courses in Human Resources, Operations Management, and International Management receiving lesser attention. To date, studies of courses in Entrepreneurship are next to non-existent. Our review suggests that although several multi-course studies have been published, there is ample opportunity for research within the respective management disciplines. We also suggest topics and methodological issues requiring further study, including stronger delineations between online and blended management education; further examination of participant characteristics,...

Journal ArticleDOI
TL;DR: This round table paper suggests goals for data sharing and a work plan for reaching them, and challenges respondents to move beyond well intentioned but largely aspirational data sharing plans.
Abstract: Epidemiologists and public health researchers are moving very slowly in the data sharing revolution, and agencies that maintain global health databases are reluctant to share data too. Once investments in infrastructure have been made, recycling and combining data provide access to maximum knowledge for minimal additional cost. By refusing to share data, researchers are slowing progress towards reducing illness and death and are denying a public good to taxpayers who support most of the research. Funders of public health research are beginning to call for change and developing data sharing policies. However they are not yet adequately addressing the obstacles that underpin the failure to share data. These include professional structures that reward publication of analysis but not of data, and funding streams and career paths that continue to undervalue critical data management work. Practical issues need to be sorted out too: how and where should data be stored for the long term, who will control access, and who will pay for those services? Existing metadata standards need to be extended to cope with health data. These obstacles have been known for some time; most can be overcome in the field of public health just as they have been overcome in other fields. However no institution has taken the lead in defining a work plan and carving up the tasks and the bill. In this round table paper, we suggest goals for data sharing and a work plan for reaching them, and challenge respondents to move beyond well intentioned but largely aspirational data sharing plans.

Proceedings ArticleDOI
10 Jun 2010
TL;DR: The inner workings of Fusion Tables are described, including the storage of data in the system and the tight integration with the Google Maps infrastructure.
Abstract: Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to different users. Users can keep the data private, share it with a select set of collaborators, or make it public and thus crawlable by search engines. The discussion feature of Fusion Tables allows collaborators to conduct detailed discussions of the data at the level of tables and individual rows, columns, and cells. This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with the Google Maps infrastructure.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: This paper ran experiments using three typical workflow applications on Amazon's EC2, and investigated some of the ways in which data can be managed for workflows in the cloud.
Abstract: Efficient data management is a key component in achieving good performance for scientific workflows in distributed environments. Workflow applications typically communicate data between tasks using files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. In grids and clusters, workflow data is often stored on network and parallel file systems. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon's EC2. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.

Patent
11 Oct 2010
TL;DR: In this paper, a host driver embedded in an application server connects an application and its data to a cluster and captures real-time data transactions, preferably in the form of an event journal that is provided to the data management system.
Abstract: A “forward” delta data management technique uses a “sparse” index associated with a delta file to achieve both delta management efficiency and to eliminate read latency while accessing history data. The invention may be implemented advantageously in a data management system that provides real-time data services to data sources associated with a set of application host servers. A host driver embedded in an application server connects an application and its data to a cluster. The host driver captures real-time data transactions, preferably in the form of an event journal that is provided to the data management system. In particular, the driver functions to translate traditional file/database/block I/O into a continuous, application-aware, output data stream. A given application-aware data stream is processed through a multi-stage data reduction process to produce a compact data representation from which an “any point-in-time” reconstruction of the original data can be made.

Journal ArticleDOI
TL;DR: To put the information within this deluge of data lies a wealth of valuable information on how the authors conduct their businesses, governments, and personal lives, they must find ways to explore, relate, and communicate the data meaningfully.
Abstract: Thanks to advances in sensing, networking, and data management, our society is producing digital information at an astonishing rate. According to one estimate, in 2010 alone we will generate 1,200 exabytes -- 60 million times the content of the Library of Congress. Within this deluge of data lies a wealth of valuable information on how we conduct our businesses, governments, and personal lives. To put the information to good use, we must find ways to explore, relate, and communicate the data meaningfully.

Journal ArticleDOI
01 May 2010
TL;DR: The IIAS is used to close the loop between the insulin pump and the continuous glucose monitoring system, by providing the pump with the appropriate insulin infusion rate in order to keep the patient's glucose levels within predefined limits.
Abstract: SMARTDIAB is a platform designed to support the monitoring, management, and treatment of patients with type 1 diabetes mellitus (T1DM), by combining state-of-the-art approaches in the fields of database (DB) technologies, communications, simulation algorithms, and data mining. SMARTDIAB consists mainly of two units: 1) the patient unit (PU); and 2) the patient management unit (PMU), which communicate with each other for data exchange. The PMU can be accessed by the PU through the internet using devices, such as PCs/laptops with direct internet access or mobile phones via a Wi-Fi/General Packet Radio Service access network. The PU consists of an insulin pump for subcutaneous insulin infusion to the patient and a continuous glucose measurement system. The aforementioned devices running a user-friendly application gather patient's related information and transmit it to the PMU. The PMU consists of a diabetes data management system (DDMS), a decision support system (DSS) that provides risk assessment for long-term diabetes complications, and an insulin infusion advisory system (IIAS), which reside on a Web server. The DDMS can be accessed from both medical personnel and patients, with appropriate security access rights and front-end interfaces. The DDMS, apart from being used for data storage/retrieval, provides also advanced tools for the intelligent processing of the patient's data, supporting the physician in decision making, regarding the patient's treatment. The IIAS is used to close the loop between the insulin pump and the continuous glucose monitoring system, by providing the pump with the appropriate insulin infusion rate in order to keep the patient's glucose levels within predefined limits. The pilot version of the SMARTDIAB has already been implemented, while the platform's evaluation in clinical environment is being in progress.

Patent
20 May 2010
TL;DR: Disclosed as mentioned in this paper is a computer program that provides a secure workflow environment through a cloud computing facility, wherein the secure workspace environment may be adapted to provide a plurality of users with a workspace adaptable to provide secure document management and secure communications management, where the users comprise at least two classes of user, including a participant and a subscriber.
Abstract: Disclosed is a computer program that provides a secure workflow environment through a cloud computing facility, wherein the secure workflow environment may be adapted to (1) provide a plurality of users with a workspace adaptable to provide secure document management and secure communications management, wherein the users comprise at least two classes of user, including a participant and a subscriber, the subscriber having control authority within the workspace that exceeds that of the participant and the participant having control over at least some of the participants own interactions with the workspace, (2) maintain a secure instance of each communication provided by each of the users such that each communication can be managed, (3) maintain a secure instance of each document interaction provided by each user such that each interaction can be managed; and extending the secure workflow environment to the users through a secure network connection.

Journal ArticleDOI
TL;DR: A workflow management system named Ergatis that enables users to build, execute and monitor pipelines for computational analysis of genomics data and was designed to be accessible to a broad class of users and provides a user friendly, web-based interface.
Abstract: Motivation: The growth of sequence data has been accompanied by an increasing need to analyze data on distributed computer clusters. The use of these systems for routine analysis requires scalable and robust software for data management of large datasets. Software is also needed to simplify data management and make large-scale bioinformatics analysis accessible and reproducible to a wide class of target users. Results: We have developed a workflow management system named Ergatis that enables users to build, execute and monitor pipelines for computational analysis of genomics data. Ergatis contains preconfigured components and template pipelines for a number of common bioinformatics tasks such as prokaryotic genome annotation and genome comparisons. Outputs from many of these components can be loaded into a Chado relational database. Ergatis was designed to be accessible to a broad class of users and provides a user friendly, web-based interface. Ergatis supports high-throughput batch processing on distributed compute clusters and has been used for data management in a number of genome annotation and comparative genomics projects. Availability: Ergatis is an open-source project and is freely available at http://ergatis.sourceforge.net Contact: ten.egrofecruos.sresu@sivroj

Patent
19 Jan 2010
TL;DR: In this article, a Personal Data Propagation Environment (PDP) is proposed to facilitate the propagation of personal data between secure personal data stores and various consumers of the personal data items.
Abstract: Methods and systems for facilitating the propagation of personal data are provided. Example embodiments provide a Personal Data Propagation Environment ("PDP environment"), which facilitates the propagation of personal data items between secure personal data stores and various consumers of the personal data items. In one embodiment, the PDP environment includes a personal data manager and a personal data subscriber. The personal data manager manages personal data items on a secure data store associated with a user computing device. The personal data manager provides access to personal data items stored on the secure data store in accordance with a personal data subscription associated with the personal data subscriber. This abstract is provided to comply with rules requiring an abstract, and it is submitted with the intention that it will not be used to interpret or limit the scope or meaning of the claims.