scispace - formally typeset
Search or ask a question
Journal ArticleDOI

'Big data', Hadoop and cloud computing in genomics

01 Oct 2013-Journal of Biomedical Informatics (J Biomed Inform)-Vol. 46, Iss: 5, pp 774-781
TL;DR: An overview of cloud computing and big data technologies, and how such expertise can be used to deal with biology's big data sets is discussed, together with an overview of the current usage of Hadoop within the bioinformatics community.
About: This article is published in Journal of Biomedical Informatics.The article was published on 2013-10-01 and is currently open access. It has received 403 citations till now. The article focuses on the topics: Data-intensive computing & Big data.
Citations
More filters
Journal ArticleDOI
TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.

2,141 citations

Journal ArticleDOI
TL;DR: The proposed MeDShare system is blockchain-based and provides data provenance, auditing, and control for shared medical data in cloud repositories among big data entities and employs smart contracts and an access control mechanism to effectively track the behavior of the data.
Abstract: The dissemination of patients’ medical records results in diverse risks to patients’ privacy as malicious activities on these records cause severe damage to the reputation, finances, and so on of all parties related directly or indirectly to the data. Current methods to effectively manage and protect medical records have been proved to be insufficient. In this paper, we propose MeDShare, a system that addresses the issue of medical data sharing among medical big data custodians in a trust-less environment. The system is blockchain-based and provides data provenance, auditing, and control for shared medical data in cloud repositories among big data entities. MeDShare monitors entities that access data for malicious use from a data custodian system. In MeDShare, data transitions and sharing from one entity to the other, along with all actions performed on the MeDShare system, are recorded in a tamper-proof manner. The design employs smart contracts and an access control mechanism to effectively track the behavior of the data and revoke access to offending entities on detection of violation of permissions on data. The performance of MeDShare is comparable to current cutting edge solutions to data sharing among cloud service providers. By implementing MeDShare, cloud service providers and other data guardians will be able to achieve data provenance and auditing while sharing medical data with entities such as research and medical institutions with minimal risk to data privacy.

819 citations


Cites background from "'Big data', Hadoop and cloud comput..."

  • ...The increased popularity of cloud services has drawn the interest of users which span from patients, medical institutions and research institution to big cooperation’s to store their acquired data on cloud repositories [10], [11]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors look at how data-driven techniques are playing a big role in deciphering processing-structure-property-performance relationships in materials, with illustrative examples of both forward models (property prediction) and inverse models (materials discovery).
Abstract: Our ability to collect “big data” has greatly surpassed our capability to analyze it, underscoring the emergence of the fourth paradigm of science, which is data-driven discovery. The need for data informatics is also emphasized by the Materials Genome Initiative (MGI), further boosting the emerging field of materials informatics. In this article, we look at how data-driven techniques are playing a big role in deciphering processing-structure-property-performance relationships in materials, with illustrative examples of both forward models (property prediction) and inverse models (materials discovery). Such analytics can significantly reduce time-to-insight and accelerate cost-effective materials discovery, which is the goal of MGI.

705 citations

Journal ArticleDOI
TL;DR: By selectively analyzing the literature, this paper systematically survey how the adoption of the above-mentioned Industry 4.0 technologies (and their integration) applied to the health domain is changing the way to provide traditional services and products.

431 citations

Journal ArticleDOI
TL;DR: This study comprehensively surveys and classifies the various attributes of Big data, including its nature, definitions, rapid growth rate, volume, management, analysis, and security, and proposes a data life cycle that uses the technologies and terminologies of Big Data.
Abstract: Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However, the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data, and security. Nonetheless, Big Data is still in its infancy stage, and the domain has not been reviewed in general. Hence, this study comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination. These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data.

419 citations


Cites background or methods from "'Big data', Hadoop and cloud comput..."

  • ...The functionality of MapReduce has been discussed in detail by [56, 57]....

    [...]

  • ...To scale the processing of Big Data, map and reduce functions can be performed on small subsets of large datasets [56, 57]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"'Big data', Hadoop and cloud comput..." refers methods in this paper

  • ...Contrail relies on the graphtheoretic framework of de Bruijin graphs [79] CloudBrush A distributed genome assembler based on string graphs [80] RNA sequence analysis Myrna A cloud computing pipeline for calculating differential gene expression in large RNA sequence datasets [48] FX RNA sequence analysis tool for the estimation of gene expression levels and genomic variant calling [34] Eoulsan An integrated and flexible solution for RNA sequence data analysis of differential expression [81] Sequence file management HadoopBAM A novel library for scalable manipulation of aligned next-generation sequencing data [82] SeqWare A tool set used for next generation genome sequencing technologies which includes a LIMS, Pipeline and Query Engine [35] GATK A gene analysis tool-kit for next-generation resequencing data [43] Phylogenetic analysis MrsRF A scalable, efficient multi-core algorithm that uses MapReduce to quickly calculate the all-to-all Robinson Foulds (RF) distance between large numbers of trees [83] Nephele A set of tools, which use the complete composition vector algorithm in order to group sequence clustering into genotypes based on a distance measure [84] GPU bioinformatics software GPU-BLAST An accelerated version of NCBI-BLAST which uses general purpose graphics processing unit (GPU), designed to rapidly manipulate and alter memory to accelerate overall algorithm processing [85] SOAP3 Short sequence read alignment algorithm that uses the multi-processors in a graphic processing unit to achieve ultra-fast alignments [86] Search engine implementation Hydra A protein sequence database search engine specifically designed to run efficiently on the Hadoop MapReduce framework [87] CloudBlast Scalable BLAST in the cloud [88] Miscellaneous BioDoop A set of tools which modules for handling Fasta streams, wrappers for Blast, converting sequences to the different formats and so on [89] BlueSNP An algorithm for computationally intensive analyses, feasible for large genotype–phenotype datasets [90] Quake DNA sequence error detection and correction in sequence reads [91] YunBe A gene set analysis algorithm for biomarker identification in the cloud [92] PeakRanger A multi-purpose peak caller software package for detecting regions from chromatin immunoprecipitation (ChIP) sequence experiments [93] particular has also blossomed in pioneering cloud and big data technologies in the biological research and medical space....

    [...]

  • ...One of the first MapReduce projects applied in the biotechnology space resulted in the Genome Analysis Tool Kit (GATK) [43]....

    [...]

  • ...GATK A gene analysis tool-kit for next-generation resequencing data [43]...

    [...]

Journal ArticleDOI
01 Jan 1998
TL;DR: Integrated circuits will lead to such wonders as home computers or at least terminals connected to a central computer, automatic controls for automobiles, and personal portable communications equipment as mentioned in this paper. But the biggest potential lies in the production of large systems.
Abstract: The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas. Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today. But the biggest potential lies in the production of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing. Computers will be more powerful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated circuits will allow the construction of larger processing units. Machines similar to those in existence today will be built at lower costs and with faster turnaround.

9,647 citations

Journal Article
TL;DR: Integrated circuits will lead to such wonders as home computers or at least terminals connected to a central computer, automatic controls for automobiles, and personal portable communications equipment as discussed by the authors. But the biggest potential lies in the production of large systems.
Abstract: The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas. Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today. But the biggest potential lies in the production of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing. Computers will be more powerful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated circuits will allow the construction of larger processing units. Machines similar to those in existence today will be built at lower costs and with faster turnaround.

6,077 citations

Book
13 May 2011
TL;DR: The amount of data in the authors' world has been exploding, and analyzing large data sets will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey.
Abstract: The amount of data in our world has been exploding, and analyzing large data sets—so-called big data— will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office. Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers. The increasing volume and detail of information captured by enterprises, the rise of multimedia, social media, and the Internet of Things will fuel exponential growth in data for the foreseeable future.

4,700 citations


"'Big data', Hadoop and cloud comput..." refers methods in this paper

  • ...In the healthcare sector, according to the McKinsey Global Institute, if big data is used effectively, the US healthcare sector could make $300 billion in savings per annum, reducing expenditure by 8% [22]....

    [...]

Journal ArticleDOI
TL;DR: An interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results.
Abstract: Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu.

2,071 citations


"'Big data', Hadoop and cloud comput..." refers methods in this paper

  • ...It comes with a user friendly Graphical User Interface (GUI), along with over 100 pre-installed bioinformatics tools including Galaxy [31], BioPerl, BLAST, Bioconductor, Glimmer, GeneSpring, ClustalW and EMBOSS utilities, amongst others....

    [...]