scispace - formally typeset
Search or ask a question
Author

Yaru Chen

Bio: Yaru Chen is an academic researcher. The author has contributed to research in topics: Cancer & Medicine. The author has an hindex of 4, co-authored 9 publications receiving 2551 citations.
Topics: Cancer, Medicine, Germline mutation, Indel, Germline

Papers
More filters
Journal ArticleDOI
TL;DR: Fastp is developed as an ultra‐fast FASTQ preprocessor with useful quality control and data‐filtering features that can perform quality control, adapter trimming, quality filtering, per‐read quality pruning and many other operations with a single scan of the FAST Q data.
Abstract: Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.

7,461 citations

Posted ContentDOI
01 Mar 2018-bioRxiv
TL;DR: Fastp is developed as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features that can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FastQ data.
Abstract: Motivation: Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming, and quality filtering. These tools are often insufficiently fast as most are developed using high level programming languages (e.g., Python and Java) and provide limited multithreading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results: We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per read quality cutting, and many other operations with a single scan of the FASTQ data. It also supports unique molecular identifier preprocessing, poly tail trimming, output splitting, and base correction for paired-end data. It can automatically detect adapters for single-end and paired-end FASTQ data. This tool is developed in C++ and has multithreading support. Based on our evaluation, fastp is 2 to 5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and Implementation: The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp

4,300 citations

Journal ArticleDOI
TL;DR: This paper presents an efficient tool gencore, an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing of NGS data, and is the only duplicate removing tool that generates both informative HTML and JSON reports.
Abstract: Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. Gencore reports statistical results in both HTML and JSON formats. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. The JSON format report contains all the statistical results, and is interpretable for downstream programs. Comparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. This tool is available at: https://github.com/OpenGene/gencore

41 citations

Journal ArticleDOI
TL;DR: Plausible genetic susceptibility was found in 4.7% of lung cancer patients, suggesting the germline mutations of the P and LP groups were risk factors for lung cancer, and somatic mutation analysis revealed no significant difference in tumor mutation burden among the groups, although a trend of lower TMB in the pathogenic group was found.
Abstract: Background Germline variations may contribute to lung cancer susceptibility besides environmental factors. The influence of germline mutations on lung cancer susceptibility and their correlation with somatic mutations has not been systematically investigated. Methods In this study, germline mutations from 1,026 non-small cell lung cancer (NSCLC) patients were analyzed with a 58-gene next-generation sequencing (NGS) panel containing known hereditary cancer-related genes, and were categorized based on American College of Medical Genetics and Genomics (ACMG) guidelines in pathogenicity, and the corresponding somatic mutations were analyzed using a 605-gene NGS panel containing known cancer-related genes. Results Plausible genetic susceptibility was found in 4.7% of lung cancer patients, in which 14 patients with pathogenic mutations (P group) and 34 patients with likely-pathogenic mutations (LP group) were identified. The ratio of the first degree relatives with lung cancer history of the P groups was significantly higher than the Non-P group (P=0.009). The ratio of lung cancer patients with history of other cancers was higher in P (P=0.0007) or LP (P=0.017) group than the Non-P group. Pathogenic mutations fell most commonly in BRCA2, followed by CHEK2 and ATM. Likely-pathogenic mutations fell most commonly in NTRK1 and EXT2, followed by BRIP1 and PALB2. These genes are involved in DNA repair, cell cycle regulation and tumor suppression. By comparing the germline mutation frequency from this study with that from the whole population or East Asian population (gnomAD database), we found that the overall odds ratio (OR) for P or LP group was 17.93 and 15.86, respectively, when compared with the whole population, and was 2.88 and 3.80, respectively, when compared with the East Asian population, suggesting the germline mutations of the P and LP groups were risk factors for lung cancer. Somatic mutation analysis revealed no significant difference in tumor mutation burden (TMB) among the groups, although a trend of lower TMB in the pathogenic group was found. The SNV/INDEL mutation frequency of TP53 in the P group was significantly lower than the other two groups, and the copy number variation (CNV) mutation frequency of PIK3CA and MET was significantly higher than the Non-P group. Pathway enrichment analysis found no significant difference in aberrant pathways among the three groups. Conclusions A proportion of 4.7% of patients carrying germline variants may be potentially linked to increased susceptibility to lung cancer. Patients with pathogenic germline mutations exhibited stronger family history and higher lung cancer risk.

29 citations

Posted ContentDOI
19 Dec 2018-bioRxiv
TL;DR: An efficient tool gencore is presented, to eliminate errors and duplicates of next-generation sequencing (NGS) data, and greatly reduces the output data’s mapping mismatches, which are mostly caused by errors.
Abstract: Summary This paper presents an efficient tool gencore, to eliminate errors and duplicates of next-generation sequencing (NGS) data. This tool clusters the mapped sequencing reads and merges each cluster to generate one consensus read. If the data has unique molecular identifier (UMI), gencore uses it for identifying the reads derived from same original DNA fragment. Comparing to the conventional tool Picard, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. Comparing to the performance of Picard, gencore is about 3X faster and uses much less memory. Availability and Implementation gencore is an open source tool written in C++. It’s hosted in github: https://github.com/OpenGene/gencore Contact chen@haplox.com

4 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The expression of proinflammatory genes, especially chemokines, was markedly elevated in COVID-19 cases compared to community-acquired pneumonia patients and healthy controls, suggesting that SARS-CoV-2 infection causes hypercytokinemia.

767 citations

Journal ArticleDOI
TL;DR: In this article, using monoclonal antibodies (mAbs), animal immune sera, human convalescent sera and human sera from recipients of the BNT162b2 mRNA vaccine, the authors report the impact on antibody neutralization of a panel of authentic SARS-CoV-2 variants including a B.1.7 isolate, chimeric strains with South African or Brazilian spike genes and isogenic recombinant viral variants.
Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the global COVID-19 pandemic. Rapidly spreading SARS-CoV-2 variants may jeopardize newly introduced antibody and vaccine countermeasures. Here, using monoclonal antibodies (mAbs), animal immune sera, human convalescent sera and human sera from recipients of the BNT162b2 mRNA vaccine, we report the impact on antibody neutralization of a panel of authentic SARS-CoV-2 variants including a B.1.1.7 isolate, chimeric strains with South African or Brazilian spike genes and isogenic recombinant viral variants. Many highly neutralizing mAbs engaging the receptor-binding domain or N-terminal domain and most convalescent sera and mRNA vaccine-induced immune sera showed reduced inhibitory activity against viruses containing an E484K spike mutation. As antibodies binding to spike receptor-binding domain and N-terminal domain demonstrate diminished neutralization potency in vitro against some emerging variants, updated mAb cocktails targeting highly conserved regions, enhancement of mAb potency or adjustments to the spike sequences of vaccines may be needed to prevent loss of protection in vivo.

716 citations

Journal ArticleDOI
25 Jan 2021-Science
TL;DR: In this article, the authors map how all mutations to the receptor binding domain (RBD) of SARS-CoV-2 affect binding by the antibodies in the REGN-COV2 cocktail and the antibody LYCoV016.
Abstract: Antibodies are a potential therapy for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), but the risk of the virus evolving to escape them remains unclear. Here we map how all mutations to the receptor binding domain (RBD) of SARS-CoV-2 affect binding by the antibodies in the REGN-COV2 cocktail and the antibody LY-CoV016. These complete maps uncover a single amino acid mutation that fully escapes the REGN-COV2 cocktail, which consists of two antibodies, REGN10933 and REGN10987, targeting distinct structural epitopes. The maps also identify viral mutations that are selected in a persistently infected patient treated with REGN-COV2 and during in vitro viral escape selections. Finally, the maps reveal that mutations escaping the individual antibodies are already present in circulating SARS-CoV-2 strains. These complete escape maps enable interpretation of the consequences of mutations observed during viral surveillance.

620 citations

Journal ArticleDOI
TL;DR: The SARS-CoV-2 Omicron BA.1 variant emerged in 20211 and has multiple mutations in its spike protein this paper , which increased the evasion of therapeutic monoclonal and vaccine-elicited polyclonal neutralizing antibodies after two doses.
Abstract: The SARS-CoV-2 Omicron BA.1 variant emerged in 20211 and has multiple mutations in its spike protein2. Here we show that the spike protein of Omicron has a higher affinity for ACE2 compared with Delta, and a marked change in its antigenicity increases Omicron's evasion of therapeutic monoclonal and vaccine-elicited polyclonal neutralizing antibodies after two doses. mRNA vaccination as a third vaccine dose rescues and broadens neutralization. Importantly, the antiviral drugs remdesivir and molnupiravir retain efficacy against Omicron BA.1. Replication was similar for Omicron and Delta virus isolates in human nasal epithelial cultures. However, in lung cells and gut cells, Omicron demonstrated lower replication. Omicron spike protein was less efficiently cleaved compared with Delta. The differences in replication were mapped to the entry efficiency of the virus on the basis of spike-pseudotyped virus assays. The defect in entry of Omicron pseudotyped virus to specific cell types effectively correlated with higher cellular RNA expression of TMPRSS2, and deletion of TMPRSS2 affected Delta entry to a greater extent than Omicron. Furthermore, drug inhibitors targeting specific entry pathways3 demonstrated that the Omicron spike inefficiently uses the cellular protease TMPRSS2, which promotes cell entry through plasma membrane fusion, with greater dependency on cell entry through the endocytic pathway. Consistent with suboptimal S1/S2 cleavage and inability to use TMPRSS2, syncytium formation by the Omicron spike was substantially impaired compared with the Delta spike. The less efficient spike cleavage of Omicron at S1/S2 is associated with a shift in cellular tropism away from TMPRSS2-expressing cells, with implications for altered pathogenesis.

577 citations

Journal ArticleDOI
07 May 2020-Nature
TL;DR: It is shown that a coronavirus isolated from a Malayan pangolin has 100%, 98.6%, 97.8% and 90.7% amino acid identity with SARS-CoV-2 in the E, M, N and S proteins, respectively, which suggests that the latter may have originated from a recombination event involving Sars-related coronaviruses from bats and pangolins.
Abstract: The current outbreak of coronavirus disease-2019 (COVID-19) poses unprecedented challenges to global health1. The new coronavirus responsible for this outbreak-severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-shares high sequence identity to SARS-CoV and a bat coronavirus, RaTG132. Although bats may be the reservoir host for a variety of coronaviruses3,4, it remains unknown whether SARS-CoV-2 has additional host species. Here we show that a coronavirus, which we name pangolin-CoV, isolated from a Malayan pangolin has 100%, 98.6%, 97.8% and 90.7% amino acid identity with SARS-CoV-2 in the E, M, N and S proteins, respectively. In particular, the receptor-binding domain of the S protein of pangolin-CoV is almost identical to that of SARS-CoV-2, with one difference in a noncritical amino acid. Our comparative genomic analysis suggests that SARS-CoV-2 may have originated in the recombination of a virus similar to pangolin-CoV with one similar to RaTG13. Pangolin-CoV was detected in 17 out of the 25 Malayan pangolins that we analysed. Infected pangolins showed clinical signs and histological changes, and circulating antibodies against pangolin-CoV reacted with the S protein of SARS-CoV-2. The isolation of a coronavirus from pangolins that is closely related to SARS-CoV-2 suggests that these animals have the potential to act as an intermediate host of SARS-CoV-2. This newly identified coronavirus from pangolins-the most-trafficked mammal in the illegal wildlife trade-could represent a future threat to public health if wildlife trade is not effectively controlled.

570 citations