scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2023"


Proceedings ArticleDOI
08 May 2023
TL;DR: This paper proposed a few-shot in-context learning (ICL) approach to the MEDIQA-2023 Dialogue2Note shared task, which achieved excellent results in terms of ROUGE-1 F1, BERTScore F1 (deberta-xlarge-mnli), and BLEURT, with scores of 0.4011, 0.7058, and 0.5421, respectively.
Abstract: This paper presents our contribution to the MEDIQA-2023 Dialogue2Note shared task, encompassing both subtask A and subtask B. We approach the task as a dialogue summarization problem and implement two distinct pipelines: (a) a fine-tuning of a pre-trained dialogue summarization model and GPT-3, and (b) few-shot in-context learning (ICL) using a large language model, GPT-4. Both methods achieve excellent results in terms of ROUGE-1 F1, BERTScore F1 (deberta-xlarge-mnli), and BLEURT, with scores of 0.4011, 0.7058, and 0.5421, respectively. Additionally, we predict the associated section headers using RoBERTa and SciBERT based classification models. Our team ranked fourth among all teams, while each team is allowed to submit three runs as part of their submission. We also utilize expert annotations to demonstrate that the notes generated through the ICL GPT-4 are better than all other baselines. The code for our submission is available.

3 citations


Journal ArticleDOI
TL;DR: In this paper , a de novo generation framework, coined PepPPO, was presented to characterize binding motif for any given MHC Class I proteins via generating repertoires of peptides presented by them.
Abstract: Abstract Motivation MHC Class I protein plays an important role in immunotherapy by presenting immunogenic peptides to anti-tumor immune cells. The repertoires of peptides for various MHC Class I proteins are distinct, which can be reflected by their diverse binding motifs. To characterize binding motifs for MHC Class I proteins, in vitro experiments have been conducted to screen peptides with high binding affinities to hundreds of given MHC Class I proteins. However, considering tens of thousands of known MHC Class I proteins, conducting in vitro experiments for extensive MHC proteins is infeasible, and thus a more efficient and scalable way to characterize binding motifs is needed. Results We presented a de novo generation framework, coined PepPPO, to characterize binding motif for any given MHC Class I proteins via generating repertoires of peptides presented by them. PepPPO leverages a reinforcement learning agent with a mutation policy to mutate random input peptides into positive presented ones. Using PepPPO, we characterized binding motifs for around 10 000 known human MHC Class I proteins with and without experimental data. These computed motifs demonstrated high similarities with those derived from experimental data. In addition, we found that the motifs could be used for the rapid screening of neoantigens at a much lower time cost than previous deep-learning methods. Availability and implementation The software can be found in https://github.com/minrq/pMHC. Supplementary information Supplementary data are available at Bioinformatics online.

1 citations


Proceedings Article
TL;DR: The authors used reinforcement learning to enhance factual consistency and align with human annotator preferences for clinical studies summarization, and achieved state-of-the-art results on the same dataset.
Abstract: In the rapidly evolving landscape of medical research, accurate and concise summarization of clinical studies is crucial to support evidence-based practice. This paper presents a novel approach to clinical studies summarization, leveraging reinforcement learning to enhance factual consistency and align with human annotator preferences. Our work focuses on two tasks: Conclusion Generation and Review Generation. We train a CONFIT summarization model that outperforms GPT-3 and previous state-of-the-art models on the same datasets and collects expert and crowd-worker annotations to evaluate the quality and factual consistency of the generated summaries. These annotations enable us to measure the correlation of various automatic metrics, including modern factual evaluation metrics like QAFactEval, with human-assessed factual consistency. By employing top-correlated metrics as objectives for a reinforcement learning model, we demonstrate improved factuality in generated summaries that are preferred by human annotators.

1 citations


Journal ArticleDOI
TL;DR: ArArgente et al. as discussed by the authors explored the role of MIGs and MiS in cancer, taking prostate cancer (PCa) as an exemplar. Small interfering RNA knocking down U6atac was ∼50% more efficient in models of advanced therapy resistant PCa compared with standard antiandrogen therapy.

1 citations


Journal ArticleDOI
TL;DR: In this article , the authors estimate the bedtimes of Reddit users from the times tamps of their posts, test inference validity against survey data, and release their model as an R package (The R Foundation).
Abstract: Background Individuals with later bedtimes have an increased risk of difficulties with mood and substances. To investigate the causes and consequences of late bedtimes and other sleep patterns, researchers are exploring social media as a data source. Pioneering studies inferred sleep patterns directly from social media data. While innovative, these efforts are variously unscalable, context dependent, confined to specific sleep parameters, or rest on untested assumptions, and none of the reviewed studies apply to the popular Reddit platform or release software to the research community. Objective This study builds on this prior work. We estimate the bedtimes of Reddit users from the times tamps of their posts, test inference validity against survey data, and release our model as an R package (The R Foundation). Methods We included 159 sufficiently active Reddit users with known time zones and known, nonanomalous bedtimes, together with the time stamps of their 2.1 million posts. The model’s form was chosen by visualizing the aggregate distribution of the timing of users’ posts relative to their reported bedtimes. The chosen model represents a user’s frequency of Reddit posting by time of day, with a flat portion before bedtime and a quadratic depletion that begins near the user’s bedtime, with parameters fitted to the data. This model estimates the bedtimes of individual Reddit users from the time stamps of their posts. Model performance is assessed through k-fold cross-validation. We then apply the model to estimate the bedtimes of 51,372 sufficiently active, nonbot Reddit users with known time zones from the time stamps of their 140 million posts. Results The Pearson correlation between expected and observed Reddit posting frequencies in our model was 0.997 on aggregate data. On average, posting starts declining 45 minutes before bedtime, reaches a nadir 4.75 hours after bedtime that is 87% lower than the daytime rate, and returns to baseline 10.25 hours after bedtime. The Pearson correlation between inferred and reported bedtimes for individual users was 0.61 (P<.001). In 90 of 159 cases (56.6%), our estimate was within 1 hour of the reported bedtime; 128 cases (80.5%) were within 2 hours. There was equivalent accuracy in hold-out sets versus training sets of k-fold cross-validation, arguing against overfitting. The model was more accurate than a random forest approach. Conclusions We uncovered a simple, reproducible relationship between Reddit users’ reported bedtimes and the time of day when high daytime posting rates transition to low nighttime posting rates. We captured this relationship in a model that estimates users’ bedtimes from the time stamps of their posts. Limitations include applicability only to users who post frequently, the requirement for time zone data, and limits on generalizability. Nonetheless, it is a step forward for inferring the sleep parameters of social media users passively at scale. Our model and precomputed estimated bedtimes of 50,000 Reddit users are freely available.

1 citations


Posted ContentDOI
14 May 2023-bioRxiv
TL;DR: Cameron et al. as mentioned in this paper proposed an ensemble learning methodology that uses multiple pickers to find consensus particles for particle identification from micrographs, using integer linear programming (ILP) to select particles.
Abstract: Cryo-EM (cryogenic electron microscopy) particle identification from micrographs (i.e., picking) is challenging due to the low signal-to-noise ratio and lack of ground truth for particle locations. Moreover, current computational methods (“pickers”) identify different particle sets, complicating the selection of the best-suited picker for a protein of interest. Here, we present REPIC, an ensemble learning methodology that uses multiple pickers to find consensus particles. REPIC identifies consensus particles by framing its task as a graph problem and using integer linear programming to select particles. REPIC picks high-quality particles when the best picker is not known a priori and for known difficult-to-pick particles (e.g., TRPV1). Reconstructions using consensus particles achieve resolutions comparable to those from particles picked by experts, without the need for downstream particle filtering. Overall, our results show REPIC requires minimal (often no) manual picking and significantly reduces the burden on cryo-EM users for picker selection and particle picking. Availability https://github.com/ccameron/REPIC

Journal ArticleDOI
TL;DR: In this article , a unified perspective on relating variant impact to various genomic disorders is presented, and the authors argue that properly addressing them will require a more unified vocabulary and approach across disease communities.

Journal ArticleDOI
TL;DR: In this article , a consensus algorithm is proposed to reduce false positives introduced by individual methods and improve particle identification using a consensus method using a set of particle identification algorithms. But their algorithm is limited due to differences in their algorithm and model-training strategies.

Posted ContentDOI
16 May 2023-bioRxiv
TL;DR: In this paper , the authors introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript.
Abstract: The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3’ end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3’ processing are deployed across human tissues, with nearly half of multitranscript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

Journal ArticleDOI
TL;DR: The authors showed that aging and cancer share a common epigenetic replication signature, which was modeled using DNA methylation from extensively passaged immortalized human cells in vitro and tested on clinical tissues.
Abstract: Aging is a leading risk factor for cancer. While it is proposed that age-related accumulation of somatic mutations drives this relationship, it is likely not the full story. We show that aging and cancer share a common epigenetic replication signature, which we modeled using DNA methylation from extensively passaged immortalized human cells in vitro and tested on clinical tissues. This signature, termed CellDRIFT, increased with age across multiple tissues, distinguished tumor from normal tissue, was escalated in normal breast tissue from cancer patients, and was transiently reset upon reprogramming. In addition, within-person tissue differences were correlated with predicted lifetime tissue-specific stem cell divisions and tissue-specific cancer risk. Our findings suggest that age-related replication may drive epigenetic changes in cells and could push them toward a more tumorigenic state.

Journal ArticleDOI
TL;DR: In this article , the presence of membrane proteins in cryogenic-Electron Microscopy (cryo-EM) imaging data limits particle identification and alignment, and the variability and strong contrast of membranes override the weaker particle signal.

Journal ArticleDOI
12 May 2023-Science
TL;DR: In this paper , a quantum computing primer offers insights into the technology's most promising potential applications, including quantum teleportation, quantum teleportation and quantum supercomputing, as well as its potential applications.
Abstract: Description A quantum computing primer offers insights into the technology’s most promising potential applications A quantum computing primer offers insights into the technology’s most promising potential applications

Journal ArticleDOI
TL;DR: In this article , the authors extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs) across human biofluids, presenting a resource for the community.
Abstract: Although the role of RNA binding proteins (RBPs) in extracellular RNA (exRNA) biology is well established, their exRNA cargo and distribution across biofluids are largely unknown. To address this gap, we extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs). This map was developed through an integrative analysis of ENCODE enhanced crosslinking and immunoprecipitation (eCLIP) data (150 RBPs) and human exRNA profiles (6,930 samples). Computational analysis and experimental validation identified exRBPs in plasma, serum, saliva, urine, cerebrospinal fluid, and cell-culture-conditioned medium. exRBPs carry exRNA transcripts from small non-coding RNA biotypes, including microRNA (miRNA), piRNA, tRNA, small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Y RNA, and lncRNA, as well as protein-coding mRNA fragments. Computational deconvolution of exRBP RNA cargo reveals associations of exRBPs with extracellular vesicles, lipoproteins, and ribonucleoproteins across human biofluids. Overall, we mapped the distribution of exRBPs across human biofluids, presenting a resource for the community.

Journal ArticleDOI
TL;DR: The authors found that users are especially likely to be active on Reddit after their bedtime (and therefore awake) on nights that they posted to Reddit shortly before bedtime, especially if they posted multiple times or in high-engagement forums that night.

Journal ArticleDOI
TL;DR: In this paper , a statistical modeling approach called MLCrosstalk based on latent Dirichlet allocation was developed to construct the full interactome of SARS-CoV-2 infection, which incorporates human microRNAs (miRNAs), additional human protein-coding genes, and exogenous microbes.
Abstract: The COVID-19 pandemic caused by the SARS-CoV-2 virus has resulted in millions of deaths worldwide. The disease presents with various manifestations that can vary in severity and long-term outcomes. Previous efforts have contributed to the development of effective strategies for treatment and prevention by uncovering the mechanism of viral infection. We now know all the direct protein–protein interactions that occur during the lifecycle of SARS-CoV-2 infection, but it is critical to move beyond these known interactions to a comprehensive understanding of the “full interactome” of SARS-CoV-2 infection, which incorporates human microRNAs (miRNAs), additional human protein-coding genes, and exogenous microbes. Potentially, this will help in developing new drugs to treat COVID-19, differentiating the nuances of long COVID, and identifying histopathological signatures in SARS-CoV-2-infected organs. To construct the full interactome, we developed a statistical modeling approach called MLCrosstalk (multiple-layer crosstalk) based on latent Dirichlet allocation. MLCrosstalk integrates data from multiple sources, including microbes, human protein-coding genes, miRNAs, and human protein–protein interactions. It constructs "topics" that group SARS-CoV-2 with genes and microbes based on similar patterns of co-occurrence across patient samples. We use these topics to infer linkages between SARS-CoV-2 and protein-coding genes, miRNAs, and microbes. We then refine these initial linkages using network propagation to contextualize them within a larger framework of network and pathway structures. Using MLCrosstalk, we identified genes in the IL1-processing and VEGFA–VEGFR2 pathways that are linked to SARS-CoV-2. We also found that Rothia mucilaginosa and Prevotella melaninogenica are positively and negatively correlated with SARS-CoV-2 abundance, a finding corroborated by analysis of single-cell sequencing data.

Journal ArticleDOI
TL;DR: In this paper , the authors reported there is no funding associated with the work featured in this article, however, they did not report any funding for any of the work in this paper.
Abstract: Click to increase image sizeClick to decrease image size Additional informationFundingThe author(s) reported there is no funding associated with the work featured in this article.

Journal ArticleDOI
TL;DR: In this article , the first wave of the COVID-19 pandemic was modeled by determining regional connectivity from phylogenetic sequence information (i.e., "genetic connectivity"), in addition to traditional epidemiologic and demographic parameters.
Abstract: For the COVID-19 pandemic, viral transmission has been documented in many historical and geographical contexts. Nevertheless, few studies have explicitly modeled the spatiotemporal flow based on genetic sequences, to develop mitigation strategies. Additionally, thousands of SARS-CoV-2 genomes have been sequenced with associated records, potentially providing a rich source for such spatiotemporal analysis, an unprecedented amount during a single outbreak. Here, in a case study of seven states, we model the first wave of the outbreak by determining regional connectivity from phylogenetic sequence information (i.e. "genetic connectivity"), in addition to traditional epidemiologic and demographic parameters. Our study shows nearly all of the initial outbreak can be traced to a few lineages, rather than disconnected outbreaks, indicative of a mostly continuous initial viral flow. While the geographic distance from hotspots is initially important in the modeling, genetic connectivity becomes increasingly significant later in the first wave. Moreover, our model predicts that isolated local strategies (e.g. relying on herd immunity) can negatively impact neighboring regions, suggesting more efficient mitigation is possible with unified, cross-border interventions. Finally, our results suggest that a few targeted interventions based on connectivity can have an effect similar to that of an overall lockdown. They also suggest that while successful lockdowns are very effective in mitigating an outbreak, less disciplined lockdowns quickly decrease in effectiveness. Our study provides a framework for combining phylodynamic and computational methods to identify targeted interventions.

Posted ContentDOI
06 Mar 2023-medRxiv
TL;DR: In this paper , the authors uniformly process and systematically characterize gene, isoform, and splicing quantitative trait loci (xQTLs) in 672 fetal brain samples from unique subjects across multiple ancestral populations.
Abstract: Genomic regulatory elements active in the developing human brain are notably enriched in genetic risk for neuropsychiatric disorders, including autism spectrum disorder (ASD), schizophrenia, and bipolar disorder. However, prioritizing the specific risk genes and candidate molecular mechanisms underlying these genetic enrichments has been hindered by the lack of a single unified large-scale gene regulatory atlas of human brain development. Here, we uniformly process and systematically characterize gene, isoform, and splicing quantitative trait loci (xQTLs) in 672 fetal brain samples from unique subjects across multiple ancestral populations. We identify 15,752 genes harboring a significant xQTL and map 3,739 eQTLs to a specific cellular context. We observe a striking drop in gene expression and splicing heritability as the human brain develops. Isoform-level regulation, particularly in the second trimester, mediates the greatest proportion of heritability across multiple psychiatric GWAS, compared with eQTLs. Via colocalization and TWAS, we prioritize biological mechanisms for ~60% of GWAS loci across five neuropsychiatric disorders, nearly two-fold that observed in the adult brain. Finally, we build a comprehensive set of developmentally regulated gene and isoform co-expression networks capturing unique genetic enrichments across disorders. Together, this work provides a comprehensive view of genetic regulation across human brain development as well as the stage- and cell type-informed mechanistic underpinnings of neuropsychiatric disorders.