scispace - formally typeset
Search or ask a question

Showing papers by "Sarvnaz Karimi published in 2020"


Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work proposes a simple, effective transition-based model with generic neural encoding for discontinuous NER that can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.
Abstract: Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

46 citations


Journal ArticleDOI
TL;DR: This research presents a novel probabilistic approach that allows us to assess the importance of knowing the carrier and removal status of canine coronavirus, as a source of infection for other animals.
Abstract: Background Melbourne, Australia, witnessed a thunderstorm asthma outbreak on 21 November 2016, resulting in over 8,000 hospital admissions by 6 P.M. This is a typical acute disease event. Because the time to respond is short for acute disease events, an algorithm based on time between events has shown promise. Shorter the time between consecutive incidents of the disease, more likely the outbreak. Social media posts such as tweets can be used as input to the monitoring algorithm. However, due to the large volume of tweets, a large number of alerts may be produced. We refer to this problem as alert swamping. Methods We present a four-step architecture for the early detection of the acute disease event, using social media posts (tweets) on Twitter. To curb alert swamping, the first three steps of the algorithm ensure the relevance of the tweets. The fourth step is a monitoring algorithm based on time between events. We experiment with a dataset of tweets posted in Melbourne from 2014 to 2016, focusing on the thunderstorm asthma outbreak in Melbourne in November 2016. Results Out of our 18 experiment combinations, three detected the thunderstorm asthma outbreak up to 9 hours before the time mentioned in the official report, and five were able to detect it before the first news report. Conclusions With appropriate checks against alert swamping in place and the use of a monitoring algorithm based on time between events, tweets can provide early alerts for an acute disease event such as thunderstorm asthma.

30 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: This work pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources, and investigates how similarity measures can be used to nominate in-domain pretraining data.
Abstract: Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

27 citations


Journal ArticleDOI
17 Mar 2020-PLOS ONE
TL;DR: It is demonstrated the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic by experimenting with two variations of an existing surveillance architecture that aggregates tweets related to different symptoms together, and considers tweets about each symptom separately.
Abstract: First reported in March 2014, an Ebola epidemic impacted West Africa, most notably Liberia, Guinea and Sierra Leone. We demonstrate the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic. We experiment with two variations of an existing surveillance architecture: the first aggregates tweets related to different symptoms together, while the second considers tweets about each symptom separately and then aggregates the set of alerts generated by the architecture. Using a dataset of tweets posted from the affected region from 2011 to 2014, we obtain alerts in December 2013, which is three months prior to the official announcement of the epidemic. Among the two variations, the second, which produces a restricted but useful set of alerts, can potentially be applied to other infectious disease surveillance and alert systems.

21 citations


Journal ArticleDOI
TL;DR: The system provides an end-to-end machine learning-based solution that achieves comparable results with the state-of-the-art which relies on hand-crafted rules or data-centric engineered features.

14 citations


Journal ArticleDOI
TL;DR: An evaluation on the TREC Precision Medicine benchmarks indicates that the approach using the BERT model pre-trained on scientific abstracts and clinical notes achieves state-of-the-art results, on par with highly specialised, manually optimised heuristic models.

13 citations


Journal ArticleDOI
TL;DR: This work proposes probabilistic aggregation models for review ratings based on the Dirichlet distribution to combat data sparsity in reviews and proposes to exploit the “helpfulness” social information and time to filter noisy reviews and effectively aggregate ratings to compute the consensus opinion.
Abstract: The star‐rating mechanism of customer reviews is used universally by the online population to compare and select merchants, movies, products, and services. The consensus opinion from aggregation of star ratings is used as a proxy for item quality. Online reviews are noisy and effective aggregation of star ratings to accurately reflect the “true quality” of products and services is challenging. The mean‐rating aggregation model is widely used and other aggregation models are also proposed. These existing aggregation models rely on a large number of reviews to tolerate noise. However, many products rarely have reviews. We propose probabilistic aggregation models for review ratings based on the Dirichlet distribution to combat data sparsity in reviews. We further propose to exploit the “helpfulness” social information and time to filter noisy reviews and effectively aggregate ratings to compute the consensus opinion. Our experiments on an Amazon data set show that our probabilistic aggregation models based on “helpfulness” achieve better performance than the statistical and heuristic baseline approaches.

7 citations


Journal ArticleDOI
TL;DR: The authors used transformers to extract disease mentions from clinical notes and used rule-based methods for extracting family member (FM) information from text and coreference resolution techniques to improve the annotation of diseases.
Abstract: Background: The prognosis, diagnosis, and treatment of many genetic disorders and familial diseases significantly improve if the family history (FH) of a patient is known. Such information is often written in the free text of clinical notes. Objective: The aim of this study is to develop automated methods that enable access to FH data through natural language processing. Methods: We performed information extraction by using transformers to extract disease mentions from notes. We also experimented with rule-based methods for extracting family member (FM) information from text and coreference resolution techniques. We evaluated different transfer learning strategies to improve the annotation of diseases. We provided a thorough error analysis of the contributing factors that affect such information extraction systems. Results: Our experiments showed that the combination of domain-adaptive pretraining and intermediate-task pretraining achieved an F1 score of 81.63% for the extraction of diseases and FMs from notes when it was tested on a public shared task data set from the National Natural Language Processing Clinical Challenges (N2C2), providing a statistically significant improvement over the baseline (P<.001). In comparison, in the 2019 N2C2/Open Health Natural Language Processing Shared Task, the median F1 score of all 17 participating teams was 76.59%. Conclusions: Our approach, which leverages a state-of-the-art named entity recognition model for disease mention detection coupled with a hybrid method for FM mention detection, achieved an effectiveness that was close to that of the top 3 systems participating in the 2019 N2C2 FH extraction challenge, with only the top system convincingly outperforming our approach in terms of precision.

6 citations



Posted Content
TL;DR: A novel method for neural retrieval is proposed, and its effectiveness on the TREC COVID search is demonstrated, to help scientists, clinicians, policy makers and others with similar information needs in finding reliable answers from the scientific literature.
Abstract: Finding answers related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually TREC COVID search track aims to assist in creating search tools to aid scientists, clinicians, policy makers and others with similar information needs in finding reliable answers from the scientific literature We experiment with different ranking algorithms as part of our participation in this challenge We propose a novel method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search

5 citations


Journal ArticleDOI
TL;DR: The A2A search and benchmarking tool as mentioned in this paper is a public online tool for searching over biomedical literature, guided by the NIST setup of the relevant TREC evaluation tasks in genomics, clinical decision support, and precision medicine.
Abstract: Finding relevant literature is crucial for many biomedical research activities and in the practice of evidence-based medicine. Search engines such as PubMed provide a means to search and retrieve published literature, given a query. However, they are limited in how users can control the processing of queries and articles—or as we call them documents—by the search engine. To give this control to both biomedical researchers and computer scientists working in biomedical information retrieval, we introduce a public online tool for searching over biomedical literature. Our setup is guided by the NIST setup of the relevant TREC evaluation tasks in genomics, clinical decision support, and precision medicine. To provide benchmark results for some of the most common biomedical information retrieval strategies, such as querying MeSH subject headings with a specific weight or querying over the title of the articles only, we present our evaluations on public datasets. Our experiments report well-known information retrieval metrics such as precision at a cutoff of ranked documents. We introduce the A2A search and benchmarking tool which is publicly available for the researchers who want to explore different search strategies over published biomedical literature. We outline several query formulation strategies and present their evaluations with known human judgements for a large pool of topics, from genomics to precision medicine.

Posted Content
TL;DR: This paper proposed a transition-based model with generic neural encoding for discontinuous NER, which can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions, and achieved state-of-the-art performance on three biomedical data sets.
Abstract: Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

01 Dec 2020
TL;DR: In this paper, the authors investigate how to better rank information for pandemic information retrieval, and propose a novel end-to-end method for neural retrieval and demonstrate its effectiveness on the TREC COVID search.
Abstract: Finding information related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. We investigate how to better rank information for pandemic information retrieval. We experiment with different ranking algorithms and propose a novel end-to-end method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search. This work could lead to a search system that aids scientists, clinicians, policymakers and others in finding reliable answers from the scientific literature.

Posted Content
TL;DR: In this article, two models were trained on tweets and forum text, respectively, and empirically demonstrated the effectiveness of these two resources and investigated how similarity measures can be used to nominate in-domain pretraining data.
Abstract: Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at this https URL

Proceedings Article
01 Jan 2020
TL;DR: This work examined two mechanisms for incorporating the treatment within the query formulation strategy for DFR: a concatenation of disease, gene and treatment fields, but filtering out the documents where treatment terms were absent, and experimented with both strategies in combination with re-rankers trained either directly on TREC PM 2017-2019 retrieval task, or trained on a treatment-augmented version of these tasks.
Abstract: TREC Precision Medicine (PM) focuses on providing highquality evidence from the biomedical literature for clinicians treating cancer patients. Our experiments focus on incorporating treatment into search. We established a promising baseline using PM 2017-2018 datasets for training and 2019 for validation. Our baseline consisted of a base-ranking step using Divergence From Randomness (DFR) scoring that used disease and gene as queries and an aggregated text field to represent documents, followed by a BERT-based neural reranker. We examined two mechanisms for incorporating the treatment within the query formulation strategy for DFR: (1) a concatenation of disease, gene and treatment fields; and (2) a concatenation of disease and gene fields, but filtering out the documents where treatment terms were absent. We experimented with both strategies in combination with re-rankers trained either directly on TREC PM 2017-2019 retrieval task, or trained on a treatment-augmented version of these tasks. We obtained the best results using boolean retrieval for treatment terms with a re-ranker trained on non-augmented TREC PM datasets. Our top-ranking run achieved 0.530, 0.565, 0.436 for infNDCG, P@10, RPrec, respectively. TREC median for these metrics were 0.432, 0.465, and 0.326.