Showing papers by "Sarvnaz Karimi published in 2020"

PDF

Open Access

Proceedings Article•DOI•

An Effective Transition-based Model for Discontinuous NER

[...]

Xiang Dai¹, Sarvnaz Karimi¹, Ben Hachey¹, Cecile Paris¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

01 Jul 2020

TL;DR: This work proposes a simple, effective transition-based model with generic neural encoding for discontinuous NER that can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

...read moreread less

Abstract: Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

...read moreread less

46 citations

Journal Article•DOI•

Harnessing Tweets for Early Detection of an Acute Disease Event.

[...]

Aditya Joshi¹, Ross Sparks¹, James McHugh¹, Sarvnaz Karimi¹, Cecile Paris¹, C. Raina MacIntyre² - Show less +2 more•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, University of New South Wales²

01 Jan 2020-Epidemiology

TL;DR: This research presents a novel probabilistic approach that allows us to assess the importance of knowing the carrier and removal status of canine coronavirus, as a source of infection for other animals.

...read moreread less

Abstract: Background Melbourne, Australia, witnessed a thunderstorm asthma outbreak on 21 November 2016, resulting in over 8,000 hospital admissions by 6 P.M. This is a typical acute disease event. Because the time to respond is short for acute disease events, an algorithm based on time between events has shown promise. Shorter the time between consecutive incidents of the disease, more likely the outbreak. Social media posts such as tweets can be used as input to the monitoring algorithm. However, due to the large volume of tweets, a large number of alerts may be produced. We refer to this problem as alert swamping. Methods We present a four-step architecture for the early detection of the acute disease event, using social media posts (tweets) on Twitter. To curb alert swamping, the first three steps of the algorithm ensure the relevance of the tweets. The fourth step is a monitoring algorithm based on time between events. We experiment with a dataset of tweets posted in Melbourne from 2014 to 2016, focusing on the thunderstorm asthma outbreak in Melbourne in November 2016. Results Out of our 18 experiment combinations, three detected the thunderstorm asthma outbreak up to 9 hours before the time mentioned in the official report, and five were able to detect it before the first news report. Conclusions With appropriate checks against alert swamping in place and the use of a monitoring algorithm based on time between events, tweets can provide early alerts for an acute disease event such as thunderstorm asthma.

...read moreread less

30 citations

Proceedings Article•DOI•

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

[...]

Xiang Dai¹, Sarvnaz Karimi¹, Ben Hachey¹, Cecile Paris¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

01 Nov 2020

TL;DR: This work pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources, and investigates how similarity measures can be used to nominate in-domain pretraining data.

...read moreread less

Abstract: Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

...read moreread less

27 citations

Journal Article•DOI•

Automated monitoring of tweets for early detection of the 2014 Ebola epidemic.

[...]

Aditya Joshi¹, Ross Sparks¹, Sarvnaz Karimi¹, Sheng-Lun Jason Yan², Abrar Ahmad Chughtai², Cecile Paris¹, C. Raina MacIntyre², C. Raina MacIntyre³ - Show less +4 more•Institutions (3)

Commonwealth Scientific and Industrial Research Organisation¹, University of New South Wales², Arizona State University³

17 Mar 2020-PLOS ONE

TL;DR: It is demonstrated the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic by experimenting with two variations of an existing surveillance architecture that aggregates tweets related to different symptoms together, and considers tweets about each symptom separately.

...read moreread less

Abstract: First reported in March 2014, an Ebola epidemic impacted West Africa, most notably Liberia, Guinea and Sierra Leone. We demonstrate the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic. We experiment with two variations of an existing surveillance architecture: the first aggregates tweets related to different symptoms together, while the second considers tweets about each symptom separately and then aggregates the set of alerts generated by the architecture. Using a dataset of tweets posted from the affected region from 2011 to 2014, we obtain alerts in December 2013, which is three months prior to the official announcement of the epidemic. Among the two variations, the second, which produces a restricted but useful set of alerts, can potentially be applied to other infectious disease surveillance and alert systems.

...read moreread less

21 citations

Journal Article•DOI•

Matching patients to clinical trials using semantically enriched document representation.

[...]

Hamed Hassanzadeh¹, Sarvnaz Karimi¹, Anthony Nguyen¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

10 Mar 2020-Journal of Biomedical Informatics

TL;DR: The system provides an end-to-end machine learning-based solution that achieves comparable results with the state-of-the-art which relies on hand-crafted rules or data-centric engineered features.

...read moreread less

14 citations

Journal Article•DOI•

Clinical trial search: Using biomedical language understanding models for re-ranking.

[...]

Maciej Rybinski¹, Jerry Xu², Jerry Xu¹, Sarvnaz Karimi¹•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, University of Sydney²

18 Aug 2020-Journal of Biomedical Informatics

TL;DR: An evaluation on the TREC Precision Medicine benchmarks indicates that the approach using the BERT model pre-trained on scientific abstracts and clinical notes achieves state-of-the-art results, on par with highly specialised, manually optimised heuristic models.

...read moreread less

13 citations

Journal Article•DOI•

Beyond mean rating: Probabilistic aggregation of star ratings based on helpfulness

[...]

Wenyi Tay¹, Wenyi Tay², Xiuzhen Zhang², Sarvnaz Karimi¹•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, RMIT University²

01 Jul 2020-Journal of the Association for Information Science and Technology

TL;DR: This work proposes probabilistic aggregation models for review ratings based on the Dirichlet distribution to combat data sparsity in reviews and proposes to exploit the “helpfulness” social information and time to filter noisy reviews and effectively aggregate ratings to compute the consensus opinion.

...read moreread less

Abstract: The star‐rating mechanism of customer reviews is used universally by the online population to compare and select merchants, movies, products, and services. The consensus opinion from aggregation of star ratings is used as a proxy for item quality. Online reviews are noisy and effective aggregation of star ratings to accurately reflect the “true quality” of products and services is challenging. The mean‐rating aggregation model is widely used and other aggregation models are also proposed. These existing aggregation models rely on a large number of reviews to tolerate noise. However, many products rarely have reviews. We propose probabilistic aggregation models for review ratings based on the Dirichlet distribution to combat data sparsity in reviews. We further propose to exploit the “helpfulness” social information and time to filter noisy reviews and effectively aggregate ratings to compute the consensus opinion. Our experiments on an Amazon data set show that our probabilistic aggregation models based on “helpfulness” achieve better performance than the statistical and heuristic baseline approaches.

...read moreread less

7 citations

Journal Article•DOI•

Correction: Extracting Family History Information From Electronic Health Records: Natural Language Processing Analysis

[...]

Maciej Rybinski¹, Xiang Dai¹, Xiang Dai², Sonit Singh³, Sonit Singh¹, Sarvnaz Karimi¹, Anthony Nguyen¹ - Show less +3 more•Institutions (3)

Commonwealth Scientific and Industrial Research Organisation¹, University of Sydney², Macquarie University³

31 Aug 2020-JMIR medical informatics

TL;DR: The authors used transformers to extract disease mentions from clinical notes and used rule-based methods for extracting family member (FM) information from text and coreference resolution techniques to improve the annotation of diseases.

...read moreread less

Abstract: Background: The prognosis, diagnosis, and treatment of many genetic disorders and familial diseases significantly improve if the family history (FH) of a patient is known. Such information is often written in the free text of clinical notes. Objective: The aim of this study is to develop automated methods that enable access to FH data through natural language processing. Methods: We performed information extraction by using transformers to extract disease mentions from notes. We also experimented with rule-based methods for extracting family member (FM) information from text and coreference resolution techniques. We evaluated different transfer learning strategies to improve the annotation of diseases. We provided a thorough error analysis of the contributing factors that affect such information extraction systems. Results: Our experiments showed that the combination of domain-adaptive pretraining and intermediate-task pretraining achieved an F1 score of 81.63% for the extraction of diseases and FMs from notes when it was tested on a public shared task data set from the National Natural Language Processing Clinical Challenges (N2C2), providing a statistically significant improvement over the baseline (P<.001). In comparison, in the 2019 N2C2/Open Health Natural Language Processing Shared Task, the median F1 score of all 17 participating teams was 76.59%. Conclusions: Our approach, which leverages a state-of-the-art named entity recognition model for disease mention detection coupled with a hybrid method for FM mention detection, achieved an effectiveness that was close to that of the top 3 systems participating in the 2019 N2C2 FH extraction challenge, with only the top system convincingly outperforming our approach in terms of precision.

...read moreread less

6 citations

Journal Article•DOI•

Monitoring events with application to syndromic surveillance using social media data

[...]

Ross Sparks¹, Aditya Joshi¹, Cecile Paris¹, Sarvnaz Karimi¹, C. Raina MacIntyre² - Show less +1 more•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, University of New South Wales²

01 May 2020

6 citations

Posted Content•

Searching Scientific Literature for Answers on COVID-19 Questions.

[...]

Vincent Nguyen, Maciej Rybinski, Sarvnaz Karimi, Zhenchang Xing

06 Jul 2020-arXiv: Information Retrieval

TL;DR: A novel method for neural retrieval is proposed, and its effectiveness on the TREC COVID search is demonstrated, to help scientists, clinicians, policy makers and others with similar information needs in finding reliable answers from the scientific literature.

...read moreread less

Abstract: Finding answers related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually TREC COVID search track aims to assist in creating search tools to aid scientists, clinicians, policy makers and others with similar information needs in finding reliable answers from the scientific literature We experiment with different ranking algorithms as part of our participation in this challenge We propose a novel method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search

...read moreread less

5 citations

Journal Article•DOI•

A2A: a platform for research in biomedical literature search

[...]

Maciej Rybinski¹, Sarvnaz Karimi¹, Vincent Nguyen², Vincent Nguyen¹, Cecile Paris¹ - Show less +1 more•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, Australian National University²

21 Dec 2020-BMC Bioinformatics

TL;DR: The A2A search and benchmarking tool as mentioned in this paper is a public online tool for searching over biomedical literature, guided by the NIST setup of the relevant TREC evaluation tasks in genomics, clinical decision support, and precision medicine.

...read moreread less

Abstract: Finding relevant literature is crucial for many biomedical research activities and in the practice of evidence-based medicine. Search engines such as PubMed provide a means to search and retrieve published literature, given a query. However, they are limited in how users can control the processing of queries and articles—or as we call them documents—by the search engine. To give this control to both biomedical researchers and computer scientists working in biomedical information retrieval, we introduce a public online tool for searching over biomedical literature. Our setup is guided by the NIST setup of the relevant TREC evaluation tasks in genomics, clinical decision support, and precision medicine. To provide benchmark results for some of the most common biomedical information retrieval strategies, such as querying MeSH subject headings with a specific weight or querying over the title of the articles only, we present our evaluations on public datasets. Our experiments report well-known information retrieval metrics such as precision at a cutoff of ranked documents. We introduce the A2A search and benchmarking tool which is publicly available for the researchers who want to explore different search strategies over published biomedical literature. We outline several query formulation strategies and present their evaluations with known human judgements for a large pool of topics, from genomics to precision medicine.

...read moreread less

Posted Content•

An Effective Transition-based Model for Discontinuous NER

[...]

Xiang Dai¹, Sarvnaz Karimi¹, Ben Hachey¹, Cecile Paris¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

28 Apr 2020-arXiv: Computation and Language

TL;DR: This paper proposed a transition-based model with generic neural encoding for discontinuous NER, which can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions, and achieved state-of-the-art performance on three biomedical data sets.

...read moreread less

Pandemic Literature Search: Finding Information on COVID-19

[...]

Vincent Nguyen, Maciek Rybinski, Sarvnaz Karimi, Zhenchang Xing

01 Dec 2020

TL;DR: In this paper, the authors investigate how to better rank information for pandemic information retrieval, and propose a novel end-to-end method for neural retrieval and demonstrate its effectiveness on the TREC COVID search.

...read moreread less

Abstract: Finding information related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. We investigate how to better rank information for pandemic information retrieval. We experiment with different ranking algorithms and propose a novel end-to-end method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search. This work could lead to a search system that aids scientists, clinicians, policymakers and others in finding reliable answers from the scientific literature.

...read moreread less

Posted Content•

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

[...]

Xiang Dai¹, Sarvnaz Karimi¹, Ben Hachey¹, Cecile Paris¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

02 Oct 2020-arXiv: Computation and Language

TL;DR: In this article, two models were trained on tweets and forum text, respectively, and empirically demonstrated the effectiveness of these two resources and investigated how similarity measures can be used to nominate in-domain pretraining data.

...read moreread less

Proceedings Article•

CSIROmed at TREC Precision Medicine 2020.

[...]

Maciej Rybinski¹, Sarvnaz Karimi¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

01 Jan 2020

TL;DR: This work examined two mechanisms for incorporating the treatment within the query formulation strategy for DFR: a concatenation of disease, gene and treatment fields, but filtering out the documents where treatment terms were absent, and experimented with both strategies in combination with re-rankers trained either directly on TREC PM 2017-2019 retrieval task, or trained on a treatment-augmented version of these tasks.

...read moreread less

Abstract: TREC Precision Medicine (PM) focuses on providing highquality evidence from the biomedical literature for clinicians treating cancer patients. Our experiments focus on incorporating treatment into search. We established a promising baseline using PM 2017-2018 datasets for training and 2019 for validation. Our baseline consisted of a base-ranking step using Divergence From Randomness (DFR) scoring that used disease and gene as queries and an aggregated text field to represent documents, followed by a BERT-based neural reranker. We examined two mechanisms for incorporating the treatment within the query formulation strategy for DFR: (1) a concatenation of disease, gene and treatment fields; and (2) a concatenation of disease and gene fields, but filtering out the documents where treatment terms were absent. We experimented with both strategies in combination with re-rankers trained either directly on TREC PM 2017-2019 retrieval task, or trained on a treatment-augmented version of these tasks. We obtained the best results using boolean retrieval for treatment terms with a re-ranker trained on non-augmented TREC PM datasets. Our top-ranking run achieved 0.530, 0.565, 0.436 for infNDCG, P@10, RPrec, respectively. TREC median for these metrics were 0.432, 0.465, and 0.326.

...read moreread less