scispace - formally typeset
Search or ask a question

Showing papers by "Sarvnaz Karimi published in 2015"


Journal ArticleDOI
TL;DR: A new rich annotated corpus of medical forum posts on patient-reported Adverse Drug Events (ADEs), which contains text that is largely written in colloquial language and often deviates from formal English grammar and punctuation rules.

217 citations


Journal ArticleDOI
TL;DR: In order to highlight the importance of contributions made by computer scientists in this area so far, the existing approaches are categorized and review, and most importantly, areas where more research should be undertaken are identified.
Abstract: We review data mining and related computer science techniques that have been studied in the area of drug safety to identify signals of adverse drug reactions from different data sources, such as spontaneous reporting databases, electronic health records, and medical literature. Development of such techniques has become more crucial for public heath, especially with the growth of data repositories that include either reports of adverse drug reactions, which require fast processing for discovering signals of adverse reactions, or data sources that may contain such signals but require data or text mining techniques to discover them. In order to highlight the importance of contributions made by computer scientists in this area so far, we categorize and review the existing approaches, and most importantly, we identify areas where more research should be undertaken.

124 citations


Proceedings Article
25 Jul 2015
TL;DR: The described system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises.
Abstract: Social media platforms, such as Twitter, offer a rich source of real-time information about real-world events, particularly during mass emergencies. Sifting valuable information from social media provides useful insight into time-critical situations for emergency officers to understand the impact of hazards and act on emergency responses in a timely manner. This work focuses on analyzing Twitter messages generated during natural disasters, and shows how natural language processing and data mining techniques can be utilized to extract situation awareness information from Twitter. We present key relevant approaches that we have investigated including burst detection, tweet filtering and classification, online clustering, and geotagging.

46 citations


Journal ArticleDOI
TL;DR: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths.
Abstract: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV.

44 citations


01 Jan 2015
TL;DR: In this paper, a set of machine learning and rule-based methods were used to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV.
Abstract: Background: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV Methods: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000–2007 in New South Wales, Australia Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness A detailed error analysis was performed on classification errors Results: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 096) More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 080) The error analysis revealed that word variations as well as certain word combinations adversely affected classification In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness Conclusions: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates

32 citations


Posted Content
TL;DR: This study is the first to systematically examine the effect of popular concept extraction methods in the area of signal detection for adverse reactions, and shows that the choice of algorithm or controlled vocabulary has a significant impact on concept extraction, which will impact the overall signal detection process.
Abstract: Social media is becoming an increasingly important source of information to complement traditional pharmacovigilance methods. In order to identify signals of potential adverse drug reactions, it is necessary to first identify medical concepts in the social media text. Most of the existing studies use dictionary-based methods which are not evaluated independently from the overall signal detection task. We compare different approaches to automatically identify and normalise medical concepts in consumer reviews in medical forums. Specifically, we implement several dictionary-based methods popular in the relevant literature, as well as a method we suggest based on a state-of-the-art machine learning method for entity recognition. MetaMap, a popular biomedical concept extraction tool, is used as a baseline. Our evaluations were performed in a controlled setting on a common corpus which is a collection of medical forum posts annotated with concepts and linked to controlled vocabularies such as MedDRA and SNOMED CT. To our knowledge, our study is the first to systematically examine the effect of popular concept extraction methods in the area of signal detection for adverse reactions. We show that the choice of algorithm or controlled vocabulary has a significant impact on concept extraction, which will impact the overall signal detection process. We also show that our proposed machine learning approach significantly outperforms all the other methods in identification of both adverse reactions and drugs, even when trained with a relatively small set of annotated text.

24 citations


Proceedings Article
27 Jun 2015
TL;DR: This work focuses on analyzing Twitter messages generated during natural disasters, and shows how natural language processing and data mining techniques can be utilized to extract situation awareness information from Twitter.
Abstract: Social media platforms, such as Twitter, offer a rich source of real-time information about real-world events, particularly during mass emergencies. Sifting valuable information from social media provides useful insight into time-critical situations for emergency officers to understand the impact of hazards and act on emergency responses in a timely manner. This work focuses on analyzing Twitter messages generated during natural disasters, and shows how natural language processing and data mining techniques can be utilized to extract situation awareness information from Twitter. We present key relevant approaches that we have investigated including burst detection, tweet filtering and classification, online clustering, and geotagging.

14 citations


Journal ArticleDOI
TL;DR: By ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task, and this work explores alternative evaluation methods that explicitly deal with statistical dependence in text.
Abstract: In recent years, many studies have been published on data collected from social media, especially microblogs such as Twitter. However, rather few of these studies have considered evaluation methodologies that take into account the statistically dependent nature of such data, which breaks the theoretical conditions for using cross-validation. Despite concerns raised in the past about using cross-validation for data of similar characteristics, such as time series, some of these studies evaluate their work using standard k-fold cross-validation. Through experiments on Twitter data collected during a two-year period that includes disastrous events, we show that by ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task. We explore alternative evaluation methods that explicitly deal with statistical dependence in text. Our work also raises concerns for any other data for which similar conditions might hold.

13 citations


Proceedings ArticleDOI
22 Oct 2015
TL;DR: CADEminer, a system that mines consumer reviews on medications in order to facilitate discovery of drug side effects that may not have been identified in clinical trials, is introduced.
Abstract: We introduce CADEminer, a system that mines consumer reviews on medications in order to facilitate discovery of drug side effects that may not have been identified in clinical trials. CADEminer utilises search and natural language processing techniques to (a) extract mentions of side effects, and other relevant concepts such as drug names and diseases in reviews; (b) normalise the extracted mentions to their unified representation in ontologies such as SNOMED CT and MedDRA; (c) identify relationships between extracted concepts, such as a drug caused a side effect; (d) search in authoritative lists of known drug side effects to identify whether or not the extracted side effects are new and therefore require further investigation; and finally (e) provide statistics and visualisation of the data.

12 citations