scispace - formally typeset
Search or ask a question
Author

Sarvnaz Karimi

Other affiliations: University of Melbourne, RMIT University, NICTA  ...read more
Bio: Sarvnaz Karimi is an academic researcher from Commonwealth Scientific and Industrial Research Organisation. The author has contributed to research in topics: Computer science & Transliteration. The author has an hindex of 22, co-authored 94 publications receiving 1842 citations. Previous affiliations of Sarvnaz Karimi include University of Melbourne & RMIT University.


Papers
More filters
Book ChapterDOI
11 Oct 2006
TL;DR: A new model of Persian is introduced that takes into account the habit of shortening, or even omitting, runs of English vowels, which makes transliteration of Persian particularly difficult for phonetic based methods.
Abstract: Persian is an Indo-European language written using Arabic script, and is an official language of Iran, Afghanistan, and Tajikistan Transliteration of Persian to English—that is, the character-by-character mapping of a Persian word that is not readily available in a bilingual dictionary—is an unstudied problem In this paper we make three novel contributions First, we present performance comparisons of existing grapheme-based transliteration methods on English to Persian Second, we discuss the difficulties in establishing a corpus for studying transliteration Finally, we introduce a new model of Persian that takes into account the habit of shortening, or even omitting, runs of English vowels This trait makes transliteration of Persian particularly difficult for phonetic based methods This new model outperforms the existing grapheme based methods on Persian, exhibiting a 24% relative increase in transliteration accuracy measured using the top-5 criteria

24 citations

Journal ArticleDOI
TL;DR: This survey discusses approaches for epidemic intelligence that use textual datasets, referring to it as “text-based epidemic intelligence,” view past work in terms of two broad categories: health mention classification and health event detection.
Abstract: Epidemic intelligence deals with the detection of outbreaks using formal (such as hospital records) and informal sources (such as user-generated text on the web) of information. In this survey, we discuss approaches for epidemic intelligence that use textual datasets, referring to it as “text-based epidemic intelligence.” We view past work in terms of two broad categories: health mention classification (selecting relevant text from a large volume) and health event detection (predicting epidemic events from a collection of relevant text). The focus of our discussion is the underlying computational linguistic techniques in the two categories. The survey also provides details of the state of the art in annotation techniques, resources, and evaluation strategies for epidemic intelligence.

23 citations

Posted Content
TL;DR: This work describes NNE—a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank, which comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting.
Abstract: Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE---a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.

22 citations

Proceedings ArticleDOI
24 Oct 2011
TL;DR: The derivation presented here for expected 1-call@k provides a novel theoretical perspective on the emergence of diversity via a latent subtopic model of relevance --- an idea underlying both ambiguous and faceted subtopic retrieval that have been used to motivate diverse retrieval.
Abstract: It has been previously observed that optimization of the 1-call@k relevance objective (i.e., a set-based objective that is 1 if at least one document is relevant, otherwise 0) empirically correlates with diverse retrieval. In this paper, we proceed one step further and show theoretically that greedily optimizing expected 1-call@k w.r.t. a latent subtopic model of binary relevance leads to a diverse retrieval algorithm sharing many features of existing diversification approaches. This new result is complementary to a variety of diverse retrieval algorithms derived from alternate rank-based relevance criteria such as average precision and reciprocal rank. As such, the derivation presented here for expected 1-call@k provides a novel theoretical perspective on the emergence of diversity via a latent subtopic model of relevance --- an idea underlying both ambiguous and faceted subtopic retrieval that have been used to motivate diverse retrieval.

22 citations

Journal ArticleDOI
17 Mar 2020-PLOS ONE
TL;DR: It is demonstrated the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic by experimenting with two variations of an existing surveillance architecture that aggregates tweets related to different symptoms together, and considers tweets about each symptom separately.
Abstract: First reported in March 2014, an Ebola epidemic impacted West Africa, most notably Liberia, Guinea and Sierra Leone. We demonstrate the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic. We experiment with two variations of an existing surveillance architecture: the first aggregates tweets related to different symptoms together, while the second considers tweets about each symptom separately and then aggregates the set of alerts generated by the architecture. Using a dataset of tweets posted from the affected region from 2011 to 2014, we obtain alerts in December 2013, which is three months prior to the official announcement of the epidemic. Among the two variations, the second, which produces a restricted but useful set of alerts, can potentially be applied to other infectious disease surveillance and alert systems.

21 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Proceedings ArticleDOI
23 Apr 2020
TL;DR: It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.
Abstract: Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

1,532 citations

Journal ArticleDOI
TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.
Abstract: Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

1,491 citations