scispace - formally typeset
Search or ask a question
Author

Paul Cook

Bio: Paul Cook is an academic researcher from University of New Brunswick. The author has contributed to research in topics: SemEval & Language model. The author has an hindex of 26, co-authored 74 publications receiving 2938 citations. Previous affiliations of Paul Cook include University of Toronto & University of Melbourne.


Papers
More filters
Proceedings ArticleDOI
01 Jul 2015
TL;DR: This work proposes a learning method that needs less data, based on the observation that there are underlying shared structures across languages, and exploits cues from a different source language in order to guide the learning process.
Abstract: Training a high-accuracy dependency parser requires a large treebank. However, these are costly and time-consuming to build. We propose a learning method that needs less data, based on the observation that there are underlying shared structures across languages. We exploit cues from a different source language in order to guide the learning process. Our model saves at least half of the annotation effort to reach the same accuracy compared with using the purely supervised method.

355 citations

Journal ArticleDOI
TL;DR: This paper presents an integrated geolocation prediction framework, and evaluates the impact of nongeotagged tweets, language, and user-declared metadata on geolocated prediction, and discusses how users differ in terms of their geolocatability.
Abstract: Geographical location is vital to geospatial applications like local search and event detection. In this paper, we investigate and improve on the task of text-based geolocation prediction of Twitter users. Previous studies on this topic have typically assumed that geographical references (e.g., gazetteer terms, dialectal words) in a text are indicative of its author's location. However, these references are often buried in informal, ungrammatical, and multilingual data, and are therefore non-trivial to identify and exploit. We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain "location indicative words". We then evaluate the impact of nongeotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date. Our findings provide valuable insights into the design of robust, practical text-based geolocation prediction systems.

328 citations

Proceedings Article
Timothy Baldwin1, Paul Cook1, Marco Lui1, Andrew MacKinlay2, Li Wang2 
01 Oct 2013
TL;DR: This work investigates just how linguistically noisy or otherwise text in social media text is over a range of social media sources, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which is compared to a reference corpus of edited English text.
Abstract: While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which we compare to a reference corpus of edited English text. We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate the proportion of grammatical sentences in each, based on a linguistically-motivated parser. We also investigate the relative similarity between different data types.

234 citations

Proceedings Article
12 Jul 2012
TL;DR: This paper proposes a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution and shows that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset.
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (eg tomorrow for tmrw) We use context information to generate possible variant and normalisation pairs and then rank these by string similarity Highly-ranked pairs are selected to populate the dictionary We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing

203 citations

Journal ArticleDOI
TL;DR: This article develops statistical measures that each model a specific property of idiomatic expressions by looking at their actual usage patterns in text, and uses some of the measures in a token identification task where they distinguish idiomatic and literal usages of potentially idiomatic expression in context.
Abstract: Idiomatic expressions are plentiful in everyday language, yet they remain mysterious, as it is not clear exactly how people learn and understand them. They are of special interest to linguists, psycholinguists, and lexicographers, mainly because of their syntactic and semantic idiosyncrasies as well as their unclear lexical status. Despite a great deal of research on the properties of idioms in the linguistics literature, there is not much agreement on which properties are characteristic of these expressions. Because of their peculiarities, idiomatic expressions have mostly been overlooked by researchers in computational linguistics. In this article, we look into the usefulness of some of the identified linguistic properties of idioms for their automatic recognition. Specifically, we develop statistical measures that each model a specific property of idiomatic expressions by looking at their actual usage patterns in text. We use these statistical measures in a type-based classification task where we automatically separate idiomatic expressions (expressions with a possible idiomatic interpretation) from similar-on-the-surface literal phrases (for which no idiomatic interpretation is possible). In addition, we use some of the measures in a token identification task where we distinguish idiomatic and literal usages of potentially idiomatic expressions in context.

188 citations


Cited by
More filters
Posted Content
TL;DR: This article seeks to help ML practitioners apply MTL by shedding light on how MTL works and providing guidelines for choosing appropriate auxiliary tasks, particularly in deep neural networks.
Abstract: Multi-task learning (MTL) has led to successes in many applications of machine learning, from natural language processing and speech recognition to computer vision and drug discovery. This article aims to give a general overview of MTL, particularly in deep neural networks. It introduces the two most common methods for MTL in Deep Learning, gives an overview of the literature, and discusses recent advances. In particular, it seeks to help ML practitioners apply MTL by shedding light on how MTL works and providing guidelines for choosing appropriate auxiliary tasks.

2,202 citations

01 Jan 2006
TL;DR: This book discusses the development of English as a global language in the 20th Century and some of the aspects of its development that have changed since the publication of the first edition.
Abstract: A catalogue record for this book is available from the British Library ISBN 0 521 82347 1 hardback ISBN 0 521 53032 6 paperback Contents List of tables page vii Preface to the second edition ix Preface to the first edition xii 1 Why a global language? 1 What is a global language? 3 What makes a global language? 7 Why do we need a global language? 11 What are the dangers of a global language? 14 Could anything stop a global language? 25 A critical era 27 2 Why English? The historical context 29 Origins 30 America 31 Canada 36 The Caribbean 39 Australia and New Zealand 40 South Africa 43 South Asia 46 Former colonial Africa 49 Southeast Asia and the South Pacific 54 A world view 59 v Contents

1,857 citations

01 Jan 2005
TL;DR: In “Constructing a Language,” Tomasello presents a contrasting theory of how the child acquires language: It is not a universal grammar that allows for language development, but two sets of cognitive skills resulting from biological/phylogenetic adaptations are fundamental to the ontogenetic origins of language.
Abstract: Child psychiatrists, pediatricians, and other child clinicians need to have a solid understanding of child language development. There are at least four important reasons that make this necessary. First, slowing, arrest, and deviation of language development are highly associated with, and complicate the course of, child psychopathology. Second, language competence plays a crucial role in emotional and mood regulation, evaluation, and therapy. Third, language deficits are the most frequent underpinning of the learning disorders, ubiquitous in our clinical populations. Fourth, clinicians should not confuse the rich linguistic and dialectal diversity of our clinical populations with abnormalities in child language development. The challenge for the clinician becomes, then, how to get immersed in the captivating field of child language acquisition without getting overwhelmed by its conceptual and empirical complexity. In the past 50 years and since the seminal works of Roger Brown, Jerome Bruner, and Catherine Snow, child language researchers (often known as developmental psycholinguists) have produced a remarkable body of knowledge. Linguists such as Chomsky and philosophers such as Grice have strongly influenced the science of child language. One of the major tenets of Chomskian linguistics (known as generative grammar) is that children’s capacity to acquire language is “hardwired” with “universal grammar”—an innate language acquisition device (LAD), a language “instinct”—at its core. This view is in part supported by the assertion that the linguistic input that children receive is relatively dismal and of poor quality relative to the high quantity and quality of output that they manage to produce after age 2 and that only an advanced, innate capacity to decode and organize linguistic input can enable them to “get from here (prelinguistic infant) to there (linguistic child).” In “Constructing a Language,” Tomasello presents a contrasting theory of how the child acquires language: It is not a universal grammar that allows for language development. Rather, human cognition universals of communicative needs and vocal-auditory processing result in some language universals, such as nouns and verbs as expressions of reference and predication (p. 19). The author proposes that two sets of cognitive skills resulting from biological/phylogenetic adaptations are fundamental to the ontogenetic origins of language. These sets of inherited cognitive skills are intentionreading on the one hand and pattern-finding, on the other. Intention-reading skills encompass the prelinguistic infant’s capacities to share attention to outside events with other persons, establishing joint attentional frames, to understand other people’s communicative intentions, and to imitate the adult’s communicative intentions (an intersubjective form of imitation that requires symbolic understanding and perspective-taking). Pattern-finding skills include the ability of infants as young as 7 months old to analyze concepts and percepts (most relevant here, auditory or speech percepts) and create concrete or abstract categories that contain analogous items. Tomasello, a most prominent developmental scientist with research foci on child language acquisition and on social cognition and social learning in children and primates, succinctly and clearly introduces the major points of his theory and his views on the origins of language in the initial chapters. In subsequent chapters, he delves into the details by covering most language acquisition domains, namely, word (lexical) learning, syntax, and morphology and conversation, narrative, and extended discourse. Although one of the remaining domains (pragmatics) is at the core of his theory and permeates the text throughout, the relative paucity of passages explicitly devoted to discussing acquisition and proBOOK REVIEWS

1,757 citations

Journal ArticleDOI

764 citations

01 Jan 2010
TL;DR: The Stanford typed dependencies representation was designed to provide a simple description of the grammatical relationships in a sentence that can easily be understood and effectively used by people without linguistic expertise who want to extract textual relations.
Abstract: The Stanford typed dependencies representation was designed to provide a simple description of the grammatical relationships in a sentence that can easily be understood and effectively used by people without linguistic expertise who want to extract textual relations. In particular, rather than the phrase structure representations that have long dominated in the computational linguistic community, it represents all sentence relationships uniformly as typed dependency relations. That is, as triples of a relation between pairs of words, such as “the subject of distributes is Bell.” Our experience is that this simple, uniform representation is quite accessible to non-linguists thinking about tasks involving information extraction from text and is quite effective in relation extraction applications. Here is an example sentence:

750 citations