Showing papers by "Paul Cook published in 2016"

PDF

Open Access

Proceedings Article•DOI•

A Word Embedding Approach to Identifying Verb-Noun Idiomatic Combinations.

[...]

Waseem Gharbieh¹, Virendrakumar C. Bhavsar, Paul Cook²•Institutions (2)

LG Electronics¹, University of New Brunswick²

01 Aug 2016

TL;DR: This article proposed supervised and unsupervised approaches based on word embeddings to identify token instances of verb-noun idiomatic combinations (VNICs), which are idioms consisting of a verb with a noun in its direct object position.

...read moreread less

Abstract: Verb–noun idiomatic combinations (VNICs) are idioms consisting of a verb with a noun in its direct object position. Usages of these expressions can be ambiguous between an idiomatic usage and a literal combination. In this paper we propose supervised and unsupervised approaches, based on word embeddings, to identifying token instances of VNICs. Our proposed supervised and unsupervised approaches perform better than the supervised and unsupervised approaches of Fazly et al. (2009), respectively. 1 Verb–noun Idiomatic Combinations Much research on multiword expressions (MWEs) in natural language processing (NLP) has focused on various type-level prediction tasks, e.g., MWE extraction (e.g., Church and Hanks, 1990; Smadja, 1993; Lin, 1999) — i.e., determining which MWE types are present in a given corpus (Baldwin and Kim, 2010) — and compositionality prediction (e.g., McCarthy et al., 2003; Reddy et al., 2011; Salehi et al., 2014). However, word combinations can be ambiguous between literal combinations and MWEs. For example, consider the following two usages of the expression hit the roof : 1. I think Paula might hit the roof if you start ironing. 2. When the blood hit the roof of the car I realised it was serious. The first example of hit the roof is an idiomatic usage, while the second is a literal combination.1 MWE identification is the task of determining These examples, and idiomaticity judgements, are taken from Cook et al. (2008). which token instances in running text are MWEs (Baldwin and Kim, 2010). Although there has been relatively less work on MWE identification than other type-level MWE prediction tasks, it is nevertheless important for NLP applications such as machine translation that must be able to distinguish MWEs from literal combinations in context. Some recent work has focused on token-level identification of a wide range of types of MWEs and other multiword units (e.g., Newman et al., 2012; Schneider et al., 2014; Brooke et al., 2014). Many studies, however, have taken a word sense disambiguation–inspired approach to MWE identification (e.g., Birke and Sarkar, 2006; Katz and Giesbrecht, 2006; Li et al., 2010), treating literal combinations and MWEs as different word senses, and have exploited linguistic knowledge of MWEs (e.g., Patrick and Fletcher, 2005; Uchiyama et al., 2005; Hashimoto and Kawahara, 2008; Fazly et al., 2009; Fothergill and Baldwin, 2012). In this study we focus on English verb–noun idiomatic combinations (VNICs). VNICs are formed from a verb with a noun in its direct object position. They are a common and productive type of English idiom, and occur cross-lingually (Fazly et al., 2009). VNICs tend to be relatively lexico-syntactically fixed, e.g., whereas hit the roof is ambiguous between literal and idiomatic meanings, hit the roofs and a roof was hit are most likely to be literal usages. Fazly et al. (2009) exploit this property in their unsupervised approach, referred to as CFORM. They define lexico-syntactic patterns for VNIC token instances based on the noun’s determiner (e.g., a, the, or possibly no determiner), the number of the noun (singular or plural), and the verb’s voice (active or passive). They propose a statistical method for automatically determining a given VNIC type’s canonical idiomatic form, based on the frequency of its usage in these

...read moreread less

23 citations

Proceedings Article•

Evaluating a Topic Modelling Approach to Measuring Corpus Similarity.

[...]

Richard Fothergill¹, Paul Cook², Timothy Baldwin¹•Institutions (2)

University of Melbourne¹, University of New Brunswick²

01 May 2016

TL;DR: The topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity, and a number of measures of corpus similarity were evaluated.

...read moreread less

Abstract: Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been previously evaluated for this task. To evaluate these methods we use known-similarity corpora that have been previously used for this purpose, as well as a number of newly-constructed known-similarity corpora targeting differences in genre, topic, time, and region. Our findings indicate that, overall, the topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity.

...read moreread less

19 citations

Proceedings Article•DOI•

UNBNLP at SemEval-2016 Task 1: Semantic Textual Similarity: A Unified Framework for Semantic Processing and Evaluation.

[...]

Milton King¹, Waseem Gharbieh², SoHyun Park, Paul Cook¹•Institutions (2)

University of New Brunswick¹, LG Electronics²

01 Jun 2016

TL;DR: This paper considers several approaches to predicting semantic textual similarity using word embeddings, as well as methods for formingembeddings for larger units of text, and compares these methods to several baselines, finding that none of them outperform the baselines.

...read moreread less

Abstract: In this paper we consider several approaches to predicting semantic textual similarity using word embeddings, as well as methods for forming embeddings for larger units of text. We compare these methods to several baselines, and find that none of them outperform the baselines. We then consider both a supervised and unsupervised approach to combining these methods which achieve modest improvements over the baselines.

...read moreread less

7 citations

Proceedings Article•

Determining the Multiword Expression Inventory of a Surprise Language

[...]

Bahar Salehi¹, Paul Cook², Timothy Baldwin¹•Institutions (2)

University of Melbourne¹, University of New Brunswick²

01 Dec 2016

TL;DR: This paper describes the first attempt to learn the MWE inventory of a “surprise” language for which the author has no explicit prior knowledge of MWE patterns, certainly no annotated MWE data, and not even a parallel corpus.

...read moreread less

Abstract: Much previous research on multiword expressions (MWEs) has focused on the token- and type-level tasks of MWE identification and extraction, respectively. Such studies typically target known prevalent MWE types in a given language. This paper describes the first attempt to learn the MWE inventory of a “surprise” language for which we have no explicit prior knowledge of MWE patterns, certainly no annotated MWE data, and not even a parallel corpus. Our proposed model is trained on a treebank with MWE relations of a source language, and can be applied to the monolingual corpus of the surprise language to identify its MWE construction types.

...read moreread less

3 citations

Proceedings of the 10th Web as Corpus Workshop

[...]

Paul Cook, Stefan Evert, Roland Schäfer, Egon W. Stemle

01 Jan 2016

TL;DR: Preliminary results from an ongoing experiment wherein two large unstructured text corpora are classified by topic domain (or subject area) are described, indicating that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.

...read moreread less

Abstract: In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.

...read moreread less

3 citations

Proceedings Article•

Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus

[...]

SoHyun Park, Afsaneh Fazly¹, Annie Lee, Brandon Seibel, Wenjie Zi, Paul Cook² - Show less +2 more•Institutions (2)

University of Toronto¹, University of New Brunswick²

01 May 2016

TL;DR: A set of nine domain- and application-specific categories for out-of-vocabulary terms is developed, drawing on features based on word embeddings, and linguistic knowledge of common properties of out- of-voc vocabulary terms.

...read moreread less

Abstract: In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us.

...read moreread less

2 citations

Book•

Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task

[...]

Paul Cook, Stefan Evert, Roland Schäfer, Egon W. Stemle

01 Jan 2016

1 citations