Search or ask a question

Showing papers by "Paul Cook published in 2017"

PDF

Open Access

Journal Article•DOI•

Building and evaluating web corpora representing national varieties of English

[...]

Paul Cook¹, Laurel J. Brinton²•Institutions (2)

University of New Brunswick¹, University of British Columbia²

06 Jan 2017

TL;DR: This article builds web corpora from national top-level domains corresponding to countries in which English is widely spoken and demonstrates, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.

...read moreread less

Abstract: Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.

...read moreread less

19 citations

Proceedings Article•DOI•

Deep Learning Models For Multiword Expression Identification

[...]

Waseem Gharbieh¹, Virendrakumar C. Bhavsar, Paul Cook²•Institutions (2)

LG Electronics¹, University of New Brunswick²

01 Aug 2017

TL;DR: This paper proposes the first deep learning models for token-level identification of MWEs, and considers a layered feedforward network, a recurrent neural network, and convolutional neural networks.

...read moreread less

Abstract: Multiword expressions (MWEs) are lexical items that can be decomposed into multiple component words, but have properties that are unpredictable with respect to their component words. In this paper we propose the first deep learning models for token-level identification of MWEs. Specifically, we consider a layered feedforward network, a recurrent neural network, and convolutional neural networks. In experimental results we show that convolutional neural networks are able to outperform the previous state-of-the-art for MWE identification, with a convolutional neural network with three hidden layers giving the best performance.

...read moreread less

17 citations

Proceedings Article•DOI•

Supervised and unsupervised approaches to measuring usage similarity

[...]

Milton King, Paul Cook

01 Apr 2017

TL;DR: This paper proposes unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieves state-of-the-art results over two USim datasets.

...read moreread less

Abstract: Usage similarity (USim) is an approach to determining word meaning in context that does not rely on a sense inventory. Instead, pairs of usages of a target lemma are rated on a scale. In this paper we propose unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieve state-of-the-art results over two USim datasets. We further consider supervised approaches to USim, and find that although they outperform unsupervised approaches, they are unable to generalize to lemmas that are unseen in the training data.

...read moreread less

4 citations