scispace - formally typeset
Search or ask a question

Showing papers by "Paul Cook published in 2017"


Journal ArticleDOI
06 Jan 2017
TL;DR: This article builds web corpora from national top-level domains corresponding to countries in which English is widely spoken and demonstrates, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.
Abstract: Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.

19 citations


Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper proposes the first deep learning models for token-level identification of MWEs, and considers a layered feedforward network, a recurrent neural network, and convolutional neural networks.
Abstract: Multiword expressions (MWEs) are lexical items that can be decomposed into multiple component words, but have properties that are unpredictable with respect to their component words. In this paper we propose the first deep learning models for token-level identification of MWEs. Specifically, we consider a layered feedforward network, a recurrent neural network, and convolutional neural networks. In experimental results we show that convolutional neural networks are able to outperform the previous state-of-the-art for MWE identification, with a convolutional neural network with three hidden layers giving the best performance.

17 citations


Proceedings ArticleDOI
01 Apr 2017
TL;DR: This paper proposes unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieves state-of-the-art results over two USim datasets.
Abstract: Usage similarity (USim) is an approach to determining word meaning in context that does not rely on a sense inventory. Instead, pairs of usages of a target lemma are rated on a scale. In this paper we propose unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieve state-of-the-art results over two USim datasets. We further consider supervised approaches to USim, and find that although they outperform unsupervised approaches, they are unable to generalize to lemmas that are unseen in the training data.

4 citations