scispace - formally typeset
Search or ask a question

Showing papers by "Kevin Duh published in 2014"


Proceedings ArticleDOI
08 Sep 2014
TL;DR: An exploratory analysis aiming to investigate methods for studying and visualizing changes in word meaning over time, and proposes a framework for exploring semantic change at the lexical level, at the contrastive-pair level, and at the sentiment orientation level.
Abstract: Recently, large amounts of historical texts have been digitized and made accessible to the public. Thanks to this, for the first time, it became possible to analyze evolution of language through the use of automatic approaches. In this paper, we show the results of an exploratory analysis aiming to investigate methods for studying and visualizing changes in word meaning over time. In particular, we propose a framework for exploring semantic change at the lexical level, at the contrastive-pair level, and at the sentiment orientation level. We demonstrate several kinds of NLP approaches that altogether give users deeper understanding of word evolution. We use two diachronic corpora that are currently the largest available historical language corpora. Our results indicate that the task is feasible and satisfactory outcomes can be already achieved by using simple approaches.

118 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: It is shown how a basic T2S system that performs on par with phrasebased systems can be improved by 2.6-4.6 BLEU, greatly exceeding existing state-of-the-art methods.
Abstract: While tree-to-string (T2S) translation theoretically holds promise for efficient, accurate translation, in previous reports T2S systems have often proven inferior to other machine translation (MT) methods such as phrase-based or hierarchical phrase-based MT. In this paper, we attempt to clarify the reason for this performance gap by investigating a number of peripheral elements that affect the accuracy of T2S systems, including parsing, alignment, and search. Based on detailed experiments on the English-Japanese and JapaneseEnglish pairs, we show how a basic T2S system that performs on par with phrasebased systems can be improved by 2.6-4.6 BLEU, greatly exceeding existing stateof-the-art methods. These results indicate that T2S systems indeed hold much promise, but the above-mentioned elements must be taken seriously in construction of these systems.

37 citations


Posted Content
TL;DR: The authors investigate the hypothesis that word representations should incorporate both distributional and relational semantics, and employ the Alternating Direction Method of Multipliers (ADMM) to flexibly optimise a distributional objective on raw text and a relational objective on WordNet.
Abstract: We investigate the hypothesis that word representations ought to incorporate both distributional and relational semantics. To this end, we employ the Alternating Direction Method of Multipliers (ADMM), which flexibly optimizes a distributional objective on raw text and a relational objective on WordNet. Preliminary results on knowledge base completion, analogy tests, and parsing show that word representations trained on both objectives can give improvements in some cases.

34 citations


Proceedings ArticleDOI
01 Apr 2014
TL;DR: This paper develops two types of supertags that encode information about head position and dependency relations in different levels of granularity and proposes a transition-based dependency parser that incorporates the predictions from a CRF-based supertagger as new features.
Abstract: Transition-based dependency parsing systems can utilize rich feature representations. However, in practice, features are generally limited to combinations of lexical tokens and part-of-speech tags. In this paper, we investigate richer features based on supertags, which represent lexical templates extracted from dependency structure annotated corpus. First, we develop two types of supertags that encode information about head position and dependency relations in different levels of granularity. Then, we propose a transition-based dependency parser that incorporates the predictions from a CRF-based supertagger as new features. On standard English Penn Treebank corpus, we show that our supertag features achieve parsing improvements of 1.3% in unlabeled attachment, 2.07% root attachment, and 3.94% in complete tree accuracy.

22 citations


Proceedings Article
01 May 2014
TL;DR: The usefulness of incorporating large unlabelled corpora and a dictionary for this task is demonstrated, and two synthetic word parsers significantly outperform the baseline (a pipeline method).
Abstract: Synthetic word analysis is a potentially important but relatively unexplored problem in Chinese natural language processing. Two issues with the conventional pipeline methods involving word segmentation are (1) the lack of a common segmentation standard and (2) the poor segmentation performance on OOV words. These issues may be circumvented if we adopt the view of character-based parsing, providing both internal structures to synthetic words and global structure to sentences in a seamless fashion. However, the accuracy of synthetic word parsing is not yet satisfactory, due to the lack of research. In view of this, we propose and present experiments on several synthetic word parsers. Additionally, we demonstrate the usefulness of incorporating large unlabelled corpora and a dictionary for this task. Our parsers significantly outperform the baseline (a pipeline method).

8 citations


Proceedings Article
14 Dec 2014
TL;DR: The authors investigate the hypothesis that word representations should incorporate both distributional and relational semantics, and employ the Alternating Direction Method of Multipliers (ADMM) to flexibly optimise a distributional objective on raw text and a relational objective on WordNet.
Abstract: We investigate the hypothesis that word representations ought to incorporate both distributional and relational semantics. To this end, we employ the Alternating Direction Method of Multipliers (ADMM), which flexibly optimizes a distributional objective on raw text and a relational objective on WordNet. Preliminary results on knowledge base completion, analogy tests, and parsing show that word representations trained on both objectives can give improvements in some cases.

8 citations




Proceedings ArticleDOI
01 Apr 2014
TL;DR: A simple and effective crosslingual approach to identifying collocations based on the observation that true collocations, which cannot be translated word for word, will exhibit very different association scores before and after literal translation is introduced.
Abstract: We introduce a simple and effective crosslingual approach to identifying collocations. This approach is based on the observation that true collocations, which cannot be translated word for word, will exhibit very different association scores before and after literal translation. Our experiments in Japanese demonstrate that our cross-lingual association measure can successfully exploit the combination of bilingual dictionary and large monolingual corpora, outperforming monolingual association measures.

4 citations


Journal ArticleDOI
TL;DR: An in-depth analysis of a large corpus of curated microblog data is performed and a novel method based on a learning-to-rank framework is proposed that increases the curator’s productivity and breadth of perspective by suggesting which novel microblogs should be added to the curated content.
Abstract: SUMMARY Social media such as microblogs have become so pervasive such that it is now possible to use them as sensors for real-world events and memes. While much recent research has focused on developing automatic methods for filtering and summarizing these data streams, we explore a different trend called social curation. In contrast to automatic methods, social curation is characterized as a human-in-the-loop and sometimes crowd-sourced mechanism for exploiting social media as sensors. Although social curation web services like Togetter, Naver Matome and Storify are gaining popularity, little academic research has studied the phenomenon. In this paper, our goal is to investigate the phenomenon and potential of this new field of social curation. First, we perform an in-depth analysis of a large corpus of curated microblog data. We seek to understand why and how people participate in this laborious curation process. We then explore new ways in which information retrieval and machine learning technologies can be used to assist curators. In particular, we propose a novel method based on a learning-to-rank framework that increases the curator’s productivity and breadth of perspective by suggesting which novel microblogs should be added to the curated content.

2 citations


01 Jan 2014
TL;DR: This paper presents NTT-NAIST SMT systems for English-German and German-English MT tasks of the IWSLT 2014 evaluation campaign based on generalized minimum Bayes risk system system combination using the forest-to-string, syntactic preordering, and phrase-based translation formalisms.
Abstract: This paper presents NTT-NAIST SMT systems for English-German and German-English MT tasks of the IWSLT 2014 evaluation campaign. The systems are based on generalized minimum Bayes risk system combination of three SMT systems using the forest-to-string, syntactic preordering, and phrase-based translation formalisms. Individual systems employ training data selection for domain adaptation, truecasing, compound word splitting (for GermanEnglish), interpolated n-gram language models, and hypotheses rescoring using recurrent neural network language models.

01 Jan 2014
TL;DR: It is shown how a basic T2S system that performs on par with phrasebased systems can be improved by 2.6-4.6 BLEU, greatly exceeding existing state-of-the-art methods.
Abstract: While tree-to-string (T2S) translation theoretically holds promise for efficient, accurate translation, in previous reports T2S systems have often proven inferior to other machine translation (MT) methods such as phrase-based or hierarchical phrase-based MT. In this paper, we attempt to clarify the reason for this performance gap by investigating a number of peripheral elements that affect the accuracy of T2S systems, including parsing, alignment, and search. Based on detailed experiments on the English-Japanese and JapaneseEnglish pairs, we show how a basic T2S system that performs on par with phrasebased systems can be improved by 2.6-4.6 BLEU, greatly exceeding existing stateof-the-art methods. These results indicate that T2S systems indeed hold much promise, but the above-mentioned elements must be taken seriously in construction of these systems.

Proceedings ArticleDOI
01 Apr 2014
TL;DR: A simple and effective method to improve automatic word alignment by pre-removing unalignable words is proposed, and improvements on hierarchical MT systems in both translation directions are shown.
Abstract: Professional human translators usually do not employ the concept of word alignments, producing translations ‘sense-forsense’ instead of ‘word-for-word’. This suggests that unalignable words may be prevalent in the parallel text used for machine translation (MT). We analyze this phenomenon in-depth for Chinese-English translation. We further propose a simple and effective method to improve automatic word alignment by pre-removing unalignable words, and show improvements on hierarchical MT systems in both translation directions. 1 Motivation It is generally acknowledged that absolute equivalence between two languages is impossible, since concept lexicalization varies across languages. Major translation theories thus argue that texts should be translated ‘sense-for-sense’ instead of ‘word-for-word’ (Nida, 1964). This suggests that unalignable words may be an issue for the parallel text used to train current statistical machine translation (SMT) systems. Although existing automatic word alignment methods have some mechanism to handle the lack of exact word-for-word alignment (e.g. null probabilities, fertility in the IBM models (Brown et al., 1993)), they may be too coarse-grained to model the ’sense-for-sense’ translations created by professional human translators.