Showing papers by "Kevin Duh published in 2013"

PDF

Open Access

Proceedings Article•

Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation

[...]

Kevin Duh¹, Graham Neubig¹, Katsuhito Sudoh², Hajime Tsukada²•Institutions (2)

Nara Institute of Science and Technology¹, Nippon Telegraph and Telephone²

01 Aug 2013

TL;DR: It is found that neural language models are indeed viable tools for data selection: while the improvements are varied, they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.

...read moreread less

Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling unknown word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.

...read moreread less

129 citations

A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking

[...]

Xiaodong Liu¹, Kevin Cheng¹, Yanyan Luo¹, Kevin Duh¹, Yuji Matsumoto¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

01 Oct 2013

TL;DR: The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low, however, using the SVM classifier to rerank the candidates could obtain higher precision with a little recall dropping.

...read moreread less

Abstract: We describe the Nara Institute of Science and Technology (NAIST) spelling check system in the shared task. Our system contains three components: a word segmentation based language model to generate correction candidates; a statistical machine translation model to provide correction candidates and a Support Vector Machine (SVM) classifier to rerank the candidates provided by the previous two components. The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low. However, using the SVM classifier to rerank the candidates, we could obtain higher precision with a little recall dropping. To address the low resource problem of the Chinese spelling check, we generate 2 million artificial training data by simply replacing the character in the provided training sentence with the character in the confusion set.

...read moreread less

38 citations

Proceedings Article•

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

[...]

Xiaodong Liu¹, Kevin Duh¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Aug 2013

TL;DR: A flexible and effective framework for extracting a bilingual dictionary from comparable corpora based on a novel combination of topic modeling and word alignment techniques, which demonstrates that it reliably extracts high-precision translation pairs on a wide variety of comparable data conditions.

...read moreread less

Abstract: We propose a flexible and effective framework for extracting a bilingual dictionary from comparable corpora. Our approach is based on a novel combination of topic modeling and word alignment techniques. Intuitively, our approach works by converting a comparable document-aligned corpus into a parallel topic-aligned corpus, then learning word alignments using co-occurrence statistics. This topicaligned corpus is similar in structure to the sentence-aligned corpus frequently used in statistical machine translation, enabling us to exploit advances in word alignment research. Unlike many previous work, our framework does not require any languagespecific knowledge for initialization. Furthermore, our framework attempts to handle polysemy by allowing multiple translation probability models for each word. On a large-scale Wikipedia corpus, we demonstrate that our framework reliably extracts high-precision translation pairs on a wide variety of comparable data conditions.

...read moreread less

25 citations

Proceedings Article•

How Much Is Said in a Tweet? A Multilingual, Information-theoretic Perspective

[...]

Graham Neubig¹, Kevin Duh¹•Institutions (1)

Nara Institute of Science and Technology¹

15 Mar 2013

TL;DR: This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages, using entropy as the criterion for quantifying “how much is said” in a tweet.

...read moreread less

Abstract: This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages. In order to answer this question in a quantitative fashion, we take an information-theoretic approach, using entropy as our criterion for quantifying “how much is said” in a tweet. Our results find that, as expected, languages with larger character sets such as Chinese and Japanese contain more information per character than other languages. However, we also find that, somewhat surprisingly, information per character does not have a strong correlation with information per microblog post, as authors of microblog posts in languages with more information per character do not necessarily use all of the space allotted to them. Finally, we examine the relative importance of a number of factors that contribute to whether a language has more or less information content in each character or post, and also compare the information content of microblog text with more traditional text from Wikipedia.

...read moreread less

23 citations

Proceedings Article•

Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks

[...]

Masashi Tsubaki¹, Kevin Duh¹, Masashi Shimbo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Oct 2013

TL;DR: A compositional model where both predicate and argument are allowed to modify each others’ meaning representations while generating the overall semantics, which readily addresses some major challenges with current vector space models.

...read moreread less

Abstract: We present a novel vector space model for semantic co-compositionality. Inspired by Generative Lexicon Theory (Pustejovsky, 1995), our goal is a compositional model where both predicate and argument are allowed to modify each others’ meaning representations while generating the overall semantics. This readily addresses some major challenges with current vector space models, notably the polysemy issue and the use of one representation per word type. We implement cocompositionality using prototype projections on predicates/arguments and show that this is effective in adapting their word representations. We further cast the model as a neural network and propose an unsupervised algorithm to jointly train word representations with co-compositionality. The model achieves the best result to date (ρ = 0.47) on the semantic similarity task of transitive verbs (Grefenstette and Sadrzadeh, 2011).

...read moreread less

22 citations

Journal Article•DOI•

Managing information disparity in multilingual document collections

[...]

Kevin Duh, Ching-man Au Yeung¹, Tomoharu Iwata, Masaaki Nagata•Institutions (1)

Huawei¹

22 Mar 2013-ACM Transactions on Speech and Language Processing

TL;DR: This work proposes a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning, and concludes its system is effective in a variety of scenarios.

...read moreread less

Abstract: Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios.

...read moreread less

19 citations

Journal Article•DOI•

Syntax-Based Post-Ordering for Efficient Japanese-to-English Translation

[...]

Katsuhito Sudoh, Xianchao Wu, Kevin Duh, Hajime Tsukada, Masaaki Nagata - Show less +1 more

01 Aug 2013-ACM Transactions on Asian Language Information Processing

TL;DR: A novel reordering method for efficient two-step Japanese-to-English statistical machine translation (SMT) that isolates reordering from SMT and solves it after lexical translation, and empirically reduces the decoding time of the accurate but slow syntax-based SMT by its good approximation using intermediate HFE.

...read moreread less

Abstract: This article proposes a novel reordering method for efficient two-step Japanese-to-English statistical machine translation (SMT) that isolates reordering from SMT and solves it after lexical translation. This reordering problem, called post-ordering, is solved as an SMT problem from Head-Final English (HFE) to English. HFE is syntax-based reordered English that is very successfully used for reordering with English-to-Japanese SMT. The proposed method incorporates its advantage into the reverse direction, Japanese-to-English, and solves the post-ordering problem by accurate syntax-based SMT with target language syntax. Two-step SMT with the proposed post-ordering empirically reduces the decoding time of the accurate but slow syntax-based SMT by its good approximation using intermediate HFE. The proposed method improves the decoding speed of syntax-based SMT decoding by about six times with comparable translation accuracy in Japanese-to-English patent translation experiments.

...read moreread less

11 citations

An Empirical Investigation of Word Representations for Parsing the Web

[...]

Sorami Hisamoto, Kevin Duh, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2013

TL;DR: This work systematically examines two different kinds of word representations, namely Brown clustering and word embeddings induced from a neural language model on the task of dependency parsing on web text.

...read moreread less

Abstract: Parsing web text is progressively becoming important for many applications in natural language processing, such as machine translation, information retrieval, and sentiment analysis. Current syntactic parsing has been focused on canonical data such as newswires. When evaluated on standard benchmarks such as Wall Street Journal data set, current state-of-the-art parsers achieve accuracies well above 90%. However the accuracy drops dramatically when they are applied to new domains such as web data, barely over 80%. In order to make progress in many applications that rely on parsing, we need robust parsers that can handle such texts. One approach that is becoming popular recently is to use unsupervised word representations as extra features. Koo et al. [1] has shown that unsupervised clustering features are effective to improve dependency parsing. Turian et al. [2] examined clustering and unsupervised word embedding features on chunking and named entity recognition tasks. Unsupervised word embeddings are dense, low-dimensional and real-value vectors representing words, often induced by neural language models. They have shown that these word representation features lead to improvement in the performances. These word representations are induced by unsupervised methods, thus they are good for new domains such as the web, which has enormous amount of unlabeled data but little labeled data. In this paper we investigate the effect of unsupervised word representation features on dependency parsing with web texts. We consider two different kinds of word representations, namely Brown clustering and word embeddings induced from a neural language model. To the best of our knowledge, this is the first work that systematically examines these word representations on the task of dependency parsing on web text.

...read moreread less

10 citations

Proceedings Article•

Hidden Markov Tree Model for Word Alignment

[...]

Shuhei Kondo¹, Kevin Duh¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Aug 2013

TL;DR: This model assumes that the alignment variables have a tree structure which is isomorphic to the target dependency tree and models the distortion probability based on the source dependency tree, thereby incorporating the syntactic structure from both sides of the parallel sentences.

...read moreread less

Abstract: We propose a novel unsupervised word alignment model based on the Hidden Markov Tree (HMT) model. Our model assumes that the alignment variables have a tree structure which is isomorphic to the target dependency tree and models the distortion probability based on the source dependency tree, thereby incorporating the syntactic structure from both sides of the parallel sentences. In English-Japanese word alignment experiments, our model outperformed an IBM Model 4 baseline by over 3 points alignment error rate. While our model was sensitive to posterior thresholds, it also showed a performance comparable to that of HMM alignment models.

...read moreread less

9 citations

Proceedings Article•

Multi-Metric Optimization Using Ensemble Tuning

[...]

Baskaran Sankaran¹, Anoop Sarkar², Kevin Duh³•Institutions (3)

Simon Fraser University¹, Monash University², Nara Institute of Science and Technology³

01 Jun 2013

TL;DR: This paper examines tuning for statistical machine translation (SMT) with respect to multiple evaluation metrics and proposes several novel methods for tuning towards multiple objectives, including some based on ensemble decoding methods.

...read moreread less

Abstract: This paper examines tuning for statistical machine translation (SMT) with respect to multiple evaluation metrics. We propose several novel methods for tuning towards multiple objectives, including some based on ensemble decoding methods. Pareto-optimality is a natural way to think about multi-metric optimization (MMO) and our methods can effectively combine several Pareto-optimal solutions, obviating the need to choose one. Our best performing ensemble tuning method is a new algorithm for multi-metric optimization that searches for Pareto-optimal ensemble models. We study the effectiveness of our methods through experiments on multiple as well as single reference(s) datasets. Our experiments show simultaneous gains across several metrics (BLEU, RIBES), without any significant reduction in other metrics. This contrasts the traditional tuning where gains are usually limited to a single metric. Our human evaluation results confirm that in order to produce better MT output, optimizing multiple metrics is better than optimizing only one.

...read moreread less

9 citations

NTT-NAIST SMT systems for IWSLT 2013.

[...]

Katsuhito Sudoh, Graham Neubig, Kevin Duh, Hajime Tsukada

01 Jan 2013

TL;DR: This paper presents NTT-NAIST SMT systems for EnglishGerman and German-English MT tasks of the IWSLT 2013 evaluation campaign based on generalized minimum Bayes risk system combination of three SMT Systems.

...read moreread less

Abstract: This paper presents NTT-NAIST SMT systems for EnglishGerman and German-English MT tasks of the IWSLT 2013 evaluation campaign. The systems are based on generalized minimum Bayes risk system combination of three SMT systems: forest-to-string, hierarchical phrase-based, phrasebased with pre-ordering. Individual SMT systems include data selection for domain adaptation, rescoring using recurrent neural net language models, interpolated language models, and compound word splitting (only for German-English).

...read moreread less

for dependency based Semantic Role Labeling

[...]

Yanyan Luo, Kevin Duh, Yuji Matsumoto

01 Jan 2013

TL;DR: The findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice than oracle parses in practice.

...read moreread less

Abstract: Semantic Role Labeling (SRL) is an important task since it benefits a wide range of natural language processing applications. Given a sentence, the task of SRL is to identify arguments for a predicate (target verb or noun) and assign semantically meaningful labels to them. Dependency parsing based methods have achieved much success in SRL. However, due to errors in dependency parsing, there remains a large performance gap between SRL based on oracle parses and SRL based on automatic parses in practice. In light of this, this paper investigates what additional information is necessary to close this gap. Is it worthwhile to introduce additional dependency informationin theformofN-bestparsefeatures, or is it better to incorporate orthogonal nondependency information (base chunk constituents)? We compare the above features in a SRL system that achieves state-of-theart results on the CoNLL 2009 Chinese task corpus. Our findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice.

...read moreread less

Proceedings Article•

What Information is Helpful for Dependency Based Semantic Role Labeling

[...]

Yanyan Luo¹, Kevin Duh¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Oct 2013

TL;DR: The findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice than oracle parses based on automatic parses in practice.

...read moreread less

Abstract: Semantic Role Labeling (SRL) is an important task since it benefits a wide range of natural language processing applications. Given a sentence, the task of SRL is to identify arguments for a predicate (target verb or noun) and assign semantically meaningful labels to them. Dependency parsing based methods have achieved much success in SRL. However, due to errors in dependency parsing, there remains a large performance gap between SRL based on oracle parses and SRL based on automatic parses in practice. In light of this, this paper investigates what additional information is necessary to close this gap. Is it worthwhile to introduce additional dependency information in the form of N-best parse features, or is it better to incorporate orthogonal nondependency information (base chunk constituents)? We compare the above features in a SRL system that achieves state-of-theart results on the CoNLL 2009 Chinese task corpus. Our findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice.

...read moreread less

NAIST in the SIGHAN 2013 Chinese Spelling Check Shared Task: A Hybrid System Using Language Model and Statistical Machine Translation with Reranking Confidence Score

[...]

Xiaodong Liu¹, Fei Cheng¹, Yanyan Luo¹, Kevin Duh¹, Yuji Matsumoto¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2013

...read moreread less