scispace - formally typeset
Search or ask a question

Showing papers by "Kevin Duh published in 2013"


Proceedings Article
01 Aug 2013
TL;DR: It is found that neural language models are indeed viable tools for data selection: while the improvements are varied, they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling unknown word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.

129 citations


01 Oct 2013
TL;DR: The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low, however, using the SVM classifier to rerank the candidates could obtain higher precision with a little recall dropping.
Abstract: We describe the Nara Institute of Science and Technology (NAIST) spelling check system in the shared task. Our system contains three components: a word segmentation based language model to generate correction candidates; a statistical machine translation model to provide correction candidates and a Support Vector Machine (SVM) classifier to rerank the candidates provided by the previous two components. The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low. However, using the SVM classifier to rerank the candidates, we could obtain higher precision with a little recall dropping. To address the low resource problem of the Chinese spelling check, we generate 2 million artificial training data by simply replacing the character in the provided training sentence with the character in the confusion set.

38 citations


Proceedings Article
01 Aug 2013
TL;DR: A flexible and effective framework for extracting a bilingual dictionary from comparable corpora based on a novel combination of topic modeling and word alignment techniques, which demonstrates that it reliably extracts high-precision translation pairs on a wide variety of comparable data conditions.
Abstract: We propose a flexible and effective framework for extracting a bilingual dictionary from comparable corpora. Our approach is based on a novel combination of topic modeling and word alignment techniques. Intuitively, our approach works by converting a comparable document-aligned corpus into a parallel topic-aligned corpus, then learning word alignments using co-occurrence statistics. This topicaligned corpus is similar in structure to the sentence-aligned corpus frequently used in statistical machine translation, enabling us to exploit advances in word alignment research. Unlike many previous work, our framework does not require any languagespecific knowledge for initialization. Furthermore, our framework attempts to handle polysemy by allowing multiple translation probability models for each word. On a large-scale Wikipedia corpus, we demonstrate that our framework reliably extracts high-precision translation pairs on a wide variety of comparable data conditions.

25 citations


Proceedings Article
15 Mar 2013
TL;DR: This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages, using entropy as the criterion for quantifying “how much is said” in a tweet.
Abstract: This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages. In order to answer this question in a quantitative fashion, we take an information-theoretic approach, using entropy as our criterion for quantifying “how much is said” in a tweet. Our results find that, as expected, languages with larger character sets such as Chinese and Japanese contain more information per character than other languages. However, we also find that, somewhat surprisingly, information per character does not have a strong correlation with information per microblog post, as authors of microblog posts in languages with more information per character do not necessarily use all of the space allotted to them. Finally, we examine the relative importance of a number of factors that contribute to whether a language has more or less information content in each character or post, and also compare the information content of microblog text with more traditional text from Wikipedia.

23 citations


Proceedings Article
01 Oct 2013
TL;DR: A compositional model where both predicate and argument are allowed to modify each others’ meaning representations while generating the overall semantics, which readily addresses some major challenges with current vector space models.
Abstract: We present a novel vector space model for semantic co-compositionality. Inspired by Generative Lexicon Theory (Pustejovsky, 1995), our goal is a compositional model where both predicate and argument are allowed to modify each others’ meaning representations while generating the overall semantics. This readily addresses some major challenges with current vector space models, notably the polysemy issue and the use of one representation per word type. We implement cocompositionality using prototype projections on predicates/arguments and show that this is effective in adapting their word representations. We further cast the model as a neural network and propose an unsupervised algorithm to jointly train word representations with co-compositionality. The model achieves the best result to date (ρ = 0.47) on the semantic similarity task of transitive verbs (Grefenstette and Sadrzadeh, 2011).

22 citations


Journal ArticleDOI
TL;DR: This work proposes a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning, and concludes its system is effective in a variety of scenarios.
Abstract: Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios.

19 citations


Journal ArticleDOI
TL;DR: A novel reordering method for efficient two-step Japanese-to-English statistical machine translation (SMT) that isolates reordering from SMT and solves it after lexical translation, and empirically reduces the decoding time of the accurate but slow syntax-based SMT by its good approximation using intermediate HFE.
Abstract: This article proposes a novel reordering method for efficient two-step Japanese-to-English statistical machine translation (SMT) that isolates reordering from SMT and solves it after lexical translation. This reordering problem, called post-ordering, is solved as an SMT problem from Head-Final English (HFE) to English. HFE is syntax-based reordered English that is very successfully used for reordering with English-to-Japanese SMT. The proposed method incorporates its advantage into the reverse direction, Japanese-to-English, and solves the post-ordering problem by accurate syntax-based SMT with target language syntax. Two-step SMT with the proposed post-ordering empirically reduces the decoding time of the accurate but slow syntax-based SMT by its good approximation using intermediate HFE. The proposed method improves the decoding speed of syntax-based SMT decoding by about six times with comparable translation accuracy in Japanese-to-English patent translation experiments.

11 citations


01 Jan 2013
TL;DR: This work systematically examines two different kinds of word representations, namely Brown clustering and word embeddings induced from a neural language model on the task of dependency parsing on web text.
Abstract: Parsing web text is progressively becoming important for many applications in natural language processing, such as machine translation, information retrieval, and sentiment analysis. Current syntactic parsing has been focused on canonical data such as newswires. When evaluated on standard benchmarks such as Wall Street Journal data set, current state-of-the-art parsers achieve accuracies well above 90%. However the accuracy drops dramatically when they are applied to new domains such as web data, barely over 80%. In order to make progress in many applications that rely on parsing, we need robust parsers that can handle such texts. One approach that is becoming popular recently is to use unsupervised word representations as extra features. Koo et al. [1] has shown that unsupervised clustering features are effective to improve dependency parsing. Turian et al. [2] examined clustering and unsupervised word embedding features on chunking and named entity recognition tasks. Unsupervised word embeddings are dense, low-dimensional and real-value vectors representing words, often induced by neural language models. They have shown that these word representation features lead to improvement in the performances. These word representations are induced by unsupervised methods, thus they are good for new domains such as the web, which has enormous amount of unlabeled data but little labeled data. In this paper we investigate the effect of unsupervised word representation features on dependency parsing with web texts. We consider two different kinds of word representations, namely Brown clustering and word embeddings induced from a neural language model. To the best of our knowledge, this is the first work that systematically examines these word representations on the task of dependency parsing on web text.

10 citations


Proceedings Article
01 Aug 2013
TL;DR: This model assumes that the alignment variables have a tree structure which is isomorphic to the target dependency tree and models the distortion probability based on the source dependency tree, thereby incorporating the syntactic structure from both sides of the parallel sentences.
Abstract: We propose a novel unsupervised word alignment model based on the Hidden Markov Tree (HMT) model. Our model assumes that the alignment variables have a tree structure which is isomorphic to the target dependency tree and models the distortion probability based on the source dependency tree, thereby incorporating the syntactic structure from both sides of the parallel sentences. In English-Japanese word alignment experiments, our model outperformed an IBM Model 4 baseline by over 3 points alignment error rate. While our model was sensitive to posterior thresholds, it also showed a performance comparable to that of HMM alignment models.

9 citations


Proceedings Article
01 Jun 2013
TL;DR: This paper examines tuning for statistical machine translation (SMT) with respect to multiple evaluation metrics and proposes several novel methods for tuning towards multiple objectives, including some based on ensemble decoding methods.
Abstract: This paper examines tuning for statistical machine translation (SMT) with respect to multiple evaluation metrics. We propose several novel methods for tuning towards multiple objectives, including some based on ensemble decoding methods. Pareto-optimality is a natural way to think about multi-metric optimization (MMO) and our methods can effectively combine several Pareto-optimal solutions, obviating the need to choose one. Our best performing ensemble tuning method is a new algorithm for multi-metric optimization that searches for Pareto-optimal ensemble models. We study the effectiveness of our methods through experiments on multiple as well as single reference(s) datasets. Our experiments show simultaneous gains across several metrics (BLEU, RIBES), without any significant reduction in other metrics. This contrasts the traditional tuning where gains are usually limited to a single metric. Our human evaluation results confirm that in order to produce better MT output, optimizing multiple metrics is better than optimizing only one.

9 citations


01 Jan 2013
TL;DR: This paper presents NTT-NAIST SMT systems for EnglishGerman and German-English MT tasks of the IWSLT 2013 evaluation campaign based on generalized minimum Bayes risk system combination of three SMT Systems.
Abstract: This paper presents NTT-NAIST SMT systems for EnglishGerman and German-English MT tasks of the IWSLT 2013 evaluation campaign. The systems are based on generalized minimum Bayes risk system combination of three SMT systems: forest-to-string, hierarchical phrase-based, phrasebased with pre-ordering. Individual SMT systems include data selection for domain adaptation, rescoring using recurrent neural net language models, interpolated language models, and compound word splitting (only for German-English).

01 Jan 2013
TL;DR: The findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice than oracle parses in practice.
Abstract: Semantic Role Labeling (SRL) is an important task since it benefits a wide range of natural language processing applications. Given a sentence, the task of SRL is to identify arguments for a predicate (target verb or noun) and assign semantically meaningful labels to them. Dependency parsing based methods have achieved much success in SRL. However, due to errors in dependency parsing, there remains a large performance gap between SRL based on oracle parses and SRL based on automatic parses in practice. In light of this, this paper investigates what additional information is necessary to close this gap. Is it worthwhile to introduce additional dependency informationin theformofN-bestparsefeatures, or is it better to incorporate orthogonal nondependency information (base chunk constituents)? We compare the above features in a SRL system that achieves state-of-theart results on the CoNLL 2009 Chinese task corpus. Our findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice.

Proceedings Article
01 Oct 2013
TL;DR: The findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice than oracle parses based on automatic parses in practice.
Abstract: Semantic Role Labeling (SRL) is an important task since it benefits a wide range of natural language processing applications. Given a sentence, the task of SRL is to identify arguments for a predicate (target verb or noun) and assign semantically meaningful labels to them. Dependency parsing based methods have achieved much success in SRL. However, due to errors in dependency parsing, there remains a large performance gap between SRL based on oracle parses and SRL based on automatic parses in practice. In light of this, this paper investigates what additional information is necessary to close this gap. Is it worthwhile to introduce additional dependency information in the form of N-best parse features, or is it better to incorporate orthogonal nondependency information (base chunk constituents)? We compare the above features in a SRL system that achieves state-of-theart results on the CoNLL 2009 Chinese task corpus. Our findings suggest that orthogonal information in the form of constituents is much more helpful in improving dependency based SRL in practice.

01 Jan 2013
TL;DR: The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low, however, using the SVM classifier to rerank the candidates could obtain higher precision with a little recall dropping.
Abstract: We describe the Nara Institute of Science and Technology (NAIST) spelling check system in the shared task. Our system contains three components: a word segmentation based language model to generate correction candidates; a statistical machine translation model to provide correction candidates and a Support Vector Machine (SVM) classifier to rerank the candidates provided by the previous two components. The experimental results show that the kbest language model and the statistical machine translation model could generate almost all the correction candidates, while the precision is very low. However, using the SVM classifier to rerank the candidates, we could obtain higher precision with a little recall dropping. To address the low resource problem of the Chinese spelling check, we generate 2 million artificial training data by simply replacing the character in the provided training sentence with the character in the confusion set.