scispace - formally typeset
Search or ask a question

Showing papers by "Kevin Duh published in 2011"


Proceedings Article
19 Jun 2011
TL;DR: It is argued that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered.
Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about cross-lingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefully-designed experiments that led us to these conclusions.

62 citations


Proceedings Article
01 Nov 2011
TL;DR: This paper proposes a lineartime algorithm to extract the pre-ordering rules from word-aligned HPSG-tree-tostring pairs and a bottom-up algorithm to apply the extracted rules to H PSG trees to yield target language style source sentences.
Abstract: Word ordering remains as an essential problem for translating between languages with substantial structural differences, such as SOV and SVO languages. In this paper, we propose to automatically extract pre-ordering rules from predicateargument structures. A pre-ordering rule records the relative position mapping of a predicate word and its argument phrases from the source language side to the target language side. We propose 1) a lineartime algorithm to extract the pre-ordering rules from word-aligned HPSG-tree-tostring pairs and 2) a bottom-up algorithm to apply the extracted rules to HPSG trees to yield target language style source sentences. Experimental results are reported for large-scale English-to-Japanese translation, showing significant improvements of BLEU score compared with the baseline SMT systems.

37 citations



Journal ArticleDOI
TL;DR: This paper examines whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms, and proposes a simple yet flexible transductive meta-algorithm, which improves over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset.

21 citations


01 Jan 2011
TL;DR: Details of the NTT-UT system in NTCIR- 9 PatentMT task are described, which includes syntactic pre-ordering, forest-to-string translation, and using external resources for domain adaptation and tar- get language modeling.
Abstract: This paper describes details of the NTT-UT system in NTCIR- 9 PatentMT task. One of its key technology is system com- bination; the final translation hypotheses are chosen from n-bests by different SMT systems in a Minimum Bayes Risk (MBR) manner. Each SMT system includes different tech- nology: syntactic pre-ordering, forest-to-string translation, and using external resources for domain adaptation and tar- get language modeling.

21 citations


Proceedings Article
01 Nov 2011
TL;DR: This work argues that common MBR implementations are actually not correct, and introduces Generalized MBR, which parameterizes the loss function in MBR and allows it to be optimized in the given hypothesis space of multiple systems.
Abstract: Minimum Bayes Risk (MBR) has been used as a decision rule for both singlesystem decoding and system combination in machine translation. For system combination, we argue that common MBR implementations are actually not correct, since probabilities in the hypothesis space cannot be reliably estimated. These implementations achieve the effect of consensus decoding (which may be beneficial in its own right), but does not reduce Bayes Risk in the true Bayesian sense. We introduce Generalized MBR, which parameterizes the loss function in MBR and allows it to be optimized in the given hypothesis space of multiple systems. This extension better approximates the true Bayes Risk decision rule and empirically improves over MBR, even in cases where the combined systems are of mixed quality.

20 citations


Book ChapterDOI
20 Feb 2011
TL;DR: A framework to assist Wikipedia editors to transfer information among different languages is proposed and can be easily generalised and applied to other multi-lingual corpora.
Abstract: We propose a framework to assist Wikipedia editors to transfer information among different languages. Firstly, with the help of some machine translation tools, we analyse the texts in two different language editions of an article and identify information that is only available in one edition. Next, we propose an algorithm to look for the most probable position in the other edition where the new information can be inserted. We show that our method can accurately suggest positions for new information. Our proposal is beneficial to both readers and editors of Wikipedia, and can be easily generalised and applied to other multi-lingual corpora.

13 citations


01 Jan 2011
TL;DR: The Linear Ordering Problem (LOP) based reordering model was applied to Japanese-to-English translation to deal with the substantial difference in the word order between the two languages.
Abstract: This paper describes the patent translation system submitted for the NTCIR-9 PatentMT task. We applied the Linear Ordering Problem (LOP) based reordering model [16] to Japanese-to-English translation to deal with the substantial difference in the word order between the two languages.

9 citations


Proceedings Article
01 Nov 2011
TL;DR: This work proposes an alternative approach based on particle swarm optimization (PSO), which can easily exploit the fast growth of distributed computing to obtain solutions quickly and reduce the parameter tuning time from 10 hours to 40 minutes with no degradation in BLEU-score.
Abstract: The direct optimization of a translation metric is an integral part of building stateof-the-art SMT systems. Unfortunately, widely used translation metrics such as BLEU-score are non-smooth, non-convex, and non-trivial to optimize. Thus, standard optimizers such as minimum error rate training (MERT) can be extremely time-consuming, leading to a slow turnaround rate for SMT research and experimentation. We propose an alternative approach based on particle swarm optimization (PSO), which can easily exploit the fast growth of distributed computing to obtain solutions quickly. For example in our experiments on NIST 2008 Chineseto-English data with 512 cores, we demonstrate a speed increase of up to 15x and reduce the parameter tuning time from 10 hours to 40 minutes with no degradation in BLEU-score.

6 citations


Proceedings ArticleDOI
23 Oct 2011
TL;DR: This paper presents a means of using a constrained form of bilinear logistic regression for diagnosis in a pattern of diagnosis problems in which each of J entities produces the same K features, yet the authors are only informed of overall faults from the ensemble.
Abstract: In this paper, we address a pattern of diagnosis problems in which each of J entities produces the same K features, yet we are only informed of overall faults from the ensemble. Furthermore, we suspect that only certain entities and certain features are leading to the problem. The task, then, is to reliably identify which entities and which features are at fault. Such problems are particularly prevalent in the world of computer systems, in which a datacenter with hundreds of machines, each with the same performance counters, occasionally produces overall faults. In this paper, we present a means of using a constrained form of bilinear logistic regression for diagnosis in such problems. The bilinear treatment allows us to represent the scenarios with J+K instead of JK parameters, resulting in more easily interpretable results and far fewer false positives compared to treating the parameters independently. We develop statistical tests to determine which features and entities, if any, may be responsible for the labeled faults, and use false discovery rate (FDR) analysis to ensure that our values are meaningful. We show results in comparison to ordinary logistic regression (with L1 regularization) on two scenarios: a synthetic dataset based on a model of faults in a datacenter, and a real problem of finding problematic processes/features based on user-reported hangs.

5 citations




Journal ArticleDOI
TL;DR: A framework that can be used to assist authors and editors in collaborative writing to perform cross-lingual editing is described.
Abstract: Wikis have enabled Web users to author and edit documents in a collaborative manner. In many cases such as Wikipedia and Wikibooks, they have been used to host a set of parallel or comparable documents written in different languages. While a wiki provides an environment in which editors can work together efficiently, maintaining a set of multi-lingual documents is still a very demanding task for the editors. When documents in different languages are maintained by different groups of editors, it is even harder for a contributor to determine whether some information is missing and should be translated from one language to another. In this article, we briefly describe a framework that can be used to assist authors and editors in collaborative writing to perform cross-lingual editing.

Journal ArticleDOI
01 Jan 2011
TL;DR: SemEval-2010 日本語語義曖昧性解消タスクを利用した.本稿では,訓練データの自動拡張による語
Abstract: 本稿では,訓練データの自動拡張による語義曖昧性解消の精度向上方法について述べる.評価対象として,SemEval-2010 日本語語義曖昧性解消タスクを利用した.本稿では,まず,配布された訓練データのみを利用して学習した場合の結果を紹介する.更に,辞書の例文,配布データ以外のセンスバンク,ラベルなしコーパスなど,さまざまなコーパスを利用して,訓練データの自動拡張を試みた結果を紹介する.本稿では,訓練データの自動獲得により 79.5% の精度を得ることができた.更に,対象語の難易度に基づき,追加する訓練データの上限を制御したところ,最高 80.0% の精度を得ることができた.