Showing papers by "Kevin Duh published in 2011"

PDF

Open Access

Proceedings Article•

Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

[...]

Kevin Duh¹, Akinori Fujino¹, Masaaki Nagata¹•Institutions (1)

19 Jun 2011

TL;DR: It is argued that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered.

...read moreread less

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about cross-lingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefully-designed experiments that led us to these conclusions.

...read moreread less

62 citations

Proceedings Article•

Extracting Pre-ordering Rules from Predicate-Argument Structures

[...]

Xianchao Wu¹, Katsuhito Sudoh², Kevin Duh³, Hajime Tsukada², Masaaki Nagata² - Show less +1 more•Institutions (3)

University of Tokyo¹, Nippon Telegraph and Telephone², Nara Institute of Science and Technology³

01 Nov 2011

TL;DR: This paper proposes a lineartime algorithm to extract the pre-ordering rules from word-aligned HPSG-tree-tostring pairs and a bottom-up algorithm to apply the extracted rules to H PSG trees to yield target language style source sentences.

...read moreread less

Abstract: Word ordering remains as an essential problem for translating between languages with substantial structural differences, such as SOV and SVO languages. In this paper, we propose to automatically extract pre-ordering rules from predicateargument structures. A pre-ordering rule records the relative position mapping of a predicate word and its argument phrases from the source language side to the target language side. We propose 1) a lineartime algorithm to extract the pre-ordering rules from word-aligned HPSG-tree-tostring pairs and 2) a bottom-up algorithm to apply the extracted rules to HPSG trees to yield target language style source sentences. Experimental results are reported for large-scale English-to-Japanese translation, showing significant improvements of BLEU score compared with the baseline SMT systems.

...read moreread less

37 citations

Post-ordering in Statistical Machine Translation

[...]

Katsuhito Sudoh, Xianchao Wu, Kevin Duh, Hajime Tsukada, Masaaki Nagata - Show less +1 more

19 Sep 2011

24 citations

Journal Article•DOI•

Semi-supervised ranking for document retrieval

[...]

Kevin Duh¹, Katrin Kirchhoff¹•Institutions (1)

University of Washington¹

01 Apr 2011-Computer Speech & Language

TL;DR: This paper examines whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms, and proposes a simple yet flexible transductive meta-algorithm, which improves over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset.

...read moreread less

21 citations

NTT-UT Statistical Machine Translation in NTCIR-9 PatentMT

[...]

Katsuhito Sudoh, Kevin Duh, Hajime Tsukada, Masaaki Nagata, Xianchao Wu, Takuya Matsuzaki¹, Jun'ichi Tsujii² - Show less +3 more•Institutions (2)

University of Tokyo¹, Microsoft²

01 Jan 2011

TL;DR: Details of the NTT-UT system in NTCIR- 9 PatentMT task are described, which includes syntactic pre-ordering, forest-to-string translation, and using external resources for domain adaptation and tar- get language modeling.

...read moreread less

Abstract: This paper describes details of the NTT-UT system in NTCIR- 9 PatentMT task. One of its key technology is system com- bination; the final translation hypotheses are chosen from n-bests by different SMT systems in a Minimum Bayes Risk (MBR) manner. Each SMT system includes different tech- nology: syntactic pre-ordering, forest-to-string translation, and using external resources for domain adaptation and tar- get language modeling.

...read moreread less

21 citations

Proceedings Article•

Generalized Minimum Bayes Risk System Combination

[...]

Kevin Duh¹, Katsuhito Sudoh², Xianchao Wu³, Hajime Tsukada², Masaaki Nagata² - Show less +1 more•Institutions (3)

Nara Institute of Science and Technology¹, Nippon Telegraph and Telephone², University of Tokyo³

01 Nov 2011

TL;DR: This work argues that common MBR implementations are actually not correct, and introduces Generalized MBR, which parameterizes the loss function in MBR and allows it to be optimized in the given hypothesis space of multiple systems.

...read moreread less

Abstract: Minimum Bayes Risk (MBR) has been used as a decision rule for both singlesystem decoding and system combination in machine translation. For system combination, we argue that common MBR implementations are actually not correct, since probabilities in the hypothesis space cannot be reliably estimated. These implementations achieve the effect of consensus decoding (which may be beneficial in its own right), but does not reduce Bayes Risk in the true Bayesian sense. We introduce Generalized MBR, which parameterizes the loss function in MBR and allows it to be optimized in the given hypothesis space of multiple systems. This extension better approximates the true Bayes Risk decision rule and empirically improves over MBR, even in cases where the combined systems are of mixed quality.

...read moreread less

20 citations

Book Chapter•DOI•

Providing cross-lingual editing assistance to Wikipedia editors

[...]

Ching-man Au Yeung¹, Kevin Duh¹, Masaaki Nagata¹•Institutions (1)

Nippon Telegraph and Telephone¹

20 Feb 2011

TL;DR: A framework to assist Wikipedia editors to transfer information among different languages is proposed and can be easily generalised and applied to other multi-lingual corpora.

...read moreread less

Abstract: We propose a framework to assist Wikipedia editors to transfer information among different languages. Firstly, with the help of some machine translation tools, we analyse the texts in two different language editions of an article and identify information that is only available in one edition. Next, we propose an algorithm to look for the most probable position in the other edition where the new information can be inserted. We show that our method can accurately suggest positions for new information. Our proposal is beneficial to both readers and editors of Wikipedia, and can be easily generalised and applied to other multi-lingual corpora.

...read moreread less

13 citations

Learning of Linear Ordering Problems and its Application to J-E Patent Translation in NTCIR-9 PatentMT

[...]

Shuhei Kondo¹, Mamoru Komachi¹, Yuji Matsumoto¹, Katsuhito Sudoh, Kevin Duh, Hajime Tsukada - Show less +2 more•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2011

TL;DR: The Linear Ordering Problem (LOP) based reordering model was applied to Japanese-to-English translation to deal with the substantial difference in the word order between the two languages.

...read moreread less

Abstract: This paper describes the patent translation system submitted for the NTCIR-9 PatentMT task. We applied the Linear Ordering Problem (LOP) based reordering model [16] to Japanese-to-English translation to deal with the substantial difference in the word order between the two languages.

...read moreread less

9 citations

Proceedings Article•

Distributed Minimum Error Rate Training of SMT using Particle Swarm Optimization

[...]

Jun Suzuki¹, Kevin Duh², Masaaki Nagata³•Institutions (3)

Kyoto University¹, Nara Institute of Science and Technology², Nippon Telegraph and Telephone³

01 Nov 2011

TL;DR: This work proposes an alternative approach based on particle swarm optimization (PSO), which can easily exploit the fast growth of distributed computing to obtain solutions quickly and reduce the parameter tuning time from 10 hours to 40 minutes with no degradation in BLEU-score.

...read moreread less

Abstract: The direct optimization of a translation metric is an integral part of building stateof-the-art SMT systems. Unfortunately, widely used translation metrics such as BLEU-score are non-smooth, non-convex, and non-trivial to optimize. Thus, standard optimizers such as minimum error rate training (MERT) can be extremely time-consuming, leading to a slow turnaround rate for SMT research and experimentation. We propose an alternative approach based on particle swarm optimization (PSO), which can easily exploit the fast growth of distributed computing to obtain solutions quickly. For example in our experiments on NIST 2008 Chineseto-English data with 512 cores, we demonstrate a speed increase of up to 15x and reduce the parameter tuning time from 10 hours to 40 minutes with no degradation in BLEU-score.

...read moreread less

6 citations

Proceedings Article•DOI•

BLR-D: applying bilinear logistic regression to factored diagnosis problems

[...]

Sumit Basu¹, John Dunagan¹, Kevin Duh, Kiran-Kumar Muniswamy-Reddy²•Institutions (2)

Microsoft¹, Harvard University²

23 Oct 2011

TL;DR: This paper presents a means of using a constrained form of bilinear logistic regression for diagnosis in a pattern of diagnosis problems in which each of J entities produces the same K features, yet the authors are only informed of overall faults from the ensemble.

...read moreread less

Abstract: In this paper, we address a pattern of diagnosis problems in which each of J entities produces the same K features, yet we are only informed of overall faults from the ensemble. Furthermore, we suspect that only certain entities and certain features are leading to the problem. The task, then, is to reliably identify which entities and which features are at fault. Such problems are particularly prevalent in the world of computer systems, in which a datacenter with hundreds of machines, each with the same performance counters, occasionally produces overall faults. In this paper, we present a means of using a constrained form of bilinear logistic regression for diagnosis in such problems. The bilinear treatment allows us to represent the scenarios with J+K instead of JK parameters, resulting in more easily interpretable results and far fewer false positives compared to treating the parameters independently. We develop statistical tests to determine which features and entities, if any, may be responsible for the labeled faults, and use false discovery rate (FDR) analysis to ensure that our values are meaningful. We show results in comparison to ordinary logistic regression (with L1 regularization) on two scenarios: a synthetic dataset based on a model of faults in a datacenter, and a real problem of finding problematic processes/features based on user-reported hangs.

...read moreread less

5 citations

Extracting Pre-ordering Rules from Chunk-based Dependency Trees for Japanese-to-English Translation.

[...]

Xianchao Wu, Katsuhito Sudoh, Kevin Duh, Hajime Tsukada, Masaaki Nagata - Show less +1 more

19 Sep 2011

Alignment Inference and Bayesian Adaptation for Machine Translation

[...]

Kevin Duh, Katsuhito Sudoh, Tomoharu Iwata, Hajime Tsukada

19 Sep 2011

Journal Article•DOI•

Assisting cross-lingual editing in collaborative writing

[...]

Ching-man Au Yeung¹, Kevin Duh¹, Masaaki Nagata¹•Institutions (1)

Nippon Telegraph and Telephone¹

01 Apr 2011-ACM Sigweb Newsletter

TL;DR: A framework that can be used to assist authors and editors in collaborative writing to perform cross-lingual editing is described.

...read moreread less

Abstract: Wikis have enabled Web users to author and edit documents in a collaborative manner. In many cases such as Wikipedia and Wikibooks, they have been used to host a set of parallel or comparable documents written in different languages. While a wiki provides an environment in which editors can work together efficiently, maintaining a set of multi-lingual documents is still a very demanding task for the editors. When documents in different languages are maintained by different groups of editors, it is even harder for a contributor to determine whether some information is missing and should be translated from one language to another. In this article, we briefly describe a framework that can be used to assist authors and editors in collaborative writing to perform cross-lingual editing.

...read moreread less

Journal Article•DOI•

Effectiveness of Automatic Expansion of Training Data for Japanese Word Sense Disambiguation

[...]

Sanae Fujita, Kevin Duh, Akinori Fujino, Hirotoshi Taira, Hiroyuki Shindo - Show less +1 more

01 Jan 2011

TL;DR: SemEval-2010 日本語語義曖昧性解消タスクを利用した．本稿では，訓練データの自動拡張による語

...read moreread less

Abstract: 本稿では，訓練データの自動拡張による語義曖昧性解消の精度向上方法について述べる．評価対象として，SemEval-2010 日本語語義曖昧性解消タスクを利用した．本稿では，まず，配布された訓練データのみを利用して学習した場合の結果を紹介する．更に，辞書の例文，配布データ以外のセンスバンク，ラベルなしコーパスなど，さまざまなコーパスを利用して，訓練データの自動拡張を試みた結果を紹介する．本稿では，訓練データの自動獲得により 79.5% の精度を得ることができた．更に，対象語の難易度に基づき，追加する訓練データの上限を制御したところ，最高 80.0% の精度を得ることができた．

...read moreread less