scispace - formally typeset
Search or ask a question
Author

Eiichiro Sumita

Bio: Eiichiro Sumita is an academic researcher from National Institute of Information and Communications Technology. The author has contributed to research in topics: Machine translation & Rule-based machine translation. The author has an hindex of 40, co-authored 401 publications receiving 6385 citations.


Papers
More filters
Proceedings Article
01 May 2002
TL;DR: A bilingual travel conversation corpus of spoken languages and a broad-coverage bilingual basic expression corpus are introduced, which compare the characteristics of vocabulary and expressions between these two corpora and suggest that the authors need a much greater variety of expressions.
Abstract: At ATR Spoken Language Translation Research Laboratories, we are building a broad-coverage bilingual corpus to study corpus-based speech translation technologies for the real world. There are three important points to consider in designing and constructing a corpus for future speech translation research. The first is to have a variety of speech samples, with a wide range of pronunciations and speakers. The second is to have data for a variety of situations. The third is to have a variety of expressions. This paper reports our trials and discusses the methodology. First, we introduce a bilingual travel conversation (TC) corpus of spoken languages and a broad-coverage bilingual basic expression (BE) corpus. TC and BE are designed to be complementary. TC is a collection of transcriptions of bilingual spoken dialogues, while BE is a collection of Japanese sentences and their English translations. Whereas TC covers a small domain, BE covers a wide variety of domains. We compare the characteristics of vocabulary and expressions between these two corpora and suggest that we need a much greater variety of expressions. One promising approach might be to collect paraphrases representing various different expressions generated by many people for similar concepts.

284 citations

01 Dec 2011
TL;DR: An overview of the Patent Machine Translation Task (PatentMT) at NTCIR-9 is given by describing the test collection, evaluation methods, and evaluation results.
Abstract: This paper gives an overview of the Patent Machine Translation Task (PatentMT) at NTCIR-9 by describing the test collection, evaluation methods, and evaluation results. We organized three patent machine translation subtasks: Chinese to English, Japanese to English, and English to Japanese. For these subtasks, we provided large-scale test collections, including training data, development data and test data. In total, 21 research groups participated and 130 runs were submitted. We conducted human evaluations for adequacy and acceptability, as well as automatic evaluation.

188 citations

Proceedings Article
01 May 2016
TL;DR: The details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain, are described.
Abstract: In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

185 citations

Proceedings ArticleDOI
18 Jun 1991
TL;DR: The basic idea of EBMT is explained, a prototype system to deal with a difficult translation problem for conventional Rule-Based Machine Translation (RBMT) is implemented, the experiment in detail is illustrated, and the broad applicability is explained to several difficult translation problems for RBMT.
Abstract: EBMT (Example-Based Machine Translation) is proposed. EBMT retrieves similar examples (pairs of source phrases, sentences, or texts and their translations) from a database of examples, adapting the examples to translate a new input. EBMT has the following features: (1) It is easily upgraded simply by inputting appropriate examples to the database; (2) It assigns a reliability factor to the translation result; (3) It is accelerated effectively by both indexing and parallel computing; (4) It is robust because of best-match reasoning; and (5) It well utilizes translator expertise. A prototype system has been implemented to deal with a difficult translation problem for conventional Rule-Based Machine Translation (RBMT), i.e., translating Japanese noun phrases of the form "N1 no N2" into English. The system has achieved about a 78% success rate on average. This paper explains the basic idea of EBMT, illustrates the experiment in detail, explains the broad applicability of EBMT to several difficult translation problems for RBMT and discusses the advantages of integrating EBMT with RBMT.

180 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: Two instance weighting technologies, i.e., sentence weighting and domain weighting with a dynamic weight learning strategy, are proposed for NMT domain adaptation and empirical results show that the proposed methods can substantially improve NMT performance.
Abstract: Instance weighting has been widely applied to phrase-based machine translation domain adaptation. However, it is challenging to be applied to Neural Machine Translation (NMT) directly, because NMT is not a linear model. In this paper, two instance weighting technologies, i.e., sentence weighting and domain weighting with a dynamic weight learning strategy, are proposed for NMT domain adaptation. Empirical results on the IWSLT English-German/French tasks show that the proposed methods can substantially improve NMT performance by up to 2.7-6.7 BLEU points, outperforming the existing baselines by up to 1.6-3.6 BLEU points.

157 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Proceedings ArticleDOI
25 Jun 2007
TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
Abstract: We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.

6,008 citations

Journal ArticleDOI
TL;DR: The survey discusses distance functions, smoothing parameters, weighting functions, local model structures, regularization of the estimates and bias, assessing predictions, handling noisy data and outliers, improving the quality of predictions by tuning fit parameters, and applications of locally weighted learning.
Abstract: This paper surveys locally weighted learning, a form of lazy learning and memory-based learning, and focuses on locally weighted linear regression. The survey discusses distance functions, smoothing parameters, weighting functions, local model structures, regularization of the estimates and bias, assessing predictions, handling noisy data and outliers, improving the quality of predictions by tuning fit parameters, interference between old and new data, implementing locally weighted learning efficiently, and applications of locally weighted learning. A companion paper surveys how locally weighted learning can be used in robot learning and control.

1,863 citations

01 Jan 2006
TL;DR: This book discusses the development of English as a global language in the 20th Century and some of the aspects of its development that have changed since the publication of the first edition.
Abstract: A catalogue record for this book is available from the British Library ISBN 0 521 82347 1 hardback ISBN 0 521 53032 6 paperback Contents List of tables page vii Preface to the second edition ix Preface to the first edition xii 1 Why a global language? 1 What is a global language? 3 What makes a global language? 7 Why do we need a global language? 11 What are the dangers of a global language? 14 Could anything stop a global language? 25 A critical era 27 2 Why English? The historical context 29 Origins 30 America 31 Canada 36 The Caribbean 39 Australia and New Zealand 40 South Africa 43 South Asia 46 Former colonial Africa 49 Southeast Asia and the South Pacific 54 A world view 59 v Contents

1,857 citations

Proceedings Article
07 Dec 2015
TL;DR: This article used the continuity of text from books to train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage, which can produce highly generic sentence representations that are robust and perform well in practice.
Abstract: We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice.

1,802 citations