scispace - formally typeset
Search or ask a question
Author

Steve Finch

Bio: Steve Finch is an academic researcher from West. The author has an hindex of 2, co-authored 2 publications receiving 123 citations.

Papers
More filters
ReportDOI
12 Apr 2000
TL;DR: The preliminary results are >92% accurate, suggesting the feasibility of the model, and some improvements are needed and the model needs to undergo some improvements and should be tested cross linguistically before assessing its significance.
Abstract: In this paper, we present a model of statistical word-level mapping for comparable corpora. The approach is based on the assumption that if two terms have close distributional profiles, their corresponding translations' distributional profiles should be close in a comparable corpus. The proposed model is described. A preliminary investigation on intralanguage comparable corpora is laid out. The preliminary results are >92% accurate, suggesting the feasibility of the model. The model needs to undergo some improvements and should be tested cross linguistically before assessing its significance.

114 citations


Cited by
More filters
Book
18 Jan 2010
TL;DR: This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish, and the companion website provides open-source corpora and tool-kits.
Abstract: This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish. In general, statistical techniques allow automatic translation systems to be built quickly for any language-pair using only translated texts and generic software. With increasing globalization, statistical machine translation will be central to communication and commerce. Based on courses and tutorials, and classroom-tested globally, it is ideal for instruction or self-study, for advanced undergraduates and graduate students in computer science and/or computational linguistics, and researchers in natural language processing. The companion website provides open-source corpora and tool-kits.

1,538 citations

Proceedings ArticleDOI
06 Jul 2001
TL;DR: This model transforms a source-language parse tree into a target-language string by applying stochastic operations at each node, and produces word alignments that are better than those produced by IBM Model 5.
Abstract: We present a syntax-based statistical translation model. Our model transforms a source-language parse tree into a target-language string by applying stochastic operations at each node. These operations capture linguistic differences such as word order and case marking. Model parameters are estimated in polynomial time using an EM algorithm. The model produces word alignments that are better than those produced by IBM Model 5.

924 citations

Journal ArticleDOI
TL;DR: A maximum entropy classifier is trained that, given a pair of sentences, can reliably determine whether or not they are translations of each other and can be applied with great benefit to language pairs for which only scarce resources are available.
Abstract: We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

471 citations

Proceedings ArticleDOI
06 Jul 2002
TL;DR: An unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora is presented, using pseudo-translations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set.
Abstract: We present an unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora. The technique takes advantage of the fact that cross-language lexicalizations of the same concept tend to be consistent, preserving some core element of its semantics, and yet also variable, reflecting differing translator preferences and the influence of context. Working with parallel corpora introduces an extra complication for evaluation, since it is difficult to find a corpus that is both sense tagged and parallel with another language; therefore we use pseudo-translations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set. The results demonstrate that word-level translation correspondences are a valuable source of information for sense disambiguation.

250 citations

Proceedings ArticleDOI
12 Jul 2002
TL;DR: This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora and combines various clues such as cognates, similar context, preservation of word similarity, and word frequency to create a German-English noun lexicon.
Abstract: This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-English noun lexicon are reported. Noun translation accuracy of 39% scored against a parallel test corpus could be achieved.

245 citations