scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An algorithm for the better assessment of machine translation

01 May 2017-pp 395-399
TL;DR: An algorithm (by incorporating different modules of language models like synonym replacement, root word extraction and shallow parsing) which when applied upon the translation of English to Hindi text gives better evaluation results as compared to those algorithms which do not incorporate all these modules.
Abstract: Machine Translation, sometimes referred by the acronym MT, is one of the important fields of study of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of atomic words in one natural language for words in another language. Around the world, numerous systems are available in the market for the assessment of the translation being done by the various translation systems. Even within India, a large number of such evaluation systems are available and a lot of research is still going on to develop a better evaluation system which can beat the results produced by Human Evaluators. Even the main challenge before Indian Researchers is that the evaluation systems which are giving unbeatable results for the translation of Foreign languages (such as German, French, Chinese, etc.) are not even giving considerable results for the translation of Indian Languages (Hindi, Tamil, Telugu, Punjabi, etc.). So at par these evaluation systems cannot be applied as it is to evaluate Machine Translations of Indian Languages. Indian languages require a novel approach because of the relatively unrestricted order of words within a word group. In this paper, we are presenting an algorithm (by incorporating different modules of language models like synonym replacement, root word extraction and shallow parsing) which when applied upon the translation of English to Hindi text gives better evaluation results as compared to those algorithms which do not incorporate all these modules. Moreover, our study is limited to English to Hindi language pair and the testing is being with the corpora of agriculture domain.
Citations
More filters
Book ChapterDOI
01 Jan 2023
TL;DR: This paper used machine translation evaluation metrics such as TER (translation error rate), METEOR (Metric for Evaluation of Translation with Explicit Ordering), BLEU (Bilingual Evaluation Understudy), and NIST (National Institution of Standards and Technology) for English-Hindi Translation.
Abstract: Language is the primary mode of communication. Communication is the only way to convey our thoughts and emotions to others. But there are many languages that we don't know how to speak and we can't learn all languages so quickly which is why Machine Translation systems were invented to help us communicate with anyone from anywhere. Researchers started working on Machine Translation Systems in the 1950s and have developed various amazing techniques to make our communication easy. Machine Translation Evaluation (MTE) Methodology checks the accuracy of the translations done by Machine Translation Tools. It is extremely important because while developing the translation systems, it constantly checks the performance and helps us to make appropriate changes in the system for bringing accuracy. Mostly it is done by comparing the output of machine translation systems with the translation(s) done by Human beings, there are also some techniques that do not require any reference sentences. In our project, we are using four Automated Machine Translation Evaluation metrics which are TER (Translation Error Rate), METEOR (Metric for Evaluation of Translation with Explicit Ordering), BLEU (Bilingual Evaluation Understudy), and NIST (National Institution of Standards and Technology) for English-Hindi Translation.
References
More filters
Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations

Proceedings ArticleDOI
24 Mar 2002
TL;DR: NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
Abstract: Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

1,734 citations

Proceedings Article
01 Jan 2006
TL;DR: It is found that NI ST scores correlate best with human judgments, but that all automatic metrics the authors examined are biased in favour of generators that select on the basis of frequency alone.
Abstract: We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NI ST, B LEU, and ROUGE. We find that NI ST scores correlate best (>0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain.

211 citations

Book ChapterDOI
28 Sep 2004
TL;DR: This work shows that correlation with human judgments is highest when almost all of the weight is assigned to recall, and shows that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.
Abstract: Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

128 citations

Journal ArticleDOI
TL;DR: An EBMT system that can be dubbed the “purest ever built” is designed, implemented and assessed: it strictly does not make any use of variables, templates or patterns, does not have any explicit transfer component, and does not require any preprocessing or training of the aligned examples.
Abstract: We have designed, implemented and assessed an EBMT system that can be dubbed the "purest ever built": it strictly does not make any use of variables, templates or patterns, does not have any explicit transfer component, and does not require any preprocessing or training of the aligned examples. It uses only a specific operation, proportional analogy, that implicitly neutralizes divergences between languages and captures lexical and syntactic variations along the paradigmatic and syntagmatic axes without explicitly decomposing sentences into fragments. Exactly the same genuine implementation of such a core engine was evaluated on different tasks and language pairs. To begin with, we compared our system on two tasks of a previous MT evaluation campaign to rank it among other current state-of-the-art systems. Then, we illustrated the "universality" of our system by participating in a recent MT evaluation campaign, with exactly the same core engine, for a wide variety of language pairs. Finally, we studied the influence of extra data like dictionaries and paraphrases on the system performance.

103 citations