scispace - formally typeset
Search or ask a question

Showing papers on "Computer-assisted translation published in 2017"


Journal ArticleDOI
TL;DR: This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Abstract: We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no changes to the model architecture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model. On the WMT’14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-the-art results for English→German. Similarly, a single multilingual model surpasses state-of-the-art results for French→English and German→English on WMT’14 and WMT’15 benchmarks, respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. Our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.

1,288 citations


Journal ArticleDOI
TL;DR: The first attention-based neural-MT for multi-way, multilingual translation is proposed and it outperforms strong conventional statistical machine translation systems on Turkish-English and Uzbek-English by incorporating the resources of other language pairs.

85 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work proposes to use lexical formality models to control the formality level of machine translation output and demonstrates the effectiveness of this approach in empirical evaluations, as measured by automatic metrics and human assessments.
Abstract: Stylistic variations of language, such as formality, carry speakers’ intention beyond literal meaning and should be conveyed adequately in translation. We propose to use lexical formality models to control the formality level of machine translation output. We demonstrate the effectiveness of our approach in empirical evaluations, as measured by automatic metrics and human assessments.

70 citations


Journal ArticleDOI
TL;DR: For translation graduates to serve as professional post-editors in the language industry, content must be embedded in multiple courses across the curriculum, rather than concentrating the material in a stand-alone course or module.
Abstract: Graduates of translation programmes increasingly encounter machine translation in the language industry. In response to this identified market need, translation education programmes have begun to i...

44 citations


Posted Content
TL;DR: It is found that while SMT remains the best option for low-resource settings, this method can produce acceptable translations with only 70000 tokens of training data, a level where the baseline NMT system fails completely.
Abstract: Neural machine translation (NMT) approaches have improved the state of the art in many machine translation settings over the last couple of years, but they require large amounts of training data to produce sensible output. We demonstrate that NMT can be used for low-resource languages as well, by introducing more local dependencies and using word alignments to learn sentence reordering during translation. In addition to our novel model, we also present an empirical evaluation of low-resource phrase-based statistical machine translation (SMT) and NMT to investigate the lower limits of the respective technologies. We find that while SMT remains the best option for low-resource settings, our method can produce acceptable translations with only 70000 tokens of training data, a level where the baseline NMT system fails completely.

37 citations


Journal ArticleDOI
TL;DR: An ongoing project to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching and the bidirectional corpus can be used to compare translated and source language and identify differences.
Abstract: Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

36 citations


Journal ArticleDOI
TL;DR: A definition of translation literality is developed that is based on the syntactic and semantic similarity of the source and the target texts and it is found that non-literality makes from-scratch translation and post-editing difficult.
Abstract: The paper develops a definition of translation literality that is based on the syntactic and semantic similarity of the source and the target texts. We provide theoretical and empirical evidence that absolute literal translations are easy to produce. Based on a multilingual corpus of alternative translations we investigate the effects of cross-lingual syntactic and semantic distance on translation production times and find that non-literality makes from-scratch translation and post-editing difficult. We show that statistical machine translation systems encounter even more difficulties with non-literality.

34 citations


OtherDOI
18 Feb 2017

29 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: In order to take Machine Translation to another level, it will need to judge output not based on a single reference translation, but based on notions of fluency and of adequacy – ideally with reference to the source text.
Abstract: As the quality of Machine Translation (MT) improves, research on improving discourse in automatic translations becomes more viable This has resulted in an increase in the amount of work on discourse in MT However many of the existing models and metrics have yet to integrate these insights Part of this is due to the evaluation methodology, based as it is largely on matching to a single reference At a time when MT is increasingly being used in a pipeline for other tasks, the semantic element of the translation process needs to be properly integrated into the task Moreover, in order to take MT to another level, it will need to judge output not based on a single reference translation, but based on notions of fluency and of adequacy – ideally with reference to the source text

29 citations


Journal ArticleDOI
TL;DR: This work evaluates the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality and compares two terminology injection methods that can be easily used at run-time without altering the normal activity of anSMT system.
Abstract: This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 223 to 678 BLEU points over a baseline SMT system and from 005 to 303 compared to the widely-used XML markup approach

20 citations


Journal ArticleDOI
Łucja Biel1
TL;DR: How comparable corpora may be used in the classroom to increase the communicative dimension of legal translations, an aspect which tends to be neglected in training, is demonstrated.
Abstract: The objective of this paper is to demonstrate how comparable corpora may be used in the classroom to increase the communicative dimension of legal translations, an aspect which tends to be neglecte...

Journal ArticleDOI
TL;DR: This work presents one of these new interactive protocols, which allows the user to validate all correct word sequences in a translation hypothesis, and compares it against the classical prefix-based approach.
Abstract: Machine translation systems require human revision to obtain high-quality translations. Interactive methods provide an efficient human–computer collaboration, notably increasing productivity. Recently, new interactive protocols have been proposed, seeking for a more effective user interaction with the system. In this work, we present one of these new protocols, which allows the user to validate all correct word sequences in a translation hypothesis. Thus, the left-to-right barrier from most of the existing protocols is broken. We compare this protocol against the classical prefix-based approach, obtaining a significant reduction of the user effort in a simulated environment. Additionally, we experiment with the use of confidence measures to select the word the user should correct at each iteration, reaching the conclusion that the order in which words are corrected does not affect the overall effort.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work built and released the first evaluation corpus for Japanese paraphrase identification, which comprises 655 sentence pairs, and proposes a novel sentential paraphrase acquisition method, which focuses on acquiring both non-trivial positive and negative instances.
Abstract: We propose a novel sentential paraphrase acquisition method. To build a wellbalanced corpus for Paraphrase Identification, we especially focus on acquiring both non-trivial positive and negative instances. We use multiple machine translation systems to generate positive candidates and a monolingual corpus to extract negative candidates. To collect nontrivial instances, the candidates are uniformly sampled by word overlap rate. Finally, annotators judge whether the candidates are either positive or negative. Using this method, we built and released the first evaluation corpus for Japanese paraphrase identification, which comprises 655 sentence pairs.

Posted Content
TL;DR: This work describes the recently developed neural machine translation (NMT) system and benchmark it against the authors' own statistical machinetranslation (SMT) system as well as two other general purpose online engines (statistical and neural).
Abstract: We describe our recently developed neural machine translation (NMT) system and benchmark it against our own statistical machine translation (SMT) system as well as two other general purpose online engines (statistical and neural). We present automatic and human evaluation results of the translation output provided by each system. We also analyze the effect of sentence length on the quality of output for SMT and NMT systems.

Posted Content
07 Sep 2017
TL;DR: Through the specific and unique terminological expressions, subword segmentation within NMT does not outperform a word based neural translation model and a clear advantage in domain adaptation and terminology injection of NMT methods over SMT is observed.
Abstract: Our work presented in this paper focuses on the translation of domain-specific expressions represented in semantically structured resources, like ontologies or knowledge graphs. To make knowledge accessible beyond language borders, these resources need to be translated into different languages. The challenge of translating labels or terminological expressions represented in ontologies lies in the highly specific vocabulary and the lack of contextual information, which can guide a machine translation system to translate ambiguous words into the targeted domain. Due to the challenges, we train and translate the terminological expressions in the medial and financial domain with statistical as well as with neural machine translation methods. We evaluate the translation quality of domainspecific expressions with translation systems trained on a generic dataset and experiment domain adaptation with terminological expressions. Furthermore we perform experiments on the injection of external knowledge into the translation systems. Through these experiments, we observed a clear advantage in domain adaptation and terminology injection of NMT methods over SMT. Nevertheless, through the specific and unique terminological expressions, subword segmentation within NMT does not outperform a word based neural translation model.

Posted Content
TL;DR: This paper proposes a method that enables NMT to translate patent sentences comprising a large vocabulary of technical terms, and trains an NMT system on bilingual data wherein technical terms are replaced with technical term tokens; this allows it to translate most of the source sentences except technical terms.
Abstract: Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. In NMTs, words that are out of vocabulary are represented by a single unknown token. In this paper, we propose a method that enables NMT to translate patent sentences comprising a large vocabulary of technical terms. We train an NMT system on bilingual data wherein technical terms are replaced with technical term tokens; this allows it to translate most of the source sentences except technical terms. Further, we use it as a decoder to translate source sentences with technical term tokens and replace the tokens with technical term translations using SMT. We also use it to rerank the 1,000-best SMT translations on the basis of the average of the SMT score and that of the NMT rescoring of the translated sentences with technical term tokens. Our experiments on Japanese-Chinese patent sentences show that the proposed NMT system achieves a substantial improvement of up to 3.1 BLEU points and 2.3 RIBES points over traditional SMT systems and an improvement of approximately 0.6 BLEU points and 0.8 RIBES points over an equivalent NMT system without our proposed technique.

Journal ArticleDOI
TL;DR: An English to Malay EBMT system is presented to demonstrate the practical application of the structural semantics, which is used to support deeper semantic similarity measurement and impose structural constraints in translation examples selection.
Abstract: The main tasks in Example-based Machine Translation (EBMT) comprise of source text decomposition, following with translation examples matching and selection, and finally adaptation and recombination of the target translation. As the natural language is ambiguous in nature, the preservation of source text’s meaning throughout these processes is complex and challenging. A structural semantics is introduced, as an attempt towards meaning-based approach to improve the EBMT system. The structural semantics is used to support deeper semantic similarity measurement and impose structural constraints in translation examples selection. A semantic compositional structure is derived from the structural semantics of the selected translation examples. This semantic compositional structure serves as a representation structure to preserve the consistency and integrity of the input sentence’s meaning structure throughout the recombination process. In this paper, an English to Malay EBMT system is presented to demonstrate the practical application of this structural semantics. Evaluation of the translation test results shows that the new translation framework based on the structural semantics has outperformed the previous EBMT framework.

Journal ArticleDOI
TL;DR: Results show that it is more appropriate to use translation and/or Arabic-expanding techniques with technical terms derived from common linguistic roots in the source language (SL) to preserve the integrity and authenticity of Arabic as a target language (TL) at a time of a marked increase in the number of SL technical terms.
Abstract: The main aim of this paper is to explore the techniques used in translating English technical terms into Arabic in the Microsoft Terminology Collection (MTC) (English-Arabic) as an example of comprehensive multilingual resources of technical terminology on the Web. MTC is a well-known online IT-glossary available on the Microsoft Language Portal in over ninety languages. It provides users with the opportunity to perform quick searches between different languages and to download files that integrate with Microsoft products and computer-assisted translation (CAT) tools. Some examples of MTC terms in Arabic are examined by the researcher to identify the kinds of translation strategies that MTC follows in order to translate technical terms into Arabic as well as the appropriateness of these strategies to their translation situations through comparison of different translations for the same SL term. The analysis of selected examples from MTC shows that in the Arabic translations of technical terms, MTC uses translation, Arabicisation, and Arabic-expanding techniques inconsistently, either in providing more than one translation for a standard technical term within the same translation situation or in using different translation strategies for similar technical terms in similar translation situations. Results show that it is more appropriate to use translation and/or Arabic-expanding techniques (mainly derivation and compounding) with technical terms derived from common linguistic roots in the source language (SL) to preserve the integrity and authenticity of Arabic as a target language (TL) at a time of a marked increase in the number of SL technical terms, while methods of Arabicisation should only be used with SL proper nouns or any word derived from them to solve problems of non-equivalence at word level between Arabic and English.

Journal ArticleDOI
TL;DR: The qualitative examination of the target language texts revealed that the three MT systems had errors and non-errors in rendering gender-bound constructs from English to Arabic, and that errors transpired in certain co-textual environments.

Journal ArticleDOI
TL;DR: A system for automatic terminology extraction and automatic detection of the equivalent terms in the target language to be used alongside a computer assisted translation (CAT) tool that provides term candidates and their translations in an automatic way each time the translator goes from one segment to the next one.
Abstract: In this paper we present a system for automatic terminology extraction and automatic detection of the equivalent terms in the target language to be used alongside a computer assisted translation (CAT) tool that provides term candidates and their translations in an automatic way each time the translator goes from one segment to the next one. The system uses several sources of information: the text from the segment being translated and from the whole translation project, the translation memories assigned to the project and a translation phrase table from a statistical machine translation system. It also uses the terminological database assigned to the project in order to avoid presenting already known terms. The use of translation phrase tables allows us to use very large parallel corpora in a very efficient way. We have used Moses to calculate and to consult the translation phrase tables. The program is written in Python and it can be used with any CAT tool. In our experiments we have used OmegaT, a well-known open source CAT tool. Evaluation results for English–Spanish and for three subjects (politics, finance, and medicine) are presented.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper describes the neural machine translation systems of the University of Latvia, University of Zurich and University of Tartu, which participated in the WMT 2017 shared task on news translation by building systems for two language pairs, based on an attentional encoder-decoder, using BPE subword segmentation.
Abstract: This paper describes the neural machine translation systems of the University of Latvia, University of Zurich and University of Tartu. We participated in the WMT 2017 shared task on news translation by building systems for two language pairs: English↔German and English↔Latvian. Our systems are based on an attentional encoder-decoder, using BPE subword segmentation. We experimented with backtranslating the monolingual news corpora and filtering out the best translations as additional training data, enforcing named entity translation from a dictionary of parallel named entities, penalizing overand under-translated sentences, and combining output from multiple NMT systems with SMT. The described methods give 0.7 1.8 BLEU point improvements over our baseline systems.

Journal ArticleDOI
07 Jul 2017-Sendebar
TL;DR: The method used in teaching specialised translation in the English Language Translation Master’s programme at Masaryk University is described, with the first results of the research examining a learner corpus of translations from Czech into English.
Abstract: This paper describes the method used in teaching specialised translation in the English Language Translation Master’s programme at Masaryk University. After a brief description of the courses, the focus shifts to translation learner corpora (TLC) compiled in the new Hypal interface, which can be integrated in Moodle. Student translations are automatically aligned (with possible adjustments), PoS (part-of-speech) tagged, and manually error-tagged. Personal student reports based on error statistics for individual translations to show students’progress throughout the term or during their studies in the four-semester programme can be easily generated. Using the data from the pilot run of the new software, the paper concludes with the first results of the research examining a learner corpus of translations from Czech into English.

Journal ArticleDOI
TL;DR: Bundgaard et al. as discussed by the authors found that translators seem to have a flexible and pragmatic attitude towards TCI, adapting to the tool's imperfections and accommodating its resistances.
Abstract: Today technology is part and parcel of professional translation, and translation has therefore been characterised as Translator-Computer Interaction (TCI) (O’Brien 2012). Translation is increasingly carried out using Translation Memory (TM) systems which incorporate machine translation (MT), referred to as MT-assisted TM translation, and in this type of tool, translators switch between editing TM matches and post-editing MT matches. It is generally assumed that translators’ attitudes towards technology impact on this interaction with the technology. Drawing on Eagly/Chaiken’s (1995) definition of attitudes as evaluations of entities with favour or disfavour and on qualitative data from a workplace study of TCI, conducted as part of a PhD dissertation (Bundgaard 2017) and partly reported on in Bundgaard et al. (2016), this paper explores translator attitudes towards TCI in the form of MT-assisted TM translation. In doing so, the paper has a particular focus on the disfavour towards TCI expressed by translators. Moreover, inspired by Olohan (2011), who applies Pickering’s “mangle of practice” theory and analyses resistance and accommodation in TCI, the paper focuses on how translators accommodate resistances offered by the tool. The study shows that the translators express disfavour towards MT in many respects, but also acknowledge positive aspects of the technology and expect MT to play a significant role in their future working lives. The translators do not make many positive or negative comments about TM which might indicate that TM is a completely integrated part of their processes. The translators seem to have a flexible and pragmatic attitude towards TCI, adapting to the tool’s imperfections and accommodating its resistances.

Journal ArticleDOI
26 Sep 2017
TL;DR: Findings in a controlled study carried out to examine the possible benefits of editing Machine Translation and Translation Memory outputs when translating from English to Welsh contradict supposed similarities between translation quality in terms of style and post-editing Machine Translation.
Abstract: This article reports on a controlled study carried out to examine the possible benefits of editing Machine Translation and Translation Memory outputs when translating from English to Welsh. Using software capable of timing the translation process per segment, 8 professional translators each translated 75 sentences of differing match percentage, and post- edited a further 25 segments of Machine Translation. Basing the final analysis on 800 sentences and 17,440 words, the use of Fuzzy Matches in the 70-99% match range, Exact Matches and Statistical Machine Translation was found to significantly speed up the translation process. Significant correlations were also found between the processing time data of Exact Matches and Machine Translation post-editing, rather than between Fuzzy Matches and Machine Translation as expected. Two experienced translators were then asked to rate all translations for fidelity, grammaticality and style, whereby it was found that the use of translation technology either did not negatively affect translation quality compared to manual translation, or its use actually improved final quality in some cases. As well as confirming the findings of research in relation to translation technology, these findings also contradict supposed similarities between translation quality in terms of style and post-editing Machine Translation.

Posted Content
TL;DR: This paper proposes the use of synthetic methods for extending a low-resource corpus and apply it to a multi-source neural machine translation model, and shows the improvement of machine translation performance through corpus extension using the synthetic method.
Abstract: In machine translation, we often try to collect resources to improve performance. However, most of the language pairs, such as Korean-Arabic and Korean-Vietnamese, do not have enough resources to train machine translation systems. In this paper, we propose the use of synthetic methods for extending a low-resource corpus and apply it to a multi-source neural machine translation model. We showed the improvement of machine translation performance through corpus extension using the synthetic method. We specifically focused on how to create source sentences that can make better target sentences, including the use of synthetic methods. We found that the corpus extension could also improve the performance of multi-source neural machine translation. We showed the corpus extension and multi-source model to be efficient methods for a low-resource language pair. Furthermore, when both methods were used together, we found better machine translation performance.

13 Jul 2017
TL;DR: It is argued that the Translator’s Amanuensis 2020 could benefit from existing Translation Studies concepts: the study of translation problems, translation competence models, and the ethics and sociology of translation.
Abstract: This paper is an exercise of imagination. Based on Kay’s (1980) inspiring idea of a translator’s amanuensis, we attempt to describe a post-editing tool that enables ubiquitous translation (Cronin 2010). We argue that a parallelism exists between media remediation (Bolter and Grusin 1999) and the shifting phase translation is undergoing, with machine translation post-editing having an impact on the global workflow of translated content. We take the hybridisation of traditional and machine translation processes as a starting point to envisage the features of forthcoming translation technologies. Results of previous surveys helped us to select features expected to play a central role: versatile devices to which we broadly refer as displayers would enable ubiquity; a relevant knowledge feature would provide human translators with a well-assorted repertoire of reliable sources; and an effort prediction feature would provide post-editors with reliable estimates of how much work lay ahead. Interacting with the Translator’s Amanuensis 2020 would not always be straightforward, however. Translators will have to adapt to richer ways of reading and visualising information. Ultimately, we argue that the Translator’s Amanuensis 2020 could benefit from existing Translation Studies concepts: the study of translation problems, translation competence models, and the ethics and sociology of translation.

Journal ArticleDOI
TL;DR: An integrated functional approach to translating business texts is suggested on the basis of analyzing semantic and morphological features of actual text content and also on axiological and epistemic semantic features that bring to light subjective modality.
Abstract: This article draws on the example of business texts to consider practical aspects of the distortion of meaning in translation from one language to another in the available machine translation (MT) systems and their underlying approach based on word-by-word translation. An integrated functional approach to translating business texts is suggested on the basis of analyzing semantic and morphological features of actual text content and also on axiological and epistemic semantic features that bring to light subjective modality. The suggested technique is used to develop an algorithm of business text MT that makes it possible to resolve the word-by-word translation issue and conveys the meanings of short texts. Cases of testing the suggested technique and the derived algorithm are considered for the Russian–English language pair.

Journal ArticleDOI
TL;DR: The article focuses on the technology used (Rule-Based Machine Translation) and on some of the rules created, as well as on the orthographic model used for Sardinian.
Abstract: Abstract This paper describes the process of creation of the first machine translation system from Italian to Sardinian, a Romance language spoken on the island of Sardinia in the Mediterranean. The project was carried out by a team of translators and computational linguists. The article focuses on the technology used (Rule-Based Machine Translation) and on some of the rules created, as well as on the orthographic model used for Sardinian.

Journal ArticleDOI
TL;DR: The paper addresses the challenges of MT and solution efforts made in this direction and provides approaches for effective Hindi-to-English Machine Translation that can be helpful in inexpensive and ease implementation of and MT systems.
Abstract: Objectives: To provide approaches for effective Hindi-to-English Machine Translation (MT) that can be helpful in inexpensive and ease implementation of and MT systems. Methods/Statistical Analysis: Structure of the Hindi and English languages have been studied thoroughly. The possible steps towards the Natural languages have also been studied. The methods, rules, approaches, tools, resources etc. related to MT have been discussed in detail. Findings: MT is an idea for automatic translation of a language. India is the country with full of diversity in culture and languages. More than 20 regional languages are spoken along with several dialects. Hindi is a widely spoken language in all the states of country. A lot of literature, poetries and valuable texts are available in Hindi which gives opportunities to retranslate into English. However, new generation is learning English rapidly and also showing keenness to learn it in simplified lucid manner. Several efforts have been made in this direction. A large number of approaches and solutions exist for MT still there is a huge scope. The paper addresses the challenges of MT and solution efforts made in this direction. This motivates researchers to implement new Hindi-to-English Machine translation systems. Application/Improvements: Efficient, inexpensive and ease translation for available Hindi literature, poetries and other valuable texts into English. Children can easily learn the culture through the poetries and literatures hence the Machine Translation of these will bring wonderful impact.