D
Duygu Ataman
Researcher at New York University
Publications - 22
Citations - 263
Duygu Ataman is an academic researcher from New York University. The author has contributed to research in topics: Machine translation & Vocabulary. The author has an hindex of 8, co-authored 19 publications receiving 206 citations. Previous affiliations of Duygu Ataman include fondazione bruno kessler & University of Zurich.
Papers
More filters
Journal ArticleDOI
Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
TL;DR: This paper proposed a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language, and achieved a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.
Proceedings ArticleDOI
Compositional Representation of Morphologically-Rich Input for Neural Machine Translation
Duygu Ataman,Marcello Federico +1 more
TL;DR: The authors propose to replace the source-language embedding layer of NMT with a bi-directional recurrent neural network that generates compositional representations of the input at any desired level of granularity.
Proceedings Article
An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation
Duygu Ataman,Marcello Federico +1 more
TL;DR: An extensive evaluation of two unsupervised vocabulary reduction methods in NMT, the wellknown byte-pair-encoding (BPE) and linguistically-motivated vocabulary reduction (LMVR), a segmentation method which also considers morphological properties of subwords.
Posted Content
Compositional Representation of Morphologically-Rich Input for Neural Machine Translation
Duygu Ataman,Marcello Federico +1 more
TL;DR: The authors propose to replace the source-language embedding layer of NMT with a bi-directional recurrent neural network that generates compositional representations of the input at any desired level of granularity.
Posted Content
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Isaac Caswell,Julia Kreutzer,Lisa Wang,Ahsan Wahab,Daan van Esch,Nasanbayar Ulzii-Orshikh,Allahsera Auguste Tapo,Nishant Subramani,Artem Sokolov,Claytone Sikasote,Monang Setyawan,Supheakmungkol Sarin,Sokhar Samb,Benoît Sagot,Clara E. Rivera,Annette Rios,Isabel Papadimitriou,Salomey Osei,Pedro Javier Ortiz Suárez,Iroro Orife,Kelechi Ogueji,Rubungo Andre Niyongabo,Toan Q. Nguyen,Mathias Müller,André Müller,Shamsuddeen Hassan Muhammad,Nanda Muhammad,Ayanda Mnyakeni,Jamshidbek Mirzakhalov,Tapiwanashe Matangira,Colin Leong,Nze Lawson,Sneha Kudugunta,Yacine Jernite,Mathias Jenny,Orhan Firat,Bonaventure F. P. Dossou,Sakhile Dlamini,Nisansa de Silva,Sakine Çabuk Ballı,Stella Biderman,Alessia Battisti,Ahmed Baruwa,Ankur Bapna,Pallavi Baljekar,Israel Abebe Azime,Ayodele Awokoya,Duygu Ataman,Orevaoghene Ahia,Oghenefego Ahia,Sweta Agrawal,Mofetoluwa Adeyemi +51 more
TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).