scispace - formally typeset
D

Duygu Ataman

Researcher at New York University

Publications -  22
Citations -  263

Duygu Ataman is an academic researcher from New York University. The author has contributed to research in topics: Machine translation & Vocabulary. The author has an hindex of 8, co-authored 19 publications receiving 206 citations. Previous affiliations of Duygu Ataman include fondazione bruno kessler & University of Zurich.

Papers
More filters
Journal ArticleDOI

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

TL;DR: This paper proposed a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language, and achieved a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.
Proceedings ArticleDOI

Compositional Representation of Morphologically-Rich Input for Neural Machine Translation

TL;DR: The authors propose to replace the source-language embedding layer of NMT with a bi-directional recurrent neural network that generates compositional representations of the input at any desired level of granularity.
Proceedings Article

An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation

TL;DR: An extensive evaluation of two unsupervised vocabulary reduction methods in NMT, the wellknown byte-pair-encoding (BPE) and linguistically-motivated vocabulary reduction (LMVR), a segmentation method which also considers morphological properties of subwords.
Posted Content

Compositional Representation of Morphologically-Rich Input for Neural Machine Translation

TL;DR: The authors propose to replace the source-language embedding layer of NMT with a bi-directional recurrent neural network that generates compositional representations of the input at any desired level of granularity.
Posted Content

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).