scispace - formally typeset
Open AccessProceedings Article

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Reads0
Chats0
TLDR
This article proposed a method to improve cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a crosslingual fashion.
Abstract
Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

read more

Citations
More filters
Posted Content

One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval

TL;DR: This article proposed a cross-lingual open-retrieval answer generation model that can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable.
Book ChapterDOI

The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

Yibo Ding
TL;DR: This article used a parallel corpus to make embeddings of related words across languages similar to each other, and showed that fine-tuning leads to forgetting some of the cross-lingual alignment information This article .
References
More filters
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Proceedings ArticleDOI

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
Proceedings ArticleDOI

Unsupervised Cross-lingual Representation Learning at Scale

TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Proceedings Article

Teaching machines to read and comprehend

TL;DR: A new methodology is defined that resolves this bottleneck and provides large scale supervised reading comprehension data that allows a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure to be developed.
Related Papers (5)