scispace - formally typeset
Proceedings ArticleDOI

WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool

Reads0
Chats0
TLDR
WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool, is presented, which can be used easily to align Wikipedia documents in any language pair and shed the light on Wikipedia as a source of Arabic dialects language resources.
Abstract
Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.

read more

Citations
More filters
Journal ArticleDOI

Arabic natural language processing: An overview

TL;DR: This study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.
Journal ArticleDOI

Arabic natural language processing: An overview

TL;DR: In this paper, a survey focusing on 90 recent research papers (74% of which were published after 2015) is presented and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.
Journal ArticleDOI

Systematic Literature Review of Dialectal Arabic: Identification and Detection

TL;DR: The authors conducted a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends.

UPV-UMA at CheckThat! Lab: verifying Arabic claims using a cross lingual approach

TL;DR: A cross-lingual approach to detect the factuality of claims using three main steps, evidence retrieval, evidence ranking, and textual entailment to achieve the best performance in subtask-D.
Journal ArticleDOI

Exploiting Comparable Corpora to Enhance Bilingual Lexicon Induction from Monolingual Corpora

TL;DR: A two stages framework that can learn bilingual lexicons from monolingual corpora enhanced using comparable corpora without any additional resources is proposed and the result of the experiment showed that the proposed method can enhance the accuracy from Monolingual Corpora and outperform other previous methods.
References
More filters
Proceedings ArticleDOI

Data Structures for Statistical Computing in Python

Wes McKinney
TL;DR: P pandas is a new library which aims to facilitate working with data sets common to finance, statistics, and other related fields and to provide a set of fundamental building blocks for implementing statistical models.
Proceedings ArticleDOI

A Multidialectal Parallel Corpus of Arabic

TL;DR: This paper presents the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English, a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.
Proceedings ArticleDOI

Using Twitter to Collect a Multi-Dialectal Corpus of Arabic

TL;DR: In this paper, the authors described the collection and classification of a multi-dialectal corpus of Arabic based on the geographical information of tweets, and extracted tweets that have dialectal word(s).
Proceedings Article

A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic

TL;DR: This work is the most diverse corpus of dialectal Arabic in both the source of the content and the number of dialects, and extends the Arabic dialect identification task to the Iraqi and Maghrebi dialects and improves the results of Zaidan and Callison-Burch (2011a).
Book ChapterDOI

Cross-Dialectal Arabic Processing

TL;DR: An Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that the authors compare to the Modern Standard Arabic (MSA) is presented.