Fine-Grained Arabic Dialect Identification

Open AccessProceedings Article

Fine-Grained Arabic Dialect Identification

- pp 1332-1344

TLDR

This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic, and builds several classification systems and explores a large space of features.

Abstract:

Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification). This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic – a very challenging task. We build several classification systems and explore a large space of features. Our results show that we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words. We also report on additional insights from a data analysis of similarity and difference across Arabic dialects.

Citations

PDF

Open Access

More filters

Proceedings Article

CAMeL tools: An open source python toolkit for arabic natural language processing

Ossama Obeid, +9 more

TL;DR: The design of CAMeL Tools is described and the functionalities it provides are described, including utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis.

...read moreread less

Proceedings ArticleDOI

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification

Houda Bouamor, +2 more

TL;DR: This shared task is the first to target a large set of dialect labels at the city and country levels and was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019.

...read moreread less

Journal ArticleDOI

Arabic natural language processing: An overview

Imane Guellil, +4 more

- 01 Jun 2021 -

Journal of King Saud University - Comput...

TL;DR: This study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.

...read moreread less

Journal ArticleDOI

Arabic natural language processing: An overview

Imane Guellil, +4 more

- 07 Mar 2019 -

arXiv: Computation and Language

TL;DR: In this paper, a survey focusing on 90 recent research papers (74% of which were published after 2015) is presented and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.

...read moreread less

NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task

Muhammad Abdul-Mageed, +3 more

TL;DR: The second Nuanced Arabic Dialect Identification Shared Task (NADI 2021) as discussed by the authors was the first shared task to include four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask 1.1), countrylevel dialect identification, province level MSA identification, and province-level sub-dialect identifica-tion (SubTask 2.2).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Introduction to Information Retrieval

Christopher D. Manning, +2 more

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Proceedings Article

KenLM: Faster and Smaller Language Model Queries

Kenneth Heafield

TL;DR: KenLM is a library that implements two data structures for efficient language model queries, reducing both time and memory costs and is integrated into the Moses, cdec, and Joshua translation systems.

...read moreread less

Book

Introduction to Arabic Natural Language Processing

Nizar Habash

TL;DR: The goal is to introduce Arabic linguistic phenomena and review the state-of-the-art in Arabic processing to provide system developers and researchers in natural language processing and computational linguistics with the necessary background information for working with the Arabic language.

...read moreread less

Proceedings Article

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

Arfath Pasha, +8 more

TL;DR: MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.

...read moreread less

Proceedings Article

Crowdsourcing Translation: Professional Quality from Non-Professionals

Omar F. Zaidan, +1 more

TL;DR: A set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators are proposed.

...read moreread less

Collapse

Related Papers (5)

The madar Arabic dialect corpus and lexicon

Houda Bouamor, +10 more

Arabic dialect identification

Omar F. Zaidan, +1 more

- 01 Mar 2014 -

Computational Linguistics

Fine-Grained Arabic Dialect Identification

Citations

CAMeL tools: An open source python toolkit for arabic natural language processing

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification

Arabic natural language processing: An overview

Arabic natural language processing: An overview

NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task

References

Introduction to Information Retrieval

KenLM: Faster and Smaller Language Model Queries

Introduction to Arabic Natural Language Processing

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

Crowdsourcing Translation: Professional Quality from Non-Professionals

Related Papers (5)

The madar Arabic dialect corpus and lexicon

Arabic dialect identification

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Sentence Level Dialect Identification in Arabic