scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Findings of the VarDial Evaluation Campaign 2017

TL;DR: The VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which was organized as part of the fourth edition of the VarDial workshop at EACL’2017, is presented.
Abstract: We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017 This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP) A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Jan 2017
TL;DR: The task and evaluation methodology is defined, how the data sets were prepared, report and analyze the main results, and a brief categorization of the different approaches of the participating systems are provided.
Abstract: The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

281 citations

Proceedings Article
20 Aug 2018
TL;DR: The results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects and Indo-Aryan Language Identification are presented.
Abstract: We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

113 citations


Cites background or methods from "Findings of the VarDial Evaluation ..."

  • ...Last year, in the second edition of the ADI task (Zampieri et al., 2017), we offered the input represented as (i) automatic text transcriptions generated using large-vocabulary speech recognition (LVCSR), and (ii) acoustic features....

    [...]

  • ...For training and development, we released the same data as for last year’s VarDial evaluation campaign (Zampieri et al., 2017)....

    [...]

  • ...The second iteration of the ADI task (Zampieri et al., 2017) introduced multi-modality for dialect identification, using i-vectors for the acoustic representation in addition to lexical features....

    [...]

  • ...This year’s second edition of the VarDial Evaluation Campaign was preceded by the first edition of the campaign in 2017 with four shared tasks (Zampieri et al., 2017)....

    [...]

  • ...The previous GDI task was part of the first VarDial evaluation campaign (Zampieri et al., 2017)....

    [...]

Proceedings Article
01 Aug 2018
TL;DR: This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic, and builds several classification systems and explores a large space of features.
Abstract: Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification). This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic – a very challenging task. We build several classification systems and explore a large space of features. Our results show that we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words. We also report on additional insights from a data analysis of similarity and difference across Arabic dialects.

107 citations


Cites background from "Findings of the VarDial Evaluation ..."

  • ...For instance, several evaluation campaigns were dedicated to discriminating between language varieties (Malmasi et al., 2016; Zampieri et al., 2017)....

    [...]

  • ...More recently, discriminating between Arabic Dialects has been the goal of a dedicated shared task (Zampieri et al., 2017; Malmasi et al., 2016), encouraging researchers to submit systems to recognize the dialect of speech transcripts along with acoustic features for dialects of four main regions: Egyptian, Gulf, Levantine and North African, in addition to MSA....

    [...]

  • ...Character n-grams Character n-grams have shown to be the most effective in language and dialect identification tasks (Zampieri et al., 2017)....

    [...]

  • ...More recently, discriminating between Arabic Dialects has been the goal of a dedicated shared task (Zampieri et al., 2017; Malmasi et al., 2016), encouraging researchers to submit systems to recognize the dialect of speech transcripts along with acoustic features for dialects of four main regions:…...

    [...]

Proceedings Article
01 May 2020
TL;DR: The design of CAMeL Tools is described and the functionalities it provides are described, including utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis.
Abstract: We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.

98 citations


Cites background from "Findings of the VarDial Evaluation ..."

  • ...More recently, discriminating between Arabic dialects has been the goal of a dedicated shared task (Zampieri et al., 2017), encouraging researchers to submit systems to recognize the dialect of speech transcripts along with acoustic features for dialects of four main regions: Egyptian, Gulf, Levantine and North African, in addition to MSA....

    [...]

  • ...More recently, discriminating between Arabic dialects has been the goal of a dedicated shared task (Zampieri et al., 2017), encouraging researchers to submit systems to recognize the dialect of speech transcripts along with acoustic features for dialects of four main regions: Egyptian, Gulf,…...

    [...]

Proceedings ArticleDOI
01 Aug 2019
TL;DR: This shared task is the first to target a large set of dialect labels at the city and country levels and was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019.
Abstract: In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. This shared task was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019. The shared task includes two subtasks: the MADAR Travel Domain Dialect Identification subtask (Subtask 1) and the MADAR Twitter User Dialect Identification subtask (Subtask 2). This shared task is the first to target a large set of dialect labels at the city and country levels. The data for the shared task was created or collected under the Multi-Arabic Dialect Applications and Resources (MADAR) project. A total of 21 teams from 15 countries participated in the shared task.

94 citations


Cites background from "Findings of the VarDial Evaluation ..."

  • ...4In (Salameh et al., 2018), the Corpus 6 test set corresponds to the 2,000 sentences from Corpus 26 corresponding to the Corpus 6’s five cities and MSA....

    [...]

  • ...The participants were provided with a dataset from the MADAR corpus (Bouamor et al., 2018), a largescale collection of parallel sentences in the travel domain covering the dialects of 25 cities from the Arab World in addition to MSA (Table 1 shows the list of cities)....

    [...]

  • ...Dialect Identification The goal of this subtask is to classify written Arabic sentences into one of 26 labels representing the specific city dialect of the sentences, or MSA....

    [...]

  • ...An example of a 27-way parallel sentence (25 cities plus MSA and English) extracted from Corpus 26 is given in Table 2....

    [...]

  • ...We refer to it as Corpus 6 (5 cities plus MSA)....

    [...]

References
More filters
Proceedings Article
01 Jan 1994
TL;DR: Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases, and expansion using terms from the top documents retrieved by a pilot search on topic terms was used.
Abstract: City submitted two runs each for the automatic ad hoc, very large collection track, automatic routing and Chinese track; and took part in the interactive and filtering tracks. The method used was : expansion using terms from the top documents retrieved by a pilot search on topic terms. Additional runs seem to show that we would have done better without expansion. Twor runs using the method of city96al were also submitted for the Very Large Collection track. The training database and its relevant documents were partitioned into three parts. Working on a pool of terms extracted from the relevant documents for one partition, an iterative procedure added or removed terms and/or varied their weights. After each change in query content or term weights a score was calculated by using the current query to search a second protion of the training database and evaluating the results against the corresponding set of relevant documents. Methods were compared by evaluating queries predictively against the third training partition. Queries from different methods were then merged and the results evaluated in the same way. Two runs were submitted, one based on character searching and the other on words or phrases. Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases

2,459 citations

Proceedings Article
01 May 2012
TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.
Abstract: This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.

1,559 citations


"Findings of the VarDial Evaluation ..." refers methods in this paper

  • ...For the constrained setup, we also provided parallel datasets coming from OPUS (Tiedemann, 2012) that could be used for training cross-lingual parsers in any way....

    [...]

Proceedings Article
01 May 2016
TL;DR: This paper describes v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages, as well as highlighting the needs for sound comparative evaluation and cross-lingual learning experiments.
Abstract: Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments. It is also useful for multilingual system development and comparative linguistic studies. Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages.

1,111 citations


"Findings of the VarDial Evaluation ..." refers methods in this paper

  • ...ing language pairs from the Universal Dependencies (UD) project (Nivre et al., 2016) that match the setup and come close to a realistic case for the approach (using UD release 1....

    [...]

  • ...We do so by simulating the resource-poor situation by selecting language pairs from the Universal Dependencies (UD) project (Nivre et al., 2016) that match the setup and come close to a realistic case for the approach (using UD release 1.4)....

    [...]

Proceedings Article
01 Aug 2013
TL;DR: A new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean is presented, made freely available in order to facilitate research on multilingual dependency parsing.
Abstract: We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing. 1

489 citations


"Findings of the VarDial Evaluation ..." refers methods in this paper

  • ...…Transfer learning and annotation projection are popular approaches in this field and various techniques and models have been proposed in the literature in particular in connection with dependency parsing (Hwa et al., 2005; McDonald et al., 2013; Täckström et al., 2012; Tiedemann, 2014)....

    [...]

Journal ArticleDOI
TL;DR: Using parallel text to help solving the problem of creating syntactic annotation in more languages by annotating the English side of a parallel corpus, project the analysis to the second language, and train a stochastic analyzer on the resulting noisy annotations.
Abstract: Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

384 citations


"Findings of the VarDial Evaluation ..." refers methods in this paper

  • ...7 Transfer learning and annotation projection are popular approaches in this field and various techniques and models have been proposed in the literature in particular in connection with dependency parsing (Hwa et al., 2005; McDonald et al., 2013; Täckström et al., 2012; Tiedemann, 2014)....

    [...]

  • ...…Transfer learning and annotation projection are popular approaches in this field and various techniques and models have been proposed in the literature in particular in connection with dependency parsing (Hwa et al., 2005; McDonald et al., 2013; Täckström et al., 2012; Tiedemann, 2014)....

    [...]