scispace - formally typeset
Search or ask a question
Author

Nizar Habash

Bio: Nizar Habash is an academic researcher from New York University Abu Dhabi. The author has contributed to research in topics: Machine translation & Modern Standard Arabic. The author has an hindex of 52, co-authored 279 publications receiving 9818 citations. Previous affiliations of Nizar Habash include Birzeit University & Columbia University.


Papers
More filters
Book
Nizar Habash1
30 Aug 2010
TL;DR: The goal is to introduce Arabic linguistic phenomena and review the state-of-the-art in Arabic processing to provide system developers and researchers in natural language processing and computational linguistics with the necessary background information for working with the Arabic language.
Abstract: he Arabic language has recently become the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). In this book, I try to provide NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic.I discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing.

715 citations

Proceedings Article
01 May 2014
TL;DR: MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.
Abstract: In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

570 citations

Proceedings ArticleDOI
25 Jun 2005
TL;DR: An approach to using a morphological analyzer for tokenizing and morphologically tagging Arabic words in one process using classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer.
Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

501 citations

Book ChapterDOI
01 Jan 2007
TL;DR: This chapter introduces the transliteration scheme used to represent Arabic characters in this book and presents guidelines for Arabic pronunciation using this transliterations scheme.
Abstract: This chapter introduces the transliteration scheme used to represent Arabic characters in this book. The scheme is a one-to-one transliteration of the Arabic script that is complete, easy to read, and consistent with Arabic computer encodings. We present guidelines for Arabic pronunciation using this transliteration scheme and discuss various idiosyncrasies of Arabic orthography

322 citations

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The task and evaluation methodology is defined, how the data sets were prepared, report and analyze the main results, and a brief categorization of the different approaches of the participating systems are provided.
Abstract: The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

281 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: It is argued that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder.
Abstract: This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks---but that their use makes the interpretation of the value of the coefficient even harder.

1,324 citations

Proceedings ArticleDOI
01 Aug 2017
TL;DR: Crowdourcing on Amazon Mechanical Turk was used to label a large Twitter training dataset along with additional test sets of Twitter and SMS messages for both subtasks, which included two subtasks: A, an expression-level subtask, and B, a message level subtask.
Abstract: This paper describes the fifth year of the Sentiment Analysis in Twitter task. SemEval-2017 Task 4 continues with a rerun of the subtasks of SemEval-2016 Task 4, which include identifying the overall sentiment of the tweet, sentiment towards a topic with classification on a two-point and on a five-point ordinal scale, and quantification of the distribution of sentiment towards a topic across a number of tweets: again on a two-point and on a five-point ordinal scale. Compared to 2016, we made two changes: (i) we introduced a new language, Arabic, for all subtasks, and (ii) we made available information from the profiles of the Twitter users who posted the target tweets. The task continues to be very popular, with a total of 48 teams participating this year.

1,107 citations

Proceedings ArticleDOI
16 Mar 2020
TL;DR: This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

1,040 citations

Proceedings Article
01 Jun 2013
TL;DR: A simple log-linear reparameterization of IBM Model 2 that overcomes problems arising from Model 1’'s strong assumptions and Model 2’s overparameterization is presented.
Abstract: We present a simple log-linear reparameterization of IBM Model 2 that overcomes problems arising from Model 1’s strong assumptions and Model 2’s overparameterization. Efficient inference, likelihood evaluation, and parameter estimation algorithms are provided. Training the model is consistently ten times faster than Model 4. On three large-scale translation tasks, systems built using our alignment model outperform IBM Model 4. An open-source implementation of the alignment model described in this paper is available from http://github.com/clab/fast align .

1,006 citations