scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

25 Jun 2005-pp 573-580
TL;DR: An approach to using a morphological analyzer for tokenizing and morphologically tagging Arabic words in one process using classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer.
Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.
Citations
More filters
Book
Nizar Habash1
30 Aug 2010
TL;DR: The goal is to introduce Arabic linguistic phenomena and review the state-of-the-art in Arabic processing to provide system developers and researchers in natural language processing and computational linguistics with the necessary background information for working with the Arabic language.
Abstract: he Arabic language has recently become the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). In this book, I try to provide NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic.I discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing.

715 citations

Proceedings Article
01 May 2014
TL;DR: MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.
Abstract: In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

570 citations


Cites methods from "Arabic Tokenization, Part-of-Speech..."

  • ...Abstract In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....

    [...]

  • ...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....

    [...]

  • ...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007)....

    [...]

Journal ArticleDOI
TL;DR: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges and some solutions that would guide current and future practitioners in the field of Arabic natural languageprocessing (ANLP).
Abstract: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.

481 citations

Proceedings ArticleDOI
04 Jun 2006
TL;DR: The results show that given large amounts of training data, splitting off only proclitics performs best, and choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Abstract: In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like to-kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

271 citations


Cites background from "Arabic Tokenization, Part-of-Speech..."

  • ...MADA, The Morphological Analysis and Disambiguation for Arabic tool, is an off-the-shelf resource for Arabic disambiguation (Habash and Rambow, 2005)....

    [...]

01 Jan 2006
TL;DR: The approach towards building a lexical resource in Standard Arabic will be based on the design and contents of the universally accepted Princeton WordNet and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet, enabling translation on the lexical level to English and dozens of other languages.
Abstract: Arabic is the official language of hundreds of millions of people in twenty Middle East and northern African countries, and is the religious language of all Muslims of various ethnicities around the world. Surprisingly little has been done in the field of computerised language and lexical resources. It is therefore motivating to develop an Arabic (WordNet) lexical resource that discovers the richness of Arabic as described in Elkateb (2005). This paper describes our approach towards building a lexical resource in Standard Arabic. Arabic WordNet (AWN) will be based on the design and contents of the universally accepted Princeton WordNet (PWN) and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet (EWN), enabling translation on the lexical level to English and dozens of other languages. Several tools specific to this task will be developed. AWN will be a linguistic resource with a deep formal semantic foundation. Besides the standard wordnet representation of senses, word meanings are defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology (SUMO) and its associated domain ontologies. We will greatly extend the ontology and its set of mappings to provide formal terms and definitions equivalent to each synset.

227 citations


Cites background from "Arabic Tokenization, Part-of-Speech..."

  • ...Arabic words in bilingual resources must be normalized and lemmatized (Diab et al. 2004, Habash and Rambow 2005) but vowels and diacritics must be maintained....

    [...]

  • ...These include English, German, Czech, Italian, Hindi (Western character set) and Chinese (traditional characters and pinyin)....

    [...]

References
More filters
Proceedings ArticleDOI
02 May 2004
TL;DR: A Support Vector Machine (SVM) based approach to automatically tokenize, tag and annotate base phrases (BPs) in Arabic text and adapt highly accurate tools that have been developed for English text and apply them to Arabic text.
Abstract: To date, there are no fully automated systems addressing the community's need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an Fβ=1 score of 99.12, the SVM-POS tagger achieves an accuracy of 95.49%, and the SVM-BP chunker yields an Fβ=1 score of 92.08.

368 citations


"Arabic Tokenization, Part-of-Speech..." refers background or methods or result in this paper

  • ...Diab et al. (2004) report a score of 95.5% for all tokens on a test corpus drawn from ATB1, thus their figure is comparable to our score of 97.6%....

    [...]

  • ...The only work on Arabic tagging that uses a corpus for training and evaluation (that we are aware of), (Diab et al., 2004), does not use a morphological analyzer....

    [...]

  • ...We map our best solutions as chosen by the Maj model in Section 6 to the English tagset, and we furthermore assume (as do Diab et al. (2004)) the gold standard tokenization....

    [...]

  • ...we are aware of), (Diab et al., 2004), does not use a morphological analyzer....

    [...]

  • ...While there have been many publications on computational morphological analysis for Arabic (see (Al-Sughaiyer and Al-Kharashi, 2004) for an excellent overview), to our knowledge only Diab et al. (2004) perform a large-scale corpus-based evaluation of their approach....

    [...]

Proceedings Article
William W. Cohen1
04 Aug 1996
TL;DR: It is argued that many decision tree and rule learning algorithms can be easily extended to set-valued features, and it is shown by example that many real-world learning problems can be efficiently and naturally represented with set- valued features.
Abstract: In most learning systems examples are represented as fixed-length "feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the feature-vector representation that allows the value of a feature to be a set of strings; for instance, to represent a small white and black dog with the nominal features size and species and the set-valued feature color, one might use a feature vector with size=small, species=canis-familiaris and color-{white, black}. Since we make no assumptions about the number of possible set elements, this extension of the traditional feature-vector representation is closely connected to Blum's "infinite attribute" representation. We argue that many decision tree and rule learning algorithms can be easily extended to set-valued features. We also show by example that many real-world learning problems can be efficiently and naturally represented with set-valued features; in particular, text categorization problems and problems that arise in propositionalizing first-order representations lend themselves to set-valued features.

281 citations


"Arabic Tokenization, Part-of-Speech..." refers methods in this paper

  • ...• We use Ripper (Cohen, 1996) to learn a rulebased classifier (Rip) to determine whether an analysis from the morphological analyzer is a “good” or a “bad” analysis....

    [...]

  • ...(The reason we use Ripper here is because it allows us to learn lower bounds for the confidence score features, which are real-valued.)...

    [...]

Journal ArticleDOI
TL;DR: This paper introduces, classifies, and surveys Arabic morphological analysis techniques, and summarizes and organize the information available in the literature in an attempt to motivate researchers to look into these techniques and try to develop more advanced ones.
Abstract: After several decades of heavy research activity on English stemmers, Arabic morphological analysis techniques have become a popular area of research. The Arabic language is one of the Semitic languages; it exhibits a very systematic but complex morphological structure based on root-pattern schemes. As a consequence, survey of such techniques proves to be more necessary. The aim of this paper is to summarize and organize the information available in the literature in an attempt to motivate researchers to look into these techniques and try to develop more advanced ones. This paper introduces, classifies, and surveys Arabic morphological analysis techniques. Furthermore, conclusions, open areas, and future directions are provided at the end.

231 citations

Proceedings ArticleDOI
07 Jul 2003
TL;DR: A Basket Mining algorithm is extended to convert a kernel-based classifier into a simple and fast linear classifier, showing results that show that these new classifiers are about 30 to 300 times faster than the standard kernel- based classifiers.
Abstract: Kernel-based learning (e.g., Support Vector Machines) has been successfully applied to many hard problems in Natural Language Processing (NLP). In NLP, although feature combinations are crucial to improving performance, they are heuristically selected. Kernel methods change this situation. The merit of the kernel methods is that effective feature combination is implicitly expanded without loss of generality and increasing the computational costs. Kernel-based text analysis shows an excellent performance in terms in accuracy; however, these methods are usually too slow to apply to large-scale text analysis. In this paper, we extend a Basket Mining algorithm to convert a kernel-based classifier into a simple and fast linear classifier. Experimental results on English BaseNP Chunking, Japanese Word Segmentation and Japanese Dependency Parsing show that our new classifiers are about 30 to 300 times faster than the standard kernel-based classifiers.

228 citations


"Arabic Tokenization, Part-of-Speech..." refers methods in this paper

  • ...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding.6 As training features, we use two sets....

    [...]

  • ...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding....

    [...]

Proceedings ArticleDOI
11 Jul 2002
TL;DR: The paper presents a rapid method of developing a shallow Arabic morphological analyzer based on automatically derived rules and statistics that will only be concerned with generating the possible roots of any given Arabic word.
Abstract: The paper presents a rapid method of developing a shallow Arabic morphological analyzer. The analyzer will only be concerned with generating the possible roots of any given Arabic word. The analyzer is based on automatically derived rules and statistics. For evaluation, the analyzer is compared to a commercially available Arabic Morphological Analyzer.

189 citations


"Arabic Tokenization, Part-of-Speech..." refers background in this paper

  • ...Darwish (2003) discusses unsupervised identification of roots; as mentioned above, we leave root identification to future work....

    [...]