Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

doi:10.3115/1219840.1219911

Home
/
Papers
/
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

Proceedings Article•DOI•

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

Nizar Habash¹, Owen Rambow¹•Institutions (1)

Columbia University¹

25 Jun 2005-pp 573-580

TL;DR: An approach to using a morphological analyzer for tokenizing and morphologically tagging Arabic words in one process using classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer.

read less

Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Introduction to Arabic Natural Language Processing

[...]

Nizar Habash¹•Institutions (1)

Columbia University¹

30 Aug 2010

TL;DR: The goal is to introduce Arabic linguistic phenomena and review the state-of-the-art in Arabic processing to provide system developers and researchers in natural language processing and computational linguistics with the necessary background information for working with the Arabic language.

...read moreread less

Abstract: he Arabic language has recently become the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). In this book, I try to provide NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic.I discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing.

...read moreread less

715 citations

Proceedings Article•

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

[...]

Arfath Pasha¹, Mohamed Al-Badrashiny², Mona Diab², Ahmed El Kholy¹, Ramy Eskander¹, Nizar Habash¹, Manoj Pooleery¹, Owen Rambow¹, Ryan M. Roth - Show less +5 more•Institutions (2)

Columbia University¹, George Washington University²

01 May 2014

TL;DR: MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.

...read moreread less

Abstract: In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

...read moreread less

570 citations

Cites methods from "Arabic Tokenization, Part-of-Speech..."

...Abstract In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....
[...]
...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....
[...]
...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007)....
[...]

Journal Article•DOI•

Arabic Natural Language Processing: Challenges and Solutions

[...]

Ali Farghaly¹, Khaled Shaalan²•Institutions (2)

Monterey Institute of International Studies¹, British University in Dubai²

01 Dec 2009-ACM Transactions on Asian Language Information Processing

TL;DR: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges and some solutions that would guide current and future practitioners in the field of Arabic natural languageprocessing (ANLP).

...read moreread less

Abstract: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.

...read moreread less

481 citations

Proceedings Article•DOI•

Arabic Preprocessing Schemes for Statistical Machine Translation

[...]

Nizar Habash¹, Fatiha Sadat²•Institutions (2)

Columbia University¹, National Research Council²

04 Jun 2006

TL;DR: The results show that given large amounts of training data, splitting off only proclitics performs best, and choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

...read moreread less

Abstract: In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like to-kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

...read moreread less

271 citations

Cites background from "Arabic Tokenization, Part-of-Speech..."

...MADA, The Morphological Analysis and Disambiguation for Arabic tool, is an off-the-shelf resource for Arabic disambiguation (Habash and Rambow, 2005)....
[...]

Introducing the Arabic WordNet project

[...]

Christiane Fellbaum, M. Alkhalifa, W. Black, P.J.T.M. Vossen

01 Jan 2006

TL;DR: The approach towards building a lexical resource in Standard Arabic will be based on the design and contents of the universally accepted Princeton WordNet and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet, enabling translation on the lexical level to English and dozens of other languages.

...read moreread less

Abstract: Arabic is the official language of hundreds of millions of people in twenty Middle East and northern African countries, and is the religious language of all Muslims of various ethnicities around the world. Surprisingly little has been done in the field of computerised language and lexical resources. It is therefore motivating to develop an Arabic (WordNet) lexical resource that discovers the richness of Arabic as described in Elkateb (2005). This paper describes our approach towards building a lexical resource in Standard Arabic. Arabic WordNet (AWN) will be based on the design and contents of the universally accepted Princeton WordNet (PWN) and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet (EWN), enabling translation on the lexical level to English and dozens of other languages. Several tools specific to this task will be developed. AWN will be a linguistic resource with a deep formal semantic foundation. Besides the standard wordnet representation of senses, word meanings are defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology (SUMO) and its associated domain ontologies. We will greatly extend the ontology and its set of mappings to provide formal terms and definitions equivalent to each synset.

...read moreread less

227 citations

Cites background from "Arabic Tokenization, Part-of-Speech..."

...Arabic words in bilingual resources must be normalized and lemmatized (Diab et al. 2004, Habash and Rambow 2005) but vowels and diacritics must be maintained....
[...]
...These include English, German, Czech, Italian, Hindi (Western character set) and Chinese (traditional characters and pinyin)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Automatic tagging of Arabic text: from raw text to base phrase chunks

[...]

Mona Diab¹, Kadri Hacioglu², Dan Jurafsky¹•Institutions (2)

Stanford University¹, University of Colorado Boulder²

02 May 2004

TL;DR: A Support Vector Machine (SVM) based approach to automatically tokenize, tag and annotate base phrases (BPs) in Arabic text and adapt highly accurate tools that have been developed for English text and apply them to Arabic text.

...read moreread less

Abstract: To date, there are no fully automated systems addressing the community's need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an Fβ=1 score of 99.12, the SVM-POS tagger achieves an accuracy of 95.49%, and the SVM-BP chunker yields an Fβ=1 score of 92.08.

...read moreread less

368 citations

"Arabic Tokenization, Part-of-Speech..." refers background or methods or result in this paper

...Diab et al. (2004) report a score of 95.5% for all tokens on a test corpus drawn from ATB1, thus their figure is comparable to our score of 97.6%....
[...]
...The only work on Arabic tagging that uses a corpus for training and evaluation (that we are aware of), (Diab et al., 2004), does not use a morphological analyzer....
[...]
...We map our best solutions as chosen by the Maj model in Section 6 to the English tagset, and we furthermore assume (as do Diab et al. (2004)) the gold standard tokenization....
[...]
...we are aware of), (Diab et al., 2004), does not use a morphological analyzer....
[...]
...While there have been many publications on computational morphological analysis for Arabic (see (Al-Sughaiyer and Al-Kharashi, 2004) for an excellent overview), to our knowledge only Diab et al. (2004) perform a large-scale corpus-based evaluation of their approach....
[...]

Proceedings Article•

Learning trees and rules with set-valued features

[...]

William W. Cohen¹•Institutions (1)

AT&T Labs¹

04 Aug 1996

TL;DR: It is argued that many decision tree and rule learning algorithms can be easily extended to set-valued features, and it is shown by example that many real-world learning problems can be efficiently and naturally represented with set- valued features.

...read moreread less

Abstract: In most learning systems examples are represented as fixed-length "feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the feature-vector representation that allows the value of a feature to be a set of strings; for instance, to represent a small white and black dog with the nominal features size and species and the set-valued feature color, one might use a feature vector with size=small, species=canis-familiaris and color-{white, black}. Since we make no assumptions about the number of possible set elements, this extension of the traditional feature-vector representation is closely connected to Blum's "infinite attribute" representation. We argue that many decision tree and rule learning algorithms can be easily extended to set-valued features. We also show by example that many real-world learning problems can be efficiently and naturally represented with set-valued features; in particular, text categorization problems and problems that arise in propositionalizing first-order representations lend themselves to set-valued features.

...read moreread less

281 citations

"Arabic Tokenization, Part-of-Speech..." refers methods in this paper

...• We use Ripper (Cohen, 1996) to learn a rulebased classifier (Rip) to determine whether an analysis from the morphological analyzer is a “good” or a “bad” analysis....
[...]
...(The reason we use Ripper here is because it allows us to learn lower bounds for the confidence score features, which are real-valued.)...
[...]

Journal Article•DOI•

Arabic morphological analysis techniques: a comprehensive survey

[...]

Imad A. Al-Sughaiyer, Ibrahim A. Al-Kharashi

01 Feb 2004-Journal of the Association for Information Science and Technology

TL;DR: This paper introduces, classifies, and surveys Arabic morphological analysis techniques, and summarizes and organize the information available in the literature in an attempt to motivate researchers to look into these techniques and try to develop more advanced ones.

...read moreread less

Abstract: After several decades of heavy research activity on English stemmers, Arabic morphological analysis techniques have become a popular area of research. The Arabic language is one of the Semitic languages; it exhibits a very systematic but complex morphological structure based on root-pattern schemes. As a consequence, survey of such techniques proves to be more necessary. The aim of this paper is to summarize and organize the information available in the literature in an attempt to motivate researchers to look into these techniques and try to develop more advanced ones. This paper introduces, classifies, and surveys Arabic morphological analysis techniques. Furthermore, conclusions, open areas, and future directions are provided at the end.

...read moreread less

231 citations

Proceedings Article•DOI•

Fast Methods for Kernel-Based Text Analysis

[...]

Taku Kudo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

07 Jul 2003

TL;DR: A Basket Mining algorithm is extended to convert a kernel-based classifier into a simple and fast linear classifier, showing results that show that these new classifiers are about 30 to 300 times faster than the standard kernel- based classifiers.

...read moreread less

Abstract: Kernel-based learning (e.g., Support Vector Machines) has been successfully applied to many hard problems in Natural Language Processing (NLP). In NLP, although feature combinations are crucial to improving performance, they are heuristically selected. Kernel methods change this situation. The merit of the kernel methods is that effective feature combination is implicitly expanded without loss of generality and increasing the computational costs. Kernel-based text analysis shows an excellent performance in terms in accuracy; however, these methods are usually too slow to apply to large-scale text analysis. In this paper, we extend a Basket Mining algorithm to convert a kernel-based classifier into a simple and fast linear classifier. Experimental results on English BaseNP Chunking, Japanese Word Segmentation and Japanese Dependency Parsing show that our new classifiers are about 30 to 300 times faster than the standard kernel-based classifiers.

...read moreread less

228 citations

"Arabic Tokenization, Part-of-Speech..." refers methods in this paper

...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding.6 As training features, we use two sets....
[...]
...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding....
[...]

Proceedings Article•DOI•

Building a Shallow Arabic Morphological Analyser in One Day

[...]

Kareem Darwish¹•Institutions (1)

University of Maryland, College Park¹

11 Jul 2002

TL;DR: The paper presents a rapid method of developing a shallow Arabic morphological analyzer based on automatically derived rules and statistics that will only be concerned with generating the possible roots of any given Arabic word.

...read moreread less

Abstract: The paper presents a rapid method of developing a shallow Arabic morphological analyzer. The analyzer will only be concerned with generating the possible roots of any given Arabic word. The analyzer is based on automatically derived rules and statistics. For evaluation, the analyzer is compared to a commercially available Arabic Morphological Analyzer.

...read moreread less

189 citations