Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
Citations
715 citations
570 citations
Cites methods from "Arabic Tokenization, Part-of-Speech..."
...Abstract In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....
[...]
...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al....
[...]
...In this paper, we focus on two systems that are commonly used by researchers in Arabic NLP: MADA (Habash and Rambow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007)....
[...]
481 citations
271 citations
Cites background from "Arabic Tokenization, Part-of-Speech..."
...MADA, The Morphological Analysis and Disambiguation for Arabic tool, is an off-the-shelf resource for Arabic disambiguation (Habash and Rambow, 2005)....
[...]
227 citations
Cites background from "Arabic Tokenization, Part-of-Speech..."
...Arabic words in bilingual resources must be normalized and lemmatized (Diab et al. 2004, Habash and Rambow 2005) but vowels and diacritics must be maintained....
[...]
...These include English, German, Czech, Italian, Hindi (Western character set) and Chinese (traditional characters and pinyin)....
[...]
References
368 citations
"Arabic Tokenization, Part-of-Speech..." refers background or methods or result in this paper
...Diab et al. (2004) report a score of 95.5% for all tokens on a test corpus drawn from ATB1, thus their figure is comparable to our score of 97.6%....
[...]
...The only work on Arabic tagging that uses a corpus for training and evaluation (that we are aware of), (Diab et al., 2004), does not use a morphological analyzer....
[...]
...We map our best solutions as chosen by the Maj model in Section 6 to the English tagset, and we furthermore assume (as do Diab et al. (2004)) the gold standard tokenization....
[...]
...we are aware of), (Diab et al., 2004), does not use a morphological analyzer....
[...]
...While there have been many publications on computational morphological analysis for Arabic (see (Al-Sughaiyer and Al-Kharashi, 2004) for an excellent overview), to our knowledge only Diab et al. (2004) perform a large-scale corpus-based evaluation of their approach....
[...]
281 citations
"Arabic Tokenization, Part-of-Speech..." refers methods in this paper
...• We use Ripper (Cohen, 1996) to learn a rulebased classifier (Rip) to determine whether an analysis from the morphological analyzer is a “good” or a “bad” analysis....
[...]
...(The reason we use Ripper here is because it allows us to learn lower bounds for the confidence score features, which are real-valued.)...
[...]
231 citations
228 citations
"Arabic Tokenization, Part-of-Speech..." refers methods in this paper
...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding.6 As training features, we use two sets....
[...]
...We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding....
[...]
189 citations
"Arabic Tokenization, Part-of-Speech..." refers background in this paper
...Darwish (2003) discusses unsupervised identification of roots; as mentioned above, we leave root identification to future work....
[...]