scispace - formally typeset
Search or ask a question
Topic

Quranic Arabic Corpus

About: Quranic Arabic Corpus is a research topic. Over the lifetime, 25 publications have been published within this topic receiving 419 citations.

Papers
More filters
Proceedings Article
01 May 2010
TL;DR: How the unique challenge of morphological annotation of Quranic Arabic is solved using a multi-stage approach is discussed, which includes automatic morphological tagging using diacritic edit-distance, two-pass manual verification, and online collaborative annotation.
Abstract: The Quranic Arabic Corpus (http://corpus.quran.com) is an annotated linguistic resource with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year old central religious text of Islam. This paper describes a new approach to morphological annotation of Quranic Arabic, a genre difficult to compare with other forms of Arabic. Processing Quranic Arabic is a unique challenge from a computational point of view, since the vocabulary and spelling differ from Modern Standard Arabic. The Quranic Arabic Corpus differs from other Arabic computational resources in adopting a tagset that closely follows traditional Arabic grammar. We made this decision in order to leverage a large body of existing historical grammatical analysis, and to encourage online collaborative annotation. In this paper, we discuss how the unique challenge of morphological annotation of Quranic Arabic is solved using a multi-stage approach. The different stages include automatic morphological tagging using diacritic edit-distance, two-pass manual verification, and online collaborative annotation. This process is evaluated to validate the appropriateness of the chosen methodology.

104 citations

Journal ArticleDOI
01 Mar 2013
TL;DR: A new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach is presented and the effectiveness of the chosen annotation methodology is evaluated.
Abstract: The Quranic Arabic Corpus ( http://corpus.quran.com ) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i?r?b (?????). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.

90 citations

Proceedings Article
28 Mar 2010
TL;DR: The Quranic Arabic Dependency Treebank (QADT) is presented and it is reported on how online collaborative annotation was used to bring together Quranic scholars and Arabic language experts to ensure a high level of accuracy for grammatical analysis of the entire Quran.
Abstract: The Quran is a significant religious text, followed by the 1.5 billion believers of the Islamic faith worldwide. The text dates to 610–632 CE and is written in Quranic Arabic, the direct ancestor language of modern standard Arabic in use today. This paper presents the Quranic Arabic Dependency Treebank (QADT) and reports on the approaches and solutions used to apply Natural Language Processing to the unique and challenging language of the Quran. This project differs from other Arabic treebanks by providing a deep computational linguistic model based on historical traditional Arabic grammar($$$$). The treebank is part of the Quranic Arabic Corpus (http://corpus.quran.com), a popular free Arabic resource developed at the University of Leeds. Motivated by the importance of the Quran as a central religious text, we also report on how online collaborative annotation was used to bring together Quranic scholars and Arabic language experts to ensure a high level of accuracy for grammatical analysis of the entire Quran.

55 citations

Proceedings Article
01 May 2010
TL;DR: The treebank is presented, the choice of syntactic representation is explained, and key parts of the annotation guidelines are highlighted, which are especially important to promote consistency for a corpus which is being developed through online collaboration.
Abstract: The Quranic Arabic Dependency Treebank (QADT) is part of the Quranic Arabic Corpus (http://corpus.quran.com), an online linguistic resource organized by the University of Leeds, and developed through online collaborative annotation. The website has become a popular study resource for Arabic and the Quran, and is now used by over 1,500 researchers and students daily. This paper presents the treebank, explains the choice of syntactic representation, and highlights key parts of the annotation guidelines. The text being analyzed is the Quran, the central religious book of Islam, written in classical Quranic Arabic (c. 600 CE). To date, all 77,430 words of the Quran have a manually verified morphological analysis, and syntactic analysis is in progress. 11,000 words of Quranic Arabic have been syntactically annotated as part of a gold standard treebank. Annotation guidelines are especially important to promote consistency for a corpus which is being developed through online collaboration, since often many people will participate from different backgrounds and with different levels of linguistic expertise. The treebank is available online for collaborative correction to improve accuracy, with suggestions reviewed by expert Arabic linguists, and compared against existing published books of Quranic Syntax.

54 citations

01 Jan 2011
TL;DR: A review of Artificial Intelligence and Corpus Linguistics research at Leeds University on Arabic and the Quran and a proposal for further research: the Quranic Knowledge Map.
Abstract: We review a range of Artificial Intelligence and Corpus Linguistics research at Leeds University on Arabic and the Quran, which has produced a range of software and corpus datasets for research on Modern Standard Arabic and more recently Quranic Arabic .Our work on Quranic Arabic corpus linguistics has attracted widespread interest, not only from Arabic linguists but also from Quranic students, and the general public. We see a great potential impact of Artificial Intelligence modelling of the Quran. This leads us to present a proposal for further research: the Quranic Knowledge Map.

53 citations

Network Information
Related Topics (5)
Parsing
21.5K papers, 545.4K citations
84% related
Natural language
31.1K papers, 806.8K citations
82% related
Machine translation
22.1K papers, 574.4K citations
82% related
Sentiment analysis
22.1K papers, 460.8K citations
80% related
Query expansion
17.5K papers, 452.7K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20213
20201
20193
20184
20173
20162