scispace - formally typeset
Open AccessProceedings ArticleDOI

An HMM Approach to Vowel Restoration in Arabic and Hebrew

TLDR
It is shown that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages and does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic.
Abstract
Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Maximum Entropy Based Restoration of Arabic Diacritics

TL;DR: A maximum entropy approach for restoring diacritics in a document that can easily integrate and make effective use of diverse types of information and integrates a wide array of lexical, segment-based and part-of-speech tag features.
Proceedings ArticleDOI

Automatic diacritization of Arabic for acoustic modeling in speech recognition

TL;DR: Various procedures that enable us to use acoustic information in combination with different levels of morphological and contextual constraints by automatically inserting the missing diacritics into the transcription are investigated.
Proceedings ArticleDOI

Arabic Diacritization Using Weighted Finite-State Transducers

TL;DR: A novel algorithm for restoring symbols of Arabic without short vowels and additional diacritics is presented, using a cascade of probabilistic finite-state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-basedlanguage model, and an extremely simple morphological model.
Journal ArticleDOI

Automatic diacritization of Arabic text using recurrent neural networks

TL;DR: A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences using a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions.
Proceedings ArticleDOI

Arabic Diacritics based Steganography

TL;DR: The proposed approach uses eight different diacritical symbols in Arabic to hide binary bits in the original cover media and extract data by reading the diacritics from the document and translating them back to binary.
References
More filters
Journal ArticleDOI

Estimation of probabilities from sparse data for the language model component of a speech recognizer

TL;DR: The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.
Book

Statistical Language Learning

TL;DR: In this article, Charniak presents statistical language processing from an artificial intelligence point of view in a text for researchers and scientists with a traditional computer science background, which is grounded in real text and therefore promises to produce usable results.
Journal ArticleDOI

Good‐turing frequency estimation without tears*

TL;DR: The Simple Good–Turing estimator is defined, which is straightforward to use and performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.
Proceedings ArticleDOI

Arabic finite-state morphological analysis and generation

TL;DR: A large-scale system that performs morphological analysis and generation of on-line Arabic words represented in the standard orthography, whether fully voweled, partially voweled or unvoweled, using Xerox Finite-State Morphology tools.
Proceedings ArticleDOI

Similarity-Based Estimation of Word Cooccurrence Probabilities

TL;DR: A probabilistic word association model based on distributional word similarity is described, and it is applied to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model.
Related Papers (5)