An HMM Approach to Vowel Restoration in Arabic and Hebrew

doi:10.3115/1118637.1118641

Open AccessProceedings ArticleDOI

An HMM Approach to Vowel Restoration in Arabic and Hebrew

- pp 1-7

TLDR

It is shown that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages and does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic.

Abstract:

Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Maximum Entropy Based Restoration of Arabic Diacritics

Imed Zitouni, +2 more

TL;DR: A maximum entropy approach for restoring diacritics in a document that can easily integrate and make effective use of diverse types of information and integrates a wide array of lexical, segment-based and part-of-speech tag features.

...read moreread less

Proceedings ArticleDOI

Automatic diacritization of Arabic for acoustic modeling in speech recognition

Dimitra Vergyri, +1 more

TL;DR: Various procedures that enable us to use acoustic information in combination with different levels of morphological and contextual constraints by automatically inserting the missing diacritics into the transcription are investigated.

...read moreread less

Proceedings ArticleDOI

Arabic Diacritization Using Weighted Finite-State Transducers

Rani Nelken, +1 more

TL;DR: A novel algorithm for restoring symbols of Arabic without short vowels and additional diacritics is presented, using a cascade of probabilistic finite-state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-basedlanguage model, and an extremely simple morphological model.

...read moreread less

Journal ArticleDOI

Automatic diacritization of Arabic text using recurrent neural networks

Gheith A. Abandah, +5 more

- 01 Jun 2015 -

International Journal on Document Analys...

TL;DR: A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences using a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions.

...read moreread less

Proceedings ArticleDOI

Arabic Diacritics based Steganography

M.A. Aabed, +3 more

TL;DR: The proposed approach uses eight different diacritical symbols in Arabic to hide binary bits in the original cover media and extract data by reading the diacritics from the document and translating them back to binary.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Estimation of probabilities from sparse data for the language model component of a speech recognizer

S. Katz

- 01 Mar 1987 -

IEEE Transactions on Acoustics, Speech, ...

TL;DR: The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.

...read moreread less

Book

Statistical Language Learning

Eugene Charniak

TL;DR: In this article, Charniak presents statistical language processing from an artificial intelligence point of view in a text for researchers and scientists with a traditional computer science background, which is grounded in real text and therefore promises to produce usable results.

...read moreread less

Journal ArticleDOI

Good‐turing frequency estimation without tears*

William A. Gale, +1 more

- 01 Jan 1995 -

Journal of Quantitative Linguistics

TL;DR: The Simple Good–Turing estimator is defined, which is straightforward to use and performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.

...read moreread less

Proceedings ArticleDOI

Arabic finite-state morphological analysis and generation

Kenneth R. Beesley

TL;DR: A large-scale system that performs morphological analysis and generation of on-line Arabic words represented in the standard orthography, whether fully voweled, partially voweled or unvoweled, using Xerox Finite-State Morphology tools.

...read moreread less

Proceedings ArticleDOI

Similarity-Based Estimation of Word Cooccurrence Probabilities

Ido Dagan, +2 more

TL;DR: A probabilistic word association model based on distributional word similarity is described, and it is applied to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model.

...read moreread less

International Journal on Document Analys...

An HMM Approach to Vowel Restoration in Arabic and Hebrew

Citations

Maximum Entropy Based Restoration of Arabic Diacritics

Automatic diacritization of Arabic for acoustic modeling in speech recognition

Arabic Diacritization Using Weighted Finite-State Transducers

Automatic diacritization of Arabic text using recurrent neural networks

Arabic Diacritics based Steganography

References

Estimation of probabilities from sparse data for the language model component of a speech recognizer

Statistical Language Learning

Good‐turing frequency estimation without tears*

Arabic finite-state morphological analysis and generation

Similarity-Based Estimation of Word Cooccurrence Probabilities

Related Papers (5)

Maximum Entropy Based Restoration of Arabic Diacritics

Arabic Diacritization Using Weighted Finite-State Transducers

Arabic Diacritization through Full Morphological Tagging

Automatic diacritization of Arabic for acoustic modeling in speech recognition

Automatic diacritization of Arabic text using recurrent neural networks