scispace - formally typeset
Search or ask a question
JournalISSN: 1530-0226

ACM Transactions on Asian Language Information Processing 

Association for Computing Machinery
About: ACM Transactions on Asian Language Information Processing is an academic journal. The journal publishes majorly in the area(s): Machine translation & Example-based machine translation. It has an ISSN identifier of 1530-0226. Over the lifetime, 225 publications have been published receiving 5925 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges and some solutions that would guide current and future practitioners in the field of Arabic natural languageprocessing (ANLP).
Abstract: The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.

481 citations

Journal ArticleDOI
TL;DR: An ontology of time is being developed for describing the temporal content of Web pages and the temporal properties of Web services, which covers topological properties of instants and intervals, measures of duration, and the meanings of clock and calendar terms.
Abstract: In connection with the DAML project for bringing about the Semantic Web, an ontology of time is being developed for describing the temporal content of Web pages and the temporal properties of Web services This ontology covers topological properties of instants and intervals, measures of duration, and the meanings of clock and calendar terms

449 citations

Journal ArticleDOI
TL;DR: Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only, which implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.
Abstract: This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.

340 citations

Journal ArticleDOI
TL;DR: A separable mixture model (SMM) is adopted to estimate the similarity between an input sentence and the EARs of each emotional state, and a dialog system focusing on students' daily expressions is constructed.
Abstract: This study presents a novel approach to automatic emotion recognition from text. First, emotion generation rules (EGRs) are manually deduced from psychology to represent the conditions for generating emotion. Based on the EGRs, the emotional state of each sentence can be represented as a sequence of semantic labels (SLs) and attributes (ATTs); SLs are defined as the domain-independent features, while ATTs are domain-dependent. The emotion association rules (EARs) represented by SLs and ATTs for each emotion are automatically derived from the sentences in an emotional text corpus using the a priori algorithm. Finally, a separable mixture model (SMM) is adopted to estimate the similarity between an input sentence and the EARs of each emotional state. Since some features defined in this approach are domain-dependent, a dialog system focusing on the students' daily expressions is constructed, and only three emotional states, happy, unhappy, and neutral, are considered for performance evaluation. According to the results of the experiments, given the domain corpus, the proposed approach is promising, and easily ported into other domains.

245 citations

Journal ArticleDOI
TL;DR: This article presents a unified approach to Chinese statistical language modeling, which automatically and consistently gathers a high-quality training data set from the Web, creates ahigh-quality lexicon, segments the training data using this Lexicon, and compresses the language model by using the maximum likelihood principle, which is consistent with trigram model training.
Abstract: This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

172 citations

Network Information
Related Journals (5)
arXiv: Computation and Language
24.8K papers, 481.5K citations
87% related
Computational Linguistics
1.4K papers, 154.8K citations
82% related
Information Processing and Management
3.8K papers, 151.6K citations
81% related
Speech Communication
2.6K papers, 119K citations
78% related
ACM Transactions on Information Systems
1K papers, 104.4K citations
78% related
Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
201418
201315
201218
201120
201014
200918