Showing papers in "International Journal of Computer Processing of Languages in 2005"

PDF

Open Access

Journal Article•DOI•

An Overview of Temporal Information Extraction

[...]

Kam-Fai Wong¹, Yunqing Xia¹, Wenjie Li², Chunfa Yuan³•Institutions (3)

The Chinese University of Hong Kong¹, Hong Kong Polytechnic University², Tsinghua University³

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: The past, present and future research development in temporal information extraction is presented, ranging from temporal expression extraction and annotation to temporal reasoning and understanding.

...read moreread less

Abstract: Research of temporal Information Extraction was regarded as a subtask of named entity recognition in 1990's. To date, the scope of this research is broadened, ranging from temporal expression extraction and annotation to temporal reasoning and understanding. This area of research is now a hot NLP topic and the results are applicable to question answering, information extraction, text summarization, etc. This paper presents the past, present and future research development in temporal information extraction.

...read moreread less

26 citations

Journal Article•DOI•

Advanced technologies for information access

[...]

Tetsuya Sakai

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: A system called BRIDJE (Bi-directional Retriever/Information Distiller for Japanese and English), which achieved many gold-medal performances at the recent NTCIR (a.k.a. "Asian TREC") workshop, is described.

...read moreread less

Abstract: This paper briefly describes Toshiba Knowledge Media Laboratory's recent research efforts for effective information retrieval and access. Firstly, I will mention the main research topics that are being tackled by our information access group, including document retrieval, speech-input/multimedia question answering, and evaluation metrics. Secondly, I will focus on the problem of cross-language information retrieval and access, and describe a system called BRIDJE (Bi-directional Retriever/Information Distiller for Japanese and English), which achieved many gold-medal performances at the recent NTCIR (a.k.a. "Asian TREC") workshop. Finally, I will conclude the paper by mentioning some unsolved problems and suggesting possible directions for future Information Access research.

...read moreread less

12 citations

Journal Article•DOI•

Sentiment and Content Analysis of Chinese News Coverage

[...]

Benjamin Ka-Yin T'sou, Oi Yee Kwong, Wei Lung Wong, Tom B. Y. Lai

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: Analysis either on demand or on a longitudinal basis provides a critical source of information heretofore not readily nor economically obtainable for a range of meaningful purposes.

...read moreread less

Abstract: Typical news coverage contains both objective facts and subjective sentiments. This is especially true for newsworthy individuals and organizations, and media opinion on strategic subjects. Analysis either on demand or on a longitudinal basis provides a critical source of information heretofore not readily nor economically obtainable for a range of meaningful purposes. One application is the monitoring of positive or negative summative news coverage on targeted subjects.

...read moreread less

9 citations

Journal Article•DOI•

Handwritten Chinese Character Font Generation Based on Stroke Correspondence

[...]

Jungpil Shin¹, Kazunori Suzuki, Atsushi Hasegawa•Institutions (1)

University of Aizu¹

01 Sep 2005-International Journal of Computer Processing of Languages

TL;DR: This paper investigates a novel method for an individual's handwritten Chinese character font generation, using stroke correspondence between the reference character database and the compressed character database, by vector quantization and demonstrates that fonts generated successfully reflect the user's individual handwriting.

...read moreread less

Abstract: In this paper, we investigate a novel method for an individual's handwritten Chinese character font generation, using stroke correspondence between the reference character database and the compressed character database, by vector quantization. Chinese characters are composed of a combination of radicals. A radical may be separated into several strokes, with each stroke corresponding to two or more common strokes. By paying attention to the characteristics of Chinese characters and the strokes that form them, we consider each stroke to be a vector and compress the stroke pattern using vector quantization. A compression rate of 1.27% is achieved by the vector quantization. We performed the evaluation experiments using both subjective and objective criteria involving 26 subjects and demonstrated that fonts generated successfully reflect the user's individual handwriting.

...read moreread less

6 citations

Journal Article•DOI•

English to Bangla Translator: The BANGANUBAD

[...]

Goutam Kumar Saha

01 Dec 2005-International Journal of Computer Processing of Languages

TL;DR: A new approach to English to Bangla translation based on a special kind of amalgamated architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components is described.

...read moreread less

Abstract: This paper describes a new approach to English to Bangla translation. The English to Bangla Translator was developed. This system (BANGANUBAD) takes a paragraph of English sentences as input sentences and produces equivalent Bangla sentences. The BANGANUBAD system comprises a preprocessor, morphological recursive parser, semantic parser using English word ontology for context disambiguation, an electronic lexicon associated with grammatical information and a discourse processor. It also employs a lexical disambiguation analyzer. This system does not rely on a stochastic approach. Rather, it is based on a special kind of amalgamated architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components.

...read moreread less

5 citations

Journal Article•DOI•

An Approach to Proper Speech Segmentation for Quality Improvement in Concatenative Text-To-Speech System for Indian Languages

[...]

Sanghamitra Mohanty, Suman Bhattacharya, Sumit Bose, Sabyasachi Swain

01 Mar 2005-International Journal of Computer Processing of Languages

TL;DR: Automatic placement of phoneme boundaries in a speech waveform using explicit statistical model for phoneme boundary is proposed in this paper and studies show that HNM is capable of synthesizing all vowels and diphones with good quality.

...read moreread less

Abstract: Most of the Indian-language Text-To-Speech (TTS) synthesis systems designed till date are based upon the concatenation of acoustic units. The prime challenge is the selection of proper units and their elegant concatenation. Due to the precincts of current automated techniques based on Hidden Markov Model (HMM) and Dynamic Time Warping (DTW), manual verification and labeling are often essential. Automatic placement of phoneme boundaries in a speech waveform using explicit statistical model for phoneme boundary is proposed in this paper. We are projecting the Harmonic plus Noise Model (HNM) in the first step and refine the boundary placement by searching for the best match in a region near the estimated boundary with predefined boundary model Technique like ESNOLA. This technique is applied for effective concatenation, which results in smooth output. Studies show that HNM is capable of synthesizing all vowels and diphones with good quality. This can remarkably reduce the size of the database. Further the pitch synchronous analysis and the Glottal Closure Instants (GCI) are accurately calculated. The quality of the synthesized speech improves if these units are obtained from the glottal signal rather than from processing the signal. The database has to be developed for VCV for all Indian languages as we have done for Oriya, one of the official languages of the Republic of India for our case study.

...read moreread less

5 citations

Journal Article•DOI•

Exploitation of Morphological Structures in Large Vocabulary Arabic Speech Recognition

[...]

Sekharjit Datta, M. Al-Zabibi, Omar Farooq

01 Dec 2005-International Journal of Computer Processing of Languages

TL;DR: The results show that about 83% of the 10,000 test Arabic words can be uniquely represented by using 7 broad phonetic classes for consonants and six classes for vowels, and their implication for a large vocabulary speech recognition system.

...read moreread less

Abstract: This paper presents a new approach for large vocabulary Arabic speech recognition based on exploiting the morphological structures of the Arabic language. In this model, word discrimination is achieved by a hybrid analysis scheme, where vowels are described in detail while consonants are classified according to broad phonetic classes. Different phonetic classification strategies are used to describe two large vocabulary lexicons. The results show that about 83% of the 10,000 test Arabic words can be uniquely represented by using 7 broad phonetic classes for consonants and six classes for vowels. In this case, the maximum number of words having the same phonetic labelling is 6. This paper summarises the results of ten different phonetic classification schemes and discusses their implication for a large vocabulary speech recognition system.

...read moreread less

5 citations

Journal Article•DOI•

Construction of a Japanese-Chinese Bilingual Dictionary Using English as an Intermediary

[...]

Yujie Zhang, Qing Ma, Hitoshi Isahara

01 Mar 2005-International Journal of Computer Processing of Languages

TL;DR: Experimental results showed that the proposed method is a feasible way to construct a bilingual dictionary of any two languages.

...read moreread less

Abstract: We present a method for constructing a Japanese-Chinese bilingual dictionary from a Japanese-English dictionary and an English-Chinese dictionary using English as an intermediate language. To select correct translations from among a large number of candidates, we have developed a method of ranking candidate translations by utilizing three sources of information, i.e. the number of English translations in common, the part of speech and Japanese kanji information. Experimental results showed that the proposed method is a feasible way to construct a bilingual dictionary of any two languages.

...read moreread less

5 citations

Journal Article•DOI•

Chinese Unknown Word Identification Based on Local Bigram Model

[...]

Zhuoran Wang, Ting Liu¹•Institutions (1)

Harbin Institute of Technology¹

01 Sep 2005-International Journal of Computer Processing of Languages

TL;DR: This paper presents a Chinese unknown word identification system based on a local bigram model which is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed.

...read moreread less

Abstract: This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.

...read moreread less

4 citations

Journal Article•DOI•

Influence of Acoustics in Speech Recognition for Oriental Languages

[...]

E. Ramaraj, E. Chandra

01 Dec 2005-International Journal of Computer Processing of Languages

TL;DR: The acoustic signal quality requirement for efficient speech recognition is put forward and an intelligent mechanism of modification in the regular Input speech signal format is suggested for significant improvement in speech recognition.

...read moreread less

Abstract: As agileTV, Nuance XML Platforms, Microsoft Speech server2004 are the projects in the news, there is still a demand for a better speech recognition engine in terms of WER. This article puts forward the acoustic signal quality requirement for efficient speech recognition. It reports that the major thrust is on the acoustics of speech recognition. It also envisages the performance of various speech recognition engines in the industry, the techniques adopted by them towards achieving the quality acoustic signal of the speaker for efficient results [in terms of Less Word Rate Error] and the impact of the external factors that makes it less robust in terms of attaining high quality speech signal. To tackle the problem we suggest an intelligent mechanism of modification in the regular Input speech signal format for significant improvement in speech recognition.

...read moreread less

3 citations

Journal Article•DOI•

Text Summarization Based on Sentence Clustering with Rhetorical Structure Information

[...]

Sa-Kwang Song, Dong Hyun Jang, Sung Hyon Myaeng

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: A text summarization approach that clusters text units before extracting summary sentences and shows that the approach improves the quality of summarization.

...read moreread less

Abstract: We propose a text summarization approach that clusters text units before extracting summary sentences. Text units are formed by combining sentences based on rhetorical structure information. The rhetorical structure information we use is the one immediately recognizable at the surface level, making this approach language independent as much as possible. Experiments conducted with both Korean and English text collections show that the approach improves the quality of summarization.

...read moreread less

Journal Article•DOI•

Thai Spelling Recognition Using a Continuous Speech Corpus

[...]

Chutima Pisarn, Thanaruk Theeramunkong, Nick Cercone, Junalux Chalidabhongse

01 Dec 2005-International Journal of Computer Processing of Languages

TL;DR: A method based on hidden Markov models is proposed for constructing a Thai spelling recognition system from an existing continuous speech corpus and the adjustment of utterance speed is taken into account to compensate speed difference between spelling utterances and continuous speech utterances.

...read moreread less

Abstract: Spelling recognition provides alternative input method for computer systems as well as enhances a speech recognizer to cope with incorrectly recognized words and out-of-vocabulary words. This paper presents a general framework of Thai speech recognition enhanced with spelling recognition. Towards the implementation of Thai spelling recognition, Thai alphabets and their spelling methods are analyzed. A method based on hidden Markov models is proposed for constructing a Thai spelling recognition system from an existing continuous speech corpus. To compensate speed difference between spelling utterances and continuous speech utterances, the adjustment of utterance speed is taken into account. Two alternative language models, bigram and trigram, are used to investigate the performance of spelling recognition under three different environments: close-type, open-type and mix-type language models. Using the 1.25-times-stretched training utterances under the mix-type language model, the system achieves 87.37% correctness and 87.18% accuracy for bigram, and up to 91.12% correctness and 90.80% accuracy for trigram.

...read moreread less

Journal Article•DOI•

Contrastive Analysis and Feature Selection for Korean Modal Expression in Chinese-Korean Machine Translation System

[...]

Jin-Ji Li, Ji-Eun Roh, Dong-Il Kim, Jong-Hyeok Lee

01 Sep 2005-International Journal of Computer Processing of Languages

TL;DR: A new applicable categorization of Korean modality system viz. tense, aspect, mood, negation, and voice, will be proposed through a contrastive analysis of Chinese and Korean from the viewpoint of a practical MT system.

...read moreread less

Abstract: To generate a proper Korean predicate, a natural modal expression is the most important factor for a machine translation (MT) system Tense, aspect, mood, negation, and voice are the major constituents related to modal expression The linguistic encoding of a modal expression is quite different between Chinese and Korean in terms of linguistic typology and genealogy In this paper, a new applicable categorization of Korean modality system viz tense, aspect, mood, negation, and voice, will be proposed through a contrastive analysis of Chinese and Korean from the viewpoint of a practical MT system In order to precisely determine the modal expression, effective feature selection frameworks for Chinese are presented with a variety of machine learning methods As a result, our proposed approach achieved an accuracy of 8310%

...read moreread less

Journal Article•DOI•

Accuracy Enhancement for the Decomposition of Human-Written Summary

[...]

Minh Le Nguyen, Susumu Horiguchi

01 Mar 2005-International Journal of Computer Processing of Languages

TL;DR: Experimental results using DUC data and the Telecommunication Corpus show that the proposed method improves the accuracy of decomposition of human-written summary sentences.

...read moreread less

Abstract: This paper proposes a new method of enhancing the accuracy of a decomposition task by using position checking and a semantic measure for each word within a summary document. The proposed model is an extension of the Hidden Markov Model for the human-written decomposition problem. Experimental results using DUC data and the Telecommunication Corpus show that the proposed method improves the accuracy of decomposition of human-written summary sentences.

...read moreread less

Journal Article•DOI•

Automatic Identification of Oriental and Other Scripts in Image Documents

[...]

Ching Y. Suen, Sabine Bergler, Nicola Nobile, Wumo Pan, Boulos Waked - Show less +1 more

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: Two methods of identifying Oriental languages among four language groups, i.e. Oriental, Roman, Cyrillic, and Arabic are described, based on features extracted from the shapes of words and letters and global analysis of text pieces using Gabor filters are described.

...read moreread less

Abstract: Increasing amount of paper documents are produced and received by many organizations. Frequently, they have to be digitized for electronic archiving and later information retrieval or data mining, requiring scanning and OCR. Since OCR techniques are language dependent, the language of the original document must be identified first by advanced technology. This paper describes two methods of identifying Oriental languages among four language groups, i.e. Oriental, Roman, Cyrillic, and Arabic. One method is based on features extracted from the shapes of words and letters, while the other is based on global analysis of text pieces using Gabor filters. Experimental results on hundreds of both clean and noisy documents indicate that the proposed classification approaches look quite promising. The use of linguistic analysis to enhance the results is also discussed.

...read moreread less

Journal Article•DOI•

Translation Probabilities in Cross-Language Information Retrieval

[...]

In-Su Kang, Seung-Hoon Na, Jong-Hyeok Lee

01 Jun 2005-International Journal of Computer Processing of Languages

TL;DR: Empirically investigates the impact of translation probabilities on retrieval effectiveness in direct disambiguation approaches, the comparison of cross-lingual query formulation techniques involving translation probabilities, and the relationship between top n translations and retrieval effectiveness.

...read moreread less

Abstract: Translation ambiguity is a major problem in dictionary-based cross-language information retrieval. To attack the problem, indirect disambiguation approaches, which do not explicitly resolve translation ambiguity, rely on query-structuring techniques such as a structured Boolean model and Pirkola's method. Direct disambiguation approaches try to assign translation probabilities to translation equivalents, normally by employing co-occurrence statistics of target language terms from target documents as disambiguation clues. Thus far, translation probabilities have not been well explored in terms of statistical query translation models, query formulation, or cross-lingual retrieval models, etc. In order to study the impact of translation probabilities on retrieval effectiveness in direct disambiguation approaches, this paper empirically investigates the following issues: different disambiguation factors affecting the calculation of translation probabilities, the comparison of cross-lingual query formulation techniques involving translation probabilities, the relationship between the accuracy of translation disambiguation and retrieval effectiveness, and the relationship between top n translations and retrieval effectiveness.

...read moreread less

Journal Article•DOI•

Improving Domain Dictionary-Based Text Categorization Using Self-Partition Model

[...]

Wenliang Chen, Jingbo Zhu, Muhua Zhu, Li Zhang, Tianshun Yao - Show less +1 more

01 Sep 2005-International Journal of Computer Processing of Languages

TL;DR: A novel model for improving the performance of Domain Dictionary-based text categorization, named as Self-Partition Model (SPM), which can group the candidate words into the predefined clusters, which are generated according to the structure of domain Dictionary.

...read moreread less

Abstract: In this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model (SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The experimental results show that the proposed text representation-based text categorization system performs better than the Domain Dictionary-based text categorization system. It also performs better than the system based on Bag-of-Words when the number of features is small and the training corpus size is small.

...read moreread less

Journal Article•DOI•

Transliteration Using a Network of Phoneme Chunks

[...]

Inho Kang, Gil Chang Kim

01 Mar 2005-International Journal of Computer Processing of Languages

TL;DR: The methods that find similar words which have the longest overlap with an input word which could get 86% character accuracy and 53% word accuracy in an English-to-Korean transliteration test are proposed.

...read moreread less

Abstract: In this paper, we present methods of transliteration and back-transliteration. In Korean technical documents and web documents, many English words and Japanese words are transliterated into Korean words. These transliterated words are usually technical terms and proper nouns, so it is hard to find them in a dictionary. Therefore an automatic transliteration system is needed. Previous transliteration models restrict an information length to two or three letters per letter. However, most transliteration phenomena cannot be explained with a single standard rule especially in Korean. Various rules such as the origin of a word and profession of users are applied to each transliteration. The restriction of information length may lose the discriminative information of each transliteration rule. In this paper, we propose the methods that find similar words which have the longest overlap with an input word. To find similar words without the loss of each transliteration rule, phoneme chunks that do not have a length limit are used. By merging phoneme chunks, an input word is transliterated. With our proposed method, we could get 86% character accuracy and 53% word accuracy in an English-to-Korean transliteration test.

...read moreread less