scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Processing of Languages in 2005"


Journal Article•DOI•
TL;DR: The past, present and future research development in temporal information extraction is presented, ranging from temporal expression extraction and annotation to temporal reasoning and understanding.
Abstract: Research of temporal Information Extraction was regarded as a subtask of named entity recognition in 1990's. To date, the scope of this research is broadened, ranging from temporal expression extraction and annotation to temporal reasoning and understanding. This area of research is now a hot NLP topic and the results are applicable to question answering, information extraction, text summarization, etc. This paper presents the past, present and future research development in temporal information extraction.

26 citations


Journal Article•DOI•
TL;DR: A system called BRIDJE (Bi-directional Retriever/Information Distiller for Japanese and English), which achieved many gold-medal performances at the recent NTCIR (a.k.a. "Asian TREC") workshop, is described.
Abstract: This paper briefly describes Toshiba Knowledge Media Laboratory's recent research efforts for effective information retrieval and access. Firstly, I will mention the main research topics that are being tackled by our information access group, including document retrieval, speech-input/multimedia question answering, and evaluation metrics. Secondly, I will focus on the problem of cross-language information retrieval and access, and describe a system called BRIDJE (Bi-directional Retriever/Information Distiller for Japanese and English), which achieved many gold-medal performances at the recent NTCIR (a.k.a. "Asian TREC") workshop. Finally, I will conclude the paper by mentioning some unsolved problems and suggesting possible directions for future Information Access research.

12 citations


Journal Article•DOI•
TL;DR: Analysis either on demand or on a longitudinal basis provides a critical source of information heretofore not readily nor economically obtainable for a range of meaningful purposes.
Abstract: Typical news coverage contains both objective facts and subjective sentiments. This is especially true for newsworthy individuals and organizations, and media opinion on strategic subjects. Analysis either on demand or on a longitudinal basis provides a critical source of information heretofore not readily nor economically obtainable for a range of meaningful purposes. One application is the monitoring of positive or negative summative news coverage on targeted subjects.

9 citations


Journal Article•DOI•
TL;DR: This paper investigates a novel method for an individual's handwritten Chinese character font generation, using stroke correspondence between the reference character database and the compressed character database, by vector quantization and demonstrates that fonts generated successfully reflect the user's individual handwriting.
Abstract: In this paper, we investigate a novel method for an individual's handwritten Chinese character font generation, using stroke correspondence between the reference character database and the compressed character database, by vector quantization. Chinese characters are composed of a combination of radicals. A radical may be separated into several strokes, with each stroke corresponding to two or more common strokes. By paying attention to the characteristics of Chinese characters and the strokes that form them, we consider each stroke to be a vector and compress the stroke pattern using vector quantization. A compression rate of 1.27% is achieved by the vector quantization. We performed the evaluation experiments using both subjective and objective criteria involving 26 subjects and demonstrated that fonts generated successfully reflect the user's individual handwriting.

6 citations


Journal Article•DOI•
TL;DR: A new approach to English to Bangla translation based on a special kind of amalgamated architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components is described.
Abstract: This paper describes a new approach to English to Bangla translation. The English to Bangla Translator was developed. This system (BANGANUBAD) takes a paragraph of English sentences as input sentences and produces equivalent Bangla sentences. The BANGANUBAD system comprises a preprocessor, morphological recursive parser, semantic parser using English word ontology for context disambiguation, an electronic lexicon associated with grammatical information and a discourse processor. It also employs a lexical disambiguation analyzer. This system does not rely on a stochastic approach. Rather, it is based on a special kind of amalgamated architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components.

5 citations


Journal Article•DOI•
TL;DR: Automatic placement of phoneme boundaries in a speech waveform using explicit statistical model for phoneme boundary is proposed in this paper and studies show that HNM is capable of synthesizing all vowels and diphones with good quality.
Abstract: Most of the Indian-language Text-To-Speech (TTS) synthesis systems designed till date are based upon the concatenation of acoustic units. The prime challenge is the selection of proper units and their elegant concatenation. Due to the precincts of current automated techniques based on Hidden Markov Model (HMM) and Dynamic Time Warping (DTW), manual verification and labeling are often essential. Automatic placement of phoneme boundaries in a speech waveform using explicit statistical model for phoneme boundary is proposed in this paper. We are projecting the Harmonic plus Noise Model (HNM) in the first step and refine the boundary placement by searching for the best match in a region near the estimated boundary with predefined boundary model Technique like ESNOLA. This technique is applied for effective concatenation, which results in smooth output. Studies show that HNM is capable of synthesizing all vowels and diphones with good quality. This can remarkably reduce the size of the database. Further the pitch synchronous analysis and the Glottal Closure Instants (GCI) are accurately calculated. The quality of the synthesized speech improves if these units are obtained from the glottal signal rather than from processing the signal. The database has to be developed for VCV for all Indian languages as we have done for Oriya, one of the official languages of the Republic of India for our case study.

5 citations


Journal Article•DOI•
TL;DR: The results show that about 83% of the 10,000 test Arabic words can be uniquely represented by using 7 broad phonetic classes for consonants and six classes for vowels, and their implication for a large vocabulary speech recognition system.
Abstract: This paper presents a new approach for large vocabulary Arabic speech recognition based on exploiting the morphological structures of the Arabic language. In this model, word discrimination is achieved by a hybrid analysis scheme, where vowels are described in detail while consonants are classified according to broad phonetic classes. Different phonetic classification strategies are used to describe two large vocabulary lexicons. The results show that about 83% of the 10,000 test Arabic words can be uniquely represented by using 7 broad phonetic classes for consonants and six classes for vowels. In this case, the maximum number of words having the same phonetic labelling is 6. This paper summarises the results of ten different phonetic classification schemes and discusses their implication for a large vocabulary speech recognition system.

5 citations


Journal Article•DOI•
TL;DR: Experimental results showed that the proposed method is a feasible way to construct a bilingual dictionary of any two languages.
Abstract: We present a method for constructing a Japanese-Chinese bilingual dictionary from a Japanese-English dictionary and an English-Chinese dictionary using English as an intermediate language. To select correct translations from among a large number of candidates, we have developed a method of ranking candidate translations by utilizing three sources of information, i.e. the number of English translations in common, the part of speech and Japanese kanji information. Experimental results showed that the proposed method is a feasible way to construct a bilingual dictionary of any two languages.

5 citations


Journal Article•DOI•
TL;DR: This paper presents a Chinese unknown word identification system based on a local bigram model which is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed.
Abstract: This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.

4 citations


Journal Article•DOI•
TL;DR: The acoustic signal quality requirement for efficient speech recognition is put forward and an intelligent mechanism of modification in the regular Input speech signal format is suggested for significant improvement in speech recognition.
Abstract: As agileTV, Nuance XML Platforms, Microsoft Speech server2004 are the projects in the news, there is still a demand for a better speech recognition engine in terms of WER. This article puts forward the acoustic signal quality requirement for efficient speech recognition. It reports that the major thrust is on the acoustics of speech recognition. It also envisages the performance of various speech recognition engines in the industry, the techniques adopted by them towards achieving the quality acoustic signal of the speaker for efficient results [in terms of Less Word Rate Error] and the impact of the external factors that makes it less robust in terms of attaining high quality speech signal. To tackle the problem we suggest an intelligent mechanism of modification in the regular Input speech signal format for significant improvement in speech recognition.

3 citations


Journal Article•DOI•
TL;DR: A text summarization approach that clusters text units before extracting summary sentences and shows that the approach improves the quality of summarization.
Abstract: We propose a text summarization approach that clusters text units before extracting summary sentences. Text units are formed by combining sentences based on rhetorical structure information. The rhetorical structure information we use is the one immediately recognizable at the surface level, making this approach language independent as much as possible. Experiments conducted with both Korean and English text collections show that the approach improves the quality of summarization.

Journal Article•DOI•
TL;DR: A method based on hidden Markov models is proposed for constructing a Thai spelling recognition system from an existing continuous speech corpus and the adjustment of utterance speed is taken into account to compensate speed difference between spelling utterances and continuous speech utterances.
Abstract: Spelling recognition provides alternative input method for computer systems as well as enhances a speech recognizer to cope with incorrectly recognized words and out-of-vocabulary words. This paper presents a general framework of Thai speech recognition enhanced with spelling recognition. Towards the implementation of Thai spelling recognition, Thai alphabets and their spelling methods are analyzed. A method based on hidden Markov models is proposed for constructing a Thai spelling recognition system from an existing continuous speech corpus. To compensate speed difference between spelling utterances and continuous speech utterances, the adjustment of utterance speed is taken into account. Two alternative language models, bigram and trigram, are used to investigate the performance of spelling recognition under three different environments: close-type, open-type and mix-type language models. Using the 1.25-times-stretched training utterances under the mix-type language model, the system achieves 87.37% correctness and 87.18% accuracy for bigram, and up to 91.12% correctness and 90.80% accuracy for trigram.

Journal Article•DOI•
TL;DR: A new applicable categorization of Korean modality system viz. tense, aspect, mood, negation, and voice, will be proposed through a contrastive analysis of Chinese and Korean from the viewpoint of a practical MT system.
Abstract: To generate a proper Korean predicate, a natural modal expression is the most important factor for a machine translation (MT) system Tense, aspect, mood, negation, and voice are the major constituents related to modal expression The linguistic encoding of a modal expression is quite different between Chinese and Korean in terms of linguistic typology and genealogy In this paper, a new applicable categorization of Korean modality system viz tense, aspect, mood, negation, and voice, will be proposed through a contrastive analysis of Chinese and Korean from the viewpoint of a practical MT system In order to precisely determine the modal expression, effective feature selection frameworks for Chinese are presented with a variety of machine learning methods As a result, our proposed approach achieved an accuracy of 8310%

Journal Article•DOI•
TL;DR: Experimental results using DUC data and the Telecommunication Corpus show that the proposed method improves the accuracy of decomposition of human-written summary sentences.
Abstract: This paper proposes a new method of enhancing the accuracy of a decomposition task by using position checking and a semantic measure for each word within a summary document. The proposed model is an extension of the Hidden Markov Model for the human-written decomposition problem. Experimental results using DUC data and the Telecommunication Corpus show that the proposed method improves the accuracy of decomposition of human-written summary sentences.

Journal Article•DOI•
TL;DR: Two methods of identifying Oriental languages among four language groups, i.e. Oriental, Roman, Cyrillic, and Arabic are described, based on features extracted from the shapes of words and letters and global analysis of text pieces using Gabor filters are described.
Abstract: Increasing amount of paper documents are produced and received by many organizations. Frequently, they have to be digitized for electronic archiving and later information retrieval or data mining, requiring scanning and OCR. Since OCR techniques are language dependent, the language of the original document must be identified first by advanced technology. This paper describes two methods of identifying Oriental languages among four language groups, i.e. Oriental, Roman, Cyrillic, and Arabic. One method is based on features extracted from the shapes of words and letters, while the other is based on global analysis of text pieces using Gabor filters. Experimental results on hundreds of both clean and noisy documents indicate that the proposed classification approaches look quite promising. The use of linguistic analysis to enhance the results is also discussed.

Journal Article•DOI•
TL;DR: Empirically investigates the impact of translation probabilities on retrieval effectiveness in direct disambiguation approaches, the comparison of cross-lingual query formulation techniques involving translation probabilities, and the relationship between top n translations and retrieval effectiveness.
Abstract: Translation ambiguity is a major problem in dictionary-based cross-language information retrieval. To attack the problem, indirect disambiguation approaches, which do not explicitly resolve translation ambiguity, rely on query-structuring techniques such as a structured Boolean model and Pirkola's method. Direct disambiguation approaches try to assign translation probabilities to translation equivalents, normally by employing co-occurrence statistics of target language terms from target documents as disambiguation clues. Thus far, translation probabilities have not been well explored in terms of statistical query translation models, query formulation, or cross-lingual retrieval models, etc. In order to study the impact of translation probabilities on retrieval effectiveness in direct disambiguation approaches, this paper empirically investigates the following issues: different disambiguation factors affecting the calculation of translation probabilities, the comparison of cross-lingual query formulation techniques involving translation probabilities, the relationship between the accuracy of translation disambiguation and retrieval effectiveness, and the relationship between top n translations and retrieval effectiveness.

Journal Article•DOI•
TL;DR: A novel model for improving the performance of Domain Dictionary-based text categorization, named as Self-Partition Model (SPM), which can group the candidate words into the predefined clusters, which are generated according to the structure of domain Dictionary.
Abstract: In this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model (SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The experimental results show that the proposed text representation-based text categorization system performs better than the Domain Dictionary-based text categorization system. It also performs better than the system based on Bag-of-Words when the number of features is small and the training corpus size is small.

Journal Article•DOI•
TL;DR: The methods that find similar words which have the longest overlap with an input word which could get 86% character accuracy and 53% word accuracy in an English-to-Korean transliteration test are proposed.
Abstract: In this paper, we present methods of transliteration and back-transliteration. In Korean technical documents and web documents, many English words and Japanese words are transliterated into Korean words. These transliterated words are usually technical terms and proper nouns, so it is hard to find them in a dictionary. Therefore an automatic transliteration system is needed. Previous transliteration models restrict an information length to two or three letters per letter. However, most transliteration phenomena cannot be explained with a single standard rule especially in Korean. Various rules such as the origin of a word and profession of users are applied to each transliteration. The restriction of information length may lose the discriminative information of each transliteration rule. In this paper, we propose the methods that find similar words which have the longest overlap with an input word. To find similar words without the loss of each transliteration rule, phoneme chunks that do not have a length limit are used. By merging phoneme chunks, an input word is transliterated. With our proposed method, we could get 86% character accuracy and 53% word accuracy in an English-to-Korean transliteration test.