scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Processing of Languages in 2001"


Journal Article•DOI•
TL;DR: The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are necessary to perform comparable IE tasks in most European languages, and large amounts of domain knowledge to cope with these ambiguities.
Abstract: The development of an information extraction (IE) system for Thai documents raises a number of issues which are not important for IE in English and other European languages. We describe the characteristics of written Thai and the problem statements, and our approach to the Thai IE system. The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are necessary to perform comparable IE tasks in most European languages, and large amounts of domain knowledge to cope with these ambiguities. The basic characteristic of this system is to provide different natural language components to assess the surface structure of the documents. These components include word segmentation, specific lexical structure terms identification and part-of-speech tagger. Further analysis is to perform a shallow parsing based on the relevant regions that contain the specific trigger terms or patterns specified in the extraction templates. Finally, the information of interest is extracted from the grammar trees in corresponding to predefined concept definitions and returns the users with a list of answers responding to each concept.

15 citations


Journal Article•DOI•
TL;DR: This work argues that the newly proposed transliteration approach is more advantageous for the resolution of the word mismatch problem than the previously proposed back-transliteration approach, and results support this argument.
Abstract: In Korean text these days, the use of English words with or without phonetic translations are growing at a high speed. To make matters worse, the Korean transliteration of an English word may vary greatly. The mixed use of English words and their various transliterations in the same document or document collection may cause severe word mismatch problems in Korean information retrieval. There are two possible approaches to tackle this problem: transliteration and back-transliteration method. We argue that our newly proposed transliteration approach is more advantageous for the resolution of the word mismatch problem than the previously proposed back-transliteration approach. Our information retrieval experiment results support this argument.

12 citations


Journal Article•DOI•
TL;DR: The Semantic Conceptual Model (SCM) is defined for the domain specific knowledge representations and extraction strategies and is used to formulate natural language queries in NChiql, a Chinese natural language interface to databases.
Abstract: Numerous natural language interface to databases (NLIDBs) which were developed in the mid-1980s demonstrated impressive characteristics in certain application areas. But NLIDBs did not gain the expected commercial acceptance. We argue that this was due to two major reasons, i.e., limited portability and poor usability. This paper describes the design and implementation of NChiql, a Chinese natural language interface to databases. We define the Semantic Conceptual Model (SCM) for the domain specific knowledge representations and extraction strategies. SCM depicts database semantics and is used to formulate natural language queries. Further, we present a novel method to process the Chinese natural language queries. Experiments show that NChiql has good usability and high correctness.

12 citations


Journal Article•DOI•
TL;DR: This article proposes the method FNLU (Filtering based on Natural Language Understanding) including the algorithms for Extracting Typical Phrase, Calculating feature vector, Mining threshold vector, Objective judging and Subjective judging.
Abstract: Web document filtering is an important aspect in information security. The traditional strategy based on simple keywords matching often leads to low accuracy of discrimination. This article proposes the method FNLU (Filtering based on Natural Language Understanding) including the algorithms for Extracting Typical Phrase, Calculating feature vector, Mining threshold vector, Objective judging and Subjective judging. The experimental result shows that the algorithms are efficient.

10 citations


Journal Article•DOI•
TL;DR: With the aid of pseudo-relevance feedback, MT-based Japanese-English CLIR can be as effective as "best-case" monolingual retrieval.
Abstract: This paper evaluates the effectiveness of cross-language information retrieval (CLIR) based on machine-translation (MT) where the search requests are in Japanese and the documents are in English [24]. Our experiments use a subcollection of the TREC test collections and two bilingual researchers for separately translating the TREC requests into Japanese. Our main findings are as follows: (1) With the aid of pseudo-relevance feedback, MT-based Japanese-English CLIR can be as effective as "best-case" monolingual retrieval. In particular, although poor MT quality often leads to poor initial CLIR performance, pseudo-relevance feedback can alleviate the harm in many cases; (2) The manual request translation process that is inherent in conventional CLIR performance evaluation can be a more dominant factor than query length and overall MT quality.

10 citations


Journal Article•DOI•
TL;DR: The weighted Boolean models using fuzzy and p-norm measures, as well as the vector space model using the cosine measure, are extended for processing hybrid terms to achieve better average precision over those using words.
Abstract: Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best precision amongst homogenous terms at a lower storage cost than indexing with bigrams. However, this was tested with conjunctive queries. Here, we extended the weighted Boolean models using fuzzy and p-norm measures, as well as the vector space model using the cosine measure, for processing hybrid terms. Our evaluation shows that all IR models using hybrid terms achieve better average precision over those using words. Across different recall values, the weighted Boolean model using fuzzy measures with hybrid terms achieve consistently about 8% higher than those using words. The vector space model using the cosine measures with hybrid terms achieved the best improvement in the average recall and precision.

6 citations


Journal Article•DOI•
TL;DR: The cultural dependent characteristics of Chinese business processes are identified and taken into account in the MAWM model, which adopts a decentralized modeling style based on the conventional business perspective on labor division and cooperation.
Abstract: This paper proposes a new workflow model (MAWM) targeted at Chinese business processing. The cultural dependent characteristics of Chinese business processes are identified and taken into account in the MAWM model. MAWM adopts a decentralized modeling style based on the conventional business perspective on labor division and cooperation. This idea is reflected in the concept of agent-workflow. Activity-based and communication-based modeling methods are integrated in the process model. They are used to model different types of business activities. In addition, the model is formalized using Process Algebra and a simplified algorithm is proposed for safety verification of workflow processes.

6 citations


Journal Article•DOI•
TL;DR: An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information and is concluded that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models).
Abstract: Context windows are important for a variety of natural language analysis and processing. A trade-off exists between the task performance and the size of the context. Lucassen and Mercer used mutual information to determine the size of the context for English text. We apply the same technique to determine the Context window size for Chinese text. In addition, we use the association score, proposed by Church. The association score is directly related to the prediction ability of units in the context. To reduce the effects of spurious associations, the association score values at the N% quartile is used, instead of the maximum, and the association score derived from low frequency occurrences (i.e. <5) are discarded. A window size of 9 characters was found to be large enough for most associations between characters themselves, and between words themselves. An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information. We conclude that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models) whereas the association score is more suitable for non-exhaustive contextual models (e.g. identification of collocation).

5 citations


Journal Article•DOI•
TL;DR: This paper reviews a variety of recently proposed features of signatures and presents two new features that establish good verification rates and proposes an off-line Chinese signature verification system based on this combination of features.
Abstract: Signature verification is one of the important methodologies for personnel identification. Because of its wide applications to security practices quite a number of signature verification techniques associated with various features have been proposed. The features utilized in a signature verification system directly affect the performance of the system. In this paper, we review a variety of recently proposed features of signatures and present two new features. Each of these features is examined to evaluate its performance for Chinese signature verification. Then we choose and combine the features that establish good verification rates. As a result, we propose an off-line Chinese signature verification system based on this combination of features. The experimental results show that the proposed system achieves a good performance in term of verification rate.

3 citations


Journal Article•DOI•
TL;DR: In an example task for retrieval of Mandarin Chinese broadcast news data, the content-based language models, either trained on automatic transcriptions of spoken documents or adapted from baseline language models using automatic transcripts of spoken Documents, were used to create more accurate recognition results and indexing terms from both spoken documents and speech queries.
Abstract: Spoken document retrieval (SDR) has been extensively studied in recent years because of its potential use in navigating large multimedia collections in the near future. This paper presents a novel concept of applying content-based language models to spoken document retrieval. In an example task for retrieval of Mandarin Chinese broadcast news data, the content-based language models, either trained on automatic transcriptions of spoken documents or adapted from baseline language models using automatic transcriptions of spoken documents, were used to create more accurate recognition results and indexing terms from both spoken documents and speech queries. We report on some interesting findings obtained in this research.

3 citations


Journal Article•DOI•
TL;DR: A Question-Answering (QA) system in Korean that uses a predictive answer indexer that can save response time because it is not necessary for the QA system to extract answer candidates with scores on retrieval time.
Abstract: We propose a Question-Answering (QA) system in Korean that uses a predictive answer indexer. The predictive answer indexer, first, extracts all answer candidates in a document in indexing time. Then, it gives scores to the adjacent content words that are closely related to each answer candidate. Next, it stores the weighted content words with each candidate into a database. Using this technique, along with a complementary analysis of questions, the proposed QA system can save response time because it is not necessary for the QA system to extract answer candidates with scores on retrieval time. If the QA system is combined with a traditional Information Retrieval system, it can improve the document retrieval precision for closed-class questions after minimum loss of retrieval time.

Journal Article•DOI•
TL;DR: This paper attempts to improve the performance of a statistical method by integrating it with the transformation-based error-driven learning (TEL) technique and shows the TEL algorithm to be effective in improving the precision.
Abstract: Noun phrases are commonly used for generating index terms for information retrieval systems. Therefore, we need an effective noun phrase extraction method. In this paper, we propose an approach to extract maximal noun phrases from Chinese text. Although previous studies have been proposed to extract noun phrases, most of them are only applicable to Western languages. To the best of our knowledge, very few has handled Chinese text. Many existing approaches for Western languages made use of statistical methods. However, due to the complicated structure of maximal Chinese noun phrase, pure statistical approaches are not effective. We attempt to improve the performance of a statistical method by integrating it with the transformation-based error-driven learning (TEL) technique. Our methodology includes two modules. The first module applies a statistical method to extract Chinese noun phrases. The performance of this approach, in terms of precision and recall, is investigated. The second module applies the TEL algorithm to further refine the output of the first module. The TEL algorithm automatically learns a set of transformation rules to fix the errors that are obtained through comparing the output of the first module with the correctly annotated corpus. The learned rules can be applied to sentences in any corpus one by one to correct the errors. The TEL algorithm is shown to be effective in improving the precision.

Journal Article•DOI•
TL;DR: It is shown that within the TREC 5&6 Chinese corpus and retrieval environment, 74% of monolingual effectiveness is achievable for short queries of a few English words, and 85% for long queries of paragraph sizes.
Abstract: We investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-Chinese version in terms of phrase translation and word translation selection. Additional techniques such as frequency-based term selection, translation set weighting and term co-occurrence data were employed. Experiments show that within the TREC 5&6 Chinese corpus and retrieval environment, 74% of monolingual effectiveness is achievable for short queries of a few English words, and 85% for long queries of paragraph sizes.

Journal Article•DOI•
TL;DR: Experimental results demonstrate that the complexity of Chinese parsing is greatly reduced using the different rules learnt by the new method of Chinese grammar rules learning.
Abstract: A new method of Chinese grammar rules learning is put forward in this paper. The key point of this method is to use part-of-speech (POS), semantic and contextual information together in learning and expressing Chinese grammar rules. In this way, not only Context Free Grammar (CFG) rules can be learnt, but also the ambiguous structures in POS can be identified automatically. Furthermore, non-ambiguous semantic rules and forbidden rules can be produced from ambiguous rules by using semantic and contextual information. Experimental results demonstrate that the complexity of Chinese parsing is greatly reduced using the different rules learnt by our method.

Journal Article•DOI•
TL;DR: In order to manage constraints among different enterprise information sources effectively, the properties of distributed constraints are presented, and Active Rules are incorporated to support constraints of enterprise-wide information processing.
Abstract: With the explosive growth of the Worldwide Web, research on constraints in different databases has become increasingly important. How to interoperate heterogeneous Virtual Enterprise (VE) information sources, and how to enforce enterprise logic across multi-enterprise information sources are problems attracting some attention. The existing approaches defined enterprise logic in advance that result in poor flexibility, reusability and practicability. In order to solve these problems, a cooperative constraint description for different kinds of distributed constraints is proposed. The constraints classification has been done through the Constraint Definition Language LC. In order to manage these constraints among different enterprise information sources effectively, the properties of distributed constraints are presented, and Active Rules are incorporated to support constraints of enterprise-wide information processing. Further, some novel implementation issues including classifying enterprise event, static and dynamic definition methods for Enterprise Event Object (EEO) and the scheduler model in a CORBA based VE information integrating system, ViaScope, are presented.

Journal Article•DOI•
TL;DR: The experimental results show the extraction of the information about location and width can improve the compression performance effectively, and therefore the proposed method obtains better compression ratios than others.
Abstract: To digitize a calligraphic piece into a bi-level image has been shown to be appropriate. For preserving aesthetic features, lossless or near lossless compression is preferred. In this paper, we propose a lossless compression method for bi-level calligraphic images. Each segment consisting of continuous pixels with foreground color is abbreviated into a representative pixel, and the number of the pruned pixels is recorded. The location information and width information are then decomposed. To decrease redundancies, the location propagation approach and width inheritance approach are applied separately. The proposed method has been compared with some lossless compression methods, such as WinZip, run-length, New S-tree, lossless pattern matching (LPM), and JBIG2. The experimental results show the extraction of the information about location and width can improve the compression performance effectively, and therefore the proposed method obtains better compression ratios than others. Moreover, the method proposed is very simple and efficient.

Journal Article•DOI•
TL;DR: The previously ignored ontology used in the ALICE system is described in detail, and its importance in the lexically interlingual, syntactically transfer Mini-Transfer approach are reconfirmed.
Abstract: This article presents the design of a new architecture for ontology-based multi-lingual spoken dialogue systems. In the center of the architecture is the domain ontology. Around the ontology are four processing modules: human interface module, intelligence core module, content interface module, and the fusion module. In this article, we will first review an earlier work on Chinese-English machine translation, the Mini-Transfer approach and a corresponding system called ALICE. Then, the previously ignored ontology used in the ALICE system is described in detail, and its importance in the lexically interlingual, syntactically transfer Mini-Transfer approach are reconfirmed. After that, we propose the new ontology-based architecture for multi-lingual (cross-lingual) spoken dialogue systems. Then, we will point out some research issues and future directions towards ontology-based multi-lingual (cross-lingual) spoken dialogue systems, especially for Chinese and oriental languages.

Journal Article•DOI•
TL;DR: Mel'cuk's meaning-text model was adopted as a linguistic model in Korean generation part, in which the complicated task of Korean generation can be broken up into logically independent subtasks, making the system highly modularized and robust.
Abstract: This paper describes how to generate high quality Korean sentences from intermediate meaning representations. In previous research, there is no clear-cut separation between syntactic and morphological processing, and lexical information and rules are so tightly combined. This creates difficulties, such as attempting to enhance portability and extensibility, or handling complex linguistic phenomena in a systematic manner. We adopted Mel'cuk's meaning-text model as a linguistic model in Korean generation part, in which the complicated task of Korean generation can be broken up into logically independent subtasks, making the system highly modularized and robust. In a Korean generation experiment, the system showed a fidelity rate of 90% and an intelligibility rate of 85%, which is a promising result considering the difficulties with generating various linguistic phenomena.

Journal Article•DOI•
TL;DR: The experimental results showed that the image quality of the proposed method is better than those of related traditional works: TSVQ and SMTSVQ methods for Chinese calligraphies in each phase.
Abstract: Due to the effort of the calligrapher in all the past dynasties of China, a considerable amount of valued calligraphies have been accumulated. These calligraphies have been facing the natural trend of being digitalized. However, the digital imagery requires a great number of bits and a large capacity of transmission channel. This trouble can be alleviated by some techniques of progressive image transmission (PIT). In this paper, a new progressive image transmission technique for Chinese calligraphies is proposed. The proposed method is based on the features that the colors of Chinese calligraphies are always black and white, and the same color is likely to gather together. The experimental results showed that the image quality of our method is better than those of related traditional works: TSVQ and SMTSVQ methods for Chinese calligraphies in each phase. Our method is then effective for transmitting the Chinese calligraphy images progressively.