scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Asian Language Information Processing in 2006"


Journal ArticleDOI
TL;DR: A separable mixture model (SMM) is adopted to estimate the similarity between an input sentence and the EARs of each emotional state, and a dialog system focusing on students' daily expressions is constructed.
Abstract: This study presents a novel approach to automatic emotion recognition from text. First, emotion generation rules (EGRs) are manually deduced from psychology to represent the conditions for generating emotion. Based on the EGRs, the emotional state of each sentence can be represented as a sequence of semantic labels (SLs) and attributes (ATTs); SLs are defined as the domain-independent features, while ATTs are domain-dependent. The emotion association rules (EARs) represented by SLs and ATTs for each emotion are automatically derived from the sentences in an emotional text corpus using the a priori algorithm. Finally, a separable mixture model (SMM) is adopted to estimate the similarity between an input sentence and the EARs of each emotional state. Since some features defined in this approach are domain-dependent, a dialog system focusing on the students' daily expressions is constructed, and only three emotional states, happy, unhappy, and neutral, are considered for performance evaluation. According to the results of the experiments, given the domain corpus, the proposed approach is promising, and easily ported into other domains.

245 citations


Journal ArticleDOI
TL;DR: Three statistical query translation models that focus on the resolution of query translation ambiguities are presented that have a positive impact on query translation and lead to significant improvements of CLIR performance over the simple dictionary-based translation method.
Abstract: Query translation is an important task in cross-language information retrieval (CLIR), which aims to determine the best translation words and weights for a query. This article presents three statistical query translation models that focus on the resolution of query translation ambiguities. All the models assume that the selection of the translation of a query term depends on the translations of other terms in the query. They differ in the way linguistic structures are detected and exploited. The co-occurrence model treats a query as a bag of words and uses all the other terms in the query as the context for translation disambiguation. The other two models exploit linguistic dependencies among terms. The noun phrase (NP) translation model detects NPs in a query, and translates each NP as a unit by assuming that the translation of a term only depends on other terms within the same NP. Similarly, the dependency translation model detects and translates dependency triples, such as verb-object, as units. The evaluations show that linguistic structures always lead to more precise translations. The experiments of CLIR on TREC Chinese collections show that all three models have a positive impact on query translation and lead to significant improvements of CLIR performance over the simple dictionary-based translation method. The best results are obtained by combining the three models.

37 citations


Journal ArticleDOI
TL;DR: A new approach to aligning bilingual NEs in parallel corpora by incorporating statistical models with multiple knowledge sources, which model the process of translating an English NE phrase into a Chinese equivalent using lexical translation/transliteration probabilities for word translation and alignment probabilities forword reordering.
Abstract: Named entity (NE) extraction is one of the fundamental tasks in natural language processing (NLP). Although many studies have focused on identifying NEs within monolingual documents, aligning NEs in bilingual documents has not been investigated extensively due to the complexity of the task. In this article we introduce a new approach to aligning bilingual NEs in parallel corpora by incorporating statistical models with multiple knowledge sources. In our approach, we model the process of translating an English NE phrase into a Chinese equivalent using lexical translation/transliteration probabilities for word translation and alignment probabilities for word reordering. The method involves automatically learning phrase alignment and acquiring word translations from a bilingual phrase dictionary and parallel corpora, and automatically discovering transliteration transformations from a training set of name-transliteration pairs. The method also involves language-specific knowledge functions, including handling abbreviations, recognizing Chinese personal names, and expanding acronyms. At runtime, the proposed models are applied to each source NE in a pair of bilingual sentences to generate and evaluate the target NE candidates; the source and target NEs are then aligned based on the computed probabilities. Experimental results demonstrate that the proposed approach, which integrates statistical models with extra knowledge sources, is highly feasible and offers significant improvement in performance compared to our previous work, as well as the traditional approach of IBM Model 4.

36 citations


Journal ArticleDOI
TL;DR: This paper proposes a transliteration model that dynamically uses both graphemes and phonemes, particularly the correspondence between them, and achieves better performance than has been reported for other models.
Abstract: Machine transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language. There has been growing interest in the use of machine transliteration to assist machine translation and information retrieval. Three types of machine transliteration models---grapheme-based, phoneme-based, and hybrid---have been proposed. Surprisingly, there have been few reports of efforts to utilize the correspondence between source graphemes and source phonemes, although this correspondence plays an important role in machine transliteration. Furthermore, little work has been reported on ways to dynamically handle source graphemes and phonemes. In this paper, we propose a transliteration model that dynamically uses both graphemes and phonemes, particularly the correspondence between them. With this model, we have achieved better performance---improvements of about 15 to 41p in English-to-Korean transliteration and about 16 to 44p in English-to-Japanese transliteration---than has been reported for other models.

30 citations


Journal ArticleDOI
TL;DR: The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms, which have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision.
Abstract: Discovering links and relationships is one of the main challenges in biomedical research, as scientists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities (represented by domain terms) from biomedical literature. The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical similarities are based on the level of sharing of word constituents. Syntactic similarities rely on expressions (such as term enumerations and conjunctions) in which a sequence of terms appears as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of relevant contexts shared among terms. The approach is evaluated using the Genia resources, and the results of experiments are presented. Lexical and syntactic links have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision. By combining the three metrics, we achieved F measures of 68p for semantically related terms and 37p for highly related entities.

29 citations


Journal ArticleDOI
TL;DR: This article presents an empirical study of four techniques for adapting language models, including a maximum a posteriori (MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion, and tries to interpret the results in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain, measured by using the information-theoretic notion of cross entropy.
Abstract: This article presents an empirical study of four techniques for adapting language models, including a maximum a posteriori (MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion. We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains. In particular, we attempt to interpret the results in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain, measured by using the information-theoretic notion of cross entropy. We show that such a metric correlates well with the CER performance of the adaptation methods, and also show that the discriminative methods are not only superior to a MAP-based method in achieving larger CER reduction, but also in having fewer side effects and being more robust against the similarity between background and adaptation domains.

24 citations


Journal ArticleDOI
TL;DR: This study shows that language modeling is a suitable framework to implement basic inference operations in IR effectively and the integration of term relationships into the language modeling framework can consistently improve the retrieval effectiveness compared with the traditional language models.
Abstract: Language modeling (LM) has been widely used in IR in recent years. An important operation in LM is smoothing of the document language model. However, the current smoothing techniques merely redistribute a portion of term probability according to their frequency of occurrences only in the whole document collection. No relationships between terms are considered and no inference is involved. In this article, we propose several inferential language models capable of inference using term relationships. The inference operation is carried out through a semantic smoothing either on the document model or query model, resulting in document or query expansion. The proposed models implement some of the logical inference capabilities proposed in the previous studies on logical models, but with necessary simplifications in order to make them tractable. They are a good compromise between inference power and efficiency. The models have been tested on several TREC collections, both in English and Chinese. It is shown that the integration of term relationships into the language modeling framework can consistently improve the retrieval effectiveness compared with the traditional language models. This study shows that language modeling is a suitable framework to implement basic inference operations in IR effectively.

24 citations


Journal ArticleDOI
TL;DR: This work proposes a method for labelling prepositional phrases according to two different semantic role classifications, as contained in the Penn treebank and the CoNLL 2004 Semantic Role Labeling data set.
Abstract: We propose a method for labelling prepositional phrases according to two different semantic role classifications, as contained in the Penn treebank and the CoNLL 2004 Semantic Role Labeling data set. Our results illustrate the difficulties in determining preposition semantics, but also demonstrate the potential for PP semantic role labelling to improve the performance of a holistic semantic role labelling system.

20 citations


Journal ArticleDOI
TL;DR: It is concluded that the VSM with unique (adaptive) pivoted document-length normalization is effective for Chinese IR and that its retrieval effectiveness is comparable to that of other competitive retrieval models with or without PRF for the reference test collections used in this evaluation.
Abstract: The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but comparable to those in the NTCIR-4 and NTCIR-5 workshops. We do not know whether the lower level performance was due to the VSM's inherent deficiencies or to a less effective normalization of document length. Hence we evaluated the VSM with various pivoted normalizations of document length using the NTCIR-3 collection for confirmation. We found that VSM's retrieval effectiveness with pivoted normalization was comparable to other competitive retrieval models (for example, 2-Poisson), and that VSM's retrieval speed with pivoted normalization was similar to competitive retrieval models (2-Poisson). We proposed a novel adaptive scheme that automatically estimates the (near) best parameters for pivoted document-length normalization based on query size; the new normalization is called adaptive pivoted document-length normalization. This scheme achieved good retrieval effectiveness, sometimes for short (title) queries and sometimes for long queries, without manually adjusting parameter values. We found that unique, adaptive pivoted normalization can enhance fixed pivoted normalizations for different test collections (TREC-5 and TREC-6). We also evaluated the VSM with the adaptive pivoted normalization using the pseudo-relevance feedback (PRF) and found that this type of VSM performs similarly to the competitive retrieval models (2-Poisson) with PRF. Hence, we conclude that the VSM with unique (adaptive) pivoted document-length normalization is effective for Chinese IR and that its retrieval effectiveness is comparable to that of other competitive retrieval models with or without PRF for the reference test collections used in this evaluation.

16 citations


Journal ArticleDOI
TL;DR: An integrated knowledge-mining system for the domain of biomedicine, in which automatic term recognition, term clustering, information retrieval, and visualization are combined, to facilitate knowledge acquisition from documents and aid knowledge discovery through terminology-based similarity calculation and visualization of automatically structured knowledge.
Abstract: In this article we present an integrated knowledge-mining system for the domain of biomedicine, in which automatic term recognition, term clustering, information retrieval, and visualization are combined. The primary objective of this system is to facilitate knowledge acquisition from documents and aid knowledge discovery through terminology-based similarity calculation and visualization of automatically structured knowledge. This system also supports the integration of different types of databases and simultaneous retrieval of different types of knowledge. In order to accelerate knowledge discovery, we also propose a visualization method for generating similarity-based knowledge maps. The method is based on real-time terminology-based knowledge clustering and categorization and allows users to observe real-time generated knowledge maps, graphically. Lastly, we discuss experiments using the GENIA corpus to assess the practicality and applicability of the system.

14 citations


Journal ArticleDOI
TL;DR: A two-phase biomedical NE-recognition method based on a ME model is presented: the first recognize biomedical terms and then assign appropriate semantic classes to the recognized terms, and morphological patterns extracted from the training data are used for learning the ME-based classifiers.
Abstract: In this paper, we present a two-phase biomedical NE-recognition method based on a ME model: we first recognize biomedical terms and then assign appropriate semantic classes to the recognized terms In the two-phase NE-recognition method, the performance of the term-recognition phase is very important, because the semantic classification is performed on the region identified at the recognition phase In this study, in order to improve the performance of term recognition, we try to incorporate lexical knowledge into pre- and postprocessing of the term-recognition phase In the preprocessing step, we use domain-salient words as lexical knowledge obtained by corpus comparison In the postprocessing step, we utilize χ2-based collocations gained from Medline corpus In addition, we use morphological patterns extracted from the training data as features for learning the ME-based classifiers Experimental results show that the performance of NE-recognition can be improved by utilizing such lexical knowledge

Journal ArticleDOI
TL;DR: Experiments show that the proposed method could make good use of temporal information in news stories, and it consistently outperforms the baseline centroid algorithm and other algorithms which consider temporal relatedness.
Abstract: Temporal information is an important attribute of a topic, and a topic usually exists in a limited period. Therefore, many researchers have explored the utilization of temporal information in topic detection and tracking (TDT). They use either a story's publication time or temporal expressions in text to derive temporal relatedness between two stories or a story and a topic. However, past research neglects the fact that people tend to express a time with different granularities as time lapses. Based on a careful investigation of temporal information in news streams, we propose a new strategy with time granularity reasoning for utilizing temporal information in topic tracking. A set of topic times, which as a whole represent the temporal attribute of a topic, are distinguished from others in the given on-topic stories. The temporal relatedness between a story and a topic is then determined by the highest coreference level between each time in the story and each topic time where the coreference level between a test time and a topic time is inferred from the two times themselves, their granularities, and the time distance between the topic time and the publication time of the story where the test time appears. Furthermore, the similarity value between an incoming story and a topic, that is the likelihood that a story is on-topic, can be adjusted only when the new story is both temporally and semantically related to the target topic. Experiments on two different TDT corpora show that our proposed method could make good use of temporal information in news stories, and it consistently outperforms the baseline centroid algorithm and other algorithms which consider temporal relatedness.

Journal ArticleDOI
TL;DR: This paper presents a statistical framework for dictionary-based CLIR that estimates the translation probabilities of query words based on the monolingual word co-occurrence statistics, and presents two realizations of the proposed framework that exploit different metrics for the coherence measurement between a translation of a query word and the theme of the entire query.
Abstract: Resolving ambiguity in the process of query translation is crucial to cross-language information retrieval (CLIR), given the short length of queries. This problem is even more challenging when only a bilingual dictionary is available, which is the focus of our work described here. In this paper, we will present a statistical framework for dictionary-based CLIR that estimates the translation probabilities of query words based on the monolingual word co-occurrence statistics. In addition, we will present two realizations of the proposed framework, i.e., the “maximum coherence model” and the “spectral query-translation model,” that exploit different metrics for the coherence measurement between a translation of a query word and the theme of the entire query. Compared to previous work on dictionary-based CLIR, the proposed framework is advantageous in three aspects: (1) Translation probabilities are calculated explicitly to capture the uncertainty in translating queries; (2) translations of all query words are estimated simultaneously rather than independently; and (3) the formulated problem can be solved efficiently with a unique optimal solution. Empirical studies with Chinese--English cross-language information retrieval using TREC datasets have shown that the proposed models achieve a relative 10p--50p improvement, compared to other approaches that also exploit word co-occurrence statistics for query translation disambiguation.

Journal ArticleDOI
TL;DR: This paper proposes a novel approach to align the ontologies at the node level, a language-independent, corpus-based method that borrows from techniques used in information retrieval and machine translation and shows its efficiency by applying it to two very different ontologies in very different languages.
Abstract: The growing importance of multilingual information retrieval and machine translation has made multilingual ontologies extremely valuable resources. Since the construction of an ontology from scratch is a very expensive and time-consuming undertaking, it is attractive to consider ways of automatically aligning monolingual ontologies, which already exist for many of the world's major languages. Previous research exploited similarity in the structure of the ontologies to align, or manually created bilingual resources. These approaches cannot be used to align ontologies with vastly different structures and can only be applied to much studied language pairs for which expensive resources are already available. In this paper, we propose a novel approach to align the ontologies at the node level: Given a concept represented by a particular word sense in one ontology, our task is to find the best corresponding word sense in the second language ontology. To this end, we present a language-independent, corpus-based method that borrows from techniques used in information retrieval and machine translation. We show its efficiency by applying it to two very different ontologies in very different languages: the Mandarin Chinese HowNet and the American English WordNet. Moreover, we propose a methodology to measure bilingual corpora comparability and show that our method is robust enough to use noisy nonparallel bilingual corpora efficiently, when clean parallel corpora are not available.

Journal ArticleDOI
Jung-jae Kim1, Jong Cheol Park1
TL;DR: This work presents a system that extracts contrastive information from expressions of negation by identifying those syntactic structures with natural language processing techniques and with additional linguistic resources for semantics, and applies the system to the biological interactions as extracted by the biomedical information-extraction system in order to enrich proteome databases with Contrastive information.
Abstract: Expressions of negation in the biomedical literature often encode information of contrast as a means for explaining significant differences between the objects that are so contrasted. We show that such information gives additional insights into the nature of the structures and/or biological functions of these objects, leading to valuable knowledge for subcategorization of protein families by the properties that the involved proteins do not have in common. Based on the observation that the expressions of negation employ mostly predictable syntactic structures that can be characterized by subclausal coordination and by clause-level parallelism, we present a system that extracts such contrastive information by identifying those syntactic structures with natural language processing techniques and with additional linguistic resources for semantics. The implemented system shows the performance of 85.7p precision and 61.5p recall, including 7.7p partial recall, or an F score of 76.6. We apply the system to the biological interactions as extracted by our biomedical information-extraction system in order to enrich proteome databases with contrastive information.

Journal ArticleDOI
TL;DR: A supervised machine-learning method was adopted to reduce the human effort in making extraction rules in order to obtain a highly domain-portable system and the classical trade-off between precision and recall was overcame by using an event component verification method.
Abstract: Many previous biological event-extraction systems were based on hand-crafted rules which were specifically tuned to a specific biological application domain. But manually constructing and tuning the rules are time-consuming processes and make the systems less portable. So supervised machine-learning methods were developed to generate the extraction rules automatically, but accepting the trade-off between precision and recall (high recall with low precision, and vice versa) is a barrier to improving performance. To make matters worse, a text in the biological domain is more complex because it often contains more than two biological events in a sentence, and one event in a noun chunk can be an entity for the other event. As a result, there are as yet no systems that give a good performance in extracting events in biological domains by using supervised machine learning.To overcome the limitations of previous systems and the complexity of biological texts, we present the following new ideas. First, we adopted a supervised machine-learning method to reduce the human effort in making extraction rules in order to obtain a highly domain-portable system. Second, we overcame the classical trade-off between precision and recall by using an event component verification method. Thus, machine learning occurs in two phases in our architecture. In the first phase, the system focuses on improving recall in extracting events between biological entities during a supervised machine-learning period. After extracting the biological events with automatically learned rules, in the second phase the system removes incorrect biological events by verifying the extracted event components with a maximum entropy (ME) classification method. In other words, the system targets for high recall in the first phase and tries to achieve high precision with a classifier in the second phase. Finally, we improved a supervised machine-learning algorithm so that it could learn a rule in a noun chunk and a rule extending throughout a sentence at two different levels, separately, for nested biological events.

Journal ArticleDOI
TL;DR: The results show that knowledge about the use of normative honorific expressions in textbooks is similar to that demonstrated by the younger subject groups, but differed from that of the older subject groups.
Abstract: We investigated, via experiment, knowledge of normative honorific expressions as used in textbooks and in practice by people. Forty subjects divided into four groups according to age (younger/older) and gender (male/female) participated in the experiments. The results show that knowledge about the use of normative honorific expressions in textbooks is similar to that demonstrated by the younger subject groups, but differed from that of the older subject groups. The knowledge of the older subjects was more complex than that shown in textbooks or demonstrated by the younger subjects. A model that can identify misuse of honorific expressions in sentences is the framework for this investigation. The model is minimal, but could represent 76p to 92p of the subjects' knowledge regarding each honorific element. This model will be useful in the development of computer-aided systems to help teach how honorific expressions should be used.

Journal ArticleDOI
TL;DR: Through retrieval experiments using the Japanese test collection for information retrieval systems, it is shown that these two methods offer superior retrieval effectiveness compared with the TF--IDF method, and are effective with different databases and diverse search topics sets.
Abstract: Two Japanese-language information retrieval (IR) methods that enhance retrieval effectiveness by utilizing the relationships between words are proposed. The first method uses dependency relationships between words in a sentence. The second method uses proximity relationships, particularly information about the ordered co-occurrence of words in a sentence, to approximate the dependency relationships between them. A Structured Index has been constructed for these two methods, which represents the dependency relationships between words in a sentence as a set of binary trees. The Structured Index is created by morphological analysis and dependency analysis based on simple template matching and compound noun analysis derived from word statistics. Through retrieval experiments using the Japanese test collection for information retrieval systems (NTCIR-1, the NACSIS Test Collection for IR systems), it is shown that these two methods offer superior retrieval effectiveness compared with the TF--IDF method, and are effective with different databases and diverse search topics sets. There is little difference in retrieval effectiveness between these two methods.

Journal ArticleDOI
TL;DR: This article presents a data-driven approach that can deal with such difficult data instances by discovering and emphasizing important conjunctions or associations of statistics hidden in the training data.
Abstract: Discriminative sequential learning models like Conditional Random Fields (CRFs) have achieved significant success in several areas such as natural language processing or information extraction. Their key advantage is the ability to capture various nonindependent and overlapping features of inputs. However, several unexpected pitfalls have a negative influence on the model's performance; these mainly come from a high imbalance among classes, irregular phenomena, and potential ambiguity in the training data. This article presents a data-driven approach that can deal with such difficult data instances by discovering and emphasizing important conjunctions or associations of statistics hidden in the training data. Discovered associations are then incorporated into these models to deal with difficult data instances. Experimental results of phrase-chunking and named entity recognition using CRFs show a significant improvement in accuracy. In addition to the technical perspective, our approach also highlights a potential connection between association mining and statistical learning by offering an alternative strategy to enhance learning performance with interesting and useful patterns discovered from large datasets.

Journal ArticleDOI
TL;DR: A brief overview on some major aspects of explicating reasoning in NLP is given and a summary of the articles included in this special issue is summarized.
Abstract: For any applications related to Natural Language Processing (NLP), reasoning has been recognized as a necessary underlying aspect. Many of the existing work in NLP deals with specific NLP problems in a highly heuristic manner, yet not from an explicit reasoning perspective. Recently, there have been developments on models that allow reasoning in NLP such as language models, logical models, and so on. The goal of this special issue is to present high-quality contributions that integrate reasoning involved in different areas of natural language processing both at theoretical and/or practical levels. In this article, we give a brief overview on some major aspects of explicating reasoning in NLP and summarize the articles included in this special issue.