scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Asian Language Information Processing in 2005"


Journal ArticleDOI
TL;DR: This paper explores a new, inexpensive Flexible PRF method, called Selective Sampling, which is unique in that it can skip documents in the initial ranked output to look for more “novel” pseudo-relevant documents.
Abstract: Although Pseudo-Relevance Feedback (PRF) is a widely used technique for enhancing average retrieval performance, it may actually hurt performance for around one-third of a given set of topics. To enhance the reliability of PRF, Flexible PRF has been proposed, which adjusts the number of pseudo-relevant documents and/or the number of expansion terms for each topic. This paper explores a new, inexpensive Flexible PRF method, called Selective Sampling, which is unique in that it can skip documents in the initial ranked output to look for more “novel” pseudo-relevant documents. While Selective Sampling is only comparable to Traditional PRF in terms of average performance and reliability, per-topic analyses show that Selective Sampling outperforms Traditional PRF almost as often as Traditional PRF outperforms Selective Sampling. Thus, treating the top P documents as relevant is often not the best strategy. However, predicting when Selective Sampling outperforms Traditional PRF appears to be as difficult as predicting when a PRF method fails. For example, our per-topic analyses show that even the proportion of truly relevant documents in the pseudo-relevant set is not necessarily a good performance predictor.

83 citations


Journal ArticleDOI
Hisao Mase1, Tadataka Matsubayashi1, Yuichi Ogawa1, Makoto Iwayama1, Tadaaki Oshio 
TL;DR: Evaluation results using test sets of the NTCIR4 Patent Retrieval Task show that the methods are effective, though the degree of the effectiveness varies depending on the test sets.
Abstract: The importance of patents is increasing in global society. In preparing a patent application, it is essential to search for related patents that may invalidate the invention. However, it is time-consuming to identify them among the millions of patents. This article proposes a patent-retrieval method that considers a claim structure for a more accurate search for invalidity. This method uses a claim text as input; it consists of two retrieval stages. In stage 1, general text analysis and retrieval methods are applied to improve recall. In stage 2, the top N documents retrieved in stage 1 are rearranged to improve precision by applying text analysis and retrieval methods using the claim structure. Our two-stage retrieval introduces five precision-oriented analysis and retrieval methods: query-term extraction from a portion of a claim text that describes the characteristics of a claim; query term-weighting without term frequency; query term-weighting with “measurement terms”; text retrieval using only claims as a target; and calculating the relevant score by “partially” adding scores in stage 2 to those in stage 1. Evaluation results using test sets of the NTCIR4 Patent Retrieval Task show that our methods are effective, though the degree of the effectiveness varies depending on the test sets.

66 citations


Journal ArticleDOI
TL;DR: This work considers four types of causal relations---cause, effect, precond(ition) and means---mainly based on agents' volitionality, as proposed in the research field of discourse understanding, as well as creating a computational model that is able to classify samples according to the causal relation.
Abstract: In this paper, we deal with automatic knowledge acquisition from text, specifically the acquisition of causal relations. A causal relation is the relation existing between two events such that one event causes (or enables) the other event, such as “hard rain causes flooding” or “taking a train requires buying a ticket.” In previous work these relations have been classified into several types based on a variety of points of view. In this work, we consider four types of causal relations---cause, effect, precond(ition) and means---mainly based on agents' volitionality, as proposed in the research field of discourse understanding. The idea behind knowledge acquisition is to use resultative connective markers, such as “because,” “but,” and “if” as linguistic cues. However, there is no guarantee that a given connective marker always signals the same type of causal relation. Therefore, we need to create a computational model that is able to classify samples according to the causal relation. To examine how accurately we can automatically acquire causal knowledge, we attempted an experiment using Japanese newspaper articles, focusing on the resultative connective “tame.” By using machine-learning techniques, we achieved 80p recall with over 95p precision for the cause, precond, and means relations, and 30p recall with 90p precision for the effect relation. Furthermore, the classification results suggest that one can expect to acquire over 27,000 instances of causal relations from 1 year of Japanese newspaper articles.

58 citations


Journal ArticleDOI
TL;DR: This work presents an overview of the retrieval effectiveness of nine vector-space and two probabilistic models that perform monolingual searches in the Chinese, Japanese, Korean, and English languages and addresses basic problems related to multilingual searches.
Abstract: Based on the NTCIR-4 test-collection, our first objective is to present an overview of the retrieval effectiveness of nine vector-space and two probabilistic models that perform monolingual searches in the Chinese, Japanese, Korean, and English languages. Our second goal is to analyze the relative merits of the various automated and freely available toolsto translate the English-language topics into Chinese, Japanese, or Korean, and then submit the resultant query in order to retrieve pertinent documents written in one of the three Asian languages. We also demonstrate how bilingual searches could be improved by applying both the combined query translation strategies and data-fusion approaches. Finally, we address basic problems related to multilingual searches, in which queries written in English are used to search documents written in the English, Chinese, Japanese, and Korean languages.

54 citations


Journal ArticleDOI
TL;DR: Results indicate that the proposed approach outperformed the FAQ-Finder system in medical FAQ retrieval and the expectation-maximization algorithm is employed to estimate the optimal mixing weights in the probabilistic mixture model.
Abstract: This investigation presents an approach to domain-specific FAQ (frequently-asked question) retrieval using independent aspects The data analysis classifies the questions in the collected QA (question-answer) pairs into ten question types in accordance with question stems The answers in the QA pairs are then paragraphed and clustered using latent semantic analysis and the K-means algorithm For semantic representation of the aspects, a domain-specific ontology is constructed based on WordNet and HowNet A probabilistic mixture model is then used to interpret the query and QA pairs based on independent aspects; hence the retrieval process can be viewed as the maximum likelihood estimation problem The expectation-maximization (EM) algorithm is employed to estimate the optimal mixing weights in the probabilistic mixture model Experimental results indicate that the proposed approach outperformed the FAQ-Finder system in medical FAQ retrieval

50 citations


Journal ArticleDOI
TL;DR: This paper proposed a method to calculate sentence importance using scores, for responses to multiple questions, generated by a Question-Answering engine, and described the integration of this method with a generic multi-document summarization system.
Abstract: In recent years, answer-focused summarization has gained attention as a technology complementary to information retrieval and question answering. In order to realize multi-document summarization focused by multiple questions, we propose a method to calculate sentence importance using scores, for responses to multiple questions, generated by a Question-Answering engine. Further, we describe the integration of this method with a generic multi-document summarization system. The evaluation results demonstrate that the performance of the proposed method is better than not only several baselines but also other participants' systems at the evaluation workshop NTCIR4 TSC3 Formal Run. However, it should be noted that some of the other systems do not use the information of questions.

29 citations


Journal ArticleDOI
TL;DR: The anaphora resolution process reverses the order of the steps in the classification-then-search model proposed by Ng and Cardie [2002b], inheriting all the advantages of that model.
Abstract: We propose a machine learning-based approach to noun-phrase anaphora resolution that combines the advantages of previous learning-based models while overcoming their drawbacks. Our anaphora resolution process reverses the order of the steps in the classification-then-search model proposed by Ng and Cardie [2002b], inheriting all the advantages of that model. We conducted experiments on resolving noun-phrase anaphora in Japanese. The results show that with the selection-then-classification-based modifications, our proposed model outperforms earlier learning-based approaches.

29 citations


Journal ArticleDOI
TL;DR: A new segmentation-free technique for automatic translation of Chinese OOV terms using the web and the effects of distance factor and window size when using a hidden Markov model to provide disambiguation are investigated.
Abstract: Cross-lingual information retrieval allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. A major difficulty for cross-lingual information retrieval is the detection and translation of out-of-vocabulary (OOV) terms; for OOV terms in Chinese, another difficulty is segmentation. At NTCIR-4, we explored methods for translation and disambiguation for OOV terms when using a Chinese query on an English collection. We have developed a new segmentation-free technique for automatic translation of Chinese OOV terms using the web. We have also investigated the effects of distance factor and window size when using a hidden Markov model to provide disambiguation. Our experiments show these methods significantly improve effectiveness; in conjunction with our post-translation query expansion technique, effectiveness approaches that of monolingual retrieval.

25 citations


Journal ArticleDOI
TL;DR: A novel content-based join model for data streams (closed captions of videos or TV programs) and Web pages based on the concept of topic structures is proposed and a mechanism based on this model for retrieving complementary Web pages to augment the content of video or television programs is proposed.
Abstract: A great deal of technology has been developed to help people access the information they require. With advances in the availability of information, information-seeking activities are becoming more sophisticated. This means that information technology must move to the next stage, i.e., enable users to acquire information from multiple perspectives to satisfy diverse needs. For instance, with the spread of digital broadcasting and broadband Internet connection services, infrastructure for the integration of TV programs and the Internet has been developed that enables users to acquire information from different media at the same time to improve information quality and the level of detail. In this paper, we propose a novel content-based join model for data streams (closed captions of videos or TV programs) and Web pages based on the concept of topic structures. We then propose a mechanism based on this model for retrieving complementary Web pages to augment the content of video or television programs. One of the most notable features of this complementary retrieval mechanism is that the retrieved information is not just similar to the video or TV program, but also provides additional information. In addition, we introduce an application system called WebTelop, which augments the content of TV programs in real time by using complementary Web pages. We also describe some experimental results.

22 citations


Journal ArticleDOI
TL;DR: A good method for corpus correction to a verb modality corpus for machine translation is developed using the maximum-entropy and decision-list methods as machine-learning methods.
Abstract: In recent years, various types of tagged corpora have been constructed and much research using tagged corpora has been done. However, tagged corpora contain errors, which impedes the progress of research. Therefore, the correction of errors in corpora is an important research issue. In this study we investigate the correction of such errors, which we call corpus correction. Using machine-learning methods, we applied corpus correction to a verb modality corpus for machine translation. We used the maximum-entropy and decision-list methods as machine-learning methods. We compared several kinds of methods for corpus correction in our experiments, and determined which is most effective by using a statistical test. We obtained several noteworthy findings: (1) Precision was almost the same for both detection and correction, so it is more convenient to do both correction and detection, rather than detection only. (2) In general, the maximum-entropy method worked better than the decision-list method; but the two methods had almost the same precision for the top 50 pieces of extracted data when closed data was used. (3) In terms of precision, the use of closed data was better than the use of open data; however, in terms of the total number of extracted errors, the use of open data was better than the use of closed data. Based on our analysis of these results, we developed a good method for corpus correction. We confirmed the effectiveness of our method by carrying out experiments on machine translation. As corpus-based machine translation continues to be developed, the corpus correction we discuss in this article should prove to be increasingly significant.

15 citations


Journal ArticleDOI
TL;DR: The authors propose an efficient retrieval method for a sentence-wise EBMT that efficiently retrieves the most similar sentences using the measure of edit-distance without omissions and employs search-space division, word graphs, and an A* search algorithm.
Abstract: An Example-Based Machine Translation (EBMT) system, whose translation example unit is a sentence, can produce an accurate and natural translation if translation examples similar enough to an input sentence are retrieved. Such a system, however, suffers from the problem of narrow coverage. To reduce the problem, a large-scale parallel corpus is required and, therefore, an efficient method is needed to retrieve translation examples from a large-scale corpus. The authors propose an efficient retrieval method for a sentence-wise EBMT using edit-distance. The proposed retrieval method efficiently retrieves the most similar sentences using the measure of edit-distance without omissions. The proposed method employs search-space division, word graphs, and an Aa search algorithm. The performance of the EBMT was evaluated through Japanese-to-English translation experiments using a bilingual corpus comprising hundreds of thousands of sentences from a travel conversation domain. The EBMT system achieved a high-quality translation ability by using a large corpus and also achieved efficient processing by using the proposed retrieval method.

Journal ArticleDOI
TL;DR: In this paper, the authors empirically examine what kinds of abilities are needed for question answering systems in information access dialogues, and propose a challenge for evaluating those abilities objectively and quantitatively.
Abstract: There are strong expectations for the use of question answering technologies in information access dialogues, such as for information gathering and browsing. In this paper, we empirically examine what kinds of abilities are needed for question answering systems in such situations, and propose a challenge for evaluating those abilities objectively and quantitatively. We also show that existing technologies have the potential to address this challenge. From the empirical study, we found that questions that have values and names as answers account for a majority in realistic information-gathering situations and that those sequences of questions contain a wide range of reference expressions and are sometimes complicated by the inclusion of subdialogues and focus shifts. The challenge proposed is not only novel as an evaluation of the handling of information access dialogues, but also includes several valuable ideas such as categorization and characterization of information access dialogues, and introduces three measures to evaluate various aspects in addressing list-type questions and reference test sets for evaluating context-processing ability in isolation.

Journal ArticleDOI
TL;DR: A new method that combines the probabilistic IR and the Boolean IR models is proposed and a new IR system---called appropriate Boolean query reformulation for information retrieval (ABRIR)--- is introduced based on these two methods and the Okapi system.
Abstract: Even though a Boolean query can express the information need precisely enough to select relevant documents, it is not easy to construct an appropriate Boolean query that covers all relevant documents To utilize a Boolean query effectively, a mechanism to retrieve as many as possible relevant documents is therefore required In accordance with this requirement, we propose a method for modifying a given Boolean query by using information from a relevant document set The retrieval results, however, may deteriorate if some important query terms are removed by this reformulation A further mechanism is thus required in order to use other query terms that are useful for finding more relevant documents, but are not strictly required in relevant documents To meet this requirement, we propose a new method that combines the probabilistic IR and the Boolean IR models We also introduce a new IR system---called appropriate Boolean query reformulation for information retrieval (ABRIR)---based on these two methods and the Okapi system ABRIR uses both a word index and a phrase index formed from combinations of two adjacent noun words The effectiveness of these two methods was confirmed according to the NTCIR-4 Web test collection

Journal ArticleDOI
TL;DR: A Persian synthesizer is developed that includes an innovative text analyzer module and a new model (SEHMM) is used as a postprocessor to compensate for errors generated by the neural network.
Abstract: The feasibility of converting text into speech using an inexpensive computer with minimal memory is of great interest. Speech synthesizers have been developed for many popular languages (e.g., English, Chinese, Spanish, French, etc.), but designing a speech synthesizer for a language is largely dependant on the language structure. In this article, we develop a Persian synthesizer that includes an innovative text analyzer module. In the synthesizer, the text is segmented into words and after preprocessing, a neural network is passed over each word. In addition to preprocessing, a new model (SEHMM) is used as a postprocessor to compensate for errors generated by the neural network. The performance of the proposed model is verified and the intelligibility of the synthetic speech is assessed via listening tests.

Journal ArticleDOI
TL;DR: The authors' experiments show that named entity/numerical expression recognition and word sense-based answer extraction mainly contributed to the performance and a new proximity-based document retrieval module that performs better than BM25 is developed.
Abstract: Twenty-five Japanese Question Answering systems participated in NTCIR QAC2 subtask 1. Of these, our system SAIQA-QAC2 performed the best: MRR = 0.607. SAIQA-QAC2 is an improvement on our previous system SAIQA-Ii that achieved MRR = 0.46 for QAC1. We mainly improved the answer-type determination module and the retrieval module. In general, a fine-grained answer taxonomy improves QA performance but it is difficult to build an accurate answer extraction module for the fine-grained taxonomy because Machine Learning methods require a huge training corpus and hand-crafted rules are hard to maintain. Therefore, we built a fine-grained system by using a coarse-grained named entity recognizer and a Japanese lexicon “Nihongo Goi-taikei.” Our experiments show that named entity/numerical expression recognition and word sense-based answer extraction mainly contributed to the performance. In addition, we developed a new proximity-based document retrieval module that performs better than BM25. We also compared its performance with MultiText, a conventional proximity-based retrieval method developed for QA.

Journal ArticleDOI
TL;DR: This paper proposes a method to improve chronological ordering by resolving precedent information of arranging sentences by combining the refinement algorithm with topical segmentation and chronological ordering and demonstrates that the proposed method significantly improves chronological sentence ordering.
Abstract: It is necessary to determine a proper arrangement of extracted sentences to generate a well-organized summary from multiple documents. This paper describes our Multi-Document Summarization (MDS) system for TSC-3. It specifically addresses an approach to coherent sentence ordering for MDS. An impediment to the use of chronological ordering, which is widely used by conventional summarization system, is that it arranges sentences without considering the presupposed information of each sentence. We propose a method to improve chronological ordering by resolving precedent information of arranging sentences. Combining the refinement algorithm with topical segmentation and chronological ordering, we address our experiments and metrics to test the effectiveness of MDS tasks. Results demonstrate that the proposed method significantly improves chronological sentence ordering. At the end of the paper, we also report an outline/evaluation of important sentence extraction and redundant clause elimination integrated in our MDS system.

Journal ArticleDOI
TL;DR: A method to introduce A* search control in a sentential matching mechanism for Japanese question-answering systems in order to reduce the turnaround time while maintaining the accuracy of the answers is proposed.
Abstract: We have proposed a method to introduce Aa search control in a sentential matching mechanism for Japanese question-answering systems in order to reduce the turnaround time while maintaining the accuracy of the answers. Using this method, preprocessing need not be performed on a document database and we may use any information retrieval systems by writing a simple wrapper program. However, the disadvantage is that the accuracy is not sufficiently high and the mean reciprocal rank (MRR) is approximately 0.3 in NTCIR3 QAC1, an evaluation workshop for question-answering systems. In order to improve the accuracy, we propose several measures of the degree of sentence matching and a variant of a voting method. Both of them can be integrated with our system of controlled search. Using these techniques, the system achieves a higher MRR of 0.5 in the evaluation workshop NTCIR4 QAC2.

Journal ArticleDOI
Sumio Fujita1
TL;DR: It is found that TF*IDF performed similarly to the language modeling runs against the patent collection by controlling the document length normalization, whereas the language modeled approach does not perform as well as TF* IDF, despite calibration against the CLIR J-J collection.
Abstract: NTCIR-4 experiments of the CLIR J-J (Japanese monolingual newspaper retrieval) and patent tasks are described, focusing on comparative studies of two test collections and two retrieval approaches in view of document length hypotheses. TF*IDF outperformed the language modeling approach in the CLIR J-J task whereas the language modeling approach performed better in the patent task. Two different document length hypotheses behind two tasks/collections are assumed by analyzing document length distributions of relevant/retrieved documents in the NTCIR-3 and -4 collections. Given these hypotheses, TF*IDF is easily adapted to patent retrieval tasks. Document length prior probabilities are applied to the language modeling approach. For the patent task, task-specific techniques, such as IPC priors and different indexing strategies, are evaluated and reported. To facilitate retrieval from large patent collections, a simple distributed search strategy is applied and found to be efficient, despite a slight deterioration of effectiveness. We found that TF*IDF performed similarly to the language modeling runs against the patent collection by controlling the document length normalization, whereas the language modeling approach does not perform as well as TF*IDF, despite calibration against the CLIR J-J collection. The different characteristics of the document lengths of the two test collections are illustrated through comparative studies.

Journal ArticleDOI
TL;DR: The authors report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment.
Abstract: We report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment Simple stemming is helpful in improving bigram indexing for Korean retrieval For word indexing, keeping nouns only is preferable Web-based translation reduces untranslated terms left over after MT and substantially improves CLIR results Translation concatenation is found to consistently improve CLIR effectiveness, while combining a retrieval list from bigram and word indexing is also helpful A method to disambiguate multiple MT outputs using a log likelihood ratio threshold was tested Depending on the nature of the title or description queries, bigram only or a retrieval combination, or relaxed or rigid evaluations, direct bilingual CLIR returned an average precision of 71--79p (English-Korean) and 76--84p (Chinese-English) of the corresponding Korean-Korean and English-English monolingual results Using English as a pivot in Chinese-Korean CLIR provides about 55--65p the effectiveness that Korean alone does Entity/terminology translation at the pivot language stage accounts for a large portion of this deficiency A topic with comparatively worse Chinese-English bilingual result does not necessarily mean that it will continue to under-perform (after further transitive Korean translation) at the Korean retrieval level

Journal ArticleDOI
TL;DR: The main contribution of this article is the investigation of different strategies, their interactions in both monolingual and bilingual retrieval tasks, and their respective contributions to operational retrieval systems in the context of NTCIR-4.
Abstract: At the NTCIR-4 workshop, Justsystem Corporation (JSC) and Clairvoyance Corporation (CC) collaborated in the cross-language retrieval task (CLIR). Our goal was to evaluate the performance and robustness of our recently developed commercial-grade CLIR systems for English and Asian languages. The main contribution of this article is the investigation of different strategies, their interactions in both monolingual and bilingual retrieval tasks, and their respective contributions to operational retrieval systems in the context of NTCIR-4. We report results of Japanese and English monolingual retrieval and results of Japanese-to-English bilingual retrieval. In monolingual retrieval analysis, we examine two special properties of the NTCIR experimental design (two levels of relevance and identical queries in multiple languages) and explore how they interact with strategies of our retrieval system, including pseudo-relevance feedback, multi-word term down-weighting, and term weight merging strategies. Our analysis shows that the choice of language (English or Japanese) does not have a significant impact on retrieval performance. Query expansion is slightly more effective with relaxed judgments than with rigid judgments. For better retrieval performance, weights of multi-word terms should be lowered. In the bilingual retrieval analysis, we aim to identify robust strategies that are effective when used alone and when used in combination with other strategies. We examine cross-lingual specific strategies such as translation disambiguation and translation structuring, as well as general strategies such as pseudo-relevance feedback and multi-word term down-weighting. For shorter title topics, pseudo-relevance feedback is a major performance enhancer, but translation structuring affects retrieval performance negatively when used alone or in combination with other strategies. All experimented strategies improve retrieval performance for the longer description topics, with pseudo-relevance feedback and translation structuring as the major contributors.

Journal ArticleDOI
TL;DR: This article uses automatically extracted short terms from document sets to build indexes and use the short terms in both the query and documents to do initial retrieval, and uses long terms extracted from the document collection to reorder the top N retrieved documents to improve precision.
Abstract: In this article we describe our approach to Chinese information retrieval, where a query is a short natural language description. First, we use automatically extracted short terms from document sets to build indexes and use the short terms in both the query and documents to do initial retrieval. Next, we use long terms extracted from the document collection to reorder the top N retrieved documents to improve precision. Finally, we acquire the relevant terms of the short terms from the Internet and the top retrieved documents and use them to do query expansion. Experiments on the NTCIR-4 CLIR Chinese SLIR sub-collection show that document reranking can both improve the retrieval performance on its own and make a significant contribution to query expansion. The experiments also show that the extended query expansion proposed in this article is more effective than the standard Rocchio query expansion.

Journal ArticleDOI
TL;DR: The ACM TALIP Special Issues on NTC IR-4 contain fourteen papers selected from the papers submitted by researchers involved in the Fourth NTCIR Workshop (NTCIR-4), designed to enhance research in information access technologies such as information retrieval (IR), question answering, summarization, text mining, and so on.
Abstract: The ACM TALIP Special Issues on NTCIR-4 contain fourteen papers selected from the papers submitted by researchers involved in the Fourth NTCIR Workshop (NTCIR-4). The NTCIR Workshops are a series of evaluation workshops designed to enhance research in information access technologies such as information retrieval (IR), question answering, summarization, text mining, and so on, by providing large-scale evaluation infrastructures and a forum for researchers interested in cross-system comparisons and in exchanging research ideas in an informal atmosphere. Because fundamental text processing, such as indexing, includes language-dependent procedures, the NTCIR project began in late 1997, and has placed emphasis on East Asian languages such as Japanese, Chinese, Korean, (and English documents published in Asia), and its series of workshops has attracted international participation. An NTCIR workshop is held about once every one and a half years. Because we respect all the interactions among the participants, we consider the entire process from initial document release to the final meeting as the workshop. Each workshop selects several research areas, called “tasks,” or “challenges” for the more challenging tasks. A task may consist of more than one subtask. From the beginning of the project, the tasks were selected with a focus along two directions: (a) laboratory-type testing of IR systems, and (b) evaluating challenging technologies. For testing IR systems, we placed emphasis on East Asian languages and on testing various document genres. For the challenging issues, we looked at the technologies that utilize “information” in documents; the intersection of IR and natural language processing; and at the methodologies and metrics that provide more realistic and reliable evaluations, with special attention to users’ information-seeking tasks. The NTCIR-4 focused on five tasks: Cross-Lingual Information Retrieval (CLIR); Patent Retrieval (PATENT); Question Answering (QAC); Text Summarization (TSC);

Journal ArticleDOI
TL;DR: A new method for estimating the degree of satisfaction of the selectional restriction for a word combination from a tagged corpus, based on the multiple regression model with independent variables that correspond to modifiers is proposed.
Abstract: A selectional restriction specifies what combinations of words are semantically valid in a particular syntactic construction. This is one of the basic and important pieces of knowledge in natural language processing and has been used for syntactic and word sense disambiguation. In the case of acquiring the selectional restriction for many combinations of words from a corpus, it is necessary to estimate whether or not a word combination that is not observed in the corpus satisfies the selectional restriction. This paper proposes a new method for estimating the degree of satisfaction of the selectional restriction for a word combination from a tagged corpus, based on the multiple regression model. The independent variables of this model correspond to modifiers. Unlike a conventional multiple regression analysis, the independent variables are also parameters to be learned. We experiment on estimating the degree of satisfaction of the selectional restriction for Japanese word combinations 〈noun, postpositional-particle, verb〉. The experimental results indicate that our method estimates the degree of satisfaction of a word combination not very well observed in the corpus, and that the accuracy of syntactic disambiguation using the co-occurrencies estimated by our method is higher than using co-occurrence probabilities smoothed by previous methods.