scispace - formally typeset
Search or ask a question

Showing papers by "Hideki Isozaki published in 2003"


Proceedings ArticleDOI
11 Jul 2003
TL;DR: This paper proposes a method that combines ranking rules and machine learning, which is simple and effective, while machine learning can take more factors into account.
Abstract: Anaphora resolution is one of the most important research topics in Natural Language Processing. In English, overt pronouns such as she and definite noun phrases such as the company are anaphors that refer to preceding entities (antecedents). In Japanese, anaphors are often omitted, and these omissions are called zero pronouns. There are two major approaches to zero pronoun resolution: the heuristic approach and the machine learning approach. Since we have to take various factors into consideration, it is difficult to find a good combination of heuristic rules. Therefore, the machine learning approach is attractive, but it requires a large amount of training data. In this paper, we propose a method that combines ranking rules and machine learning. The ranking rules are simple and effective, while machine learning can take more factors into account. From the results of our experiments, this combination gives better performance than either of the two previous approaches.

43 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: An interactive approach for spoken interactive ODQA systems that derives disambiguous queries (DQ) that draw out additional information to contribute to distinguishing effectively an exact answer and to supplementing a lack of information by recognition errors.
Abstract: Recently, open-domain question answering (ODQA) systems that extract an exact answer from large text corpora based on text input are intensively being investigated. However, the information in the first question input by a user is not usually enough to yield the desired answer. Interactions for collecting additional information to accomplish QA is needed. This paper proposes an interactive approach for spoken interactive ODQA systems. When the reliabilities for answer hypotheses obtained by an ODQA system are low, the system automatically derives disambiguous queries (DQ) that draw out additional information. The additional information based on the DQ should contribute to distinguishing effectively an exact answer and to supplementing a lack of information by recognition errors. In our spoken interactive ODQA system, SPIQA, spoken questions are recognized by an ASR system, and DQ are automatically generated to disambiguate the transcribed questions. We confirmed the appropriateness of the derived DQ by comparing them with manually prepared ones.

25 citations


Proceedings ArticleDOI
07 Jul 2003
TL;DR: A spoken interactive ODQA system that derives disambiguating queries (DQs) that draw out additional information by reconstructing the user's initial question by combining the addition information with question and the combination is used for answer extraction.
Abstract: We have been investigating an interactive approach for Open-domain QA (ODQA) and have constructed a spoken interactive ODQA system, SPIQA. The system derives disambiguating queries (DQs) that draw out additional information. To test the efficiency of additional information requested by the DQs, the system reconstructs the user's initial question by combining the addition information with question. The combination is then used for answer extraction. Experimental results revealed the potential of the generated DQs.

15 citations


Patent
05 Nov 2003
TL;DR: In this paper, when a document set belonging to a certain domain of a document DB10 is applied, a word string extracting device extracts a word word string, and performs a low order square test between the word string and the previously extracted word string for a document group included in the predetermined domain and the others, and calculates scores by applying predetermined weight to the authorized word string to extract a sentence whose score is high from among a plurality of documents belonging to the certain domain.
Abstract: PROBLEM TO BE SOLVED: To decide a score under the consideration of not only respective words but also the combination of words. SOLUTION: When a document set belonging to a certain domain of a document DB10 is applied, a word string extracting device extracts a word string, and performs a low order square test between the word string and the previously extracted word string for a document group included in the predetermined domain and the others, and compares it with a threshold to authorize the word string which is characteristics to the domain, and calculates scores by applying predetermined weight to the authorized word string to extract a sentence whose score is high from among a plurality of documents belonging to the certain domain. COPYRIGHT: (C)2005,JPO&NCIPI

5 citations


Journal ArticleDOI
10 Jan 2003
TL;DR: In this paper, a Support Vector Machine (SVM) is used to detect the presence of a malicious node in a target environment, which can be used to identify malicious nodes.
Abstract: 近年, インターネットや大容量の磁気デバイスの普及によって, 大量の電子化文書が氾濫している. こうした状況を背景として, 文書要約技術に対する期待が高まってきている. 特に, ある話題に関連する一連の文書集合をまとめて要約することが可能となれば, 人間の負担を大きく軽減することができる. そこで本稿では, 特定の話題に直接関連する文書集合を対象とし, 機械学習手法を用いることによって重要文を抽出する手法を提案する. 重要文抽出の手法としては近年, 自然言語処理研究の分野でも注目されている機械学習手法の1種であるSupport Vector Machineを用いた手法を提案する. 毎日新聞99年1年分より選んだ12話題の文書集合を用意し, それぞれの話題から総文数の10%, 30%, 50%の要約率に応じて人手により重要文を抽出した正解データセットを異なる被験者により3種作成した. このデータセットを用いて評価実験を行った結果, 提案手法の重要文抽出精度は, Lead手法, TF・IDF手法よりも高いことがわかった. また, 従来より複数文書要約に有効とされる冗長性の削減が, 文を単位とした場合には, 必ずしも有効でないこともわかった.

2 citations


Proceedings Article
01 Jan 2003
TL;DR: This work developed a heuristic scoring system that simply counts the number of verbs and their derived words, which are important to specify the function of a query gene or its product and uses a machine learning technique to score documents.
Abstract: Our system consists of two steps. The first step retrieves documents using a keyword search, and the second step scores each document retrieved in the previous step and creates an output file for the TREC submission. The database provided by TREC consists of more than 500,000 PubMed abstracts. However, less than 50 documents are relevant for most queries. Applying scoring methods to all 500,000 abstracts would create a lot of noise. In the first step, we refined the document set with a simple keyword search. For the second step, we developed two methods. The first method (Method 1) uses a heuristic scoring system that simply counts the number of verbs and their derived words, which are important to specify the function of a query gene or its product. The second method (Method 2) uses a machine learning technique to score documents.