scispace - formally typeset
Search or ask a question

Showing papers by "Hideki Isozaki published in 2008"


Proceedings Article
01 Dec 2008
TL;DR: Evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing tasks, such as part-of-speech tagging, syntactic chunking, and named entity recognition is provided.
Abstract: This paper provides evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing (NLP) tasks, such as part-of-speech tagging, syntactic chunking, and named entity recognition. We first propose a simple yet powerful semi-supervised discriminative model appropriate for handling large scale unlabeled data. Then, we describe experiments performed on widely used test collections, namely, PTB III data, CoNLL’00 and ’03 shared task data for the above three NLP tasks, respectively. We incorporate up to 1G-words (one billion tokens) of unlabeled data, which is the largest amount of unlabeled data ever used for these tasks, to investigate the performance improvement. In addition, our results are superior to the best reported results for all of the above test collections.

159 citations


Proceedings Article
01 Jan 2008
TL;DR: NAZEQA, a Japanese why-QA system based on the proposed corpus-based approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why- QA system.
Abstract: This paper proposes a corpus-based approach for answering why-questions. Conventional systems use hand-crafted patterns to extract and evaluate answer candidates. However, such hand-crafted patterns are likely to have low coverage of causal expressions, and it is also difficult to assign suitable weights to the patterns by hand. In our approach, causal expressions are automatically collected from corpora tagged with semantic relations. From the collected expressions, features are created to train an answer candidate ranker that maximizes the QA performance with regards to the corpus of why-questions and answers. NAZEQA, a Japanese why-QA system based on our approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why-QA system.

73 citations


Proceedings Article
01 Jan 2008
TL;DR: This work designs a classifier design method based on model combination and F1-score maximization for multi-label categorization and results confirmed that the proposed method was useful especially for datasets where there were many combinations of category labels.
Abstract: Text categorization is a fundamental task in natural language processing, and is generally defined as a multi-label categorization problem, where each text document is assigned to one or more categories. We focus on providing good statistical classifiers with a generalization ability for multi-label categorization and present a classifier design method based on model combination and F1-score maximization. In our formulation, we first design multiple models for binary classification per category. Then, we combine these models to maximize the F1-score of a training dataset. Our experimental results confirmed that our proposed method was useful especially for datasets where there were many combinations of category labels.

44 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This analysis shows that empathic utterances by users are strong indicators of increasing closeness and user satisfaction, and self-disclosure by users increases when users have positive preferences on topics being discussed.
Abstract: To build trust or cultivate long-term relationships with users, conversational systems need to perform social dialogue. To date, research has primarily focused on the overall effect of social dialogue in human-computer interaction, leading to little work on the effects of individual linguistic phenomena within social dialogue. This paper investigates such individual effects through dialogue experiments. Focusing on self-disclosure and empathic utterances (agreement and disagreement), we empirically calculate their contributions to the dialogue quality. Our analysis shows that (1) empathic utterances by users are strong indicators of increasing closeness and user satisfaction, (2) the system's empathic utterances are effective for inducing empathy from users, and (3) self-disclosure by users increases when users have positive preferences on topics being discussed.

35 citations


Journal ArticleDOI
TL;DR: NAZEQA, a Japanese why-QA system based on the approach, clearly outperforms baselines with a Mean Reciprocal Rank ( top-5) of 0.223 when sentences are used as answers and with a MRR (top-5), making it presumably the best-performing fully implemented why- QA system.
Abstract: This article describes our approach for answering why-questions that we initially introduced at NTCIR-6 QAC-4. The approach automatically acquires causal expression patterns from relation-annotated corpora by abstracting text spans annotated with a causal relation and by mining syntactic patterns that are useful for distinguishing sentences annotated with a causal relation from those annotated with other relations. We use these automatically acquired causal expression patterns to create features to represent answer candidates, and use these features together with other possible features related to causality to train an answer candidate ranker that maximizes the QA performance with regards to the corpus of why-questions and answers. NAZEQA, a Japanese why-QA system based on our approach, clearly outperforms baselines with a Mean Reciprocal Rank (top-5) of 0.223 when sentences are used as answers and with a MRR (top-5) of 0.326 when paragraphs are used as answers, making it presumably the best-performing fully implemented why-QA system. Experimental results also verified the usefulness of the automatically acquired causal expression patterns.

26 citations


01 Jan 2008
TL;DR: A multi-label classification system based on a machine learning approach for the NTCIR-7 Patent Mining Task that employs a logistic regression model for each International Patent Classification code that determines the IPC code assignment of research papers is designed.
Abstract: We design a multi-label classification system based on a machine learning approach for the NTCIR-7 Patent Mining Task. In our system, we employ a logistic regression model for each International Patent Classification (IPC) code that determines the IPC code assignment of research papers. The logistic regressionmodels are trainedby usingpatentdocuments providedby task organizers. To mitigate the overfitting of the logistic regression models to the patent documents, we design the feature vectors of the patent documents with feature weighting and component selection methods utilizing a research paper set. Using a test collection for the Japanese subtask of the NTCIR7 Patent Mining Task, we confirmed the effectiveness of our multi-label classification system.

11 citations


01 Jan 2008
TL;DR: This paper demonstrates the strong baseline for the PAT-MT English/Japanese translations and describes NTT SMT System 2008 presented at the patent translation task (PAT-MT) in NTCIR-7.
Abstract: This paper describes NTT SMT System 2008 presented at the patent translation task (PAT-MT) in NTCIR-7. For PAT-MT, we submitted our strong baseline system faithfully following a hierarchical phrasebased statistical machine translation [2]. The hierarchical phrase-based SMT is based on a synchronousCFGs in which a paired source/target rules are synchronously applied starting from the initial symbol. The decoding is realized by a CYK-style bottom-up parsing on the source side with each derivation representing a translation candidate. We demonstrate the strong baseline for the PAT-MT English/Japanese translations.

2 citations


01 Jan 2008
TL;DR: A new rule-based English question analyzer to extract English query terms, which are translated into Japanese by translation dictionaries, based on the technologies used in the past NTCIR systems for QAC and CLQA.
Abstract: This paper describes our Complex Cross-Lingual Question Answering (CCLQA) system based on the technologies used in our past NTCIR systems for QAC and CLQA. We implemented a new rule-based English question analyzer to extract English query terms, which are translated into Japanese by translation dictionaries. For DEFINITION, BIOGRAPHY, and EVENT questions, we reused our definition module for QAC-4. For RELATIONSHIP questions, we developed a new module based on our why-QA approach for QAC-4. When these modules were not applicable, a simple sentence retriever was used. According to the organizers’ evaluation results, although our ENJA system performed rather poorly due to the low coverage of the translation dictionaries, our JA-JA system achieved the second best score among the four participants.