scispace - formally typeset
Search or ask a question

Showing papers by "Hiroyuki Kitagawa published in 2005"


Proceedings ArticleDOI
27 Nov 2005
TL;DR: This paper presents a novel robust solution which detects high dimensional outliers based on user examples and tolerates incorrect inputs and studies the behavior of projections of such a few examples, to discover further objects that are outstanding in the projection where many examples are outlying.
Abstract: Detecting outliers is an important problem. Most of its applications typically possess high dimensional datasets. In high dimensional space, the data becomes sparse which implies that every object can be regarded as an outlier from the point of view of similarity. Furthermore, a fundamental issue is that the notion of which objects are outliers typically varies between users, problem domains or, even, datasets. In this paper, we present a novel robust solution which detects high dimensional outliers based on user examples and tolerates incorrect inputs. It studies the behavior of projections of such a few examples, to discover further objects that are outstanding in the projection where many examples are outlying. Our experiments on both real and synthetic datasets demonstrate the ability of the proposed method to detect outliers corresponding to the user examples.

52 citations


Journal ArticleDOI
TL;DR: A novel solution to the problem of detecting outliers based on user examples for high dimensional datasets by discovering the hidden view of outliers and picking out further objects that are outstanding in the projection where the examples stand out greatly is presented.
Abstract: Detecting outliers is an important problem, in applications such as fraud detection, financial analysis, health monitoring and so on. It is typical of most such applications to possess high dimensional datasets. Many recent approaches detect outliers according to some reasonable, pre-defined concepts of an outlier (e.g., distance-based, density-based, etc.). Most of these concepts are proximity-based which define an outlier by its relationship to the rest of the data. However, in high dimensional space, the data becomes sparse which implies that every object can be regarded as an outlier from the point of view of similarity. Furthermore, a fundamental issue is that the notion of which objects are outliers typically varies between users, problem domains or, even, datasets. In this paper, we present a novel solution to this problem, by detecting outliers based on user examples for high dimensional datasets. By studying the behavior of projections of such a few outlier examples in the dataset, the proposed method discovers the hidden view of outliers and picks out further objects that are outstanding in the projection where the examples stand out greatly. Our experiments on both real and synthetic datasets demonstrate the ability of the proposed method to detect outliers that match users' intentions.

16 citations


Proceedings ArticleDOI
13 Mar 2005
TL;DR: A novel topic activation analysis scheme is proposed that incorporates both document arrival rate and relevance to address the first problem and an incremental scheme more appropriate for a document streaming environment is presented.
Abstract: With the advance of network technology in recent years, the dissemination and exchange of massive documents has become commonplace. Accordingly, the importance of content analysis techniques is increasing. Topic analysis in large-scale document streams such as E-mails and news articles is an important research issue. This paper addresses techniques for "topic activation analysis" for document streams. For example, when news articles with a strong relationship to a given topic arrive frequently in a news stream, we can regard the activation level of the topic as high. In [1], Kleinberg proposed a method for analyzing document streams. Although the main objective of his method was to detect bursts of topics, it can also be used for topic activation analysis. His method, however, has a serious limitation in that it only looks at the arrival rate of documents and ignores the degree of relevance for each document. Another limitation is that his method is "batch-oriented." This paper first proposes a novel topic activation analysis scheme that incorporates both document arrival rate and relevance to address the first problem. It then presents an incremental scheme more appropriate for a document streaming environment. The proposed schemes are validated by experiments using real CNN news articles.

13 citations


01 Jan 2005
TL;DR: A method to detect topics in text data using feature vectors using Singular Value Decomposition, clustering, and Independent Component Analysis is examined.
Abstract: Topic detection is an important subject when voluminous text data is sent continuously to a user. We examine a method to detect topics in text data using feature vectors. Feature vectors represent the main distribution of data and they are obtained by various data analysis methods. This paper examines three methods: Singular Value Decomposition (SVD), clustering, and Independent Component Analysis (ICA). SVD and clustering are popular existing methods. Clustering, especially, is applied to many topic detection methods. ICA was recently developed in signal processing research. In applications related to text data, however, ICA has not been compared with SVD and clustering, nor has its relationship with them been explored. This paper reports comparative experiments for these three methods and then shows properties as they apply to text data.

9 citations


01 Jan 2005
TL;DR: How the software tool that finds destinations (new URLs) of Web pages after pages are moved works internally and some results of its applications to the problem of finding new locations of real Web pages are shown.
Abstract: We are developing a software tool that finds destinations (new URLs) of Web pages after pages are moved. A point of the tool is that it tries to find “reliable Web links,” which are links always to be kept updated. We believe this is a new approach in finding new URLs for Web pages. This paper explains how the tool works internally and shows some results of its applications to the problem of finding new locations of real Web pages.

8 citations


Proceedings ArticleDOI
08 Apr 2005
TL;DR: Three methods to detect topics in text data using feature vectors are examined: singular value decomposition, clustering, and independent component analysis (ICA).
Abstract: Topic detection is an important subject when voluminous text data is sent continuously to a user. We examine a method to detect topics in text data using feature vectors. Feature vectors represent the main distribution of data and they are obtained by various data analysis methods. This paper examines three methods: singular value decomposition (SVD), clustering, and independent component analysis (ICA). SVD and clustering are popular existing methods. Clustering, especially, is applied to many topic detection methods. ICA was recently developed in signal processing research. In applications related to text data, however, ICA has not been compared with SVD and clustering, nor has its relationship with them been explored. This paper reports comparative experiments for these three methods and then shows properties as they apply to text data

8 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: The research group has proposed a multiple query optimization method for continuous queries that automatically estimates the optimal value and iteratively adjusts it even if properties of underlying data streams dramatically change.
Abstract: Continuous query is widely recognized as a scheme for processing queries over data streams, and efficient methods for processing multiple continuous queries are needed. Our research group has proposed a multiple query optimization method for continuous queries. In our method, the system forms clusters of queries with similar execution patterns, and derives query plans sharing the result of common operators. Our previous experiments have shown that a parameter value in the clustering phase controls divisions of clusters and has a great impact on query processing effi- ciency. However, the optimal parameter value must be decided by trial and error. This paper extends our previous work. The proposed method automatically estimates the optimal value and iteratively adjusts it even if properties of underlying data streams dramatically change.

5 citations


Journal ArticleDOI
TL;DR: This article proposes a method employing the taxonomy-based search services such as Web directories to facilitate searches in any Web search interfaces that support Boolean queries, and develops new fast classification learning algorithms.
Abstract: Introducing context to a user query is effective to improve the search effectiveness. In this article we propose a method employing the taxonomy-based search services such as Web directories to facilitate searches in any Web search interfaces that support Boolean queries. The proposed method enables one to convey current search context on taxonomy of a taxonomy-based search service to the searches conducted with the Web search interfaces. The basic idea is to learn the search context in the form of a Boolean condition that is commonly accepted by many Web search interfaces, and to use the condition to modify the user query before forwarding it to the Web search interfaces. To guarantee that the modified query can always be processed by the Web search interfaces and to make the method adaptive to different user requirements on search result effectiveness, we have developed new fast classification learning algorithms.

4 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: How the software tool that finds destinations (new URLs) of Web pages after pages are moved works internally and some results of its applications to the problem of finding new locations of real Web pages are shown.
Abstract: We are developing a software tool that finds destinations (new URLs) of Web pages after pages are moved. A point of the tool is that it tries to find "reliable Web links," which are links always to be kept updated. We believe this is a new approach in finding new URLs for Web pages. This paper explains how the tool works internally and shows some results of its applications to the problem of finding new locations of real Web pages.

4 citations


Book ChapterDOI
28 Mar 2005
TL;DR: This paper proposes a distributed algorithm to detect outliers for large and distributed datasets using the basis of distance-based outliers based on the distance of a point to its kth nearest neighbor to identify outliers.
Abstract: This paper proposes a distributed algorithm to detect outliers for large and distributed datasets. The algorithm employs the basis of distance-based outliers based on the distance of a point to its kth nearest neighbor. It declares the top n points in the ranking to be outliers. To the best of our knowledge, this is the first proposal of a distributed algorithm for outlier detection for shared-nothing multiple processor computing environments. It has four phases. First, in each processing node, the algorithm partitions the input data set into disjoint subsets, then it prunes entire partitions as soon as it is determined that they cannot contain outliers. Then it applies a global filtering technique to collect the partitions as global candidates from local candidate partitions in each processing node. Further, it introduces a load balancing algorithm to balance the number of local candidate partitions. Finally, it identifies outliers from each processing node.

4 citations


Book ChapterDOI
22 Aug 2005
TL;DR: This paper proposes a method called LocalRank to rank web pages by integrating the web and a user database containing information on a specific geographical area using a linked graph structure using entries contained in the database.
Abstract: In this paper, we propose a method called LocalRank to rank web pages by integrating the web and a user database containing information on a specific geographical area. LocalRank is a rank value for a web page to assess its relevance degree to database entries considering geographical locality and its popularity on a local web space. In our method, we first construct a linked graph structure using entries contained in the database. The nodes of this graph consist of database entries and their related web pages. The edges in the graph are composed of semantic links including geographical links between these nodes, in addition to conventional hyperlinks. Then a link analysis is performed to compute a LocalRank value for each node. LocalRank can represent user's interest since this graph effectively integrates the web and the user database. Our experimental results for a local restaurant database shows that local web pages related to the database entries are highly ranked based on our method.

Journal Article
TL;DR: In this paper, the authors propose a method called LocalRank to rank web pages by integrating the web and a user database containing information on a specific geographical area, which is a rank value for a web page to assess its relevance degree to database entries considering geographical locality and its popularity on local web space.
Abstract: In this paper, we propose a method called LocalRank to rank web pages by integrating the web and a user database containing information on a specific geographical area. LocalRank is a rank value for a web page to assess its relevance degree to database entries considering geographical locality and its popularity on a local web space. In our method, we first construct a linked graph structure using entries contained in the database. The nodes of this graph consist of database entries and their related web pages. The edges in the graph are composed of semantic links including geographical links between these nodes, in addition to conventional hyperlinks. Then a link analysis is performed to compute a LocalRank value for each node. LocalRank can represent user's interest since this graph effectively integrates the web and the user database. Our experimental results for a local restaurant database shows that local web pages related to the database entries are highly ranked based on our method.


Journal ArticleDOI
TL;DR: This article proposes a method employing the taxonomy-based search services such as Web directories to facilitate searches in any Web search interfaces that support Boolean queries, and develops new fast classification learning algorithms for this method.
Abstract: Introducing context to a user query is effective to improve the search effectiveness. In this article we propose a method employing the taxonomy-based search services such as Web directories to facilitate searches in any Web search interfaces that support Boolean queries. The proposed method enables one to convey current search context on taxonomy of a taxonomy-based search service to the searches conducted with the Web search interfaces. The basic idea is to learn the search context in the form of a Boolean condition that is commonly accepted by many Web search interfaces, and to use the condition to modify the user query before forwarding it to the Web search interfaces. To guarantee that the modified query can always be processed by the Web search interfaces and to make the method adaptive to different user requirements on search result effectiveness, we have developed new fast classification learning algorithms. © 2005 Wiley Periodicals, Inc.

Proceedings ArticleDOI
08 Apr 2005
TL;DR: This paper proposes an approach to extract spatial information hubs from the Web to create spatial nodes and spatial links, and conducts a link analysis based on the extended link structures.
Abstract: Recently, Web mining that tries to find useful knowledge from the vast amount of Web pages has attracted a lot of research interests. Besides, it is becoming an essential task to provide Web pages related to a user-specified geographic area. In this paper, we propose an approach to extract spatial information hubs from the Web. A spatial information hub is a Web page which is related to a specified geographic area and has much local information and/or many hyperlinks to local Web pages. In the traditional approach of Web link analysis, the importance and quality of pages are judged only by their contents and hyperlink structures. However, we take their geographic localities into consideration. In our approach, we first extract geographic information from Web pages to create spatial nodes and spatial links, then conduct a link analysis based on the extended link structures. We also show our approach works well based on the experiments

Patent
22 Aug 2005
TL;DR: In this paper, the authors proposed a link authority determination device capable of determining proper link authority in a shorter period of time and with a less labor as much as possible, where the link authority candidate determining is defined as determining a plurality of Web pages satisfying predetermined conditions as link authority candidates from among a plurality collected by a Web page collecting.
Abstract: PROBLEM TO BE SOLVED: To provide a link authority determination device capable of determining proper link authority in a shorter period of time and with a less labor as much as possible. SOLUTION: A link authority candidate determining means 5 decides a plurality of Web pages satisfying predetermined conditions as link authority candidates from among a plurality of Web pages collected by a Web page collecting means 3. A ranking means 17 of a link authority determining means 13 ranks candidate Web pages in the order of the Web page with the small percentage of link shortage. A final determination means 19 determines one or more Web pages positioned in the high order rank as link authority from the result of ranking by the ranking means 17. COPYRIGHT: (C)2007,JPO&INPIT