scispace - formally typeset
Search or ask a question

Showing papers by "Michalis Vazirgiannis published in 2007"


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper addresses the efficient computation of subspace skyline queries in large-scale peer-to-peer (P2P) networks, where the dataset is horizontally distributed across the peers, and proposes a threshold based algorithm, called SKYPEER, which forwards the skyline query requests among peers in such a way that the amount of transferred data is significantly reduced.
Abstract: Skyline query processing has received considerable attention in the recent past. Mainly, the skyline query is used to find a set of non dominated data points in a multidimensional dataset. While most previous work has assumed a centralized setting, in this paper we address the efficient computation of subspace skyline queries in large-scale peer-to-peer (P2P) networks, where the dataset is horizontally distributed across the peers. Relying on a super-peer architecture we propose a threshold based algorithm, called SKYPEER, which forwards the skyline query requests among peers, in such a way that the amount of transferred data is significantly reduced. For efficient subspace skyline processing, we extend the notion of domination by defining the extended skyline set, which contains all data elements that are necessary to answer a skyline query in any arbitrary subspace. We prove that our algorithm provides the exact answers and we present optimization techniques to reduce communication cost and execution time. Finally, we provide an extensive experimental evaluation showing that SKYPEER performs efficiently and provides a viable solution when a large degree of distribution is required.

136 citations


Proceedings Article
06 Jan 2007
TL;DR: A new unsupervised WSD algorithm is proposed, which is based on generating Spreading Activation Networks (SANs) from the senses of a thesaurus and the relations between them, and a new method of assigning weights to the networks' links.
Abstract: Most word sense disambiguation (WSD) methods require large quantities of manually annotated training data and/or do not exploit fully the semantic relations of thesauri. We propose a new unsupervised WSD algorithm, which is based on generating Spreading Activation Networks (SANs) from the senses of a thesaurus and the relations between them. A new method of assigning weights to the networks' links is also proposed. Experiments show that the algorithm outperforms previous unsupervised approaches to WSD.

88 citations


Journal ArticleDOI
TL;DR: This paper describes an unsupervised approach for decentralized and distributed generation of SONS (DESENT), and through simulations and analytical cost models the claims regarding performance, scalability, and quality are verified.
Abstract: The current approach in web searching, i.e., using centralized search engines, rises issues that question their future applicability: 1) coverage and scalability, 2) freshness, and 3) information monopoly. Performing web search using a P2P architecture that consists of the actual web servers has the potential to tackle those issues. In order to achieve the desired performance and scalability, as well as enhancing search quality relative to centralized search engines, semantic overlay networks (SONS) connecting peers storing semantically related information can be employed. The lack of global content/topology knowledge in a P2P system is the key challenge in forming SONS, and this paper describes an unsupervised approach for decentralized and distributed generation of SONS (DESENT). Through simulations and analytical cost models we verify our claims regarding performance, scalability, and quality.

61 citations


Proceedings Article
23 Sep 2007
TL;DR: This paper presents SIMPEER, a novel framework that dynamically clusters peer data, in order to build distributed routing information at super-peer level that reduces communication cost, network latency, bandwidth consumption and computational overhead at each individual peer.
Abstract: This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, a novel framework that dynamically clusters peer data, in order to build distributed routing information at super-peer level. SIMPEER allows the evaluation of range and nearest neighbor queries in a distributed manner that reduces communication cost, network latency, bandwidth consumption and computational overhead at each individual peer. SIMPEER utilizes a set of distributed statistics and guarantees that all similar objects to the query are retrieved, without necessarily flooding the network during query processing. The statistics are employed for estimating an adequate query radius for k-nearest neighbor queries, and transform the query to a range query. Our experimental evaluation employs both real-world and synthetic data collections, and our results show that SIMPEER performs efficiently, even in the case of high degree of distribution.

60 citations


Journal ArticleDOI
TL;DR: UPR, a PageRank-style algorithm which combines usage data and link analysis techniques for assigning probabilities to Web pages based on their importance in the Web site's navigational graph, is presented and it is proved that this approach results in more objective and representative predictions than the ones produced from the pure usage-based approaches.
Abstract: The continuous growth in the size and use of the World Wide Web imposes new methods of design and development of online information services. The need for predicting the users' needs in order to improve the usability and user retention of a Web site is more than evident and can be addressed by personalizing it. Recommendation algorithms aim at proposing “next” pages to users based on their current visit and past users' navigational patterns. In the vast majority of related algorithms, however, only the usage data is used to produce recommendations, disregarding the structural properties of the Web graph. Thus important—in terms of PageRank authority score—pages may be underrated. In this work, we present UPR, a PageRank-style algorithm which combines usage data and link analysis techniques for assigning probabilities to Web pages based on their importance in the Web site's navigational graph. We propose the application of a localized version of UPR (l-UPR) to personalized navigational subgraphs for online Web page ranking and recommendation. Moreover, we propose a hybrid probabilistic predictive model based on Markov models and link analysis for assigning prior probabilities in a hybrid probabilistic model. We prove, through experimentation, that this approach results in more objective and representative predictions than the ones produced from the pure usage-based approaches.

25 citations


Proceedings ArticleDOI
08 May 2007
TL;DR: This work presents an efficiently computable normalization for PageRank scores that makes them comparable across graphs, and shows that the normalized PageRank Scores are robust to non-local changes in the graph, unlike the standard PageRank measure.
Abstract: PageRank is the best known technique for link-based importance ranking. The computed importance scores, however, are not directly comparable across different snapshots of an evolving graph. We present an efficiently computable normalization for PageRank scores that makes them comparable across graphs. Furthermore, we show that the normalized PageRank scores are robust to non-local changes in the graph, unlike the standard PageRank measure.

23 citations


Journal ArticleDOI
TL;DR: In this paper, context-aware web service discovery is proposed to enable the provision of the most appropriate services at the right location and time in a mobile peer-to-peer environment.
Abstract: In modern heterogeneous environments, such as mobile, pervasive and ad-hoc networks, architectures based on web services offer an attractive solution for effective communication and inter-operation. In such dynamic and rapidly evolving environments, efficient web service discovery is an important task. Usually this task is based on the input/output parameters or other functional attributes, however this does not guarantee the validity or successful utilization of retrieved web services. Instead, non-functional attributes, such as device power features, computational resources and connectivity status, that characterize the context of both service providers and consumers play an important role to the quality and usability of discovery results. In this paper we introduce context-awareness in web service discovery, enabling the provision of the most appropriate services at the right location and time. We focus on context-based caching and routing for improving web service discovery in a mobile peer-to-peer environment. We conducted a thorough experimental study, using our prototype implementation based on the JXTA framework, while simulations are employed for testing the scalability of the approach. We illustrate the advantages that this approach offers, both by evaluating the context-based cache performance and by comparing the efficiency of location-based routing to broadcast-based approaches.

19 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: The contribution of the proposed framework is considered to be a standardized workflow aiming at the integration of data produced by various hospitals into a consistent data warehouse and the use of a mechanism that detects hidden and previously unknown patterns on large datasets in terms of association rules, which can provide surveillance warnings.
Abstract: One of the most considerable functions in a hospital's infection control program is the surveillance of antibiotic resistance. Several traditional methods used to measure it do not provide adequate and promising results for further analysis. Data mining techniques, such as the association rules, have been used in the past and successfully led to discovering interesting patterns in public health data. In this work, we present the architecture of a novel framework which integrates data from multiple hospitals, discovers association rules, stores them in a data warehouse for future analysis and provides anytime accessibility through an intuitive Web interface. We implemented the proposed architecture as a Web application and evaluated it using data from the WHONET software installed in many Greek hospitals that belong to "the Greek System for Surveillance of Antimicrobial Resistance" network. The contribution of the proposed framework is considered to be a standardized workflow aiming at the integration of data produced by various hospitals into a consistent data warehouse and the use of a mechanism that detects hidden and previously unknown patterns on large datasets, in terms of association rules, which can provide surveillance warnings.

10 citations


Book ChapterDOI
05 Jul 2007
TL;DR: The core scientific and technological objectives of PIRES, a scalable decentralized and distributed infrastructure for building a search engine for image content capitalizing on P2P technology, are presented.
Abstract: The World Wide Web provides an enormous amount of images easily accessible to everybody. The main challenge is to provide efficient search mechanisms for image content that are truly scalable and can support full coverage of web contents. In this paper, we present an architecture that adopts the peer-to-peer (P2P) paradigm for indexing, searching and ranking of image content. The ultimate goal of our architecture is to provide an adaptive search mechanism for image content, enhanced with learning, relying on image features, user-defined annotations and user feedback. Thus, we present PIRES, a scalable decentralized and distributed infrastructure for building a search engine for image content capitalizing on P2P technology. In the following, we first present the core scientific and technological objectives of PIRES, and then we present some preliminary experimental results of our prototype.

9 citations


Book ChapterDOI
17 Sep 2007
TL;DR: This paper focuses on two well studied algorithms, LSI and PCA, and proposes a feature selection process that provably guarantees the stability of their outputs, and utilizes bootstrapping confidence intervals for assessing the statistical accuracy of the input sample matrices, and matrix perturbation theory in order to relate the statistics accuracy to the Stability of eigenvectors.
Abstract: The stability of sample based algorithms is a concept commonly used for parameter tuning and validity assessment. In this paper we focus on two well studied algorithms, LSI and PCA, and propose a feature selection process that provably guarantees the stability of their outputs. The feature selection process is performed such that the level of (statistical) accuracy of the LSI/PCA input matrices is adequate for computing meaningful (stable) eigenvectors. The feature selection process "sparsifies" LSI/PCA, resulting in the projection of the instances on the eigenvectors of a principal submatrix of the original input matrix, thus producing sparse factor loadings that are linear combinations solely of the selected features. We utilize bootstrapping confidence intervals for assessing the statistical accuracy of the input sample matrices, and matrix perturbation theory in order to relate the statistical accuracy to the stability of eigenvectors. Experiments on several UCI-datasets verify empirically our approach.

8 citations


Book ChapterDOI
01 Nov 2007
TL;DR: This paper addresses the issue of representing and quantifying web ranking trends as a measure of web pages, and proposes normalized measures of ranking trends that are comparable among web graph snapshots of different sizes.
Abstract: One of the grand research and industrial challenges in recent years is efficient web search, inherently involving the issue of page ranking. In this paper we address the issue of representing and quantifying web ranking trends as a measure of web pages. We study the rank position of a web page among different snapshots of the web graph and propose normalized measures of ranking trends that are comparable among web graph snapshots of different sizes. We define the ra nk c hang e r ate (racer)as a measure quantifying the web graph evolution. Thereafter, we examine different ways to aggregate the rank change rates and quantify the trends over a group of web pages. We outline the problem of identifying highly dynamic web pages and discuss possible future work. In our experimental evaluation we study the dynamics of web pages, especially those highly ranked.