scispace - formally typeset
Search or ask a question

Showing papers by "Michalis Vazirgiannis published in 2012"


Proceedings ArticleDOI
16 Jul 2012
TL;DR: A reverse approach is proposed, where the set of strictly required data items to fill in the application form can be computed on the user's side, leading to a significant reduction of the quantity of personal data filled in application forms while still reaching the same decision.
Abstract: Application forms are often used by companies and administrations to collect personal data about applicants and tailor services to their specific situation. For example, taxes rates, social care, or personal loans, are usually calibrated based on a set of personal data collected through application forms. In the eyes of privacy laws and directives, the set of personal data collected to achieve a service must be restricted to the minimum necessary. This reduces the impact of data breaches both in the interest of service providers and applicants. In this article, we study the problem of limiting data collection in those application forms, used to collect data and subsequently feed decision making processes. In practice, the set of data collected is far excessive because application forms are filled in without any means to know what data will really impact the decision. To overcome this problem, we propose a reverse approach, where the set of strictly required data items to fill in the application form can be computed on the user's side. We formalize the underlying NP Hard optimization problem, propose algorithms to compute a solution, and validate them with experiments. Our proposal leads to a significant reduction of the quantity of personal data filled in application forms while still reaching the same decision. Privacy principle; Limited collection; Automated form filling.

16 citations


Proceedings ArticleDOI
20 Nov 2012
TL;DR: This paper investigates the correlation between the social network communities as defined by a community detection algorithm and the Facebook pages annotated as Likes by its users, and proves that Likes constitute a criterion of distinction among the communities.
Abstract: In this paper we investigate the correlation between the social network communities as defined by a community detection algorithm and the Facebook pages annotated as Likes by its users Our goal is twofold First, we aim to examine the relation between the underlined social dynamic, as expressed indirectly by a community structure, with the users' characteristics represented by Likes Second, to valuate the outcome of the community detection algorithm To the best of our knowledge this is the first study of the correlation between community structure and users' Likes in Facebook Using a standard crawling method, such as the Breadth First Search, we collect: a) several snapshots of a sub graph of Facebook, b) the users' Likes in Web and Facebook pages and c) the pages' categories as classified by the owner of the page We study several graph samples along with their community structure The experimental results demonstrate that in the case of users' Likes, the correlation ranges from small to medium between communities and the whole population, while it is even smaller between communities Moreover, there is a high correlation in terms of Likes' categories between the different communities and between communities and the whole population This fact proves that Likes constitute a criterion of distinction among the communities and verifies the intuition that lead us towards this research

9 citations


Proceedings ArticleDOI
17 Apr 2012
TL;DR: A new feature scoring algorithm for web page terms extraction that takes into account the freshness of keywords in the current page as means of shifting users interests, and achieves more than 70% in precision at 20 extracted keywords.
Abstract: Keyword extraction from web pages is essential to various text mining tasks including contextual advertising, recommendation selection, user profiling and personalization. For example, extracted keywords in contextual advertising are used to match advertisements with the web page currently browsed by a user. Most of the keyword extraction methods mainly rely on the content of a single web page, ignoring the browsing history of a user, and hence, potentially leading to the same advertisements or recommendations.In this work we propose a new feature scoring algorithm for web page terms extraction that, assuming a recent browsing history per user, takes into account the freshness of keywords in the current page as means of shifting users interests. We propose BM25H, a variant of BM25 scoring function, implemented on the client-side, that takes into account the user browsing history and suggests keywords relevant to the currently browsed page, but also fresh with respect to the user's recent browsing history. In this way, for each web page we obtain a set of keywords, representing the time shifting interests of the user. BM25H avoids repetitions of keywords which may be simply domain specific stop-words, or may result in matching the same ads or similar recommendations. Our experimental results show that BM25H achieves more than 70% in precision at 20 extracted keywords (based on human blind evaluation) and outperforms our baselines (TF and BM25 scoring functions), while it succeeds in keeping extracted keywords fresh compared to recent user history.

8 citations


Journal Article
TL;DR: The Minimum Exposure Project aims at proposing an analysis, framework and implementation of an important privacy principle, called Limited Data Collection, which is based on the principle of informed consent.
Abstract: When requesting bank loans, social care, tax reduction, and many other services, individuals are required to fill in application forms with hundreds of data items. It is possible, however, to drastically reduce the set of completed fields without impacting the final decision. The Minimum Exposure Project investigates this issue. It aims at proposing an analysis, framework and implementation of an important privacy principle, called Limited Data Collection.

6 citations


Posted Content
TL;DR: In this article, the authors propose a methodology, an architecture, and a fully functional framework for semi and fully automated creation, monitoring, and optimization of cost-efficient pay-per-click campaigns with budget constraints.
Abstract: Creating and monitoring competitive and cost-effective pay-per-click advertisement campaigns through the web-search channel is a resource demanding task in terms of expertise and effort. Assisting or even automating the work of an advertising specialist will have an unrivaled commercial value. In this paper we propose a methodology, an architecture, and a fully functional framework for semi- and fully- automated creation, monitoring, and optimization of cost-efficient pay-per-click campaigns with budget constraints. The campaign creation module generates automatically keywords based on the content of the web page to be advertised extended with corresponding ad-texts. These keywords are used to create automatically the campaigns fully equipped with the appropriate values set. The campaigns are uploaded to the auctioneer platform and start running. The optimization module focuses on the learning process from existing campaign statistics and also from applied strategies of previous periods in order to invest optimally in the next period. The objective is to maximize the performance (i.e. clicks, actions) under the current budget constraint. The fully functional prototype is experimentally evaluated on real world Google AdWords campaigns and presents a promising behavior with regards to campaign performance statistics as it outperforms systematically the competing manually maintained campaigns.

5 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: A system that supports the visual exploration of collaboration networks by providing an easy-to-use interface to query for the fractional core index of an author, to see who the closest equally or higher-ranked co-authors are, and explore the entire co-authorship network in an incremental manner.
Abstract: We demonstrate a system that supports the visual exploration of collaboration networks. The system leverages the notion of fractional cores introduced in earlier work to rank vertices in a collaboration network and filter vertices' neighborhoods. Fractional cores build on the idea of graph degeneracy as captured by the notion of k-cores in graph theory and extend it to undirected edge-weighted graphs. In a co-authorship network, for instance, the fractional core index of an author intuitively reflects the degree of collaboration with equally or higher-ranked authors. Our system has been deployed on a real-world co-authorship network derived from DBLP, demonstrating that the idea of fractional cores can be applied even to large-scale networks. The system provides an easy-to-use interface to query for the fractional core index of an author, to see who the closest equally or higher-ranked co-authors are, and explore the entire co-authorship network in an incremental manner.

5 citations


Proceedings ArticleDOI
10 Dec 2012
TL;DR: A prototype and a functional web application for semi- and fully-automated creation, monitoring, and management of cost-efficient pay-per-click campaigns with budget constraints and shows a promising behavior with regards to campaign performance statistics outperforming systematically the competitive manually created and/or monitored campaigns.
Abstract: Creating and monitoring a competitive and cost-effective pay-per-click advertisement campaign through the web-search channel is a resource demanding task in terms of human expertise and effort. Assisting or even automating the work of an advertising specialist will have an unrivaled commercial value. In this demonstration we present a prototype and a functional web application for semi- and fully-automated creation, monitoring, and management of cost-efficient pay-per-click campaigns with budget constraints. The prototype is experimentally evaluated on real world Google Ad Words campaigns and shows a promising behavior with regards to campaign performance statistics outperforming systematically the competitive manually created and/or monitored campaigns.

5 citations


04 Dec 2012
TL;DR: In this article, the authors address the case of decision making processes based on sets of classifiers, typically multi-label classifiers and propose an approach, termed Minimum Exposure, to reduce the quantity of information provided by the users, in order to protect her privacy, reduce processing costs for the organization, and financial lost in the event of a data breach.
Abstract: Administrative services such social care, tax reduction, and many others using complex decision processes, request individuals to provide large amounts of private data items, in order to calibrate their proposal to the specific situation of the applicant. This data is subsequently processed and stored by the organization. However, all the requested information is not needed to reach the same decision. We have recently proposed an approach, termed Minimum Exposure, to reduce the quantity of information provided by the users, in order to protect her privacy, reduce processing costs for the organization, and financial lost in the case of a data breach. In this paper, we address the case of decision making processes based on sets of classifiers, typically multi-label classifiers. We propose a practical implementation using state of the art multi-label classifiers, and analyze the effectiveness of our solution on several real multi-label data sets.

4 citations


Journal ArticleDOI
TL;DR: The 2011 edition of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) was held in Athens, Greece, during September 5–9, 2011 and the proceedings were published in three volumes of the Springer’s Lecture Notes in Artificial Intelligence series.
Abstract: The 2011 edition of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) was held in Athens, Greece, during September 5–9, 2011. Ten years after the first edition of this joint conference, ECML PKDD 2011 continued to provide a common forum for the closely related fields of machine learning and data mining. Apart from six plenary invited talks, four invited talks for the industrial session, a demo session, six tutorials and eleven co-located workshops, the main technical sessions comprised the presentation of 121 peer-reviewed papers selected by the program committee from 599 full-paper submissions. ECML PKDD 2011 was a highly selective conference and the proceedings were published in three volumes of the Springer’s Lecture Notes in Artificial Intelligence series (Gunopulos et al. 2011a, 2011b, 2011c). Authors of the best ten machine learning papers presented at the conference were invited to submit a significantly extended version of their paper to this special issue. The selection was made by the Program Chairs on the basis of their exceptional scientific quality and high impact on the field, as indicated by conference reviewers. In this special issue you will find seven papers which have been accepted after two or three rounds of peer-reviewing according to the journal criteria. The diversity of topics addressed in these papers reflects the significant progress being made by the machine learning community in the theoretical understanding of the principles underlying knowledge discov-

2 citations


Book ChapterDOI
29 May 2012
TL;DR: This work develops a set of features which are combined in a scoring function to select the named entity of the Web page owner and forms the problem as a classification problem in which a pair of a Web page and named entity is classified as being associated or not.
Abstract: Entity-based applications, such as expert search or online social networks where users search for persons, require high-quality datasets of named entity references. Obtaining such high-quality datasets can be achieved by automatically extracting metadata from Web pages. In this work, we focus on the identification of the named entity that corresponds to the owner of a particular Web page, for example, a home page or an organizational staff Web page. More specifically, from a set of named entities that have already been extracted from a Web page, we identify the one which corresponds to the owner of the home page. First, we develop a set of features which are combined in a scoring function to select the named entity of the Web page owner. Second, we formulate the problem as a classification problem in which a pair of a Web page and named entity is classified as being associated or not. We evaluate the proposed approaches on a set of Web pages in which we have previously identified named entities. Our experimental results show that we can identify the named entity corresponding to the owner of a home page with accuracy over 90%.

2 citations


Journal ArticleDOI
TL;DR: The 2011 edition of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) was held in Athens, Greece, during September 5–9, 2011 and the proceedings were published in three volumes of the Springer’s Lecture Notes in Artificial Intelligence series.
Abstract: The 2011 edition of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) was held in Athens, Greece, during September 5–9, 2011. Ten years after the first edition of this joint conference, ECML PKDD 2011 continued to provide a common forum for the closely related fields of machine learning and data mining. Apart from six plenary invited talks, four invited talks for the industrial session, a demo session, six tutorials and eleven co-located workshops, the main technical sessions comprised the presentation of 121 peer-reviewed papers selected by the program committee from 599 full-paper submissions. ECML PKDD 2011 was a highly selective conference and the proceedings were published in three volumes of the Springer’s Lecture Notes in Artificial Intelligence series (Gunopulos et al. 2011a, 2011b, 2011c). Authors of the best ten machine learning papers presented at the conference were invited to submit a significantly extended version of their paper to this special issue. The selection was made by the Program Chairs on the basis of their exceptional scientific quality and high impact on the field, as indicated by conference reviewers. In this special issue you will find seven papers which have been accepted after two or three rounds of peer-reviewing according to the journal criteria. The diversity of topics addressed in these papers reflects the significant progress being made by the machine learning community in the theoretical understanding of the principles underlying knowledge discov-


Journal ArticleDOI
TL;DR: The experimental results demonstrate that the long time states are sensitive to initial conditions and exhibit the following patterns: Homogeneous: the market prices of the two goods are equal and the buyers split equally their budget amongst the goods.
Abstract: In this paper, we study the effect of diffusion on the evolution of a market consisting of two infinitely divisible goods and buyers with constant elasticity of substitution utility functions. In consecutive time periods, the buyers’ preferences depend on the actions taken by their neighbours in the network. We investigate the properties of the long time states, where a market state is defined by the market equilibrium prices and goods allocation. The experimental results demonstrate that the long time states are sensitive to initial conditions and exhibit the following patterns. Homogeneous: the market prices of the two goods are equal and the buyers split equally their budget amongst the goods. Heterogeneous: the buyers’ bids on the two goods differ. Periodic: the buyers’ bids oscillate with stable oscillation width. Moreover, we present the critical values where a phase transition occurs between homogeneous, heterogeneous and periodic states.

Proceedings ArticleDOI
16 Sep 2012
TL;DR: S-Suite is proposed, a SOA architecture fully implemented and operational that mediates among brokers and service providers that handles the full life cycle of reservations enabling automatic reservation treatment incorporating the most enhanced functional features demanded by brokers andService providers.
Abstract: The car rental business is one with awesome budgets due to its popularity in tourism and business trips worldwide. The broker service provider model is the dominant one with the brokers searching and negotiating with several providers for each reservation request. This implies a workload that would overwhelm the participating parts. Moreover reservations life cycle in the aforementioned model is a complex process bearing exhaustive details and constraints that have to be met until a reservation is confirmed and deployed. In this paper we propose S-Suite, a SOA architecture fully implemented and operational that mediates among brokers and service providers. It handles the full life cycle of reservations enabling automatic reservation treatment incorporating the most enhanced functional features demanded by brokers and service providers. The benefits of the system are multiple: a. efficiency and transparency, b. optimal matching among reservations demands and service offers at a local level.