scispace - formally typeset
Search or ask a question

Showing papers in "Sigkdd Explorations in 2011"


Journal ArticleDOI
TL;DR: This work describes and evaluates a system that uses phone-based accelerometers to perform activity recognition, a task which involves identifying the physical activity a user is performing, and has a wide range of applications, including automatic customization of the mobile device's behavior based upon a user's activity.
Abstract: Mobile devices are becoming increasingly sophisticated and the latest generation of smart cell phones now incorporates many diverse and powerful sensors These sensors include GPS sensors, vision sensors (ie, cameras), audio sensors (ie, microphones), light sensors, temperature sensors, direction sensors (ie, magnetic compasses), and acceleration sensors (ie, accelerometers) The availability of these sensors in mass-marketed communication devices creates exciting new opportunities for data mining and data mining applications In this paper we describe and evaluate a system that uses phone-based accelerometers to perform activity recognition, a task which involves identifying the physical activity a user is performing To implement our system we collected labeled accelerometer data from twenty-nine users as they performed daily activities such as walking, jogging, climbing stairs, sitting, and standing, and then aggregated this time series data into examples that summarize the user activity over 10- second intervals We then used the resulting training data to induce a predictive model for activity recognition This work is significant because the activity recognition model permits us to gain useful knowledge about the habits of millions of users passively---just by having them carry cell phones in their pockets Our work has a wide range of applications, including automatic customization of the mobile device's behavior based upon a user's activity (eg, sending calls directly to voicemail if a user is jogging) and generating a daily/weekly activity profile to determine if a user (perhaps an obese child) is performing a healthy amount of exercise

2,417 citations


Journal ArticleDOI
TL;DR: An overview of the state-of-the-art privacy-preserving techniques in continuous LBS and trajectory data publication is given, which have increasingly drawn attention from the research community and industry.
Abstract: The ubiquity of mobile devices with global positioning functionality (eg, GPS and AGPS) and Internet connectivity (eg, 3G andWi-Fi) has resulted in widespread development of location-based services (LBS) Typical examples of LBS include local business search, e-marketing, social networking, and automotive traffic monitoring Although LBS provide valuable services for mobile users, revealing their private locations to potentially untrusted LBS service providers pose privacy concerns In general, there are two types of LBS, namely, snapshot and continuous LBS For snapshot LBS, a mobile user only needs to report its current location to a service provider once to get its desired information On the other hand, a mobile user has to report its location to a service provider in a periodic or on-demand manner to obtain its desired continuous LBS Protecting user location privacy for continuous LBS is more challenging than snapshot LBS because adversaries may use the spatial and temporal correlations in the user's location samples to infer the user's location information with higher certainty Such user location trajectories are also very important for many applications, eg, business analysis, city planning, and intelligent transportation However, publishing such location trajectories to the public or a third party for data analysis could pose serious privacy concerns Privacy protection in continuous LBS and trajectory data publication has increasingly drawn attention from the research community and industry In this survey, we give an overview of the state-of-the-art privacy-preserving techniques in these two problems

210 citations


Journal ArticleDOI
TL;DR: This essay presents several important and under-discussed challenges to using active learning well in practice, and encourages researchers in active learning to focus even more attention on how practitioners might actually use active learning.
Abstract: Despite the tremendous level of adoption of machine learning techniques in real-world settings, and the large volume of research on active learning, active learning techniques have been slow to gain substantial traction in practical applications. This reluctance of adoption is contrary to active learning's promise of reduced model-development costs and increased performance on a model-development budget. This essay presents several important and under-discussed challenges to using active learning well in practice. We hope this paper can serve as a call to arms for researchers in active learning--an encouragement to focus even more attention on how practitioners might actually use active learning.

93 citations


Journal ArticleDOI
TL;DR: PiRi is proposed, a privacy-aware framework for PS systems, which enables participation of the users without compromising their privacy, and extensive experiments verify the efficiency of the approach.
Abstract: With the abundance and ubiquity of mobile devices, a new class of applications is emerging, called participatory sensing (PS), where people can contribute data (e.g., images, video) collected by their mobile devices to central data servers. However, privacy concerns are becoming a major impediment in the success of many participatory sensing systems. While several privacy preserving techniques exist in the context of conventional location-based services, they are not directly applicable to the PS systems because of the extra information that the PS systems can collect from their participants. In this paper, we formally define the problem of privacy in PS systems and identify its unique challenges assuming an un-trusted central data server model. We propose PiRi, a privacy-aware framework for PS systems, which enables participation of the users without compromising their privacy. Our extensive experiments verify the efficiency of our approach.

81 citations


Journal ArticleDOI
TL;DR: This work addresses the topic-based community key-members extraction problem, for which the method combines both text mining and social network analysis techniques.
Abstract: The study of extremist groups and their interaction is a crucial task in order to maintain homeland security and peace. Tools such as social networks analysis and text mining have contributed to their understanding in order to develop counter-terrorism applications. This work addresses the topic-based community key-members extraction problem, for which our method combines both text mining and social network analysis techniques. This is achieved by first applying latent Dirichlet allocation to build two topic-based social networks in online forums: one social network oriented towards the thread creator point-of-view, and the other is oriented towards the repliers of the overall forum. Then, by using different network analysis measures, topic-based key members are evaluated using as benchmark a social network built a plain representation of the network of posts. Experiments were successfully performed using an English language based forum available in the Dark Web portal.

81 citations


Journal ArticleDOI
TL;DR: Several anonymity models are introduced, then some of the proposed techniques to enforce trajectory anonymity are described, discussing their merits and limitations, and challenging open problems that need attention are identified.
Abstract: Recent years have witnessed pervasive use of location-aware devices such as GSM mobile phones, GPS-enabled PDAs, location sensors, and active RFID tags. The use of these devices generates a huge collection of spatio-temporal data, variously called moving object data, trajectory data, or moblity data. These data can be used for various data analysis purposes such as city traffic control, mobility management, urban planning, and location-based service advertisements. Clearly, the spatio-temporal data so collected may help an attacker to discover personal and sensitive information like user habits, social customs, religious and sexual preferences of individuals. Consequently, it raises serious concerns about privacy. Simply replacing users' real identifiers (name, SSN, etc.) with pseudonyms is insufficient to guarantee anonymity. The problem is that due to the existence of quasi-identifiers, i.e., spatio-temporal data points that can be linked to external information to re-identify individuals, the attacker may be able to trace the anonymous spatio-temporal data back to individuals.In this survey, we discuss recent advancement on anonymity preserving data publishing of moving object databases in an off-line fashion. We first introduce several anonymity models, then we describe in detail some of the proposed techniques to enforce trajectory anonymity, discussing their merits and limitations. We conclude by identifying challenging open problems that need attention.

72 citations


Journal ArticleDOI
Ron Kohavi1, Roger Longbotham1
TL;DR: This work shares several real examples of unexpected results and lessons learned from online controlled experiments, being used frequently, utilizing software capabilities like ramp-up (exposure control) and running experiments on large server farms with millions of users.
Abstract: Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Offline controlled experiments have been well studied and documented since Sir Ronald A. Fisher led the development of statistical experimental design while working at the Rothamsted Agricultural Experimental Station in England in the 1920s. With the growth of the world-wide-web and web services, online controlled experiments are being used frequently, utilizing software capabilities like ramp-up (exposure control) and running experiments on large server farms with millions of users. We share several real examples of unexpected results and lessons learned.

68 citations


Journal ArticleDOI
TL;DR: The aim of the paper is to provide a brief survey of the attack scenarios, the privacy guaranties and the data transformations employed to protect user privacy in real time and to provide an overview of the cases that are covered by existing research.
Abstract: The rapid advance in handheld communication devices and the appearance of smartphones has allowed users to connect to the Internet and surf on the WWW while they are moving around the city or traveling Location based services have been developed to deliver content that is adjusted to the current user location Social networks have also responded to the challenge of users who can access the Internet from any place in the city, and location based social-networks like Foursquare have become very popular in a short period of time The popularity of these applications is linked to the significant advantages they offer: users can exploit live location-based information to take dynamic decisions on issues like transportation, identification of places of interest or even on the opportunity to meet a friend or an associate in nearby locations A side effect of sharing location-based information is that it exposes the user to substantial privacy related threats Revealing the user's location carelessly can prove to be embarrassing, harmful professionally, or even dangerousResearch in the data management field has put significant effort on anonymization techniques that obfuscate spatial information in order to hide the identity of the user or her exact location Privacy guaranties and anonymization algorithms become increasingly sophisticated offering better and more efficient protection in data publishing and data exchange Still, it is not clear yet what are the greatest dangers to user privacy and which are the most realistic privacy breaching scenarios The aim of the paper is to provide a brief survey of the attack scenarios, the privacy guaranties and the data transformations employed to protect user privacy in real time The paper focuses mostly on providing an overview of the privacy models that are investigated in literature and less on the algorithms and their scaling capabilities The models and the attack scenarios are classified and compared, in order to provide an overview of the cases that are covered by existing research

45 citations


Journal ArticleDOI
TL;DR: This paper finds that LSI yields poor retrieval accuracy on the TREC 2, 7, 8, and 2004 collections, and derives novel scoring methods that implement the ideas of query expansion and score regularization in the LSI framework.
Abstract: The aim of latent semantic indexing (LSI) is to uncover the relationships between terms, hidden concepts, and documents. LSI uses the matrix factorization technique known as singular value decomposition (SVD). In this paper, we apply LSI to standard benchmark collections. We find that LSI yields poor retrieval accuracy on the TREC 2, 7, 8, and 2004 collections. We believe that the negative result is robust, because we try more LSI variants than any previous work. First, we show that using Okapi BM25 weights for terms in documents improves the performance of LSI. Second, we derive novel scoring methods that implement the ideas of query expansion and score regularization in the LSI framework. Third, we show how to combine the BM25 method with LSI methods. All proposed methods are evaluated experimentally on the four TREC collections mentioned above. The experiments show that the new variants of LSI improve upon previous LSI methods. Nevertheless, no way of using LSI achieves a worthwhile improvement in retrieval accuracy over BM25.

40 citations


Journal ArticleDOI
TL;DR: It is shown that CV creates predictions that have an 'inverse' ranking with AUC well below 0.25 using features that were initially entirely unpredictive and models that can only perform monotonic transformations.
Abstract: A number of times when using cross-validation (CV) while trying to do classification/probability estimation we have observed surprisingly low AUC's on real data with very few positive examples. AUC is the area under the ROC and measures the ranking ability and corresponds to the probability that a positive example receives a higher model score than a negative example. Intuition seems to suggest that no reasonable methodology should ever result in a model with an AUC significantly below 0.5. The focus of this paper is not on the estimator properties of CV (bias/variance/significance), but rather on the properties of the 'holdout' predictions based on which the CV performance of a model is calculated. We show that CV creates predictions that have an 'inverse' ranking with AUC well below 0.25 using features that were initially entirely unpredictive and models that can only perform monotonic transformations. In the extreme, combining CV with bagging (repeated averaging of out-of-sample predictions) generates 'holdout' predictions with perfectly opposite rankings on random data. While this would raise immediate suspicion upon inspection, we would like to caution the data mining community against using CV for stacking or in currently popular ensemble methods. They can reverse the predictions by assigning negative weights and produce in the end a model that appears to have close to perfect predictability while in reality the data was random.

26 citations


Journal ArticleDOI
TL;DR: The Eighth Workshop on Mining and Learning with Graphs (MLG) was held at KDD 2010 in Washington DC and brought together a variete of researchers interested in analyzing data that is best represented as a graph.
Abstract: The Eighth Workshop on Mining and Learning with Graphs (MLG)1was held at KDD 2010 in Washington DC. It brought together a variete of researchers interested in analyzing data that is best represented as a graph. Examples include the WWW, social networks, biological networks, communication networks, and many others. The importance of being able to effectively mine and learn from such data is growing, as more and more structured and semi-structured data is becoming available. This is a problem across widely different fields such as economics, statistics, social science, physics and computer science, and is studied within a variety of sub-disciplines of machine learning and data mining including graph mining, graphical models, kernel theory, statistical relational learning, etc. The objective of this workshop was to bring together practitioners from these various fields and areas to foster a rich discussion of which problems we work on, how we frame them in the context of graphs, which tools and algorithms we apply and our general findings and lessons learned. This year's workshop was very successful with well over 100 attendees, excellent keynote speakers and papers. This is a rapidly growing area and we believe that this community is only in its infancy. We hope that the readers will join us next year for MLG 2011.

Journal ArticleDOI
TL;DR: A framework for mining subjectively interesting pattern sets that is based on the encoding of prior information in a model for the data miner's state of mind and the search for a pattern set that is maximally informative while efficient to convey to the data Miner is suggested.
Abstract: This paper suggests a framework for mining subjectively interesting pattern sets that is based on two components: (1) the encoding of prior information in a model for the data miner's state of mind; (2) the search for a pattern set that is maximally informative while efficient to convey to the data miner.We illustrate the framework with an instantiation for tile patterns in binary databases where prior information on the row and column marginals is available. This approach implements step (1) above by constructing the MaxEnt model with respect to the prior information [2, 3], and step (2) by relying on concepts from information and coding theory. We provide a brief overview of a number of possible extensions and future research challenges, including a key challenge related to the design of empirical evaluations for subjective interestingness measures.

Journal ArticleDOI
TL;DR: It is concluded that analyzing a Web page's DOM-structure is not sufficient for the general list finding task.
Abstract: The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.

Journal ArticleDOI
TL;DR: The issues related to trace norm minimization are analyzed and it is found an unexpected result that tracenorm minimization often does not work as well as expected.
Abstract: In recent years, compressive sensing attracts intensive attentions in the field of statistics, automatic control, data mining and machine learning. It assumes the sparsity of the dataset and proposes that the whole dataset can be reconstructed by just observing a small set of samples. One of the important approaches of compressive sensing is trace norm minimization, which can minimize the rank of the data matrix under some conditions. For example, in collaborative filtering, we are given a small set of observed item ratings of some users and we want to predict the missing values in the rating matrix. It is assumed that the users' ratings are affected by only a few factors and the resulting rating matrix should be of low rank. In this paper, we analyze the issues related to trace norm minimization and find an unexpected result that trace norm minimization often does not work as well as expected.


Journal ArticleDOI
TL;DR: This report provides a description of the first international workshop on this emerging topic --- SIGKDD MultiClust10: Discovering, Summarizing and Using Multiple Clusterings, which was held in Washington DC, on July 25th 2010.
Abstract: Traditional clustering focuses on finding a single best clustering solution from data. However, given a single data set, one could interpret it in different ways. This is particularly true with complex data that has become prevalent in the data mining community: text, video, images and biological data to name a few. It is thus of practical interest to find all possible alternative and interesting clustering solutions from data. Recently there has been increasing interest on developing algorithms to discover multiple clustering solutions from complex data. This report provides a description of the first international workshop on this emerging topic --- SIGKDD MultiClust10: Discovering, Summarizing and Using Multiple Clusterings, which was held in Washington DC, on July 25th 2010. The workshop program consists of three invited talks and presentations of four full research papers and three short papers.

Journal ArticleDOI
TL;DR: A general model to achieve resource and quality awareness for stream mining algorithms in dynamic setups is proposed by classifying influencing parameters and quality measures as components of a multiobjective optimization problem.
Abstract: Data streams have become ubiquitous in recent years and are handled on a variety of platforms, ranging from dedicated high-end servers to battery-powered mobile sensors. Data stream processing is therefore required to work under virtually any dynamic resource constraints. Few approaches exist for stream mining algorithms that are capable to adapt to given constraints, and none of them reflects from the resource adaptation to the resulting output quality. In this paper, we propose a general model to achieve resource and quality awareness for stream mining algorithms in dynamic setups. The general applicability is granted by classifying influencing parameters and quality measures as components of a multiobjective optimization problem. By the use of CluStream as an example algorithm, we demonstrate the practicability of the proposed model.

Journal ArticleDOI
TL;DR: This work tests whether the performance of a regular pairwise classifier can be improved when additional information about the hierarchical class structure is added to the training sets, and surprisingly, the additional information seems to hurt the performance.
Abstract: The goal of this work was to test whether the performance of a regular pairwise classifier can be improved when additional information about the hierarchical class structure is added to the training sets. Somewhat surprisingly, the additional information seems to hurt the performance. We explain this with the fact that the structure of the class hierarchy is not reflected in the distribution of the instances.


Journal ArticleDOI
TL;DR: A summary of the workshop on Useful Patterns (UP'10) held in conjunction with the ACM SIGKDD 2010, on July 25th in Washington, DC, USA, is provided.
Abstract: We provide a summary of the workshop on Useful Patterns (UP'10) held in conjunction with the ACM SIGKDD 2010, on July 25th in Washington, DC, USA. We report in detail on the motivation, goals, and the research issues addressed in the talks at this full-day workshop. More information can be found at: http://www.usefulpatterns.org.

Journal ArticleDOI
TL;DR: This paper considers fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above.
Abstract: In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets.The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios.

Journal ArticleDOI
TL;DR: A report is provided for the ACM SIGKDD community about the 2010 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2010), its origin in MMDS 2006 and MMDS 2008, and future directions for this interdisciplinary research area.
Abstract: A report is provided for the ACM SIGKDD community about the 2010 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2010), its origin in MMDS 2006 and MMDS 2008, and future directions for this interdisciplinary research area.

Journal ArticleDOI
TL;DR: This report summarizes the First International Workshop on Novel Data Stream Pattern Mining held at the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, on July 25 2010 in Washington, DC.
Abstract: This report summarizes the First International Workshop on Novel Data Stream Pattern Mining held at the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, on July 25 2010 in Washington, DC.