scispace - formally typeset
Search or ask a question

Showing papers presented at "Knowledge Discovery and Data Mining in 2006"


Proceedings ArticleDOI
20 Aug 2006
TL;DR: A Cutting Plane Algorithm for training linear SVMs that provably has training time 0(s,n) for classification problems and o(sn log (n)) for ordinal regression problems and several orders of magnitude faster than decomposition methods like svm light for large datasets.
Abstract: Linear Support Vector Machines (SVMs) have become one of the most prominent machine learning techniques for high-dimensional sparse data commonly encountered in applications like text classification, word-sense disambiguation, and drug design. These applications involve a large number of examples n as well as a large number of features N, while each example has only s

2,173 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work presents a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.
Abstract: Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classifiers. Unfortunately, the space required to store this many classifiers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e.g. PDAs), and where computational power is limited (e.g. hea-ring aids). We present a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.

2,091 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: It is found that the propensity of individuals to join communities, and of communities to grow rapidly, depends in subtle ways on the underlying network structure, and decision-tree techniques are used to identify the most significant structural determinants of these properties.
Abstract: The processes by which communities come together, attract new members, and develop over time is a central research issue in the social sciences - political movements, professional organizations, and religious denominations all provide fundamental examples of such communities. In the digital domain, on-line groups are becoming increasingly prominent due to the growth of community and social networking sites such as MySpace and LiveJournal. However, the challenge of collecting and analyzing large-scale time-resolved data on social groups and communities has left most basic questions about the evolution of such groups largely unresolved: what are the structural features that influence whether individuals will join communities, which communities will grow rapidly, and how do the overlaps among pairs of communities change over time.Here we address these questions using two large sources of data: friendship links and community membership on LiveJournal, and co-authorship and conference publications in DBLP. Both of these datasets provide explicit user-defined communities, where conferences serve as proxies for communities in DBLP. We study how the evolution of these communities relates to properties such as the structure of the underlying social networks. We find that the propensity of individuals to join communities, and of communities to grow rapidly, depends in subtle ways on the underlying network structure. For example, the tendency of an individual to join a community is influenced not just by the number of friends he or she has within the community, but also crucially by how those friends are connected to one another. We use decision-tree techniques to identify the most significant structural determinants of these properties. We also develop a novel methodology for measuring movement of individuals between communities, and show how such movements are closely aligned with changes in the topics of interest within the communities.

2,001 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: An LDA-style topic model is presented that captures not only the low-dimensional structure of data, but also how the structure changes over time, showing improved topics, better timestamp prediction, and interpretable trends.
Abstract: This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.

1,327 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: The best performing methods are the ones based on random-walks and "forest fire"; they match very accurately both static as well as evolutionary graph patterns, with sample sizes down to about 15% of the original graph.
Abstract: Given a huge real graph, how can we derive a representative sample? There are many known algorithms to compute interesting measures (shortest paths, centrality, betweenness, etc.), but several of them become impractical for large graphs. Thus graph sampling is essential.The natural questions to ask are (a) which sampling method to use, (b) how small can the sample size be, and (c) how to scale up the measurements of the sample (e.g., the diameter), to get estimates for the large graph. The deeper, underlying question is subtle: how do we measure success?.We answer the above questions, and test our answers by thorough experiments on several, diverse datasets, spanning thousands nodes and edges. We consider several sampling methods, propose novel methods to check the goodness of sampling, and develop a set of scaling laws that describe relations between the properties of the original and the sample.In addition to the theoretical contributions, the practical conclusions from our work are: Sampling strategies based on edge selection do not perform well; simple uniform random node selection performs surprisingly well. Overall, best performing methods are the ones based on random-walks and "forest fire"; they match very accurately both static as well as evolutionary graph patterns, with sample sizes down to about 15% of the original graph.

1,290 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work provides a new approach of evaluating the quality of clustering on words using class aggregate distribution and multi-peak distribution and provides new rules for updating $F,S, G$ and proves the convergence of these algorithms.
Abstract: Currently, most research on nonnegative matrix factorization (NMF)focus on 2-factor $X=FG^T$ factorization. We provide a systematicanalysis of 3-factor $X=FSG^T$ NMF. While it unconstrained 3-factor NMF is equivalent to it unconstrained 2-factor NMF, itconstrained 3-factor NMF brings new features to it constrained 2-factor NMF. We study the orthogonality constraint because it leadsto rigorous clustering interpretation. We provide new rules for updating $F,S, G$ and prove the convergenceof these algorithms. Experiments on 5 datasets and a real world casestudy are performed to show the capability of bi-orthogonal 3-factorNMF on simultaneously clustering rows and columns of the input datamatrix. We provide a new approach of evaluating the quality ofclustering on words using class aggregate distribution andmulti-peak distribution. We also provide an overview of various NMF extensions andexamine their relationships.

1,211 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: Yale is described, a free open-source environment for KDD and machine learning which provides a rich variety of methods which allows rapid prototyping for new applications and makes costlyre-implementations unnecessary and offers extensive functionality for process evaluation and optimization.
Abstract: KDD is a complex and demanding task While a large number of methods has been established for numerous problems, many challenges remain to be solved New tasks emerge requiring the development of new methods or processing schemes Like in software development, the development of such solutions demands for careful analysis, specification, implementation, and testing Rapid prototyping is an approach which allows crucial design decisions as early as possible A rapid prototyping system should support maximal re-use and innovative combinations of existing methods, as well as simple and quick integration of new onesThis paper describes Yale, a free open-source environment forKDD and machine learning Yale provides a rich variety of methods whichallows rapid prototyping for new applications and makes costlyre-implementations unnecessary Additionally, Yale offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool Following the paradigm of visual programming eases the design of processing schemes While the graphical user interface supports interactive design, the underlying XML representation enables automated applications after the prototyping phaseAfter a discussion of the key concepts of Yale, we illustrate the advantages of rapid prototyping for KDD on case studies ranging from data pre-processing to result visualization These case studies cover tasks like feature engineering, text mining, data stream mining and tracking drifting concepts, ensemble methods and distributed data mining This variety of applications is also reflected in a broad user base, we counted more than 40,000 downloads during the last twelve months

1,151 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: A simple model of network growth is presented, characterizing users as either passive members of the network; inviters who encourage offline friends and acquaintances to migrate online; and linkers who fully participate in the social evolution of thenetwork.
Abstract: In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars.We present a simple model of network growth which captures these aspects of component structure. The model follows our experimental results, characterizing users as either passive members of the network; inviters who encourage offline friends and acquaintances to migrate online; and linkers who fully participate in the social evolution of the network.

1,151 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work presents a generic framework for clustering data over time, and discusses evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering.
Abstract: We consider the problem of clustering data over time. An evolutionary clustering should simultaneously optimize two potentially conflicting criteria: first, the clustering at any point in time should remain faithful to the current data as much as possible; and second, the clustering should not shift dramatically from one timestep to the next. We present a generic framework for this problem, and discuss evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering. We extensively evaluate these algorithms on real data sets and show that our algorithms can simultaneously attain both high accuracy in capturing today's data, and high fidelity in reflecting yesterday's clustering.

686 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: It is proved that the optimal (α, k)-anonymity problem is NP-hard, and a local-recoding algorithm is proposed which is more scalable and result in less data distortion.
Abstract: Privacy preservation is an important issue in the release of data for mining purposes. The k-anonymity model has been introduced for protecting individual identification. Recent studies show that a more sophisticated model is necessary to protect the association of individuals to sensitive information. In this paper, we propose an (α, k)-anonymity model to protect both identifications and relationships to sensitive information in data. We discuss the properties of (α, k)-anonymity model. We prove that the optimal (α, k)-anonymity problem is NP-hard. We first presentan optimal global-recoding method for the (α, k)-anonymity problem. Next we propose a local-recoding algorithm which is more scalable and result in less data distortion. The effectiveness and efficiency are shown by experiments. We also describe how the model can be extended to more general case.

676 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation.
Abstract: There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The dynamic tensor analysis (DTA) method, and its variants are introduced, which provides a compact summary for high-order and high-dimensional data, and it also reveals the hidden correlations.
Abstract: How do we find patterns in author-keyword associations, evolving over time? Or in Data Cubes, with product-branch-customer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example.We propose to envision such higher order data as tensors,and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for high-order and high-dimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA.We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multi-way latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A new plagiarism detection tool, called GPLAG, is developed, which detects plagiarism by mining program dependence graphs (PDGs) and is more effective than state-of-the-art tools for plagiarism Detection.
Abstract: Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less self-disciplined, may be tempted to plagiarize some open source projects for its own products. Although current plagiarism detection tools appear sufficient for academic use, they are nevertheless short for fighting against serious plagiarists. For example, disguises like statement reordering and code insertion can effectively confuse these tools. In this paper, we develop a new plagiarism detection tool, called GPLAG, which detects plagiarism by mining program dependence graphs (PDGs). A PDG is a graphic representation of the data and control dependencies within a procedure. Because PDGs are nearly invariant during plagiarism, GPLAG is more effective than state-of-the-art tools for plagiarism detection. In order to make GPLAG scalable to large programs, a statistical lossy filter is proposed to prune the plagiarism search space. Experiment study shows that GPLAG is both effective and efficient: It detects plagiarism that easily slips over existing tools, and it usually takes a few seconds to find (simulated) plagiarism in programs having thousands of lines of code.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper proposes a simple framework to specify utility of attributes and develops two simple yet efficient heuristic local recoding methods for utility-based anonymization, which outperform the state-of-the-art multidimensional global recode methods in both discernability and query answering accuracy.
Abstract: Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attributes to generalized or changed values. However, global recoding may not always achieve effective anonymization in terms of discernability and query answering accuracy using the anonymized data. Moreover, anonymized data is often for analysis. As well accepted in many analytical applications, different attributes in a data set may have different utility in the analysis. The utility of attributes has not been considered in the previous methods.In this paper, we study the problem of utility-based anonymization. First, we propose a simple framework to specify utility of attributes. The framework covers both numeric and categorical data. Second, we develop two simple yet efficient heuristic local recoding methods for utility-based anonymization. Our extensive performance study using both real data sets and synthetic data sets shows that our methods outperform the state-of-the-art multidimensional global recoding methods in both discernability and query answering accuracy. Furthermore, our utility-based method can boost the quality of analysis using the anonymized data.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper presents a novel approach to outlier detection based on classification, which is superior to other methods based on the same reduction to classification, but using standard classification methods, and shows that it is competitive to the state-of-the-art outlier Detection methods in the literature.
Abstract: Most existing approaches to outlier detection are based on density estimation methods. There are two notable issues with these methods: one is the lack of explanation for outlier flagging decisions, and the other is the relatively high computational requirement. In this paper, we present a novel approach to outlier detection based on classification, in an attempt to address both of these issues. Our approach isbased on two key ideas. First, we present a simple reduction of outlier detection to classification, via a procedure that involves applying classification to a labeled data set containing artificially generated examples that play the role of potential outliers. Once the task has been reduced to classification, we then invoke a selective sampling mechanism based on active learning to the reduced classification problem. We empirically evaluate the proposed approach using a number of data sets, and find that our method is superior to other methods based on the same reduction to classification, but using standard classification methods. We also show that it is competitive to the state-of-the-art outlier detection methods in the literature based on density estimation, while significantly improving the computational complexity and explanatory power.

Book ChapterDOI
09 Apr 2006
TL;DR: A simple but effective measure on local outliers based on a symmetric neighborhood relationship that considers both neighbors and reverse neighbors of an object when estimating its density distribution and shows that it is more effective in ranking outliers.
Abstract: Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

Book ChapterDOI
09 Apr 2006
TL;DR: In this paper, the authors investigate a large person-to-person recommendation network, consisting of four million people who made sixteen million recommendations on half a million products, and discover novel patterns: the distribution of cascade sizes is approximately heavy-tailed; cascades tend to be shallow, but occasional large bursts of propagation can occur.
Abstract: Information cascades are phenomena in which individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, and in particular study the patterns of cascading recommendations that arise in large social networks. We investigate a large person-to-person recommendation network, consisting of four million people who made sixteen million recommendations on half a million products. Such a dataset allows us to pose a number of fundamental questions: What kinds of cascades arise frequently in real life? What features distinguish them? We enumerate and count cascade subgraphs on large directed graphs; as one component of this, we develop a novel efficient heuristic based on graph isomorphism testing that scales to large datasets. We discover novel patterns: the distribution of cascade sizes is approximately heavy-tailed; cascades tend to be shallow, but occasional large bursts of propagation can occur. The relative abundance of different cascade subgraphs suggests subtle properties of the underlying social network and recommendation process.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: Wall-clock timing results on the DBLP dataset show that the proposed approximation achieve good accuracy for about 6:1 speedup, and experiments confirm that the method naturally deals with multi-source queries and that the resulting subgraphs agree with the intuition.
Abstract: Given Q nodes in a social network (say, authorship network), how can we find the node/author that is the center-piece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong to. Isomorphic scenarios appear in law enforcement (find the master-mind criminal, connected to all current suspects), gene regulatory networks (find the protein that participates in pathways with all or most of the given Q proteins), viral marketing and many more.Connection subgraphs is an important first step, handling the case of Q=2 query nodes. Then, the connection subgraph algorithm finds the b intermediate nodes, that provide a good connection between the two original query nodes.Here we generalize the challenge in multiple dimensions: First, we allow more than two query nodes. Second, we allow a whole family of queries, ranging from 'OR' to 'AND', with 'softAND' in-between. Finally, we design and compare a fast approximation, and study the quality/speed trade-off.We also present experiments on the DBLP dataset. The experiments confirm that our proposed method naturally deals with multi-source queries and that the resulting subgraphs agree with our intuition. Wall-clock timing results on the DBLP dataset show that our proposed approximation achieve good accuracy for about 6:1 speedup.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work proposes a semi-supervised technique for building time series classifiers and shows that special considerations must be made to make them both efficient and effective for the time series domain.
Abstract: The problem of time series classification has attracted great interest in the last decade. However current research assumes the existence of large amounts of labeled training data. In reality, such data may be very difficult or expensive to obtain. For example, it may require the time and expertise of cardiologists, space launch technicians, or other domain specialists. As in many other domains, there are often copious amounts of unlabeled data available. For example, the PhysioBank archive contains gigabytes of ECG data. In this work we propose a semi-supervised technique for building time series classifiers. While such algorithms are well known in text domains, we will show that special considerations must be made to make them both efficient and effective for the time series domain. We evaluate our work with a comprehensive set of experiments on diverse data sources including electrocardiograms, handwritten documents, and video datasets. The experimental results demonstrate that our approach requires only a handful of labeled examples to construct accurate classifiers.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work proposes the framework MONIC for modeling and tracking of cluster transitions, which encompasses changes that involve more than one cluster, thus allowing for insights on cluster change in the whole clustering.
Abstract: There is much recent work on detecting and tracking change in clusters, often based on the study of the spatiotemporal properties of a cluster. For the many applications where cluster change is relevant, among them customer relationship management, fraud detection and marketing, it is also necessary to provide insights about the nature of cluster change: Is a cluster corresponding to a group of customers simply disappearing or are its members migrating to other clusters? Is a new emerging cluster reflecting a new target group of customers or does it rather consist of existing customers whose preferences shift? To answer such questions, we propose the framework MONIC for modeling and tracking of cluster transitions. Our cluster transition model encompasses changes that involve more than one cluster, thus allowing for insights on cluster change in the whole clustering. Our transition tracking mechanism is not based on the topological properties of clusters, which are only available for some types of clustering, but on the contents of the underlying data stream. We present our first results on monitoring cluster transitions over the ACM digital library.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A new mathematical and computational framework is proposed that enables analysis of dynamic social networks and that explicitly makes use of information about when social interactions occur.
Abstract: Finding patterns of social interaction within a population has wide-ranging applications including: disease modeling, cultural and information transmission, and behavioral ecology. Social interactions are often modeled with networks. A key characteristic of social interactions is their continual change. However, most past analyses of social networks are essentially static in that all information about the time that social interactions take place is discarded. In this paper, we propose a new mathematical and computational framework that enables analysis of dynamic social networks and that explicitly makes use of information about when social interactions occur.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper study statistical language modeling based methods to mine contextual information from long-term search history and exploit it for a more accurate estimate of the query language model.
Abstract: Long-term search history contains rich information about a user's search preferences, which can be used as search context to improve retrieval performance. In this paper, we study statistical language modeling based methods to mine contextual information from long-term search history and exploit it for a more accurate estimate of the query language model. Experiments on real web search data show that the algorithms are effective in improving search accuracy for both fresh and recurring queries. The best performance is achieved when using clickthrough data of past searches that are related to the current query.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The experimental results indicate that the proposed time-varying Poisson model provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
Abstract: Time-series of count data are generated in many different contexts, such as web access logging, freeway traffic monitoring, and security logs associated with buildings. Since this data measures the aggregated behavior of individual human beings, it typically exhibits a periodicity in time on a number of scales (daily, weekly,etc.) that reflects the rhythms of the underlying human activity and makes the data appear non-homogeneous. At the same time, the data is often corrupted by a number of bursty periods of unusual behavior such as building events, traffic accidents, and so forth. The data mining problem of finding and extracting these anomalous events is made difficult by both of these elements. In this paper we describe a framework for unsupervised learning in this context, based on a time-varying Poisson process model that can also account for anomalous events. We show how the parameters of this model can be learned from count time series using statistical estimation techniques. We demonstrate the utility of this model on two datasets for which we have partial ground truth in the form of known events, one from freeway traffic data and another from building access data, and show that the model performs significantly better than a non-probabilistic, threshold-based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of-day effects, and make inferences about the detected events (e.g., popularity or level of attendance). Our experimental results indicate that the proposed time-varying Poisson model provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The lossy join is introduced, a negative property in relational database design, as a way to hide the join relationship among releases, and a scalable and practical solution to the sequential anonymization problem is proposed.
Abstract: An organization makes a new release as new information become available, releases a tailored view for each data request, releases sensitive information and identifying information separately. The availability of related releases sharpens the identification of individuals by a global quasi-identifier consisting of attributes from related releases. Since it is not an option to anonymize previously released data, the current release must be anonymized to ensure that a global quasi-identifier is not effective for identification. In this paper, we study the sequential anonymization problem under this assumption. A key question is how to anonymize the current release so that it cannot be linked to previous releases yet remains useful for its own release purpose. We introduce the lossy join, a negative property in relational database design, as a way to hide the join relationship among releases, and propose a scalable and practical solution.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A suite of anonymization algorithms that produce an anonymous view based on a target class of workloads, consisting of one or more data mining tasks, as well as selection predicates are provided.
Abstract: Protecting data privacy is an important problem in microdata distribution. Anonymization algorithms typically aim to protect individual privacy, with minimal impact on the quality of the resulting data. While the bulk of previous work has measured quality through one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used.This paper provides a suite of anonymization algorithms that produce an anonymous view based on a target class of workloads, consisting of one or more data mining tasks, as well as selection predicates. An extensive experimental evaluation indicates that this approach is often more effective than previous anonymization techniques.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper proposes and studies different attributes derived from user profiles for their utility in attack detection and shows that a machine learning classification approach that includes attributesderived from attack models is more successful than more generalized detection algorithms previously studied.
Abstract: Collaborative recommender systems are highly vulnerable to attack. Attackers can use automated means to inject a large number of biased profiles into such a system, resulting in recommendations that favor or disfavor given items. Since collaborative recommender systems must be open to user input, it is difficult to design a system that cannot be so attacked. Researchers studying robust recommendation have therefore begun to identify types of attacks and study mechanisms for recognizing and defeating them. In this paper, we propose and study different attributes derived from user profiles for their utility in attack detection. We show that a machine learning classification approach that includes attributes derived from attack models is more successful than more generalized detection algorithms previously studied.

Proceedings ArticleDOI
Jun Zhu1, Zaiqing Nie2, Ji-Rong Wen2, Bo Zhang1, Wei-Ying Ma2 
20 Aug 2006
TL;DR: It is shown that separately extracting data records and attributes is highly ineffective and a probabilistic model to perform these two tasks simultaneously is proposed and it can also incorporate hierarchical interactions which are very important for Web data extraction.
Abstract: Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.

Proceedings ArticleDOI
Seung-Taek Park1, David M. Pennock1, Omid Madani1, Nathan Good1, Dennis DeCoste1 
20 Aug 2006
TL;DR: This work improves the scalability and performance of a previous approach to handling cold-start situations that uses filterbots, or surrogate users that rate items based only on user or item attributes, and shows that introducing a very small number of simple filterbots helps make CF algorithms more robust.
Abstract: The goal of a recommender system is to suggest items of interest to a user based on historical behavior of a community of users. Given detailed enough history, item-based collaborative filtering (CF) often performs as well or better than almost any other recommendation method. However, in cold-start situations - where a user, an item, or the entire system is new - simple non-personalized recommendations often fare better. We improve the scalability and performance of a previous approach to handling cold-start situations that uses filterbots, or surrogate users that rate items based only on user or item attributes. We show that introducing a very small number of simple filterbots helps make CF algorithms more robust. In particular, adding just seven global filterbots improves both user-based and item-based CF in cold-start user, cold-start item, and cold-start system settings. Performance is better when data is scarce, performance is no worse when data is plentiful, and algorithm efficiency is negligibly affected. We systematically compare a non-personalized baseline, user-based CF, item-based CF, and our bot-augmented user- and item-based CF algorithms using three data sets (Yahoo! Movies, MovieLens, and EachMovie) with the normalized MAE metric in three types of cold-start situations. The advantage of our "naive filterbot" approach is most pronounced for the Yahoo! data, the sparsest of the three data sets.

Book ChapterDOI
09 Apr 2006
TL;DR: In this paper, an algorithm that enhances the K-means algorithm to handle data uncertainty is presented, which can produce more accurate results when data mining techniques are applied to these data, their uncertainty has to be considered to obtain high quality results.
Abstract: Data uncertainty is an inherent property in various applications due to reasons such as outdated sources or imprecise measurement. When data mining techniques are applied to these data, their uncertainty has to be considered to obtain high quality results. We present UK-means clustering, an algorithm that enhances the K-means algorithm to handle data uncertainty. We apply UK-means to the particular pattern of moving-object uncertainty. Experimental results show that by considering uncertainty, a clustering algorithm can produce more accurate results.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The authors used deep linguistic structures instead of surface text patterns to extract pairs of a given semantic relation from text documents and applied them to a corpus to find new pairs, and demonstrated the benefits of their approach by extensive experiments with their prototype system LEILA.
Abstract: The World Wide Web provides a nearly endless source of knowledge, which is mostly given in natural language. A first step towards exploiting this data automatically could be to extract pairs of a given semantic relation from text documents - for example all pairs of a person and her birthdate. One strategy for this task is to find text patterns that express the semantic relation, to generalize these patterns, and to apply them to a corpus to find new pairs. In this paper, we show that this approach profits significantly when deep linguistic structures are used instead of surface text patterns. We demonstrate how linguistic structures can be represented for machine learning, and we provide a theoretical analysis of the pattern matching approach. We show the benefits of our approach by extensive experiments with our prototype system LEILA.