Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2011"

PDF

Open Access

Journal Article•DOI•

Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics

[...]

Anindya Ghose¹, Panagiotis G. Ipeirotis¹•Institutions (1)

01 Oct 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, the impact of reviews on economic outcomes like product sales and how different factors affect social outcomes such as their perceived usefulness was examined, and it was shown that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness.

...read moreread less

Abstract: With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates econometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.

...read moreread less

1,014 citations

Journal Article•DOI•

Random k-Labelsets for Multilabel Classification

[...]

Grigorios Tsoumakas¹, Ioannis Katakis¹, Ioannis Vlahavas¹•Institutions (1)

Aristotle University of Thessaloniki¹

01 Jul 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Empirical evidence indicates that RAkEL manages to improve substantially over LP, especially in domains with large number of labels and exhibits competitive performance against other high-performing multilabel learning methods.

...read moreread less

Abstract: A simple yet effective multilabel learning method, called label powerset (LP), considers each distinct combination of labels that exist in the training set as a different class value of a single-label classification task. The computational efficiency and predictive performance of LP is challenged by application domains with large number of labels and training examples. In these cases, the number of classes may become very large and at the same time many classes are associated with very few training examples. To deal with these problems, this paper proposes breaking the initial set of labels into a number of small random subsets, called labelsets and employing LP to train a corresponding classifier. The labelsets can be either disjoint or overlapping depending on which of two strategies is used to construct them. The proposed method is called RAkEL (RAndom k labELsets), where k is a parameter that specifies the size of the subsets. Empirical evidence indicates that RAkEL manages to improve substantially over LP, especially in domains with large number of labels and exhibits competitive performance against other high-performing multilabel learning methods.

...read moreread less

795 citations

Journal Article•DOI•

Discovering Activities to Recognize and Track in a Smart Environment

[...]

Parisa Rashidi¹, Diane J. Cook¹, Lawrence B. Holder¹, Maureen Schmitter-Edgecombe¹•Institutions (1)

Washington State University¹

01 Apr 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper introduces an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine and can then track the occurrence of regular activities to monitor functional health and to detect changes in anindividual's patterns and lifestyle.

...read moreread less

Abstract: The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. Although approaches do exist for recognizing activities, the approaches are applied to activities that have been preselected and for which labeled training data are available. In contrast, we introduce an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine. With this capability, we can then track the occurrence of regular activities to monitor functional health and to detect changes in an individual's patterns and lifestyle. In this paper, we describe our activity mining and tracking approach, and validate our algorithms on data collected in physical smart environments.

...read moreread less

468 citations

Journal Article•DOI•

Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints

[...]

Mohammad M. Masud¹, Jing Gao², Latifur Khan¹, Jiawei Han², Bhavani Thuraisingham¹ - Show less +1 more•Institutions (2)

University of Texas at Dallas¹, University of Illinois at Urbana–Champaign²

01 Jun 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive.

...read moreread less

Abstract: Most existing data stream classification techniques ignore one important aspect of stream data: arrival of a novel class. We address this issue and propose a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive. Novel class detection problem becomes more challenging in the presence of concept-drift, when the underlying data distributions evolve in streams. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. A maximum allowable wait time Tc is imposed as a time constraint to classify a test instance. Furthermore, most existing stream classification approaches assume that the true label of a data point can be accessed immediately after the data point is classified. In reality, a time delay Tl is involved in obtaining the true label of a data point since manual labeling is time consuming. We show how to make fast and correct classification decisions under these constraints and apply them to real benchmark data. Comparison with state-of-the-art stream classification techniques prove the superiority of our approach.

...read moreread less

362 citations

Journal Article•DOI•

Locally Consistent Concept Factorization for Document Clustering

[...]

Deng Cai¹, Xiaofei He¹, Jiawei Han²•Institutions (2)

Zhejiang University¹, University of Illinois at Urbana–Champaign²

01 Jun 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new approach to extract the document concepts which are consistent with the manifold geometry such that each concept corresponds to a connected component is proposed, which is called Locally Consistent Concept Factorization (LCCF).

...read moreread less

Abstract: Previous studies have demonstrated that document clustering performance can be improved significantly in lower dimensional linear subspaces. Recently, matrix factorization-based techniques, such as Nonnegative Matrix Factorization (NMF) and Concept Factorization (CF), have yielded impressive results. However, both of them effectively see only the global euclidean geometry, whereas the local manifold geometry is not fully considered. In this paper, we propose a new approach to extract the document concepts which are consistent with the manifold geometry such that each concept corresponds to a connected component. Central to our approach is a graph model which captures the local geometry of the document submanifold. Thus, we call it Locally Consistent Concept Factorization (LCCF). By using the graph Laplacian to smooth the document-to-concept mapping, LCCF can extract concepts with respect to the intrinsic manifold structure and thus documents associated with the same concept can be well clustered. The experimental results on TDT2 and Reuters-21578 have shown that the proposed approach provides a better representation and achieves better clustering results in terms of accuracy and mutual information.

...read moreread less

335 citations

Journal Article•DOI•

Differential Privacy via Wavelet Transforms

[...]

Xiaokui Xiao¹, Guozhang Wang², Johannes Gehrke²•Institutions (2)

Nanyang Technological University¹, Cornell University²

01 Aug 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper develops a data publishing technique that ensures ∈-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range.

...read moreread less

Abstract: Privacy-preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, ∈-differential privacy provides the strongest privacy guarantee. Existing data publishing methods that achieve ∈-differential privacy, however, offer little data utility. In particular, if the output data set is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures ∈-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution.

...read moreread less

309 citations

Journal Article•DOI•

IR-Tree: An Efficient Index for Geographic Document Search

[...]

Zhisheng Li, Ken C. K. Lee¹, Baihua Zheng, Wang-Chien Lee², Dik Lun Lee³, Xufa Wang⁴ - Show less +2 more•Institutions (4)

University of Massachusetts Dartmouth¹, Pennsylvania State University², Hong Kong University of Science and Technology³, University of Science and Technology of China⁴

01 Apr 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An efficient index, called IR-tree, is proposed that together with a top-k document search algorithm facilitates four major tasks in document searches, namely, 1) spatial filtering, 2) textual filtering, 3) relevance computation, and 4) document ranking in a fully integrated manner.

...read moreread less

Abstract: Given a geographic query that is composed of query keywords and a location, a geographic search engine retrieves documents that are the most textually and spatially relevant to the query keywords and the location, respectively, and ranks the retrieved documents according to their joint textual and spatial relevances to the query. The lack of an efficient index that can simultaneously handle both the textual and spatial aspects of the documents makes existing geographic search engines inefficient in answering geographic queries. In this paper, we propose an efficient index, called IR-tree, that together with a top-k document search algorithm facilitates four major tasks in document searches, namely, 1) spatial filtering, 2) textual filtering, 3) relevance computation, and 4) document ranking in a fully integrated manner. In addition, IR-tree allows searches to adopt different weights on textual and spatial relevance of documents at the runtime and thus caters for a wide variety of applications. A set of comprehensive experiments over a wide range of scenarios has been conducted and the experiment results demonstrate that IR-tree outperforms the state-of-the-art approaches for geographic document searches.

...read moreread less

270 citations

Journal Article•DOI•

A Privacy-Preserving Remote Data Integrity Checking Protocol with Data Dynamics and Public Verifiability

[...]

Zhuo Hao¹, Sheng Zhong², Nenghai Yu¹•Institutions (2)

University of Science and Technology of China¹, University at Buffalo²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper adapts Sebé et al.'s protocol to support public verifiability and shows the correctness and security of the protocol, and demonstrates that the proposed protocol has a good performance.

...read moreread less

Abstract: Remote data integrity checking is a crucial technology in cloud computing. Recently, many works focus on providing data dynamics and/or public verifiability to this type of protocols. Existing protocols can support both features with the help of a third-party auditor. In a previous work, Sebe et al. propose a remote data integrity checking protocol that supports data dynamics. In this paper, we adapt Sebe et al.'s protocol to support public verifiability. The proposed protocol supports public verifiability without help of a third-party auditor. In addition, the proposed protocol does not leak any private information to third-party verifiers. Through a formal analysis, we show the correctness and security of the protocol. After that, through theoretical analysis and experimental results, we demonstrate that the proposed protocol has a good performance.

...read moreread less

270 citations

Journal Article•DOI•

Missing Value Estimation for Mixed-Attribute Data Sets

[...]

Xiaofeng Zhu¹, Shichao Zhang², Zhi Jin³, Zili Zhang⁴, Zhuoming Xu⁵ - Show less +1 more•Institutions (5)

University of Queensland¹, Zhejiang Normal University², Peking University³, Southwest University⁴, Hohai University⁵

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively, and then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets.

...read moreread less

Abstract: Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). This paper studies a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets. The proposed method is evaluated with extensive experiments compared with some typical algorithms, and the result demonstrates that the proposed approach is better than these existing imputation methods in terms of classification accuracy and root mean square error (RMSE) at different missing ratios.

...read moreread less

267 citations

Journal Article•DOI•

Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing

[...]

Mohammad Farhan Husain¹, James P. McGlothlin², Mohammad M. Masud², Latifur Khan², Bhavani Thuraisingham² - Show less +1 more•Institutions (2)

Amazon.com¹, University of Texas at Dallas²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper describes a framework that is built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm and shows that this framework is scalable and efficient and can handle large amounts of R DF data, unlike traditional approaches.

...read moreread less

Abstract: Semantic web is an emerging area to augment human reasoning. Various technologies are being developed in this arena which have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). Semantic web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this paper, we describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job (the smallest unit of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part in more than one join in a single Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst case cost is bounded, based on a greedy approach to answer a SPARQL Protocol and RDF Query Language (SPARQL) query. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.

...read moreread less

225 citations

Journal Article•DOI•

Energy Time Series Forecasting Based on Pattern Sequence Similarity

[...]

Francisco Martinez Alvarez¹, Alicia Troncoso¹, José C. Riquelme, Jesús S. Aguilar–Ruiz¹•Institutions (1)

Pablo de Olavide University¹

01 Aug 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a new approach to forecast the behavior of time series based on similarity of pattern sequences, avoiding the use of real values of the time series until the last step of the prediction process.

...read moreread less

Abstract: This paper presents a new approach to forecast the behavior of time series based on similarity of pattern sequences. First, clustering techniques are used with the aim of grouping and labeling the samples from a data set. Thus, the prediction of a data point is provided as follows: first, the pattern sequence prior to the day to be predicted is extracted. Then, this sequence is searched in the historical data and the prediction is calculated by averaging all the samples immediately after the matched sequence. The main novelty is that only the labels associated with each pattern are considered to forecast the future behavior of the time series, avoiding the use of real values of the time series until the last step of the prediction process. Results from several energy time series are reported and the performance of the proposed method is compared to that of recently published techniques showing a remarkable improvement in the prediction.

...read moreread less

Journal Article•DOI•

Optimizing Multiway Joins in a Map-Reduce Environment

[...]

Foto N. Afrati¹, Jeffrey D. Ullman²•Institutions (2)

National and Kapodistrian University of Athens¹, Stanford University²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work identifies the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple, and studies the problem of optimizing the shares, given a fixed number of Reduce processes.

...read moreread less

Abstract: Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

...read moreread less

Journal Article•DOI•

A Web Search Engine-Based Approach to Measure Semantic Similarity between Words

[...]

Danushka Bollegala¹, Yutaka Matsuo¹, Mitsuru Ishizuka¹•Institutions (1)

University of Tokyo¹

01 Jul 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words, and proposes a novel pattern extraction algorithm and a pattern clustering algorithm that significantly improves the accuracy in a community mining task.

...read moreread less

Abstract: Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark data sets showing a high correlation with human ratings. Moreover, the proposed method significantly improves the accuracy in a community mining task.

...read moreread less

Journal Article•DOI•

Laplacian Regularized Gaussian Mixture Model for Data Clustering

[...]

Xiaofei He¹, Deng Cai¹, Yuanlong Shao¹, Hujun Bao¹, Jiawei Han² - Show less +1 more•Institutions (2)

Zhejiang University¹, University of Illinois at Urbana–Champaign²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper introduces a regularized probabilistic model based on manifold structure for data clustering, called Laplacian regularized Gaussian Mixture Model (LapGMM), which is modeled by a nearest neighbor graph, and the graph structure is incorporated in the maximum likelihood objective function.

...read moreread less

Abstract: Gaussian Mixture Models (GMMs) are among the most statistically mature methods for clustering. Each cluster is represented by a Gaussian distribution. The clustering process thereby turns to estimate the parameters of the Gaussian mixture, usually by the Expectation-Maximization algorithm. In this paper, we consider the case where the probability distribution that generates the data is supported on a submanifold of the ambient space. It is natural to assume that if two points are close in the intrinsic geometry of the probability distribution, then their conditional probability distributions are similar. Specifically, we introduce a regularized probabilistic model based on manifold structure for data clustering, called Laplacian regularized Gaussian Mixture Model (LapGMM). The data manifold is modeled by a nearest neighbor graph, and the graph structure is incorporated in the maximum likelihood objective function. As a result, the obtained conditional probability distribution varies smoothly along the geodesics of the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.

...read moreread less

Journal Article•DOI•

Discovering Conditional Functional Dependencies

[...]

Wenfei Fan¹, Floris Geerts¹, Jianzhong Li², Ming Xiong²•Institutions (2)

University of Edinburgh¹, Bell Labs²

01 May 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper provides three methods for CFD discovery, based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only.

...read moreread less

Abstract: This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.

...read moreread less

Journal Article•DOI•

Decision Trees for Uncertain Data

[...]

Smith Tsang¹, Ben Kao¹, Kevin Y. Yip², Wai-Shing Ho¹, Sau Dan Lee¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, Yale University²

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work discovers that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized.

...read moreread less

Abstract: Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency.

...read moreread less

Journal Article•DOI•

Data Leakage Detection

[...]

Panagiotis Papadimitriou¹, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes data allocation strategies (across the agents) that improve the probability of identifying leakages and can also inject “realistic but fake” data records to further improve the chances of detecting leakage and identifying the guilty party.

...read moreread less

Abstract: We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody's laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases, we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.

...read moreread less

Journal Article•DOI•

Efficient Relevance Feedback for Content-Based Image Retrieval by Mining User Navigation Patterns

[...]

Ja-Hwung Su¹, Wei-Jyun Huang¹, Philip S. Yu², Vincent S. Tseng¹•Institutions (2)

National Cheng Kung University¹, University of Illinois at Chicago²

01 Mar 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a novel method, Navigation-Pattern-based Relevance Feedback (NPRF), to achieve the high efficiency and effectiveness of CBIR in coping with the large-scale image data and reveals that NPRF outperforms other existing methods significantly in terms of precision, coverage, and number of feedbacks.

...read moreread less

Abstract: Nowadays, content-based image retrieval (CBIR) is the mainstay of image retrieval systems. To be more profitable, relevance feedback techniques were incorporated into CBIR such that more precise results can be obtained by taking user's feedbacks into account. However, existing relevance feedback-based CBIR methods usually request a number of iterative feedbacks to produce refined search results, especially in a large-scale image database. This is impractical and inefficient in real applications. In this paper, we propose a novel method, Navigation-Pattern-based Relevance Feedback (NPRF), to achieve the high efficiency and effectiveness of CBIR in coping with the large-scale image data. In terms of efficiency, the iterations of feedback are reduced substantially by using the navigation patterns discovered from the user query log. In terms of effectiveness, our proposed search algorithm NPRFSearch makes use of the discovered navigation patterns and three kinds of query refinement strategies, Query Point Movement (QPM), Query Reweighting (QR), and Query Expansion (QEX), to converge the search space toward the user's intention effectively. By using NPRF method, high quality of image retrieval on RF can be achieved in a small number of feedbacks. The experimental results reveal that NPRF outperforms other existing methods significantly in terms of precision, coverage, and number of feedbacks.

...read moreread less

Journal Article•DOI•

Selecting Attributes for Sentiment Classification Using Feature Relation Networks

[...]

Ahmed Abbasi¹, Zhu Zhang², Hsinchun Chen²•Institutions (2)

University of Wisconsin–Milwaukee¹, University of Arizona²

01 Mar 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A rule-based multivariate text feature selection method called Feature Relation Network (FRN) that considers semantic information and also leverages the syntactic relationships between n-gram features to efficiently enable the inclusion of extended sets of heterogeneous n- gram features for enhanced sentiment classification.

...read moreread less

Abstract: A major concern when incorporating large sets of diverse n-gram features for sentiment classification is the presence of noisy, irrelevant, and redundant attributes. These concerns can often make it difficult to harness the augmented discriminatory potential of extended feature sets. We propose a rule-based multivariate text feature selection method called Feature Relation Network (FRN) that considers semantic information and also leverages the syntactic relationships between n-gram features. FRN is intended to efficiently enable the inclusion of extended sets of heterogeneous n-gram features for enhanced sentiment classification. Experiments were conducted on three online review testbeds in comparison with methods used in prior sentiment classification research. FRN outperformed the comparison univariate, multivariate, and hybrid feature selection methods; it was able to select attributes resulting in significantly better classification accuracy irrespective of the feature subset sizes. Furthermore, by incorporating syntactic information about n-gram relations, FRN is able to select features in a more computationally efficient manner than many multivariate and hybrid techniques.

...read moreread less

Journal Article•DOI•

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

[...]

Dawei Jiang¹, Anthony K. H. Tung¹, Gang Chen²•Institutions (2)

National University of Singapore¹, Zhejiang University²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper introduces Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters and presents a new data processing strategy which performs filtering-join-aggregation tasks in two successive Map Reduce jobs.

...read moreread less

Abstract: Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.

...read moreread less

Journal Article•DOI•

A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification

[...]

Jung-Yi Jiang¹, Ren-Jia Liou¹, Shie-Jue Lee¹•Institutions (1)

National Sun Yat-sen University¹

01 Mar 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A fuzzy similarity-based self-constructing algorithm for feature clustering that can run faster and obtain better extracted features than other methods, and the user need not specify the number of extracted features in advance.

...read moreread less

Abstract: Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature, corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods.

...read moreread less

Journal Article•DOI•

Kernelized Fuzzy Rough Sets and Their Applications

[...]

Qinghua Hu¹, Daren Yu¹, Witold Pedrycz², Degang Chen•Institutions (2)

Harbin Institute of Technology¹, University of Alberta²

01 Nov 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This study integrates kernel functions with fuzzy rough set models and proposes two types of kernelized fuzzy rough sets, and extends the measures existing in classical rough sets to evaluate the approximation quality and approximation abilities of the attributes.

...read moreread less

Abstract: Kernel machines and rough sets are two classes of commonly exploited learning techniques. Kernel machines enhance traditional learning algorithms by bringing opportunities to deal with nonlinear classification problems, rough sets introduce a human-focused way to deal with uncertainty in learning problems. Granulation and approximation play a pivotal role in rough sets-based learning and reasoning. However, a way how to effectively generate fuzzy granules from data has not been fully studied so far. In this study, we integrate kernel functions with fuzzy rough set models and propose two types of kernelized fuzzy rough sets. Kernel functions are employed to compute the fuzzy T-equivalence relations between samples, thus generating fuzzy information granules in the approximation space. Subsequently fuzzy granules are used to approximate the classification based on the concepts of fuzzy lower and upper approximations. Based on the models of kernelized fuzzy rough sets, we extend the measures existing in classical rough sets to evaluate the approximation quality and approximation abilities of the attributes. We discuss the relationship between these measures and feature evaluation function ReliefF, and augment the ReliefF algorithm to enhance the robustness of these proposed measures. Finally, we apply these measures to evaluate and select features for classification problems. The experimental results help quantify the performance of the KFRS.

...read moreread less

Journal Article•DOI•

On the Design and Analysis of the Privacy-Preserving SVM Classifier

[...]

Keng-Pei Lin¹, Ming-Syan Chen¹•Institutions (1)

National Taiwan University¹

01 Nov 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes an approach to postprocess the SVM classifier to transform it to a privacy-preserving classifier which does not disclose the private content of support vectors, and introduces the Privacy-Preserving SVM Classifier (abbreviated as PPSVC), designed for the commonly used Gaussian kernel function.

...read moreread less

Abstract: The support vector machine (SVM) is a widely used tool in classification problems. The SVM trains a classifier by solving an optimization problem to decide which instances of the training data set are support vectors, which are the necessarily informative instances to form the SVM classifier. Since support vectors are intact tuples taken from the training data set, releasing the SVM classifier for public use or shipping the SVM classifier to clients will disclose the private content of support vectors. This violates the privacy-preserving requirements for some legal or commercial reasons. The problem is that the classifier learned by the SVM inherently violates the privacy. This privacy violation problem will restrict the applicability of the SVM. To the best of our knowledge, there has not been work extending the notion of privacy preservation to tackle this inherent privacy violation problem of the SVM classifier. In this paper, we exploit this privacy violation problem, and propose an approach to postprocess the SVM classifier to transform it to a privacy-preserving classifier which does not disclose the private content of support vectors. The postprocessed SVM classifier without exposing the private content of training data is called Privacy-Preserving SVM Classifier (abbreviated as PPSVC). The PPSVC is designed for the commonly used Gaussian kernel function. It precisely approximates the decision function of the Gaussian kernel SVM classifier without exposing the sensitive attribute values possessed by support vectors. By applying the PPSVC, the SVM classifier is able to be publicly released while preserving privacy. We prove that the PPSVC is robust against adversarial attacks. The experiments on real data sets show that the classification accuracy of the PPSVC is comparable to the original SVM classifier.

...read moreread less

Journal Article•DOI•

A Pattern Mining Approach to Sensor-Based Human Activity Recognition

[...]

Tao Gu¹, Liang Wang², Zhanqing Wu², Xianping Tao², Jian Lu² - Show less +1 more•Institutions (2)

University of Southern Denmark¹, Nanjing University²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a novel pattern mining approach to recognize sequential, interleaved, and concurrent activities in a unified framework and exploits Emerging Pattern-a discriminative pattern that describes significant changes between classes of data-to identify sensor features for classifying activities.

...read moreread less

Abstract: Recognizing human activities from sensor readings has recently attracted much research interest in pervasive computing due to its potential in many applications, such as assistive living and healthcare. This task is particularly challenging because human activities are often performed in not only a simple (i.e., sequential), but also a complex (i.e., interleaved or concurrent) manner in real life. Little work has been done in addressing complex issues in such a situation. The existing models of interleaved and concurrent activities are typically learning-based. Such models lack of flexibility in real life because activities can be interleaved and performed concurrently in many different ways. In this paper, we propose a novel pattern mining approach to recognize sequential, interleaved, and concurrent activities in a unified framework. We exploit Emerging Pattern-a discriminative pattern that describes significant changes between classes of data-to identify sensor features for classifying activities. Different from existing learning-based approaches which require different training data sets for building activity models, our activity models are built upon the sequential activity trace only and can be applied to recognize both simple and complex activities. We conduct our empirical studies by collecting real-world traces, evaluating the performance of our algorithm, and comparing our algorithm with static and temporal models. Our results demonstrate that, with a time slice of 15 seconds, we achieve an accuracy of 90.96 percent for sequential activity, 88.1 percent for interleaved activity, and 82.53 percent for concurrent activity.

...read moreread less

Journal Article•DOI•

A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

[...]

Xuan-Hieu Phan¹, Cam-Tu Nguyen¹, Dieu-Thu Le², Le-Minh Nguyen³, Susumu Horiguchi¹, Quang-Thuy Ha⁴ - Show less +2 more•Institutions (4)

Tohoku University¹, University of Trento², Japan Advanced Institute of Science and Technology³, Vietnam National University, Hanoi⁴

01 Jul 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A hidden topic-based framework for processing short and sparse documents on the Web that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented.

...read moreread less

Abstract: This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.

...read moreread less

Journal Article•DOI•

Efficient Periodicity Mining in Time Series Databases Using Suffix Trees

[...]

Faraz Rasheed¹, Mohammed Alshalalfa¹, Reda Alhajj¹•Institutions (1)

University of Calgary¹

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity in time series and is noise resilient; it is generally more time-efficient and noise-resilient than existing algorithms.

...read moreread less

Abstract: Periodic pattern mining or periodicity detection has a number of applications, such as prediction, forecasting, detection of unusual activities, etc. The problem is not trivial because the data to be analyzed are mostly noisy and different periodicity types (namely symbol, sequence, and segment) are to be investigated. Accordingly, we argue that there is a need for a comprehensive approach capable of analyzing the whole time series or in a subsection of it to effectively handle different types of noise (to a certain degree) and at the same time is able to detect different types of periodic patterns; combining these under one umbrella is by itself a challenge. In this paper, we present an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity in time series. The algorithm uses suffix tree as the underlying data structure; this allows us to design the algorithm such that its worstcase complexity is O(k.n2), where k is the maximum length of periodic pattern and n is the length of the analyzed portion (whole or subsection) of the time series. The algorithm is noise resilient; it has been successfully demonstrated to work with replacement, insertion, deletion, or a mixture of these types of noise. We have tested the proposed algorithm on both synthetic and real data from different domains, including protein sequences. The conducted comparative study demonstrate the applicability and effectiveness of the proposed algorithm; it is generally more time-efficient and noise-resilient than existing algorithms.

...read moreread less

Journal Article•DOI•

A Personalized Ontology Model for Web Information Gathering

[...]

Xiaohui Tao¹, Yuefeng Li¹, Ning Zhong²•Institutions (2)

Queensland University of Technology¹, Maebashi Institute of Technology²

01 Apr 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A personalized ontology model is proposed for knowledge representation and reasoning over user profiles that learns ontological user profiles from both a world knowledge base and user local instance repositories.

...read moreread less

Abstract: As a model for knowledge description and formalization, ontologies are widely used to represent user profiles in personalized web information gathering. However, when representing user profiles, many models have utilized only knowledge from either a global knowledge base or a user local information. In this paper, a personalized ontology model is proposed for knowledge representation and reasoning over user profiles. This model learns ontological user profiles from both a world knowledge base and user local instance repositories. The ontology model is evaluated by comparing it against benchmark models in web information gathering. The results show that this ontology model is successful.

...read moreread less

Journal Article•DOI•

Classification Using Streaming Random Forests

[...]

H Abdulsalam¹, David B. Skillicorn², Patrick Martin²•Institutions (2)

Kuwait University¹, Queen's University²

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A stream classification algorithm that is online, running in amortized O(1) time, able to handle intermittent arrival of labeled records, and able to adjust its parameters to respond to changing class boundaries (“concept drift”) in the data stream is introduced.

...read moreread less

Abstract: We consider the problem of data stream classification, where the data arrive in a conceptually infinite stream, and the opportunity to examine each record is brief. We introduce a stream classification algorithm that is online, running in amortized O(1) time, able to handle intermittent arrival of labeled records, and able to adjust its parameters to respond to changing class boundaries (“concept drift”) in the data stream. In addition, when blocks of labeled data are short, the algorithm is able to judge internally whether the quality of models updated from them is good enough for deployment on unlabeled records, or whether further labeled records are required. Unlike most proposed stream-classification algorithms, multiple target classes can be handled. Experimental results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm that sees all of the data at once and is able to make multiple passes over it.

...read moreread less

Journal Article•DOI•

Text Clustering with Seeds Affinity Propagation

[...]

Renchu Guan¹, Xiaohu Shi¹, Maurizio Marchese², Chen Yang¹, Yanchun Liang¹ - Show less +1 more•Institutions (2)

Jilin University¹, University of Trento²

01 Apr 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new similarity metric that captures the structural information of texts, and a novel seed construction method to improve the semisupervised clustering process are presented, which show that the proposed similarity metric is more effective in text clustering.

...read moreread less

Abstract: Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

...read moreread less

Journal Article•DOI•

Optimal Service Pricing for a Cloud Cache

[...]

Verena Kantere¹, Debabrata Dash, Grégory François², S. Kyriakopoulou², Anastasia Ailamaki² - Show less +1 more•Institutions (2)

Cyprus University of Technology¹, École Normale Supérieure²

01 Sep 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel price-demand model designed for a cloud cache and a dynamic pricing scheme for queries executed in the cloud cache is proposed that employs a novel method that estimates the correlations of the cache services in an time-efficient manner.

...read moreread less

Abstract: Cloud applications that offer data management services are emerging. Such clouds support caching of data in order to provide quality query services. The users can query the cloud data, paying the price for the infrastructure they use. Cloud management necessitates an economy that manages the service of multiple users in an efficient, but also, resource-economic way that allows for cloud profit. Naturally, the maximization of cloud profit given some guarantees for user satisfaction presumes an appropriate price-demand model that enables optimal pricing of query services. The model should be plausible in that it reflects the correlation of cache structures involved in the queries. Optimal pricing is achieved based on a dynamic pricing scheme that adapts to time changes. This paper proposes a novel price-demand model designed for a cloud cache and a dynamic pricing scheme for queries executed in the cloud cache. The pricing solution employs a novel method that estimates the correlations of the cache services in an time-efficient manner. The experimental study shows the efficiency of the solution.

...read moreread less