Showing papers in &quot;Knowledge and Information Systems in 2013&quot;

Transfer learning for activity recognition: a survey

TL;DR: Several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selection methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features.

...read moreread less

Abstract: With the advent of high dimensionality, adequate identification of relevant features of the data has become indispensable in real-world scenarios. In this context, the importance of feature selection is beyond doubt and different methods have been developed. However, with such a vast body of algorithms available, choosing the adequate feature selec- tion method is not an easy-to-solve question and it is necessary to check their effectiveness on different situations. Nevertheless, the assessment of relevant features is difficult in real datasets and so an interesting option is to use artificial data. In this paper, several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selec- tion methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features. Seven filters, two embedded methods, and two wrappers are applied over eleven synthetic datasets, tested by four classifiers, so as to be able to choose a robust method, paving the way for its application to real datasets.

...read moreread less

637 citations

Journal Article•DOI•

[...]

Diane J. Cook¹, Kyle D. Feuz¹, Narayanan C. Krishnan¹•Institutions (1)

Washington State University¹

A survey on instance selection for active learning

TL;DR: The literature is surveyed to highlight recent advances in transfer learning for activity recognition, and existing approaches to transfer-based activity recognition are characterized by sensor modality, by differences between source and target environments, by data availability, and by type of information that is transferred.

...read moreread less

Abstract: Many intelligent systems that focus on the needs of a human require information about the activities being performed by the human. At the core of this capability is activity recognition, which is a challenging and well-researched problem. Activity recognition algorithms require substantial amounts of labeled training data yet need to perform well under very diverse circumstances. As a result, researchers have been designing methods to identify and utilize subtle connections between activity recognition datasets, or to perform transfer-based activity recognition. In this paper, we survey the literature to highlight recent advances in transfer learning for activity recognition. We characterize existing approaches to transfer-based activity recognition by sensor modality, by differences between source and target environments, by data availability, and by type of information that is transferred. Finally, we present some grand challenges for the community to consider as this field is further developed.

...read moreread less

395 citations

Journal Article•DOI•

[...]

Yifan Fu¹, Xingquan Zhu¹, Bin Li¹•Institutions (1)

University of Technology, Sydney¹

Topic-aware social influence propagation models

TL;DR: This survey intends to provide a high-level summarization for active learning and motivates interested readers to consider instance-selection approaches for designing effective active learning solutions.

...read moreread less

Abstract: Active learning aims to train an accurate prediction model with minimum cost by labeling most informative instances. In this paper, we survey existing works on active learning from an instance-selection perspective and classify them into two categories with a progressive relationship: (1) active learning merely based on uncertainty of independent and identically distributed (IID) instances, and (2) active learning by further taking into account instance correlations. Using the above categorization, we summarize major approaches in the field, along with their technical strengths/weaknesses, followed by a simple runtime performance comparison, and discussion about emerging active learning applications and instance-selection challenges therein. This survey intends to provide a high-level summa- rization for active learning and motivates interested readers to consider instance-selection approaches for designing effective active learning solutions.

...read moreread less

302 citations

Journal Article•DOI•

[...]

Nicola Barbieri¹, Francesco Bonchi¹, Giuseppe Manco²•Institutions (2)

Yahoo!¹, Indian Council of Agricultural Research²

19 Apr 2013-Knowledge and Information Systems

TL;DR: Novel topic-aware influence-driven propagation models that are more accurate in describing real-world cascades than the standard (i.e., topic-blind) propagation models studied in the literature are introduced.

...read moreread less

Abstract: The study of influence-driven propagations in social networks and its exploitation for viral marketing purposes has recently received a large deal of attention. However, regardless of the fact that users authoritativeness, expertise, trust and influence are evidently topic-dependent, the research on social influence has surprisingly largely overlooked this aspect. In this article, we study social influence from a topic modeling perspective. We introduce novel topic-aware influence-driven propagation models that, as we show in our experiments, are more accurate in describing real-world cascades than the standard (i.e., topic-blind) propagation models studied in the literature. In particular, we first propose simple topic-aware extensions of the well-known Independent Cascade and Linear Threshold models. However, these propagation models have a very large number of parameters which could lead to overfitting. Therefore, we propose a different approach explicitly modeling authoritativeness, influence and relevance under a topic-aware perspective. Instead of considering user-to-user influence, the proposed model focuses on user authoritativeness and interests in a topic, leading to a drastic reduction in the number of parameters of the model. We devise methods to learn the parameters of the models from a data set of past propagations. Our experimentation confirms the high accuracy of the proposed models and learning schemes.

...read moreread less

257 citations

Journal Article•DOI•

In-network outlier detection in wireless sensor networks

[...]

Joel W. Branch¹, Chris Giannella², Boleslaw K. Szymanski³, Ran Wolff⁴, Hillol Kargupta⁵ - Show less +1 more•Institutions (5)

IBM¹, Mitre Corporation², Rensselaer Polytechnic Institute³, University of Haifa⁴, University of Maryland, Baltimore County⁵

SVDD-based outlier detection on uncertain data

TL;DR: In this article, an unsupervised outlier detection approach for wireless sensor networks is proposed, which is flexible with respect to the outlier definition and uses only single-hop communication, thus permitting very simple node failure detection and message reliability assurance mechanisms.

...read moreread less

Abstract: To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an approach that (1) is flexible with respect to the outlier definition, (2) computes the result in-network to reduce both bandwidth and energy consumption, (3) uses only single-hop communication, thus permitting very simple node failure detection and message reliability assurance mechanisms (e.g., carrier-sense), and (4) seamlessly accommodates dynamic updates to data. We examine performance by simulation, using real sensor data streams. Our results demonstrate that our approach is accurate and imposes reasonable communication and power consumption demands.

...read moreread less

217 citations

Journal Article•DOI•

[...]

Bo Liu¹, Yanshan Xiao¹, Longbing Cao², Zhifeng Hao¹, Feiqi Deng³ - Show less +1 more•Institutions (3)

Guangdong University of Technology¹, University of Technology, Sydney², South China University of Technology³

Geographic knowledge extraction and semantic similarity in OpenStreetMap

TL;DR: A new SVDD-based approach to detect outliers on uncertain data that outperforms state-of-art outlier detection techniques and reduces the contribution of the examples with the least confidence score on the construction of the decision boundary.

...read moreread less

Abstract: Outlier detection is an important problem that has been studied within diverse research areas and application domains. Most existing methods are based on the assumption that an example can be exactly categorized as either a normal class or an outlier. However, in many real-life applications, data are uncertain in nature due to various errors or partial completeness. These data uncertainty make the detection of outliers far more difficult than it is from clearly separable data. The key challenge of handling uncertain data in outlier detection is how to reduce the impact of uncertain data on the learned distinctive classi- fier. This paper proposes a new SVDD-based approach to detect outliers on uncertain data. The proposed approach operates in two steps. In the first step, a pseudo-training set is gen- erated by assigning a confidence score to each input example, which indicates the likelihood of an example tending normal class. In the second step, the generated confidence score is incorporated into the support vector data description training phase to construct a global

...read moreread less

127 citations

Journal Article•DOI•

[...]

Andrea Ballatore¹, Michela Bertolotto¹, David Wilson²•Institutions (2)

University College Dublin¹, University of North Carolina at Charlotte²

01 Oct 2013-Knowledge and Information Systems

TL;DR: Devising a mechanism for computing the semantic similarity of the OSM geographic classes can help alleviate this semantic gap, and empirical evidence supports the usage of co-citation algorithms—SimRank showing the highest plausibility—to compute concept similarity in a crowdsourced semantic network.

...read moreread less

Abstract: In recent years, a web phenomenon known as Volunteered Geographic Information (VGI) has produced large crowdsourced geographic data sets OpenStreetMap (OSM), the leading VGI project, aims at building an open-content world map through user contributions OSM semantics consists of a set of properties (called ‘tags’) describing geographic classes, whose usage is defined by project contributors on a dedicated Wiki website Because of its simple and open semantic structure, the OSM approach often results in noisy and ambiguous data, limiting its usability for analysis in information retrieval, recommender systems and data mining Devising a mechanism for computing the semantic similarity of the OSM geographic classes can help alleviate this semantic gap The contribution of this paper is twofold It consists of (1) the development of the OSM Semantic Network by means of a web crawler tailored to the OSM Wiki website; this semantic network can be used to compute semantic similarity through co-citation measures, providing a novel semantic tool for OSM and GIS communities; (2) a study of the cognitive plausibility (ie the ability to replicate human judgement) of co-citation algorithms when applied to the computation of semantic similarity of geographic concepts Empirical evidence supports the usage of co-citation algorithms—SimRank showing the highest plausibility—to compute concept similarity in a crowdsourced semantic network

...read moreread less

126 citations

Journal Article•DOI•

A survey of queries over uncertain data

[...]

Yijie Wang¹, Xiaoyong Li¹, Xiaoling Li¹, Yuan Wang¹•Institutions (1)

National University of Defense Technology¹

30 Apr 2013-Knowledge and Information Systems

TL;DR: This paper presents and analyzes several typical uncertain queries, such as skyline queries, top-$$k$$ queries, nearest-neighbor queries, aggregate queries, join queries, range queries, and threshold queries over uncertain data, and summarizes the main features of uncertain queries.

...read moreread less

Abstract: Uncertain data have already widely existed in many practical applications recently, such as sensor networks, RFID networks, location-based services, and mobile object management. Query processing over uncertain data as an important aspect of uncertain data management has received increasing attention in the field of database. Uncertain query processing poses inherent challenges and demands non-traditional techniques, due to the data uncertainty. This paper surveys this interesting and still evolving research area in current database community, so that readers can easily obtain an overview of the state-of-the-art techniques. We first provide an overview of data uncertainty, including uncertainty types, probability representation models, and sources of probabilities. We next outline the current major types of uncertain queries and summarize the main features of uncertain queries. Particularly, we present and analyze several typical uncertain queries in detail, such as skyline queries, top- $$k$$ queries, nearest-neighbor queries, aggregate queries, join queries, range queries, and threshold queries over uncertain data. Finally, we present many interesting research topics on uncertain queries that have not yet been explored.

...read moreread less

114 citations

Journal Article•DOI•

Travel route recommendation using geotagged photos

[...]

Takeshi Kurashima¹, Tomoharu Iwata¹, Go Irie¹, Ko Fujimura¹•Institutions (1)

Nippon Telegraph and Telephone¹

01 Oct 2013-Knowledge and Information Systems

TL;DR: A travel route recommendation method that makes use of the photographers’ histories as held by social photo-sharing sites and outputs a set of personalized travel plans that match the user’s preference, present location, spare time and transportation means.

...read moreread less

Abstract: We propose a travel route recommendation method that makes use of the photographers’ histories as held by social photo-sharing sites. Assuming that the collection of each photographer’s geotagged photos is a sequence of visited locations, photo-sharing sites are important sources for gathering the location histories of tourists. By following their location sequences, we can find representative and diverse travel routes that link key landmarks. Recommendations are performed by our photographer behavior model, which estimates the probability of a photographer visiting a landmark. We incorporate user preference and present location information into the probabilistic behavior model by combining topic models and Markov models. Based on the photographer behavior model, proposed route recommendation method outputs a set of personalized travel plans that match the user’s preference, present location, spare time and transportation means. We demonstrate the effectiveness of the proposed method using an actual large-scale geotag dataset held by Flickr in terms of the prediction accuracy of travel behavior.

...read moreread less

111 citations

Journal Article•DOI•

Quantifying explainable discrimination and removing illegal discrimination in automated decision making

[...]

Faisal Kamiran¹, Indre Zliobaite², Toon Calders³•Institutions (3)

King Abdullah University of Science and Technology¹, Bournemouth University², Eindhoven University of Technology³

01 Jun 2013-Knowledge and Information Systems

TL;DR: The refined notion of conditional non-discrimination in classifier design is introduced and it is shown that some of the differences in decisions across the sensitive groups can be explainable and are hence tolerable.

...read moreread less

Abstract: Recently, the following discrimination-aware classification problem was introduced. Historical data used for supervised learning may contain discrimination, for instance, with respect to gender. The question addressed by discrimination-aware techniques is, given sensitive attribute, how to train discrimination-free classifiers on such historical data that are discriminative, with respect to the given sensitive attribute. Existing techniques that deal with this problem aim at removing all discrimination and do not take into account that part of the discrimination may be explainable by other attributes. For example, in a job application, the education level of a job candidate could be such an explainable attribute. If the data contain many highly educated male candidates and only few highly educated women, a difference in acceptance rates between woman and man does not necessarily reflect gender discrimination, as it could be explained by the different levels of education. Even though selecting on education level would result in more males being accepted, a difference with respect to such a criterion would not be considered to be undesirable, nor illegal. Current state-of-the-art techniques, however, do not take such gender-neutral explanations into account and tend to overreact and actually start reverse discriminating, as we will show in this paper. Therefore, we introduce and analyze the refined notion of conditional non-discrimination in classifier design. We show that some of the differences in decisions across the sensitive groups can be explainable and are hence tolerable. Therefore, we develop methodology for quantifying the explainable discrimination and algorithmic techniques for removing the illegal discrimination when one or more attributes are considered as explanatory. Experimental evaluation on synthetic and real-world classification datasets demonstrates that the new techniques are superior to the old ones in this new context, as they succeed in removing almost exclusively the undesirable discrimination, while leaving the explainable differences unchanged, allowing for differences in decisions as long as they are explainable.

...read moreread less

104 citations

Journal Article•DOI•

Hybrid Collaborative Filtering algorithm for bidirectional Web service recommendation

[...]

Jie Cao¹, Zhiang Wu¹, Youquan Wang², Yi Zhuang³•Institutions (3)

Nanjing University of Finance and Economics¹, Nanjing University of Science and Technology², Zhejiang Gongshang University³

Efficient greedy feature selection for unsupervised learning

TL;DR: A cube model is designed to explicitly describe the relationship among providers, consumers and Web services, and a Standard Deviation based Hybrid Collaborative Filtering (SD-HCF) for Web Service Recommendation (WSRec) and an Inverse consumer Frequency based User Collaborativefiltering (IF-UCF), which indicates the effectiveness of adding inverse consumer frequency to UCF.

...read moreread less

Abstract: Web service recommendation has become a hot yet fundamental research topic in service computing. The most popular technique is the Collaborative Filtering (CF) based on a user-item matrix. However, it cannot well capture the relationship between Web services and providers. To address this issue, we first design a cube model to explicitly describe the relationship among providers, consumers and Web services. And then, we present a Standard Deviation based Hybrid Collaborative Filtering (SD-HCF) for Web Service Recommendation (WSRec) and an Inverse consumer Frequency based User Collaborative Filtering (IF-UCF) for Potential Consumers Recommendation (PCRec). Finally, the decision-making process of bidirectional recommendation is provided for both providers and consumers. Sets of experiments are conducted on real-world data provided by Planet-Lab. In the experiment phase, we show how the parameters of SD-HCF impact on the prediction quality as well as demonstrate that the SD-HCF is much better than extant methods on recommendation quality, including the CF based on user, the CF based on item and general HCF. Experimental comparison between IF-UCF and UCF indicates the effectiveness of adding inverse consumer frequency to UCF.

...read moreread less

Journal Article•DOI•

[...]

Ahmed K. Farahat¹, Ali Ghodsi¹, Mohamed S. Kamel¹•Institutions (1)

University of Waterloo¹

01 May 2013-Knowledge and Information Systems

TL;DR: This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner and presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far.

...read moreread less

Abstract: Reducing the dimensionality of the data has been a challenging task in data mining and machine learning applications. In these applications, the existence of irrelevant and redundant features negatively affects the efficiency and effectiveness of different learning algorithms. Feature selection is one of the dimension reduction techniques, which has been used to allow a better understanding of data and improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection in the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection that measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison with the state-of-the-art methods for unsupervised feature selection.

...read moreread less

Journal Article•DOI•

How you move reveals who you are: understanding human behavior by analyzing trajectory data

[...]

Chiara Renso¹, Miriam Baglioni, José Antônio Fernandes de Macêdo², Roberto Trasarti¹, Monica Wachowicz³ - Show less +1 more•Institutions (3)

Istituto di Scienza e Tecnologie dell'Informazione¹, Federal University of Ceará², University of New Brunswick³

01 Nov 2013-Knowledge and Information Systems

TL;DR: The formalization of a semantic-enriched KDD process for supporting meaningful pattern interpretations of human behavior, based on the integration of inductive reasoning and deductive reasoning, is described.

...read moreread less

Abstract: The widespread use of mobile devices is producing a huge amount of trajectory data, making the discovery of movement patterns possible, which are crucial for understanding human behavior. Significant advances have been made with regard to knowledge discovery, but the process now needs to be extended bearing in mind the emerging field of behavior informatics. This paper describes the formalization of a semantic-enriched KDD process for supporting meaningful pattern interpretations of human behavior. Our approach is based on the integration of inductive reasoning (movement pattern discovery) and deductive reasoning (human behavior inference). We describe the implemented Athena system, which supports such a process, along with the experimental results on two different application domains related to traffic and recreation management.

...read moreread less

Journal Article•DOI•

Using Case-Based Reasoning and Principled Negotiation to provide decision support for dispute resolution

[...]

Davide Carneiro¹, Paulo Novais¹, Francisco Andrade¹, John Zeleznikow², José Neves¹ - Show less +1 more•Institutions (2)

University of Minho¹, Victoria University, Australia²

D-cores: measuring collaboration of directed graphs based on degeneracy

TL;DR: UMCourt is described, a project built around two sub-fields of AI research: Multi-agent Systems and Case-Based Reasoning, aimed at fostering the development of tools for ODR, to develop autonomous tools that can increase the effectiveness of the dispute resolution processes.

...read moreread less

Abstract: The growing use of Information Technology in the commercial arena leads to an urgent need to find alternatives to traditional dispute resolution. New tools from fields such as artificial intelligence (AI) should be considered in the process of developing novel online dispute resolution (ODR) platforms, in order to make the ligation process simpler, faster and conform with the new virtual environments. In this work, we describe UMCourt, a project built around two sub-fields of AI research: Multi-agent Systems and Case-Based Reasoning, aimed at fostering the development of tools for ODR. This is then used to accomplish several objectives, from suggesting solutions to new disputes based on the observation of past similar disputes, to the improvement of the negotiation and mediation processes that may follow. The main objective of this work is to develop autonomous tools that can increase the effectiveness of the dispute resolution processes, namely by increasing the amount of meaningful information that is available for the parties.

...read moreread less

Journal Article•DOI•

[...]

Christos Giatsidis¹, Dimitrios M. Thilikos², Michalis Vazirgiannis³, Michalis Vazirgiannis⁴, Michalis Vazirgiannis¹ - Show less +1 more•Institutions (4)

École Polytechnique¹, National and Kapodistrian University of Athens², Télécom ParisTech³, Athens State University⁴

01 May 2013-Knowledge and Information Systems

TL;DR: This paper defines a novel D-core framework and devise a wealth of novel metrics used to evaluate graph collaboration features of directed graphs, extending the classic graph-theoretic notion of $$k$$-cores for undirected graphs to directed ones.

...read moreread less

Abstract: Community detection and evaluation is an important task in graph mining. In many cases, a community is defined as a subgraph characterized by dense connections or interactions between its nodes. A variety of measures are proposed to evaluate different quality aspects of such communities—in most cases ignoring the directed nature of edges. In this paper, we introduce novel metrics for evaluating the collaborative nature of directed graphs—a property not captured by the single node metrics or by other established community evaluation metrics. In order to accomplish this objective, we capitalize on the concept of graph degeneracy and define a novel D-core framework, extending the classic graph-theoretic notion of $$k$$ -cores for undirected graphs to directed ones. Based on the D-core, which essentially can be seen as a measure of the robustness of a community under degeneracy, we devise a wealth of novel metrics used to evaluate graph collaboration features of directed graphs. We applied the D-core approach on large synthetic and real-world graphs such as Wikipedia, DBLP, and ArXiv and report interesting results at the graph as well at the node level.

...read moreread less

Journal Article•DOI•

A signed-distance-based approach to importance assessment and multi-criteria group decision analysis based on interval type-2 fuzzy set

[...]

Ting-Yu Chen¹•Institutions (1)

Chang Gung University¹

01 Apr 2013-Knowledge and Information Systems

TL;DR: This paper presents a signed-distance-based method for determining the objective importance of criteria and handling fuzzy, multiple criteria group decision-making problems in a flexible and intelligent way using interval type-2 trapezoidal fuzzy numbers.

...read moreread less

Abstract: Interval type-2 fuzzy sets are associated with greater imprecision and more ambiguities than ordinary fuzzy sets. This paper presents a signed-distance-based method for determining the objective importance of criteria and handling fuzzy, multiple criteria group decision-making problems in a flexible and intelligent way. These advantages arise from the method’s use of interval type-2 trapezoidal fuzzy numbers to represent alternative ratings and the importance of various criteria. An integrated approach to determine the overall importance of the criteria is also developed using the subjective information provided by decision-makers and the objective information delivered by the decision matrix. In addition, a linear programming model is developed to estimate criterion weights and to extend the proposed multiple criteria decision analysis method. Finally, the feasibility and effectiveness of the proposed methods are illustrated by a group decision-making problem of patient-centered medicine in basilar artery occlusion.

...read moreread less

Journal Article•DOI•

Overlapping correlation clustering

[...]

Francesco Bonchi¹, Aristides Gionis¹, Antti Ukkonen¹•Institutions (1)

Yahoo!¹

01 Apr 2013-Knowledge and Information Systems

TL;DR: A new approach for finding overlapping clusters given pairwise similarities of objects is introduced, which relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster.

...read moreread less

Abstract: We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0–1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.

...read moreread less

Journal Article•DOI•

A segment-based approach to clustering multi-topic documents

[...]

Andrea Tagarelli¹, George Karypis²•Institutions (2)

University of Calabria¹, University of Minnesota²

An automatic keyphrase extraction system for scientific documents

TL;DR: This paper proposes a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents.

...read moreread less

Abstract: Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

...read moreread less

Journal Article•DOI•

[...]

Wei You, Dominique Fontaine, Jean-Paul A. Barthès

Towards graphical models for text processing

TL;DR: This paper develops and evaluates an automatic keyphrase extraction system for scientific documents and shows the efficiency and effectiveness of the refined candidate set and demonstrates that the new features improve the accuracy of the system.

...read moreread less

Abstract: Automatic keyphrase extraction techniques play an important role for many tasks including indexing, categorizing, summarizing, and searching. In this paper, we develop and evaluate an automatic keyphrase extraction system for scientific documents. Compared with previous work, our system concentrates on two important issues: (1) more precise location for potential keyphrases: a new candidate phrase generation method is proposed based on the core word expansion algorithm, which can reduce the size of the candidate set by about 75% without increasing the computational complexity; (2) overlap elimination for the output list: when a phrase and its sub-phrases coexist as candidates, an inverse document frequency feature is introduced for selecting the proper granularity. Additional new features are added for phrase weighting. Experiments based on real-world datasets were carried out to evaluate the proposed system. The results show the efficiency and effectiveness of the refined candidate set and demonstrate that the new features improve the accuracy of the system. The overall performance of our system compares favorably with other state-of-the-art keyphrase extraction systems.

...read moreread less

Journal Article•DOI•

[...]

Charu C. Aggarwal¹, Peixiang Zhao²•Institutions (2)

IBM¹, Florida State University²

01 Jul 2013-Knowledge and Information Systems

TL;DR: This paper introduces the concept of distance graph representations of text data that preserve information about the relative ordering and distance between the words in the graphs and provide a much richer representation in terms of sentence structure of the underlying data.

...read moreread less

Abstract: The rapid proliferation of the World Wide Web has increased the importance and prevalence of text as a medium for dissemination of information. A variety of text mining and management algorithms have been developed in recent years such as clustering, classification, indexing, and similarity search. Almost all these applications use the well-known vector-space model for text representation and analysis. While the vector-space model has proven itself to be an effective and efficient representation for mining purposes, it does not preserve information about the ordering of the words in the representation. In this paper, we will introduce the concept of distance graph representations of text data. Such representations preserve information about the relative ordering and distance between the words in the graphs and provide a much richer representation in terms of sentence structure of the underlying data. Recent advances in graph mining and hardware capabilities of modern computers enable us to process more complex representations of text. We will see that such an approach has clear advantages from a qualitative perspective. This approach enables knowledge discovery from text which is not possible with the use of a pure vector-space representation, because it loses much less information about the ordering of the underlying words. Furthermore, this representation does not require the development of new mining and management techniques. This is because the technique can also be converted into a structural version of the vector-space representation, which allows the use of all existing tools for text. In addition, existing techniques for graph and XML data can be directly leveraged with this new representation. Thus, a much wider spectrum of algorithms is available for processing this representation. We will apply this technique to a variety of mining and management applications and show its advantages and richness in exploring the structure of the underlying text documents.

...read moreread less

Journal Article•DOI•

Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments

[...]

Bai-En Shie¹, Hui Fang Hsiao¹, Vincent S. Tseng¹•Institutions (1)

National Cheng Kung University¹

01 Nov 2013-Knowledge and Information Systems

TL;DR: Two types of algorithms, namely level-wise and tree-based methods, are proposed for mining high-utility mobile sequential patterns and the results show that the proposed algorithms outperform the state-of-the-art mobile sequential pattern algorithms and that the tree- based algorithms deliver better performance than the level- wise ones under various conditions.

...read moreread less

Abstract: Mining user behavior patterns in mobile environments is an emerging topic in data mining fields with wide applications. By integrating moving paths with purchasing transactions, one can find the sequential purchasing patterns with the moving paths, which are called mobile sequential patterns of the mobile users. Mobile sequential patterns can be applied not only for planning mobile commerce environments but also for analyzing and managing online shopping websites. However, unit profits and purchased numbers of the items are not considered in traditional framework of mobile sequential pattern mining. Thus, the patterns with high utility (i.e., profit here) cannot be found. In view of this, we aim at integrating mobile data mining with utility mining for finding high-utility mobile sequential patterns in this study. Two types of algorithms, namely level-wise and tree-based methods, are proposed for mining high-utility mobile sequential patterns. A series of analyses and comparisons on the performance of the two different types of algorithms are conducted through experimental evaluations. The results show that the proposed algorithms outperform the state-of-the-art mobile sequential pattern algorithms and that the tree-based algorithms deliver better performance than the level-wise ones under various conditions.

...read moreread less

Journal Article•DOI•

Efficient and flexible anonymization of transaction data

[...]

Grigorios Loukides¹, Aris Gkoulalas-Divanis², Jianhua Shao¹•Institutions (2)

Cardiff University¹, IBM²

01 Jul 2013-Knowledge and Information Systems

TL;DR: A rule-based privacy model is introduced that allows data publishers to express fine-grained protection requirements for both identity and sensitive information disclosure, and two anonymization algorithms are developed that significantly outperform the state-of-the-art in terms of retaining data utility, while achieving good protection and scalability.

...read moreread less

Abstract: Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing transaction data have been proposed recently, but they may produce excessively distorted and inadequately protected solutions. This is because these approaches do not consider privacy requirements that are common in real-world applications in a realistic and flexible manner, and attempt to safeguard the data only against either identity disclosure or sensitive information inference. In this paper, we propose a new approach that overcomes these limitations. We introduce a rule-based privacy model that allows data publishers to express fine-grained protection requirements for both identity and sensitive information disclosure. Based on this model, we also develop two anonymization algorithms. Our first algorithm works in a top-down fashion, employing an efficient strategy to recursively generalize data with low information loss. Our second algorithm uses sampling and a combination of top-down and bottom-up generalization heuristics, which greatly improves scalability while maintaining low information loss. Extensive experiments show that our algorithms significantly outperform the state-of-the-art in terms of retaining data utility, while achieving good protection and scalability.

...read moreread less

Journal Article•DOI•

Learning with multi-resolution overlapping communities

[...]

Xufei Wang¹, Lei Tang², Huan Liu¹, Lei Wang³•Institutions (3)

Arizona State University¹, Walmart Labs², University of Wollongong³

01 Aug 2013-Knowledge and Information Systems

TL;DR: This work develops an efficient algorithm to extract a hierarchy of overlapping communities and demonstrates the promising potential of the proposed approach in real-world applications.

...read moreread less

Abstract: A recent surge of participatory web and social media has created a new laboratory for studying human relations and collective behavior on an unprecedented scale. In this work, we study the predictive power of social connections to determine the preferences or behaviors of individuals such as whether a user supports a certain political view, whether one likes a product, whether she would like to vote for a presidential candidate, etc. Since an actor is likely to participate in multiple different communities with each regulating the actor’s behavior in varying degrees, and a natural hierarchy might exist between these communities, we propose to zoom into a network at multiple different resolutions and determine which communities reflect a targeted behavior. We develop an efficient algorithm to extract a hierarchy of overlapping communities. Empirical results on social media networks demonstrate the promising potential of the proposed approach in real-world applications.

...read moreread less

Journal Article•DOI•

Mitigation of the ground reflection effect in real-time locating systems based on wireless sensor networks by using artificial neural networks

[...]

Juan F. De Paz¹, Dante I. Tapia¹, Ricardo S. Alonso¹, Cristian Pinzón¹, Javier Bajo², Juan M. Corchado¹ - Show less +2 more•Institutions (2)

University of Salamanca¹, Pontifical University of Salamanca²

A method based on interval-valued fuzzy soft set for multi-attribute group decision-making problems under uncertain environment

TL;DR: This paper presents an innovative mathematical model for improving the accuracy of RTLSs, focusing on the mitigation of the ground reflection effect by using multilayer perceptron artificial neural networks.

...read moreread less

Abstract: Wireless sensor networks (WSNs) have become much more relevant in recent years, mainly because they can be used in a wide diversity of applications. Real-time locating systems (RTLSs) are one of the most promising applications based on WSNs and represent a currently growing market. Specifically, WSNs are an ideal alternative to develop RTLSs aimed at indoor environments where existing global navigation satellite systems, such as the global positioning system, do not work correctly due to the blockage of the satellite signals. However, accuracy in indoor RTLSs is still a problem requiring novel solutions. One of the main challenges is to deal with the problems that arise from the effects of the propagation of radiofrequency waves, such as attenuation, diffraction, reflection and scattering. These effects can lead to other undesired problems, such as multipath. When the ground is responsible for wave reflections, multipath can be modeled as the ground reflection effect. This paper presents an innovative mathematical model for improving the accuracy of RTLSs, focusing on the mitigation of the ground reflection effect by using multilayer perceptron artificial neural networks.

...read moreread less

Journal Article•DOI•

[...]

Zhi Xiao¹, Weijie Chen¹, Lingling Li¹•Institutions (1)

Chongqing University¹

Efficient planning for top-K Web service composition

TL;DR: A new method is developed for multiple attributes group decision-making problems under uncertain environment, in which the information about attribute weights is incompletely known or completely unknown, and each maker’s decision information is expressed by an interval-valued fuzzy soft set.

...read moreread less

Abstract: In this paper, we develop a new method for multiple attributes group decision-making problems under uncertain environment, in which the information about attribute weights is incompletely known or completely unknown, and each maker’s decision information is expressed by an interval-valued fuzzy soft set. Moreover, this paper takes account of the decision makers’ attitude toward risk. In order to get the weight vector of the attributes, we construct the score matrix of the final fuzzy soft set. From the score matrix and the given attribute weights information, we establish an optimization model to determine the weights of attributes. For the special situations where the information about attribute weights is completely unknown, we establish another optimization model. By solving this model, we get a simple and exact formula, which can be used to determine the attribute weights. According to these models, a method based on interval-valued fuzzy soft set, which considers the decision makers’ risk attitude under uncertain environment, is given to rank the alternatives. Finally, a numerical example is used to illustrate the applicability of the proposed approach.

...read moreread less

Journal Article•DOI•

[...]

Shuiguang Deng¹, Bin Wu¹, Jianwei Yin¹, Zhaohui Wu¹•Institutions (1)

Zhejiang University¹