Showing papers in &quot;Knowledge and Information Systems in 2017&quot;

Recent advances in feature selection and its applications

TL;DR: This survey article enumerates, categorizes, and compares many of the methods that have been proposed to detect change points in time series, and presents some grand challenges for the community to consider.

...read moreread less

Abstract: Change points are abrupt variations in time series data. Such abrupt changes may represent transitions that occur between states. Detection of change points is useful in modelling and prediction of time series and is found in application areas such as medical condition monitoring, climate change detection, speech and image analysis, and human activity analysis. This survey article enumerates, categorizes, and compares many of the methods that have been proposed to detect change points in time series. The methods examined include both supervised and unsupervised algorithms that have been introduced and evaluated. We introduce several criteria to compare the algorithms. Finally, we present some grand challenges for the community to consider.

...read moreread less

788 citations

Journal Article•DOI•

[...]

Yun Li¹, Tao Li¹, Huan Liu²•Institutions (2)

Nanjing University of Posts and Telecommunications¹, Arizona State University²

01 Dec 2017-Knowledge and Information Systems

TL;DR: This review paper presents a selection of challenges which are of particular current interests, such as feature selection for high-dimensional small sample size data, large-scale data, and secure feature selection, as well as some representative applications of feature selection.

...read moreread less

Abstract: Feature selection is one of the key problems for machine learning and data mining. In this review paper, a brief historical background of the field is given, followed by a selection of challenges which are of particular current interests, such as feature selection for high-dimensional small sample size data, large-scale data, and secure feature selection. Along with these challenges, some hot topics for feature selection have emerged, e.g., stable feature selection, multi-view feature selection, distributed feature selection, multi-label feature selection, online feature selection, and adversarial feature selection. Then, the recent advances of these topics are surveyed in this paper. For each topic, the existing problems are analyzed, and then, current solutions to these problems are presented and discussed. Besides the topics, some representative applications of feature selection are also introduced, such as applications in bioinformatics, social media, and multimedia retrieval.

...read moreread less

219 citations

Journal Article•DOI•

The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

[...]

Hans-Peter Kriegel¹, Erich Schubert¹, Arthur Zimek²•Institutions (2)

Ludwig Maximilian University of Munich¹, University of Southern Denmark²

01 Aug 2017-Knowledge and Information Systems

TL;DR: This work substantiates its points with extensive experiments, using clustering and outlier detection methods with and without index acceleration, and discusses what one can learn from evaluations, whether experiments are properly designed, and what kind of conclusions one should avoid.

...read moreread less

Abstract: Any paper proposing a new algorithm should come with an evaluation of efficiency and scalability (particularly when we are designing methods for "big data"). However, there are several (more or less serious) pitfalls in such evaluations. We would like to point the attention of the community to these pitfalls. We substantiate our points with extensive experiments, using clustering and outlier detection methods with and without index acceleration. We discuss what we can learn from evaluations, whether experiments are properly designed, and what kind of conclusions we should avoid. We close with some general recommendations but maintain that the design of fair and conclusive experiments will always remain a challenge for researchers and an integral part of the scientific endeavor.

...read moreread less

184 citations

Journal Article•DOI•

EFIM: a fast and memory efficient algorithm for high-utility itemset mining

[...]

Souleymane Zida¹, Philippe Fournier-Viger², Jerry Chun-Wei Lin², Cheng-Wei Wu³, Vincent S. Tseng² - Show less +1 more•Institutions (3)

Université de Moncton¹, Harbin Institute of Technology², National Chiao Tung University³

Recent advances in document summarization

TL;DR: A novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discover high-UTility itemsets and is in general two to three orders of magnitude faster than the state-of-art algorithms.

...read moreread less

Abstract: In recent years, high-utility itemset mining has emerged as an important data mining task. However, it remains computationally expensive both in terms of runtime and memory consumption. It is thus an important challenge to design more efficient algorithms for this task. In this paper, we address this issue by proposing a novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discover high-utility itemsets. EFIM relies on two new upper bounds named revised sub-tree utility and local utility to more effectively prune the search space. It also introduces a novel array-based utility counting technique named Fast Utility Counting to calculate these upper bounds in linear time and space. Moreover, to reduce the cost of database scans, EFIM proposes efficient database projection and transaction merging techniques named High-utility Database Projection and High-utility Transaction Merging (HTM), also performed in linear time. An extensive experimental study on various datasets shows that EFIM is in general two to three orders of magnitude faster than the state-of-art algorithms $$\hbox {d}^2$$d2HUP, HUI-Miner, HUP-Miner, FHM and UP-Growth+ on dense datasets and performs quite well on sparse datasets. Moreover, a key advantage of EFIM is its low memory consumption.

...read moreread less

169 citations

Journal Article•DOI•

[...]

Jin-ge Yao¹, Xiaojun Wan¹, Jianguo Xiao¹•Institutions (1)

Peking University¹

01 Nov 2017-Knowledge and Information Systems

TL;DR: Significant contributions made in recent years are emphasized, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences.

...read moreread less

Abstract: The task of automatic document summarization aims at generating short summaries for originally long documents. A good summary should cover the most important information of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. Numerous approaches for automatic summarization have been developed to date. In this paper we give a self-contained, broad overview of recent progress made for document summarization within the last 5 years. Specifically, we emphasize on significant contributions made in recent years that represent the state-of-the-art of document summarization, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences. In addition, we review progress made for document summarization in domains, genres and applications that are different from traditional settings. We also point out some of the latest trends and highlight a few possible future directions.

...read moreread less

143 citations

Journal Article•DOI•

Visual domain adaptation via transfer feature learning

[...]

Jafar Tahmoresnezhad¹, Sattar Hashemi¹•Institutions (1)

Shiraz University¹

A review of task scheduling based on meta-heuristics approach in cloud computing

TL;DR: A novel transfer learning and domain adaptation approach, referred to as visual domain adaptation (VDA), which reduces the joint marginal and conditional distributions across domains in an unsupervised manner where no label is available in test set.

...read moreread less

Abstract: One of the serious challenges in computer vision and image classification is learning an accurate classifier for a new unlabeled image dataset, considering that there is no available labeled training data. Transfer learning and domain adaptation are two outstanding solutions that tackle this challenge by employing available datasets, even with significant difference in distribution and properties, and transfer the knowledge from a related domain to the target domain. The main difference between these two solutions is their primary assumption about change in marginal and conditional distributions where transfer learning emphasizes on problems with same marginal distribution and different conditional distribution, and domain adaptation deals with opposite conditions. Most prior works have exploited these two learning strategies separately for domain shift problem where training and test sets are drawn from different distributions. In this paper, we exploit joint transfer learning and domain adaptation to cope with domain shift problem in which the distribution difference is significantly large, particularly vision datasets. We therefore put forward a novel transfer learning and domain adaptation approach, referred to as visual domain adaptation (VDA). Specifically, VDA reduces the joint marginal and conditional distributions across domains in an unsupervised manner where no label is available in test set. Moreover, VDA constructs condensed domain invariant clusters in the embedding representation to separate various classes alongside the domain transfer. In this work, we employ pseudo target labels refinement to iteratively converge to final solution. Employing an iterative procedure along with a novel optimization problem creates a robust and effective representation for adaptation across domains. Extensive experiments on 16 real vision datasets with different difficulties verify that VDA can significantly outperform state-of-the-art methods in image classification problem.

...read moreread less

125 citations

Journal Article•DOI•

[...]

Poonam Singh, Maitreyee Dutta, Naveen Aggarwal¹•Institutions (1)

Panjab University, Chandigarh¹

12 Apr 2017-Knowledge and Information Systems

TL;DR: Methodical analysis of task scheduling in cloud and grid computing is presented based on swarm intelligence and bio-inspired techniques and will enable the readers to decide suitable approach for suggesting better schemes for scheduling user’s application.

...read moreread less

Abstract: Heterogeneous distributed computing systems are the emerging for executing scientific and computationally intensive applications. Cloud computing in this context describes a paradigm to deliver the resource-like computing and storage on-demand basis using pay-per-use model. These resources are managed by data centers and dynamically provisioned to the users based on their availability, demand and quality parameters required to be satisfied. The task scheduling onto the distributed and virtual resources is a main concern which can affect the performance of the system. In the literature, a lot of work has been done by considering cost and makespan as the affecting parameters for scheduling the dependent tasks. Prior work has discussed the various challenges affecting the performance of dependent task scheduling but did not consider storage cost, failure rate-related challenges. This paper accomplishes a review of using meta-heuristics techniques for scheduling tasks in cloud computing. We presented the taxonomy and comparative review on these algorithms. Methodical analysis of task scheduling in cloud and grid computing is presented based on swarm intelligence and bio-inspired techniques. This work will enable the readers to decide suitable approach for suggesting better schemes for scheduling user’s application. Future research issues have also been suggested in this research work.

...read moreread less

114 citations

Journal Article•DOI•

Wind speed parameters sensitivity analysis based on fractals and neuro-fuzzy selection technique

[...]

Vlastimir Nikolić¹, Vojislav V. Mitić², Ljubiša Kocić¹, Dalibor Petković¹•Institutions (2)

University of Niš¹, Serbian Academy of Sciences and Arts²

01 Jul 2017-Knowledge and Information Systems

TL;DR: The neuro-fuzzy approach was used to detect the most important variables which affect the wind speed according to the fractal dimensions and the main goal was to investigate the influence of terrain roughness length and different heights of theWind speed prediction.

...read moreread less

Abstract: Fluctuation of wind speed affects wind energy systems since the potential wind power is proportional the cube of wind speed. Hence precise prediction of wind speed is very important to improve the performances of the systems. Due to unstable behavior of the wind speed above different terrains, in this study fractal characteristics of the wind speed series were analyzed. According to the self-similarity characteristic and the scale invariance, the fractal extrapolate interpolation prediction can be performed by extending the fractal characteristic from internal interval to external interval. Afterward neuro-fuzzy technique was applied to the fractal data because of high nonlinearity of the data. The neuro-fuzzy approach was used to detect the most important variables which affect the wind speed according to the fractal dimensions. The main goal was to investigate the influence of terrain roughness length and different heights of the wind speed on the wind speed prediction.

...read moreread less

96 citations

Journal Article•DOI•

Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

[...]

Dariusz Brzezinski¹, Jerzy Stefanowski¹•Institutions (1)

Poznań University of Technology¹

01 Aug 2017-Knowledge and Information Systems

TL;DR: This paper analyzes the properties of an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting, called prequential AUC, and shows that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures.

...read moreread less

Abstract: Modern data-driven systems often require classifiers capable of dealing with streaming imbalanced data and concept changes. The assessment of learning algorithms in such scenarios is still a challenge, as existing online evaluation measures focus on efficiency, but are susceptible to class ratio changes over time. In case of static data, the area under the receiver operating characteristics curve, or simply AUC, is a popular measure for evaluating classifiers both on balanced and imbalanced class distributions. However, the characteristics of AUC calculated on time-changing data streams have not been studied. This paper analyzes the properties of our recent proposal, an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting. The resulting evaluation measure, called prequential AUC, is studied in terms of: visualization over time, processing speed, differences compared to AUC calculated on blocks of examples, and consistency with AUC calculated traditionally. Simulation results show that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures. Finally, experiments on real-world and synthetic data showcase characteristic properties of prequential AUC compared to classification accuracy, G-mean, Kappa, Kappa M, and recall when used to evaluate classifiers on imbalanced streams with various difficulty factors.

...read moreread less

91 citations

Journal Article•DOI•

Document-level sentiment classification using hybrid machine learning approach

[...]

Abinash Tripathy¹, Abhishek Anand¹, Santanu Kumar Rath¹•Institutions (1)

National Institute of Technology, Rourkela¹

01 Dec 2017-Knowledge and Information Systems

TL;DR: Different performance evaluation parameters such as precision, recall, f-measure, accuracy have been considered to evaluate the performance of the proposed approach on two different datasets, i.e., IMDb dataset and polarity dataset.

...read moreread less

Abstract: It is a practice that users or customers intend to share their comments or reviews about any product in different social networking sites. An analyst usually processes to reviews properly to obtain any meaningful information from it. Classification of sentiments associated with reviews is one of these processing steps. The reviews framed are often made in text format. While processing the text reviews, each word of the review is considered as a feature. Thus, selection of right kind of features needs to be carried out to select the best feature from the set of all features. In this paper, the machine learning algorithm, i.e., support vector machine, is used to select the best features from the training data. These features are then given input to artificial neural network method, to process further. Different performance evaluation parameters such as precision, recall, f-measure, accuracy have been considered to evaluate the performance of the proposed approach on two different datasets, i.e., IMDb dataset and polarity dataset.

...read moreread less

82 citations

Journal Article•DOI•

A semi-supervised approach to sentiment analysis using revised sentiment strength based on SentiWordNet

[...]

Farhan Hassan Khan¹, Usman Qamar¹, Saba Bashir¹•Institutions (1)

College of Electrical and Mechanical Engineering¹

Cloud resource allocation schemes: review, taxonomy, and opportunities

TL;DR: This research proposes a semi-supervised sentiment analysis approach that incorporates lexicon-based methodology with machine learning in order to improve sentiment analysis performance.

...read moreread less

Abstract: An immense amount of data is available with the advent of social media in the last decade. This data can be used for sentiment analysis and decision making. The data present on blogs, news/review sites, social networks, etc., are so enormous that manual labeling is not feasible and an automatic approach is required for its analysis. The sentiment of the masses can be understood by analyzing this large scale and opinion rich data. The major issues in the application of automated approaches are data unavailability, data sparsity, domain independence and inadequate performance. This research proposes a semi-supervised sentiment analysis approach that incorporates lexicon-based methodology with machine learning in order to improve sentiment analysis performance. Mathematical models such as information gain and cosine similarity are employed to revise the sentiment scores defined in SentiWordNet. This research also emphasizes on the importance of nouns and employs them as semantic features with other parts of speech. The evaluation of performance measures and comparison with state-of-the-art techniques proves that the proposed approach is superior.

...read moreread less

Journal Article•DOI•

[...]

Abdullah Yousafzai¹, Abdullah Gani¹, Rafidah Md Noor¹, Mehdi Sookhak¹, Hamid Talebian², Muhammad Shiraz³, Muhammad Khurram Khan⁴ - Show less +3 more•Institutions (4)

Information Technology University¹, University of Malaya², Federal Urdu University³, King Saud University⁴

Graphlet decomposition: framework, algorithms, and applications

TL;DR: Current state-of-the-art cloud resource allocation schemes are extensively reviewed to highlight their strengths and weaknesses and a thematic taxonomy is presented based on resource allocation optimization objectives to classify the existing literature.

...read moreread less

Abstract: Cloud computing has emerged as a popular computing model to process data and execute computationally intensive applications in a pay-as-you-go manner Due to the ever-increasing demand for cloud-based applications, it is becoming difficult to efficiently allocate resources according to user requests while satisfying the service-level agreement between service providers and consumers Furthermore, cloud resource heterogeneity, the unpredictable nature of workload, and the diversified objectives of cloud actors further complicate resource allocation in the cloud computing environment Consequently, both the industry and academia have commenced substantial research efforts to efficiently handle the aforementioned multifaceted challenges with cloud resource allocation The lack of a comprehensive review covering the resource allocation aspects of optimization objectives, design approaches, optimization methods, target resources, and instance types has motivated a review of existing cloud resource allocation schemes In this paper, current state-of-the-art cloud resource allocation schemes are extensively reviewed to highlight their strengths and weaknesses Moreover, a thematic taxonomy is presented based on resource allocation optimization objectives to classify the existing literature The cloud resource allocation schemes are analyzed based on the thematic taxonomy to highlight the commonalities and deviations among them Finally, several opportunities are suggested for the design of optimal resource allocation schemes

...read moreread less

Journal Article•DOI•

[...]

Nesreen K. Ahmed¹, Jennifer Neville², Ryan A. Rossi³, Nick Duffield⁴, Theodore L. Willke¹ - Show less +1 more•Institutions (4)

Intel¹, Purdue University², PARC³, Texas A&M University⁴

01 Mar 2017-Knowledge and Information Systems

TL;DR: In this paper, the authors proposed a fast, efficient, and parallel framework for counting k-node graphlets, which leverages a number of theoretical combinatorial arguments that allow them to obtain significant improvement on the scalability of graphlet counting.

...read moreread less

Abstract: From social science to biology, numerous applications often rely on graphlets for intuitive and meaningful characterization of networks. While graphlets have witnessed a tremendous success and impact in a variety of domains, there has yet to be a fast and efficient framework for computing the frequencies of these subgraph patterns. However, existing methods are not scalable to large networks with billions of nodes and edges. In this paper, we propose a fast, efficient, and parallel framework as well as a family of algorithms for counting k-node graphlets. The proposed framework leverages a number of theoretical combinatorial arguments that allow us to obtain significant improvement on the scalability of graphlet counting. For each edge, we count a few graphlets and obtain the exact counts of others in constant time using the combinatorial arguments. On a large collection of $$300+$$300+ networks from a variety of domains, our graphlet counting strategies are on average $$460{\times }$$460× faster than existing methods. This brings new opportunities to investigate the use of graphlets on much larger networks and newer applications as we show in the experiments. To the best of our knowledge, this paper provides the largest graphlet computations to date.

...read moreread less

Journal Article•DOI•

The cascading neural network: building the Internet of Smart Things

[...]

Sam Leroux¹, Steven Bohez¹, Elias De Coninck¹, Tim Verbelen¹, Bert Vankeirsbilck¹, Pieter Simoens¹, Bart Dhoedt¹ - Show less +3 more•Institutions (1)

Ghent University¹

01 Sep 2017-Knowledge and Information Systems

TL;DR: A new architecture called a cascading network is proposed that is capable of distributing a deep neural network between a local device and the cloud while keeping the required communication network traffic to a minimum and allows for an early-stopping mechanism during the recall phase of the network.

...read moreread less

Abstract: Most of the research on deep neural networks so far has been focused on obtaining higher accuracy levels by building increasingly large and deep architectures. Training and evaluating these models is only feasible when large amounts of resources such as processing power and memory are available. Typical applications that could benefit from these models are, however, executed on resource-constrained devices. Mobile devices such as smartphones already use deep learning techniques, but they often have to perform all processing on a remote cloud. We propose a new architecture called a cascading network that is capable of distributing a deep neural network between a local device and the cloud while keeping the required communication network traffic to a minimum. The network begins processing on the constrained device, and only relies on the remote part when the local part does not provide an accurate enough result. The cascading network allows for an early-stopping mechanism during the recall phase of the network. We evaluated our approach in an Internet of Things context where a deep neural network adds intelligence to a large amount of heterogeneous connected devices. This technique enables a whole variety of autonomous systems where sensors, actuators and computing nodes can work together. We show that the cascading architecture allows for a substantial improvement in evaluation speed on constrained devices while the loss in accuracy is kept to a minimum.

...read moreread less

Journal Article•DOI•

FDHUP: Fast algorithm for mining discriminative high utility patterns

[...]

Jerry Chun-Wei Lin¹, Wensheng Gan¹, Philippe Fournier-Viger¹, Tzung-Pei Hong², Han-Chieh Chao¹ - Show less +1 more•Institutions (2)

Harbin Institute of Technology¹, National Sun Yat-sen University²

k-Degree anonymity and edge selection: improving data utility in large networks

TL;DR: An efficient algorithm named fast algorithm for mining discriminative high utility patterns (DHUPs) with strong frequency affinity (FDHUP) is proposed to efficiently discover DHUPs by considering both the utility and frequency affinity constraints.

...read moreread less

Abstract: Recently, high utility pattern mining (HUPM) has been extensively studied. Many approaches for HUPM have been proposed in recent years, but most of them aim at mining HUPs without any consideration for their frequency. This has the major drawback that any combination of a low utility item with a very high utility pattern is regarded as a HUP, even if this combination has low affinity and contains items that rarely co-occur. Thus, frequency should be a key criterion to select HUPs. To address this issue, and derive high utility interesting patterns (HUIPs) with strong frequency affinity, the HUIPM algorithm was proposed. However, it recursively constructs a series of conditional trees to produce candidates and then derive the HUIPs. This procedure is time-consuming and may lead to a combinatorial explosion when the minimum utility threshold is set relatively low. In this paper, an efficient algorithm named fast algorithm for mining discriminative high utility patterns (DHUPs) with strong frequency affinity (FDHUP) is proposed to efficiently discover DHUPs by considering both the utility and frequency affinity constraints. Two compact structures named EI-table and FU-tree and three pruning strategies are introduced in the proposed algorithm to reduce the search space, and efficiently and effectively discover DHUPs. An extensive experimental study shows that the proposed FDHUP algorithm considerably outperforms the state-of-the-art HUIPM algorithm in terms of execution time, memory consumption, and scalability.

...read moreread less

Journal Article•DOI•

[...]

Jordi Casas-Roma¹, Jordi Herrera-Joancomartí², Vicenç Torra³•Institutions (3)

Open University of Catalonia¹, Autonomous University of Barcelona², University of Skövde³

Intelligent bus routing with heterogeneous human mobility patterns

TL;DR: A simple and efficient algorithm for k-degree anonymity in large networks by considering the neighbourhood centrality score of each edge, which preserves the most important edges of the network, reducing the information loss and increasing the data utility.

...read moreread less

Abstract: The problem of anonymization in large networks and the utility of released data are considered in this paper. Although there are some anonymization methods for networks, most of them cannot be applied in large networks because of their complexity. In this paper, we devise a simple and efficient algorithm for k-degree anonymity in large networks. Our algorithm constructs a k-degree anonymous network by the minimum number of edge modifications. We compare our algorithm with other well-known k-degree anonymous algorithms and demonstrate that information loss in real networks is lowered. Moreover, we consider the edge relevance in order to improve the data utility on anonymized networks. By considering the neighbourhood centrality score of each edge, we preserve the most important edges of the network, reducing the information loss and increasing the data utility. An evaluation of clustering processes is performed on our algorithm, proving that edge neighbourhood centrality increases data utility. Lastly, we apply our algorithm to different large real datasets and demonstrate their efficiency and practical utility.

...read moreread less

Journal Article•DOI•

[...]

Yanchi Liu¹, Chuanren Liu², Nicholas Jing Yuan³, Lian Duan⁴, Yanjie Fu¹, Hui Xiong¹, Songhua Xu⁵, Junjie Wu⁶ - Show less +4 more•Institutions (6)

Rutgers University¹, Drexel University², Microsoft³, Hofstra University⁴, New Jersey Institute of Technology⁵, Beihang University⁶

A survey of query result diversification

TL;DR: This paper proposes a localized transportation mode choice model, with which it can dynamically predict the bus travel demand for different bus routing by taking into account both bus and taxi travel demands, and identifies region pairs with flawed bus routes which are effectively optimized using this approach.

...read moreread less

Abstract: Optimal planning for public transportation is one of the keys helping to bring a sustainable development and a better quality of life in urban areas. Compared to private transportation, public transportation uses road space more efficiently and produces fewer accidents and emissions. However, in many cities people prefer to take private transportation other than public transportation due to the inconvenience of public transportation services. In this paper, we focus on the identification and optimization of flawed region pairs with problematic bus routing to improve utilization efficiency of public transportation services, according to people's real demand for public transportation. To this end, we first provide an integrated mobility pattern analysis between the location traces of taxicabs and the mobility records in bus transactions. Based on the mobility patterns, we propose a localized transportation mode choice model, with which we can dynamically predict the bus travel demand for different bus routing by taking into account both bus and taxi travel demands. This model is then used for bus routing optimization which aims to convert as many people from private transportation to public transportation as possible given budget constraints on the bus route modification. We also leverage the model to identify region pairs with flawed bus routes, which are effectively optimized using our approach. To validate the effectiveness of the proposed methods, extensive studies are performed on real-world data collected in Beijing which contains 19 million taxi trips and 10 million bus trips.

...read moreread less

Journal Article•DOI•

[...]

Kaiping Zheng¹, Hongzhi Wang¹, Zhixin Qi¹, Jianzhong Li¹, Hong Gao¹ - Show less +1 more•Institutions (1)

Harbin Institute of Technology¹

01 Apr 2017-Knowledge and Information Systems

TL;DR: This survey aims to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversify systems.

...read moreread less

Abstract: Nowadays, in information systems such as web search engines and databases, diversity is becoming increasingly essential and getting more and more attention for improving users' satisfaction. In this sense, query result diversification is of vital importance and well worth researching. Some issues such as the definition of diversification and efficient diverse query processing are more challenging to handle in information systems. Many researchers have focused on various dimensions of diversify problem. In this survey, we aim to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversification systems. We also propose some open research directions, which are challenging and have not been explored up till now, to improve the quality of query results.

...read moreread less

Journal Article•DOI•

Weighted-object ensemble clustering: methods and analysis

[...]

Yazhou Ren¹, Carlotta Domeniconi², Guoji Zhang³, Guoxian Yu⁴•Institutions (4)

University of Electronic Science and Technology of China¹, George Mason University², South China University of Technology³, Southwest University⁴

Minimizing conservativity violations in ontology alignments: algorithms and evaluation

TL;DR: This paper first estimates how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and then embed the corresponding information as weights associated with objects in a framework called Weighted-Object Ensemble Clustering (WOEC).

...read moreread less

Abstract: Ensemble clustering has attracted increasing attention in recent years. Its goal is to combine multiple base clusterings into a single consensus clustering of increased quality. Most of the existing ensemble clustering methods treat each base clustering and each object as equally important, while some approaches make use of weights associated with clusters, or to clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have led to the idea of considering weighted objects during the clustering process. However, not much effort has been put toward incorporating weighted objects into the consensus process. To fill this gap, in this paper, we propose a framework called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated with objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We experimentally demonstrate the gain in performance that our WOEC methodology achieves with respect to state-of-the-art ensemble clustering methods, as well as its stability and robustness.

...read moreread less

Journal Article•DOI•

[...]

Alessandro Solimando¹, Ernesto Jiménez-Ruiz², Giovanna Guerrini¹•Institutions (2)

University of Genoa¹, University of Oxford²

CITIESData: a smart city data management framework

TL;DR: This paper presents an approach to detect and minimize the violations of the so-called conservativity principle where novel subsumption entailments between named concepts in one of the input ontologies are considered as unwanted.

...read moreread less

Abstract: In order to enable interoperability between ontology-based systems, ontology matching techniques have been proposed. However, when the generated mappings lead to undesired logical consequences, their usefulness may be diminished. In this paper, we present an approach to detect and minimize the violations of the so-called conservativity principle where novel subsumption entailments between named concepts in one of the input ontologies are considered as unwanted. The practical applicability of the proposed approach is experimentally demonstrated on the datasets from the Ontology Alignment Evaluation Initiative.

...read moreread less

Journal Article•DOI•

[...]

Xiufeng Liu¹, Alfred Heller¹, Per Sieverts Nielsen¹•Institutions (1)

Technical University of Denmark¹

12 Apr 2017-Knowledge and Information Systems

TL;DR: This paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories, including data collection, cleansing, anonymization, and publishing.

...read moreread less

Abstract: Smart city data come from heterogeneous sources including various types of the Internet of Things such as traffic, weather, pollution, noise, and portable devices. They are characterized with diverse quality issues and with different types of sensitive information. This makes data processing and publishing challenging. In this paper, we propose a framework to streamline smart city data management, including data collection, cleansing, anonymization, and publishing. The paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories. The paper evaluates the framework using a real-world smart city data set, and the results verify its effectiveness and efficiency. The framework can be a generic solution to manage smart city data.

...read moreread less

Journal Article•DOI•

Indexed list-based high utility pattern mining with utility upper-bound reduction and pattern combination techniques

[...]

Heungmo Ryang¹, Unil Yun¹•Institutions (1)

Sejong University¹

Intelligent data analysis approaches to churn as a business problem: a survey

TL;DR: Experimental results show that the proposed algorithm mines high utility patterns more efficiently than the state-of-the-art algorithms.

...read moreread less

Abstract: High utility pattern mining has been studied as an essential topic in the field of pattern mining in order to satisfy requirements of many real-world applications that need to process non-binary databases including item importance such as market analysis. In this paper, we propose an efficient algorithm with a novel indexed list-based data structure for mining high utility patterns. Previous approaches first generate an enormous number of candidate patterns on the basis of overestimation methods in their mining processes and then identify actual high utility patterns from the candidates through an additional database scan, which leads to high computational overheads. Although several list-based algorithms to discover high utility patterns without candidate generation have been suggested in recent years, they require a large number of comparison operations. Our method facilitates efficient mining of high utility patterns with the proposed indexed list by effectively reducing the total number of such operations. Moreover, we develop two techniques based on this novel data structure to more enhance mining performance of the proposed method. Experimental results on real and synthetic datasets show that the proposed algorithm mines high utility patterns more efficiently than the state-of-the-art algorithms.

...read moreread less

Journal Article•DOI•

[...]

David L. García¹, Àngela Nebot¹, Alfredo Vellido¹•Institutions (1)

Polytechnic University of Catalonia¹

Community detection in social networks using user frequent pattern mining

TL;DR: A detailed survey of recent applications of business analytics to churn, with a focus on computational intelligence methods, is provided, preceded by an in-depth discussion of churn within the context of customer continuity management.

...read moreread less

Abstract: Globalization processes and market deregulation policies are rapidly changing the competitive environments of many economic sectors. The appearance of new competitors and technologies leads to an increase in competition and, with it, a growing preoccupation among service-providing companies with creating stronger customer bonds. In this context, anticipating the customer's intention to abandon the provider, a phenomenon known as churn, becomes a competitive advantage. Such anticipation can be the result of the correct application of information-based knowledge extraction in the form of business analytics. In particular, the use of intelligent data analysis, or data mining, for the analysis of market surveyed information can be of great assistance to churn management. In this paper, we provide a detailed survey of recent applications of business analytics to churn, with a focus on computational intelligence methods. This is preceded by an in-depth discussion of churn within the context of customer continuity management. The survey is structured according to the stages identified as basic for the building of the predictive models of churn, as well as according to the different types of predictive methods employed and the business areas of their application.

...read moreread less

Journal Article•DOI•

[...]

Seyed Ahmad Moosavi¹, Mehrdad Jalali¹, Negin Misaghian¹, Shahaboddin Shamshirband², Mohammad Hossein Anisi² - Show less +1 more•Institutions (2)

Islamic Azad University¹, Information Technology University²

01 Apr 2017-Knowledge and Information Systems

TL;DR: The main contributions of proposed method are based on the interests and activities of users on networks, some small communities of similar users are discovered, and then by using social relations, the discovered communities are extended.

...read moreread less

Abstract: Recently, social networking sites are offering a rich resource of heterogeneous data. The analysis of such data can lead to the discovery of unknown information and relations in these networks. The detection of communities including `similar' nodes is a challenging topic in the analysis of social network data, and it has been widely studied in the social networking community in the context of underlying graph structure. Online social networks, in addition to having graph structures, include effective user information within networks. Using this information leads to enhance quality of community discovery. In this study, a method of community discovery is provided. Besides communication among nodes to improve the quality of the discovered communities, content information is used as well. This is a new approach based on frequent patterns and the actions of users on networks, particularly social networking sites where users carry out their preferred activities. The main contributions of proposed method are twofold: First, based on the interests and activities of users on networks, some small communities of similar users are discovered, and then by using social relations, the discovered communities are extended. The F-measure is used to evaluate the results of two real-world datasets (Blogcatalog and Flickr), demonstrating that the proposed method principals to improve the community detection quality.

...read moreread less

Journal Article•DOI•

Representation and analysis of enterprise models with semantic techniques: an application to ArchiMate, e3value and business model canvas

[...]

Artur Caetano¹, Gonçalo Antunes², João Pombinho¹, Marzieh Bakhshandeh², José Granjo², José Borbinha¹, Miguel Mira da Silva¹ - Show less +3 more•Institutions (2)

University of Lisbon¹, INESC-ID²

01 Jan 2017-Knowledge and Information Systems

TL;DR: The results show that the graph-based approach is able to handle the specification, integration and analysis of enterprise models represented with different modelling languages and, on the other, that the integration challenge resides in defining appropriate mapping functions between the schemas.

...read moreread less

Abstract: Enterprise models assist the governance and transformation of organizations through the specification, communication and analysis of strategy, goals, processes, information, along with the underlying application and technological infrastructure. Such models cross-cut different concerns and are often conceptualized using domain-specific modelling languages. This paper explores the application of graph-based semantic techniques to specify, integrate and analyse multiple, heterogeneous enterprise models. In particular, the proposal described in this paper (1) specifies enterprise models as ontological schemas, (2) uses transformation mapping functions to integrate the ontological schemas and (3) analyses the integrated schemas with graph querying and logical inference. The proposal is evaluated through a scenario that integrates three distinct enterprise modelling languages: the business model canvas, e3value, and the business layer of the ArchiMate language. The results show, on the one hand, that the graph-based approach is able to handle the specification, integration and analysis of enterprise models represented with different modelling languages and, on the other, that the integration challenge resides in defining appropriate mapping functions between the schemas.

...read moreread less

Journal Article•DOI•

DBMUTE: density-based majority under-sampling technique

[...]

Chumphol Bunkhumpornpat¹, Krung Sinapiromsaran²•Institutions (2)

Chiang Mai University¹, Chulalongkorn University²

01 Mar 2017-Knowledge and Information Systems

TL;DR: In this paper, the density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph.

...read moreread less

Abstract: Class imbalance is a challenging problem that demonstrates the unsatisfactory classification performance of a minority class. A trivial classifier is biased toward minority instances because of their tiny fraction. In this paper, our density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph. A short path implies highly overlapping dense minority instances. In contrast, a long path indicates a sparsity of instances. A new under-sampling algorithm is proposed to eliminate majority instances with low distances because these instances are insignificant and obscure the classification boundary in the overlapping region. The results show predictive improvements on a minority class from various classifiers on different UCI datasets.

...read moreread less

Journal Article•DOI•

Contextual information fusion for intrusion detection: a survey and taxonomy

[...]

Ahmed Aleroud¹, George Karabatis²•Institutions (2)

Yarmouk University¹, University of Maryland, Baltimore County²

17 Feb 2017-Knowledge and Information Systems

TL;DR: A comprehensive review of data analytics paradigms for intrusion detection along with an overview of techniques that apply contextual information in a layered manner with consistent, coherent, and feasible evidence toward the correct prediction of cyber-attacks is presented.

...read moreread less

Abstract: Research in cyber-security has demonstrated that dealing with cyber-attacks is by no means an easy task. One particular limitation of existing research originates from the uncertainty of information that is gathered to discover attacks. This uncertainty is partly due to the lack of attack prediction models that utilize contextual information to analyze activities that target computer networks. The focus of this paper is a comprehensive review of data analytics paradigms for intrusion detection along with an overview of techniques that apply contextual information for intrusion detection. A new research taxonomy is introduced consisting of several dimensions of data mining techniques, which create attack prediction models. The survey reveals the need to use multiple categories of contextual information in a layered manner with consistent, coherent, and feasible evidence toward the correct prediction of cyber-attacks.

...read moreread less

Journal Article•DOI•

Can classification performance be predicted by complexity measures? A study using microarray data

[...]

Laura Morán-Fernández¹, Verónica Bolón-Canedo¹, Amparo Alonso-Betanzos¹•Institutions (1)

University of A Coruña¹