scispace - formally typeset
Search or ask a question

Showing papers in "Knowledge Based Systems in 2014"


Journal ArticleDOI
TL;DR: A new user similarity model is presented to improve the recommendation performance when only few ratings are available to calculate the similarities for each user, which not only considers the local context information of user ratings, but also the global preference of user behavior.
Abstract: We first analyze the shortages of the existing similarity measures in collaborative filtering.And second, we propose a new user similarity model to overcome these drawbacks.We compare the new model with many other similarity measures on two real data sets.Experiments show that the new model can reach better performance than many existing similarity measures. Collaborative filtering has become one of the most used approaches to provide personalized services for users. The key of this approach is to find similar users or items using user-item rating matrix so that the system can show recommendations for users. However, most approaches related to this approach are based on similarity algorithms, such as cosine, Pearson correlation coefficient, and mean squared difference. These methods are not much effective, especially in the cold user conditions. This paper presents a new user similarity model to improve the recommendation performance when only few ratings are available to calculate the similarities for each user. The model not only considers the local context information of user ratings, but also the global preference of user behavior. Experiments on three real data sets are implemented and compared with many state-of-the-art similarity measures. The results show the superiority of the new similarity model in recommended performance.

528 citations


Journal ArticleDOI
TL;DR: A Back Propagation neural network based on Particle Swam Optimization that combines PSO-BP with comprehensive parameter selection is introduced that achieves much better forecast performance than the basic back propagation neural network and ARIMA model.
Abstract: As a clean and renewable energy source, wind energy has been increasingly gaining global attention. Wind speed forecast is of great significance for wind energy domain: planning and design of wind farms, wind farm operation control, wind power prediction, power grid operation scheduling, and more. Many wind speed forecasting algorithms have been proposed to improve prediction accuracy. Few of them, however, have studied how to select input parameters carefully to achieve desired results. After introducing a Back Propagation neural network based on Particle Swam Optimization (PSO-BP), this paper details a method called IS-PSO-BP that combines PSO-BP with comprehensive parameter selection. The IS-PSO-BP is short for Input parameter Selection (IS)-PSO-BP, where IS stands for Input parameter Selection. To evaluate the forecast performance of proposed approach, this paper uses daily average wind speed data of Jiuquan and 6-hourly wind speed data of Yumen, Gansu of China from 2001 to 2006 as a case study. The experiment results clearly show that for these two particular datasets, the proposed method achieves much better forecast performance than the basic back propagation neural network and ARIMA model.

419 citations


Journal ArticleDOI
TL;DR: A novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam, which demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance and wrappers are more effective than filters with regard to classification performance indexes.
Abstract: In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as a to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov–Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found a = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.

372 citations


Journal ArticleDOI
TL;DR: Results show that at individual stock, sector and index levels, the models with sentiment analysis outperform the bag-of-words model in both validation set and independent testing set, and the models which use sentiment polarity cannot provide useful predictions.
Abstract: Financial news articles are believed to have impacts on stock price return. Previous works model news pieces in bag-of-words space, which analyzes the latent relationship between word statistical patterns and stock price movements. However, news sentiment, which is an important ring on the chain of mapping from the word patterns to the price movements, is rarely touched. In this paper, we first implement a generic stock price prediction framework, and plug in six different models with different analyzing approaches. To take one step further, we use Harvard psychological dictionary and Loughran–McDonald financial sentiment dictionary to construct a sentiment space. Textual news articles are then quantitatively measured and projected onto the sentiment space. Instance labeling method is rigorously discussed and tested. We evaluate the models' prediction accuracy and empirically compare their performance at different market classification levels. Experiments are conducted on five years historical Hong Kong Stock Exchange prices and news articles. Results show that (1) at individual stock, sector and index levels, the models with sentiment analysis outperform the bag-of-words model in both validation set and independent testing set; (2) the models which use sentiment polarity cannot provide useful predictions; (3) there is a minor difference between the models using two different sentiment dictionaries.

368 citations


Journal ArticleDOI
TL;DR: A structured and comprehensive overview of the literature in the field of Web Data Extraction is provided, namely applications at the Enterprise level and at the Social Web level, which allows to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users.
Abstract: Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

364 citations


Journal ArticleDOI
TL;DR: The authors proposed a concept-level sentiment analysis that merges linguistics, common-sense computing, and machine learning for improving the accuracy of tasks such as polarity detection, by allowing sentiments to flow from concept to concept based on the dependency relation of the input sentence, in particular, achieving a better understanding of the contextual role of each concept within the sentence and, hence, obtaining a polarity detector that outperforms state-of-the-art statistical methods.
Abstract: The Web is evolving through an era where the opinions of users are getting increasingly important and valuable. The distillation of knowledge from the huge amount of unstructured information on the Web can be a key factor for tasks such as social media marketing, branding, product positioning, and corporate reputation management. These online social data, however, remain hardly accessible to computers, as they are specifically meant for human consumption. The automatic analysis of online opinions involves a deep understanding of natural language text by machines, from which we are still very far. To this end, concept-level sentiment analysis aims to go beyond a mere word-level analysis of text and provide novel approaches to opinion mining and sentiment analysis that enable a more efficient passage from (unstructured) textual information to (structured) machine-processable data. A recent knowledge-based technology in this context is sentic computing, which relies on the ensemble application of common-sense computing and the psychology of emotions to infer the conceptual and affective information associated with natural language. Sentic computing, however, is limited by the richness of the knowledge base and by the fact that the bag-of-concepts model, despite more sophisticated than bag-of-words, misses out important discourse structure information that is key for properly detecting the polarity conveyed by natural language opinions. In this work, we introduce a novel paradigm to concept-level sentiment analysis that merges linguistics, common-sense computing, and machine learning for improving the accuracy of tasks such as polarity detection. By allowing sentiments to flow from concept to concept based on the dependency relation of the input sentence, in particular, we achieve a better understanding of the contextual role of each concept within the sentence and, hence, obtain a polarity detection engine that outperforms state-of-the-art statistical methods.

325 citations


Journal ArticleDOI
TL;DR: The objective is to develop an interval type-2 fuzzy AHP method together with a new ranking method for type- 2 fuzzy sets that applies the proposed method to a supplier selection problem.
Abstract: The membership functions of type-1 fuzzy sets have no uncertainty associated with it. While excessive arithmetic operations are needed with type-2 fuzzy sets with respect to type-1's, type-2 fuzzy sets generalize type-1 fuzzy sets and systems so that more uncertainty for defining membership functions can be handled. A type-2 fuzzy set lets us incorporate the uncertainty of membership functions into the fuzzy set theory. Some fuzzy multicriteria methods have recently been extended by using type-2 fuzzy sets. Analytic Hierarchy Process (AHP) is a widely used multicriteria method that can take into account various and conflicting criteria at the same time. Our objective is to develop an interval type-2 fuzzy AHP method together with a new ranking method for type-2 fuzzy sets. We apply the proposed method to a supplier selection problem.

318 citations


Journal ArticleDOI
TL;DR: This paper makes a full summary, analysis and evaluation on the current literatures of FDP from the following four unique aspects: definition of financial distress in the new century, FDP modeling, sampling approaches for FDP, and featuring approaches forFDP.
Abstract: As a hot topic, financial distress prediction (FDP), or called as corporate failure prediction, bankruptcy prediction, acts as an important role in decision-making of various areas, including: accounting, finance, business, and engineering. Since academic research on FDP has gone on for nearly eighty years, there are abundant literatures on this topic, which may appear chaotic to the researchers of the field and make them feel confused. This paper contributes to the current review researches by making a full summary, analysis and evaluation on the current literatures of FDP. The current literatures of FDP are reviewed from the following four unique aspects: definition of financial distress in the new century, FDP modeling, sampling approaches for FDP, and featuring approaches for FDP. By considering the new state-of-the-art techniques in this area, FDP modeling are classified and reviewed by the following groups: namely, modeling with pure single classifier, modeling with hybrid single classifier, modeling by ensemble techniques, dynamic FDP modeling, and modeling with group decision-making techniques. Sampling methods for FDP are classified and reviewed by the following paired groups, namely: training sampling and testing sampling, single industry sampling and cross-industry sampling, balanced sampling and imbalanced sampling. Featuring methods for FDP are categorized and reviewed by qualitative selection and combination of qualitative and quantitative selection. We comment on the current researches from the view of each category and propose further research topics. The review paper is valuable to guide research and application of the area.

259 citations


Journal ArticleDOI
TL;DR: A social network analysis (SNA) trust-consensus based group decision making model with interval-valued fuzzy reciprocal preference relation (IFRPR) with main novelty is that it determines the importance degree of experts by combining two reliable resources: trust degree (TD) and consensus level (CL).
Abstract: A social network analysis (SNA) trust-consensus based group decision making model with interval-valued fuzzy reciprocal preference relation (IFRPR) is investigated. The main novelty of this model is that it determines the importance degree of experts by combining two reliable resources: trust degree (TD) and consensus level (CL). To do that, an interval-valued fuzzy SNA methodology to represent and model trust relationship between experts and to compute the trust degree of each expert is developed. The multiplicative consistency property of IFRPR is also investigated, and the consistency indexes for the three different levels of an IFRPR are defined. Additionally, similarity indexes of IFRPR are defined to measure the level of agreement among the group of experts. The consensus level is derived by combining both the consistency index and similarity index, and it is used to guide a feedback mechanism to support experts in changing their opinions to achieve a consensus solution with a high degree of consistency. Finally, a quantifier guided non-dominance possibility degree (QGNDPD) based prioritisation method to derive the final trust-consensus based solution is proposed.

249 citations


Journal ArticleDOI
TL;DR: Results demonstrate that this novel method to incorporate social trust information (i.e., trusted neighbors explicitly specified by users) in providing recommendations outperforms other counterparts both in terms of accuracy and coverage.
Abstract: Providing high quality recommendations is important for e-commerce systems to assist users in making effective selection decisions from a plethora of choices. Collaborative filtering is a widely accepted technique to generate recommendations based on the ratings of like-minded users. However, it suffers from several inherent issues such as data sparsity and cold start. To address these problems, we propose a novel method called ''Merge'' to incorporate social trust information (i.e., trusted neighbors explicitly specified by users) in providing recommendations. Specifically, ratings of a user's trusted neighbors are merged to complement and represent the preferences of the user and to find other users with similar preferences (i.e., similar users). In addition, the quality of merged ratings is measured by the confidence considering the number of ratings and the ratio of conflicts between positive and negative opinions. Further, the rating confidence is incorporated into the computation of user similarity. The prediction for a given item is generated by aggregating the ratings of similar users. Experimental results based on three real-world data sets demonstrate that our method outperforms other counterparts both in terms of accuracy and coverage.

233 citations


Journal ArticleDOI
TL;DR: Results suggest that retinal image processing is a valid approach for automatic DR screening and testing on the publicly available Messidor database shows 90% sensitivity, 91% specificity and 90% accuracy and 0.989 AUC are achieved in a disease/no-disease setting.
Abstract: In this paper, an ensemble-based method for the screening of diabetic retinopathy (DR) is proposed. This approach is based on features extracted from the output of several retinal image processing algorithms, such as image-level (quality assessment, pre-screening, AM/FM), lesion-specific (microaneurysms, exudates) and anatomical (macula, optic disk) components. The actual decision about the presence of the disease is then made by an ensemble of machine learning classifiers. We have tested our approach on the publicly available Messidor database, where 90% sensitivity, 91% specificity and 90% accuracy and 0.989 AUC are achieved in a disease/no-disease setting. These results are highly competitive in this field and suggest that retinal image processing is a valid approach for automatic DR screening.

Journal ArticleDOI
TL;DR: A novel approach for sentiment classification based on meta-level features is proposed, which boosts existing sentiment classification of subjectivity and polarity detection on Twitter and offers a more global insight of the resource components for the complex task of classifying human emotion and opinion.
Abstract: People react to events, topics and entities by expressing their personal opinions and emotions. These reactions can correspond to a wide range of intensities, from very mild to strong. An adequate processing and understanding of these expressions has been the subject of research in several fields, such as business and politics. In this context, Twitter sentiment analysis, which is the task of automatically identifying and extracting subjective information from tweets, has received increasing attention from the Web mining community. Twitter provides an extremely valuable insight into human opinions, as well as new challenging Big Data problems. These problems include the processing of massive volumes of streaming data, as well as the automatic identification of human expressiveness within short text messages. In that area, several methods and lexical resources have been proposed in order to extract sentiment indicators from natural language texts at both syntactic and semantic levels. These approaches address different dimensions of opinions, such as subjectivity, polarity, intensity and emotion. This article is the first study of how these resources, which are focused on different sentiment scopes, complement each other. With this purpose we identify scenarios in which some of these resources are more useful than others. Furthermore, we propose a novel approach for sentiment classification based on meta-level features. This supervised approach boosts existing sentiment classification of subjectivity and polarity detection on Twitter. Our results show that the combination of meta-level features provides significant improvements in performance. However, we observe that there are important differences that rely on the type of lexical resource, the dataset used to build the model, and the learning strategy. Experimental results indicate that manually generated lexicons are focused on emotional words, being very useful for polarity prediction. On the other hand, lexicons generated with automatic methods include neutral words, introducing noise in the detection of subjectivity. Our findings indicate that polarity and subjectivity prediction are different dimensions of the same problem, but they need to be addressed using different subspace features. Lexicon-based approaches are recommendable for polarity, and stylistic part-of-speech based approaches are meaningful for subjectivity. With this research we offer a more global insight of the resource components for the complex task of classifying human emotion and opinion.

Journal ArticleDOI
TL;DR: The TODIM (an acronym in Portuguese of interactive and multi-criteria decision making) method is extended, which is based on prospect theory and can effectively capture the decision maker's psychological behavior, to solve this type of problems under hesitant fuzzy environment.
Abstract: Hesitant fuzzy set (HFS) is used to deal with the situations in which the decision makers hesitate among several values to assess an indicator, alternative, variable, etc. Recently, the multi-criteria decision making (MCDM) problems with hesitant fuzzy information have received increasing attentions and many corresponding MCDM methods have also been developed, but none of them can be used to solve the MCDM problems in case of considering the decision maker's psychological behavior. In this study, we extend the TODIM (an acronym in Portuguese of interactive and multi-criteria decision making) method, which is based on prospect theory and can effectively capture the decision maker's psychological behavior, to solve this type of problems under hesitant fuzzy environment. Firstly, we develop two novel measured functions for comparing the magnitude of hesitant fuzzy elements and interval-valued hesitant fuzzy elements, which are more reasonable and effective compared with the existing measured functions. Then, we calculate the dominance degree of each alternative related to the others based on novel measured functions and distance measures. By aggregating these dominance degrees, we can further obtain the overall value of each alternative and whereby rank the alternatives. Finally, a decision making problem that concerns the evaluation and ranking of the service quality among domestic airlines is used to illustrate the validity and applicability of the proposed method.

Journal ArticleDOI
TL;DR: An extensive evaluation of similarity measures for time series classification following the aforementioned principles is provided, showing the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest.
Abstract: Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using empirical, rigorous, quantitative, and large-scale assessment strategies. In this article, we provide an extensive evaluation of similarity measures for time series classification following the aforementioned principles. We consider 7 different measures coming from alternative measure 'families', and 45 publicly-available time series data sets coming from a wide variety of scientific domains. We focus on out-of-sample classification accuracy, but in-sample accuracies and parameter choices are also discussed. Our work is based on rigorous evaluation methodologies and includes the use of powerful statistical significance tests to derive meaningful conclusions. The obtained results show the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest. Such findings, together with the followed methodology, invite researchers on the field to adopt a more consistent evaluation criteria and a more informed decision regarding the baseline measures to which new developments should be compared.

Journal ArticleDOI
TL;DR: In the proposed IFFO, a new control parameter is introduced to tune the search scope around its swarm location adaptively and a new solution generating method is developed to enhance accuracy and convergence rate of the algorithm.
Abstract: This paper presents an improved fruit fly optimization (IFFO) algorithm for solving continuous function optimization problems. In the proposed IFFO, a new control parameter is introduced to tune the search scope around its swarm location adaptively. A new solution generating method is developed to enhance accuracy and convergence rate of the algorithm. Extensive computational experiments and comparisons are carried out based on a set of 29 benchmark functions from the literature. The computational results show that the proposed IFFO not only significantly improves the basic fruit fly optimization algorithm but also performs much better than five state-of-the-art harmony search algorithms.

Journal ArticleDOI
TL;DR: A novel GA based clustering technique that is capable of automatically finding the right number of clusters and identifying the right genes through a novel initial population selection approach is proposed and with the help of its novel fitness function, and gene rearrangement operation it produces high quality cluster centers.
Abstract: Many existing clustering techniques including K-Means require a user input on the number of clusters. It is often extremely difficult for a user to accurately estimate the number of clusters in a data set. The genetic algorithms (GAs) generally determine the number of clusters automatically. However, they typically choose the genes and the number of genes randomly. If we can identify the right genes in the initial population then GAs have better possibility to produce a high quality clustering result than the case when we randomly choose the genes. We propose a novel GA based clustering technique that is capable of automatically finding the right number of clusters and identifying the right genes through a novel initial population selection approach. With the help of our novel fitness function, and gene rearrangement operation it produces high quality cluster centers. The centers are then fed into K-Means as initial seeds in order to produce an even higher quality clustering solution by allowing the initial seeds to readjust as needed. Our experimental results indicate a statistically significant superiority (according to the sign test analysis) of our technique over five recent techniques on twenty natural data sets used in this study based on six evaluation criteria.

Journal ArticleDOI
TL;DR: The results obtained in this study indicate that the proposed FA-MSVR method is a promising alternative for forecasting interval-valued financial time series.
Abstract: Highly accurate interval forecasting of a stock price index is fundamental to successfully making a profit when making investment decisions, by providing a range of values rather than a point estimate. In this study, we investigate the possibility of forecasting an interval-valued stock price index series over short and long horizons using multi-output support vector regression (MSVR). Furthermore, this study proposes a firefly algorithm (FA)-based approach, built on the established MSVR, for determining the parameters of MSVR (abbreviated as FA-MSVR). Three globally traded broad market indices are used to compare the performance of the proposed FA-MSVR method with selected counterparts. The quantitative and comprehensive assessments are performed on the basis of statistical criteria, economic criteria, and computational cost. In terms of statistical criteria, we compare the out-of-sample forecasting using goodness-of-forecast measures and testing approaches. In terms of economic criteria, we assess the relative forecast performance with a simple trading strategy. The results obtained in this study indicate that the proposed FA-MSVR method is a promising alternative for forecasting interval-valued financial time series.

Journal ArticleDOI
TL;DR: EmoSenticSpace, a new framework for affective common-sense reasoning that extends WordNet-Affect and SenticNet by providing both emotion labels and polarity scores for a large set of natural language concepts, is proposed.
Abstract: Emotions play a key role in natural language understanding and sensemaking. Pure machine learning usually fails to recognize and interpret emotions in text accurately. The need for knowledge bases that give access to semantics and sentics (the conceptual and affective information) associated with natural language is growing exponentially in the context of big social data analysis. To this end, this paper proposes EmoSenticSpace, a new framework for affective common-sense reasoning that extends WordNet-Affect and SenticNet by providing both emotion labels and polarity scores for a large set of natural language concepts. The framework is built by means of fuzzy c-means clustering and support-vector-machine classification, and takes into account a number of similarity measures, including point-wise mutual information and emotional affinity. EmoSenticSpace was tested on three emotion-related natural language processing tasks, namely sentiment analysis, emotion recognition, and personality detection. In all cases, the proposed framework outperforms the state-of-the-art. In particular, the direct evaluation of EmoSenticSpace against psychological features provided in the benchmark ISEAR dataset shows a 92.15% agreement.

Journal ArticleDOI
TL;DR: The feedback mechanism is proved to converge to unanimous consensus when all experts are provided with recommendations and these are fully implemented, and an IRPR fuzzy majority based quantier-guided nondominance degree based prioritisation method using the associated score reciprocal preference relation is proposed to obtain the nal solution of consensus.
Abstract: The mathematical modelling and representation of Tanino’s multiplicative transitivity property to the case of intuitionistic reciprocal preference relations (IRPRs) is derived via Zadeh’s extension principle and the representation theorem of fuzzy sets. This result guarantees the correct generalisation of the multiplicative transitivity property of reciprocal preference relations (RPRs), and it allows the multiplicative consistency (MC) property of IRPRs to be dened. The MC property used in decision making problems is threefold: (1) to develop a consistency based procedure to estimate missing values in IRPRs using an indirect chain of alternatives; (2) to quantify the consistency index (CI) of preferences provided by experts; and (3) to build a novel consistency based induced ordered weighted averaging (MC-IOWA) operator that associates a higher contribution in the aggregated value to the more consistent information. These three uses are implemented in developing a consensus model for GDM problems with incomplete IRPRs in which the level of agreement between the experts’ individual IRPRs and the collective IRPR, which is referred here as the proximity index (PI), is combined with the CI to design a feedback mechanism to support experts to change some of their preference values using simple advice rules that aim at increasing the level of agreement while, at the same time, keeping a high degree of consistency. In the presence of missing information, the feedback mechanism implements the consistency based procedure to produce appropriate estimate values of the missing ones based on the given information provided by the experts. Under the assumption of constant CI values, the feedback mechanism is proved to converge to unanimous consensus when all experts are provided with recommendations and these are fully implemented. Additionally, visual representation of experts’ consensus position within the group before and after implementing their feedback advice is also provided, which help an expert to revisit his evaluations and make changes if considered appropriate to achieve a higher consensus level. Finally, an IRPR fuzzy majority based quantier-guided nondominance degree based prioritisation method using the associated score reciprocal preference relation is proposed to obtain the nal solution of consensus.

Journal ArticleDOI
TL;DR: A multi-objective approach for feature selection and its application to an unsupervised clustering procedure based on Growing Hierarchical Self-Organising Maps (GHSOMs) that includes a new method for unit labelling and efficient determination of the winning unit is considered.
Abstract: Feature selection is an important and active issue in clustering and classification problems. By choosing an adequate feature subset, a dataset dimensionality reduction is allowed, thus contributing to decreasing the classification computational complexity, and to improving the classifier performance by avoiding redundant or irrelevant features. Although feature selection can be formally defined as an optimisation problem with only one objective, that is, the classification accuracy obtained by using the selected feature subset, in recent years, some multi-objective approaches to this problem have been proposed. These either select features that not only improve the classification accuracy, but also the generalisation capability in case of supervised classifiers, or counterbalance the bias toward lower or higher numbers of features that present some methods used to validate the clustering/classification in case of unsupervised classifiers. The main contribution of this paper is a multi-objective approach for feature selection and its application to an unsupervised clustering procedure based on Growing Hierarchical Self-Organising Maps (GHSOMs) that includes a new method for unit labelling and efficient determination of the winning unit. In the network anomaly detection problem here considered, this multi-objective approach makes it possible not only to differentiate between normal and anomalous traffic but also among different anomalies. The efficiency of our proposals has been evaluated by using the well-known DARPA/NSL-KDD datasets that contain extracted features and labelled attacks from around 2 million connections. The selected feature sets computed in our experiments provide detection rates up to 99.8% with normal traffic and up to 99.6% with anomalous traffic, as well as accuracy values up to 99.12%.

Journal ArticleDOI
TL;DR: A decision model which consists of seven criteria and four alternatives is built, AHP (Analytic Hierarchy Process) integrated Grey-TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method is proposed, and applied in a Turkish foreign trade company.
Abstract: Content Management System (CMS) is an information system that allows publishing, editing, modifying content over internet through a central interface. By the evolution of internet and related communication technologies, CMS has become a key information technology (IT) for organizations to communicate with its internal and exterior environment. Just like any other IT projects, the selection of CMS consists of various tangible and intangible criteria which contain uncertainty and incomplete information. In this paper the selection of CMS among available alternatives is regarded as a multi criteria decision making problem. A decision model which consists of seven criteria and four alternatives is built, AHP (Analytic Hierarchy Process) integrated Grey-TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method is proposed, and applied in a Turkish foreign trade company. In the proposed model, the weights of the criteria are determined by AHP method and the alternatives are evaluated by Grey-TOPSIS. Due to the uncertainties, grey numbers are used for evaluations of the alternatives. One at a time sensitivity analysis is also provided in order to monitor the robustness of the method. Besides, the effects of using different distance functions, such as Manhattan, Euclidian and Minkowski distance functions on the results are examined.

Journal ArticleDOI
TL;DR: This paper studies how under certain circumstances the wrapper FSS process can be speeded up by embedding the classifier into the wrapper algorithm, instead of dealing with it as a black-box.
Abstract: This paper deals with the problem of wrapper feature subset selection (FSS) in classification-oriented datasets with a (very) large number of attributes. In high-dimensional datasets with thousands of variables, wrapper FSS becomes a laborious computational process because of the amount of CPU time it requires. In this paper we study how under certain circumstances the wrapper FSS process can be speeded up by embedding the classifier into the wrapper algorithm, instead of dealing with it as a black-box. Our proposal is based on the combination of the NB classifier (which is known to be largely beneficial for FSS) with incremental wrapper FSS algorithms. The merit of this approach is analyzed both theoretically and experimentally, and the results show an impressive speed-up for the embedded FSS process.

Journal ArticleDOI
TL;DR: It is shown, that new topics appearing in Twitter can be detected right after their occurrence, and it is observed that the topics emerged earlier in Twitter than in Google Trends.
Abstract: In this work, we present a system called PoliTwi, which was designed to detect emerging political topics (Top Topics) in Twitter sooner than other standard information channels. The recognized Top Topics are shared via different channels with the wider public. For the analysis, we have collected about 4,000,000 tweets before and during the parliamentary election 2013 in Germany, from April until September 2013. It is shown, that new topics appearing in Twitter can be detected right after their occurrence. Moreover, we have compared our results to Google Trends. We observed that the topics emerged earlier in Twitter than in Google Trends. Finally, we show how these topics can be used to extend existing knowledge bases (web ontologies or semantic networks) which are required for concept-level sentiment analysis. For this, we utilized special Twitter hashtags, called sentiment hashtags, used by the German community during the parliamentary election.

Journal ArticleDOI
TL;DR: A novel fruit fly optimization algorithm (nFOA) is proposed to solve the semiconductor final testing scheduling problem (SFTSP) and a cooperative search process is developed to simulate the information communication behavior among fruit flies.
Abstract: In this paper, a novel fruit fly optimization algorithm (nFOA) is proposed to solve the semiconductor final testing scheduling problem (SFTSP). First, a new encoding scheme is presented to represent solutions reasonably, and a new decoding scheme is presented to map solutions to feasible schedules. Second, it uses multiple fruit fly groups during the evolution process to enhance the parallel search ability of the FOA. According to the characteristics of the SFTSP, a smell-based search operator and a vision-based search operator are well designed for the groups to stress exploitation. Third, to simulate the information communication behavior among fruit flies, a cooperative search process is developed to stress exploration. The cooperative search process includes a modified improved precedence operation crossover (IPOX) and a modified multipoint preservative crossover (MPX) based on two popular structures of the flexible job shop scheduling. Moreover, the influence of the parameter setting is investigated by using Taguchi method of design-of-experiment (DOE), and suitable values are determined for key parameters. Finally, computational tests results with some benchmark instances and the comparisons to some existing algorithms are provided, which demonstrate the effectiveness and the efficiency of the nFOA in solving the SFTSP.

Journal ArticleDOI
TL;DR: This study shows that the proposed framework can avoid internal inconsistency issue when using the transformation functions among different preference representation structures and satisfies the Pareto principle of social choice theory.
Abstract: This study proposes a direct consensus framework for multiperson decision making (MPDM) with different preference representation structures (preference orderings, utility functions, multiplicative preference relations and fuzzy preference relations). In this framework, the individual selection methods, associated with different preference representation structures, are used to obtain individual preference vectors of alternatives. Then, the standardized individual preference vectors are aggregated into a collective preference vector. Finally, based on the collective preference vector, the feedback adjustment rules, associated with different preference representation structures, are presented to help the decision makers reach consensus. This study shows that the proposed framework satisfies two desirable properties: (i) the proposed framework can avoid internal inconsistency issue when using the transformation functions among different preference representation structures; (ii) it satisfies the Pareto principle of social choice theory. The results in this study are helpful to complete Chiclana et al.'s MPDM with different preference representation structures.

Journal ArticleDOI
TL;DR: With respect to the multiple attribute group decision making problems in which the attribute values take the form of the 2-dimension uncertain linguistic information, the method based on some power generalized aggregation operators is proposed, and two examples are given to verify the developed approach and to demonstrate its effectiveness.
Abstract: The 2-dimension uncertain linguistic variables add a subjective evaluation on the reliability of the evaluation results given by decision makers, so they can better express fuzzy information. At the same time, the power average (PA) operator has the characteristic of capturing the correlations of the aggregated arguments. In this paper, we propose some power aggregation operators, including 2-dimension uncertain linguistic power generalized aggregation operator (2DULPGA) and 2-dimension uncertain linguistic power generalized weighted aggregation operator (2DULPGWA), and discuss some properties and special cases of them. Finally, with respect to the multiple attribute group decision making problems in which the attribute values take the form of the 2-dimension uncertain linguistic information, the method based on some power generalized aggregation operators is proposed, and two examples are given to verify the developed approach and to demonstrate its effectiveness.

Journal ArticleDOI
TL;DR: A fraud detection method based on the user accounts visualization and threshold-type detection and a method of the detection threshold setting on the basis of the SOM U-matrix are proposed.
Abstract: We propose a fraud detection method based on the user accounts visualization and threshold-type detection. The visualization technique employed in our approach is the Self-Organizing Map (SOM). Since the SOM technique in its original form visualizes only the vectors, and the user accounts are represented in our work as the matrices storing a collection of records reflecting the user sequential activities, we propose a method of the matrices visualization on the SOM grid, which constitutes the main contribution of this paper. Furthermore, we propose a method of the detection threshold setting on the basis of the SOM U-matrix. The results of the conducted experimental study on real data in three different research fields confirm the advantages and effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: A novel framework named CSTrust is designed for conducting cloud service trustworthiness evaluation by combining QoS prediction and customer satisfaction estimation, which considers how to improve the accuracy of QoS value prediction on quantitative trustworthy attributes.
Abstract: The collection and combination of assessment data in trustworthiness evaluation of cloud service is challenging, notably because QoS value may be missing in offline evaluation situation due to the time-consuming and costly cloud service invocation. Considering the fact that many trustworthiness evaluation problems require not only objective measurement but also subjective perception, this paper designs a novel framework named CSTrust for conducting cloud service trustworthiness evaluation by combining QoS prediction and customer satisfaction estimation. The proposed framework considers how to improve the accuracy of QoS value prediction on quantitative trustworthy attributes, as well as how to estimate the customer satisfaction of target cloud service by taking advantages of the perception ratings on qualitative attributes. The proposed methods are validated through simulations, demonstrating that CSTrust can effectively predict assessment data and release evaluation results of trustworthiness.

Journal ArticleDOI
TL;DR: HOHFS is the actual extension of HFS that enables us to define the membership of a given element in terms of several possible generalized type of fuzzy sets (G-Type FSs).
Abstract: In this study, we extend the hesitant fuzzy set (HFS) to its higher order type and refer to it as the higher order hesitant fuzzy set (HOHFS). HOHFS is the actual extension of HFS that enables us to define the membership of a given element in terms of several possible generalized type of fuzzy sets (G-Type FSs). The rationale behind HOHFS can be seen in the case that the decision makers are not satisfied by providing exact values for the membership degrees and therefore the HFS is not applicable. However, in order to indicate HOHFSs have a good performance in decision making, we first introduce some information measures for HOHFSs and then apply them to multiple attribute decision making with higher order hesitant fuzzy information.

Journal ArticleDOI
TL;DR: Two methods are proposed to select Gaussian kernel parameters in OCSVM: according to the first one, the parameters are selected using the information of the farthest and the nearest neighbors of each sample; using the second one,The parameters are determined via detecting the ''tightness'' of the decision boundaries.
Abstract: As one of the methods to solve one-class classification problems (OCC), one-class support vector machines (OCSVM) have been applied to fault detection in recent years. Among all the kernels available for OCSVM, the Gaussian kernel is the most commonly used one. The selection of Gaussian kernel parameters influences greatly the performances of classifiers, which remains as an open problem. In this paper two methods are proposed to select Gaussian kernel parameters in OCSVM: according to the first one, the parameters are selected using the information of the farthest and the nearest neighbors of each sample; using the second one, the parameters are determined via detecting the ''tightness'' of the decision boundaries. The two proposed methods are tested on UCI data sets and Tennessee Eastman Process benchmark data sets. The results show that, the two proposed methods can be used to select suitable parameters for the Gaussian kernel, enabling the resulting OCSVM models to perform well on fault detection.