scispace - formally typeset
Search or ask a question

Showing papers presented at "SOCO-CISIS-ICEUTE in 2014"


Book ChapterDOI
01 Jan 2014
TL;DR: This paper has used the text in the tweet and machine learning and compression algorithms to filter those undesired tweets and proposes a content-based approach to filter spam tweets.
Abstract: Twitter has become one of the most used social networks. And, as happens with every popular media, it is prone to misuse. In this context, spam in Twitter has emerged in the last years, becoming an important problem for the users. In the last years, several approaches have appeared that are able to determine whether an user is a spammer or not. However, these blacklisting systems cannot filter every spam message and a spammer may create another account and restart sending spam. In this paper, we propose a content-based approach to filter spam tweets. We have used the text in the tweet and machine learning and compression algorithms to filter those undesired tweets.

44 citations


Book ChapterDOI
01 Jan 2014
TL;DR: A new method based on anomaly detection that extracts the strings contained in application files in order to detect malware is proposed.
Abstract: The usage of mobile phones has increased in our lives because they offer nearly the same functionality as a personal computer. Specifically, Android is one of the most widespread mobile operating systems. Indeed, its app store is one of the most visited and the number of applications available for this platform has also increased. However, as it happens with any popular service, it is prone to misuse, and the number of malware samples has increased dramatically in the last months. Thus, we propose a new method based on anomaly detection that extracts the strings contained in application files in order to detect malware.

31 citations


Book ChapterDOI
01 Jan 2014
TL;DR: A structured resource would allow researches and industry professionals to write relatively simple queries to retrieve all the information regards transcriptions of any accident, instead of the thousands of abstracts provided by querying the unstructured corpus.
Abstract: The development of automatic methods to produce usable structured information from unstructured text sources is extremely valuable to the oil and gas industry. A structured resource would allow researches and industry professionals to write relatively simple queries to retrieve all the information regards transcriptions of any accident. Instead of the thousands of abstracts provided by querying the unstructured corpus, the queries on structured corpus would result in a few hundred well-formed results.

29 citations


Book ChapterDOI
01 Jan 2014
TL;DR: The objective of this work is to propose the k-nearest neighbor (kNN) regression as geo-imputation preprocessing step for pattern-label-based short-term wind prediction of spatio-temporal wind data sets and show that kNN regression is the most superior method for imputation.
Abstract: The shift from traditional energy systems to distributed systems of energy suppliers and consumers and the power volatileness in renewable energy imply the need for effective short-term prediction models. These machine learning models are based on measured sensor information. In practice, sensors might fail for several reasons. The prediction models cannot naturally cannot work properly with incomplete patterns. If the imputation method, which completes the missing data, is not appropriately chosen, a bias may be introduced. The objective of this work is to propose the k-nearest neighbor (kNN) regression as geo-imputation preprocessing step for pattern-label-based short-term wind prediction of spatio-temporal wind data sets. The approach is compared to three other methods. The evaluation is based on four turbines with neighbors of the NREL Western Wind Data Set and the values are missing uniformly distributed. The results show that kNN regression is the most superior method for imputation.

24 citations


Book ChapterDOI
01 Jan 2014
TL;DR: In the presented method, it is proposed to use statistical relationships between predicted and original network traffic to determine if the examined trace is normal or attacked, and the efficiency of the method is verified with the use of extended set of benchmark test real traces.
Abstract: In this paper, we present network anomaly detection with the use of ARFIMA model. We propose the method of estimation parameters using the Hyndman-Khandakar algorithm to estimate the polymonials parameters and the Haslett and Raftery algorithm to estimate the differencing parameters. The choice of optimal values of the model parameters is performed on the basis of information criteria representing a compromise between the consistency model and the size of its error of estimate. In the presented method, we propose to use statistical relationships between predicted and original network traffic to determine if the examined trace is normal or attacked. The efficiency of our method is verified with the use of extended set of benchmark test real traces. The reported experimental results confirm the efficiency of the presented method.

24 citations


Book ChapterDOI
01 Jan 2014
TL;DR: The aim of the article is to discuss the selected process modelling methods in the supply chain on the example of one of the coordination mechanisms, i.e. contracting.
Abstract: Structures of a supply chain nature are "multi-actor" systems. They grapple with the lack of synchronized tasks, the lack of internal rational and often cohesion as well as uncertainty. Modern supply chains are often a series of enterprises and actions that are weakly connected with each other. Enterprises are also more involved in the internal integration than the external cooperation within the framework of the supply chain. For this reason, designing the logistics processes in these types of structures seems to be an unusually difficult task. The aim of the article is to discuss the selected process modelling methods in the supply chain on the example of one of the coordination mechanisms, i.e. contracting.

20 citations


Book ChapterDOI
01 Jan 2014
TL;DR: This paper presents a Denial of Service tool that belongs to the Slow DoS Attacks category, describes in detail the attack functioning and compares the proposed threat with a similar one known as slowloris, showing the enhancements provided by the proposed tool.
Abstract: In the last years, with the advent of the Internet, cyberwarfare operations moved from the battlefield to the cyberspace, locally or remotely executing sabotage or espionage operations in order to weaken the enemy. Among the technologies and methods used during cyberwarfare actions, Denial of Service attacks are executed to reduce the availability of a particular service on a network. In this paper we present a Denial of Service tool that belongs to the Slow DoS Attacks category. We describe in detail the attack functioning and we compare the proposed threat with a similar one known as slowloris, showing the enhancements provided by the proposed tool.

19 citations


Proceedings Article
01 Jan 2014
TL;DR: A hybrid algorithm based on Heterogeneous Earliest Finish Time heuristic and genetic algorithm that combines best characteristics of both approaches is proposed in this paper. But it does not consider variable workload in dynamically changing heterogeneous computational environment.
Abstract: The optimal workflow scheduling is one of the most important issues in heterogeneous distributed computational environment. Existing heuristic and evolutionary scheduling algorithms have their advantages and disadvantages. In this work we propose a hybrid algorithm based on Heterogeneous Earliest Finish Time heuristic and genetic algorithm that combines best characteristics of both approaches. We also experimentally show its efficiency for variable workload in dynamically changing heterogeneous computational environment.

19 citations


Book ChapterDOI
01 Jan 2014
TL;DR: The main objective of this paper is to develop a short term predictive model, based on neural networks, of the electricity demand for the CIESOL research center, and show a quick prediction with acceptable final results for real data.
Abstract: Energy efficiency in buildings is a topic that is being widely studied. In order to achieve energy efficiency it is necessary to perform both, a proper management of the electric demand, and an optimal exploitation of renewable sources, using for that appropriate control strategies. The main objective of this paper is to develop a short term predictive model, based on neural networks, of the electricity demand for the CIESOL research center. The performed experiments, using different techniques for weather forecast, show a quick prediction with acceptable final results for real data, obtaining a maximum root mean squared error of 5 % in validation data, with a short-term prediction horizon of 60 minutes.

16 citations


Book ChapterDOI
01 Jan 2014
TL;DR: This research deals with the extended investigations on the concept of a chaos-driven evolutionary algorithm Differential Evolution by embedding set of six discrete dissipative chaotic systems in the form of chaos pseudo random number generator for DE.
Abstract: This research deals with the extended investigations on the concept of a chaos-driven evolutionary algorithm Differential Evolution (DE). This paper is aimed at the embedding of set of six discrete dissipative chaotic systems in the form of chaos pseudo random number generator for DE. Repeated simulations were performed on the set of two shifted benchmark test functions in higher dimensions. Finally, the obtained results are compared with canonical DE.

15 citations


Book ChapterDOI
01 Jan 2014
TL;DR: This paper focuses on detecting SQLIA (SQL Injection Attacks) and XSS (Cross Site Scripting) and model normal traffic with the use of regular expressions, achieving very good results achieved on the large benchmark CISC’10 database.
Abstract: In this paper we present our further research results concerning detection of cyber attacks targeted at the application layer. In particular we focus on detecting SQLIA (SQL Injection Attacks) and XSS (Cross Site Scripting). In our approach, we model normal traffic (HTTP requests) with the use of regular expressions. We report very good results achieved on the large benchmark CISC’10 database and compare them to other solutions.

Book ChapterDOI
01 Jan 2014
TL;DR: The main contribution of this paper is to prove that advanced soft-computing techniques are a feasible solution to be implemented on reasonably priced μC -based embedded platforms.
Abstract: This paper presents an approach to merge three elements that are usually not thought to be combined in one application: evolutionary computing running on reasonably priced microcontrollers (μC) for real-time fast control systems. A Multi Objective Genetic Algorithm (MOGA) is implemented on a 180MHz μC.A fourth element, a Neural Network (NN) for supporting the evaluation function by predicting the response of the controlled system, is also implemented. Computational performance and the influence of a variety of factors are discussed. The results open a whole new spectrum of applications with great potential to benefit from multivariable and multiobjective intelligent control methods in which the hybridization of different soft-computing techniques could be present. The main contribution of this paper is to prove that advanced soft-computing techniques are a feasible solution to be implemented on reasonably priced μC -based embedded platforms.

Book ChapterDOI
01 Jan 2014
TL;DR: The utilization of chaotic pseudo random number generators based on six selected discrete chaotic maps to enhance the performance of newly proposed multiple choice strategy based PSO algorithm is proposed.
Abstract: In this paper, it is proposed the utilization of chaotic pseudo random number generators based on six selected discrete chaotic maps to enhance the performance of newly proposed multiple choice strategy based PSO algorithm. This research represents a continuation of previous successful experiments with the fusion of the PSO algorithm and chaotic systems. The performance of proposed algorithm is tested on a set of four test functions. Obtained promising results are presented, discussed and compared against the basic PSO strategy with inertia weight.

Book ChapterDOI
01 Jan 2014
TL;DR: A comparison between the classic MLR-based methodology and common regression techniques in machine learning (neural networks, regression trees, support vector machines, nearest neighbour, and ensembles such as random forests) shows that support vector regression statistically outperforms the rest of techniques when feature selection is applied.
Abstract: Light Detection and Ranging (LiDAR) is a remote sensor able to extract vertical information from sensed objects. LiDAR-derived information is nowadays used to develop environmental models for describing fire behaviour or quantifying biomass stocks in forest areas. A multiple linear regression (MLR) with previous stepwise feature selection is the most common method in the literature to develop LiDAR-derived models. MLR defines the relation between the set of field measurements and the statistics extracted from a LiDAR flight. Machine learning has recently been paid an increasing attention to improve classic MLR results. Unfortunately, few studies have been proposed to compare the quality of the multiple machine learning approaches. This paper presents a comparison between the classic MLR-based methodology and common regression techniques in machine learning (neural networks, regression trees, support vector machines, nearest neighbour, and ensembles such as random forests). The selected techniques are applied to real LiDAR data from two areas in the province of Lugo (Galizia, Spain). The results show that support vector regression statistically outperforms the rest of techniques when feature selection is applied. However, its performance cannot be said statistically different from that of Random Forests when previous feature selection is skipped.

Book ChapterDOI
01 Jan 2014
TL;DR: An optimization based on genetic algorithms for both feature selection and model tuning is presented to improve the prediction of set points in industrial lines, demonstrating the rerank makes more efficiently and easily the process of obtaining parsimonious models without reducing performance.
Abstract: An optimization based on genetic algorithms for both feature selection and model tuning is presented to improve the prediction of set points in industrial lines. The objective is the development of an automatic procedure that efficiently generates parsimonious prediction models with higher generalisation capacity. These models can achieve higher accuracy in predictions, maintaining the high quality of products while working with continual changes in the production cycle. The proposed method deals with three strict restrictions: few individuals per population, low number of holds and runs in model validation procedure and a reduced number of maximum generations. To fullfill these restrictions, we propose to include in the optimization the reranking of the individuals by their complexity when no significant difference is found between the values of their fitness functions. The method is applied to develop support vector machines for predicting three temperature set points in the annealing furnace of a continuous hot-dip galvanising line. The results demonstrate the rerank makes more efficiently and easily the process of obtaining parsimonious models without reducing performance.

Book ChapterDOI
01 Jan 2014
TL;DR: Host-based Packet Header Anomaly Detection (HbPHAD) model is proposed that is proficient in pinpoint suspicious packet header behaviour based on statistical analysis and is capable to detect 40 attack types from DARPA 1999 benchmark dataset.
Abstract: The disclosure of network packets to recurrent cyber intrusion has upraised the essential for modelling various statistical-based anomaly detection methods lately. Theoretically, the statistical-based anomaly detection method fascinates researcher’s attentiveness, but technologically, the fewer intrusion detection rates persist as vulnerable disputes. Thus, a Host-based Packet Header Anomaly Detection (HbPHAD) model that is proficient in pinpoint suspicious packet header behaviour based on statistical analysis is proposed in this paper. We perform scoring mechanism using Relative Percentage Ratio (RPR) in scheming normal scores, desegregate Linear Regression Analysis (LRA) to distinguish the degree of packets behaviour (i.e. fit to be suspicious or not suspicious) and Cohen’s-d (effect size) dimension to pre-define the finest threshold. HbPHAD is an effectual resolution for statistical-based anomaly detection method in pinpoint suspicious behaviour precisely. The experiment validate that HbPHAD is effectively in correctly detecting suspicious packet at above 90% as an intrusion detection rate for both ISCX 2012 and is capable to detect 40 attack types from DARPA 1999 benchmark dataset.

Book ChapterDOI
01 Jan 2014
TL;DR: Different approaches to detect one particular covert channel technique: DNS tunneling are illustrated.
Abstract: The use of covert-channel methods to bypass security policies has increasing in the last years. Malicious users neutralize security restriction encapsulating protocols like peer-to-peer, chat or http proxy into other allowed protocols like DNS or HTTP. This paper illustrates different approaches to detect one particular covert channel technique: DNS tunneling.

Book ChapterDOI
01 Jan 2014
TL;DR: The Linux kernel has become widely adopted in the mobile devices and cloud services, parallel to this has grown its abuse and misuse by attackers and malicious users, which has increased attention paid to kernel security through the deployment of kernel protection mechanisms.
Abstract: The Linux kernel has become widely adopted in the mobile devices and cloud services, parallel to this has grown its abuse and misuse by attackers and malicious users. This has increased attention paid to kernel security through the deployment of kernel protection mechanisms. Kernel based attacks require reliability, kernel attack reliability is achieved through the information gathering stage where the attacker is able to gather enough information about the target to succeed. The taxonomy of kernel vulnerabilities includes information leaks, that are a class of vulnerabilities that permit access to the kernel memory layout and contents. Information leaks can improve the attack reliability allowing the attacker to read sensitive kernel data to bypass kernel based protections.

Book ChapterDOI
01 Jan 2014
TL;DR: A new mathematical model is introduced to study the spread of a bluetooth mobile malware, a compartmental model where the mobile devices are classified into four types: susceptibles, carriers, exposed and infectious; its dynamic is governed by means of a couple of two-dimensional cellular automata.
Abstract: There is an unstoppable rise of the number of smartphones worldwide and, as several applications requires an Internet access, these mobile devices are exposed to the malicious effects of malware. Of particular interest is the malware which is propagated using bluetooth connections since it infects devices in its proximity as does biological virus. The main goal of this work is to introduce a new mathematical model to study the spread of a bluetooth mobile malware. Specifically, it is a compartmental model where the mobile devices are classified into four types: susceptibles, carriers, exposed and infectious; its dynamic is governed by means of a couple of two-dimensional cellular automata. Some computer simulations are shown and analyzed in order to determine how a proximity mobile malware might spread under different conditions.

Book ChapterDOI
01 Jan 2014
TL;DR: This paper uses two different binary particle swarm optimization techniques to cryptanalyze knapsack PKC and believes that modern Computation Intelligence (CI) techniques can provide efficient cryptanalytic results.
Abstract: The security of most Public Key Cryptosystem (PKC) proposed in literature relies on the difficulty of the integer factorization problem or discrete logarithm problem. However, using shor’s [19] algorithm the problems can be solved in acceptable amount of time via ‘quantum computers’. Therefore in this context knapsack (more accurately subset sum problem(SSP)) based PKC is reconsidered as a viable option by the cryptography community. However, before considering the practicability of this cryptosystem, there is a growing need to cryptanalyze it using all possible present techniques, in order to guarantee their security. We believe that modern Computation Intelligence (CI) techniques can provide efficient cryptanalytic results (because of the new aspects have been incorporated in CI techniques). In this paper, we use two different binary particle swarm optimization techniques to cryptanalyze knapsack PKC. The results obtained via extensive testing are promising and proficient. We present, discuss and compare the effectiveness of the proposed work in the result section.

Book ChapterDOI
01 Jan 2014
TL;DR: The techniques used today for the search of patterns and vulnerabilities within the software to know what are the possible solutions to this issue are discussed and their effectiveness in finding bugs is examined.
Abstract: The error detection in software is a problem that causes the loss of large amount of money in updates and patches. Many programmers spend their time correcting code instead of programming new features for their applications. This makes early detection of software errors become essential. Both in the fields of static analysis and model checking, great advances are being made to find errors in the software before the products are released. Although model checking techniques are more dedicated to find malware, it can be adapted for errors in the software. In this article we will discuss the techniques used today for the search of patterns and vulnerabilities within the software to know what are the possible solutions to this issue. We examine the problem from the point of view of their algorithms and their effectiveness in finding bugs. Although there are similar surveys, none of them addresses the comparison of best static analysis algorithms against the best mathematical logic languages for model checking, two fields that are becoming very important in the search for errors in software.

Book ChapterDOI
01 Jan 2014
TL;DR: Power consumption, cord area, tensile strength and tensile stress were modelled with quadratic regression models using Response Surface Methodology (RSM) and were compared with regression models based on DM (linear regression (LR), isotonic regression (IR), Gaussian processes (GP), artificial neural networks (ANN), support vector machines (SVM) and regression trees (RT).
Abstract: Gas Metal Arc Welding (GMAW) is an industrial process commonly used in manufacturing welded products. This manufacturing process is normally done by an industrial robot, which is controlled through the parameters of speed, current and voltage. These control parameters strongly influence the residual stress and the strength of the welded joint, as well as the total cost of manufacturing the welded components. Residual stress and tensile strength are commonly obtained via standardized hole-drilling and tensile tests which are very expensive to routinely carry out during the mass production of welded products. Over the past few decades, researchers have concentrated on improving the quality of manufacturing welded products using experimental analysis or trial-and-error results, but the cost of this methodology has proved unacceptable. Likewise, regression models based on Data Mining (DM) techniques have been used to improve various manufacturing processes, but usually require a relatively large amount of data in order to obtain acceptable results. By contrast, multiple response surface (MRS) methodology is a method for modelling and optimizing, which aims to identify the combination of input parameters that give the best output responses with a reduced number of data sets. In this paper, power consumption, cord area, tensile strength and tensile stress were modelled with quadratic regression (QR) models using Response Surface Methodology (RSM) and were compared with regression models based on DM (linear regression (LR), isotonic regression (IR), Gaussian processes (GP), artificial neural networks (ANN), support vector machines (SVM) and regression trees (RT)). The optimization of the parameters was conducted using RSM with quadratic regression and desirability functions, and was achieved when the residual stresses and power consumption were as low as possible, while strength and process speed were as high as possible.

Book ChapterDOI
01 Jan 2014
TL;DR: A simple procedure has been developed to use in the presentation of techniques and its results in the field of routing problems, and all the good practices to follow are introduced step by step.
Abstract: Researchers who investigate in any field related to computational algorithms (defining new algorithms or improving existing ones) find large difficulties when evaluating their work. Comparisons among different scientific works in this area is often difficult, due to the ambiguity or lack of detail in the presentation of the work or its results. In many cases, a replication of the work done by others is required, which means a waste of time and a delay in the research advances. After suffering this problem in many occasions, a simple procedure has been developed to use in the presentation of techniques and its results in the field of routing problems. In this paper this procedure is described in detail, and all the good practices to follow are introduced step by step. Although these good practices can be applied for any type of combinatorial optimization problem, the literature of this study is focused in routing problems. This field has been chosen due to its importance in the real world, and its great relevance in the literature.

Book ChapterDOI
01 Jan 2014
TL;DR: This article presents a proposed approach for analysis of students’ behaviour in the system based on their profiles and on the students' profiles similarity, which uses principles from process mining and the visualization of relations between students and groups of students is done by graph theory.
Abstract: E-learning is a method of education which usually uses Learning Management Systems and internet environment to ensure the maintenance of courses and to support the educational process. Moodle, one of such systems widely used, provides several statistical tools to analyse students’ behaviour in the system. However, none of these tools provides visualisation of relations between students and their clustering into groups based on their similar behaviour. This article presents a proposed approach for analysis of students’ behaviour in the system based on their profiles and on the students’ profiles similarity. The approach uses principles from process mining and the visualization of relations between students and groups of students is done by graph theory.

Book ChapterDOI
01 Jan 2014
TL;DR: A theoretical model is presented that allows the design of longer sequences with higher linear span than in previous DLFSR schemes, and determines the constant relationship between period and linear span for these structures.
Abstract: Many proposals of pseudorandom sequence generators and stream ciphers employ linear feedback shift registers with dynamic feedback (DLFSR) as the main module to increase the period and linear span of the involved m-sequences. In this paper, we present a theoretical model that allows the design of longer sequences with higher linear span than in previous DLFSR schemes. The model determines the constant relationship between period and linear span for these structures. These more complex sequences here obtained improve the proposals based on LFSR with dynamic feedback found in the literature.

Book ChapterDOI
01 Jan 2014
TL;DR: This work proposes to apply multi-objective evolutionary algorithms in order to obtain a set of non-dominated solutions, from which the final users would choose the one to be definitely carried out.
Abstract: This work deals with a multi-objective formulation of the Container Loading Problem which is commonly encountered in transportation and wholesaling industries. The goal of the problem is to load the items (boxes) that would provide the highest total volume and weight to the container, without exceeding the container limits. These two objectives are conflicting because the volume of a box is usually not proportional to its weight. Most of the proposals in the literature simplify the problem by converting it into a mono-objective problem. However, in this work we propose to apply multi-objective evolutionary algorithms in order to obtain a set of non-dominated solutions, from which the final users would choose the one to be definitely carried out. To apply evolutionary approaches we have defined a representation scheme for the candidate solutions, a set of evolutionary operators and a method to generate and evaluate the candidate solutions. The obtained results improve previous results in the literature and demonstrate the importance of the evaluation heuristic to be applied.

Book ChapterDOI
Tien Pham1, Wanli Ma1, Dat Tran1, Phuoc Nguyen1, Dinh Q. Phung1 
01 Jan 2014
TL;DR: This paper proposes an EEG-based authentication method, which is simple to implement and easy to use, by taking the advantage of EEG artifacts, generated by a number of purposely designed voluntary facial muscle movements, which can be single or combined, depending on the level of security required.
Abstract: Recently, electroencephalography (EEG) is considered as a new potential type of user authentication with many security advantages of being difficult to fake, impossible to observe or intercept, unique, and alive person recording require. The difficulty is that EEG signals are very weak and subject to the contamination from many artifact signals. However, for the applications in human health, true EEG signals, without the contamination, is highly desirable, but for the purposes of authentication, where stable and repeatable patterns from the source signals are critical, the origins of the signals are of less concern. In this paper, we propose an EEG-based authentication method, which is simple to implement and easy to use, by taking the advantage of EEG artifacts, generated by a number of purposely designed voluntary facial muscle movements. These tasks can be single or combined, depending on the level of security required. Our experiment showed that using EEG artifacts for user authentication in multilevel security systems is promising.

Book ChapterDOI
01 Jan 2014
TL;DR: A mathematical model to find the best electrical interconnection configuration of the wind farm turbines and the substation is proposed and the results are compared with the ground solution.
Abstract: Nowadays, wind energy has an important role in the challenges of clean energy supply. It is the fastest growing energy source with a increasing annual rate of 20%. This scenario motivate the development of an optimization design tool to find optimal layout for wind farms. This paper proposes a mathematical model to find the best electrical interconnection configuration of the wind farm turbines and the substation. The goal is to minimize the installation costs, that include cable cost and cable installation costs, considering technical constraints. This problem corresponds to a capacitated minimum spanning tree with additional constraints. The methodology proposed is applied in a real case study and the results are compared with the ground solution.

Book ChapterDOI
01 Jan 2014
TL;DR: The experiment shows very good results in detection layer, and for the classification layer, 88% of false positives were successfully labeled as normal traffic connections, and 79% of DoS and Probe attacks were labeled correctly.
Abstract: A multi-agent artificial immune system for network intrusion detection and classification is proposed and tested in this paper. The multi-layer detection and classification process is proposed to be executed on each agent, for each host in the network. The experiment shows very good results in detection layer, where 90% of anomalies are detected. For the classification layer, 88% of false positives were successfully labeled as normal traffic connections, and 79% of DoS and Probe attacks were labeled correctly. An analysis is given for future work to enhance results for low-presented attacks.

Book ChapterDOI
01 Jan 2014
TL;DR: Experimental analysis, carried out on a large malware dataset, prove that the method is capable of outperforming other state-of-the-art algorithms, and hence is an effective approach for the problem of imbalanced malware detection.
Abstract: Malware detection is among the most extensively developed areas for computer security. Unauthorized, malicious software can cause expensive damage to both private users and companies. It can destroy the computer, breach the privacy of user and result in loss of valuable data. The amount of data uploaded and downloaded each day makes almost impossible for manual screening of each incoming software package. That is why there is a need for effective intelligent filters, that can automatically dichotomize between the safe and dangerous applications. The number of malware programs, that are faced by the detection system, is typically much smaller than the number of desired programs. Therefore, we have to deal with the imbalanced classification problem, in which standard classification algorithms tend to fail. In this paper, we present a novel ensemble, based on cost-sensitive decision trees. Individual classifiers are constructed according to an established cost matrix and trained on random feature subspaces to ensure, that they are mutually complementary. Instead of using a fixed cost matrix we derive its parameters via ROC analysis. An evolutionary algorithm is being applied for simultaneous classifier selection and assignment of committee member weights for the fusion process. Experimental analysis, carried out on a large malware dataset, prove that our method is capable of outperforming other state-of-the-art algorithms, and hence is an effective approach for the problem of imbalanced malware detection.