Author
Hai Thanh Nguyen
Other affiliations: Telenor, Nha Trang University, Gjøvik University College ...read more
Bio: Hai Thanh Nguyen is an academic researcher from Norwegian University of Science and Technology. The author has contributed to research in topics: Computer science & Feature selection. The author has an hindex of 13, co-authored 35 publications receiving 607 citations. Previous affiliations of Hai Thanh Nguyen include Telenor & Nha Trang University.
Papers
More filters
25 Mar 2010
TL;DR: Experiments show that the proposed automatic feature selection procedure outperforms the best first and genetic algorithm search strategies by removing much more redundant features and still keeping the classification accuracies or even getting better performances.
Abstract: The quality of the feature selection algorithm is one of the most important factors that affects the effectiveness of an intrusion detection system (IDS). Achieving reduction of the number of relevant traffic features without negative effect on classification accuracy is a goal that greatly improves the overall effectiveness of the IDS. Obtaining a good feature set automatically without involving expert knowledge is a complex task. In this paper, we propose an automatic feature selection procedure based on the filter method used in machine learning. In particular, we focus on Correlation Feature Selection (CFS). By transforming the CFS optimization problem into a polynomial mixed 0−1 fractional programming problem and by introducing additional variables in the problem transformed in such a way, we obtain a new mixed 0 − 1 linear programming problem with a number of constraints and variables that is linear in the number of full set features. The mixed 0−1 linear programming problem can then be solved by means of branch-and-bound algorithm. Our feature selection algorithm was compared experimentally with the best-first-CFS and the genetic-algorithm-CFS methods regarding the feature selection capabilities. The classification accuracy obtained after the feature selection by means of the C4.5 and the BayesNet machines over the KDD CUP'99 IDS benchmarking data set was also tested. Experiments show that our proposed method outperforms the best first and genetic algorithm search strategies by removing much more redundant features and still keeping the classification accuracies or even getting better performances.
81 citations
TL;DR: Experiments show that the proposed automatic feature selection procedure outperforms the best first and genetic algorithm search strategies by removing much more redundant features and still keeping the classification accuracies or even getting better performances.
Abstract: In this paper, the authors propose a new feature selection procedure for intrusion detection, which is based on filter method used in machine learning. They focus on Correlation Feature Selection CFS and transform the problem of feature selection by means of CFS measure into a mixed 0-1 linear programming problem with a number of constraints and variables that is linear in the number of full set features. The mixed 0-1 linear programming problem can then be solved by using branch-and-bound algorithm. This feature selection algorithm was compared experimentally with the best-first-CFS and the genetic-algorithm-CFS methods regarding the feature selection capabilities. Classification accuracies obtained after the feature selection by means of the C4.5 and the BayesNet over the KDD CUP'99 dataset were also tested. Experiments show that the authors' method outperforms the best-first-CFS and the genetic-algorithm-CFS methods by removing much more redundant features while keeping the classification accuracies or getting better performances.
75 citations
TL;DR: A Bayesian personalized ranking method for heterogeneous implicit feedback (BPRH) is proposed, whereby items are first classified into different types according to the actions they received and their correlations are quantified.
Abstract: Personalized recommendation for online service systems aims to predict potential demand by analysing user preference. User preference can be inferred from heterogeneous implicit feedback (i.e. various user actions) especially when explicit feedback (i.e. ratings) is not available. However, most methods either merely focus on homogeneous implicit feedback (i.e. target action), e.g., purchase in shopping websites and forward in Twitter, or dispose heterogeneous implicit feedback without the investigation of its speciality. In this paper, we adopt two typical actions in online service systems, i.e., view and like , as auxiliary feedback to enhance recommendation performance, whereby we propose a Bayesian personalized ranking method for heterogeneous implicit feedback (BPRH). Specifically, items are first classified into different types according to the actions they received. Then by analysing the co-occurrence of different types of actions, which is one of the fundamental speciality of heterogeneous implicit feedback systems, we quantify their correlations, based on which the difference of users’ preference among different types of items is investigated. An adaptive sampling strategy is also proposed to tackle the unbalanced correlation among different actions. Extensive experimentation on three real-world datasets demonstrates that our approach significantly outperforms state-of-the-art algorithms.
66 citations
08 Jun 2011
TL;DR: This paper conducts experiments on the publicly available ECML/PKDD-2007 dataset and generates a new CSIC-2010 dataset to determine appropriate instances of the GeFS measure for feature selection and uses different classifiers to test the detection accuracies.
Abstract: Feature selection for filtering HTTP-traffic in Web application firewalls (WAFs) is an important task We focus on the Generic-Feature-Selection (GeFS) measure [4], which was successfully tested on low-level package filters, ie, the KDD CUP'99 dataset However, the performance of the GeFS measure in analyzing high-level HTTP-traffic is still unknown In this paper we study the GeFS measure for WAFs We conduct experiments on the publicly available ECML/PKDD-2007 dataset Since this dataset does not target any real Web application, we additionally generate our new CSIC-2010 dataset We analyze the statistical properties of both two datasets to provide more insides of their nature and quality Subsequently, we determine appropriate instances of the GeFS measure for feature selection We use different classifiers to test the detection accuracies The experiments show that we can remove 63% of irrelevant and redundant features from the original dataset, while reducing only 012% the detection accuracy of WAFs
61 citations
TL;DR: The proposed solution based on spectral minutiae is evaluated against other comparison strategies on three different datasets of wrist and palm dorsal vein samples and shows a competitive biometric performance while producing features that are compatible with state-of-the-art template protection systems.
Abstract: Similar to biometric fingerprint recognition, characteristic minutiae points - here end and branch points - can be extracted from skeletonised vein images to distinguish individuals. An approach to extract those vein minutiae and to transform them into a fixed-length, translation and scale invariant representation where rotations can be easily compensated is presented in this study. The proposed solution based on spectral minutiae is evaluated against other comparison strategies on three different datasets of wrist and palm dorsal vein samples. The authors- analysis shows a competitive biometric performance while producing features that are compatible with state-of-the-art template protection systems. In addition, a modified and more distinctive, but not transform or rotation invariant, representation is proposed and evaluated.
52 citations
Cited by
More filters
[...]
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
13,246 citations
2,140 citations
TL;DR: This paper provides a structured and comprehensive overview of various facets of network anomaly detection so that a researcher can become quickly familiar with every aspect of network anomalies detection.
Abstract: Network anomaly detection is an important and dynamic research area. Many network intrusion detection methods and systems (NIDS) have been proposed in the literature. In this paper, we provide a structured and comprehensive overview of various facets of network anomaly detection so that a researcher can become quickly familiar with every aspect of network anomaly detection. We present attacks normally encountered by network intrusion detection systems. We categorize existing network anomaly detection methods and systems based on the underlying computational techniques used. Within this framework, we briefly describe and compare a large number of network anomaly detection methods and systems. In addition, we also discuss tools that can be used by network defenders and datasets that researchers in network anomaly detection can use. We also highlight research directions in network anomaly detection.
971 citations
TL;DR: This survey classifies the IoT security threats and challenges for IoT networks by evaluating existing defense techniques and provides a comprehensive review of NIDSs deploying different aspects of learning techniques for IoT, unlike other top surveys targeting the traditional systems.
Abstract: Pervasive growth of Internet of Things (IoT) is visible across the globe. The 2016 Dyn cyberattack exposed the critical fault-lines among smart networks. Security of IoT has become a critical concern. The danger exposed by infested Internet-connected Things not only affects the security of IoT but also threatens the complete Internet eco-system which can possibly exploit the vulnerable Things (smart devices) deployed as botnets. Mirai malware compromised the video surveillance devices and paralyzed Internet via distributed denial of service attacks. In the recent past, security attack vectors have evolved bothways, in terms of complexity and diversity. Hence, to identify and prevent or detect novel attacks, it is important to analyze techniques in IoT context. This survey classifies the IoT security threats and challenges for IoT networks by evaluating existing defense techniques. Our main focus is on network intrusion detection systems (NIDSs); hence, this paper reviews existing NIDS implementation tools and datasets as well as free and open-source network sniffing software. Then, it surveys, analyzes, and compares state-of-the-art NIDS proposals in the IoT context in terms of architecture, detection methodologies, validation strategies, treated threats, and algorithm deployments. The review deals with both traditional and machine learning (ML) NIDS techniques and discusses future directions. In this survey, our focus is on IoT NIDS deployed via ML since learning algorithms have a good success rate in security and privacy. The survey provides a comprehensive review of NIDSs deploying different aspects of learning techniques for IoT, unlike other top surveys targeting the traditional systems. We believe that, this paper will be useful for academia and industry research, first, to identify IoT threats and challenges, second, to implement their own NIDS and finally to propose new smart techniques in IoT context considering IoT limitations. Moreover, the survey will enable security individuals differentiate IoT NIDS from traditional ones.
494 citations
TL;DR: Empirical results show that selected reduced attributes give better performance to design IDS that is efficient and effective for network intrusion detection.
Abstract: Intrusion detection is the process of monitoring and analyzing the events occurring in a computer system in order to detect signs of security problems. Today most of the intrusion detection approaches focused on the issues of feature selection or reduction, since some of the features are irrelevant and redundant which results lengthy detection process and degrades the performance of an intrusion detection system (IDS). The purpose of this study is to identify important reduced input features in building IDS that is computationally efficient and effective. For this we investigate the performance of three standard feature selection methods using Correlation-based Feature Selection, Information Gain and Gain Ratio. In this paper we propose method Feature Vitality Based Reduction Method, to identify important reduced input features. We apply one of the efficient classifier naive bayes on reduced datasets for intrusion detection. Empirical results show that selected reduced attributes give better performance to design IDS that is efficient and effective for network intrusion detection.
397 citations