TL;DR: This research synthesizes binary classification in which various approaches for binary classification are discussed and sockpuppet detection is based on binary.
Abstract: In the field of information extraction and retrieval, binary classification is the process of classifying given document/account on the basis of predefined classes. Sockpuppet detection is based on binary, in which given accounts are detected either sockpuppet or non-sockpuppet. Sockpuppets has become significant issues, in which one can have fake identity for some specific purpose or malicious use. Text categorization is also performed with binary classification. This research synthesizes binary classification in which various approaches for binary classification are discussed.
TL;DR: The results and comparative study showed that, the current work improved the previous accuracy score in predicting heart disease, and the integration of the machine learning model presented in this study with medical information systems would be useful to predict the HF or any other disease using the live data collected from patients.
Abstract: In the current era, Heart Failure (HF) is one of the common diseases that can lead to dangerous situation. Every year almost 26 million of patients are affecting with this kind of disease. From the heart consultant and surgeon’s point of view, it is complex to predict the heart failure on right time. Fortunately, classification and predicting models are there, which can aid the medical field and can illustrates how to use the medical data in an efficient way. This paper aims to improve the HF prediction accuracy using UCI heart disease dataset. For this, multiple machine learning approaches used to understand the data and predict the HF chances in a medical database. Furthermore, the results and comparative study showed that, the current work improved the previous accuracy score in predicting heart disease. The integration of the machine learning model presented in this study with medical information systems would be useful to predict the HF or any other disease using the live data collected from patients.
118 citations
Cites methods from "Machine Learning: A Review on Binar..."
...Mainly, the experiment design built using binary classification, which is the process of categorizing the dataset according to predefined classes, which has been widely used in applying machine learning algorithms [29]....
TL;DR: A novel time series-based approach for the early identification of increases in hypertension to discriminate between cardiovascular high-risk and low-risk hypertensive patients through the analyses of electrocardiographic holter signals achieves excellent results compared with the state-of-the-art.
TL;DR: This paper presents a study about using binary classifiers with NNs together with a perceptual linear prediction (PLP) method for feature extraction to increase the classification rate of voice commands captured using a throat microphone, comparing this method with a single NN.
Abstract: Multi-class pattern classification has many applications including speech recognition, and it is not easy to extend from two-class neural networks (NNs). This paper presents a study about using binary classifiers with NNs together with a perceptual linear prediction (PLP) method for feature extraction to increase the classification rate of voice commands captured using a throat microphone, comparing this method with a single NN. Because there is no other data set with voice commands captured using a throat microphone in the Brazilian Portuguese language in researched literature, we created a data set with isolated voice commands with utterances captured from 150 people (men and women). All the voice samples are captured in Brazilian Portuguese, and they are the digits “0” through “9” and the words “Ok” and “Cancel”. The results show that the throat microphone is robust in noise environment, achieving 95.4% of hit rate in our speech recognition system with multiple NNs using the one-against-all approach, better performance than a simple NN that reach 91.88%. This result is very representative, since both classifiers obtained high hit rates. But, it requires 535% more time for training the multiple NNs compared with simple NN. The best configuration on PLP extraction order is 9 or 10 for voice samples captured by the throat microphone, which was observed that poor stressed vowel and fricative-like words “3” and “7” in Portuguese confuses the classifier.
12 citations
Cites methods from "Machine Learning: A Review on Binar..."
...Those techniques are typically used to classify data into groups, making the system understand how to distinguish such groups, allowing the classification of new data within this set of groups [48], [49]....
TL;DR: This paper was able to achieve good accuracy and less variation with Discriminant Analysis as compared to many commonly used classification algorithms with training accuracy reaching 97.37% and testing accuracy of 95.92% using Quadratic Discrimant Analysis.
Abstract: A lot of prognostication methodologies have been formulated for early detection of Polycystic Ovary Syndrome also known as PCOS using Machine Learning. PCOS is a binary classification problem. Dimensionality Reduction methods impact the performance of Machine Learning to a greater extent and using a Supervised Dimensionality Reduction method can give us a new edge to tackle this problem. In this paper we present Discriminant Analysis in different dimensions with Linear and Quadratic form for binary classification along with metrics. We were able to achieve good accuracy and less variation with Discriminant Analysis as compared to many commonly used classification algorithms with training accuracy reaching 97.37% and testing accuracy of 95.92% using Quadratic Discriminant Analysis. Paper also gives the analysis of data with visualizations for deeper understanding of problem.
TL;DR: An historical review of experimental and computational approaches employed for the characterisation of essential genes in eukaryotes is undertaken, with a particular focus on model ecdysozoans (C. elegans and D. melanogaster), and the possible applicability of ML-approaches to organisms such as socioeconomically important parasites is discussed.
TL;DR: This paper examines the ability of existing host-based anti-virus products to provide semantically meaningful information about the malicious software and tools used by attackers and proposes a new classification technique that describes malware behavior in terms of system state changes rather than in sequences or patterns of system calls.
Abstract: Numerous attacks, such as worms, phishing, and botnets, threaten the availability of the Internet, the integrity of its hosts, and the privacy of its users. A core element of defense against these attacks is anti-virus (AV) software--a service that detects, removes, and characterizes these threats. The ability of these products to successfully characterize these threats has far-reaching effects--from facilitating sharing across organizations, to detecting the emergence of new threats, and assessing risk in quarantine and cleanup. In this paper, we examine the ability of existing host-based anti-virus products to provide semantically meaningful information about the malicious software and tools (or malware) used by attackers. Using a large, recent collection of malware that spans a variety of attack vectors (e.g., spyware, worms, spam), we show that different AV products characterize malware in ways that are inconsistent across AV products, incomplete across malware, and that fail to be concise in their semantics. To address these limitations, we propose a new classification technique that describes malware behavior in terms of system state changes (e.g., files written, processes created) rather than in sequences or patterns of system calls. To address the sheer volume of malware and diversity of its behavior, we provide a method for automatically categorizing these profiles of malware into groups that reflect similar classes of behaviors and demonstrate how behavior-based clustering provides a more direct and effective way of classifying and analyzing Internet malware.
602 citations
"Machine Learning: A Review on Binar..." refers background in this paper
...al.[15] has explained that anti-virus is incomplete in that it fails to detect or provide labels of the malware samples....
TL;DR: The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.
Abstract: Text classification is a domain with high dimensional feature space.Extracting the keywords as the features can be extremely useful in text classification.An empirical analysis of five statistical keyword extraction methods.A comprehensive analysis of classifier and keyword extraction ensembles.For ACM collection, a classification accuracy of 93.80% with Bagging ensemble of Random Forest. Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.
445 citations
"Machine Learning: A Review on Binar..." refers methods in this paper
...al.[19] proposed ensemble approach such as Adaboost, Bagging, Dagging, Random Subspaces and majority Voting....
TL;DR: A framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious) and should include training of multiple classifiers on various types of features, as well as an active learning mechanism to maintain high detection accuracy.
302 citations
Additional excerpts
...al.[10] has addressed different challenges i....
TL;DR: It is shown that using a large feature set, it is possible to distinguish regular documents from deceptive documents with 96.6% accuracy (F-measure) and an analysis of linguistic features that can be modified to hide writing style is presented.
Abstract: In digital forensics, questions often arise about the authors of documents: their identity, demographic background, and whether they can be linked to other documents. The field of stylometry uses linguistic features and machine learning techniques to answer these questions. While stylometry techniques can identify authors with high accuracy in non-adversarial scenarios, their accuracy is reduced to random guessing when faced with authors who intentionally obfuscate their writing style or attempt to imitate that of another author. While these results are good for privacy, they raise concerns about fraud. We argue that some linguistic features change when people hide their writing style and by identifying those features, stylistic deception can be recognized. The major contribution of this work is a method for detecting stylistic deception in written documents. We show that using a large feature set, it is possible to distinguish regular documents from deceptive documents with 96.6% accuracy (F-measure). We also present an analysis of linguistic features that can be modified to hide writing style.
276 citations
"Machine Learning: A Review on Binar..." refers methods in this paper
...al.[5] Author has proposed a method to detect stylistic deception in written document....
TL;DR: This paper presents the first classification method integrating static and dynamic features into a single test and concludes that to achieve acceptable accuracy in classifying the latest malware, some older malware should be included in the set of data.