scispace - formally typeset
Search or ask a question
Posted Content

Harvesting SSL Certificate Data to Identify Web-Fraud

TL;DR: In this paper, the authors conduct a comprehensive study of SSL certificates and build a classifier that detects web-fraud domains with high accuracy, based on extensive measurements, which is used to detect typosquatting and phishing.
Abstract: Web-fraud is one of the most unpleasant features of today's Internet. Two well-known examples of fraudulent activities on the web are phishing and typosquatting. Their effects range from relatively benign (such as unwanted ads) to downright sinister (especially, when typosquatting is combined with phishing). This paper presents a novel technique to detect web-fraud domains that utilize HTTPS. To this end, we conduct the first comprehensive study of SSL certificates. We analyze certificates of legitimate and popular domains and those used by fraudulent ones. Drawing from extensive measurements, we build a classifier that detects such malicious domains with high accuracy.
Citations
More filters
Proceedings ArticleDOI
26 May 2015
TL;DR: This work proposes a machine-learning approach to detect phishing websites using features from their X.509 public key certificates, and illustrates that this certificate-based approach greatly increases the difficulty of masquerading undetected for phishers, with single millisecond delays for users.
Abstract: We propose a machine-learning approach to detect phishing websites using features from their X.509 public key certificates. We show that its efficacy extends beyond HTTPSenabled sites. Our solution enables immediate local identification of phishing sites. As such, this serves as an important complement to the existing server-based anti-phishing mechanisms which predominately use blacklists. Blacklisting suffers from several inherent drawbacks in terms of correctness, timeliness, and completeness. Due to the potentially significant lag prior to site blacklisting, there is a window of opportunity for attackers. Other local client-side phishing detection approaches also exist, but primarily rely on page content or URLs, which are arguably easier to manipulate by attackers. We illustrate that our certificatebased approach greatly increases the difficulty of masquerading undetected for phishers, with single millisecond delays for users. We further show that this approach works not only against HTTPS-enabled phishing attacks, but also detects HTTP phishing attacks with port 443 enabled.

51 citations

Journal ArticleDOI
17 Sep 2016
TL;DR: This work presents a method for detecting rogue certificates from trusted CAs developed from a large and timely collection of certificates, and automates classification by building machine-learning models with Deep Neural Networks (DNN).
Abstract: Rogue certificates are valid certificates issued by a legitimate certificate authority (CA) that are nonetheless untrustworthy; yet trusted by web browsers and users. With the current public key infrastructure, there exists a window of vulnerability between the time a rogue certificate is issued and when it is detected. Rogue certificates from recent compromises have been trusted for as long as weeks before detection and revocation. Previous proposals to close this window of vulnerability require changes in the infrastructure, Internet protocols, or end user experience. We present a method for detecting rogue certificates from trusted CAs developed from a large and timely collection of certificates. This method automates classification by building machine-learning models with Deep Neural Networks (DNN). Despite the scarcity of rogue instances in the dataset, DNN produced a classification method that is proven both in simulation and in the July 2014 compromise of the India CCA. We report the details of the classification method and illustrate that it is repeatable, such as with datasets obtained from crawling. We describe the classification performance under our current research deployment.

38 citations

Proceedings ArticleDOI
23 Feb 2014
TL;DR: This work evaluates the actual level of adherence to the CA/Browser Forum guidelines over time, as well as the impact of each violation, by inspecting a large collection of certificates gathered from Web crawls and automatically deriving profile templates that characterize the makeup of certificates per issuer.
Abstract: A string of recent attacks against the global public key infrastructure (PKI) has brought to light weaknesses in the certification authority (CA) system. In response, the CA/Browser Forum, a consortium of certification authorities and browser vendors, published in 2011 a set of requirements applicable to all certificates intended for use on the Web and issued after July 1st, 2012, following the successful adoption of the extended validation guidelines in 2007. We evaluate the actual level of adherence to the CA/Browser Forum guidelines over time, as well as the impact of each violation, by inspecting a large collection of certificates gathered from Web crawls. We further refine our analysis by automatically deriving profile templates that characterize the makeup of certificates per issuer. By integrating these templates with violation statistics, we are able to depict the practices of certification authorities worldwide, and thus to monitor the PKI and proactively detect major violations. Our method also provides new means of assessing the trustworthiness of SSL certificates used on the Web.

36 citations

Journal ArticleDOI
TL;DR: This paper shows how the concepts of learning automata LA can be used to verify SSL certificates, and proposes a proposed LA-based system that can detect safe or unsafe SSL certificates.
Abstract: With the rapid evolution of the Internet, security has become a major area of concern and, consequently, an interesting research area Different applications transmit sensitive information over the Internet, which creates increased chances for attackers to look into every piece of data, unless it is secured using secure socket layer SSL certificate However, the present SSL certificates too face challenges because of various attacks, and these certificates need to be verified before transmitting information In this paper, we show how the concepts of learning automata LA can be used to verify SSL certificates The proposed LA-based system can detect safe or unsafe SSL certificates The LA reward/penalty scheme is used to build the trust value for SSL certificates Copyright © 2013 John Wiley & Sons, Ltd

17 citations

Journal ArticleDOI
12 Nov 2018
TL;DR: A model that amalgamates multiple approaches to detect phishing URLs, typosquatting and phoneme-based domain and suggesting the legitimate domain which is targeted by attackers is proposed.
Abstract: Purpose This paper aims to propose a model entitled MMSPhiD (multidimensional similarity metrics model for screen reader user to phishing detection) that amalgamates multiple approaches to detect phishing URLs. Design/methodology/approach The model consists of three major components: machine learning-based approach, typosquatting-based approach and phoneme-based approach. The major objectives of the proposed model are detecting phishing URL, typosquatting and phoneme-based domain and suggesting the legitimate domain which is targeted by attackers. Findings The result of the experiment shows that the MMSPhiD model can successfully detect phishing with 99.03 per cent accuracy. In addition, this paper has analyzed 20 leading domains from Alexa and identified 1,861 registered typosquatting and 543 phoneme-based domains. Research limitations/implications The proposed model has used machine learning with the list-based approach. Building and maintaining the list shall be a limitation. Practical implication The results of the experiments demonstrate that the model achieved higher performance due to the incorporation of multi-dimensional filters. Social implications In addition, this paper has incorporated the accessibility needs of persons with visual impairments and provides an accessible anti-phishing approach. Originality/value This paper assists persons with visual impairments on detection phoneme-based phishing domains.

11 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations

Journal ArticleDOI
01 Aug 1996
TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.
Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

16,118 citations

Journal ArticleDOI
01 Aug 1997
TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Abstract: In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in Rn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.

15,813 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations