scispace - formally typeset
Search or ask a question
Proceedings Article

The ghost in the browser analysis of web-based malware

Niels Provos1, Dean McNamee1, Panayiotis Mavrommatis1, Ke Wang1, Nagendra Modadugu1 
10 Apr 2007-pp 4-4
TL;DR: This work identifies the four prevalent mechanisms used to inject malicious content on popular web sites: web server security, user contributed content, advertising and third-party widgets, and presents examples of abuse found on the Internet.
Abstract: As more users are connected to the Internet and conduct their daily activities electronically, computer users have become the target of an underground economy that infects hosts with malware or adware for financial gain. Unfortunately, even a single visit to an infected web site enables the attacker to detect vulnerabilities in the user's applications and force the download a multitude of malware binaries. Frequently, this malware allows the adversary to gain full control of the compromised systems leading to the ex-filtration of sensitive information or installation of utilities that facilitate remote control of the host. We believe that such behavior is similar to our traditional understanding of botnets. However, the main difference is that web-based malware infections are pull-based and that the resulting command feedback loop is looser. To characterize the nature of this rising thread, we identify the four prevalent mechanisms used to inject malicious content on popular web sites: web server security, user contributed content, advertising and third-party widgets. For each of these areas, we present examples of abuse found on the Internet. Our aim is to present the state of malware on the Web and emphasize the importance of this rising threat.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs, and gives a general framework for the algorithms categorized under various settings.
Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised versus (semi-)supervised approaches, for static versus dynamic graphs, for attributed versus plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

998 citations


Cites background from "The ghost in the browser analysis o..."

  • ...E-mail: tong@cs.ccny.cuny.edu Danai Koutra Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15217 USA....

    [...]

  • ...…cargo shipments [Das and Schneider, 2007, Eberle and Holder, 2007] malware/spyware detection [Invernizzi and Comparetti, 2012, Ma et al., 2009, Provos et al., 2007], false advertising [Lee et al., 2010], data-center monitoring [Li et al., 2011b], insider threat [Eberle and Holder, 2009],…...

    [...]

Journal ArticleDOI
TL;DR: An overview of techniques based on dynamic analysis that are used to analyze potentially malicious samples and analysis programs that employ these techniques to assist human analysts in assessing whether a given sample deserves closer manual inspection due to its unknown malicious behavior is provided.
Abstract: Anti-virus vendors are confronted with a multitude of potentially malicious samples today. Receiving thousands of new samples every day is not uncommon. The signatures that detect confirmed malicious threats are mainly still created manually, so it is important to discriminate between samples that pose a new unknown threat and those that are mere variants of known malware.This survey article provides an overview of techniques based on dynamic analysis that are used to analyze potentially malicious samples. It also covers analysis programs that leverage these It also covers analysis programs that employ these techniques to assist human analysts in assessing, in a timely and appropriate manner, whether a given sample deserves closer manual inspection due to its unknown malicious behavior.

815 citations

Posted Content
TL;DR: A comprehensive survey of the state-of-the-art methods for anomaly detection in data represented as graphs can be found in this article, where the authors highlight the effectiveness, scalability, generality, and robustness aspects of the methods.
Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured {\em graph} data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we provide a comprehensive exploration of both data mining and machine learning algorithms for these {\em detection} tasks. we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly {\em attribution} and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

703 citations

Proceedings ArticleDOI
26 Apr 2010
TL;DR: A novel approach to the detection and analysis of malicious JavaScript code is presented that uses a number of features and machine-learning techniques to establish the characteristics of normal JavaScript code and is able to identify anomalous JavaScript code by emulating its behavior and comparing it to the established profiles.
Abstract: JavaScript is a browser scripting language that allows developers to create sophisticated client-side interfaces for web applications. However, JavaScript code is also used to carry out attacks against the user's browser and its extensions. These attacks usually result in the download of additional malware that takes complete control of the victim's platform, and are, therefore, called "drive-by downloads." Unfortunately, the dynamic nature of the JavaScript language and its tight integration with the browser make it difficult to detect and block malicious JavaScript code.This paper presents a novel approach to the detection and analysis of malicious JavaScript code. Our approach combines anomaly detection with emulation to automatically identify malicious JavaScript code and to support its analysis. We developed a system that uses a number of features and machine-learning techniques to establish the characteristics of normal JavaScript code. Then, during detection, the system is able to identify anomalous JavaScript code by emulating its behavior and comparing it to the established profiles. In addition to identifying malicious code, the system is able to support the analysis of obfuscated code and to generate detection signatures for signature-based systems. The system has been made publicly available and has been used by thousands of analysts.

639 citations


Cites background from "The ghost in the browser analysis o..."

  • ...…and Malicious JavaScript Code Marco Cova, Christopher Kruegel, and Giovanni VignaUniversity of California, Santa Barbara {marco,chris,vigna}@cs.ucsb.edu ABSTRACT JavaScript is a browser scripting language that allows developers to create sophisticated client-side interfaces for web applications....

    [...]

Book
07 Jun 2012
TL;DR: This publication provides an overview of the security and privacy challenges pertinent to public cloud computing and points out considerations organizations should take when outsourcing data, applications, and infrastructure to a public cloud environment.
Abstract: NIST Special Publication 800-144 - Cloud computing can and does mean different things to different people. The common characteristics most interpretations share are on-demand scalability of highly available and reliable pooled computing resources, secure access to metered services from nearly anywhere, and displacement of data and services from inside to outside the organization. While aspects of these characteristics have been realized to a certain extent, cloud computing remains a work in progress. This publication provides an overview of the security and privacy challenges pertinent to public cloud computing and points out considerations organizations should take when outsourcing data, applications, and infrastructure to a public cloud environment.~

634 citations

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


"The ghost in the browser analysis o..." refers methods in this paper

  • ...The MapReduce allows us to prune several billion URLs into a few million....

    [...]

  • ...We can further reduce the resulting number of URLs by sampling on a per-site basis; implemented as another MapReduce....

    [...]

  • ...MapReduce is a programming model that operates in two stages: the Map stage takes a sequence of key-value pairs as input and produces a sequence of intermediate key-value pairs as output....

    [...]

  • ...MapReduce: Simplified Data Processing on Large Clusters....

    [...]

  • ...In first phase we employ MapReduce [5] to process all the crawled web pages for properties indicative of exploits....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Proceedings Article
07 Jul 2005
TL;DR: This paper outlines the origins and structure of bots and botnets and uses data from the operator community, the Internet Motion Sensor project, and a honeypot experiment to illustrate the botnet problem today and describes a system to detect botnets that utilize advanced command and control systems by correlating secondary detection data from multiple sources.
Abstract: Global Internet threats are undergoing a profound transformation from attacks designed solely to disable infrastructure to those that also target people and organizations. Behind these new attacks is a large pool of compromised hosts sitting in homes, schools, businesses, and governments around the world. These systems are infected with a bot that communicates with a bot controller and other bots to form what is commonly referred to as a zombie army or botnet. Botnets are a very real and quickly evolving problem that is still not well understood or studied. In this paper we outline the origins and structure of bots and botnets and use data from the operator community, the Internet Motion Sensor project, and a honeypot experiment to illustrate the botnet problem today. We then study the effectiveness of detecting botnets by directly monitoring IRC communication or other command and control activity and show a more comprehensive approach is required. We conclude by describing a system to detect botnets that utilize advanced command and control systems by correlating secondary detection data from multiple sources.

588 citations


"The ghost in the browser analysis o..." refers background in this paper

  • ...Unlike traditional botnets [4] that use push-based infection to increase their population, web-based malware infection follows a pull-based model and usually provides a looser feedback loop....

    [...]

Proceedings Article
01 Feb 2006
TL;DR: The design and implementation of the Strider HoneyMonkey Exploit Detection System is described, which consists of a pipeline of “monkey programs” running possibly vulnerable browsers on virtual machines with different patch levels and patrolling the Web to seek out and classify web sites that exploit browser vulnerabilities.
Abstract: Internet attacks that use malicious web sites to install malware programs by exploiting browser vulnerabilities are a serious emerging threat. In response, we have developed an automated web patrol system to automatically identify and monitor these malicious sites. We describe the design and implementation of the Strider HoneyMonkey Exploit Detection System, which consists of a pipeline of “monkey programs” running possibly vulnerable browsers on virtual machines with different patch levels and patrolling the Web to seek out and classify web sites that exploit browser vulnerabilities. Within the first month of utilizing this system, we identified 752 unique URLs hosted on 288 web sites that could successfully exploit unpatched Windows XP machines. The system automatically constructed topology graphs based on traffic redirection to capture the relationship between the exploit sites. This allowed us to identify several major players who are responsible for a large number of exploit pages. By monitoring these 752 exploit-URLs on a daily basis, we discovered a malicious web site that was performing zero-day exploits of the unpatched javaprxy.dll vulnerability and was operating behind 25 exploit-URLs. It was confirmed as the first “inthe-wild”, zero-day exploit of this vulnerability that was reported to the Microsoft Security Response Center. Additionally, by scanning the most popular one million URLs as classified by a search engine, we found over seven hundred exploit-URLs, many of which serve popular content related to celebrities, song lyrics, wallpapers, video game cheats, and wrestling.

434 citations

Proceedings Article
01 Jan 2005
TL;DR: This paper performs a large-scale, longitudinal study of the Web, sampling both executables and conventional Web pages for malicious objects, and quantifies the density of spyware, the types of of threats, and the most dangerous Web zones in which spyware is likely to be encountered.
Abstract: Malicious spyware poses a significant threat to desktop security and integrity. This paper examines that threat from an Internet perspective. Using a crawler, we performed a large-scale, longitudinal study of the Web, sampling both executables and conventional Web pages for malicious objects. Our results show the extent of spyware content. For example, in a May 2005 crawl of 18 million URLs, we found spyware in 13.4% of the 21,200 executables we identified. At the same time, we found scripted “drive-by download” attacks in 5.9% of the Web pages we processed. Our analysis quantifies the density of spyware, the types of of threats, and the most dangerous Web zones in which spyware is likely to be encountered. We also show the frequency with which specific spyware programs were found in the content we crawled. Finally, we measured changes in the density of spyware over time; e.g., our October 2005 crawl saw a substantial reduction in the presence of drive-by download attacks, compared with those we detected in May.

272 citations


"The ghost in the browser analysis o..." refers background in this paper

  • ...Moshchuk et. al conducted a study of spyware on the web by crawling 18 million URLs in May 2005 [7]....

    [...]