Home
/
Authors
/
Justin Ma

Author

Justin Ma

Other affiliations: University of California, Berkeley

Bio: Justin Ma is an academic researcher from University of California, San Diego. The author has contributed to research in topics: Semantic URL & The Internet. The author has an hindex of 10, co-authored 11 publications receiving 2281 citations. Previous affiliations of Justin Ma include University of California, Berkeley.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

[...]

Justin Ma¹, Lawrence K. Saul¹, Stefan Savage¹, Geoffrey M. Voelker¹•Institutions (1)

University of California, San Diego¹

28 Jun 2009

TL;DR: This paper describes an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs.

...read moreread less

Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

...read moreread less

806 citations

Proceedings Article•DOI•

Identifying suspicious URLs: an application of large-scale online learning

[...]

Justin Ma¹, Lawrence K. Saul¹, Stefan Savage¹, Geoffrey M. Voelker¹•Institutions (1)

University of California, San Diego¹

14 Jun 2009

TL;DR: It is demonstrated that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

...read moreread less

Abstract: This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

...read moreread less

567 citations

Journal Article•DOI•

Scalability, fidelity, and containment in the potemkin virtual honeyfarm

[...]

Michael Vrable¹, Justin Ma¹, Jay Chen¹, David Moore¹, Erik Vandekieft¹, Alex C. Snoeren¹, Geoffrey M. Voelker¹, Stefan Savage¹ - Show less +4 more•Institutions (1)

University of California, San Diego¹

20 Oct 2005

TL;DR: This paper has built a prototype honeyfarm system, called Potemkin, that exploits virtual machines, aggressive memory sharing, and late binding of resources to achieve the goal of improving honeypot scalability while still closely emulating the execution behavior of individual Internet hosts.

...read moreread less

Abstract: The rapid evolution of large-scale worms, viruses and bot-nets have made Internet malware a pressing concern. Such infections are at the root of modern scourges including DDoS extortion, on-line identity theft, SPAM, phishing, and piracy. However, the most widely used tools for gathering intelligence on new malware -- network honeypots -- have forced investigators to choose between monitoring activity at a large scale or capturing behavior with high fidelity. In this paper, we describe an approach to minimize this tension and improve honeypot scalability by up to six orders of magnitude while still closely emulating the execution behavior of individual Internet hosts. We have built a prototype honeyfarm system, called Potemkin, that exploits virtual machines, aggressive memory sharing, and late binding of resources to achieve this goal. While still an immature implementation, Potemkin has emulated over 64,000 Internet honeypots in live test runs, using only a handful of physical servers.

...read moreread less

356 citations

Journal Article•DOI•

Learning to detect malicious URLs

[...]

Justin Ma¹, Lawrence K. Saul², Stefan Savage², Geoffrey M. Voelker²•Institutions (2)

University of California, Berkeley¹, University of California, San Diego²

06 May 2011-ACM Transactions on Intelligent Systems and Technology

TL;DR: This article develops a real-time system for gathering URL features and is able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.

...read moreread less

Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99p accuracy over a balanced dataset.

...read moreread less

216 citations

Proceedings Article•DOI•

Unexpected means of protocol inference

[...]

Justin Ma¹, Kirill Levchenko¹, Christian Kreibich², Stefan Savage¹, Geoffrey M. Voelker¹ - Show less +1 more•Institutions (2)

University of California, San Diego¹, University of Cambridge²

25 Oct 2006

TL;DR: This work analyzes three alternative mechanisms using statistical and structural content models for automatically identifying traffic that uses the same application-layer protocol, relying solely on flow content, and evaluates each mechanism's classification performance using real-world traffic traces from multiple sites.

...read moreread less

Abstract: Network managers are inevitably called upon to associate network traffic with particular applications. Indeed, this operation is critical for a wide range of management functions ranging from debugging and security to analytics and policy support. Traditionally, managers have relied on application adherence to a well established global port mapping: Web traffic on port 80, mail traffic on port 25 and so on. However, a range of factors - including firewall port blocking, tunneling, dynamic port allocation, and a bloom of new distributed applications - has weakened the value of this approach. We analyze three alternative mechanisms using statistical and structural content models for automatically identifying traffic that uses the same application-layer protocol, relying solely on flow content. In this manner, known applications may be identified regardless of port number, while traffic from one unknown application will be identified as distinct from another. We evaluate each mechanism's classification performance using real-world traffic traces from multiple sites.

...read moreread less

214 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

National Institute of Standards and Technology における超伝導研究及び生活

[...]

尚島影

01 Oct 2001-Ieej Transactions on Fundamentals and Materials

2,687 citations

Journal Article•DOI•

A survey of techniques for internet traffic classification using machine learning

[...]

Thuy T. T. Nguyen¹, Grenville Armitage¹•Institutions (1)

Swinburne University of Technology¹

01 Oct 2008-IEEE Communications Surveys and Tutorials

TL;DR: This survey paper looks at emerging research into the application of Machine Learning techniques to IP traffic classification - an inter-disciplinary blend of IP networking and data mining techniques.

...read moreread less

Abstract: The research community has begun looking for IP traffic classification techniques that do not rely on `well known? TCP or UDP port numbers, or interpreting the contents of packet payloads. New work is emerging on the use of statistical traffic characteristics to assist in the identification and classification process. This survey paper looks at emerging research into the application of Machine Learning (ML) techniques to IP traffic classification - an inter-disciplinary blend of IP networking and data mining techniques. We provide context and motivation for the application of ML techniques to IP traffic classification, and review 18 significant works that cover the dominant period from 2004 to early 2007. These works are categorized and reviewed according to their choice of ML strategies and primary contributions to the literature. We also discuss a number of key requirements for the employment of ML-based traffic classifiers in operational IP networks, and qualitatively critique the extent to which the reviewed works meet these requirements. Open issues and challenges in the field are also discussed.

...read moreread less

1,519 citations

Journal Article•DOI•

Graph based anomaly detection and description: a survey

[...]

Leman Akoglu¹, Hanghang Tong², Danai Koutra³•Institutions (3)

Stony Brook University¹, City University of New York², Carnegie Mellon University³

01 May 2015-Data Mining and Knowledge Discovery

TL;DR: This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs, and gives a general framework for the algorithms categorized under various settings.

...read moreread less

Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised versus (semi-)supervised approaches, for static versus dynamic graphs, for attributed versus plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

...read moreread less

998 citations

Proceedings Article•DOI•

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

[...]

Justin Ma¹, Lawrence K. Saul¹, Stefan Savage¹, Geoffrey M. Voelker¹•Institutions (1)

University of California, San Diego¹

28 Jun 2009

...read moreread less

806 citations

Posted Content•

Graph-based Anomaly Detection and Description: A Survey

[...]

Leman Akoglu¹, Hanghang Tong², Danai Koutra³•Institutions (3)

Stony Brook University¹, City University of New York², Carnegie Mellon University³

18 Apr 2014-arXiv: Social and Information Networks

TL;DR: A comprehensive survey of the state-of-the-art methods for anomaly detection in data represented as graphs can be found in this article, where the authors highlight the effectiveness, scalability, generality, and robustness aspects of the methods.

...read moreread less

Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured {\em graph} data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we provide a comprehensive exploration of both data mining and machine learning algorithms for these {\em detection} tasks. we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly {\em attribution} and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

...read moreread less

703 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse