scispace - formally typeset
Search or ask a question

Showing papers by "Lawrence K. Saul published in 2011"


Journal ArticleDOI
TL;DR: This article develops a real-time system for gathering URL features and is able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.
Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99p accuracy over a balanced dataset.

216 citations


Proceedings Article
14 Jun 2011
TL;DR: This paper explores a generalization of conditional random elds (CRFs) in which binary stochastic hidden units appear between the data and the labels, and derives ecient algorithms for inference and learning in these models by observing that the hidden units are conditionally independent given the dataand the labels.
Abstract: The paper explores a generalization of conditional random elds (CRFs) in which binary stochastic hidden units appear between the data and the labels. Hidden-unit CRFs are potentially more powerful than standard CRFs because they can represent nonlinear dependencies at each frame. The hidden units in these models also learn to discover latent distributed structure in the data that improves classication. We derive ecient algorithms for inference and learning in these models by observing that the hidden units are conditionally independent given the data and the labels. Finally, we show that hiddenunit CRFs perform well in experiments on a range of tasks, including optical character recognition, text classication, protein structure prediction, and part-of-speech tagging.

79 citations


Proceedings Article
12 Dec 2011
TL;DR: This paper uses maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities, to analyze EEG-fMRI data and develops a fast implementation based on ideas from spectral graph theory.
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming.

31 citations


Proceedings ArticleDOI
21 Oct 2011
TL;DR: By analyzing the full content of individual Web pages, this work more than halve the error rate obtained by a comparably trained classifier that only extracts features from URLs.
Abstract: The physical world is rife with cues that allow us to distinguish between safe and unsafe situations. By contrast, the Internet offers a much more ambiguous environment; hence many users are unable to distinguish a scam from a legitimate Web page. To help address this problem, we explore how to train classifiers that can automatically identify malicious Web pages based on clues from their textual content, structural tags, page links, visual appearance, and URLs. Using a contemporary labeled data feed from a large Web mail provider, we extract such features and demonstrate how they can be used to improve classification accuracy over previous, more constrained approaches. In particular, by analyzing the full content of individual Web pages, we more than halve the error rate obtained by a comparably trained classifier that only extracts features from URLs. By training classifiers on different sets of features, we are further able to assess the strength of clues provided by these different sources of information.

29 citations


Posted Content
TL;DR: A recently proposed family of positive-denite kernels that mimic the computation in large neural networks are investigated using tools from dierential geometry; specically, the geometry of surfaces in Hilbert space that are induced by these kernels are analyzed.
Abstract: We investigate a recently proposed family of positive-denite kernels that mimic the computation in large neural networks. We examine the properties of these kernels using tools from dierential geometry; specically, we analyze the geometry of surfaces in Hilbert space that are induced by these kernels. When this geometry is described by a Riemannian manifold, we derive results for the metric, curvature, and volume element. Interestingly, though, we nd that the simplest kernel in this family does not admit such an interpretation. We explore two variations of these kernels that mimic computation in neural networks with dierent activation functions. We experiment with these new

20 citations


Proceedings ArticleDOI
21 Oct 2011
TL;DR: This paper shows how to discover clusters of abuse tasks using latent Dirichlet allocation (LDA), an unsupervised method for topic modeling in large corpora of text.
Abstract: Web services such as Google, Facebook, and Twitter are recurring victims of abuse, and their plight will only worsen as more attackers are drawn to their large user bases. Many attackers hire cheap, human labor to actualize their schemes, connecting with potential workers via crowdsourcing and freelancing sites such as Mechanical Turk and Freelancer.com. To identify solicitations for abuse jobs, these Web sites need ways to distinguish these tasks from ordinary jobs. In this paper, we show how to discover clusters of abuse tasks using latent Dirichlet allocation (LDA), an unsupervised method for topic modeling in large corpora of text. Applying LDA to hundreds of thousands of unlabeled job postings from Freelancer.com, we find that it discovers clusters of related abuse jobs and identifies the prevalent words that distinguish them. Finally, we use the clusters from LDA to profile the population of workers who bid on abuse jobs and the population of buyers who post their project descriptions.

13 citations


Posted Content
TL;DR: This work shows how to incorporate information from labeled examples into nonnegative matrix factorization (NMF), a popular unsupervised learning algorithm for dimensionality reduction, and finds that low dimensional representations discovered yield more accurate classifiers than both ordinary and transductive SVMs trained in the original input space.
Abstract: We show how to incorporate information from labeled examples into nonnegative matrix factorization (NMF), a popular unsupervised learning algorithm for dimensionality reduction In addition to mapping the data into a space of lower dimensionality, our approach aims to preserve the nonnegative components of the data that are important for classification We identify these components from the support vectors of large-margin classifiers and derive iterative updates to preserve them in a semi-supervised version of NMF These updates have a simple multiplicative form like their unsupervised counterparts; they are also guaranteed at each iteration to decrease their loss function---a weighted sum of I-divergences that captures the trade-off between unsupervised and supervised learning We evaluate these updates for dimensionality reduction when they are used as a precursor to linear classification In this role, we find that they yield much better performance than their unsupervised counterparts We also find one unexpected benefit of the low dimensional representations discovered by our approach: often they yield more accurate classifiers than both ordinary and transductive SVMs trained in the original input space

11 citations


01 Jan 2011
TL;DR: This dissertation explores the use of sequential, mistake-driven updates for online learning and acoustic feature adaptation in large margin HMMs, and finds that online updates for large margin training not only converge faster than analogous batch optimizations, but also yield lower phone error rates than approaches that do not attempt to enforce a large margin.
Abstract: Over the last two decades, large margin methods have yielded excellent performance on many tasks. The theoretical properties of large margin methods have been intensively studied and are especially well-established for support vector machines (SVMs). However, the scalability of large margin methods remains an issue due to the amount of computation they require. This is especially true for applications involving sequential data. In this thesis we are motivated by the problem of automatic speech recognition (ASR) whose large-scale applications involve training and testing on extremely large data sets. The acoustic models used in ASR are based on continuous-density hidden Markov models (CD-HMMs). Researchers in ASR have focused on discriminative training of HMMs, which leads to models with significantly lower error rates. More recently, building on the successes of SVMs and various extensions thereof in the machine learning community, a number of researchers in ASR have also explored large margin methods for discriminative training of HMMs. This dissertation aims to apply various large margin methods developed in the machine learning community to the challenging large-scale problems that arise in ASR. Specifically, we explore the use of sequential, mistake-driven updates for online learning and acoustic feature adaptation in large margin HMMs. The updates are applied to the parameters of acoustic models after the decoding of individual training utterances. For large margin training, the updates attempt to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. For acoustic feature adaptation, the updates attempt to improve recognition by linearly transforming the features computed by the front end. We evaluate acoustic models trained in this way on the TIMIT speech database. We find that online updates for large margin training not only converge faster than analogous batch optimizations, but also yield lower phone error rates than approaches that do not attempt to enforce a large margin. We conclude this thesis with a discussion of future research directions, highlighting in particular the challenges of scaling our approach to the most difficult problems in large-vocabulary continuous speech recognition.

3 citations