scispace - formally typeset
Search or ask a question

Showing papers by "Richard Lethin published in 2018"


Proceedings ArticleDOI
01 Sep 2018
TL;DR: Results are presented showing the billion-scale scalability of this novel implementation and the high level of interpretability in the components produced, suggesting that coupled, all-at-once tensor decompositions on Apache Spark represent a promising framework for large-scale, unsupervised pattern discovery.
Abstract: As the scale of unlabeled data rises, it becomes increasingly valuable to perform scalable, unsupervised data analysis. Tensor decompositions, which have been empirically successful at finding meaningful cross-dimensional patterns in multidimensional data, are a natural candidate to test for scalability and meaningful pattern discovery in these massive real-world datasets. Furthermore, the production of big data of different types necessitates the ability to mine patterns across disparate sources. The coupled tensor decomposition framework captures this idea by supporting the decomposition of several tensors from different data sources together. We present a scalable implementation of coupled tensor decomposition on Apache Spark. We introduce nonnegativity and sparsity constraints, and perform all-at-once quasi-Newton optimization of all factor matrix parameters. We present results showing the billion-scale scalability of this novel implementation and also demonstrate the high level of interpretability in the components produced, suggesting that coupled, all-at-once tensor decompositions on Apache Spark represent a promising framework for large-scale, unsupervised pattern discovery.

6 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: A large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package is discussed, and the performance of the framework is evaluated, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset.
Abstract: We present Streaming CP Update, an algorithmic framework for updating CP tensor decompositions that possesses the capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset.

6 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: Dijkstra's multiresolution algorithm delivers an amortized cost of $O(\vert V\vert +\vert E\vert)$, where $V and $E$ are the set of vertices and edges, respectively.
Abstract: Multiresolution priority queues are recently introduced data structures that can trade-off a small bounded error for faster performance. When used to implement the frontier set in label setting algorithms, they provide a new mathematical approach to classic graph problems such as the computation of shortest paths or minimum spanning trees. To understand how they work, this paper presents a mathematical study of multiresolution label setting algorithms. The theory is general in that the classic mathematical results respond to the particular case where the problem's resolution is infinitely small. The concept of multiresolution helps break the information theoretic barriers of the problem by achieving lower time complexity at the cost of introducing a bounded error. Such error is proven independent of the graph size, a relevant feature in uniform-cost search algorithms where graphs can be infinitely large. Properly tuned, Dijkstra's multiresolution algorithm delivers an amortized cost of $O(\vert V\vert +\vert E\vert)$ , where $V$ and $E$ are the set of vertices and edges, respectively. Benchmarks show speedups of 53% and 26% when applied to Dijkstra's and A* algorithms on a graph of US roads with 87,575 vertices and 121,961 edges.

2 citations


Proceedings ArticleDOI
09 May 2018
TL;DR: This work presents an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and shows example applications to representative cybersecurity and geospatial datasets.
Abstract: Tensor decompositions are a class of algorithms used for unsupervised pattern discovery. Structured, multidimensional datasets are encoded as tensors and decomposed into discrete, coherent patterns captured as weighted collections of high-dimensional vectors known as components. Tensor decompositions have recently shown promising results when addressing problems related to data comprehension and anomaly discovery in cybersecurity and intelligence analysis. However, analysis of Big Data tensor decompositions is currently a critical bottleneck owing to the volume and variety of unlabeled patterns that are produced. We present an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and show example applications to representative cybersecurity and geospatial datasets.

2 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: The proposed method of Dirichlet-Categorical inference provides a novel, powerful framework to elephant flow detection that is both highly accurate and probabilistically meaningful.
Abstract: The problem of elephant flow detection is a longstanding research area with the goal of quickly identifying flows in a network that are large enough to affect the quality of service of smaller flows. Past work in this field has largely been either domain-specific, based on thresholds for a specific flow size metric, or required several hyperparameters, reducing their ease of adaptation to the great variety of traffic distributions present in real-world networks. In this paper, we present an approach to elephant flow detection that avoids these limitations, utilizing the rigorous framework of Bayesian inference. By observing packets sampled from the network, we use \textit{Dirichlet-Categorical inference} to calculate a posterior distribution that explicitly captures our uncertainty about the sizes of each flow. We then use this posterior distribution to find the most likely subset of elephant flows under this probabilistic model. Our algorithm rapidly converges to the optimal sampling rate at a speed $O(1/n)$, where $n$ is the number of packet samples received, and the only hyperparameter required is the targeted \textit{detection likelihood}, defined as the probability of correctly inferring all the elephant flows. Compared to the state-of-the-art based on static sampling rate, we show a reduction in error rate by a factor of 20 times. The proposed method of Dirichlet-Categorical inference provides a novel, powerful framework to elephant flow detection that is both highly accurate and probabilistically meaningful.

2 citations


Journal ArticleDOI
TL;DR: In this article, the authors present five algorithms and data structures (long queue emulation, lockless bimodal queues, tail early dropping, LFN tables, and multiresolution priority queues) designed to optimize the process of analyzing network traffic.

1 citations


Patent
13 Nov 2018
TL;DR: In this article, a fixed-size storage is used for both the producer and the consumer to access the storage without locking it and, to facilitate selective consumption of the packets by the consumer, the consumer can transition between awake and sleep modes where the packets are consumed in the awake mode only.
Abstract: In a network system, an application receiving packets can consume one or more packets in two or more stages, where the second and the later stages can selectively consume some but not all of the packets consumed by the preceding stage. Packets are transferred between two consecutive stages, called producer and consumer, via a fixed-size storage. Both the producer and the consumer can access the storage without locking it and, to facilitate selective consumption of the packets by the consumer, the consumer can transition between awake and sleep modes, where the packets are consumed in the awake mode only. The producer may also switch between awake and sleep modes. Lockless access is made possible by controlling the operation of the storage by the producer and the consumer both according to the mode of the consumer, which is communicated via a shared memory location.