Showing papers by "Richard Lethin published in 2018"

PDF

Open Access

Proceedings Article•DOI•

All-at-once Decomposition of Coupled Billion-scale Tensors in Apache Spark

[...]

Aditya Gudibanda, Thomas Henretty, Muthu Baskaran, James Ezick, Richard Lethin - Show less +1 more

01 Sep 2018

TL;DR: Results are presented showing the billion-scale scalability of this novel implementation and the high level of interpretability in the components produced, suggesting that coupled, all-at-once tensor decompositions on Apache Spark represent a promising framework for large-scale, unsupervised pattern discovery.

...read moreread less

Abstract: As the scale of unlabeled data rises, it becomes increasingly valuable to perform scalable, unsupervised data analysis. Tensor decompositions, which have been empirically successful at finding meaningful cross-dimensional patterns in multidimensional data, are a natural candidate to test for scalability and meaningful pattern discovery in these massive real-world datasets. Furthermore, the production of big data of different types necessitates the ability to mine patterns across disparate sources. The coupled tensor decomposition framework captures this idea by supporting the decomposition of several tensors from different data sources together. We present a scalable implementation of coupled tensor decomposition on Apache Spark. We introduce nonnegativity and sparsity constraints, and perform all-at-once quasi-Newton optimization of all factor matrix parameters. We present results showing the billion-scale scalability of this novel implementation and also demonstrate the high level of interpretability in the components produced, suggesting that coupled, all-at-once tensor decompositions on Apache Spark represent a promising framework for large-scale, unsupervised pattern discovery.

...read moreread less

6 citations

Proceedings Article•DOI•

Computationally Efficient CP Tensor Decomposition Update Framework for Emerging Component Discovery in Streaming Data

[...]

Pierre-David Letourneau, Muthu Baskaran, Thomas Henretty, James Ezick, Richard Lethin - Show less +1 more

01 Sep 2018

TL;DR: A large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package is discussed, and the performance of the framework is evaluated, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset.

...read moreread less

Abstract: We present Streaming CP Update, an algorithmic framework for updating CP tensor decompositions that possesses the capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset.

...read moreread less

6 citations

Proceedings Article•DOI•

Accelerating Dijkstra's Algorithm Using Multiresolution Priority Queues

[...]

Jordi Ros-Giralt, Alan Commike, Peter Cullen, Richard Lethin

01 Sep 2018

TL;DR: Dijkstra's multiresolution algorithm delivers an amortized cost of $O(\vert V\vert +\vert E\vert)$, where $V and $E$ are the set of vertices and edges, respectively.

...read moreread less

Abstract: Multiresolution priority queues are recently introduced data structures that can trade-off a small bounded error for faster performance. When used to implement the frontier set in label setting algorithms, they provide a new mathematical approach to classic graph problems such as the computation of shortest paths or minimum spanning trees. To understand how they work, this paper presents a mathematical study of multiresolution label setting algorithms. The theory is general in that the classic mathematical results respond to the particular case where the problem's resolution is infinitely small. The concept of multiresolution helps break the information theoretic barriers of the problem by achieving lower time complexity at the cost of introducing a bounded error. Such error is proven independent of the graph size, a relevant feature in uniform-cost search algorithms where graphs can be infinitely large. Properly tuned, Dijkstra's multiresolution algorithm delivers an amortized cost of $O(\vert V\vert +\vert E\vert)$ , where $V$ and $E$ are the set of vertices and edges, respectively. Benchmarks show speedups of 53% and 26% when applied to Dijkstra's and A* algorithms on a graph of US roads with 87,575 vertices and 121,961 edges.

...read moreread less

2 citations

Proceedings Article•DOI•

Topic modeling for analysis of big data tensor decompositions

[...]

Thomas Henretty, M. Harper Langston, Muthu Baskaran, James Ezick, Richard Lethin - Show less +1 more

09 May 2018

TL;DR: This work presents an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and shows example applications to representative cybersecurity and geospatial datasets.

...read moreread less

Abstract: Tensor decompositions are a class of algorithms used for unsupervised pattern discovery. Structured, multidimensional datasets are encoded as tensors and decomposed into discrete, coherent patterns captured as weighted collections of high-dimensional vectors known as components. Tensor decompositions have recently shown promising results when addressing problems related to data comprehension and anomaly discovery in cybersecurity and intelligence analysis. However, analysis of Big Data tensor decompositions is currently a critical bottleneck owing to the volume and variety of unlabeled patterns that are produced. We present an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and show example applications to representative cybersecurity and geospatial datasets.

...read moreread less

2 citations

Proceedings Article•DOI•

Fast Detection of Elephant Flows with Dirichlet-Categorical Inference

[...]

Aditya Gudibanda, Jordi Ros-Giralt, Alan Commike, Richard Lethin

01 Nov 2018

TL;DR: The proposed method of Dirichlet-Categorical inference provides a novel, powerful framework to elephant flow detection that is both highly accurate and probabilistically meaningful.

...read moreread less

Abstract: The problem of elephant flow detection is a longstanding research area with the goal of quickly identifying flows in a network that are large enough to affect the quality of service of smaller flows. Past work in this field has largely been either domain-specific, based on thresholds for a specific flow size metric, or required several hyperparameters, reducing their ease of adaptation to the great variety of traffic distributions present in real-world networks. In this paper, we present an approach to elephant flow detection that avoids these limitations, utilizing the rigorous framework of Bayesian inference. By observing packets sampled from the network, we use \textit{Dirichlet-Categorical inference} to calculate a posterior distribution that explicitly captures our uncertainty about the sizes of each flow. We then use this posterior distribution to find the most likely subset of elephant flows under this probabilistic model. Our algorithm rapidly converges to the optimal sampling rate at a speed $O(1/n)$, where $n$ is the number of packet samples received, and the only hyperparameter required is the targeted \textit{detection likelihood}, defined as the probability of correctly inferring all the elephant flows. Compared to the state-of-the-art based on static sampling rate, we show a reduction in error rate by a factor of 20 times. The proposed method of Dirichlet-Categorical inference provides a novel, powerful framework to elephant flow detection that is both highly accurate and probabilistically meaningful.

...read moreread less

2 citations

Journal Article•DOI•

Algorithms and data structures to accelerate network analysis

[...]

Jordi Ros-Giralt, Alan Commike, Peter Cullen, Richard Lethin

01 Sep 2018-Future Generation Computer Systems

TL;DR: In this article, the authors present five algorithms and data structures (long queue emulation, lockless bimodal queues, tail early dropping, LFN tables, and multiresolution priority queues) designed to optimize the process of analyzing network traffic.

...read moreread less

1 citations

Patent•

Systems and methods for providing lockless bimodal queues for selective packet capture

[...]

Jordi Ros-Giralt, Alan Commike, Peter Cullen, Richard Lethin

13 Nov 2018

TL;DR: In this article, a fixed-size storage is used for both the producer and the consumer to access the storage without locking it and, to facilitate selective consumption of the packets by the consumer, the consumer can transition between awake and sleep modes where the packets are consumed in the awake mode only.

...read moreread less

Abstract: In a network system, an application receiving packets can consume one or more packets in two or more stages, where the second and the later stages can selectively consume some but not all of the packets consumed by the preceding stage. Packets are transferred between two consecutive stages, called producer and consumer, via a fixed-size storage. Both the producer and the consumer can access the storage without locking it and, to facilitate selective consumption of the packets by the consumer, the consumer can transition between awake and sleep modes, where the packets are consumed in the awake mode only. The producer may also switch between awake and sleep modes. Lockless access is made possible by controlling the operation of the storage by the producer and the consumer both according to the mode of the consumer, which is communicated via a shared memory location.

...read moreread less