scispace - formally typeset
Search or ask a question

Showing papers on "Graph (abstract data type) published in 2016"


Posted Content
TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Abstract: We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

15,696 citations


Posted Content
TL;DR: The variational graph auto-encoder (VGAE) is introduced, a framework for unsupervised learning on graph-structured data based on the variational auto- Encoder (VAE) that can naturally incorporate node features, which significantly improves predictive performance on a number of benchmark datasets.
Abstract: We introduce the variational graph auto-encoder (VGAE), a framework for unsupervised learning on graph-structured data based on the variational auto-encoder (VAE). This model makes use of latent variables and is capable of learning interpretable latent representations for undirected graphs. We demonstrate this model using a graph convolutional network (GCN) encoder and a simple inner product decoder. Our model achieves competitive results on a link prediction task in citation networks. In contrast to most existing models for unsupervised learning on graph-structured data and link prediction, our model can naturally incorporate node features, which significantly improves predictive performance on a number of benchmark datasets.

2,044 citations


Proceedings Article
19 Jun 2016
TL;DR: In this article, a semi-supervised learning framework based on graph embeddings is proposed, where given a graph between instances, an embedding for each instance is trained to jointly predict the class label and the neighborhood context in the graph.
Abstract: We present a semi-supervised learning framework based on graph embeddings. Given a graph between instances, we train an embedding for each instance to jointly predict the class label and the neighborhood context in the graph. We develop both transductive and inductive variants of our method. In the transductive variant of our method, the class labels are determined by both the learned embeddings and input feature vectors, while in the inductive variant, the embeddings are defined as a parametric function of the feature vectors, so predictions can be made on instances not seen during training. On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, we show improved performance over many of the existing models.

1,012 citations


Journal ArticleDOI
TL;DR: Molecular graph convolutions are described, a machine learning architecture for learning from undirected graphs, specifically small molecules, that represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
Abstract: Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular "graph convolutions", a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph---atoms, bonds, distances, etc.---which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.

974 citations


Proceedings Article
12 Feb 2016
TL;DR: A novel model for learning graph representations, which generates a low-dimensional vector representation for each vertex by capturing the graph structural information directly, and which outperforms other stat-of-the-art models in such tasks.
Abstract: In this paper, we propose a novel model for learning graph representations, which generates a low-dimensional vector representation for each vertex by capturing the graph structural information. Different from other previous research efforts, we adopt a random surfing model to capture graph structural information directly, instead of using the sampling-based method for generating linear sequences proposed by Perozzi et al. (2014). The advantages of our approach will be illustrated from both theorical and empirical perspectives. We also give a new perspective for the matrix factorization method proposed by Levy and Goldberg (2014), in which the pointwise mutual information (PMI) matrix is considered as an analytical solution to the objective function of the skip-gram model with negative sampling proposed by Mikolov et al. (2013). Unlike their approach which involves the use of the SVD for finding the low-dimensitonal projections from the PMI matrix, however, the stacked denoising autoencoder is introduced in our model to extract complex features and model non-linearities. To demonstrate the effectiveness of our model, we conduct experiments on clustering and visualization tasks, employing the learned vertex representations as features. Empirical results on datasets of varying sizes show that our model outperforms other stat-of-the-art models in such tasks.

919 citations


Proceedings Article
04 Nov 2016
TL;DR: This paper used a larger but more thoroughly regularized parser with biaffine classifiers to predict arcs and labels, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset.
Abstract: This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.

841 citations


Journal ArticleDOI
TL;DR: In this article, molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules, are described. But they do not outperform all fingerprint-based methods, and they represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
Abstract: Molecular “fingerprints” encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph—atoms, bonds, distances, etc.—which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.

796 citations


Journal ArticleDOI
TL;DR: The Stanford Network Analysis Platform (SNAP) as mentioned in this paper is a general-purpose, high-performance system that provides easy-to-use, highlevel operations for analysis and manipulation of large networks.
Abstract: Large networks are becoming a widely used abstraction for studying complex systems in a broad set of disciplines, ranging from social-network analysis to molecular biology and neuroscience. Despite an increasing need to analyze and manipulate large networks, only a limited number of tools are available for this task. Here, we describe the Stanford Network Analysis Platform (SNAP), a general-purpose, high-performance system that provides easy-to-use, high-level operations for analysis and manipulation of large networks. We present SNAP functionality, describe its implementational details, and give performance benchmarks. SNAP has been developed for single big-memory machines, and it balances the trade-off between maximum performance, compact in-memory graph representation, and the ability to handle dynamic graphs in which nodes and edges are being added or removed over time. SNAP can process massive networks with hundreds of millions of nodes and billions of edges. SNAP offers over 140 different graph algorithms that can efficiently manipulate large graphs, calculate structural properties, generate regular and random graphs, and handle attributes and metadata on nodes and edges. Besides being able to handle large graphs, an additional strength of SNAP is that networks and their attributes are fully dynamic; they can be modified during the computation at low cost. SNAP is provided as an open-source library in C++ as well as a module in Python. We also describe the Stanford Large Network Dataset, a set of social and information real-world networks and datasets, which we make publicly available. The collection is a complementary resource to our SNAP software and is widely used for development and benchmarking of graph analytics algorithms.

689 citations


Journal ArticleDOI
04 Nov 2016-Science
TL;DR: It is shown that an optical processing approach based on a network of coupled optical pulses in a ring fiber can be used to model and optimize large-scale Ising systems, and a coherent Ising machine outperformed simulated annealing in terms of accuracy and computation time for a 2000-node complete graph.
Abstract: The analysis and optimization of complex systems can be reduced to mathematical problems collectively known as combinatorial optimization. Many such problems can be mapped onto ground-state search problems of the Ising model, and various artificial spin systems are now emerging as promising approaches. However, physical Ising machines have suffered from limited numbers of spin-spin couplings because of implementations based on localized spins, resulting in severe scalability problems. We report a 2000-spin network with all-to-all spin-spin couplings. Using a measurement and feedback scheme, we coupled time-multiplexed degenerate optical parametric oscillators to implement maximum cut problems on arbitrary graph topologies with up to 2000 nodes. Our coherent Ising machine outperformed simulated annealing in terms of accuracy and computation time for a 2000-node complete graph.

555 citations


Proceedings Article
12 Feb 2016
TL;DR: This work develops two versions of the Constrained Laplacian Rank (CLR) method, based upon the L1-norm and the L2-norm, which yield two new graph-based clustering objectives and derives optimization algorithms to solve them.
Abstract: Graph-based clustering methods perform clustering on a fixed input data graph. If this initial construction is of low quality then the resulting clustering may also be of low quality. Moreover, existing graph-based clustering methods require post-processing on the data graph to extract the clustering indicators. We address both of these drawbacks by allowing the data graph itself to be adjusted as part of the clustering procedure. In particular, our Constrained Laplacian Rank (CLR) method learns a graph with exactly k connected components (where k is the number of clusters). We develop two versions of this method, based upon the L1-norm and the L2-norm, which yield two new graph-based clustering objectives. We derive optimization algorithms to solve these objectives. Experimental results on synthetic datasets and real-world benchmark datasets exhibit the effectiveness of this new graph-based clustering method.

546 citations


Journal ArticleDOI
TL;DR: This study uses the Lancichinetti-Fortunato-Radicchi benchmark graph to test eight state-of-the-art community detection algorithms and provides guidelines that help to choose the most adequate community detection algorithm for a given network.
Abstract: Many community detection algorithms have been developed to uncover the mesoscopic properties of complex networks. However how good an algorithm is, in terms of accuracy and computing time, remains still open. Testing algorithms on real-world network has certain restrictions which made their insights potentially biased: the networks are usually small, and the underlying communities are not defined objectively. In this study, we employ the Lancichinetti-Fortunato-Radicchi benchmark graph to test eight state-of-the-art algorithms. We quantify the accuracy using complementary measures and algorithms' computing time. Based on simple network properties and the aforementioned results, we provide guidelines that help to choose the most adequate community detection algorithm for a given network. Moreover, these rules allow uncovering limitations in the use of specific algorithms given macroscopic network properties. Our contribution is threefold: firstly, we provide actual techniques to determine which is the most suited algorithm in most circumstances based on observable properties of the network under consideration. Secondly, we use the mixing parameter as an easily measurable indicator of finding the ranges of reliability of the different algorithms. Finally, we study the dependency with network size focusing on both the algorithm's predicting power and the effective computing time.

Proceedings ArticleDOI
16 May 2016
TL;DR: In this article, a Bayesian convolutional neural network is used to regress the 6-DOF camera pose from a single RGB image, which is trained in an end-to-end manner with no need of additional engineering or graph optimisation.
Abstract: We present a robust and real-time monocular six degree of freedom visual relocalization system. We use a Bayesian convolutional neural network to regress the 6-DOF camera pose from a single RGB image. It is trained in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking under 6ms to compute. It obtains approximately 2m and 6° accuracy for very large scale outdoor scenes and 0.5m and 10° accuracy indoors. Using a Bayesian convolutional neural network implementation we obtain an estimate of the model's relocalization uncertainty and improve state of the art localization accuracy on a large scale outdoor dataset. We leverage the uncertainty measure to estimate metric relocalization error and to detect the presence or absence of the scene in the input image. We show that the model's uncertainty is caused by images being dissimilar to the training dataset in either pose or appearance.

Journal ArticleDOI
TL;DR: The proposed general Laplacian regularized low-rank representation framework for data representation takes advantage of the graph regularizer and can represent the global low-dimensional structures, but also capture the intrinsic non-linear geometric information in data.
Abstract: Low-rank representation (LRR) has recently attracted a great deal of attention due to its pleasing efficacy in exploring low-dimensional subspace structures embedded in data. For a given set of observed data corrupted with sparse errors, LRR aims at learning a lowest-rank representation of all data jointly. LRR has broad applications in pattern recognition, computer vision and signal processing. In the real world, data often reside on low-dimensional manifolds embedded in a high-dimensional ambient space. However, the LRR method does not take into account the non-linear geometric structures within data, thus the locality and similarity information among data may be missing in the learning process. To improve LRR in this regard, we propose a general Laplacian regularized low-rank representation framework for data representation where a hypergraph Laplacian regularizer can be readily introduced into, i.e., a Non-negative Sparse Hyper-Laplacian regularized LRR model (NSHLRR). By taking advantage of the graph regularizer, our proposed method not only can represent the global low-dimensional structures, but also capture the intrinsic non-linear geometric information in data. The extensive experimental results on image clustering, semi-supervised image classification and dimensionality reduction tasks demonstrate the effectiveness of the proposed method.

Proceedings Article
09 Jul 2016
TL;DR: This paper proposes a novel framework via the reformulation of the standard spectral learning model, which can be used for multiview clustering and semisupervised tasks and achieves comparable performance with the state-of-the-art approaches.
Abstract: Graph-based approaches have been successful in unsupervised and semi-supervised learning. In this paper, we focus on the real-world applications where the same instance can be represented by multiple heterogeneous features. The key point of utilizing the graph-based knowledge to deal with this kind of data is to reasonably integrate the different representations and obtain the most consistent manifold with the real data distributions. In this paper, we propose a novel framework via the reformulation of the standard spectral learning model, which can be used for multiview clustering and semisupervised tasks. Unlike other methods in the literature, the proposed methods can learn an optimal weight for each graph automatically without introducing an additive parameter as previous methods do. Furthermore, our objective under semisupervised learning is convex and the global optimal result will be obtained. Extensive empirical results on different real-world data sets demonstrate that the proposed methods achieve comparable performance with the state-of-the-art approaches and can be used more practically.

Book ChapterDOI
08 Oct 2016
TL;DR: Wang et al. as mentioned in this paper proposed the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTMs from sequential data or multi-dimensional data to general graph-structured data.
Abstract: By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data. Particularly, instead of evenly and fixedly dividing an image to pixels or patches in existing multi-dimensional LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each arbitrary-shaped superpixel as a semantically consistent node, and adaptively construct an undirected graph for each image, where the spatial relations of the superpixels are naturally used as edges. Constructed on such an adaptive graph topology, the Graph LSTM is more naturally aligned with the visual patterns in the image (e.g., object boundaries or appearance similarities) and provides a more economical information propagation route. Furthermore, for each optimization step over Graph LSTM, we propose to use a confidence-driven scheme to update the hidden and memory states of nodes progressively till all nodes are updated. In addition, for each node, the forgets gates are adaptively learned to capture different degrees of semantic correlation with neighboring nodes. Comprehensive evaluations on four diverse semantic object parsing datasets well demonstrate the significant superiority of our Graph LSTM over other state-of-the-art solutions.

Proceedings ArticleDOI
11 Apr 2016
TL;DR: The LargeVis is proposed, a technique that first constructs an accurately approximated K-nearest neighbor graph from the data and then layouts the graph in the low-dimensional space and easily scales to millions of high-dimensional data points.
Abstract: We study the problem of visualizing large-scale and high-dimensional data in a low-dimensional (typically 2D or 3D) space. Much success has been reported recently by techniques that first compute a similarity structure of the data points and then project them into a low-dimensional space with the structure preserved. These two steps suffer from considerable computational costs, preventing the state-of-the-art methods such as the t-SNE from scaling to large-scale and high-dimensional data (e.g., millions of data points and hundreds of dimensions). We propose the LargeVis, a technique that first constructs an accurately approximated K-nearest neighbor graph from the data and then layouts the graph in the low-dimensional space. Comparing to t-SNE, LargeVis significantly reduces the computational cost of the graph construction step and employs a principled probabilistic model for the visualization step, the objective of which can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity. The whole procedure thus easily scales to millions of high-dimensional data points. Experimental results on real-world data sets demonstrate that the LargeVis outperforms the state-of-the-art methods in both efficiency and effectiveness. The hyper-parameters of LargeVis are also much more stable over different data sets.

Proceedings Article
02 May 2016
TL;DR: In this article, the authors propose a primal-dual framework to learn the graph structure underlying a set of smooth signals under the smoothness assumption that trace(X^TLX) is small.
Abstract: We propose a framework that learns the graph structure underlying a set of smooth signals. Given a real m by n matrix X whose rows reside on the vertices of an unknown graph, we learn the edge weights w under the smoothness assumption that trace(X^TLX) is small. We show that the problem is a weighted l-1 minimization that leads to naturally sparse solutions. We point out how known graph learning or construction techniques fall within our framework and propose a new model that performs better than the state of the art in many settings. We present efficient, scalable primal-dual based algorithms for both our model and the previous state of the art, and evaluate their performance on artificial and real data.

Proceedings ArticleDOI
13 Aug 2016
TL;DR: FRAUDAR is proposed, an algorithm that is camouflage-resistant, provides upper bounds on the effectiveness of fraudsters, and is effective in real-world data.
Abstract: Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look "normal". Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic. Our focus is to spot fraudsters in the presence of camouflage or hijacked accounts. We propose FRAUDAR, an algorithm that (a) is camouflage-resistant, (b) provides upper bounds on the effectiveness of fraudsters, and (c) is effective in real-world data. Experimental results under various attacks show that FRAUDAR outperforms the top competitor in accuracy of detecting both camouflaged and non-camouflaged fraud. Additionally, in real-world experiments with a Twitter follower-followee graph of 1.47 billion edges, FRAUDAR successfully detected a subgraph of more than 4000 detected accounts, of which a majority had tweets showing that they used follower-buying services.

Posted Content
TL;DR: The Graph Long Short-Term Memory network is proposed, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.
Abstract: By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data. Particularly, instead of evenly and fixedly dividing an image to pixels or patches in existing multi-dimensional LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each arbitrary-shaped superpixel as a semantically consistent node, and adaptively construct an undirected graph for each image, where the spatial relations of the superpixels are naturally used as edges. Constructed on such an adaptive graph topology, the Graph LSTM is more naturally aligned with the visual patterns in the image (e.g., object boundaries or appearance similarities) and provides a more economical information propagation route. Furthermore, for each optimization step over Graph LSTM, we propose to use a confidence-driven scheme to update the hidden and memory states of nodes progressively till all nodes are updated. In addition, for each node, the forgets gates are adaptively learned to capture different degrees of semantic correlation with neighboring nodes. Comprehensive evaluations on four diverse semantic object parsing datasets well demonstrate the significant superiority of our Graph LSTM over other state-of-the-art solutions.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: A new energy on the vertices of a regularly sampled spatiotemporal bilateral grid is designed, which can be solved efficiently using a standard graph cut label assignment, and implicitly approximates long-range, spatio-temporal connections between pixels while still containing only a small number of variables and only local graph edges.
Abstract: In this work, we propose a novel approach to video segmentation that operates in bilateral space. We design a new energy on the vertices of a regularly sampled spatiotemporal bilateral grid, which can be solved efficiently using a standard graph cut label assignment. Using a bilateral formulation, the energy that we minimize implicitly approximates long-range, spatio-temporal connections between pixels while still containing only a small number of variables and only local graph edges. We compare to a number of recent methods, and show that our approach achieves state-of-the-art results on multiple benchmarks in a fraction of the runtime. Furthermore, our method scales linearly with image size, allowing for interactive feedback on real-world high resolution video.

Journal ArticleDOI
TL;DR: Reactive Vega is presented, a system architecture that provides the first robust and comprehensive treatment of declarative visual and interaction design for data visualization and the results of benchmark studies indicate superior interactive performance to both D3 and the original, non-reactive Vega system.
Abstract: We present Reactive Vega, a system architecture that provides the first robust and comprehensive treatment of declarative visual and interaction design for data visualization. Starting from a single declarative specification, Reactive Vega constructs a dataflow graph in which input data, scene graph elements, and interaction events are all treated as first-class streaming data sources. To support expressive interactive visualizations that may involve time-varying scalar, relational, or hierarchical data, Reactive Vega's dataflow graph can dynamically re-write itself at runtime by extending or pruning branches in a data-driven fashion. We discuss both compile- and run-time optimizations applied within Reactive Vega, and share the results of benchmark studies that indicate superior interactive performance to both D3 and the original, non-reactive Vega system.

Journal ArticleDOI
TL;DR: A novel classification scheme is introduced by distinguishing between methods for deterministic and random graphs for a better understanding of the methods, their challenges and, finally, for applying the methods efficiently in an interdisciplinary setting of data science to solve a particular problem involving comparative network analysis.

Journal ArticleDOI
TL;DR: The recent advances in this forefront and rapidly evolving field of reconstructing nonlinear and complex dynamical systems from measured data or time series are reviewed, aiming to cover topics such as compressive sensing, noised-induced dynamical mapping, perturbations, reverse engineering, synchronization, inner composition alignment, global silencing and Granger Causality.

Journal ArticleDOI
TL;DR: Factorized graph matching (FGM) is proposed, which factorizes the large pairwise affinity matrix into smaller matrices that encode the local structure of each graph and the Pairwise affinity between edges and four are the benefits that follow.
Abstract: Graph matching (GM) is a fundamental problem in computer science, and it plays a central role to solve correspondence problems in computer vision. GM problems that incorporate pairwise constraints can be formulated as a quadratic assignment problem (QAP). Although widely used, solving the correspondence problem through GM has two main limitations: (1) the QAP is NP-hard and difficult to approximate; (2) GM algorithms do not incorporate geometric constraints between nodes that are natural in computer vision problems. To address aforementioned problems, this paper proposes factorized graph matching (FGM). FGM factorizes the large pairwise affinity matrix into smaller matrices that encode the local structure of each graph and the pairwise affinity between edges. Four are the benefits that follow from this factorization: (1) There is no need to compute the costly (in space and time) pairwise affinity matrix; (2) The factorization allows the use of a path-following optimization algorithm, that leads to improved optimization strategies and matching performance; (3) Given the factorization, it becomes straight-forward to incorporate geometric transformations (rigid and non-rigid) to the GM problem. (4) Using a matrix formulation for the GM problem and the factorization, it is easy to reveal commonalities and differences between different GM methods. The factorization also provides a clean connection with other matching algorithms such as iterative closest point; Experimental results on synthetic and real databases illustrate how FGM outperforms state-of-the-art algorithms for GM. The code is available at http://humansensing.cs.cmu.edu/fgm .

Journal Article
TL;DR: In this paper, the authors consider two closely related problems: planted clustering and submatrix localization, and show that the space of the model parameters (cluster/submatrix size, edge probabilities and the mean of the submatrices) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities.
Abstract: We consider two closely related problems: planted clustering and submatrix localization. In the planted clustering problem, a random graph is generated based on an underlying cluster structure of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and the stochastic block model, which are widely used for studying community detection, graph clustering and bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, edge probabilities and the mean of the submatrices) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the impossible regime, where all algorithms fail; (2) the hard regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the easy regime, where the polynomial-time convexified MLE succeeds; (4) the simple regime, where a local counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the harder regimes. Our results establish the minimax recovery limits, which are tight up to universal constants and hold even with a growing number of clusters/submatrices, and provide order-wise stronger performance guarantees for polynomial-time algorithms than previously known. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax limits may not be achievable by polynomial-time algorithms.

Journal ArticleDOI
TL;DR: Experiments show that the proposed hybrid assembly approach is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.
Abstract: The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.

Posted Content
TL;DR: In this paper, the authors introduce a model of the future computation of the network graph, which predicts what the result of the modelled subgraph will produce using only local information.
Abstract: Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one's future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass -- amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

Proceedings ArticleDOI
14 Jun 2016
TL;DR: EmptyHeaded as discussed by the authors is a high-level graph processing engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines by introducing a new join algorithm that satisfies strong theoretical guarantees.
Abstract: There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees, but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and execution engine that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. Finally, we show that the EmptyHeaded design can easily be extended to accommodate a standard resource description framework (RDF) workload, the LUBM benchmark. On the LUBM benchmark, we show that EmptyHeaded can compete with and sometimes outperform two high-level, but specialized RDF baselines (TripleBit and RDF-3X), while outperforming MonetDB by up to three orders of magnitude and LogicBlox by up to two orders of magnitude.

Journal ArticleDOI
TL;DR: A fast local anchor embedding method, which reformulates the optimization of local weights and obtains an analytical solution, and a new adjacency matrix among anchors by considering the commonly linked datapoints, which leads to a more effective normalized graph Laplacian over anchors.
Abstract: Many graph-based semi-supervised learning methods for large datasets have been proposed to cope with the rapidly increasing size of data, such as Anchor Graph Regularization (AGR). This model builds a regularization framework by exploring the underlying structure of the whole dataset with both datapoints and anchors. Nevertheless, AGR still has limitations in its two components: (1) in anchor graph construction, the estimation of the local weights between each datapoint and its neighboring anchors could be biased and relatively slow; and (2) in anchor graph regularization, the adjacency matrix that estimates the relationship between datapoints, is not sufficiently effective. In this paper, we develop an Efficient Anchor Graph Regularization (EAGR) by tackling these issues. First, we propose a fast local anchor embedding method, which reformulates the optimization of local weights and obtains an analytical solution. We show that this method better reconstructs datapoints with anchors and speeds up the optimizing process. Second, we propose a new adjacency matrix among anchors by considering the commonly linked datapoints, which leads to a more effective normalized graph Laplacian over anchors. We show that, with the novel local weight estimation and normalized graph Laplacian, EAGR is able to achieve better classification accuracy with much less computational costs. Experimental results on several publicly available datasets demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: This study represents the first algebraic topological analysis of structural connectomics and connectomics-based spatio-temporal activity in a biologically realistic neural microcircuit and shows promise for more general applications in network science.
Abstract: A recent publication provides the network graph for a neocortical microcircuit comprising 8 million connections between 31,000 neurons. Since traditional graph-theoretical methods may not be sufficient to understand the immense complexity of such a biological network, we explored whether methods from algebraic topology could provide a new perspective on its structural and functional organization. Structural topological analysis revealed that directed graphs representing connectivity among neurons in the microcircuit deviated significantly from different varieties of randomized graph. In particular, the directed graphs contained in the order of 10 simplices groups of neurons with all-to-all directed connectivity. Some of these simplices contained up to 8 neurons, making them the most extreme neuronal clustering motif ever reported. Functional topological analysis of simulated neuronal activity in the microcircuit revealed novel spatio-temporal metrics that provide an effective classification of functional responses to qualitatively different stimuli. This study represents the first algebraic topological analysis of structural connectomics and connectomics-based spatio-temporal activity in a biologically realistic neural microcircuit. The methods used in the study show promise for more general applications in network science.