scispace - formally typeset
Search or ask a question
Proceedings Article

A Consistent Histogram Estimator for Exchangeable Graph Models

21 Jun 2014-pp 208-216
TL;DR: A histogram estimator of a graphon that is provably consistent and numerically efficient is proposed, based on a sorting-and-smoothing (SAS) algorithm, which first sorts the empirical degree of agraph, then smooths the sorted graph using total variation minimization.
Abstract: Exchangeable graph models (ExGM) subsume a number of popular network models. The mathematical object that characterizes an ExGM is termed a graphon. Finding scalable estimators of graphons, provably consistent, remains an open issue. In this paper, we propose a histogram estimator of a graphon that is provably consistent and numerically efficient. The proposed estimator is based on a sorting-and-smoothing (SAS) algorithm, which first sorts the empirical degree of a graph, then smooths the sorted graph using total variation minimization. The consistency of the SAS algorithm is proved by leveraging sparsity concepts from compressed sensing.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This paper establishes optimal rate of convergence for graphon estimation in a H\"{o}lder class with smoothness $\alpha$, which is, to the surprise, identical to the classical nonparametric rate.
Abstract: Network analysis is becoming one of the most active research areas in statistics. Significant advances have been made recently on developing theories, methodologies and algorithms for analyzing networks. However, there has been little fundamental study on optimal estimation. In this paper, we establish optimal rate of convergence for graphon estimation. For the stochastic block model with $k$ clusters, we show that the optimal rate under the mean squared error is $n^{-1}\log k+k^2/n^2$. The minimax upper bound improves the existing results in literature through a technique of solving a quadratic equation. When $k\leq\sqrt{n\log n}$, as the number of the cluster $k$ grows, the minimax rate grows slowly with only a logarithmic order $n^{-1}\log k$. A key step to establish the lower bound is to construct a novel subset of the parameter space and then apply Fano's lemma, from which we see a clear distinction of the nonparametric graphon estimation problem from classical nonparametric regression, due to the lack of identifiability of the order of nodes in exchangeable random graph models. As an immediate application, we consider nonparametric graphon estimation in a Holder class with smoothness $\alpha$. When the smoothness $\alpha\geq1$, the optimal rate of convergence is $n^{-1}\log n$, independent of $\alpha$, while for $\alpha\in(0,1)$, the rate is $n^{-2\alpha/(\alpha+1)}$, which is, to our surprise, identical to the classical nonparametric rate.

179 citations


Cites background from "A Consistent Histogram Estimator fo..."

  • ...[9] Stanley H Chan and Edoardo M Airoldi....

    [...]

  • ...Though various algorithms have been proposed and analyzed [10, 44, 51, 2, 9], it is not clear whether the convergence rates obtained in these works can be improved, and not clear what the differences and connections are between nonparametric graphon estimation and classical nonparametric regression....

    [...]

  • ...The work by [9] obtained the rate n−1 log n for estimating a Lipschitz f , but they imposed strong assumptions on f ....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of statistical estimation of the matrix of connection probabilities based on the observations of the adjacency matrix of the network and derive optimal rates of estimation.
Abstract: Inhomogeneous random graph models encompass many network models such as stochastic block models and latent position models. We consider the problem of statistical estimation of the matrix of connection probabilities based on the observations of the adjacency matrix of the network. Taking the stochastic block model as an approximation, we construct estimators of network connection probabilities -- the ordinary block constant least squares estimator, and its restricted version. We show that they satisfy oracle inequalities with respect to the block constant oracle. As a consequence, we derive optimal rates of estimation of the probability matrix. Our results cover the important setting of sparse networks. Another consequence consists in establishing upper bounds on the minimax risks for graphon estimation in the $L_2$ norm when the probability matrix is sampled according to a graphon model. These bounds include an additional term accounting for the ``agnostic" error induced by the variability of the latent unobserved variables of the graphon model. In this setting, the optimal rates are influenced not only by the bias and variance components as in usual nonparametric problems but also include the third component, which is the agnostic error. The results shed light on the differences between estimation under the empirical loss (the probability matrix estimation) and under the integrated loss (the graphon estimation).

139 citations

Journal ArticleDOI
TL;DR: In this article, a neighbourhood smoothing method is proposed to estimate the expectation of the adjacency matrix directly without making the structural assumptions that graphon estimation requires, which has a competitive mean squared error rate and outperforms many benchmark methods for link prediction.
Abstract: SummaryThe estimation of probabilities of network edges from the observed adjacency matrix has important applications to the prediction of missing links and to network denoising. It is usually addressed by estimating the graphon, a function that determines the matrix of edge probabilities, but this is ill-defined without strong assumptions on the network structure. Here we propose a novel computationally efficient method, based on neighbourhood smoothing, to estimate the expectation of the adjacency matrix directly, without making the structural assumptions that graphon estimation requires. The neighbourhood smoothing method requires little tuning, has a competitive mean squared error rate and outperforms many benchmark methods for link prediction in simulated and real networks.

74 citations

Proceedings Article
01 Apr 2015
TL;DR: This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures.
Abstract: Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many "off-the-shelf" tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract tables, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reasoning about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for leveraging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96% precision at 92% recall when tested against this dataset, surpassing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research.

68 citations


Cites background from "A Consistent Histogram Estimator fo..."

  • ...com/ Figure 1: A scholarly document (left, page from (Chan and Airoldi 2014)), and the same document with body text masked with filled boxes, captions masked with empty boxes, and tables and figures removed (right)....

    [...]

Proceedings Article
02 Apr 2014
TL;DR: A specific estimator is built using the proposed 3-step procedure, which combines probability matrix estimation by Universal Singular Value Thresholding (USVT) and empirical degree sorting of the observed adjacency matrix, and it is proved that this estimation is consistent.
Abstract: Exchangeable graph models (ExGM) are a nonparametric approach to modeling network data that subsumes a number of popular models. The key object that defines an ExGM is often referred to as a graphon, or graph kernel. Here, we make three contributions to advance the theory of estimation of graphons. We determine conditions under which a unique canonical representation for a graphon exists and it is identifiable. We propose a 3-step procedure to estimate the canonical graphon of any ExGM that satisfies these conditions. We then focus on a specific estimator, built using the proposed 3-step procedure, which combines probability matrix estimation by Universal Singular Value Thresholding (USVT) and empirical degree sorting of the observed adjacency matrix. We prove that this estimator is consistent. We illustrate how the proposed theory and methods can be used to develop hypothesis testing procedures for models of network data.

56 citations


Cites background or methods from "A Consistent Histogram Estimator fo..."

  • ...This observation motivated recent follow-up work (Chan and Airoldi, 2014)....

    [...]

  • ...A detailed investigation of adding this third step in the estimation of canonical graphon is discussed in a follow-up paper (Chan and Airoldi, 2014)....

    [...]

  • ...cent papers focus on this direction (Bickel and Chen, 2009; Miller, Griffiths, and Jordan, 2009; Lloyd, Orbanz, Ghahramani, and Roy, 2012; Choi, Wolfe, and Airoldi, 2012; Azari and Airoldi, 2012; Chatterjee, 2012; Tang, Sussman, and Priebe, 2013; Wolfe and Olhede, 2013; Latouche and Robin, 2013; Orbanz and Roy, 2013; Airoldi, Costa, and Chan, 2013; Chan, Costa, and Airoldi, 2013; Chan and Airoldi, 2014), but one of the deficiencies for this formulation is that the resulting estimate always lacks the global structural information to the generating graphon....

    [...]

  • ...…results also suggest that, if the canonical graphon W is believed to be smooth, then a smoothing algorithm like total variation minimization method (Chan, Khoshabeh, Gibson, Gill, and Nguyen, 2011) could be applied to get a further reduction of estimation errors (e.g., see Chan and Airoldi, 2014)....

    [...]

  • ...While, adding a third smoothing step is helpful in these two specific examples, we note that the histogram estimator recently proposed by Chan and Airoldi (2014) requires an additional smoothness assumption on the underlying canonical graphon W ....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work develops a class of models where the probability of a relation between actors depends on the positions of individuals in an unobserved “social space,” and proposes Markov chain Monte Carlo procedures for making inference on latent positions and the effects of observed covariates.
Abstract: Network models are widely used to represent relational information among interacting units. In studies of social networks, recent emphasis has been placed on random graph models where the nodes usually represent individual social actors and the edges represent the presence of a specified relation between actors. We develop a class of models where the probability of a relation between actors depends on the positions of individuals in an unobserved “social space.” We make inference for the social space within maximum likelihood and Bayesian frameworks, and propose Markov chain Monte Carlo procedures for making inference on latent positions and the effects of observed covariates. We present analyses of three standard datasets from the social networks literature, and compare the method to an alternative stochastic blockmodeling approach. In addition to improving on model fit for these datasets, our method provides a visual and interpretable model-based spatial representation of social relationships and improv...

2,027 citations


"A Consistent Histogram Estimator fo..." refers background in this paper

  • ...…models include the exponential random graph model Wasserman (2005); Hunter and Handcock (2006), the stochastic blockmodel Nowicki and Snijders (2001), the mixed membership model Airoldi et al. (2008), the latent space model Hoff et al. (2002), the graphlet Azari and Airoldi (2012) and many others....

    [...]

  • ...…in Table 1, we note that graphon no. 1 w(u, v) = uv is a special case of the eigenmodel Hoff (2008), graphon no. 5 w(u, v) = 1/(1 + exp{−10(u2 + v2)}) is a variation of the logistic model presented in Chatterjee, and graphon no. 6 w(u, v) = |u− v| is the latent distance model Hoff et al. (2002)....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors introduce a class of variance allocation models for pairwise measurements, called mixed membership stochastic blockmodels, which combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters (mixed membership), and develop a general variational inference algorithm for fast approximate posterior inference.
Abstract: Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.

1,803 citations

Posted Content
TL;DR: The mixed membership stochastic block model as discussed by the authors extends block models for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation.
Abstract: Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.

1,546 citations

Journal ArticleDOI
TL;DR: In this article, a statistical approach to a posteriori blockmodeling for digraph and valued digraphs is proposed, which assumes that the vertices of the digraph are partitioned into several unobserved (latent) classes and that the probability distribution of the relation between two vertices depends only on the classes to which they belong.
Abstract: A statistical approach to a posteriori blockmodeling for digraphs and valued digraphs is proposed. The probability model assumes that the vertices of the digraph are partitioned into several unobserved (latent) classes and that the probability distribution of the relation between two vertices depends only on the classes to which they belong. A Bayesian estimator based on Gibbs sampling is proposed. The basic model is not identified, because class labels are arbitrary. The resulting identifiability problems are solved by restricting inference to the posterior distributions of invariant functions of the parameters and the vertex class membership. In addition, models are considered where class labels are identified by prior distributions for the class membership of some of the vertices. The model is illustrated by an example from the social networks literature (Kapferer's tailor shop).

1,223 citations


"A Consistent Histogram Estimator fo..." refers methods in this paper

  • ...…of these parametric models include the exponential random graph model Wasserman (2005); Hunter and Handcock (2006), the stochastic blockmodel Nowicki and Snijders (2001), the mixed membership model Airoldi et al. (2008), the latent space model Hoff et al. (2002), the graphlet Azari and…...

    [...]

Journal ArticleDOI
TL;DR: OptimSpace as mentioned in this paper reconstructs an n? × n matrix from a uniformly random subset of its entries with probability larger than 1 - 1/n3, which is a generalization of the result of Friedman-Kahn-Szemeredi and Feige-Ofek.
Abstract: Let M be an n? × n matrix of rank r, and assume that a uniformly random subset E of its entries is observed. We describe an efficient algorithm, which we call OptSpace, that reconstructs M from |E| = O(rn) observed entries with relative root mean square error 1/2 RMSE ? C(?) (nr/|E|)1/2 with probability larger than 1 - 1/n3. Further, if r = O(1) and M is sufficiently unstructured, then OptSpace reconstructs it exactly from |E| = O(n log n) entries with probability larger than 1 - 1/n3. This settles (in the case of bounded rank) a question left open by Candes and Recht and improves over the guarantees for their reconstruction algorithm. The complexity of our algorithm is O(|E|r log n), which opens the way to its use for massive data sets. In the process of proving these statements, we obtain a generalization of a celebrated result by Friedman-Kahn-Szemeredi and Feige-Ofek on the spectrum of sparse random matrices.

1,195 citations