scispace - formally typeset
Search or ask a question

Showing papers by "Ali H. Sayed published in 2019"


Journal ArticleDOI
TL;DR: In this paper, a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies was developed, which is applicable to locally balanced combination matrices which are more general and able to endow the algorithm with faster convergence rates, more flexible step-size choices, and improved privacy-preserving properties.
Abstract: This paper develops a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies. The resulting exact diffusion strategy is shown in Part II of this paper to have a wider stability range and superior convergence performance than the EXTRA strategy. The exact diffusion method is applicable to locally balanced left-stochastic combination matrices which, compared to the conventional doubly stochastic matrix, are more general and able to endow the algorithm with faster convergence rates, more flexible step-size choices, and improved privacy-preserving properties. The derivation of the exact diffusion strategy relies on reformulating the aggregate optimization problem as a penalized problem and resorting to a diagonally weighted incremental construction. Detailed stability and convergence analyses are pursued in Part II of this paper and are facilitated by examining the evolution of the error dynamics in a transformed domain. Numerical simulations illustrate the theoretical conclusions.

113 citations


Journal ArticleDOI
TL;DR: In this paper, the exact diffusion algorithm was developed to remove the bias that is characteristic of distributed solutions for deterministic optimization problems, and the algorithm was shown to be applicable to a larger set of locally balanced left-stochastic combination policies than the set of doubly-state stochastic policies.
Abstract: Part I of this paper developed the exact diffusion algorithm to remove the bias that is characteristic of distributed solutions for deterministic optimization problems. The algorithm was shown to be applicable to the larger set of locally balanced left-stochastic combination policies than the set of doubly-stochastic policies. These balanced policies endow the algorithm with faster convergence rate, more flexible step-size choices and better privacy-preserving properties. In this Part II, we examine the convergence and stability properties of exact diffusion in some detail and establish its linear convergence rate. We also show that it has a wider stability range than the EXTRA consensus solution, meaning that it is stable for a wider range of step-sizes and can, therefore, attain faster convergence rates. Analytical examples and numerical simulations illustrate the theoretical findings.

60 citations


Journal ArticleDOI
TL;DR: In this paper, a distributed variance-reduced strategy for a collection of interacting agents that are connected by a graph topology is developed, which is shown to have linear convergence to the exact solution, and is more memory efficient than other alternative algorithms.
Abstract: This paper develops a distributed variance-reduced strategy for a collection of interacting agents that are connected by a graph topology. The resulting diffusion-AVRG (where AVRG stands for “amortized variance-reduced gradient”) algorithm is shown to have linear convergence to the exact solution, and is more memory efficient than other alternative algorithms. When a batch implementation is employed, it is observed in simulations that diffusion-AVRG is more computationally efficient than exact diffusion or EXTRA, while maintaining almost the same communication efficiency.

59 citations


Posted Content
TL;DR: It is established that the diffusion learning strategy continues to yield meaningful estimates non-convex scenarios in the sense that the iterates by the individual agents will cluster in a small region around the network centroid.
Abstract: Driven by the need to solve increasingly complex optimization problems in signal processing and machine learning, there has been increasing interest in understanding the behavior of gradient-descent algorithms in non-convex environments. Most available works on distributed non-convex optimization problems focus on the deterministic setting where exact gradients are available at each agent. In this work and its Part II, we consider stochastic cost functions, where exact gradients are replaced by stochastic approximations and the resulting gradient noise persistently seeps into the dynamics of the algorithm. We establish that the diffusion learning strategy continues to yield meaningful estimates non-convex scenarios in the sense that the iterates by the individual agents will cluster in a small region around the network centroid. We use this insight to motivate a short-term model for network evolution over a finite-horizon. In Part II [2] of this work, we leverage this model to establish descent of the diffusion strategy through saddle points in O(1/$\mu$) steps and the return of approximately second-order stationary points in a polynomial number of iterations.

49 citations


Posted Content
TL;DR: This work established that agents cluster around a network centroid and proceeded to study the dynamics of this point, and established expected descent in non-convex environments in the large-gradient regime and introduced a short-term model to examine the dynamics over finite-time horizons.
Abstract: The diffusion strategy for distributed learning from streaming data employs local stochastic gradient updates along with exchange of iterates over neighborhoods. In Part I [2] of this work we established that agents cluster around a network centroid and proceeded to study the dynamics of this point. We established expected descent in non-convex environments in the large-gradient regime and introduced a short-term model to examine the dynamics over finite-time horizons. Using this model, we establish in this work that the diffusion strategy is able to escape from strict saddle-points in O(1/$\mu$) iterations; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, our approach requires less restrictive conditions on the gradient noise process.

48 citations


Journal ArticleDOI
TL;DR: In this paper, the authors study the problem of learning under both large datasets and large-dimensional feature space scenarios, and propose new and effective distributed solutions with guaranteed convergence to the minimizer with linear rate under strong convexity.
Abstract: This paper studies the problem of learning under both large datasets and large-dimensional feature space scenarios. The feature information is assumed to be spread across agents in a network, where each agent observes some of the features. Through local cooperation, the agents are supposed to interact with each other to solve an inference problem and converge towards the global minimizer of an empirical risk. We study this problem exclusively in the primal domain, and propose new and effective distributed solutions with guaranteed convergence to the minimizer with linear rate under strong convexity. This is achieved by combining a dynamic diffusion construction, a pipeline strategy, and variance-reduced techniques. Simulation results illustrate the conclusions.

48 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of inferring whether an agent is directly influenced by another agent over a network was studied, and it was shown that for any given fraction of observable agents, the interacting and non-interacting agent pairs split into two separate clusters as the network size increases.
Abstract: This paper studies the problem of inferring whether an agent is directly influenced by another agent over a network. Agent $i$ influences agent $j$ if they are connected (according to the network topology), and if agent $j$ uses the data from agent $i$ to update its online learning algorithm. The solution of this inference task is challenging for two main reasons. First, only the output of the learning algorithm is available to the external observer that must perform the inference based on these indirect measurements. Second, only output measurements from a fraction of the network agents is available, with the total number of agents itself being also unknown. The main focus of this paper is ascertaining under these demanding conditions whether consistent tomography is possible , namely, whether it is possible to reconstruct the interaction profile of the observable portion of the network, with negligible error as the network size increases. We establish a critical achievability result, namely, that for symmetric combination policies and for any given fraction of observable agents, the interacting and non-interacting agent pairs split into two separate clusters as the network size increases. This remarkable property then enables the application of clustering algorithms to identify the interacting agents influencing the observations. We provide a set of numerical experiments that verify the results for finite network sizes and time horizons. The numerical experiments show that the results hold for asymmetric combination policies as well, which is particularly relevant in the context of causation .

39 citations


Posted Content
TL;DR: This work designs a proximal gradient decentralized algorithm whose fixed point coincides with the desired minimizer and provides a concise proof that establishes its linear convergence.
Abstract: Decentralized optimization is a powerful paradigm that finds applications in engineering and learning design. This work studies decentralized composite optimization problems with non-smooth regularization terms. Most existing gradient-based proximal decentralized methods are known to converge to the optimal solution with sublinear rates, and it remains unclear whether this family of methods can achieve global linear convergence. To tackle this problem, this work assumes the non-smooth regularization term is common across all networked agents, which is the case for many machine learning problems. Under this condition, we design a proximal gradient decentralized algorithm whose fixed point coincides with the desired minimizer. We then provide a concise proof that establishes its linear convergence. In the absence of the non-smooth term, our analysis technique covers the well known EXTRA algorithm and provides useful bounds on the convergence rate and step-size.

34 citations


Proceedings Article
01 May 2019
TL;DR: In this article, a proximal gradient decentralized algorithm whose fixed point coincides with the desired minimizer is proposed, and the authors provide a concise proof that establishes its linear convergence in the absence of the non-smooth term.
Abstract: Decentralized optimization is a powerful paradigm that finds applications in engineering and learning design. This work studies decentralized composite optimization problems with non-smooth regularization terms. Most existing gradient-based proximal decentralized methods are known to converge to the optimal solution with sublinear rates, and it remains unclear whether this family of methods can achieve global linear convergence. To tackle this problem, this work assumes the non-smooth regularization term is common across all networked agents, which is the case for many machine learning problems. Under this condition, we design a proximal gradient decentralized algorithm whose fixed point coincides with the desired minimizer. We then provide a concise proof that establishes its linear convergence. In the absence of the non-smooth term, our analysis technique covers the well known EXTRA algorithm and provides useful bounds on the convergence rate and step-size.

30 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size around the minimizer rather than $O(mu)$.
Abstract: In empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. Recent works have pursued justifications for this behavior by examining the convergence rate of the learning process under diminishing step sizes. This work focuses on the constant step-size case and strongly convex loss functions. In this case, convergence is guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis establishes analytically that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size $O(\mu ^2)$ around the minimizer rather than $O(\mu)$ . Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail, the differences between sampling with and without replacement. We also explain the periodic behavior that is observed in random reshuffling implementations.

27 citations


Posted Content
TL;DR: In this article, the performance of exact diffusion under the stochastic and adaptive setting, and conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction are provided.
Abstract: Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still unknown {\em whether}, {\em when} and {\em why} these bias-correction methods can outperform their traditional counterparts (such as consensus and diffusion) with noisy gradient and constant step-sizes. This work studies the performance of exact diffusion under the stochastic and adaptive setting, and provides conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction. In particular, it is proven that this superiority is more evident over sparsely-connected network topologies such as lines, cycles, or grids. Conditions are also provided under which exact diffusion method match or may even degrade the performance of traditional methods. Simulations are provided to validate the theoretical findings.

Proceedings ArticleDOI
01 May 2019
TL;DR: Analytical formulas are obtained that reveal how the agents’ detection capability and the network topology interplay to influence the asymptotic beliefs of the agents.
Abstract: We consider a distributed social learning problem where a network of agents is interested in selecting one among a finite number of hypotheses. The data collected by the agents might be heterogeneous, meaning that different sub-networks might observe data generated by different hypotheses. For example, some sub-networks might be receiving (or even intentionally generating) data from a fake hypothesis and will bias the rest of the network via social influence. This work focuses on a two-step diffusion algorithm where each agent: i) first updates individually its belief functio. using its privat. data; ii) then computes a new belief function by exponentiatin. a linear combination of the log-beliefs of its neighbors. We obtain analytical formulas that reveal how the agents’ detection capability and the network topology interplay to influence the asymptotic beliefs of the agents. Some interesting behaviors arise, such as the "mind-control" effect or the "truth-is-somewhere-in-between" effect.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: Three matrix estimators are proposed, namely, the Granger, the one-lag correlation, and the residual estimators, which, when followed by a universal clustering algorithm, are shown to retrieve the true subgraph in the limit of large network sizes.
Abstract: In this work we consider the problem of learning an Erdős-Renyi graph over a diffusion network when: i) data from only a limited subset of nodes are available (partial observation); ii) and the inferential goal is to discover the graph of interconnections linking the accessible nodes (local structure learning). We propose three matrix estimators, namely, the Granger, the one-lag correlation, and the residual estimators, which, when followed by a universal clustering algorithm, are shown to retrieve the true subgraph in the limit of large network sizes. Remarkably, it is seen that a fundamental role is played by the uniform concentration of node degrees, rather than by sparsity.

Journal ArticleDOI
TL;DR: This letter shows how to blend real-time adaptation with graph filtering and a generalized regularization framework to result in a graph diffusion strategy for distributed learning over multitask networks.
Abstract: This letter proposes a general regularization framework for inference over multitask networks. The optimization approach relies on minimizing a global cost consisting of the aggregate sum of individual costs regularized by a term that allows to incorporate global information about the graph structure and the individual parameter vectors into the solution of the inference problem. An adaptive strategy, which responds to streaming data and employs stochastic approximations in place of actual gradient vectors, is devised and studied. Methods allowing the distributed implementation of the regularization step are also discussed. This letter shows how to blend real-time adaptation with graph filtering and a generalized regularization framework to result in a graph diffusion strategy for distributed learning over multitask networks.

Posted Content
TL;DR: This work shows that a more relaxed relative bound on the gradient noise variance is sufficient to ensure efficient escape from saddle-points without the need to inject additional noise, employ alternating step-sizes or rely on a global dispersive noise assumption.
Abstract: Recent years have seen increased interest in performance guarantees of gradient descent algorithms for non-convex optimization. A number of works have uncovered that gradient noise plays a critical role in the ability of gradient descent recursions to efficiently escape saddle-points and reach second-order stationary points. Most available works limit the gradient noise component to be bounded with probability one or sub-Gaussian and leverage concentration inequalities to arrive at high-probability results. We present an alternate approach, relying primarily on mean-square arguments and show that a more relaxed relative bound on the gradient noise variance is sufficient to ensure efficient escape from saddle-points without the need to inject additional noise, employ alternating step-sizes or rely on a global dispersive noise assumption, as long as a gradient noise component is present in a descent direction for every saddle-point.

Posted Content
TL;DR: This work examines a distributed learning problem where the agents of a network form their beliefs about certain hypotheses of interest by means of a diffusion strategy and examines the feasibility of topology learning for two useful classes of problems.
Abstract: We consider a social learning problem, where a network of agents is interested in selecting one among a finite number of hypotheses. We focus on weakly-connected graphs where the network is partitioned into a sending part and a receiving part. The data collected by the agents might be heterogeneous. For example, some sub-networks might intentionally generate data from a fake hypothesis in order to influence other agents. The social learning task is accomplished via a diffusion strategy where each agent: i) updates individually its belief using its private data; ii) computes a new belief by exponentiating a linear combination of the log-beliefs of its neighbors. First, we examine what agents learn over weak graphs (social learning problem). We obtain analytical formulas for the beliefs at the different agents, which reveal how the agents' detection capability and the network topology interact to influence the beliefs. In particular, the formulas allow us to predict when a leader-follower behavior is possible, where some sending agents can control the mind of the receiving agents by forcing them to choose a particular hypothesis. Second, we consider the dual or reverse learning problem that reveals how agents learned: given a stream of beliefs collected at a receiving agent, we would like to discover the global influence that any sending component exerts on this receiving agent (topology learning problem). A remarkable and perhaps unexpected interplay between social and topology learning is observed: given $H$ hypotheses and $S$ sending components, topology learning can be feasible when $H\geq S$. The latter being only a necessary condition, we examine the feasibility of topology learning for two useful classes of problems. The analysis reveals that a critical element to enable faithful topology learning is the diversity in the statistical models of the sending sub-networks.

Proceedings ArticleDOI
12 May 2019
TL;DR: This paper proposes an iterative and distributed implementation of the projection step, which runs in parallel with the gradient descent update, and establishes that, for small step-sizes µ, the proposed distributed adaptive strategy leads to small estimation errors on the order of µ.
Abstract: This paper considers optimization problems over networks where agents have individual objectives to meet, or individual parameter vectors to estimate, subject to subspace constraints that enforce the objectives across the network to lie in a low-dimensional subspace. This constrained formulation includes consensus optimization as a special case, and allows for more general task relatedness models such as smoothness. While such formulations can be solved via projected gradient descent, the resulting algorithm is not distributed. Motivated by the centralized solution, we propose an iterative and distributed implementation of the projection step, which runs in parallel with the gradient descent update. We establish that, for small step-sizes µ, the proposed distributed adaptive strategy leads to small estimation errors on the order of µ.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This work establishes that the diffusion algorithm continues to yield meaningful estimates in these more challenging, non-convex environments, in the sense that despite the distributed implementation, restricted to local interactions, individual agents cluster in a small region around a common and well-defined vector.
Abstract: Driven by the need to solve increasingly complex optimization problems in signal processing and machine learning, recent years have seen rising interest in the behavior of gradient-descent based algorithms in non-convex environments. Most of the works on distributed non-convex optimization focus on the deterministic setting, where exact gradients are available at each agent. In this work, we consider stochastic cost functions, where exact gradients are replaced by stochastic approximations and the resulting gradient noise persistently seeps into the dynamics of the algorithm. We establish that the diffusion algorithm continues to yield meaningful estimates in these more challenging, non-convex environments, in the sense that (a) despite the distributed implementation, restricted to local interactions, individual agents cluster in a small region around a common and well-defined vector, which will carry the interpretation of a network centroid, and (b) the network centroid inherits many properties of the centralized, stochastic gradient descent recursion, including the return of an O(µ)-mean-square-stationary point in at most O(1/µ2) iterations.

Journal ArticleDOI
TL;DR: In this paper, the authors derived approximate performance expressions for the network nodes that match well with the simulated results for a wide range of system parameters and revealed an important interplay between continuous adaptation under constant step-size learning and the binary nature of the messages exchanged with neighbors.
Abstract: This paper studies the operation of multi-agent networks engaged in binary decision tasks, and derives performance expressions and performance operating curves under challenging conditions with some revealing insights. One of the main challenges in the analysis is that agents are only allowed to exchange one-bit messages, and the information at each agent therefore consists of both continuous and discrete components. Due to this mixed nature, the steady-state distribution of the state of each agent cannot be inferred from direct application of central limit arguments. Instead, the behavior of the continuous component is characterized in integral form by using a log-characteristic function, while the behavior of the discrete component is characterized by means of an asymmetric Bernoulli convolution. By exploiting these results, this paper derives reliable approximate performance expressions for the network nodes that match well with the simulated results for a wide range of system parameters. The results also reveal an important interplay between continuous adaptation under constant step-size learning and the binary nature of the messages exchanged with neighbors.

Posted Content
TL;DR: The analysis reveals that the fundamental property enabling consistent graph learning is the statistical concentration of node degrees, and this claim is proved for three matrix estimators.
Abstract: This work examines the problem of graph learning over a diffusion network when data can be collected from a limited portion of the network (partial observability). The main question is to establish technical guarantees of consistent recovery of the subgraph of probed network nodes, i) despite the presence of unobserved nodes; and ii) under different connectivity regimes, including the dense regime where the probed nodes are influenced by many connections coming from the unobserved ones. We ascertain that suitable estimators of the combination matrix (i.e., the matrix that quantifies the pairwise interaction between nodes) possess an identifiability gap that enables the discrimination between connected and disconnected nodes. Fundamental conditions are established under which the subgraph of monitored nodes can be recovered, with high probability as the network size increases, through universal clustering algorithms. This claim is proved for three matrix estimators: i) the Granger estimator that adapts to the partial observability setting the solution that is exact under full observability ; ii) the one-lag correlation matrix; and iii) the residual estimator based on the difference between two consecutive time samples. A detailed characterization of the asymptotic behavior of these estimators is established in terms of an error bias and of the identifiability gap, and a sample complexity analysis is performed to establish how the number of samples scales with the network size to achieve consistent learning. Comparison among the estimators is performed through illustrative examples that show how estimators that are not optimal in the full observability regime can outperform the Granger estimator in the partial observability regime. The analysis reveals that the fundamental property enabling consistent graph learning is the statistical concentration of node degrees.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: A fully decentralized algorithm for policy evaluation with off-policy learning and linear function approximation that achieves linear convergence with $O$(1) memory requirements and allows all agents to converge to the optimal solution.
Abstract: In this paper we develop a fully decentralized algorithm for policy evaluation with off-policy learning and linear function approximation. The proposed algorithm is of the variance reduced kind and achieves linear convergence with $O$ (1) memory requirements. We consider the case where a collection of agents have distinct and fixed size datasets gathered following different behavior policies (none of which is required to explore the full state space) and they all collaborate to evaluate a common target policy. The network approach allows all agents to converge to the optimal solution even in situations where neither agent can converge on its own without cooperation. We provide simulations to illustrate the effectiveness of the method in a Linear Quadratic Regulator (LQR) problem.

Posted Content
26 Mar 2019
TL;DR: It is shown that the correction step in exact diffusion can lead to better steady-state performance than traditional methods, and this paper provides affirmative results.
Abstract: Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still unknown {\em whether}, {\em when} and {\em why} these bias-correction methods can outperform their traditional counterparts (such as consensus and diffusion) with noisy gradient and constant step-sizes. This work studies the performance of exact diffusion under the stochastic and adaptive setting, and provides conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction. In particular, it is proven that this superiority is more evident over sparsely-connected network topologies such as lines, cycles, or grids. Conditions are also provided under which exact diffusion method match or may even degrade the performance of traditional methods. Simulations are provided to validate the theoretical findings.

Posted Content
TL;DR: In this paper, an iterative and distributed solution was proposed that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable.
Abstract: Part I of this paper considered optimization problems over networks where agents have individual objectives to meet, or individual parameter vectors to estimate, subject to subspace constraints that require the objectives across the network to lie in low-dimensional subspaces. Starting from the centralized projected gradient descent, an iterative and distributed solution was proposed that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable. We examined the second-order stability of the learning algorithm and we showed that, for small step-sizes $\mu$, the proposed strategy leads to small estimation errors on the order of $\mu$. This Part II examines steady-state performance. The results reveal explicitly the influence of the gradient noise, data characteristics, and subspace constraints, on the network performance. The results also show that in the small step-size regime, the iterates generated by the distributed algorithm achieve the centralized steady-state performance.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: In this article, it is shown that the correction step in exact diffusion can lead to better steady-state performance than traditional methods, and it is also shown that correcting bias in exact diffusers can improve the stability of the diffusion.
Abstract: Various bias-correction methods such as EXTRA, DIGing, and exact diffusion have been proposed recently to solve distributed deterministic optimization problems. These methods employ constant step-sizes and converge linearly to the exact solution under proper conditions. However, their performance under stochastic and adaptive settings remains unclear. It is still unknown whether bias-correction is beneficial in stochastic settings. By studying exact diffusion and examining its steady-state performance under stochastic scenarios, this paper provides affirmative results. It is shown that the correction step in exact diffusion can lead to better steady-state performance than traditional methods.

Posted Content
TL;DR: In this paper, a primal-dual framework was proposed for non-smooth decentralized multi-agent optimization problems, where the agents aim at minimizing a sum of local strongly-convex smooth components plus a common nonsmooth term.
Abstract: This work studies a class of non-smooth decentralized multi-agent optimization problems where the agents aim at minimizing a sum of local strongly-convex smooth components plus a common non-smooth term. We propose a general primal-dual algorithmic framework that unifies many existing state-of-the-art algorithms. We establish linear convergence of the proposed method to the exact solution in the presence of the non-smooth term. Moreover, for the more general class of problems with agent specific non-smooth terms, we show that linear convergence cannot be achieved (in the worst case) for the class of algorithms that uses the gradients and the proximal mappings of the smooth and non-smooth parts, respectively. We further provide a numerical counterexample that shows how some state-of-the-art algorithms fail to converge linearly for strongly-convex objectives and different local non-smooth terms.

Posted Content
TL;DR: In this paper, the primal-descent dual-ascent gradient method is revisited for the solution of equality constrained optimization problems and the authors provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation.
Abstract: In this work, we revisit a classical incremental implementation of the primal-descent dual-ascent gradient method used for the solution of equality constrained optimization problems. We provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation. We also study the effect of the augmented Lagrangian penalty term on the performance of distributed optimization algorithms for the minimization of aggregate cost functions over multi-agent networks.

Posted Content
TL;DR: In this paper, a distributed strategy for Pareto optimization of an aggregate cost consisting of regularized risks is developed and studied, where each risk is modeled as the expectation of some loss function with unknown probability distribution while the regularizers are assumed deterministic.
Abstract: The purpose of this work is to develop and study a distributed strategy for Pareto optimization of an aggregate cost consisting of regularized risks. Each risk is modeled as the expectation of some loss function with unknown probability distribution while the regularizers are assumed deterministic, but are not required to be differentiable or even continuous. The individual, regularized, cost functions are distributed across a strongly-connected network of agents and the Pareto optimal solution is sought by appealing to a multi-agent diffusion strategy. To this end, the regularizers are smoothed by means of infimal convolution and it is shown that the Pareto solution of the approximate, smooth problem can be made arbitrarily close to the solution of the original, non-smooth problem. Performance bounds are established under conditions that are weaker than assumed before in the literature, and hence applicable to a broader class of adaptation and learning problems.

Posted Content
TL;DR: The article surveys recent advances on this challenging learning problem and related questions: Despite the presence of unobserved nodes, can partial observations still be sufficient to discover the graph linking the probed nodes?
Abstract: Many optimization, inference and learning tasks can be accomplished efficiently by means of decentralized processing algorithms where the network topology (i.e., the graph) plays a critical role in enabling the interactions among neighboring nodes. There is a large body of literature examining the effect of the graph structure on the performance of decentralized processing strategies. In this article, we examine the inverse problem and consider the reverse question: How much information does observing the behavior at the nodes of a graph convey about the underlying topology? For large-scale networks, the difficulty in addressing such inverse problems is compounded by the fact that usually only a limited fraction of the nodes can be probed, giving rise to a second important question: Despite the presence of unobserved nodes, can partial observations still be sufficient to discover the graph linking the probed nodes? The article surveys recent advances on this challenging learning problem and related questions.

Journal ArticleDOI
26 Sep 2019
TL;DR: In this article, the authors derived and analyzed an online learning strategy for tracking the average of time-varying distributed signals by relying on randomized coordinate-descent updates, and provided a convergence analysis for the proposed methods.
Abstract: This work derives and analyzes an online learning strategy for tracking the average of time-varying distributed signals by relying on randomized coordinate-descent updates. During each iteration, each agent selects or observes a random entry of the observation vector, and different agents may select different entries of their observations before engaging in a consultation step. Careful coordination of the interactions among agents is necessary to avoid bias and ensure convergence. We provide a convergence analysis for the proposed methods, and illustrate the results by means of simulations.

Proceedings ArticleDOI
01 May 2019
TL;DR: A fully distributed algorithm for learning the optimal policy in a multi-agent cooperative reinforcement learning scenario that is of the stochastic primal-dual kind and can be shown to converge even when used in conjunction with a wide class of function approximators.
Abstract: This work presents a fully distributed algorithm for learning the optimal policy in a multi-agent cooperative reinforcement learning scenario. We focus on games that can only be solved through coordinated team work. We consider situations in which $K$ players interact simultaneously with an environment and with each other to attain a common goal. In the algorithm, agents only communicate with other agents in their immediate neighborhood and choose their actions independently of one another based only on local information. Learning is done off-policy, which results in high data efficiency. The proposed algorithm is of the stochastic primal-dual kind and can be shown to converge even when used in conjunction with a wide class of function approximators.