scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Learning in 2008"


Posted Content
TL;DR: In this paper, the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS) is defined, and the test statistic can be computed in quadratic time, although efficient linear time approximations are available.
Abstract: We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

1,259 citations


Posted Content
TL;DR: In this paper, it was shown that a concept class is learnable by a local algorithm if and only if it is learnedable in the statistical query (SQ) model.
Abstract: Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask: what concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a notion that provides strong confidentiality guarantees in contexts where aggregate information is released about a database containing sensitive information about individuals. We demonstrate that, ignoring computational constraints, it is possible to privately agnostically learn any concept class using a sample size approximately logarithmic in the cardinality of the concept class. Therefore, almost anything learnable is learnable privately: specifically, if a concept class is learnable by a (non-private) algorithm with polynomial sample complexity and output size, then it can be learned privately using a polynomial number of samples. We also present a computationally efficient private PAC learner for the class of parity functions. Local (or randomized response) algorithms are a practical class of private algorithms that have received extensive investigation. We provide a precise characterization of local private learning algorithms. We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statistical query (SQ) model. Finally, we present a separation between the power of interactive and noninteractive local learning algorithms.

537 citations


Posted Content
TL;DR: This work proposes a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss and finds for datasets with large numbers of features, substantial sparsity is discoverable.
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online learning algorithms with convex loss functions. This method has several essential properties: The degree of sparsity is continuous -- a parameter controls the rate of sparsification from no sparsification to total sparsification. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular $L_1$-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online learning guarantees. The approach works well empirically. We apply the approach to several datasets and find that for datasets with large numbers of features, substantial sparsity is discoverable.

411 citations


Posted Content
TL;DR: A new spectral norm is designed that encodes this a priori assumption that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors, resulting in a new convex optimization formulation for multi-task learning.
Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non convex methods dedicated to the same problem.

409 citations


Posted Content
TL;DR: This paper explicitly map every bag to an undirected graph and design a graph kernel for distinguishing the positive and negative bags and implicitly construct graphs by deriving affinity matrices and propose an efficient graph kernel considering the clique information.
Abstract: Multi-instance learning attempts to learn from a training set consisting of labeled bags each containing many unlabeled instances. Previous studies typically treat the instances in the bags as independently and identically distributed. However, the instances in a bag are rarely independent, and therefore a better performance can be expected if the instances are treated in an non-i.i.d. way that exploits the relations among instances. In this paper, we propose a simple yet effective multi-instance learning method, which regards each bag as a graph and uses a specific kernel to distinguish the graphs by considering the features of the nodes as well as the features of the edges that convey some relations among instances. The effectiveness of the proposed method is validated by experiments.

368 citations


Posted Content
TL;DR: A theoretical analysis of sample selection bias correction based on the novel concept of distributional stability which generalizes the existing concept of point-based stability and can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.
Abstract: This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

255 citations


Posted Content
TL;DR: The extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

233 citations


Posted Content
TL;DR: This work presents a convex formulation of dictionary learning for sparse signal decomposition that introduces an explicit trade-off between size and sparsity of the decomposition of rectangular matrices and compares the estimation abilities of the convex and nonconvex approaches.
Abstract: We present a convex formulation of dictionary learning for sparse signal decomposition. Convexity is obtained by replacing the usual explicit upper bound on the dictionary size by a convex rank-reducing term similar to the trace norm. In particular, our formulation introduces an explicit trade-off between size and sparsity of the decomposition of rectangular matrices. Using a large set of synthetic examples, we compare the estimation abilities of the convex and non-convex approaches, showing that while the convex formulation has a single local minimum, this may lead in some cases to performance which is inferior to the local minima of the non-convex formulation.

148 citations


Posted Content
TL;DR: In this article, a unified and elementary treatment is given in these noise settings for two $\ell_1$ minimization methods: the Dantzig selector and ''ell_2'' minimization with an ''ell'' constraint.
Abstract: This article considers constrained $\ell_1$ minimization methods for the recovery of high dimensional sparse signals in three settings: noiseless, bounded error and Gaussian noise. A unified and elementary treatment is given in these noise settings for two $\ell_1$ minimization methods: the Dantzig selector and $\ell_1$ minimization with an $\ell_2$ constraint. The results of this paper improve the existing results in the literature by weakening the conditions and tightening the error bounds. The improvement on the conditions shows that signals with larger support can be recovered accurately. This paper also establishes connections between restricted isometry property and the mutual incoherence property. Some results of Candes, Romberg and Tao (2006) and Donoho, Elad, and Temlyakov (2006) are extended.

135 citations


Posted Content
TL;DR: In this paper, a polynomial-time algorithm for the exact computation of lowest energy (ground) states, worst margin violators, log partition functions, and marginal edge probabilities in certain binary undirected graphical models is presented.
Abstract: We give polynomial-time algorithms for the exact computation of lowest-energy (ground) states, worst margin violators, log partition functions, and marginal edge probabilities in certain binary undirected graphical models. Our approach provides an interesting alternative to the well-known graph cut paradigm in that it does not impose any submodularity constraints; instead we require planarity to establish a correspondence with perfect matchings (dimer coverings) in an expanded dual graph. We implement a unified framework while delegating complex but well-understood subproblems (planar embedding, maximum-weight perfect matching) to established algorithms for which efficient implementations are freely available. Unlike graph cut methods, we can perform penalized maximum-likelihood as well as maximum-margin parameter estimation in the associated conditional random fields (CRFs), and employ marginal posterior probabilities as well as maximum a posteriori (MAP) states for prediction. Maximum-margin CRF parameter estimation on image denoising and segmentation problems shows our approach to be efficient and effective. A C++ implementation is available from this http URL

80 citations


Posted Content
TL;DR: In this article, a unified framework for graph kernels is presented, which includes graph diffusion kernels, marginalized graph kernels, and geometric kernel on graphs, special cases of which include the random walk graph kernel and regularization on graphs.
Abstract: We present a unified framework to study graph kernels, special cases of which include the random walk graph kernel \citep{GaeFlaWro03,BorOngSchVisetal05}, marginalized graph kernel \citep{KasTsuIno03,KasTsuIno04,MahUedAkuPeretal04}, and geometric kernel on graphs \citep{Gaertner02} Through extensions of linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) and reduction to a Sylvester equation, we construct an algorithm that improves the time complexity of kernel computation from $O(n^6)$ to $O(n^3)$ When the graphs are sparse, conjugate gradient solvers or fixed-point iterations bring our algorithm into the sub-cubic domain Experiments on graphs from bioinformatics and other application domains show that it is often more than a thousand times faster than previous approaches We then explore connections between diffusion kernels \citep{KonLaf02}, regularization on graphs \citep{SmoKon03}, and graph kernels, and use these connections to propose new graph kernels Finally, we show that rational kernels \citep{CorHafMoh02,CorHafMoh03,CorHafMoh04} when specialized to graphs reduce to the random walk graph kernel

Posted Content
TL;DR: It is proved that the optimal assignment kernel, proposed recently as an attempt to embed labeled graphs and more generally tuples of basic data to a Hilbert space, is in fact not always positive definite.
Abstract: We prove that the optimal assignment kernel, proposed recently as an attempt to embed labeled graphs and more generally tuples of basic data to a Hilbert space, is in fact not always positive definite

Posted Content
TL;DR: An extension of principal component analysis (PCA) and a new algorithm for clustering points in \Rn based on it that is affine-invariant and nearly the best possible is presented, improving known results substantially.
Abstract: We present a new algorithm for clustering points in R^n. The key property of the algorithm is that it is affine-invariant, i.e., it produces the same partition for any affine transformation of the input. It has strong guarantees when the input is drawn from a mixture model. For a mixture of two arbitrary Gaussians, the algorithm correctly classifies the sample assuming only that the two components are separable by a hyperplane, i.e., there exists a halfspace that contains most of one Gaussian and almost none of the other in probability mass. This is nearly the best possible, improving known results substantially. For k > 2 components, the algorithm requires only that there be some (k-1)-dimensional subspace in which the emoverlap in every direction is small. Here we define overlap to be the ratio of the following two quantities: 1) the average squared distance between a point and the mean of its component, and 2) the average squared distance between a point and the mean of the mixture. The main result may also be stated in the language of linear discriminant analysis: if the standard Fisher discriminant is small enough, labels are not needed to estimate the optimal subspace for projection. Our main tools are isotropic transformation, spectral projection and a simple reweighting technique. We call this combination isotropic PCA.

Posted Content
TL;DR: This article used importance weighting to correct sampling bias, and by controlling the variance, they were able to give rigorous label complexity bounds for the learning process and showed that this approach reduces the label complexity required to achieve good predictive performance on many learning problems.
Abstract: We present a practical and statistically consistent scheme for actively learning binary classifiers under general loss functions. Our algorithm uses importance weighting to correct sampling bias, and by controlling the variance, we are able to give rigorous label complexity bounds for the learning process. Experiments on passively labeled data show that this approach reduces the label complexity required to achieve good predictive performance on many learning problems.

Posted Content
TL;DR: It is proved that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs.
Abstract: Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.

Posted Content
TL;DR: In this article, a general framework of semi-supervised dimensionality reduction for manifold learning is presented, which naturally generalizes existing supervised and unsupervised learning frameworks which apply the spectral decomposition.
Abstract: We present a general framework of semi-supervised dimensionality reduction for manifold learning which naturally generalizes existing supervised and unsupervised learning frameworks which apply the spectral decomposition. Algorithms derived under our framework are able to employ both labeled and unlabeled examples and are able to handle complex problems where data form separate clusters of manifolds. Our framework offers simple views, explains relationships among existing frameworks and provides further extensions which can improve existing algorithms. Furthermore, a new semi-supervised kernelization framework called ``KPCA trick'' is proposed to handle non-linear problems.

Posted Content
TL;DR: In this paper entropy based methods are compared and used to measure structural diversity of an ensemble of 21 classifiers, mainly applied in ecology, whereby species counts are used as a measure of diversity.
Abstract: In this paper entropy based methods are compared and used to measure structural diversity of an ensemble of 21 classifiers. This measure is mostly applied in ecology, whereby species counts are used as a measure of diversity. The measures used were Shannon entropy, Simpsons and the Berger Parker diversity indexes. As the diversity indexes increased so did the accuracy of the ensemble. An ensemble dominated by classifiers with the same structure produced poor accuracy. Uncertainty rule from information theory was also used to further define diversity. Genetic algorithms were used to find the optimal ensemble by using the diversity indices as the cost function. The method of voting was used to aggregate the decisions.

Posted Content
TL;DR: This paper presents a detailed asymptotic analysis of model consistency of the Lasso, and presents a novel variable selection algorithm, referred to as the Bolasso, which is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository.
Abstract: We consider the least-square linear regression problem with regularization by the l1-norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of model consistency of the Lasso. For various decays of the regularization parameter, we compute asymptotic equivalents of the probability of correct model selection (i.e., variable selection). For a specific rate decay, we show that the Lasso selects all the variables that should enter the model with probability tending to one exponentially fast, while it selects all other variables with strictly positive probability. We show that this property implies that if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection. This novel variable selection algorithm, referred to as the Bolasso, is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository.

Posted Content
Alina Beygelzimer1, John Langford2
TL;DR: The Offset Tree is an optimal reduction to binary classification, allowing one to reuse any existing, fully supervised binary classification algorithm in this partial information setting, and is also computationally optimal, both at training and test time.
Abstract: We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices. The algorithm reduces this setting to binary classification, allowing one to reuse of any existing, fully supervised binary classification algorithm in this partial information setting. We show that the Offset Tree is an optimal reduction to binary classification. In particular, it has regret at most $(k-1)$ times the regret of the binary classifier it uses (where $k$ is the number of choices), and no reduction to binary classification can do better. This reduction is also computationally optimal, both at training and test time, requiring just $O(\log_2 k)$ work to train on an example or make a prediction. Experiments with the Offset Tree show that it generally performs better than several alternative approaches.

Posted Content
TL;DR: The authors used text from news articles to predict intraday price movements of financial assets using support vector machines and developed an analytic center cutting plane method to solve the kernel learning problem efficiently.
Abstract: We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classification performance and we develop an analytic center cutting plane method to solve the kernel learning problem efficiently. We observe that while the direction of returns is not predictable using either text or returns, their size is, with text features producing significantly better performance than historical returns alone.

Posted Content
TL;DR: This work introduces an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP), demonstrating the applicability of this algorithm for large scale distributed computing systems.
Abstract: Support vector machines (SVMs) are an extremely successful type of classification and regression algorithms. Building an SVM entails solving a constrained convex quadratic programming problem, which is quadratic in the number of training samples. We introduce an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP). In this paper, we demonstrate that methods from the complex system domain could be utilized for performing efficient distributed computation. We compare the proposed algorithm to previously proposed distributed and single-node SVM solvers. Our comparison shows that the proposed algorithm is just as accurate as these solvers, while being significantly faster, especially for large datasets. We demonstrate scalability of the proposed algorithm to up to 1,024 computing nodes and hundreds of thousands of data points using an IBM Blue Gene supercomputer. As far as we know, our work is the largest parallel implementation of belief propagation ever done, demonstrating the applicability of this algorithm for large scale distributed computing systems.

Posted Content
TL;DR: In this paper, the regret and Bayes risk of a large collection of arms is minimized by a policy that alternates between exploration and exploitation phases, and the phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition.
Abstract: We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an $r$-dimensional random vector $\mathbf{Z} \in \mathbb{R}^r$, where $r \geq 2$. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order $\Theta(r \sqrt{T})$, by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form $O(r \sqrt{T} \log^{3/2} T)$.

Journal ArticleDOI
TL;DR: In this article, the authors present the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA).
Abstract: This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.

Posted Content
TL;DR: In this paper, the equivalence of robustness and regularization was shown to be equivalent to a robust optimization formulation for regularized support vector machines (SVMs), which has implications for both algorithms and analysis.
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization, provides a robust optimization interpretation for the success of regularized SVMs. We use the this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

Posted Content
TL;DR: This work proposes two natural learning paradigms and proves that, on input random samples generated i.i.d. by any distribution, they are guaranteed to converge to the optimal separator for that distribution.
Abstract: We define a novel, basic, unsupervised learning problem - learning the lowest density homogeneous hyperplane separator of an unknown probability distribution. This task is relevant to several problems in machine learning, such as semi-supervised learning and clustering stability. We investigate the question of existence of a universally consistent algorithm for this problem. We propose two natural learning paradigms and prove that, on input unlabeled random samples generated by any member of a rich family of distributions, they are guaranteed to converge to the optimal separator for that distribution. We complement this result by showing that no learning algorithm for our task can achieve uniform learning rates (that are independent of the data generating distribution).

Journal ArticleDOI
TL;DR: In this article, a client-server architecture to simultaneously solve multiple learning tasks from distributed datasets is described, where each client is associated with an individual learning task and the associated dataset of examples, and the role of the server is to collect data in realtime from the clients and codify the information in a common database.
Abstract: A client-server architecture to simultaneously solve multiple learning tasks from distributed datasets is described. In such architecture, each client is associated with an individual learning task and the associated dataset of examples. The goal of the architecture is to perform information fusion from multiple datasets while preserving privacy of individual data. The role of the server is to collect data in real-time from the clients and codify the information in a common database. The information coded in this database can be used by all the clients to solve their individual learning task, so that each client can exploit the informative content of all the datasets without actually having access to private data of others. The proposed algorithmic framework, based on regularization theory and kernel methods, uses a suitable class of mixed effect kernels. The new method is illustrated through a simulated music recommendation system.

Posted Content
TL;DR: Three popular learners, namely, "neighborhood component analysis", "large margin nearest neighbors" and "discriminant neighborhood embedding", which do not have kernel versions are kernelized in order to improve their classification performances.
Abstract: This paper focuses on the problem of kernelizing an existing supervised Mahalanobis distance learner. The following features are included in the paper. Firstly, three popular learners, namely, "neighborhood component analysis", "large margin nearest neighbors" and "discriminant neighborhood embedding", which do not have kernel versions are kernelized in order to improve their classification performances. Secondly, an alternative kernelization framework called "KPCA trick" is presented. Implementing a learner in the new framework gains several advantages over the standard framework, e.g. no mathematical formulas and no reprogramming are required for a kernel implementation, the framework avoids troublesome problems such as singularity, etc. Thirdly, while the truths of representer theorems are just assumptions in previous papers related to ours, here, representer theorems are formally proven. The proofs validate both the kernel trick and the KPCA trick in the context of Mahalanobis distance learning. Fourthly, unlike previous works which always apply brute force methods to select a kernel, we investigate two approaches which can be efficiently adopted to construct an appropriate kernel for a given dataset. Finally, numerical results on various real-world datasets are presented.

Journal ArticleDOI
TL;DR: This paper combines the quantum game with the problem of data clustering, and then develops a quantum-game-based clustering algorithm, in which data points in a dataset are considered as players who can make decisions and implement quantum strategies in quantum games.
Abstract: Enormous successes have been made by quantum algorithms during the last decade. In this paper, we combine the quantum game with the problem of data clustering, and then develop a quantum-game-based clustering algorithm, in which data points in a dataset are considered as players who can make decisions and implement quantum strategies in quantum games. After each round of a quantum game, each player's expected payoff is calculated. Later, he uses a link-removing-and-rewiring (LRR) function to change his neighbors and adjust the strength of links connecting to them in order to maximize his payoff. Further, algorithms are discussed and analyzed in two cases of strategies, two payoff matrixes and two LRR functions. Consequently, the simulation results have demonstrated that data points in datasets are clustered reasonably and efficiently, and the clustering algorithms have fast rates of convergence. Moreover, the comparison with other algorithms also provides an indication of the effectiveness of the proposed approach.

Posted Content
TL;DR: Two online variants of the basic CEM are provided, together with a proof of convergence, of the cross-entropy method, a simple but efficient method for global optimization.
Abstract: The cross-entropy method is a simple but efficient method for global optimization. In this paper we provide two online variants of the basic CEM, together with a proof of convergence.

Posted Content
TL;DR: Novel and distinct stability-based generalization bounds for stationary phi-mixing and beta- Mixing sequences are proved, which can be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios.
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary phi-mixing or beta-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary phi-mixing and beta-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our phi-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios.