Showing papers in "arXiv: Learning in 2008"

PDF

Open Access

Posted Content•

A Kernel Method for the Two-Sample Problem

[...]

Arthur Gretton, Karsten M. Borgwardt¹, Malte J. Rasch², Bernhard Schölkopf, Alexander J. Smola³ - Show less +1 more•Institutions (3)

Ludwig Maximilian University of Munich¹, Graz University of Technology², NICTA³

15 May 2008-arXiv: Learning

TL;DR: In this paper, the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS) is defined, and the test statistic can be computed in quadratic time, although efficient linear time approximations are available.

...read moreread less

Abstract: We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

...read moreread less

1,259 citations

Posted Content•

What Can We Learn Privately

[...]

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, Adam Smith - Show less +1 more

06 Mar 2008-arXiv: Learning

TL;DR: In this paper, it was shown that a concept class is learnable by a local algorithm if and only if it is learnedable in the statistical query (SQ) model.

...read moreread less

Abstract: Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask: what concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a notion that provides strong confidentiality guarantees in contexts where aggregate information is released about a database containing sensitive information about individuals. We demonstrate that, ignoring computational constraints, it is possible to privately agnostically learn any concept class using a sample size approximately logarithmic in the cardinality of the concept class. Therefore, almost anything learnable is learnable privately: specifically, if a concept class is learnable by a (non-private) algorithm with polynomial sample complexity and output size, then it can be learned privately using a polynomial number of samples. We also present a computationally efficient private PAC learner for the class of parity functions. Local (or randomized response) algorithms are a practical class of private algorithms that have received extensive investigation. We provide a precise characterization of local private learning algorithms. We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statistical query (SQ) model. Finally, we present a separation between the power of interactive and noninteractive local learning algorithms.

...read moreread less

537 citations

Posted Content•

Sparse Online Learning via Truncated Gradient

[...]

John Langford, Lihong Li, Tong Zhang

28 Jun 2008-arXiv: Learning

TL;DR: This work proposes a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss and finds for datasets with large numbers of features, substantial sparsity is discoverable.

...read moreread less

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online learning algorithms with convex loss functions. This method has several essential properties: The degree of sparsity is continuous -- a parameter controls the rate of sparsification from no sparsification to total sparsification. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular $L_1$-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online learning guarantees. The approach works well empirically. We apply the approach to several datasets and find that for datasets with large numbers of features, substantial sparsity is discoverable.

...read moreread less

411 citations

Posted Content•

Clustered Multi-Task Learning: A Convex Formulation

[...]

Laurent Jacob, Francis Bach¹, Jean-Philippe Vert•Institutions (1)

French Institute for Research in Computer Science and Automation¹

11 Sep 2008-arXiv: Learning

TL;DR: A new spectral norm is designed that encodes this a priori assumption that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors, resulting in a new convex optimization formulation for multi-task learning.

...read moreread less

Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non convex methods dedicated to the same problem.

...read moreread less

409 citations

Posted Content•

Multi-Instance Learning by Treating Instances As Non-I.I.D. Samples

[...]

Zhi-Hua Zhou¹, Yuyin Sun¹, Yu-Feng Li¹•Institutions (1)

Nanjing University¹

12 Jul 2008-arXiv: Learning

TL;DR: This paper explicitly map every bag to an undirected graph and design a graph kernel for distinguishing the positive and negative bags and implicitly construct graphs by deriving affinity matrices and propose an efficient graph kernel considering the clique information.

...read moreread less

Abstract: Multi-instance learning attempts to learn from a training set consisting of labeled bags each containing many unlabeled instances. Previous studies typically treat the instances in the bags as independently and identically distributed. However, the instances in a bag are rarely independent, and therefore a better performance can be expected if the instances are treated in an non-i.i.d. way that exploits the relations among instances. In this paper, we propose a simple yet effective multi-instance learning method, which regards each bag as a graph and uses a specific kernel to distinguish the graphs by considering the features of the nodes as well as the features of the edges that convey some relations among instances. The effectiveness of the proposed method is validated by experiments.

...read moreread less

368 citations

Posted Content•

Sample Selection Bias Correction Theory

[...]

Corinna Cortes¹, Mehryar Mohri², Michael Riley¹, Afshin Rostamizadeh²•Institutions (2)

Google¹, Courant Institute of Mathematical Sciences²

19 May 2008-arXiv: Learning

TL;DR: A theoretical analysis of sample selection bias correction based on the novel concept of distributional stability which generalizes the existing concept of point-based stability and can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

...read moreread less

Abstract: This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

...read moreread less

255 citations

Posted Content•

Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

[...]

Francis Bach¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

09 Sep 2008-arXiv: Learning

TL;DR: The extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

...read moreread less

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

...read moreread less

233 citations

Posted Content•

Convex Sparse Matrix Factorizations

[...]

Francis Bach, Julien Mairal, Jean Ponce

09 Dec 2008-arXiv: Learning

TL;DR: This work presents a convex formulation of dictionary learning for sparse signal decomposition that introduces an explicit trade-off between size and sparsity of the decomposition of rectangular matrices and compares the estimation abilities of the convex and nonconvex approaches.

...read moreread less

Abstract: We present a convex formulation of dictionary learning for sparse signal decomposition. Convexity is obtained by replacing the usual explicit upper bound on the dictionary size by a convex rank-reducing term similar to the trace norm. In particular, our formulation introduces an explicit trade-off between size and sparsity of the decomposition of rectangular matrices. Using a large set of synthetic examples, we compare the estimation abilities of the convex and non-convex approaches, showing that while the convex formulation has a single local minimum, this may lead in some cases to performance which is inferior to the local minima of the non-convex formulation.

...read moreread less

148 citations

Posted Content•

On Recovery of Sparse Signals via $\ell_1$ Minimization

[...]

T. Tony Cai, Guangwu Xu, Jun Zhang

01 May 2008-arXiv: Learning

TL;DR: In this article, a unified and elementary treatment is given in these noise settings for two $\ell_1$ minimization methods: the Dantzig selector and ''ell_2'' minimization with an ''ell'' constraint.

...read moreread less

Abstract: This article considers constrained $\ell_1$ minimization methods for the recovery of high dimensional sparse signals in three settings: noiseless, bounded error and Gaussian noise. A unified and elementary treatment is given in these noise settings for two $\ell_1$ minimization methods: the Dantzig selector and $\ell_1$ minimization with an $\ell_2$ constraint. The results of this paper improve the existing results in the literature by weakening the conditions and tightening the error bounds. The improvement on the conditions shows that signals with larger support can be recovered accurately. This paper also establishes connections between restricted isometry property and the mutual incoherence property. Some results of Candes, Romberg and Tao (2006) and Donoho, Elad, and Temlyakov (2006) are extended.

...read moreread less

135 citations

Posted Content•

Efficient Exact Inference in Planar Ising Models

[...]

Nicol N. Schraudolph¹, Dmitry Kamenetsky¹•Institutions (1)

Australian National University¹

24 Oct 2008-arXiv: Learning

TL;DR: In this paper, a polynomial-time algorithm for the exact computation of lowest energy (ground) states, worst margin violators, log partition functions, and marginal edge probabilities in certain binary undirected graphical models is presented.

...read moreread less

Abstract: We give polynomial-time algorithms for the exact computation of lowest-energy (ground) states, worst margin violators, log partition functions, and marginal edge probabilities in certain binary undirected graphical models. Our approach provides an interesting alternative to the well-known graph cut paradigm in that it does not impose any submodularity constraints; instead we require planarity to establish a correspondence with perfect matchings (dimer coverings) in an expanded dual graph. We implement a unified framework while delegating complex but well-understood subproblems (planar embedding, maximum-weight perfect matching) to established algorithms for which efficient implementations are freely available. Unlike graph cut methods, we can perform penalized maximum-likelihood as well as maximum-margin parameter estimation in the associated conditional random fields (CRFs), and employ marginal posterior probabilities as well as maximum a posteriori (MAP) states for prediction. Maximum-margin CRF parameter estimation on image denoising and segmentation problems shows our approach to be efficient and effective. A C++ implementation is available from this http URL

...read moreread less

80 citations

Posted Content•

Graph Kernels

[...]

S. V. N. Vishwanathan, Karsten M. Borgwardt, Imre Risi Kondor, Nicol N. Schraudolph

01 Jul 2008-arXiv: Learning

TL;DR: In this article, a unified framework for graph kernels is presented, which includes graph diffusion kernels, marginalized graph kernels, and geometric kernel on graphs, special cases of which include the random walk graph kernel and regularization on graphs.

...read moreread less

Abstract: We present a unified framework to study graph kernels, special cases of which include the random walk graph kernel \citep{GaeFlaWro03,BorOngSchVisetal05}, marginalized graph kernel \citep{KasTsuIno03,KasTsuIno04,MahUedAkuPeretal04}, and geometric kernel on graphs \citep{Gaertner02} Through extensions of linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) and reduction to a Sylvester equation, we construct an algorithm that improves the time complexity of kernel computation from $O(n^6)$ to $O(n^3)$ When the graphs are sparse, conjugate gradient solvers or fixed-point iterations bring our algorithm into the sub-cubic domain Experiments on graphs from bioinformatics and other application domains show that it is often more than a thousand times faster than previous approaches We then explore connections between diffusion kernels \citep{KonLaf02}, regularization on graphs \citep{SmoKon03}, and graph kernels, and use these connections to propose new graph kernels Finally, we show that rational kernels \citep{CorHafMoh02,CorHafMoh03,CorHafMoh04} when specialized to graphs reduce to the random walk graph kernel

...read moreread less

Posted Content•

The optimal assignment kernel is not positive definite

[...]

Jean-Philippe Vert

26 Jan 2008-arXiv: Learning

TL;DR: It is proved that the optimal assignment kernel, proposed recently as an attempt to embed labeled graphs and more generally tuples of basic data to a Hilbert space, is in fact not always positive definite.

...read moreread less

Abstract: We prove that the optimal assignment kernel, proposed recently as an attempt to embed labeled graphs and more generally tuples of basic data to a Hilbert space, is in fact not always positive definite

...read moreread less

Posted Content•

Isotropic PCA and Affine-Invariant Clustering

[...]

S. Charles Brubaker, Santosh Vempala

22 Apr 2008-arXiv: Learning

TL;DR: An extension of principal component analysis (PCA) and a new algorithm for clustering points in \Rn based on it that is affine-invariant and nearly the best possible is presented, improving known results substantially.

...read moreread less

Abstract: We present a new algorithm for clustering points in R^n. The key property of the algorithm is that it is affine-invariant, i.e., it produces the same partition for any affine transformation of the input. It has strong guarantees when the input is drawn from a mixture model. For a mixture of two arbitrary Gaussians, the algorithm correctly classifies the sample assuming only that the two components are separable by a hyperplane, i.e., there exists a halfspace that contains most of one Gaussian and almost none of the other in probability mass. This is nearly the best possible, improving known results substantially. For k > 2 components, the algorithm requires only that there be some (k-1)-dimensional subspace in which the emoverlap in every direction is small. Here we define overlap to be the ratio of the following two quantities: 1) the average squared distance between a point and the mean of its component, and 2) the average squared distance between a point and the mean of the mixture. The main result may also be stated in the language of linear discriminant analysis: if the standard Fisher discriminant is small enough, labels are not needed to estimate the optimal subspace for projection. Our main tools are isotropic transformation, spectral projection and a simple reweighting technique. We call this combination isotropic PCA.

...read moreread less

Posted Content•

Importance Weighted Active Learning

[...]

Alina Beygelzimer¹, Sanjoy Dasgupta², John Langford³•Institutions (3)

IBM¹, University of California, San Diego², Yahoo!³

29 Dec 2008-arXiv: Learning

TL;DR: This article used importance weighting to correct sampling bias, and by controlling the variance, they were able to give rigorous label complexity bounds for the learning process and showed that this approach reduces the label complexity required to achieve good predictive performance on many learning problems.

...read moreread less

Abstract: We present a practical and statistically consistent scheme for actively learning binary classifiers under general loss functions. Our algorithm uses importance weighting to correct sampling bias, and by controlling the variance, we are able to give rigorous label complexity bounds for the learning process. Experiments on passively labeled data show that this approach reduces the label complexity required to achieve good predictive performance on many learning problems.

...read moreread less

Posted Content•

A Spectral Algorithm for Learning Hidden Markov Models

[...]

Daniel Hsu¹, Sham M. Kakade², Tong Zhang¹•Institutions (2)

Rutgers University¹, University of Pennsylvania²

26 Nov 2008-arXiv: Learning

TL;DR: It is proved that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs.

...read moreread less

Abstract: Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.

...read moreread less

Posted Content•

A Unified Semi-Supervised Dimensionality Reduction Framework for Manifold Learning

[...]

Ratthachat Chatpatanasiri¹, Boonserm Kijsirikul¹•Institutions (1)

Chulalongkorn University¹

06 Apr 2008-arXiv: Learning

TL;DR: In this article, a general framework of semi-supervised dimensionality reduction for manifold learning is presented, which naturally generalizes existing supervised and unsupervised learning frameworks which apply the spectral decomposition.

...read moreread less

Abstract: We present a general framework of semi-supervised dimensionality reduction for manifold learning which naturally generalizes existing supervised and unsupervised learning frameworks which apply the spectral decomposition. Algorithms derived under our framework are able to employ both labeled and unlabeled examples and are able to handle complex problems where data form separate clusters of manifolds. Our framework offers simple views, explains relationships among existing frameworks and provides further extensions which can improve existing algorithms. Furthermore, a new semi-supervised kernelization framework called ``KPCA trick'' is proposed to handle non-linear problems.

...read moreread less

Posted Content•

The use of entropy to measure structural diversity

[...]

Lesedi Masisi¹, V. Nelwamondo¹, Tshilidzi Marwala¹•Institutions (1)

University of the Witwatersrand¹

20 Oct 2008-arXiv: Learning

TL;DR: In this paper entropy based methods are compared and used to measure structural diversity of an ensemble of 21 classifiers, mainly applied in ecology, whereby species counts are used as a measure of diversity.

...read moreread less

Abstract: In this paper entropy based methods are compared and used to measure structural diversity of an ensemble of 21 classifiers. This measure is mostly applied in ecology, whereby species counts are used as a measure of diversity. The measures used were Shannon entropy, Simpsons and the Berger Parker diversity indexes. As the diversity indexes increased so did the accuracy of the ensemble. An ensemble dominated by classifiers with the same structure produced poor accuracy. Uncertainty rule from information theory was also used to further define diversity. Genetic algorithms were used to find the optimal ensemble by using the diversity indices as the cost function. The method of voting was used to aggregate the decisions.

...read moreread less

Posted Content•

Bolasso: model consistent Lasso estimation through the bootstrap

[...]

Francis Bach¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

08 Apr 2008-arXiv: Learning

TL;DR: This paper presents a detailed asymptotic analysis of model consistency of the Lasso, and presents a novel variable selection algorithm, referred to as the Bolasso, which is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository.

...read moreread less

Abstract: We consider the least-square linear regression problem with regularization by the l1-norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of model consistency of the Lasso. For various decays of the regularization parameter, we compute asymptotic equivalents of the probability of correct model selection (i.e., variable selection). For a specific rate decay, we show that the Lasso selects all the variables that should enter the model with probability tending to one exponentially fast, while it selects all other variables with strictly positive probability. We show that this property implies that if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection. This novel variable selection algorithm, referred to as the Bolasso, is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository.

...read moreread less

Posted Content•

The Offset Tree for Learning with Partial Labels

[...]

Alina Beygelzimer¹, John Langford²•Institutions (2)

IBM¹, Yahoo!²

21 Dec 2008-arXiv: Learning

TL;DR: The Offset Tree is an optimal reduction to binary classification, allowing one to reuse any existing, fully supervised binary classification algorithm in this partial information setting, and is also computationally optimal, both at training and test time.

...read moreread less

Abstract: We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices. The algorithm reduces this setting to binary classification, allowing one to reuse of any existing, fully supervised binary classification algorithm in this partial information setting. We show that the Offset Tree is an optimal reduction to binary classification. In particular, it has regret at most $(k-1)$ times the regret of the binary classifier it uses (where $k$ is the number of choices), and no reduction to binary classification can do better. This reduction is also computationally optimal, both at training and test time, requiring just $O(\log_2 k)$ work to train on an example or make a prediction. Experiments with the Offset Tree show that it generally performs better than several alternative approaches.

...read moreread less

Posted Content•

Predicting Abnormal Returns From News Using Text Classification

[...]

Ronny Luss¹, Alexandre d'Aspremont¹•Institutions (1)

Princeton University¹

16 Sep 2008-arXiv: Learning

TL;DR: The authors used text from news articles to predict intraday price movements of financial assets using support vector machines and developed an analytic center cutting plane method to solve the kernel learning problem efficiently.

...read moreread less

Abstract: We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classification performance and we develop an analytic center cutting plane method to solve the kernel learning problem efficiently. We observe that while the direction of returns is not predictable using either text or returns, their size is, with text features producing significantly better performance than historical returns alone.

...read moreread less

Posted Content•

A Gaussian Belief Propagation Solver for Large Scale Support Vector Machines

[...]

Danny Bickson, Elad Yom-Tov, Danny Dolev

09 Oct 2008-arXiv: Learning

TL;DR: This work introduces an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP), demonstrating the applicability of this algorithm for large scale distributed computing systems.

...read moreread less

Abstract: Support vector machines (SVMs) are an extremely successful type of classification and regression algorithms. Building an SVM entails solving a constrained convex quadratic programming problem, which is quadratic in the number of training samples. We introduce an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP). In this paper, we demonstrate that methods from the complex system domain could be utilized for performing efficient distributed computation. We compare the proposed algorithm to previously proposed distributed and single-node SVM solvers. Our comparison shows that the proposed algorithm is just as accurate as these solvers, while being significantly faster, especially for large datasets. We demonstrate scalability of the proposed algorithm to up to 1,024 computing nodes and hundreds of thousands of data points using an IBM Blue Gene supercomputer. As far as we know, our work is the largest parallel implementation of belief propagation ever done, demonstrating the applicability of this algorithm for large scale distributed computing systems.

...read moreread less

Posted Content•

Linearly Parameterized Bandits

[...]

Paat Rusmevichientong¹, John N. Tsitsiklis²•Institutions (2)

Cornell University¹, Massachusetts Institute of Technology²

18 Dec 2008-arXiv: Learning

TL;DR: In this paper, the regret and Bayes risk of a large collection of arms is minimized by a policy that alternates between exploration and exploitation phases, and the phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition.

...read moreread less

Abstract: We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an $r$-dimensional random vector $\mathbf{Z} \in \mathbb{R}^r$, where $r \geq 2$. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order $\Theta(r \sqrt{T})$, by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form $O(r \sqrt{T} \log^{3/2} T)$.

...read moreread less

Journal Article•DOI•

Effect of Tuned Parameters on a LSA MCQ Answering Model

[...]

Alain Lifchitz, Sandra Jhean-Larose¹, Guy Denhière¹•Institutions (1)

New York City Landmarks Preservation Commission¹

02 Nov 2008-arXiv: Learning

TL;DR: In this article, the authors present the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA).

...read moreread less

Abstract: This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.

...read moreread less

Posted Content•

Robustness and Regularization of Support Vector Machines

[...]

Huan Xu¹, Constantine Caramanis², Shie Mannor¹, Shie Mannor³•Institutions (3)

McGill University¹, University of Texas at Austin², Technion – Israel Institute of Technology³

25 Mar 2008-arXiv: Learning

TL;DR: In this paper, the equivalence of robustness and regularization was shown to be equivalent to a robust optimization formulation for regularized support vector machines (SVMs), which has implications for both algorithms and analysis.

...read moreread less

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization, provides a robust optimization interpretation for the success of regularized SVMs. We use the this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

...read moreread less

Posted Content•

Learning Low-Density Separators

[...]

Shai Ben-David¹, Tyler Lu¹, Dávid Pál¹, Miroslava Sotakova²•Institutions (2)

University of Waterloo¹, Aarhus University²

19 May 2008-arXiv: Learning

TL;DR: This work proposes two natural learning paradigms and proves that, on input random samples generated i.i.d. by any distribution, they are guaranteed to converge to the optimal separator for that distribution.

...read moreread less

Abstract: We define a novel, basic, unsupervised learning problem - learning the lowest density homogeneous hyperplane separator of an unknown probability distribution. This task is relevant to several problems in machine learning, such as semi-supervised learning and clustering stability. We investigate the question of existence of a universally consistent algorithm for this problem. We propose two natural learning paradigms and prove that, on input unlabeled random samples generated by any member of a rich family of distributions, they are guaranteed to converge to the optimal separator for that distribution. We complement this result by showing that no learning algorithm for our task can achieve uniform learning rates (that are independent of the data generating distribution).

...read moreread less

Journal Article•DOI•

Client-server multi-task learning from distributed datasets

[...]

Francesco Dinuzzo, Gianluigi Pillonetto, Giuseppe De Nicolao

22 Dec 2008-arXiv: Learning

TL;DR: In this article, a client-server architecture to simultaneously solve multiple learning tasks from distributed datasets is described, where each client is associated with an individual learning task and the associated dataset of examples, and the role of the server is to collect data in realtime from the clients and codify the information in a common database.

...read moreread less

Abstract: A client-server architecture to simultaneously solve multiple learning tasks from distributed datasets is described. In such architecture, each client is associated with an individual learning task and the associated dataset of examples. The goal of the architecture is to perform information fusion from multiple datasets while preserving privacy of individual data. The role of the server is to collect data in real-time from the clients and codify the information in a common database. The information coded in this database can be used by all the clients to solve their individual learning task, so that each client can exploit the informative content of all the datasets without actually having access to private data of others. The proposed algorithmic framework, based on regularization theory and kernel methods, uses a suitable class of mixed effect kernels. The new method is illustrated through a simulated music recommendation system.

...read moreread less

Posted Content•

On Kernelization of Supervised Mahalanobis Distance Learners

[...]

Ratthachat Chatpatanasiri, Teesid Korsrilabutr, Pasakorn Tangchanachaianan, Boonserm Kijsirikul

09 Apr 2008-arXiv: Learning

TL;DR: Three popular learners, namely, "neighborhood component analysis", "large margin nearest neighbors" and "discriminant neighborhood embedding", which do not have kernel versions are kernelized in order to improve their classification performances.

...read moreread less

Abstract: This paper focuses on the problem of kernelizing an existing supervised Mahalanobis distance learner. The following features are included in the paper. Firstly, three popular learners, namely, "neighborhood component analysis", "large margin nearest neighbors" and "discriminant neighborhood embedding", which do not have kernel versions are kernelized in order to improve their classification performances. Secondly, an alternative kernelization framework called "KPCA trick" is presented. Implementing a learner in the new framework gains several advantages over the standard framework, e.g. no mathematical formulas and no reprogramming are required for a kernel implementation, the framework avoids troublesome problems such as singularity, etc. Thirdly, while the truths of representer theorems are just assumptions in previous papers related to ours, here, representer theorems are formally proven. The proofs validate both the kernel trick and the KPCA trick in the context of Mahalanobis distance learning. Fourthly, unlike previous works which always apply brute force methods to select a kernel, we investigate two approaches which can be efficiently adopted to construct an appropriate kernel for a given dataset. Finally, numerical results on various real-world datasets are presented.

...read moreread less

Journal Article•DOI•

A Novel Clustering Algorithm Based on Quantum Games

[...]

Qiang Li, Yan He, Jing-ping Jiang

03 Dec 2008-arXiv: Learning

TL;DR: This paper combines the quantum game with the problem of data clustering, and then develops a quantum-game-based clustering algorithm, in which data points in a dataset are considered as players who can make decisions and implement quantum strategies in quantum games.

...read moreread less

Abstract: Enormous successes have been made by quantum algorithms during the last decade. In this paper, we combine the quantum game with the problem of data clustering, and then develop a quantum-game-based clustering algorithm, in which data points in a dataset are considered as players who can make decisions and implement quantum strategies in quantum games. After each round of a quantum game, each player's expected payoff is calculated. Later, he uses a link-removing-and-rewiring (LRR) function to change his neighbors and adjust the strength of links connecting to them in order to maximize his payoff. Further, algorithms are discussed and analyzed in two cases of strategies, two payoff matrixes and two LRR functions. Consequently, the simulation results have demonstrated that data points in datasets are clustered reasonably and efficiently, and the clustering algorithms have fast rates of convergence. Moreover, the comparison with other algorithms also provides an indication of the effectiveness of the proposed approach.

...read moreread less

Posted Content•

Online variants of the cross-entropy method

[...]

Istvan Szita, András Lörincz

14 Jan 2008-arXiv: Learning

TL;DR: Two online variants of the basic CEM are provided, together with a proof of convergence, of the cross-entropy method, a simple but efficient method for global optimization.

...read moreread less

Abstract: The cross-entropy method is a simple but efficient method for global optimization. In this paper we provide two online variants of the basic CEM, together with a proof of convergence.

...read moreread less

Posted Content•

Stability Bound for Stationary Phi-mixing and Beta-mixing Processes

[...]

Mehryar Mohri, Afshin Rostamizadeh

11 Nov 2008-arXiv: Learning

TL;DR: Novel and distinct stability-based generalization bounds for stationary phi-mixing and beta- Mixing sequences are proved, which can be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios.

...read moreread less

Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary phi-mixing or beta-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary phi-mixing and beta-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our phi-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios.

...read moreread less