# Showing papers in "arXiv: Learning in 2006"

•

TL;DR: In this paper, the authors use clickthrough data to learn ranked retrieval functions for web search results, using query chains to generate new types of preference judgments from search engine logs, thus taking advantage of user intelligence in reformulating queries.

Abstract: This paper presents a novel approach for using clickthrough data to learn ranked retrieval functions for web search results. We observe that users searching the web often perform a sequence, or chain, of queries with a similar information need. Using query chains, we generate new types of preference judgments from search engine logs, thus taking advantage of user intelligence in reformulating queries. To validate our method we perform a controlled user study comparing generated preference judgments to explicit relevance judgments. We also implemented a real-world search engine to test our approach, using a modified ranking SVM to learn an improved ranking function from preference data. Our results demonstrate significant improvements in the ranking given by the search engine. The learned rankings outperform both a static ranking function, as well as one trained without considering query chains.

493 citations

•

TL;DR: The effectiveness of applying kernel method to canonical correlation analysis is investigated, which shows an efficient approach to improve such a linear method.

Abstract: Canonical correlation analysis is a technique to extract common features from a pair of multivariate data. In complex situations, however, it does not extract useful features because of its linearity. On the other hand, kernel method used in support vector machine is an efficient approach to improve such a linear method. In this paper, we investigate the effectiveness of applying kernel method to canonical correlation analysis. Keyword: multivariate analysis, multimodal data, kernel method, regularization

488 citations

•

TL;DR: This work develops a new collaborative filtering method that combines both previously known users' preferences, as well as product/user attributes, i.e. standard CF, to predict a given user's interest in a particular product.

Abstract: We develop a new collaborative filtering (CF) method that combines both previously known users' preferences, ie standard CF, as well as product/user attributes, ie classical function approximation, to predict a given user's interest in a particular product Our method is a generalized low rank matrix completion problem, where we learn a function whose inputs are pairs of vectors -- the standard low rank matrix completion problem being a special case where the inputs to the function are the row and column indices of the matrix We solve this generalized matrix completion problem using tensor product kernels for which we also formally generalize standard kernel properties Benchmark experiments on movie ratings show the advantages of our generalized matrix completion method over the standard matrix completion one with no information about movies or people, as well as over standard multi-task or single task learning methods

117 citations

•

TL;DR: This paper is an investigation on two methods to improve generalization in GP-based learning: the selection of the best-of-run individuals using a three data sets methodology, and the application of parsimony pressure in order to reduce the complexity of the solutions.

Abstract: Fitness functions based on test cases are very common in Genetic Programming (GP). This process can be assimilated to a learning task, with the inference of models from a limited number of samples. This paper is an investigation on two methods to improve generalization in GP-based learning: 1) the selection of the best-of-run individuals using a three data sets methodology, and 2) the application of parsimony pressure in order to reduce the complexity of the solutions. Results using GP in a binary classification setup show that while the accuracy on the test sets is preserved, with less variances compared to baseline results, the mean tree size obtained with the tested methods is significantly reduced.

54 citations

•

TL;DR: A novel approach to semisupervised learning which is based on statistical physics, based on sampling using a Multicanonical Markov chain Monte-Carlo algorithm, and has a straightforward probabilistic interpretation, which allows for soft assignments of points to classes, and also to cope with yet unseen class types.

Abstract: We present a novel approach to semisupervised learning which is based on statistical physics. Most of the former work in the field of semi-supervised learning classifies the points by minimizing a certain energy function, which corresponds to a minimal k-way cut solution. In contrast to these methods, we estimate the distribution of classifications, instead of the sole minimal k-way cut, which yields more accurate and robust results. Our approach may be applied to all energy functions used for semi-supervised learning. The method is based on sampling using a Multicanonical Markov chain Monte-Carlo algorithm, and has a straightforward probabilistic interpretation, which allows for soft assignments of points to classes, and also to cope with yet unseen class types. The suggested approach is demonstrated on a toy data set and on two real-life data sets of gene expression.

39 citations

•

TL;DR: The intermediary MA output by DEES is studied and it is shown that they compute rational series which converge absolutely and which can be used to provide stochastic languages which closely estimate the target.

Abstract: Given a finite set of words w1,...,wn independently drawn according to a fixed unknown distribution law P called a stochastic language, an usual goal in Grammatical Inference is to infer an estimate of P in some class of probabilistic models, such as Probabilistic Automata (PA). Here, we study the class of rational stochastic languages, which consists in stochastic languages that can be generated by Multiplicity Automata (MA) and which strictly includes the class of stochastic languages generated by PA. Rational stochastic languages have minimal normal representation which may be very concise, and whose parameters can be efficiently estimated from stochastic samples. We design an efficient inference algorithm DEES which aims at building a minimal normal representation of the target. Despite the fact that no recursively enumerable class of MA computes exactly the set of rational stochastic languages over Q, we show that DEES strongly identifies tis set in the limit. We study the intermediary MA output by DEES and show that they compute rational series which converge absolutely to one and which can be used to provide stochastic languages which closely estimate the target.

34 citations

•

TL;DR: A model of user behavior is created by drawing upon user studies in laboratory and real-world settings and it is found that learning from implicit feedback can be surprisingly robust.

Abstract: This paper evaluates the robustness of learning from implicit feedback in web search. In particular, we create a model of user behavior by drawing upon user studies in laboratory and real-world settings. The model is used to understand the effect of user behavior on the performance of a learning algorithm for ranked retrieval. We explore a wide range of possible user behaviors and find that learning from implicit feedback can be surprisingly robust. This complements previous results that demonstrated our algorithm's effectiveness in a real-world search engine application.

31 citations

•

TL;DR: Strengths and limitations of the direct approach to competitive on-line prediction via metric entropy are discussed, including comparisons to other approaches.

Abstract: Competitive on-line prediction (also known as universal prediction of individual sequences) is a strand of learning theory avoiding making any stochastic assumptions about the way the observations are generated. The predictor's goal is to compete with a benchmark class of prediction rules, which is often a proper Banach function space. Metric entropy provides a unifying framework for competitive on-line prediction: the numerous known upper bounds on the metric entropy of various compact sets in function spaces readily imply bounds on the performance of on-line prediction strategies. This paper discusses strengths and limitations of the direct approach to competitive on-line prediction via metric entropy, including comparisons to other approaches.

26 citations

•

NICTA

^{1}TL;DR: It is shown that asymptotically U for m →∞ and V for k→∞ are equal, provided both limits exist, and if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that thelimit of V exists.

Abstract: Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to infinity (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m->infinity and V for k->infinity are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.

14 citations

••

TL;DR: In this article, an algorithm for collaboratively training regularized kernel least-squares regression estimators is derived, which can be viewed as an application of successive orthogonal projection algorithms and its convergence properties are investigated in a simplified theoretical setting.

Abstract: This paper addresses the problem of distributed learning under communication constraints, motivated by distributed signal processing in wireless sensor networks and data mining with distributed databases. After formalizing a general model for distributed learning, an algorithm for collaboratively training regularized kernel least-squares regression estimators is derived. Noting that the algorithm can be viewed as an application of successive orthogonal projection algorithms, its convergence properties are investigated and the statistical behavior of the estimator is discussed in a simplified theoretical setting.

14 citations

•

TL;DR: To express the practically interesting goodness of fit of individual models for individual data sets the authors have to rely on Kolmogorov complexity.

Abstract: Approximation of the optimal two-part MDL code for given data, through successive monotonically length-decreasing two-part MDL codes, has the following properties: (i) computation of each step may take arbitrarily long; (ii) we may not know when we reach the optimum, or whether we will reach the optimum at all; (iii) the sequence of models generated may not monotonically improve the goodness of fit; but (iv) the model associated with the optimum has (almost) the best goodness of fit. To express the practically interesting goodness of fit of individual models for individual data sets we have to rely on Kolmogorov complexity.

•

TL;DR: The class of stationary prediction strategies is introduced and a prediction algorithm that asymptotically performs as well as the best continuous stationary strategy is constructed.

Abstract: In this paper we introduce the class of stationary prediction strategies and construct a prediction algorithm that asymptotically performs as well as the best continuous stationary strategy. We make mild compactness assumptions but no stochastic assumptions about the environment. In particular, no assumption of stationarity is made about the environment, and the stationarity of the considered strategies only means that they do not depend explicitly on time; we argue that it is natural to consider only stationary strategies even for highly non-stationary environments.

•

TL;DR: In this paper, the authors considered the problem of finding the conditions on local absolute continuity which are sufficient for sequence prediction and formulated some open questions to outline a direction for finding conditions on classes of measures for which prediction is possible.

Abstract: Suppose we are given two probability measures on the set of one-way infinite finite-alphabet sequences and consider the question when one of the measures predicts the other, that is, when conditional probabilities converge (in a certain sense) when one of the measures is chosen to generate the sequence. This question may be considered a refinement of the problem of sequence prediction in its most general formulation: for a given class of probability measures, does there exist a measure which predicts all of the measures in the class? To address this problem, we find some conditions on local absolute continuity which are sufficient for prediction and which generalize several different notions which are known to be sufficient for prediction. We also formulate some open questions to outline a direction for finding the conditions on classes of measures for which prediction is possible.

•

TL;DR: This work proposes and analyzes a new vantage point for the learning of mixtures of Gaussians: namely, the PAC-style model of learning probability distributions introduced by Kearns et al.

Abstract: We propose and analyze a new vantage point for the learning of mixtures of Gaussians: namely, the PAC-style model of learning probability distributions introduced by Kearns et al. [13]. Here the task is to construct a hypothesis mixture of Gaussians that is statistically indistinguishable from the actual mixture generating the data; specifically, the KL divergence should be at most ǫ. In this scenario, we give a poly(n/ǫ) time algorithm that learns the class of mixtures of any constant number of axis-aligned Gaussians in R n . Our algorithm makes no assumptions about the separation between the means of the Gaussians, nor does it have any dependence on the minimum mixing weight. This is in contrast to learning results known in the “clustering” model, where such assumptions are unavoidable. Our algorithm relies on the method of moments, and a subalgorithm developed in [8] for a discrete mixture-learning problem.

•

TL;DR: A new model for discriminants is proposed, using copula functions, that provides a rich and generalized class of decision boundaries that significantly boost the classification accuracy especially for high dimensional feature spaces.

Abstract: A useful method for representing Bayesian classifiers is through \emph{discriminant functions}. Here, using copula functions, we propose a new model for discriminants. This model provides a rich and generalized class of decision boundaries. These decision boundaries significantly boost the classification accuracy especially for high dimensional feature spaces. We strengthen our analysis through simulation results.

•

TL;DR: This work is interested in extending the learning from interpretations setting introduced by L. De Raedt that extends to relational representations the classical propositional (or attribute-value) concept learning from examples framework.

Abstract: We investigate here concept learning from incomplete examples. Our first purpose is to discuss to what extent logical learning settings have to be modified in order to cope with data incompleteness. More precisely we are interested in extending the learning from interpretations setting introduced by L. De Raedt that extends to relational representations the classical propositional (or attribute-value) concept learning from examples framework. We are inspired here by ideas presented by H. Hirsh in a work extending the Version space inductive paradigm to incomplete data. H. Hirsh proposes to slightly modify the notion of solution when dealing with incomplete examples: a solution has to be a hypothesis compatible with all pieces of information concerning the examples. We identify two main classes of incompleteness. First, uncertainty deals with our state of knowledge concerning an example. Second, generalization (or abstraction) deals with what part of the description of the example is sufficient for the learning purpose. These two main sources of incompleteness can be mixed up when only part of the useful information is known. We discuss a general learning setting, referred to as "learning from possibilities" that formalizes these ideas, then we present a more specific learning setting, referred to as "assumption-based learning" that cope with examples which uncertainty can be reduced when considering contextual information outside of the proper description of the examples. Assumption-based learning is illustrated on a recent work concerning the prediction of a consensus secondary structure common to a set of RNA sequences.

•

TL;DR: The notion of residual language of a stochastic language is defined and it is used to investigate properties of several subclasses of rational stochastics languages and some connections between properties of rational Stochastic languages and results obtained in the field of probabilistic grammatical inference are shown.

Abstract: The goal of the present paper is to provide a systematic and comprehensive study of rational stochastic languages over a semiring K \in {Q, Q +, R, R+}. A rational stochastic language is a probability distribution over a free monoid \Sigma^* which is rational over K, that is which can be generated by a multiplicity automata with parameters in K. We study the relations between the classes of rational stochastic languages S rat K (\Sigma). We define the notion of residual of a stochastic language and we use it to investigate properties of several subclasses of rational stochastic languages. Lastly, we study the representation of rational stochastic languages by means of multiplicity automata.

•

TL;DR: A theoretical study of pseudo-stochastic rational languages, the languages output by DEES, showing for example that this class is decidable within polynomial time.

Abstract: In probabilistic grammatical inference, a usual goal is to infer a good approximation of an unknown distribution P called a stochastic language. The estimate of P stands in some class of probabilistic models such as probabilistic automata (PA). In this paper, we focus on probabilistic models based on multiplicity automata (MA). The stochastic languages generated by MA are called rational stochastic languages; they strictly include stochastic languages generated by PA; they also admit a very concise canonical representation. Despite the fact that this class is not recursively enumerable, it is efficiently identifiable in the limit by using the algorithm DEES, introduced by the authors in a previous paper. However, the identification is not proper and before the convergence of the algorithm, DEES can produce MA that do not define stochastic languages. Nevertheless, it is possible to use these MA to define stochastic languages. We show that they belong to a broader class of rational series, that we call pseudo-stochastic rational languages. The aim of this paper is twofold. First we provide a theoretical study of pseudo-stochastic rational languages, the languages output by DEES, showing for example that this class is decidable within polynomial time. Second, we have carried out a lot of experiments in order to compare DEES to classical inference algorithms such as ALERGIA and MDI. They show that DEES outperforms them in most cases.

•

L'Abri

^{1}TL;DR: Lambek Grammars is presented, a formalism issued from categorial grammars which, although not as expressive as needed for a full formalization of natural languages, is particularly suited to easily implement a natural interface between syntax and semantics.

Abstract: We present basic notions of Gold's "learnability in the limit" paradigm, first presented in 1967, a formalization of the cognitive process by which a native speaker gets to grasp the underlying grammar of his/her own native language by being exposed to well formed sentences generated by that grammar. Then we present Lambek grammars, a formalism issued from categorial grammars which, although not as expressive as needed for a full formalization of natural languages, is particularly suited to easily implement a natural interface between syntax and semantics. In the last part of this work, we present a learnability result for Rigid Lambek grammars from structured examples.

•

TL;DR: It is shown that Solomonoff’s model possesses many desirable properties: Fast convergence and strong bounds, and in contrast to most classical continuous prior densities has no zero p(oste)rior problem.

Abstract: Solomonoff completed the Bayesian framework by providing a rigorous, unique, formal, and universal choice for the model class and the prior. We discuss in breadth how and in which sense universal (non-i.i.d.) sequence prediction solves various (philosophical) problems of traditional Bayesian sequence prediction. We show that Solomonoff's model possesses many desirable properties: Fast convergence and strong bounds, and in contrast to most classical continuous prior densities has no zero p(oste)rior problem, i.e. can confirm universal hypotheses, is reparametrization and regrouping invariant, and avoids the old-evidence and updating problem. It even performs well (actually better) in non-computable environments.

•

TL;DR: It is shown that thresholds of parities, a natural class encompassing DNFs, cannot be learned efficiently in the Noise Sensitivity model using only statistical queries, and that a cyclic version of the Random Walk model allows to learn efficiently polynomially weighted thresholds ofParities.

Abstract: In a recent breakthrough, [Bshouty et al., 2005] obtained the first passive-learning algorithm for DNFs under the uniform distribution. They showed that DNFs are learnable in the Random Walk and Noise Sensitivity models. We extend their results in several directions. We first show that thresholds of parities, a natural class encompassing DNFs, cannot be learned efficiently in the Noise Sensitivity model using only statistical queries. In contrast, we show that a cyclic version of the Random Walk model allows to learn efficiently polynomially weighted thresholds of parities. We also extend the algorithm of Bshouty et al. to the case of Unions of Rectangles, a natural generalization of DNFs to $\{0,...,b-1\}^n$.

•

TL;DR: An approach to the classification problem of machine learning, based on building local classification rules, is developed, which has polynomial complexity in typical case and the integration of attributes levels selection with rules searching and original conflicting rules resolution strategy.

Abstract: An approach to the classification problem of machine learning, based on building local classification rules, is developed The local rules are considered as projections of the global classification rules to the event we want to classify A massive global optimization algorithm is used for optimization of quality criterion The algorithm, which has polynomial complexity in typical case, is used to find all high--quality local rules The other distinctive feature of the algorithm is the integration of attributes levels selection (for ordered attributes) with rules searching and original conflicting rules resolution strategy The algorithm is practical; it was tested on a number of data sets from UCI repository, and a comparison with the other predicting techniques is presented

•

TL;DR: This work constructs a prediction strategy universal for the class of Markov prediction strategies, not necessarily continuous, by allowing randomization, and removing the requirement of convexity.

Abstract: Assuming that the loss function is convex in the prediction, we construct a prediction strategy universal for the class of Markov prediction strategies, not necessarily continuous. Allowing randomization, we remove the requirement of convexity.

•

TL;DR: In this article, the authors review some popular kinds of prediction and argue that the theory of competitive online learning can benefit from the kinds of predictions that are now foreign to it, such as people, computer programs, and probabilistic theories.

Abstract: Prediction is a complex notion, and different predictors (such as people, computer programs, and probabilistic theories) can pursue very different goals. In this paper I will review some popular kinds of prediction and argue that the theory of competitive on-line learning can benefit from the kinds of prediction that are now foreign to it.

•

TL;DR: The model used is used for the conception of the IS so that the process of retrieving of solution(s) or the responses given by the system to an ISP is based on these behaviours and correspond to the needs of the user.

Abstract: In this paper, our aim is to propose a model that helps in the efficient use of an information system by users, within the organization represented by the IS, in order to resolve their decisional problems. In other words we want to aid the user within an organization in obtaining the information that corresponds to his needs (informational needs that result from his decisional problems). This type of information system is what we refer to as economic intelligence system because of its support for economic intelligence processes of the organisation. Our assumption is that every EI process begins with the identification of the decisional problem which is translated into an informational need. This need is then translated into one or many information search problems (ISP). We also assumed that an ISP is expressed in terms of the user's expectations and that these expectations determine the activities or the behaviors of the user, when he/she uses an IS. The model we are proposing is used for the conception of the IS so that the process of retrieving of solution(s) or the responses given by the system to an ISP is based on these behaviours and correspond to the needs of the user.

•

TL;DR: This analysis suggests that the matryoshka leverages probabilistic weak classifiers more efficiently than simple decision trees.

Abstract: We present a theory of boosting probabilistic classifiers. We place ourselves in the situation of a user who only provides a stopping parameter and a probabilistic weak learner/classifier and compare three types of boosting algorithms: probabilistic Adaboost, decision tree, and tree of trees of ... of trees, which we call matryoshka. "Nested tree," "embedded tree" and "recursive tree" are also appropriate names for this algorithm, which is one of our contributions. Our other contribution is the theoretical analysis of the algorithms, in which we give training error bounds. This analysis suggests that the matryoshka leverages probabilistic weak classifiers more efficiently than simple decision trees.

••

TL;DR: It is shown that the generalization errors of a student can temporarily become smaller than that of a moving teacher and can reach the lowest value, even if the student only uses examples from the moving teacher.

Abstract: In the framework of on-line learning, a learning machine might move around a teacher due to the differences in structures or output functions between the teacher and the learning machine. In this paper we analyze the generalization performance of a new student supervised by a moving machine. A model composed of a fixed true teacher, a moving teacher, and a student is treated theoretically using statistical mechanics, where the true teacher is a nonmonotonic perceptron and the others are simple perceptrons. Calculating the generalization errors numerically, we show that the generalization errors of a student can temporarily become smaller than that of a moving teacher, even if the student only uses examples from the moving teacher. However, the generalization error of the student eventually becomes the same value with that of the moving teacher. This behavior is qualitatively different from that of a linear model.

•

TL;DR: In this article, sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class were given. And they were analyzed how tight these conditions are and how they relate to different probabilistic assumptions known in RL and related fields, such as Markov Decision Processes and mixing conditions.

Abstract: We address the problem of reinforcement learning in which observations may exhibit an arbitrary form of stochastic dependence on past observations and actions. The task for an agent is to attain the best possible asymptotic reward where the true generating environment is unknown but belongs to a known countable family of environments. We find some sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class. We analyze how tight these conditions are and how they relate to different probabilistic assumptions known in reinforcement learning and related fields, such as Markov Decision Processes and mixing conditions.

•

TL;DR: This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation and provides a unified view of algorithms such as TD(lambda), LSTD( lambda), iLSTD, residual-gradient TD.

Abstract: This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation. It provides a unified view of algorithms such as TD(lambda), LSTD(lambda), iLSTD, residual-gradient TD. It is asserted that they all consist in minimizing a gradient function and differ by the form of this function and their means of minimizing it. Two new schemes are introduced in that framework: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, and EGD TD, which reduces the gradient by successive equi-gradient descents. These three algorithms form a new intermediate family with the interesting property of making much better use of the samples than TD while keeping a gradient descent scheme, which is useful for complexity issues and optimistic policy iteration.

••

TL;DR: In this article, the authors proposed a nonlinear blind source separation (BSS) method to find a source time series with local velocity cross correlations that vanish everywhere in stimulus state space.

Abstract: Given a time series of multicomponent measurements of an evolving stimulus, nonlinear blind source separation (BSS) seeks to find a "source" time series, comprised of statistically independent combinations of the measured components. In this paper, we seek a source time series with local velocity cross correlations that vanish everywhere in stimulus state space. However, in an earlier paper the local velocity correlation matrix was shown to constitute a metric on state space. Therefore, nonlinear BSS maps onto a problem of differential geometry: given the metric observed in the measurement coordinate system, find another coordinate system in which the metric is diagonal everywhere. We show how to determine if the observed data are separable in this way, and, if they are, we show how to construct the required transformation to the source coordinate system, which is essentially unique except for an unknown rotation that can be found by applying the methods of linear BSS. Thus, the proposed technique solves nonlinear BSS in many situations or, at least, reduces it to linear BSS, without the use of probabilistic, parametric, or iterative procedures. This paper also describes a generalization of this methodology that performs nonlinear independent subspace separation. In every case, the resulting decomposition of the observed data is an intrinsic property of the stimulus' evolution in the sense that it does not depend on the way the observer chooses to view it (e.g., the choice of the observing machine's sensors). In other words, the decomposition is a property of the evolution of the "real" stimulus that is "out there" broadcasting energy to the observer. The technique is illustrated with analytic and numerical examples.