scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Machine Learning Research in 2002"


Journal ArticleDOI
TL;DR: Experimental results showing that employing the active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings are presented.
Abstract: Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning. Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space. We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

3,212 citations


Journal Article
TL;DR: This paper describes the algorithmic implementation of multiclass kernel-based vector machines using a generalized notion of the margin to multiclass problems, and describes an efficient fixed-point algorithm for solving the reduced optimization problems and proves its convergence.
Abstract: In this paper we describe the algorithmic implementation of multiclass kernel-based vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem we are able to incorporate kernels with a compact set of constraints and decompose the dual problem into multiple optimization problems of reduced size. We describe an efficient fixed-point algorithm for solving the reduced optimization problems and prove its convergence. We then discuss technical details that yield significant running time improvements for large datasets. Finally, we describe various experiments with our approach comparing it to previously studied kernel-based methods. Our experiments indicate that for multiclass problems we attain state-of-the-art accuracy.

2,214 citations


Journal ArticleDOI
TL;DR: These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.
Abstract: We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.

1,690 citations


Journal ArticleDOI
TL;DR: In this paper, a Gaussian kernel based clustering method using support vector machines (SVM) is proposed to find the minimal enclosing sphere, which can separate into several components, each enclosing a separate cluster of points.
Abstract: We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. We demonstrate the performance of our algorithm on several datasets.

1,389 citations


Journal ArticleDOI
TL;DR: The SVM approach as represented by Schoelkopf was superior to all the methods except the neural network one, where it was, although occasionally worse, essentially comparable.
Abstract: We implemented versions of the SVM appropriate for one-class classification in the context of information retrieval. The experiments were conducted on the standard Reuters data set. For the SVM implementation we used both a version of Schoelkopf et al. and a somewhat different version of one-class SVM based on identifying "outlier" data as representative of the second-class. We report on experiments with different kernels for both of these implementations and with different representations of the data, including binary vectors, tf-idf representation and a modification called "Hadamard" representation. Then we compared it with one-class versions of the algorithms prototype (Rocchio), nearest neighbor, naive Bayes, and finally a natural one-class neural network classification method based on "bottleneck" compression generated filters.The SVM approach as represented by Schoelkopf was superior to all the methods except the neural network one, where it was, although occasionally worse, essentially comparable. However, the SVM methods turned out to be quite sensitive to the choice of representation and kernel in ways which are not well understood; therefore, for the time being leaving the neural network approach as the most robust.

1,293 citations


Journal ArticleDOI
TL;DR: A novel kernel is introduced for comparing two text documents consisting of an inner product in the feature space consisting of all subsequences of length k, which can be efficiently evaluated by a dynamic programming technique.
Abstract: We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.

1,281 citations


Journal ArticleDOI
TL;DR: The theoretical description of the kernel PLS algorithm is given and the algorithm is experimentally compared with the existing kernel PCR and kernel ridge regression techniques to demonstrate that on the data sets employed Kernel PLS achieves the same results as kernel PCR but uses significantly fewer, qualitatively different components.
Abstract: A family of regularized least squares regression models in a Reproducing Kernel Hilbert Space is extended by the kernel partial least squares (PLS) regression model. Similar to principal components regression (PCR), PLS is a method based on the projection of input (explanatory) variables to the latent variables (components). However, in contrast to PCR, PLS creates the components by modeling the relationship between input and output variables while maintaining most of the information in the input variables. PLS is useful in situations where the number of explanatory variables exceeds the number of observations and/or a high level of multicollinearity among those variables is assumed. Motivated by this fact we will provide a kernel PLS algorithm for construction of nonlinear regression models in possibly high-dimensional feature spaces.We give the theoretical description of the kernel PLS algorithm and we experimentally compare the algorithm with the existing kernel PCR and kernel ridge regression techniques. We will demonstrate that on the data sets employed kernel PLS achieves the same results as kernel PCR but uses significantly fewer, qualitatively different components.

898 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider using a score equivalent criterion in conjunction with a heuristic search algorithm to perform model selection or model averaging, and show that more sophisticated search algorithms are likely to benefit much more.
Abstract: Two Bayesian-network structures are said to be equivalent if the set of distributions that can be represented with one of those structures is identical to the set of distributions that can be represented with the other. Many scoring criteria that are used to learn Bayesian-network structures from data are score equivalent; that is, these criteria do not distinguish among networks that are equivalent. In this paper, we consider using a score equivalent criterion in conjunction with a heuristic search algorithm to perform model selection or model averaging. We argue that it is often appropriate to search among equivalence classes of network structures as opposed to the more common approach of searching among individual Bayesian-network structures. We describe a convenient graphical representation for an equivalence class of structures, and introduce a set of operators that can be applied to that representation by a search algorithm to move among equivalence classes. We show that our equivalence-class operators can be scored locally, and thus share the computational efficiency of traditional operators defined for individual structures. We show experimentally that a greedy model-selection algorithm using our representation yields slightly higher-scoring structures than the traditional approach without any additional time overhead, and we argue that more sophisticated search algorithms are likely to benefit much more.

711 citations


Journal Article
Shai Fine1, Katya Scheinberg1
TL;DR: This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors).
Abstract: SVM training is a convex optimization problem which scales with the training set size rather than the feature space dimension. While this is usually considered to be a desired quality, in large scale problems it may cause training to be impractical. The common techniques to handle this difficulty basically build a solution by solving a sequence of small scale subproblems. Our current effort is concentrated on the rank of the kernel matrix as a source for further enhancement of the training procedure. We first show that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity. We then suggest an efficient use of a known factorization technique to approximate a given kernel matrix by a low rank matrix, which in turn will be used to feed the optimizer. Finally, we derive an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors). This bound is general in the sense that it holds regardless of the approximation method.

695 citations


Journal ArticleDOI
TL;DR: It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well.
Abstract: In this article we study the generalization abilities of several classifiers of support vector machine (SVM) type using a certain class of kernels that we call universal It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well In particular we derive a simple sufficient condition for this parameter in the case of Gaussian RBF kernels On the one hand our considerations are based on an investigation of an approximation property---the so-called universality---of the used kernels that ensures that all continuous functions can be approximated by certain kernel expressions This approximation property also gives a new insight into the role of kernels in these and other algorithms On the other hand the results are achieved by a precise study of the underlying optimization problems of the classifiers Furthermore, we show consistency for the maximal margin classifier as well as for the soft margin SVM's in the presence of large margins In this case it turns out that also constant regularization parameters ensure consistency for the soft margin SVM's Finally we prove that even for simple, noise free classification problems SVM's with polynomial kernels can behave arbitrarily badly

681 citations


Journal ArticleDOI
TL;DR: An empirical evaluation of round robin classification, implemented as a wrapper around the Ripper rule learning algorithm, on 20 multi-class datasets from the UCI database repository shows that the technique is very likely to improve Ripper's classification accuracy without having a high risk of decreasing it.
Abstract: In this paper, we discuss round robin classification (aka pairwise classification), a technique for handling multi-class problems with binary classifiers by learning one classifier for each pair of classes. We present an empirical evaluation of the method, implemented as a wrapper around the Ripper rule learning algorithm, on 20 multi-class datasets from the UCI database repository. Our results show that the technique is very likely to improve Ripper's classification accuracy without having a high risk of decreasing it. More importantly, we give a general theoretical analysis of the complexity of the approach and show that its run-time complexity is below that of the commonly used one-against-all technique. These theoretical results are not restricted to rule learning but are also of interest to other communities where pairwise classification has recently received some attention. Furthermore, we investigate its properties as a general ensemble technique and show that round robin classification with C5.0 may improve C5.0's performance on multi-class problems. However, this improvement does not reach the performance increase of boosting, and a combination of boosting and round robin classification does not produce any gain over conventional boosting. Finally, we show that the performance of round robin classification can be further improved by a straight-forward integration with bagging.

Journal ArticleDOI
TL;DR: It is shown that other, more global classification techniques are preferable to the nearest neighbor rule, in such cases when dissimilarities used in practice are far from ideal and the performance of the nearest neighbors rule suffers from its sensitivity to noisy examples.
Abstract: Usually, objects to be classified are represented by features. In this paper, we discuss an alternative object representation based on dissimilarity values. If such distances separate the classes well, the nearest neighbor method offers a good solution. However, dissimilarities used in practice are usually far from ideal and the performance of the nearest neighbor rule suffers from its sensitivity to noisy examples. We show that other, more global classification techniques are preferable to the nearest neighbor rule, in such cases.For classification purposes, two different ways of using generalized dissimilarity kernels are considered. In the first one, distances are isometrically embedded in a pseudo-Euclidean space and the classification task is performed there. In the second approach, classifiers are built directly on distance kernels. Both approaches are described theoretically and then compared using experiments with different dissimilarity measures and datasets including degraded data simulating the problem of missing values.

Journal ArticleDOI
TL;DR: This work proposes a method for generating artificial outliers, uniformly distributed in a hypersphere around the target set, and gets an efficient estimate for the volume covered by the one-class classifiers.
Abstract: In one-class classification, one class of data, called the target class, has to be distinguished from the rest of the feature space. It is assumed that only examples of the target class are available. This classifier has to be constructed such that objects not originating from the target set, by definition outlier objects, are not classified as target objects. In previous research the support vector data description (SVDD) is proposed to solve the problem of one-class classification. It models a hypersphere around the target set, and by the introduction of kernel functions, more flexible descriptions are obtained. In the original optimization of the SVDD, two parameters have to be given beforehand by the user. To automatically optimize the values for these parameters, the error on both the target and outlier data has to be estimated. Because no outlier examples are available, we propose a method for generating artificial outliers, uniformly distributed in a hypersphere. An (relative) efficient estimate for the volume covered by the one-class classifiers is obtained, and so an estimate for the outlier error. Results are shown for artificial data and for real world data.

Journal ArticleDOI
TL;DR: This paper investigates two closely related methods to derive upper bounds on covering numbers for linear function classes by relying on the so-called Maurey's lemma and techniques from the mistake bound framework in online learning.
Abstract: Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In many of these theoretical studies, the concept of covering numbers played an important role. It is thus useful to study covering numbers for linear function classes. In this paper, we investigate two closely related methods to derive upper bounds on these covering numbers. The first method, already employed in some earlier studies, relies on the so-called Maurey's lemma; the second method uses techniques from the mistake bound framework in online learning. We compare results from these two methods, as well as their consequences in some learning formulations.

Journal ArticleDOI
TL;DR: An algorithm is presented that allows unnecessary support vectors to be recognized and eliminated while leaving the solution otherwise unchanged, and in most cases the procedure leads to a reduction in the number of support vectors.
Abstract: This paper demonstrates that standard algorithms for training support vector machines generally produce solutions with a greater number of support vectors than are strictly necessary. An algorithm is presented that allows unnecessary support vectors to be recognized and eliminated while leaving the solution otherwise unchanged. The algorithm is applied to a variety of benchmark data sets (for both classification and regression) and in most cases the procedure leads to a reduction in the number of support vectors. In some cases the reduction is substantial.

Journal Article
TL;DR: In this paper, an incremental learning algorithm called ALMA_p (Approximate Large Margin algorithm w.r.t. norm p) was proposed, which takes O(p-1) / (α2 γ2 ) corrections to separate the data with p-norm margin larger than (1-α)γ, where g is the normalized) p norm margin of the data.
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ≥ 2 for a set of linearly separable data. Our algorithm, called ALMA_p (Approximate Large Margin algorithm w.r.t. norm p), takes O( (p-1) / (α2 γ2 ) ) corrections to separate the data with p-norm margin larger than (1-α)γ, where g is the (normalized) p-norm margin of the data. ALMA_p avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's Perceptron algorithm. We performed extensive experiments on both real-world and artificial datasets. We compared ALMA_2 (i.e., ALMA_p with p = 2) to standard Support vector Machines (SVM) and to two incremental algorithms: the Perceptron algorithm and Li and Long's ROMMA. The accuracy levels achieved by ALMA_2 are superior to those achieved by the Perceptron algorithm and ROMMA, but slightly inferior to SVM's. On the other hand, ALMA_2 is quite faster and easier to implement than standard SVM training algorithms. When learning sparse target vectors, ALMA_p with p > 2 largely outperforms Perceptron-like algorithms, such as ALMA_2.

Journal Article
TL;DR: A general statistical model for text chunking which is based on a generalization of the Winnow algorithm and provides reliable confidence estimates for its classification predictions, and shows that the system achieves state of the art performance in text chunksing with less computational cost then previous systems.
Abstract: This paper describes a text chunking system based on a generalization of the Winnow algorithm. We propose a general statistical model for text chunking which we then convert into a classification problem. We argue that the Winnow family of algorithms is particularly suitable for solving classification problems arising from NLP applications, due to their robustness to irrelevant features. However in theory, Winnow may not converge for linearly non-separable data. To remedy this problem, we employ a generalization of the original Winnow method. An additional advantage of the new algorithm is that it provides reliable confidence estimates for its classification predictions. This property is required in our statistical modeling approach. We show that our system achieves state of the art performance in text chunking with less computational cost then previous systems.

Journal ArticleDOI
Tong Zhang1, Vijay S. Iyengar1
TL;DR: This paper proposes the use of linear classifiers in a model-based recommender system and experimental results indicate that these linear models are well suited for this application.
Abstract: Recommender systems use historical data on user preferences and other available data on users (for example, demographics) and items (for example, taxonomy) to predict items a new user might like. Applications of these methods include recommending items for purchase and personalizing the browsing experience on a web-site. Collaborative filtering methods have focused on using just the history of user preferences to make the recommendations. These methods have been categorized as memory-based if they operate over the entire data to make predictions and as model-based if they use the data to build a model which is then used for predictions. In this paper, we propose the use of linear classifiers in a model-based recommender system. We compare our method with another model-based method using decision trees and with memory-based methods using data from various domains. Our experimental results indicate that these linear models are well suited for this application. They outperform a commonly proposed memory-based method in accuracy and also have a better tradeoff between off-line and on-line computational requirements.

Journal ArticleDOI
TL;DR: The authors presented memory-based learning approaches to shallow parsing and applied these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing.
Abstract: We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement.

Journal ArticleDOI
TL;DR: This work presents SUBDUE and the development of its clustering functionalities and develops a new metric for comparing structurally-defined clusterings in both structured and unstructured data.
Abstract: Hierarchical conceptual clustering has proven to be a useful, although under-explored, data mining technique. A graph-based representation of structural information combined with a substructure discovery technique has been shown to be successful in knowledge discovery. The SUBDUE substructure discovery system provides one such combination of approaches. This work presents SUBDUE and the development of its clustering functionalities. Several examples are used to illustrate the validity of the approach both in structured and unstructured domains, as well as to compare SUBDUE to the Cobweb clustering algorithm. We also develop a new metric for comparing structurally-defined clusterings. Results show that SUBDUE successfully discovers hierarchical clusterings in both structured and unstructured data.

Journal ArticleDOI
TL;DR: This paper formalizes the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory, and describes the application of this method to the task of model-based clustering via the expectation-maximization (EM) algorithm.
Abstract: We examine the learning-curve sampling method, an approach for applying machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.

Journal Article
TL;DR: A unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM), which constructs a Specialized HMM which gives more complete contextual models.
Abstract: We present a unified technique to solve different shallow parsing tasks as a tagging problem using a Hidden Markov Model-based approach (HMM). This technique consists of the incorporation of the relevant information for each task into the models. To do this, the training corpus is transformed to take into account this information. In this way, no change is necessary for either the training or tagging process, so it allows for the use of a standard HMM approach. Taking into account this information, we construct a Specialized HMM which gives more complete contextual models. We have tested our system on chunking and clause identification tasks using different specialization criteria. The results obtained are in line with the results reported for most of the relevant state-of-the-art approaches.

Journal ArticleDOI
TL;DR: It is shown that the characterization of the KM algorithm when applied to SQ-Dρ is tight in terms of learning parity functions, and a characterization for learnability with these extended statistical queries is developed.
Abstract: The Kushilevitz-Mansour (KM) algorithm is an algorithm that finds all the "large" Fourier coefficients of a Boolean function. It is the main tool for learning decision trees and DNF expressions in the PAC model with respect to the uniform distribution. The algorithm requires access to the membership query (MQ) oracle. The access is often unavailable in learning applications and thus the KM algorithm cannot be used. We significantly weaken this requirement by producing an analogue of the KM algorithm that uses extended statistical queries (SQ) (SQs in which the expectation is taken with respect to a distribution given by a learning algorithm). We restrict a set of distributions that a learning algorithm may use for its statistical queries to be a set of product distributions with each bit being 1 with probability ρ, 1/2 or 1-ρ for a constant 1/2 > ρ > 0 (we denote the resulting model by SQ-Dρ). Our analogue finds all the "large" Fourier coefficients of degree lower than clog(n) (we call it the Bounded Sieve (BS)). We use BS to learn decision trees and by adapting Freund's boosting technique we give an algorithm that learns DNF in SQ-Dρ. An important property of the model is that its algorithms can be simulated by MQs with persistent noise. With some modifications BS can also be simulated by MQs with product attribute noise (i.e., for a query x oracle changes every bit of x with some constant probability and calculates the value of the target function at the resulting point) and classification noise. This implies learnability of decision trees and weak learnability of DNF with this non-trivial noise. In the second part of this paper we develop a characterization for learnability with these extended statistical queries. We show that our characterization when applied to SQ-Dρ is tight in terms of learning parity functions. We extend the result given by Blum et al. by proving that there is a class learnable in the PAC model with random classification noise and not learnable in SQ-Dρ.

Journal ArticleDOI
TL;DR: This paper introduced the problem of partial or shallow parsing (assigning partial syntactic structure to sentences) and explained why it is an important natural language processing (NLP) task, and future directions for machine learning of shallow parsing are suggested.
Abstract: This article introduces the problem of partial or shallow parsing (assigning partial syntactic structure to sentences) and explains why it is an important natural language processing (NLP) task. The complexity of the task makes Machine Learning an attractive option in comparison to the handcrafting of rules. On the other hand, because of the same task complexity, shallow parsing makes an excellent benchmark problem for evaluating machine learning algorithms. We sketch the origins of shallow parsing as a specific task for machine learning of language, and introduce the articles accepted for this special issue, a representative sample of current research in this area. Finally, future directions for machine learning of shallow parsing are suggested.

Journal ArticleDOI
TL;DR: A measure of complexity for data dependent hypothesis classes is defined and data dependent versions of bounds on error deviance and estimation error are provided and a structural risk minimization procedure over data dependent hierarchies is provided.
Abstract: We extend the VC theory of statistical learning to data dependent spaces of classifiers. This theory can be viewed as a decomposition of classifier design into two components; the first component is a restriction to a data dependent hypothesis class and the second is empirical risk minimization within that class. We define a measure of complexity for data dependent hypothesis classes and provide data dependent versions of bounds on error deviance and estimation error. We also provide a structural risk minimization procedure over data dependent hierarchies and prove consistency. We use this theory to provide a framework for studying the trade-offs between performance and computational complexity in classifier design. As a consequence we obtain a new family of classifiers with dimension independent performance bounds and efficient learning procedures.

Journal Article
TL;DR: Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts, and special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as their sensitivity to the size and the various types of training data sets.
Abstract: Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts. The phrase structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to in the parse tree. The encoding is based on the concatenation of the phrase tags on the path from lowest to higher nodes. Various linguistic features are used in learning; the taggers are trained on the basis of lexical information only, part-of-speech only, and a combination of both, to predict the phrase structure of the tokens with or without part-of-speech. Special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as the taggers' sensitivity to the size and the various types of training data sets. The method can be easily transferred to other languages.

Journal Article
Hervé Déjean1
TL;DR: A top-down inductive system, ALLiS, for learning linguistic structures, with a specific mechanism, refinement, which enables learning rules and their exceptions and some experiments demonstrate that linguistic knowledge improves learning.
Abstract: We present in this article a top-down inductive system, ALLiS, for learning linguistic structures. Two difficulties came up during the development of the system: the presence of a significant amount of noise in the data and the presence of exceptions linguistically motivated. It is then a challenge for an inductive system to learn rules from this kind of data. This leads us to add a specific mechanism, refinement, which enables learning rules and their exceptions. In the first part of this article we evaluate the usefulness of this device and show that it improves results when learning linguistic structures.In the second part, we explore how to improve the efficiency of the system by using prior knowledge. Since Natural Language is a strongly structured object, it may be important to investigate whether linguistic knowledge can help to make natural language learning more efficiently and accurately. This article presents some experiments demonstrating that linguistic knowledge improves learning. The system has been applied to the shared task of the CoNLL'00 workshop.

Journal ArticleDOI
TL;DR: This work investigates two different notions of "size" which appear naturally in Statistical Learning Theory and presents quantitative estimates on the fat-shattering dimension and on the covering numbers of convex hulls of sets of functions, given the necessary data on the original sets.
Abstract: We investigate two different notions of "size" which appear naturally in Statistical Learning Theory. We present quantitative estimates on the fat-shattering dimension and on the covering numbers of convex hulls of sets of functions, given the necessary data on the original sets. The proofs we present are relatively simple since they do not require extensive background in convex geometry.

Journal ArticleDOI
TL;DR: The starting point is a generalized notion of the margin to multiclass problems that describes the algorithmic implementation of multiclass kernel-based vector machines.
Abstract: In this paper we describe the algorithmic implementation of multiclass kernel-based vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this noti...

Journal Article
TL;DR: This work investigates the performance of four shallow parsers trained using various types of artificially noisy material and shows that they are surprisingly robust to synthetic noise, and addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution.
Abstract: Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we investigate the performance of four shallow parsers (maximum entropy, memory-based learning, N-grams and ensemble learning) trained using various types of artificially noisy material. Our first set of results show that shallow parsers are surprisingly robust to synthetic noise, with performance gradually decreasing as the rate of noise increases. Further results show that no single shallow parser performs best in all noise situations. Final results show that simple, parser-specific extensions can improve noise-tolerance. Our second set of results addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution. Results using the parsed Switchboard corpus suggest that, although naturally occurring disfluencies might harm performance, differences in distribution between the training set and the testing set are more significant.