scispace - formally typeset
Search or ask a question

Showing papers on "Model selection published in 2002"


Journal ArticleDOI
TL;DR: The novelty of the approach is that it does not use a model selection criterion to choose one among a set of preestimated candidate models; instead, it seamlessly integrate estimation and model selection in a single algorithm.
Abstract: This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach.

2,182 citations


Journal ArticleDOI
TL;DR: A form of k -fold cross validation for evaluating prediction success is proposed for presence/available RSF models, which involves calculating the correlation between RSF ranks and area-adjusted frequencies for a withheld sub-sample of data.

2,107 citations


Journal ArticleDOI
TL;DR: A new approach to automatic business forecasting based on an extended range of exponential smoothing methods that allows the easy calculation of the likelihood, the AIC and other model selection criteria, and the computation of prediction intervals for each method.

873 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider using a score equivalent criterion in conjunction with a heuristic search algorithm to perform model selection or model averaging, and show that more sophisticated search algorithms are likely to benefit much more.
Abstract: Two Bayesian-network structures are said to be equivalent if the set of distributions that can be represented with one of those structures is identical to the set of distributions that can be represented with the other. Many scoring criteria that are used to learn Bayesian-network structures from data are score equivalent; that is, these criteria do not distinguish among networks that are equivalent. In this paper, we consider using a score equivalent criterion in conjunction with a heuristic search algorithm to perform model selection or model averaging. We argue that it is often appropriate to search among equivalence classes of network structures as opposed to the more common approach of searching among individual Bayesian-network structures. We describe a convenient graphical representation for an equivalence class of structures, and introduce a set of operators that can be applied to that representation by a search algorithm to move among equivalence classes. We show that our equivalence-class operators can be scored locally, and thus share the computational efficiency of traditional operators defined for individual structures. We show experimentally that a greedy model-selection algorithm using our representation yields slightly higher-scoring structures than the traditional approach without any additional time overhead, and we argue that more sophisticated search algorithms are likely to benefit much more.

711 citations


Book
01 Jan 2002
TL;DR: In this paper, the authors present an introduction to the analysis of variance and the meaning of p-values and confidence intervals, as well as results about variances of sample means.
Abstract: Why use this book 1. An introduction to the analysis of variance 2. Regression 3. Models, parameters and GLMs 4. Using more than one explanatory variable 5. Designing experiments - keeping it simple 6. Combining continuous and categorical variables 7. Interactions - getting more complex 8. Checking the models A: Independence 9. Checking the models B: The other three assumptions 10. Model selection I: Principles of model choice and designed experiments 11. Model selection II: Data sets with several explanatory variables 12. Random effects 13. Categorical data 14. What lies beyond? Answers to exercises Revision section: The basics Appendix I: The meaning of p-values and confidence intervals Appendix II: Analytical results about variances of sample means Appendix III: Probability distributions Bibliography

597 citations


Journal ArticleDOI
TL;DR: Fan and Li as mentioned in this paper extended the nonconcave penalized likelihood approach to the Cox proportional hazards model and Cox proportional hazard frailty model, two commonly used semi-parametric models in survival analysis and proposed new variable selection procedures for these two commonly-used models.
Abstract: A class of variable selection procedures for parametric models via nonconcave penalized likelihood was proposed in Fan and Li (2001a). It has been shown there that the resulting procedures perform as well as if the subset of significant variables were known in advance. Such a property is called an oracle property. The proposed procedures were illustrated in the context of linear regression, robust linear regression and generalized linear models. In this paper, the nonconcave penalized likelihood approach is extended further to the Cox proportional hazards model and the Cox proportional hazards frailty model, two commonly used semi-parametric models in survival analysis. As a result, new variable selection procedures for these two commonly-used models are proposed. It is demonstrated how the rates of convergence depend on the regularization parameter in the penalty function. Further, with a proper choice of the regularization parameter and the penalty function, the proposed estimators possess an oracle property. Standard error formulae are derived and their accuracies are empirically tested. Simulation studies show that the proposed procedures are more stable in prediction and more effective in computation than the best subset variable selection, and they reduce model complexity as effectively as the best subset variable selection. Compared with the LASSO, which is the penalized likelihood method with the $L_1$ -penalty, proposed by Tibshirani, the newly proposed approaches have better theoretic properties and finite sample performance.

570 citations


Journal ArticleDOI
TL;DR: In this paper, the authors use Bayesian model averaging to analyze the sample evidence on return predictability in the presence of model uncertainty and show that the out-of-sample performance of the Bayesian approach is superior to that of model selection criteria.

468 citations


Journal ArticleDOI
TL;DR: The numerically integrated state-space (NISS) method as mentioned in this paper was proposed to fit models to time series of population abun- dances that incorporate both process noise and observation error in a likelihood framework.
Abstract: We evaluate a method for fitting models to time series of population abun- dances that incorporates both process noise and observation error in a likelihood framework. The method follows the probability logic of the Kalman filter, but whereas the Kalman filter applies to linear, Gaussian systems, we implement the full probability calculations numerically so that any nonlinear, non-Gaussian model can be used. We refer to the method as the "numerically integrated state-space (NISS) method" and compare it to two common methods used to analyze nonlinear time series in ecology: least squares with only process noise (LSPN) and least squares with only observation error (LSOE). We compare all three methods by fitting Beverton-Holt and Ricker models to many replicate model-generated time series of length 20 with several parameter choices. For the Ricker model we chose parameters for which the deterministic part of the model produces a stable equilibrium, a two-cycle, or a four-cycle. For each set of parameters we used three process-noise and observation-error scenarios: large standard deviation (0.2) for both, and large for one but small (0.05) for the other. The NISS method had lower estimator bias and variance than the other methods in nearly all cases. The only exceptions were for the Ricker model with stable-equilibrium parameters, in which case the LSPN and LSOE methods has lower bias when noise variances most closely met their assumptions. For the Beverton-Holt model, the NISS method was much less biased and more precise than the other methods. We also evaluated the utility of each method for model selection by fitting simulated data to both models and using information criteria for selection. The NISS and LSOE methods showed a strong bias toward selecting the Ricker over the Beverton-Holt, even when data were generated with the Beverton-Holt. It remains unclear whether the LSPN method is generally superior for model selection or has fortuitously better biases in this particular case. These results suggest that information criteria are best used with caution for nonlinear population models with short time series. Finally we evaluated the convergence of likelihood ratios to theoretical asymptotic distributions. Agreement with asymptotic distributions was very good for stable-point Rick- er parameters, less accurate for two-cycle and four-cycle Ricker parameters, and least accurate for the Beverton-Holt model. The numerically integrated state-space method has a number of advantages over least squares methods and offers a useful tool for connecting models and data and ecology.

438 citations


Journal ArticleDOI
TL;DR: A method of selecting among mathematical models of cognition known as minimum description length is introduced, which provides an intuitive and theoretically well-grounded understanding of why one model should be chosen.
Abstract: The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length , which provides an intuitive and theoretically well-grounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger (1994). They include (a) plausibility (are the assumptions of the model biologically and psychologically plausible?); (b) explanatory adequacy (is the theoretical explanation reasonable and consistent with what is known?); (c) interpretability (do the model and its parts— e.g., parameters—make sense? are they understandable?); (d) descriptive adequacy (does the model provide a good description of the observed data?); (e) generalizability (does the model predict well the characteristics of data that will be observed in the future?); and (f) complexity (does the model capture the phenomenon in the least complex—i.e., simplest—possible manner?). The relative importance of these criteria may vary with the types of models being compared. For example, verbal models are likely

437 citations


Journal ArticleDOI
TL;DR: In this article, the authors focus on measuring the generalizability of a model's data-fitting abilities, which should be the goal of model selection, and introduce selection methods that factor in these properties when measuring fit.

408 citations


Journal ArticleDOI
TL;DR: This work considers the problem of identifying the genetic loci contributing to variation in a quantitative trait, with data on an experimental cross, and discusses the use of model selection ideas to identify QTLs in experimental crosses.
Abstract: Summary. We consider the problem of identifying the genetic loci (called quantitative trait loci (QTLs)) contributing to variation in a quantitative trait, with data on an experimental cross. A large number of different statistical approaches to this problem have been described; most make use of multiple tests of hypotheses, and many consider models allowing only a single QTL. We feel that the problem is best viewed as one of model selection. We discuss the use of model selection ideas to identify QTLs in experimental crosses. We focus on a back-cross experiment, with strictly additive QTLs, and concentrate on identifying QTLs, considering the estimation of their effects and precise locations of secondary importance. We present the results of a simulation study to compare the performances of the more prominent methods.

Journal ArticleDOI
TL;DR: A family of (nested) dose-response models is introduced herein that can be used for describing the change in any continuous endpoint as a function of dose, and a member from this family of models may be selected using the likelihood ratio test as a criterion, to prevent overparameterization.


Journal ArticleDOI
TL;DR: The authors generalizes Vuong (1989) asymptotically normal tests for model selection in several important directions, such as allowing incompletely parametrized models such as econometric models defined by moment conditions.
Abstract: This paper generalizes Vuong (1989) asymptotically normal tests for model selection in several important directions. First, it allows for incompletely parametrized models such as econometric models defined by moment conditions. Second, it allows for a broad class of estimation methods that includes most estimators currently used in practice. Third, it considers model selection criteria other than the models’ likelihoods such as the mean squared errors of prediction. Fourth, the proposed tests are applicable to possibly misspecified nonlinear dynamic models with weakly dependent heterogeneous data. Cases where the estimation methods optimize the model selection criteria are distinguished from cases where they do not. We also consider the estimation of the asymptotic variance of the difference between the competing models’ selection criteria, which is necessary to our tests. Finally, we discuss conditions under which our tests are valid. It is seen that the competing models must be essentially nonnested.

Journal ArticleDOI
Philip H. S. Torr1
TL;DR: This paper explores ways of automating the model selection process with specific emphasis on the least squares problem of fitting manifolds to data points, illustrated with respect to epipolar geometry.
Abstract: Computer vision often involves estimating models from visual input. Sometimes it is possible to fit several different models or hypotheses to a set of data, and a decision must be made as to which is most appropriate. This paper explores ways of automating the model selection process with specific emphasis on the least squares problem of fitting manifolds (in particular algebraic varieties e.g. lines, algebraic curves, planes etc.) to data points, illustrated with respect to epipolar geometry. The approach is Bayesian and the contribution three fold, first a new Bayesian description of the problem is laid out that supersedes the author's previous maximum likelihood formulations, this formulation will reveal some hidden elements of the problem. Second an algorithm, ‘MAPSAC’, is provided to obtain the robust MAP estimate of an arbitrary manifold. Third, a Bayesian model selection paradigm is proposed, the Bayesian formulation of the manifold fitting problem uncovers an elegant solution to this problem, for which a new method ‘GRIC’ for approximating the posterior probability of each putative model is derived. This approximations bears some similarity to the penalized likelihoods used by AIC, BIC and MDL however it is far more accurate in situations involving large numbers of latent variables whose number increases with the data. This will be empirically and theoretically demonstrated.

Journal ArticleDOI
TL;DR: A comparison between Wold's R criterion and AIC for the selection of the number of latent variables to include in a PLS model that will form the basis of a multivariate statistical process control representation is undertaken based on a simulation study.

01 Jan 2002
TL;DR: This verified that there were small, but significant, elevation and topographic aspect effects in the data, when calculated from a 10 km resolution DEM, providing a physical explanation for the short range correlation identified by the two dimensional analysis in the companion paper.
Abstract: Thin plate smoothing splines incorporating varying degrees of topographic dependence were used to interpolate 100 daily rainfall values, with the degree of data smoothing determined by minimizing the generalised cross validation. Analyses were performed on the square roots of the rainfall values. Model calibration was made difficult by short range correlation and the small size of the data set. Short range correlation was partially overcome by removing one point from each of the five closest pairs of data points. An additional five representative points were removed to make up a set of 10 withheld points to assess model error. Three dimensional spline functions of position and elevation, from digital elevation models of varying resolution, were used to assess the optimum scaling of elevation and an optimum DEM resolution of 10 km. A linear sub-model, depending on the two horizontal components of the unit normal to the scaled DEM, was used to form a five dimensional partial spline model which identified a south western aspect effect. This model also had slightly smaller estimated predictive error. The model was validated by reference to the prevailing upper atmosphere wind field and by comparing predictive accuracies on 367 withheld data points. Model selection was further validated by fitting the various spline models to the 367 data points and using the 100 data points to assess model error. This verified that there were small, but significant, elevation and topographic aspect effects in the data, when calculated from a 10 km resolution DEM, providing a physical explanation for the short range correlation identified by the two dimensional analysis in the companion paper.

Book
14 Jun 2002
TL;DR: This work focuses on the development of a Multi-Variable Model for Software Development Productivity and its applications in the context of Software Development Project Data Management.
Abstract: Preface. 1. Data Analysis Methodology. Graphs. Tables. Correlation Analysis. Stepwise Regression Analysis. Numerical Variable Checks. Categorical Variable Checks. Testing the Residuals. Detecting Influential Observations. 2. Case Study: Software Development Productivity. Creation of New Variables. Data Modifications. Identifying Subsets of Categorical Variables. Model Selection. Graphs. Tables. Correlation Analysis. Stepwise Regression Analysis. Numerical Variable Checks. Categorical Variable Checks. Testing the Residuals. Detecting Influential Observations. 3. Case Study: Time to Market. Model Selection. Graphs. Tables. Correlation Analysis. Stepwise Regression Analysis. Numerical Variable Checks. Categorical Variable Checks. Testing the Residuals. Detecting Influential Observations. 4. Case Study: Developing a Software Development Cost Model. Choice of Data. Model Selection. Graphs. Tables. Correlation Analysis. Stepwise Regression Analysis. Numerical Variable Checks. Categorical Variable Checks. Testing the Residuals. Detecting Influential Observations. Common Accuracy Statistics. Boxplots of Estimation Error. Wilcoxon Signed-Rank Test. Accuracy Segmentation. The 95% Confidence Interval. Identifying Subsets of Categorical Variables. Model Selection. Building the Multi-Variable Model. Checking the Models. Measuring Estimation Accuracy. Comparison of 1991 and 1993 Models. Management Implications. 5. Case Study: Software Maintenance Cost Drivers. It's the Results That Matter. Cost Drivers of Annual Corrective Maintenance (by Katrina D. Maxwell and Pekka Forselius). From Data to Knowledge. Variable and Model Selection. Preliminary Analyses. Building the Multi-Variable Model. Checking the Model. Extracting the Equation. Interpreting the Equation. Accuracy of Model Prediction. The Telon Analysis. Further Analyses. Final Comments. 6. What You Need to Know About Statistics. Describing Individual Variables. The Normal Distribution. Overview of Sampling Theory. Other Probability Distributions. Identifying Relationships in the Data. Comparing Two Estimation Models. Final Comments. Appendix A. Raw Software Development Project Data. Appendix B. Validated Software Development Project Data. Appendix C. Validated Software Maintenance Project Data. Index.

Journal ArticleDOI
TL;DR: In this article, the authors demonstrate that model selection is more easily performed using the deviance information criterion (DIC), which combines a Bayesian measure-of-fit with a measure of model complexity.
Abstract: Bayesian methods have been efficient in estimating parameters of stochastic volatility models for analyzing financial time series. Recent advances made it possible to fit stochastic volatility models of increasing complexity, including covariates, leverage effects, jump components and heavy-tailed distributions. However, a formal model comparison via Bayes factors remains difficult. The main objective of this paper is to demonstrate that model selection is more easily performed using the deviance information criterion (DIC). It combines a Bayesian measure-of-fit with a measure of model complexity. We illustrate the performance of DIC in discriminating between various different stochastic volatility models using simulated data and daily returns data on the S&P100 index.

Journal ArticleDOI
TL;DR: In this article, a new class of functional models in which smoothing splines are used to model fixed effects as well as random effects is introduced, which inherit the flexibility of the linear mixed effects models in handling complex designs and correlation structures.
Abstract: In this article, a new class of functional models in which smoothing splines are used to model fixed effects as well as random effects is introduced. The linear mixed effects models are extended to nonparametric mixed effects models by introducing functional random effects, which are modeled as realizations of zero-mean stochastic processes. The fixed functional effects and the random functional effects are modeled in the same functional space, which guarantee the population-average and subject-specific curves have the same smoothness property. These models inherit the flexibility of the linear mixed effects models in handling complex designs and correlation structures, can include continuous covariates as well as dummy factors in both the fixed or random design matrices, and include the nested curves models as special cases. Two estimation procedures are proposed. The first estimation procedure exploits the connection between linear mixed effects models and smoothing splines and can be fitted using existing software. The second procedure is a sequential estimation procedure using Kalman filtering. This algorithm avoids inversion of large dimensional matrices and therefore can be applied to large data sets. A generalized maximum likelihood (GML) ratio test is proposed for inference and model selection. An application to comparison of cortisol profiles is used as an illustration.

Journal ArticleDOI
TL;DR: This paper evaluates the properties of a joint and sequential estimation procedure for estimating the parameters of single and multiple threshold models via the introduction of a model selection based procedure that allows the estimation of both the unknown parameters and their number to be performed jointly.

Journal ArticleDOI
TL;DR: This work proposes an approach using cross-validation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions, and discusses the probabilistic assumptions made and properties of two practical cross- validate methods, importance sampling and k-fold cross- validation.
Abstract: In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate because it describes the uncertainty in the estimate. The distributions of the expected utility estimates can also be used to compare models, for example, by computing the probability of one model having a better expected utility than some other model. We propose an approach using cross-validation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions. We also discuss the probabilistic assumptions made and properties of two practical cross-validation methods, importance sampling and k-fold cross-validation. As illustrative examples, we use multilayer perceptron neural networks and gaussian processes with Markov chain Monte Carlo sampling in one toy problem and two challenging real-world problems.

Journal ArticleDOI
TL;DR: This paper proposes a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression, and illustrates the potential of this approach by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.
Abstract: When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (e.g. CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this paper, we propose a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.

Journal ArticleDOI
TL;DR: The authors proposed an adaptive model selection procedure that uses a data-adaptive complexity penalty based on a concept of generalized degrees of freedom, combining the benefit of a class of nonadaptive procedures, approximates the best performance of this class of procedures across a variety of different situations.
Abstract: Most model selection procedures use a fixed penalty penalizing an increase in the size of a model. These nonadaptive selection procedures perform well only in one type of situation. For instance, Bayesian information criterion (BIC) with a large penalty performs well for “small” models and poorly for “large” models, and Akaike's information criterion (AIC) does just the opposite. This article proposes an adaptive model selection procedure that uses a data-adaptive complexity penalty based on a concept of generalized degrees of freedom. The proposed procedure, combining the benefit of a class of nonadaptive procedures, approximates the best performance of this class of procedures across a variety of different situations. This class includes many well-known procedures, such as AIC, BIC, Mallows's Cp, and risk inflation criterion (RIC). The proposed procedure is applied to wavelet thresholding in nonparametric regression and variable selection in least squares regression. Simulation results and an asymptotic...

Journal ArticleDOI
TL;DR: In this article, a systematic approach is described for determining the minimum level of model complexity required to predict runoff in New Zealand catchments, with minimal calibration, at decreasing timescales.
Abstract: [1] A systematic approach is described for determining the minimum level of model complexity required to predict runoff in New Zealand catchments, with minimal calibration, at decreasing timescales. Starting with a lumped conceptual model representing the most basic hydrological processes needed to capture water balance, model complexity is systematically increased in response to demonstrated deficiencies in model predictions until acceptable accuracy is achieved. Sensitivity and error analyses are performed to determine the dominant physical controls on streamflow variability. It is found that dry catchments are sensitive to a threshold storage parameter, producing inaccurate results with little confidence, while wet catchments are relatively insensitive, producing more accurate results with more confidence. Sensitivity to the threshold parameter is well correlated with climate and timescale, and in combination with the results of two previous studies, this allowed the postulation of a qualitative relationship between model complexity, timescale, and the climatic dryness index (DI). This relationship can provide an a priori understanding of the model complexity required to accurately predict streamflow with confidence in small catchments under given climate and timescales and a conceptual framework for model selection. The objective of the paper is therefore not to present a perfect model for any of the catchments studied but rather to present a systematic approach to modeling based on making inferences from data that can be applied with respect to different model designs, catchments and timescales.

Journal ArticleDOI
TL;DR: In this article, a fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be considered, which can greatly aid the interpretation of the model.
Abstract: When a number of distinct models contend for use in prediction, the choice of a single model can offer rather unstable predictions. In regression, stochastic search variable selection with Bayesian model averaging offers a cure for this robustness issue but at the expense of requiring very many predictors. Here we look at Bayes model averaging incorporating variable selection for prediction. This offers similar mean-square errors of prediction but with a vastly reduced predictor space. This can greatly aid the interpretation of the model. It also reduces the cost if measured variables have costs. The development here uses decision theory in the context of the multivariate general linear model. In passing, this reduced predictor space Bayes model averaging is contrasted with single-model approximations. A fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be contemplated. We discuss the merits of absolute rather than proportionate shrinkage in regression, especially when there are more variables than observations. The methodology is illustrated on a set of spectroscopic data used for measuring the amounts of different sugars in an aqueous solution.

Proceedings Article
01 Jan 2002
TL;DR: A fully-automated pattern search methodology for model selection of support vector machines (SVMs) for regression and classification and has proven to be very effective on benchmark tests and in high-variance drug design domains with high potential of overfitting.
Abstract: We develop a fully-automated pattern search methodology for model selection of support vector machines (SVMs) for regression and classification. Pattern search (PS) is a derivative-free optimization method suitable for low-dimensional optimization problems for which it is difficult or impossible to calculate derivatives. This methodology was motivated by an application in drug design in which regression models are constructed based on a few high-dimensional exemplars. Automatic model selection in such underdetermined problems is essential to avoid overfitting and overestimates of generalization capability caused by selecting parameters based on testing results. We focus on SVM model selection for regression based on leave-one-out (LOO) and cross-validated estimates of mean squared error, but the search strategy is applicable to any model criterion. Because the resulting error surface produces an extremely noisy map of the model quality with many local minima, the resulting generalization capacity of any single local optimal model illustrates high variance. Thus several locally optimal SVMmodels are generated and then bagged or averaged to produce the final SVM. This strategy of pattern search combined with model averaging has proven to be very effective on benchmark tests and in high-variance drug design domains with high potential of overfitting. ∗This work is supported by the NSF Grants IIS-9979860 and IRI-97092306. †Dept. of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, mommam@rpi.edu ‡Dept. of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, bennek@rpi.edu

Proceedings ArticleDOI
11 Aug 2002
TL;DR: Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies.
Abstract: In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.

Posted Content
TL;DR: The goodness-of-fit of an individual model with respect to individual data is precisely quantify and it is shown that-within the obvious constraints-every graph is realized by the structure function of some data.
Abstract: In 1974 Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The ``structure function'' of the given data expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. We show that the structure function determines all stochastic properties of the data: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the ``true'' model is in the model class considered or not. In this setting, this happens {\em with certainty}, rather than with high probability as is in the classical case. We precisely quantify the goodness-of-fit of an individual model with respect to individual data. We show that--within the obvious constraints--every graph is realized by the structure function of some data. We determine the (un)computability properties of the various functions contemplated and of the ``algorithmic minimal sufficient statistic.''

Book ChapterDOI
TL;DR: The analysis of the leukemia data from Whitehead/MIT group is a discriminant analysis, and it is observed that the performance of most of these weighted predictors on the testing set is gradually reduced as more genes are included, but a clear cutoff that separates good and bad prediction performance is not found.
Abstract: The analysis of the leukemia data from Whitehead/MIT group is a discriminant analysis (also called a supervised learning). Among thousands of genes whose expression levels are measured, not all are needed for discriminant analysis: a gene may either not contribute to the separation of two types of tissues/cancers, or it may be redundant because it is highly correlated with other genes. There are two theoretical frameworks in which variable selection (or gene selection in our case) can be addressed. The first is model selection, and the second is model averaging. We have carried out model selection using Akaike information criterion and Bayesian information criterion with logistic regression (discrimination, prediction, or classification) to determine the number of genes that provide the best model. These model selection criteria set upper limits of 22∼25 and 12∼13 genes for this data set with 38 samples, and the best model consists of only one (no.4847, zyxin) or two genes. We have also carried out model averaging over the best single-gene logistic predictors using three different weights: maximized likelihood, prediction rate on training set, and equal weight. We have observed that the performance of most of these weighted predictors on the testing set is gradually reduced as more genes are included, but a clear cutoff that separates good and bad prediction performance is not found.