scispace - formally typeset
Search or ask a question
Author

Tom F. Wilderjans

Bio: Tom F. Wilderjans is an academic researcher from Leiden University. The author has contributed to research in topics: Cluster analysis & Computer science. The author has an hindex of 16, co-authored 47 publications receiving 680 citations. Previous affiliations of Tom F. Wilderjans include VU University Amsterdam & Katholieke Universiteit Leuven.


Papers
More filters
Journal ArticleDOI
TL;DR: The wide applicability of the CHull method is demonstrated by showing how it can be used to solve various model selection problems in the context of PCA, reduced K-means, best-subset regression, and partial least squares regression.
Abstract: When analyzing data, researchers are often confronted with a model selection problem (e.g., determining the number of components/factors in principal components analysis [PCA]/factor analysis or identifying the most important predictors in a regression analysis). To tackle such a problem, researchers may apply some objective procedure, like parallel analysis in PCA/factor analysis or stepwise selection methods in regression analysis. A drawback of these procedures is that they can only be applied to the model selection problem at hand. An interesting alternative is the CHull model selection procedure, which was originally developed for multiway analysis (e.g., multimode partitioning). However, the key idea behind the CHull procedure—identifying a model that optimally balances model goodness of fit/misfit and model complexity—is quite generic. Therefore, the procedure may also be used when applying many other analysis techniques. The aim of this article is twofold. First, we demonstrate the wide applicability of the CHull method by showing how it can be used to solve various model selection problems in the context of PCA, reduced K-means, best-subset regression, and partial least squares regression. Moreover, a comparison of CHull with standard model selection methods for these problems is performed. Second, we present the CHULL software, which may be downloaded from http://ppw.kuleuven.be/okp/software/CHULL/, to assist the user in applying the CHull procedure.

107 citations

Journal ArticleDOI
TL;DR: Sparse simultaneous component analysis is a useful method for data integration: first, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness.
Abstract: High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account. We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks. Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach). The additional file contains a MATLAB implementation of the sparse simultaneous component method.

54 citations

Journal ArticleDOI
TL;DR: The findings support the assumptions regarding the heterogeneity of obesity and the association between temperament subtypes and psychopathology.
Abstract: Objective This study aimed to investigate temperament subtypes in obese patients. Methods Ninety-three bariatric surgery candidates and 63 obese inpatients from a psychotherapy unit answered the Behavioral Inhibition System/Behavioral Activation System Scale (BIS/BAS), the Effortful Control subscale of the Adult Temperament Questionnaire-Short Form (ATQ-EC), and questionnaires for eating disorder, depressive and attention deficit hyperactivity disorder (ADHD) symptoms and completed neurocognitive testing for executive functions. Binge eating disorder and impulse control disorders were diagnosed using interviews. Results A latent profile analysis using BIS/BAS and ATQ-EC scores revealed a ‘resilient/high functioning’ cluster (n = 88) showing high ATQ-EC and low BIS/BAS scores and an ‘emotionally dysregulated/undercontrolled’ cluster (n = 68) with low ATQ-EC and high BIS/BAS scores. Patients from the ‘emotionally dysregulated/undercontrolled’ cluster showed more eating disorder, depressive and ADHD symptoms, and poorer performance in the labyrinth task. Conclusion The findings support the assumptions regarding the heterogeneity of obesity and the association between temperament subtypes and psychopathology. Copyright © 2014 John Wiley & Sons, Ltd and Eating Disorders Association.

47 citations

Journal ArticleDOI
TL;DR: The CHull (Ceulemans & Kiers, 2006) method, which also balances model fit and complexity, is presented as an interesting alternative model selection strategy for MFA.
Abstract: Mixture analysis is commonly used for clustering objects on the basis of multivariate data. When the data contain a large number of variables, regular mixture analysis may become problematic, because a large number of parameters need to be estimated for each cluster. To tackle this problem, the mixtures-of-factor-analyzers (MFA) model was proposed, which combines clustering with exploratory factor analysis. MFA model selection is rather intricate, as both the number of clusters and the number of underlying factors have to be determined. To this end, the Akaike (AIC) and Bayesian (BIC) information criteria are often used. AIC and BIC try to identify a model that optimally balances model fit and model complexity. In this article, the CHull (Ceulemans & Kiers, 2006) method, which also balances model fit and complexity, is presented as an interesting alternative model selection strategy for MFA. In an extensive simulation study, the performances of AIC, BIC, and CHull were compared. AIC performs poorly and systematically selects overly complex models, whereas BIC performs slightly better than CHull when considering the best model only. However, when taking model selection uncertainty into account by looking at the first three models retained, CHull outperforms BIC. This especially holds in more complex, and thus more realistic, situations (e.g., more clusters, factors, noise in the data, and overlap among clusters).

46 citations

Journal ArticleDOI
TL;DR: The main benefits of the DISCO-SCA GUI are that it is easy to use, strongly facilitates the choice of model selection parameters (such as the number of mechanisms and their status as being common or distinctive), and is freely available.
Abstract: Behavioral researchers often obtain information about the same set of entities from different sources. A main challenge in the analysis of such data is to reveal, on the one hand, the mechanisms underlying all of the data blocks under study and, on the other hand, the mechanisms underlying a single data block or a few such blocks only (i.e., common and distinctive mechanisms, respectively). A method called DISCO-SCA has been proposed by which such mechanisms can be found. The goal of this article is to make the DISCO-SCA method more accessible, in particular for applied researchers. To this end, first we will illustrate the different steps in a DISCO-SCA analysis, with data stemming from the domain of psychiatric diagnosis. Second, we will present in this article the DISCO-SCA graphical user interface (GUI). The main benefits of the DISCO-SCA GUI are that it is easy to use, strongly facilitates the choice of model selection parameters (such as the number of mechanisms and their status as being common or distinctive), and is freely available.

44 citations


Cited by
More filters
Journal Article
TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

1,992 citations

Journal ArticleDOI
20 Aug 2015
TL;DR: In this paper, a number of data-driven solutions based on matrix and tensor decompositions are discussed, emphasizing how they account for diversity across the data sets, and a key concept, diversity, is introduced.
Abstract: In various disciplines, information about the same phenomenon can be acquired from different types of detectors, at different conditions, in multiple experiments or subjects, among others. We use the term “modality” for each such acquisition framework. Due to the rich characteristics of natural phenomena, it is rare that a single modality provides complete knowledge of the phenomenon of interest. The increasing availability of several modalities reporting on the same system introduces new degrees of freedom, which raise questions beyond those related to exploiting each modality separately. As we argue, many of these questions, or “challenges,” are common to multiple domains. This paper deals with two key issues: “why we need data fusion” and “how we perform it.” The first issue is motivated by numerous examples in science and technology, followed by a mathematical framework that showcases some of the benefits that data fusion provides. In order to address the second issue, “diversity” is introduced as a key concept, and a number of data-driven solutions based on matrix and tensor decompositions are discussed, emphasizing how they account for diversity across the data sets. The aim of this paper is to provide the reader, regardless of his or her community of origin, with a taste of the vastness of the field, the prospects, and the opportunities that it holds.

673 citations

01 Jan 2015
TL;DR: The aim of this paper is to provide the reader with a taste of the vastness of the field, the prospects, and the opportunities that it holds, and a number of data-driven solutions based on matrix and tensor decompositions are discussed, emphasizing how they account for diversity across the data sets.
Abstract: In various disciplines, information about the same phenomenon can be acquired from different types of detectors, at different conditions, in multiple experiments or subjects, among others. We use the term ''modality'' for each such acquisition framework.Duetothe richcharacteristics of natural phenomena, it is rare that a single modality provides complete knowledge of the phenomenon of interest. The increasing availability of several modalities reporting on the same system introduces new degrees of freedom, which raise questions beyond those related to exploiting each modality separately. As we argue, many of these questions, or ''challenges,'' are common to multiple domains. This paper deals with two key issues: ''why we need data fusion'' and ''how we perform it.'' The first issue is motivated by numerous examples in science and technology, followed by a mathematical framework that showcases some of the benefits that data fusion provides. In order to address the second issue, ''diversity'' is introduced as a key concept, and a number of data-driven solutions based on matrix and tensor decompositions are discussed, emphasizing how they account for diversity across the data sets. The aim of this paper is to provide the reader, regardless of his or her community of origin, with a taste of the vastness of the field, the prospects, and the opportunities that it holds.

373 citations

Journal ArticleDOI
TL;DR: It is shown how an effort account of pupil dilation can provide an explanation of these findings and future directions to further corroborate this account are discussed in the context of recent theories on cognitive control and effort and their potential neurobiological substrates.
Abstract: Pupillometry research has experienced an enormous revival in the last two decades. Here we briefly review the surge of recent studies on task-evoked pupil dilation in the context of cognitive control tasks with the primary aim being to evaluate the feasibility of using pupil dilation as an index of effort exertion, rather than task demand or difficulty. Our review shows that across the three cognitive control domains of updating, switching, and inhibition, increases in task demands typically leads to increases in pupil dilation. Studies show a diverging pattern with respect to the relationship between pupil dilation and performance and we show how an effort account of pupil dilation can provide an explanation of these findings. We also discuss future directions to further corroborate this account in the context of recent theories on cognitive control and effort and their potential neurobiological substrates.

371 citations

Journal ArticleDOI
TL;DR: This survey presents some of the most widely used tensor decompositions, providing the key insights behind them, and summarizing them from a practitioner’s point of view.
Abstract: Tensors and tensor decompositions are very powerful and versatile tools that can model a wide variety of heterogeneous, multiaspect data. As a result, tensor decompositions, which extract useful latent information out of multiaspect data tensors, have witnessed increasing popularity and adoption by the data mining community. In this survey, we present some of the most widely used tensor decompositions, providing the key insights behind them, and summarizing them from a practitioner’s point of view. We then provide an overview of a very broad spectrum of applications where tensors have been instrumental in achieving state-of-the-art performance, ranging from social network analysis to brain data analysis, and from web mining to healthcare. Subsequently, we present recent algorithmic advances in scaling tensor decompositions up to today’s big data, outlining the existing systems and summarizing the key ideas behind them. Finally, we conclude with a list of challenges and open problems that outline exciting future research directions.

347 citations