scispace - formally typeset
Search or ask a question

Showing papers by "Kai Puolamäki published in 2016"


Journal ArticleDOI
TL;DR: The significance estimates of various statistical tests are compared in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus to conclude that significance testing can be used to find consequential differences between corpora.
Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

86 citations


Journal ArticleDOI
14 Jul 2016-PLOS ONE
TL;DR: This study investigates synchrony in physiological signals between collaborating computer science students performing pair-programming exercises in a class room environment and finds evident physiological compliance in collaborating dyads’ heart-rate variability signals.
Abstract: It is known that periods of intense social interaction result in shared patterns in collaborators' physiological signals. However, applied quantitative research on collaboration is hindered due to scarcity of objective metrics of teamwork effectiveness. Indeed, especially in the domain of productive, ecologically-valid activity such as programming, there is a lack of evidence for the most effective, affordable and reliable measures of collaboration quality. In this study we investigate synchrony in physiological signals between collaborating computer science students performing pair-programming exercises in a class room environment. We recorded electrocardiography over the course of a 60 minute programming session, using lightweight physiological sensors. We employ correlation of heart-rate variability features to study social psychophysiological compliance of the collaborating students. We found evident physiological compliance in collaborating dyads' heart-rate variability signals. Furthermore, dyads' self-reported workload was associated with the physiological compliance. Our results show viability of a novel approach to field measurement using lightweight devices in an uncontrolled environment, and suggest that self-reported collaboration quality can be assessed via physiological signals.

34 citations


Book ChapterDOI
19 Sep 2016
TL;DR: A novel generic method for interactive visual exploration of high-dimensional data that employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints.
Abstract: Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and data mining techniques are rare, and those that exist are often specialized to a single problem or domain. In this paper, we introduce a novel generic method for interactive visual exploration of high-dimensional data. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection of the data and then employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints. These constraints expressed by the user are then taken into account by a projection-finding algorithm to compute a new ‘interesting’ projection, a process that can be iterated until the user runs out of time or finds that constraints explain everything she needs to find from the data. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data. The data and software related to this paper are available at http://www.interesting-patterns.net/forsied/interactive-visual-data-exploration-with-subjective-feedback/.

13 citations


Book ChapterDOI
19 Sep 2016
TL;DR: SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations via a set of projection tiles, offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets.
Abstract: We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets.

8 citations


Posted Content
TL;DR: A novel method is presented, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes.
Abstract: In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These interactions are useful in several practical applications, for example, to gain insight into the structure of the data, in feature selection, and in data anonymisation. We present a novel method, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes. Furthermore, we provide a method, named ASTRID, for automatically finding a partition of attributes describing the distribution that has generated the data. State-of-the-art classifiers are utilised to capture the interactions present in the data by systematically breaking attribute interactions and observing the effect of this breaking on classifier performance. We empirically demonstrate the utility of the proposed method with examples using real and synthetic data.

4 citations


Book ChapterDOI
19 Sep 2016
TL;DR: The task of finding combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability is formulated as a novel matrix tiling problem, and two algorithms for solving it are proposed.
Abstract: Event sequences are ubiquitous, e.g., in finance, medicine, and social media. Often the same underlying phenomenon, such as television advertisements during Superbowl, is reflected in independent event sequences, like different Twitter users. It is hence of interest to find combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability. Such patterns allow exploration of the event sequences in terms of their evolving temporal dynamics, and provide more fine-grained insights to the data than what for example straightforward clustering can reveal. We formulate the task of finding such patterns as a novel matrix tiling problem, and propose two algorithms for solving it. Our first algorithm is a greedy set-cover heuristic, while in the second approach we view the problem as time-series segmentation. We apply the algorithms on real and artificial datasets and obtain promising results. The software related to this paper is available at https://github.com/bwrc/semigeom-r.

3 citations


Posted Content
TL;DR: Clustering is a widely used unsupervised learning method for finding structure in the data but the resulting clusters are typically presented without any guarantees on their robustness.
Abstract: Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; sligh ...

2 citations


Proceedings Article
01 Jan 2016
TL;DR: SIDE, a generic tool for Subjective Interactive Data Exploration, which lets users explore high dimensional data via subjectively informative two-dimensional data visualizations and allows users to flexibly and intuitively express their interests or beliefs using visual interactions that update/constrain a background model of the data.
Abstract: Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and/or interaction with data mining techniques are rare, and those that exist are specialized to a single problem or domain. We present SIDE, a generic tool for Subjective Interactive Data Exploration, which lets users explore high dimensional data via subjectively informative two-dimensional data visualizations. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection, and then allows users to flexibly and intuitively express their interests or beliefs using visual interactions that update/constrain a background model of the data. These constraints expressed by the user are then taken into account by a projection-finding algorithm employing data randomization to compute a new ‘interesting’ projection. This process can be iterated until the user runs out of time or finds that the difference between the randomized data and the real data is no longer interesting. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data.

1 citations


Proceedings ArticleDOI
06 Sep 2016
TL;DR: This workshop brings together a cross-domain group of individuals to discuss and contribute to the problem of using mobile gaze tracking for inferring user action, advance the sharing of data and analysis algorithms as well as device solutions, and increase understanding of behavioral aspects of gaze-action sequences in natural environments and AR/VR applications.
Abstract: Gaze tracking in psychological, cognitive, and user interaction studies has recently evolved toward mobile solutions, as they enable direct assessing of users' visual attention in natural environments, and augmented and virtual reality (AR/VR) applications. Productive approaches in analyzing and predicting user actions with gaze data require a multidisciplinary approach with experts in cognitive and behavioral sciences, machine vision, and machine learning. This workshop brings together a cross-domain group of individuals to (i) discuss and contribute to the problem of using mobile gaze tracking for inferring user action, (ii) advance the sharing of data and analysis algorithms as well as device solutions, and (iii) increase understanding of behavioral aspects of gaze-action sequences in natural environments and AR/VR applications.

1 citations


Journal ArticleDOI
TL;DR: A new efficient algorithm, termed cocoreg, is proposed for the extraction of variation common to all datasets in a given collection of arbitrary size, which extends redundancy analysis to more than two datasets.
Abstract: In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed cocoreg, for the extraction of variation common to all datasets in a given collection of arbitrary size. cocoreg extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the cocoreg algorithm on five artificial and three real datasets.

1 citations