Showing papers by "Kai Puolamäki published in 2016"

PDF

Open Access

Journal Article•DOI•

Significance testing of word frequencies in corpora

[...]

Jefrey Lijffijt¹, Terttu Nevalainen², Tanja Säily², Panagiotis Papapetrou³, Kai Puolamäki⁴, Heikki Mannila⁵ - Show less +2 more•Institutions (5)

University of Bristol¹, University of Helsinki², Stockholm University³, Finnish Institute of Occupational Health⁴, Aalto University⁵

01 Jun 2016-Digital Scholarship in the Humanities

TL;DR: The significance estimates of various statistical tests are compared in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus to conclude that significance testing can be used to find consequential differences between corpora.

...read moreread less

Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

...read moreread less

86 citations

Journal Article•DOI•

Cognitive Collaboration Found in Cardiac Physiology: Study in Classroom Environment.

[...]

Lauri Ahonen¹, Benjamin Ultan Cowley², Benjamin Ultan Cowley¹, Jari Torniainen¹, Antti Ukkonen¹, Arto Vihavainen², Kai Puolamäki¹ - Show less +3 more•Institutions (2)

Finnish Institute of Occupational Health¹, University of Helsinki²

14 Jul 2016-PLOS ONE

TL;DR: This study investigates synchrony in physiological signals between collaborating computer science students performing pair-programming exercises in a class room environment and finds evident physiological compliance in collaborating dyads’ heart-rate variability signals.

...read moreread less

Abstract: It is known that periods of intense social interaction result in shared patterns in collaborators' physiological signals. However, applied quantitative research on collaboration is hindered due to scarcity of objective metrics of teamwork effectiveness. Indeed, especially in the domain of productive, ecologically-valid activity such as programming, there is a lack of evidence for the most effective, affordable and reliable measures of collaboration quality. In this study we investigate synchrony in physiological signals between collaborating computer science students performing pair-programming exercises in a class room environment. We recorded electrocardiography over the course of a 60 minute programming session, using lightweight physiological sensors. We employ correlation of heart-rate variability features to study social psychophysiological compliance of the collaborating students. We found evident physiological compliance in collaborating dyads' heart-rate variability signals. Furthermore, dyads' self-reported workload was associated with the physiological compliance. Our results show viability of a novel approach to field measurement using lightweight devices in an uncontrolled environment, and suggest that self-reported collaboration quality can be assessed via physiological signals.

...read moreread less

34 citations

Book Chapter•DOI•

Interactive Visual Data Exploration with Subjective Feedback

[...]

Kai Puolamäki¹, Bo Kang², Jefrey Lijffijt², Tijl De Bie²•Institutions (2)

Finnish Institute of Occupational Health¹, Ghent University²

19 Sep 2016

TL;DR: A novel generic method for interactive visual exploration of high-dimensional data that employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints.

...read moreread less

Abstract: Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and data mining techniques are rare, and those that exist are often specialized to a single problem or domain. In this paper, we introduce a novel generic method for interactive visual exploration of high-dimensional data. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection of the data and then employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints. These constraints expressed by the user are then taken into account by a projection-finding algorithm to compute a new ‘interesting’ projection, a process that can be iterated until the user runs out of time or finds that constraints explain everything she needs to find from the data. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data. The data and software related to this paper are available at http://www.interesting-patterns.net/forsied/interactive-visual-data-exploration-with-subjective-feedback/.

...read moreread less

13 citations

Book Chapter•DOI•

A Tool for Subjective and Interactive Visual Data Exploration

[...]

Bo Kang¹, Kai Puolamäki², Jefrey Lijffijt¹, Tijl De Bie¹•Institutions (2)

Ghent University¹, Finnish Institute of Occupational Health²

19 Sep 2016

TL;DR: SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations via a set of projection tiles, offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets.

...read moreread less

Abstract: We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets.

...read moreread less

8 citations

Posted Content•

Finding Statistically Significant Attribute Interactions.

[...]

Andreas Henelius, Antti Ukkonen, Kai Puolamäki

22 Dec 2016-arXiv: Machine Learning

TL;DR: A novel method is presented, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes.

...read moreread less

Abstract: In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These interactions are useful in several practical applications, for example, to gain insight into the structure of the data, in feature selection, and in data anonymisation. We present a novel method, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes. Furthermore, we provide a method, named ASTRID, for automatically finding a partition of attributes describing the distribution that has generated the data. State-of-the-art classifiers are utilised to capture the interactions present in the data by systematically breaking attribute interactions and observing the effect of this breaking on classifier performance. We empirically demonstrate the utility of the proposed method with examples using real and synthetic data.

...read moreread less

4 citations

Book Chapter•DOI•

Semigeometric Tiling of Event Sequences

[...]

Andreas Henelius¹, Isak Karlsson², Panagiotis Papapetrou², Antti Ukkonen¹, Kai Puolamäki¹ - Show less +1 more•Institutions (2)

Finnish Institute of Occupational Health¹, Stockholm University²

19 Sep 2016

TL;DR: The task of finding combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability is formulated as a novel matrix tiling problem, and two algorithms for solving it are proposed.

...read moreread less

Abstract: Event sequences are ubiquitous, e.g., in finance, medicine, and social media. Often the same underlying phenomenon, such as television advertisements during Superbowl, is reflected in independent event sequences, like different Twitter users. It is hence of interest to find combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability. Such patterns allow exploration of the event sequences in terms of their evolving temporal dynamics, and provide more fine-grained insights to the data than what for example straightforward clustering can reveal. We formulate the task of finding such patterns as a novel matrix tiling problem, and propose two algorithms for solving it. Our first algorithm is a greedy set-cover heuristic, while in the second approach we view the problem as time-series segmentation. We apply the algorithms on real and artificial datasets and obtain promising results. The software related to this paper is available at https://github.com/bwrc/semigeom-r.

...read moreread less

3 citations

Posted Content•

Clustering with Confidence : Finding Clusters with Statistical Guarantees

[...]

Andreas Henelius, Kai Puolamäki, Henrik Boström, Panagiotis Papapetrou

27 Dec 2016-arXiv: Machine Learning

TL;DR: Clustering is a widely used unsupervised learning method for finding structure in the data but the resulting clusters are typically presented without any guarantees on their robustness.

...read moreread less

Abstract: Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; sligh ...

...read moreread less

2 citations

Proceedings Article•

SIDE : a web app for interactive visual data exploration with subjective feedback

[...]

Jefrey Lijffijt¹, Bo Kang¹, Kai Puolamäki², Tijl De Bie¹•Institutions (2)

University of Bristol¹, Finnish Institute of Occupational Health²

01 Jan 2016

TL;DR: SIDE, a generic tool for Subjective Interactive Data Exploration, which lets users explore high dimensional data via subjectively informative two-dimensional data visualizations and allows users to flexibly and intuitively express their interests or beliefs using visual interactions that update/constrain a background model of the data.

...read moreread less

Abstract: Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and/or interaction with data mining techniques are rare, and those that exist are specialized to a single problem or domain. We present SIDE, a generic tool for Subjective Interactive Data Exploration, which lets users explore high dimensional data via subjectively informative two-dimensional data visualizations. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection, and then allows users to flexibly and intuitively express their interests or beliefs using visual interactions that update/constrain a background model of the data. These constraints expressed by the user are then taken into account by a projection-finding algorithm employing data randomization to compute a new ‘interesting’ projection. This process can be iterated until the user runs out of time or finds that the difference between the randomized data and the real data is no longer interesting. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data.

...read moreread less

1 citations

Proceedings Article•DOI•

Inferring user action with mobile gaze tracking

[...]

Miika Toivanen¹, Kai Puolamäki¹, Kristian Lukander¹, Jukka Häkkinen², Jenni Radun² - Show less +1 more•Institutions (2)

Finnish Institute of Occupational Health¹, University of Helsinki²

06 Sep 2016

TL;DR: This workshop brings together a cross-domain group of individuals to discuss and contribute to the problem of using mobile gaze tracking for inferring user action, advance the sharing of data and analysis algorithms as well as device solutions, and increase understanding of behavioral aspects of gaze-action sequences in natural environments and AR/VR applications.

...read moreread less

Abstract: Gaze tracking in psychological, cognitive, and user interaction studies has recently evolved toward mobile solutions, as they enable direct assessing of users' visual attention in natural environments, and augmented and virtual reality (AR/VR) applications. Productive approaches in analyzing and predicting user actions with gaze data require a multidisciplinary approach with experts in cognitive and behavioral sciences, machine vision, and machine learning. This workshop brings together a cross-domain group of individuals to (i) discuss and contribute to the problem of using mobile gaze tracking for inferring user action, (ii) advance the sharing of data and analysis algorithms as well as device solutions, and (iii) increase understanding of behavioral aspects of gaze-action sequences in natural environments and AR/VR applications.

...read moreread less

1 citations

Journal Article•DOI•

Using regression makes extraction of shared variation in multiple datasets easy

[...]

Jussi Korpela¹, Andreas Henelius¹, Lauri Ahonen¹, Arto Klami², Kai Puolamäki¹ - Show less +1 more•Institutions (2)

Finnish Institute of Occupational Health¹, Helsinki Institute for Information Technology²

01 Sep 2016-Data Mining and Knowledge Discovery

TL;DR: A new efficient algorithm, termed cocoreg, is proposed for the extraction of variation common to all datasets in a given collection of arbitrary size, which extends redundancy analysis to more than two datasets.

...read moreread less

Abstract: In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed cocoreg, for the extraction of variation common to all datasets in a given collection of arbitrary size. cocoreg extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the cocoreg algorithm on five artificial and three real datasets.

...read moreread less

1 citations