scispace - formally typeset
Search or ask a question

Showing papers by "Kai Puolamäki published in 2019"


Posted Content
TL;DR: The rapidly growing research landscape of low-cost sensor technologies for air quality monitoring and their calibration using machine learning techniques is surveyed and open research challenges are identified and present directions for future research.
Abstract: The significance of air pollution and the problems associated with it are fueling deployments of air quality monitoring stations worldwide. The most common approach for air quality monitoring is to rely on environmental monitoring stations, which unfortunately are very expensive both to acquire and to maintain. Hence environmental monitoring stations are typically sparsely deployed, resulting in limited spatial resolution for measurements. Recently, low-cost air quality sensors have emerged as an alternative that can improve the granularity of monitoring. The use of low-cost air quality sensors, however, presents several challenges: they suffer from cross-sensitivities between different ambient pollutants; they can be affected by external factors, such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Periodic re-calibration can improve the accuracy of low-cost sensors, particularly with machine-learning-based calibration, which has shown great promise due to its capability to calibrate sensors in-field. In this article, we survey the rapidly growing research landscape of low-cost sensor technologies for air quality monitoring and their calibration using machine learning techniques. We also identify open research challenges and present directions for future research.

54 citations


Proceedings ArticleDOI
25 Jul 2019
TL;DR: It is shown that it is possible to evaluate the significance of patterns also during exploratory analysis, and that the knowledge of the analyst can be leveraged to improve statistical power by reducing the amount of simultaneous comparisons.
Abstract: In this paper we consider the following important problem: when we explore data visually and observe patterns, how can we determine their statistical significance? Patterns observed in exploratory analysis are traditionally met with scepticism, since the hypotheses are formulated while viewing the data, rather than before doing so. In contrast to this belief, we show that it is, in fact, possible to evaluate the significance of patterns also during exploratory analysis, and that the knowledge of the analyst can be leveraged to improve statistical power by reducing the amount of simultaneous comparisons. We develop a principled framework for determining the statistical significance of visually observed patterns. Furthermore, we show how the significance of visual patterns observed during iterative data exploration can be determined. We perform an empirical investigation on real and synthetic tabular data and time series, using different test statistics and methods for generating surrogate data. We conclude that the proposed framework allows determining the significance of visual patterns during exploratory analysis.

14 citations


Book ChapterDOI
28 Oct 2019
TL;DR: This paper develops a robust regression method for finding the largest subset in the data that can be approximated using a sparse linear model to a given precision, and shows that the problem is NP-hard and hard to approximate.
Abstract: Real-world datasets are often characterised by outliers, points far from the majority of the points, which might negatively influence modelling of the data. In data analysis it is hence important to use methods that are robust to outliers. In this paper we develop a robust regression method for finding the largest subset in the data that can be approximated using a sparse linear model to a given precision. We show that the problem is NP-hard and hard to approximate. We present an efficient algorithm, termed slise, to find solutions to the problem. Our method extends current state-of-the-art robust regression methods, especially in terms of scalability on large datasets. Furthermore, we show that our method can be used to yield interpretable explanations for individual decisions by opaque, black box, classifiers. Our approach solves shortcomings in other recent explanation methods by not requiring sampling of new data points and by being usable without modifications across various data domains. We demonstrate our method using both synthetic and real-world regression and classification problems.

9 citations


Posted Content
TL;DR: A principled framework for interactive visual exploration of relations in data, through views most informative given the user's current knowledge and objectives is proposed, which at the limit of no background knowledge and with generic objectives reduces to PCA.
Abstract: Efficient explorative data analysis systems must take into account both what a user knows and wants to know. This paper proposes a principled framework for interactive visual exploration of relations in data, through views most informative given the user's current knowledge and objectives. The user can input pre-existing knowledge of relations in the data and also formulate specific exploration interests, which are then taken into account in the exploration. The idea is to steer the exploration process towards the interests of the user, instead of showing uninteresting or already known relations. The user's knowledge is modelled by a distribution over data sets parametrised by subsets of rows and columns of data, called tile constraints. We provide a computationally efficient implementation of this concept based on constrained randomisation. Furthermore, we describe a novel dimensionality reduction method for finding the views most informative to the user, which at the limit of no background knowledge and with generic objectives reduces to PCA. We show that the method is suitable for interactive use and is robust to noise, outperforms standard projection pursuit visualisation methods, and gives understandable and useful results in analysis of real-world data. We provide an open-source implementation of the framework.

4 citations


Journal ArticleDOI
TL;DR: Critical cognitive limitations when people utilise data are demonstrated and general cognitive ergonomic guidelines for design that support the utilisation of data and improve data-based decision-making are needed.
Abstract: Today’s ever-increasing amount of data places new demands on cognitive ergonomics and requires new design ideas to ensure successful human–data interaction. Our aim was to identify the cognitive fa...

4 citations


Posted Content
13 Dec 2019
TL;DR: This article presents low-cost sensor technologies, and it survey and assess machine learning-based calibration techniques for their calibration, and presents open questions and directions for future research.
Abstract: In recent years, interest in monitoring air quality has been growing. Traditional environmental monitoring stations are very expensive, both to acquire and to maintain, therefore their deployment is generally very sparse. This is a problem when trying to generate air quality maps with a fine spatial resolution. Given the general interest in air quality monitoring, low-cost air quality sensors have become an active area of research and development. Low-cost air quality sensors can be deployed at a finer level of granularity than traditional monitoring stations. Furthermore, they can be portable and mobile. Low-cost air quality sensors, however, present some challenges: they suffer from crosssensitivities between different ambient pollutants; they can be affected by external factors such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Some promising machine learning approaches can help us obtain highly accurate measurements with low-cost air quality sensors. In this article, we present low-cost sensor technologies, and we survey and assess machine learning-based calibration techniques for their calibration. We conclude by presenting open questions and directions for future research.

3 citations


Proceedings ArticleDOI
10 Sep 2019
TL;DR: It is shown that accuracy of judgement decreased as the amount of information increased, and that judgement was affected by irrelevant information, which demonstrates critical cognitive limitations when people utilise data and suggest a cognitive bias in data-based decision-making.
Abstract: Today's ever-increasing amount of data places new demands on cognitive ergonomics and requires new design ideas to ensure successful human–data interaction. Our aim is to identify the cognitive factors that require attention when designing systems to improve decision-making based on large amounts of data. We designed an experiment that simulates the typical cognitive demands people encounter in data analysis situations. We demonstrate some essential cognitive limitations using a behavioural experiment with 20 participants. The studied task presented the participants with critical and noncritical attributes that contained information on two groups of people. They had to select the response option (group) with a higher frequency of critical attributes. The results showed that accuracy of judgement decreased as the amount of information increased, and that judgement was affected by irrelevant information. Our results thus demonstrate critical cognitive limitations when people utilise data and suggest a cognitive bias in data-based decision-making. Therefore, when designing for cognition, we should consider the human cognitive limitations that are manifested in a data analysis context and develop general cognitive ergonomics guidelines for design to support the utilisation of data and improve data-based decision-making.

2 citations


Posted Content
TL;DR: This paper presents an efficient framework for estimating the generalization error of regression functions, applicable to any family of regression function when the ground truth is unknown, and finds that it performs robustly and is useful for detecting concept drift in datasets in several real-world domains.
Abstract: Regression analysis is a standard supervised machine learning method used to model an outcome variable in terms of a set of predictor variables. In most real-world applications we do not know the true value of the outcome variable being predicted outside the training data, i.e., the ground truth is unknown. It is hence not straightforward to directly observe when the estimate from a model potentially is wrong, due to phenomena such as overfitting and concept drift. In this paper we present an efficient framework for estimating the generalization error of regression functions, applicable to any family of regression functions when the ground truth is unknown. We present a theoretical derivation of the framework and empirically evaluate its strengths and limitations. We find that it performs robustly and is useful for detecting concept drift in datasets in several real-world domains.

2 citations


Book ChapterDOI
16 Sep 2019
TL;DR: This work formulates an information criterion for supervised human-guided data exploration to find the most informative views about the class structure of the data by taking both the user’s current knowledge and objectives into account and shows that the method gives understandable and useful results when analysing real-world datasets.
Abstract: An exploratory data analysis system should be aware of what a user already knows and what the user wants to know of the data. Otherwise it is impossible to provide the user with truly informative and useful views of the data. In our recently introduced framework for human-guided data exploration (Puolamaki et al. [20]), both the user’s knowledge and objectives are modelled as distributions over data, parametrised by tile constraints. This makes it possible to show the users the most informative views given their current knowledge and objectives. Often the data, however, comes with a class label and the user is interested only of the features informative related to the class. In non-interactive settings there exist dimensionality reduction methods, such as supervised PCA (Barshan et al. [1]), to make such visualisations, but no such method takes the user’s knowledge or objectives into account. Here, we formulate an information criterion for supervised human-guided data exploration to find the most informative views about the class structure of the data by taking both the user’s current knowledge and objectives into account. We study experimentally the scalability of our method for interactive use, and stability with respect to the size of the class of interest. We show that our method gives understandable and useful results when analysing real-world datasets, and a comparison to SPCA demonstrates the effect of the user’s background knowledge. The implementation will be released as an open source software library.

Journal ArticleDOI
TL;DR: CycleSampler as discussed by the authors is an efficient property-preserving Markov chain Monte Carlo method for generating surrogate networks in which edge weights are constrained to intervals and vertex strengths are preserved exactly.
Abstract: In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Generating surrogate networks typically requires that different properties of the network are preserved, e.g., edges may not be added or deleted and edge weights may be restricted to certain intervals. In this paper we present an efficient property-preserving Markov chain Monte Carlo method termed CycleSampler for generating surrogate networks in which (1) edge weights are constrained to intervals and vertex strengths are preserved exactly, and (2) edge and vertex strengths are both constrained to intervals. These two types of constraints cover a wide variety of practical use cases. The method is applicable to both undirected and directed graphs. We empirically demonstrate the efficiency of the CycleSampler method on real-world data sets. We provide an implementation of CycleSampler in R, with parts implemented in C.