scispace - formally typeset
Search or ask a question

Showing papers by "Sasu Tarkoma published in 2019"


Posted Content
TL;DR: The rapidly growing research landscape of low-cost sensor technologies for air quality monitoring and their calibration using machine learning techniques is surveyed and open research challenges are identified and present directions for future research.
Abstract: The significance of air pollution and the problems associated with it are fueling deployments of air quality monitoring stations worldwide. The most common approach for air quality monitoring is to rely on environmental monitoring stations, which unfortunately are very expensive both to acquire and to maintain. Hence environmental monitoring stations are typically sparsely deployed, resulting in limited spatial resolution for measurements. Recently, low-cost air quality sensors have emerged as an alternative that can improve the granularity of monitoring. The use of low-cost air quality sensors, however, presents several challenges: they suffer from cross-sensitivities between different ambient pollutants; they can be affected by external factors, such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Periodic re-calibration can improve the accuracy of low-cost sensors, particularly with machine-learning-based calibration, which has shown great promise due to its capability to calibrate sensors in-field. In this article, we survey the rapidly growing research landscape of low-cost sensor technologies for air quality monitoring and their calibration using machine learning techniques. We also identify open research challenges and present directions for future research.

54 citations


Proceedings ArticleDOI
22 Jul 2019
TL;DR: A 44-day measurement campaign is conducted to assess performance of low-cost air quality monitors under different environmental conditions and shows that the accuracy is sufficient for applications relying on variations in air quality index values, such as hot spot detection.
Abstract: Air pollution is a major problem in urban areas, where high population density is accompanied with excess anthropomorphic emissions impacting the environment and increasing health effects. Highly accurate air quality monitoring stations have been used to monitor the severity of the problem and warn citizens. However, air quality can vary sharply even within the same city block, and pollution exposure can vary even 30% between individuals living in the same residence. Therefore, a dense deployment of air quality sensors is needed to detect these variations, and protect citizens from overexposure. Low-cost air quality sensors make it possible to densely instrument a city and detect hot spots as they happen. However, thus far limited information exists on their accuracy and practicability. In this paper, we conduct a 44-day measurement campaign to assess performance of low-cost air quality monitors under different environmental conditions. As practical use case, we consider pollution hot spot detection. Our results show that the mean error of low-cost sensors is small, but the variation in error is significantly larger than with reference sensors. We also show that the accuracy is sufficient for applications relying on variations in air quality index values, such as hot spot detection.

28 citations


Journal ArticleDOI
TL;DR: This study conducts the first independent and large-scale study of retention rates and usage trends on a dataset of app-usage data from a community of 339,842 users and more than 213,667 apps, and develops a novel app- usage trend measure which provides instantaneous information about the popularity of an application.
Abstract: Popularity of mobile apps is traditionally measured by metrics such as the number of downloads, installations, or user ratings. A problem with these measures is that they reflect usage only indirectly. Indeed, retention rates, i.e., the number of days users continue to interact with an installed app, have been suggested to predict successful app lifecycles. We conduct the first independent and large-scale study of retention rates and usage trends on a dataset of app-usage data from a community of 339,842 users and more than 213,667 apps. Our analysis shows that, on average, applications lose 65% of their users in the first week, while very popular applications (top 100) lose only 35%. It also reveals, however, that many applications have more complex usage behaviour patterns due to seasonality, marketing, or other factors. To capture such effects, we develop a novel app-usage trend measure which provides instantaneous information about the popularity of an application. Analysis of our data using this trend filter shows that roughly 40% of all apps never gain more than a handful of users (Marginal apps). Less than 0.1% of the remaining 60% are constantly popular (Dominant apps), 1% have a quick drain of usage after an initial steep rise (Expired apps), and 6% continuously rise in popularity (Hot apps). From these, we can distinguish, for instance, trendsetters from copycat apps. We conclude by demonstrating that usage behaviour trend information can be used to develop better mobile app recommendations.

19 citations


Proceedings ArticleDOI
11 Mar 2019
TL;DR: A traffic morphing technique is proposed, which shapes network traffic thus making it more difficult to identify IoT devices and their activities, and provides protection against traffic analysis attacks and prevent privacy leakages for smart home users.
Abstract: Traffic analysis attacks allow an attacker to infer sensitive information about users by analyzing network traffic of user devices. These attacks are passive in nature and are difficult to detect. In this paper, we demonstrate that an adversary, with access to upstream traffic from a smart home network, can identify the device types and user interactions with IoT devices, with significant confidence. These attacks are practical even when device traffic is encrypted because they only utilize statistical properties, such as traffic rates, for analysis. In order to mitigate the privacy implications of traffic analysis attacks, we propose a traffic morphing technique, which shapes network traffic thus making it more difficult to identify IoT devices and their activities. Our evaluation shows that the proposed technique provides protection against traffic analysis attacks and prevent privacy leakages for smart home users.

19 citations


Proceedings ArticleDOI
13 May 2019
TL;DR: The results demonstrate that high energy consumption and high latency decrease the likelihood of retaining an app, and a model for predicting retention based on performance metrics is developed that generalizes well across application categories, locations and other factors moderating the effect of performance.
Abstract: We contribute by quantifying the effect of network latency and battery consumption on mobile app performance and retention, i.e., user's decisions to continue or stop using apps. We perform our analysis by fusing two large-scale crowdsensed datasets collected by piggybacking on information captured by mobile apps. We find that app performance has an impact in its retention rate. Our results demonstrate that high energy consumption and high latency decrease the likelihood of retaining an app. Conversely, we show that reducing latency or energy consumption does not guarantee higher likelihood of retention as long as they are within reasonable standards of performance. However, we also demonstrate that what is considered reasonable depends on what users have been accustomed to, with device and network characteristics, and app category playing a role. As our second contribution, we develop a model for predicting retention based on performance metrics. We demonstrate the benefits of our model through empirical benchmarks which show that our model not only predicts retention accurately, but generalizes well across application categories, locations and other factors moderating the effect of performance.

18 citations


Proceedings ArticleDOI
22 Jul 2019
TL;DR: A feasibility study considers measurements collected from a smart office environment having a dense deployment of motion detectors and correlating measurements obtained from motion detectors against air quality values, and demonstrates that there indeed is a connection between extent of movement and PM2.5 concentration.
Abstract: Poor indoor air quality is a significant burden to society that can cause health issues and decrease productivity. According to research, indoor air quality is intrinsically linked with human activity and mobility. Indeed, mobility is directly linked with transfer of small particles (e.g. PM 2.5 ) and extent of activity affects production of CO 2 . Currently, however, estimation of indoor quality is difficult, requiring deployment of highly specialized sensing devices which need to be carefully placed and maintained. In this paper, we contribute by examining the suitability of infrastructure-based motion detectors for indoor air quality estimation. Such sensors are increasingly being deployed into smart environments, e.g., to control lighting and ventilation for energy management purposes. Being able to take advantage of these sensors would thus provide a cost-effective solution for indoor quality monitoring without need for deploying additional sensors. We perform a feasibility study considering measurements collected from a smart office environment having a dense deployment of motion detectors and correlating measurements obtained from motion detectors against air quality values. We consider two main pollutants, PM 2.5 and CO 2 , and demonstrate that there indeed is a connection between extent of movement and PM 2.5 concentration. However, for CO 2 , no relationship can be established, mostly due to difficulties in separating between people passing by and those residing long-term in the environment.

17 citations


Journal ArticleDOI
28 Dec 2019-Sensors
TL;DR: An input-adaptive proxy, which selects input variables of other air quality variables based on their correlation coefficients with the output variable, which manages to give full continuous BC estimation and can be further extend to estimate otherAir quality parameters.
Abstract: Missing data has been a challenge in air quality measurement. In this study, we develop an input-adaptive proxy, which selects input variables of other air quality variables based on their correlation coefficients with the output variable. The proxy uses ordinary least squares regression model with robust optimization and limits the input variables to a maximum of three to avoid overfitting. The adaptive proxy learns from the data set and generates the best model evaluated by adjusted coefficient of determination (adjR2). In case of missing data in the input variables, the proposed adaptive proxy then uses the second-best model until all the missing data gaps are filled up. We estimated black carbon (BC) concentration by using the input-adaptive proxy in two sites in Helsinki, which respectively represent street canyon and urban background scenario, as a case study. Accumulation mode, traffic counts, nitrogen dioxide and lung deposited surface area are found as input variables in models with the top rank. In contrast to traditional proxy, which gives 20–80% of data, the input-adaptive proxy manages to give full continuous BC estimation. The newly developed adaptive proxy also gives generally accurate BC (street canyon: adjR2 = 0.86–0.94; urban background: adjR2 = 0.74–0.91) depending on different seasons and day of the week. Due to its flexibility and reliability, the adaptive proxy can be further extend to estimate other air quality parameters. It can also act as an air quality virtual sensor in support with on-site measurements in the future.

16 citations


Journal ArticleDOI
TL;DR: This paper proposes a performance prediction framework, called d-Simplexed, to build performance models with varied configurable parameters on Spark, and takes inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features.
Abstract: Big Data processing systems (e.g., Spark) have a number of resource configuration parameters, such as memory size, CPU allocation, and the number of running nodes. Regular users and even expert administrators struggle to understand the mutual relation between different parameter configurations and the overall performance of the system. In this paper, we address this challenge by proposing a performance prediction framework, called d-Simplexed, to build performance models with varied configurable parameters on Spark. We take inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for various feature configurations. To minimize the time and resources in building a bootstrap model with a large number of configuration values, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using WordCount, PageRank, Kmeans, and Join workloads in HiBench benchmarking suites shows that we can achieve less than 5% error rate for estimation accuracy by sampling less than 1% of data.

14 citations


Proceedings ArticleDOI
09 Dec 2019
TL;DR: This work proposes an approach that combines application-level partitioning and packet steering with a programmable NIC that can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients.
Abstract: A single CPU core is not fast enough to process packets arriving from the network on commodity NICs. Applications are therefore turning to application-level partitioning and NIC offload to exploit parallelism on multicore systems and relieve the CPU. Although NIC offload techniques are not new, programmable NICs have emerged as a way for custom packet processing offload. However, it is not clear what parts of the application should be offloaded to a programmable NIC for improving parallelism. We propose an approach that combines application-level partitioning and packet steering with a programmable NIC. Applications partition data in DRAM between CPU cores, and steer requests to the correct core by parsing L7 packet headers on a programmable NIC. This approach improves request-level parallelism but keeps the partitioning scheme transparent to clients. We believe this approach can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients.

13 citations


Proceedings ArticleDOI
Yuxing Chen1, Jiaheng Lu1, Chen Chen2, Mohammad A. Hoque1, Sasu Tarkoma1 
03 Nov 2019
TL;DR: This paper proposes a simulation-based cost model to predict the performance of jobs accurately and achieves low-cost training by taking advantage of simulation framework, i.e., Monte Carlo (MC) simulation, which uses a small amount of data and resources to make a reliable prediction for larger datasets and clusters.
Abstract: Spark is one of the prevalent big data analytical platforms. Configuring proper resource provision for Spark jobs is challenging but essential for organizations to save time, achieve high resource utilization, and remain cost-effective. In this paper, we study the challenge of determining the proper parameter values that meet the performance requirements of workloads while minimizing both resource cost and resource utilization time. We propose a simulation-based cost model to predict the performance of jobs accurately. We achieve low-cost training by taking advantage of simulation framework, i.e., Monte Carlo (MC) simulation, which uses a small amount of data and resources to make a reliable prediction for larger datasets and clusters. The salient feature of our method is that it allows us to invest low training cost while obtaining an accurate prediction. Through experiments with six benchmark workloads, we demonstrate that the cost model yields less than 7% error on average prediction accuracy and the recommendation achieves up to 5x resource cost saving.

10 citations


Proceedings ArticleDOI
20 May 2019
TL;DR: This work develops communication modules using off-the-shelf components for visible light and ultrasound that increases the network capacity, robustness of network connections across IoT devices, and provides efficient means to enable distance-bounding services.
Abstract: The number of deployed Internet of Things (IoT) devices is steadily increasing to manage and interact with community assets of smart cities, such as transportation systems and power plants This may lead to degraded network performance due to the growing amount of network traffic and connections generated by various IoT devices To tackle these issues, one promising direction is to leverage the physical proximity of communicating devices and inter-device communication to achieve low latency, bandwidth efficiency, and resilient services In this work, we aim at enhancing the performance of indoor IoT communication (eg, smart homes, SOHO) by taking advantage of emerging technologies such as visible light and ultrasound This approach increases the network capacity, robustness of network connections across IoT devices, and provides efficient means to enable distance-bounding services We have developed communication modules using off-the-shelf components for visible light and ultrasound and evaluate their network performance and energy consumption In addition, we show the efficacy of our communication modules by applying them in a practical indoor IoT scenario to realize secure IoT group communication

Proceedings ArticleDOI
13 May 2019
TL;DR: This work proposes a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware.
Abstract: I/O is getting faster in servers that have fast programmable NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow. We therefore propose a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned.

Journal ArticleDOI
TL;DR: This work proposes formulating the problem of private data release through probabilistic modeling, and demonstrates empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data.
Abstract: Differential privacy allows quantifying privacy loss resulting from accessing sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation, but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modelling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also including prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key data sets for research.

Proceedings ArticleDOI
01 Mar 2019
TL;DR: DoubleEcho as mentioned in this paper leverages acoustic room-impulse response (RIR) to mitigate context-manipulation attacks, which can improve usability and strengthen security of many authentication and access control systems.
Abstract: Copresence verification based on context can improve usability and strengthen security of many authentication and access control systems. By sensing and comparing their surroundings, two or more devices can tell whether they are copresent and use this information to make access control decisions. To the best of our knowledge, all context-based copresence verification mechanisms to date are susceptible to context-manipulation attacks. In such attacks, a distributed adversary replicates the same context at the (different) locations of the victim devices, and induces them to believe that they are copresent. In this paper we propose DoubleEcho, a context-based copresence verification technique that leverages acoustic Room Impulse Response (RIR) to mitigate context-manipulation attacks. In DoubleEcho, one device emits a wide-band audible chirp and all participating devices record reflections of the chirp from the surrounding environment. Since RIR is, by its very nature, dependent on the physical surroundings, it constitutes a unique location signature that is hard for an adversary to replicate. We evaluate DoubleEcho by collecting RIR data with various mobile devices and in a range of different locations. We show that DoubleEcho mitigates context-manipulation attacks whereas all other approaches to date are entirely vulnerable to such attacks. DoubleEcho detects copresence (or lack thereof) in roughly 2 seconds and works on commodity devices.

Journal ArticleDOI
TL;DR: Crowd replication as a novel sensor-assisted method for quantifying human behavior within public spaces is developed and a novel highly accurate pedestrian sensing solution for reconstructing movement trajectories from sensor traces captured during the replication process is developed.
Abstract: A central challenge for public space design is to evaluate whether a given space promotes different types of activities. In this article, as our first contribution, we develop crowd replication as a novel sensor-assisted method for quantifying human behavior within public spaces. In crowd replication, a researcher is tasked with recording the behavior of people using a space while being instrumented with a mobile device that captures a sensor trace of the replicated movements and activities. Through mathematical modeling, behavioral indicators extracted from the replicated trajectories can be extrapolated to represent a larger target population. As our second contribution, we develop a novel highly accurate pedestrian sensing solution for reconstructing movement trajectories from sensor traces captured during the replication process. Our key insight is to tailor sensing to characteristics of the researcher performing replication, which allows reconstruction to operate robustly against variations in pace and other walking characteristics. We validate crowd replication through a case study carried out within a representative example of a metropolitan-scale public space. Our results show that crowd-replicated data closely mirrors human dynamics in public spaces and reduces overall data collection effort while producing high-quality indicators about behaviors and activities of people within the space. We also validate our pedestrian modeling approach through extensive benchmarks, demonstrating that our approach can reconstruct movement trajectories with high accuracy and robustness (median error below 1%). Finally, we demonstrate that our contributions enable capturing detailed indicators of liveliness, extent of social interaction, and other factors indicative of public space quality.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: It is shown in an experimental evaluation that the implementation of a key-value store that uses application-level partitioning, and inter-thread messaging reduces tail latency by up to 71 % compared to baseline Memcached running on commodity hardware and Linux, but it is observed that the thread-per-core approach is held back by request steering and OS interfaces.
Abstract: The response time of an online service depends on the tail latency of a few of the applications it invokes in parallel to satisfy the requests. The individual applications are composed of one or more threads to fully utilize the available CPU cores, but this approach can incur serious overheads. The thread-per-core architecture has emerged to reduce these overheads, but it also has its challenges from thread synchronization and OS interfaces. Applications can mitigate both issues with different techniques, but their impact on application tail latency is an open question. We measure the impact of thread-per-core architecture on application tail latency by implementing a key-value store that uses application-level partitioning, and inter-thread messaging and compare its tail latency to Memcached which uses a traditional key-value store design. We show in an experimental evaluation that our approach reduces tail latency by up to 71 % compared to baseline Memcached running on commodity hardware and Linux. However, we observe that the thread-per-core approach is held back by request steering and OS interfaces, and it could be further improved with NIC hardware offload.

Posted Content
13 Dec 2019
TL;DR: This article presents low-cost sensor technologies, and it survey and assess machine learning-based calibration techniques for their calibration, and presents open questions and directions for future research.
Abstract: In recent years, interest in monitoring air quality has been growing. Traditional environmental monitoring stations are very expensive, both to acquire and to maintain, therefore their deployment is generally very sparse. This is a problem when trying to generate air quality maps with a fine spatial resolution. Given the general interest in air quality monitoring, low-cost air quality sensors have become an active area of research and development. Low-cost air quality sensors can be deployed at a finer level of granularity than traditional monitoring stations. Furthermore, they can be portable and mobile. Low-cost air quality sensors, however, present some challenges: they suffer from crosssensitivities between different ambient pollutants; they can be affected by external factors such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Some promising machine learning approaches can help us obtain highly accurate measurements with low-cost air quality sensors. In this article, we present low-cost sensor technologies, and we survey and assess machine learning-based calibration techniques for their calibration. We conclude by presenting open questions and directions for future research.

Proceedings ArticleDOI
20 May 2019
TL;DR: Enable μslicing, where each μslice is tied to a specific pair of ingress and egress nodes of a slice, to optimize the bandwidth consumed by each μ slice by co-locating its VNF instances that communicate the most.
Abstract: 5G networks leverage network slices for serving use cases with different and potentially conflicting requirements. Current approaches for composing network slices largely overlook the presence of multiple ingress and egress nodes in each network slice. This results in inefficient resource usage even when co-locating the Virtual Network Functions (VNFs) of a slice. The network is thus quickly saturated, and no more use cases can be served. We address this issue by introducing μslices, where each μslice is tied to a specific pair of ingress and egress nodes of a slice. Then, we optimize the bandwidth consumed by each μslice by co-locating its VNF instances that communicate the most. We observed that enabling μslicing was able to save about two times more control and data plane traffic and it also resulted in a more even distribution of computational and link load.

Book ChapterDOI
04 Dec 2019
TL;DR: This work proposes scalable air quality monitoring by leveraging low-cost air pollution sensors, artificial intelligence methods, and versatile connectivity provided by 4G/5G and describes pilot deployments for testing the developed sensing technologies in Helsinki, Finland.
Abstract: Air pollution has become a global challenge during the growth of megacities, which drives the deployment of air quality monitoring in order to understand and mitigate district level air pollution. Currently, air pollution monitoring mainly relies on high-end accurate reference stations, which are usually stationary and expensive. Thus, the air quality monitoring deployments are typically coarse grained with only a very small number of stations in a city. We propose scalable air quality monitoring by leveraging low-cost air pollution sensors, artificial intelligence methods, and versatile connectivity provided by 4G/5G. We describe pilot deployments for testing the developed sensing technologies in three different locations in Helsinki, Finland.

Patent
08 Oct 2019
TL;DR: In this paper, a method for generating an estimation of earth's gravity is presented, based on the acceleration data values and the magnitude of orientation change from the orientation data values, and the determined stability value is compared to a threshold value.
Abstract: Disclosed is a method for generating an estimation of earth's gravity. The method includes: obtaining one or more acceleration data values and one or more orientation data values over a period of time; generating magnitude of orientation change from the orientation data values; determining a stability value based on the acceleration data values and the magnitude of orientation change; comparing the determined stability value to a threshold value; and generating an estimation of earth's gravity over the period of time on the basis of the acceleration data values if the comparison indicates that the determined stability value is below the threshold value. Also disclosed is an apparatus implementing the method.

Posted Content
TL;DR: This article proposes two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector which are space efficient and answer queries with a set of identifiers for multiple set matching problems and shows that the space efficiency can be optimized further according to the distribution of labels among multiple sets: Uniform and Zipf.
Abstract: Bloom filter is a space-efficient probabilistic data structure for checking elements' membership in a set. Given multiple sets, however, a standard Bloom filter is not sufficient when looking for the items to which an element or a set of input elements belong to. In this article, we solve multiple set matching problem by proposing two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector. Both of them are space efficient and answer queries with a set of identifiers for multiple set matching problems. We show that the space efficiency can be optimized further according to the distribution of labels among multiple sets: Uniform and Zipf. While both of them are space efficient, Bloom Vector can efficiently exploit Zipf distribution of data for further space reduction. Our results also highlight that basic ADD and LOOKUP operations on Bloom Matrix are faster than on Bloom Vector. However, Bloom Matrix does not meet the theoretical false positive rate of less than $10^{-2}$ for LOOKUP operations if the represented data or the labels are not uniformly distributed among the multiple sets. Consequently, we introduce \textit{Bloom Test} which uses Bloom Matrix as the pre-filter structure to determine which structure is suitable for improved performance with an arbitrary input dataset.

Posted Content
TL;DR: In this paper, the authors proposed two efficient Bloom multifilters, Bloom Matrix and Bloom Vector, to solve multiple set matching problem by solving queries with a set of identifiers for multiple set-matching problems.
Abstract: Bloom filter is a space-efficient probabilistic data structure for checking elements' membership in a set. Given multiple sets, however, a standard Bloom filter is not sufficient when looking for the items to which an element or a set of input elements belong to. In this article, we solve multiple set matching problem by proposing two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector. Both of them are space efficient and answer queries with a set of identifiers for multiple set matching problems. We show that the space efficiency can be optimized further according to the distribution of labels among multiple sets: Uniform and Zipf. While both of them are space efficient, Bloom Vector can efficiently exploit Zipf distribution of data for further space reduction. Our results also highlight that basic ADD and LOOKUP operations on Bloom Matrix are faster than on Bloom Vector. However, Bloom Matrix does not meet the theoretical false positive rate of less than $10^{-2}$ for LOOKUP operations if the represented data is not uniformly distributed among multiple sets. Consequently, we introduce Bloom Test to determine which structure is suitable for an arbitrary input dataset.