scispace - formally typeset
Search or ask a question

Showing papers in "Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies in 2022"


Journal ArticleDOI
TL;DR: MicroFluID as discussed by the authors is a novel RFID artifact based on a multiple-chip structure and microfluidic switches, which informs the input state by directly reading variable ID information instead of retrieving primitive signals.
Abstract: RFID has been widely used for activity and gesture recognition in emerging interaction paradigms given its low cost, lightweight, and pervasiveness. However, current learning-based approaches on RFID sensing require significant efforts in data collection, feature extraction, and model training. To save data processing effort, we present MicroFluID, a novel RFID artifact based on a multiple-chip structure and microfluidic switches, which informs the input state by directly reading variable ID information instead of retrieving primitive signals. Fabricated on flexible substrates, four types of microfluidic switch circuits are designed to respond to external physical events, including pressure, bend, temperature, and gravity. By default, chips are disconnected into the circuit owing to the reserved gaps in transmission line. While external input or status change occurs, conductive liquid floating in the microfluidics channels will fill the gap(s), creating a connection to certain chip(s). In prototyping the device, we conducted a series of simulations and experiments to explore the feasibility of the multi-chip tag design, key fabrication parameters, interaction performance, and users' perceptions.

44 citations


Journal ArticleDOI
TL;DR: This paper presents a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data and experimental results show that ColloSSL outperforms both fully- supervised and semi-supervised learning techniques in majority of the experiment settings.
Abstract: A major bottleneck in training robust Human-Activity Recognition models (HAR) is the need for large-scale labeled sensor datasets. Because labeling large amounts of sensor data is an expensive task, unsupervised and semi-supervised learning techniques have emerged that can learn good features from the data without requiring any labels. In this paper, we extend this line of research and present a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data. A key insight that underpins the design of ColloSSL is that unlabeled sensor datasets simultaneously captured by multiple devices can be viewed as natural transformations of each other, and leveraged to generate a supervisory signal for representation learning. We present three technical innovations to extend conventional self-supervised learning algorithms to a multi-device setting: a Device Selection approach which selects positive and negative devices to enable contrastive learning, a Contrastive Sampling algorithm which samples positive and negative examples in a multi-device setting, and a loss function called Multi-view Contrastive Loss which extends standard contrastive loss to a multi-device setting. Our experimental results on three multi-device datasets show that ColloSSL outperforms both fully-supervised and semi-supervised learning techniques in majority of the experiment settings, resulting in an absolute increase of upto 7.9% in 𝐹 1 score compared to the best performing baselines. We also show that ColloSSL outperforms the fully-supervised methods in a low-data regime, by just using one-tenth of the available labeled data in the best case. data produced by ColloSSL and Fully-supervised training for a class with the highest prediction score. For ease of understanding, we only present the magnitude values of the accelerometer and gyroscope data in the saliency maps. In the middle and bottom panes, the intensity of color indicates the impact of the region on the model prediction. The regions with strong intensity imply that they contribute to the model prediction more than those with weak intensity.

28 citations


Journal ArticleDOI
TL;DR: This work constructed four design resources for reflection: temporal perspective, conversation, comparison and discovery, and identified design patterns in past digital artefacts that implement the resources.
Abstract: Reflection is a commonly addressed design goal in commercial systems and in Human-Computer Interaction (HCI) research. Yet, it is still unclear what tools are at the disposal of designers who want to build systems that support reflection. Understanding the design space of reflection support systems and the interaction techniques that can foster reflection is necessary to enable building technologies that contribute to the users' well-being. In order to gain additional insight into how interactive artefacts foster reflection, we investigated past research prototypes and reflection-supporting smartphone applications (apps). Through a structured literature review and an analysis of app reviews, we constructed four design resources for reflection: temporal perspective, conversation, comparison and discovery. We also identified design patterns in past digital artefacts that implement the resources. Our work constitutes intermediate-level knowledge that is intended to inspire future technologies that better support reflection.

18 citations


Journal ArticleDOI
TL;DR: This paper assesses the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance, and utilizes this framework to assess seven state-of-the-art self- supervised methods for HAR, leading to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios.
Abstract: The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems for scenarios where only small amounts of labeled training samples can be collected. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC, and SimCLR, to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, such that each dimension captures specific aspects of performance, including the robustness to differing source and target conditions, the influence of dataset characteristics, and the feature space characteristics. We utilize this framework to assess seven state-of-the-art self-supervised methods for HAR, leading to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios.

17 citations


Journal ArticleDOI
TL;DR: A system called LASense is proposed, which can significantly increase the sensing range for fine-grained human activities using a single pair of speaker and microphone using a virtual transceiver idea that purely leverages delicate signal processing techniques in software.
Abstract: Acoustic signals have been widely adopted in sensing fine-grained human activities, including respiration monitoring, finger tracking, eye blink detection, etc. One major challenge for acoustic sensing is the extremely limited sensing range, which becomes even more severe when sensing fine-grained activities. Different from the prior efforts that adopt multiple microphones and/or advanced deep learning techniques for long sensing range, we propose a system called LASense, which can significantly increase the sensing range for fine-grained human activities using a single pair of speaker and microphone. To achieve this, LASense introduces a virtual transceiver idea that purely leverages delicate signal processing techniques in software. To demonstrate the effectiveness of LASense, we apply the proposed approach to three fine-grained human activities, i.e., respiration, finger tapping and eye blink. For respiration monitoring, we significantly increase the sensing range from the state-of-the-art 2 m to 6 m. For finer-grained finger tapping and eye blink detection, we increase the state-of-the-art sensing range by 150% and 80%, respectively.

16 citations


Journal ArticleDOI
TL;DR: GoPose is environment-independent and is highly accurate in constructing 3D poses for mobile users and achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios.
Abstract: This paper presents GoPose, a 3D skeleton-based human pose estimation system that uses WiFi devices at home. Our system leverages the WiFi signals reflected off the human body for 3D pose estimation. In contrast to prior systems that need specialized hardware or dedicated sensors, our system does not require a user to wear or carry any sensors and can reuse the WiFi devices that already exist in a home environment for mass adoption. To realize such a system, we leverage the 2D AoA spectrum of the signals reflected from the human body and the deep learning techniques. In particular, the 2D AoA spectrum is proposed to locate different parts of the human body as well as to enable environment-independent pose estimation. Deep learning is incorporated to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body for pose tracking. Our evaluation results show GoPose achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios. to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body for 3D pose tracking. We evaluate GoPose in different home environments with various activities performed by multiple The evaluation shows that GoPose is environment-independent and is highly accurate in constructing 3D poses for mobile users. Results also show that GoPose achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios.

14 citations


Journal ArticleDOI
TL;DR: The results of an online study comparing visualizations and their combinations on these three levels indicate the presence of overtrust in AVs.
Abstract: The successful introduction of automated vehicles (AVs) depends on the user's acceptance. To gain acceptance, the intended user must trust the technology, which itself relies on an appropriate understanding. Visualizing internal processes could aid in this. For example, the functional hierarchy of autonomous vehicles distinguishes between perception, prediction, and maneuver planning. In each of these stages, visualizations including possible uncertainties (or errors) are possible. Therefore, we report the results of an online study (N=216) comparing visualizations and their combinations on these three levels using a pre-recorded real-world video with visualizations shown on a simulated augmented reality windshield. Effects on trust, cognitive load, situation awareness, and perceived safety were measured. Situation Prediction-related visualizations were perceived as worse than the remaining levels. Based on a negative evaluation of the visualization, the abilities of the AV were also judged worse. In general, the results indicate the presence of overtrust in AVs.

14 citations


Journal ArticleDOI
TL;DR: A quality-oriented signal processing framework is proposed that maximizes the contribution of the high-quality signal segments and minimizes the impact of low- quality signal segments to improve the performance of gesture recognition applications.
Abstract: WiFi-based gesture recognition emerges in recent years and attracts extensive attention from researchers. Recognizing gestures via WiFi signal is feasible because a human gesture introduces a time series of variations to the received raw signal. The major challenge for building a ubiquitous gesture recognition system is that the mapping between each gesture and the series of signal variations is not unique, exact the same gesture but performed at different locations or with different orientations towards the transceivers generates entirely different gesture signals (variations). To remove the location dependency, prior work proposes to use gesture-level location-independent features to characterize the gesture instead of directly matching the signal variation pattern. We observe that gesture-level features cannot fully remove the location dependency since the signal qualities inside each gesture are different and also depends on the location. Therefore, we divide the signal time series of each gesture into segments according to their qualities and propose customized signal processing techniques to handle them separately. To realize this goal, we characterize signal's sensing quality by building a mathematical model that links the gesture signal with the ambient noise, from which we further derive a unique metric i.e., error of dynamic phase index (EDP-index) to quantitatively describe the sensing quality of signal segments of each gesture. We then propose a quality-oriented signal processing framework that maximizes the contribution of the high-quality signal segments and minimizes the impact of low-quality signal segments to improve the performance of gesture recognition applications. We develop a prototype on COTS WiFi devices. The extensive experimental results demonstrate that our system can recognize gestures with an accuracy of more than 94% on average, and significant improvements compared with state-of-arts.

14 citations


Journal ArticleDOI
TL;DR: This paper proposes FLAME, a user-centered FL training approach to counter statistical and system heterogeneity in MDEs, and bring consistency in inference performance across devices.
Abstract: Federated Learning (FL) enables distributed training of machine learning models while keeping personal data on user devices private. While we witness increasing applications of FL in the area of mobile sensing, such as human activity recognition (HAR), FL has not been studied in the context of a multi-device environment (MDE), wherein each user owns multiple data-producing devices. With the proliferation of mobile and wearable devices, MDEs are increasingly becoming popular in ubicomp settings, therefore necessitating the study of FL in them. FL in MDEs is characterized by being not independent and identically distributed (non-IID) across clients, complicated by the presence of both user and device heterogeneities. Further, ensuring efficient utilization of system resources on FL clients in a MDE remains an important challenge. In this paper, we propose FLAME, a user-centered FL training approach to counter statistical and system heterogeneity in MDEs, and bring consistency in inference performance across devices. FLAME features (i) user-centered FL training utilizing the time alignment across devices from the same user; (ii) accuracy- and efficiency-aware device selection; and (iii) model personalization to devices. We also present an FL evaluation testbed with realistic energy drain and network bandwidth profiles, and a novel class-based data partitioning scheme to extend existing HAR datasets to a federated setup. Our experiment results on three multi-device HAR datasets show that FLAME outperforms various baselines by 4.3-25.8% higher 𝐹 1 score, 1.02-2.86 × greater energy efficiency, and up to 2.06 × speedup in convergence to target accuracy through fair distribution of the FL workload.

13 citations


Journal ArticleDOI
TL;DR: A new metric named SSNR (sensing-signal-to-noise-ratio) is proposed to quantify the sensing capability of WiFi systems and it is demonstrated that by properly placing the transmitter and receiver, the coverage of human walking sensing can be expanded by around 200%.
Abstract: WiFi-based contactless sensing has found numerous applications in the fields of smart home and health care owning to its low-cost, non-intrusive and privacy-preserving characteristics. While promising in many aspects, the limited sensing range and interference issues still exist, hindering the adoption of WiFi sensing in real world. In this paper, inspired by the SNR (signal-to-noise ratio) metric in communication theory, we propose a new metric named SSNR (sensing-signal-to-noise-ratio) to quantify the sensing capability of WiFi systems. We theoretically model the effect of transmitter-receiver distance on sensing coverage. We show that in LoS scenario, the sensing coverage area increases first from a small oval to a maximal one and then decreases. When the transmitter-receiver distance further increases, the coverage area is separated into two ovals located around the two transceivers respectively. We demonstrate that, instead of applying complex signal processing scheme or advanced hardware, by just properly placing the transmitter and receiver, the two well-known issues in WiFi sensing (i.e., small range and severe interference) can be greatly mitigated. Specifically, by properly placing the transmitter and receiver, the coverage of human walking sensing can be expanded by around 200%. By increasing the transmitter-receiver distance, a target's fine-grained respiration can still be accurately sensed with one interferer sitting just 0.5 m away.

13 citations


Journal ArticleDOI
TL;DR: DeXAR is proposed, a novel methodology to transform sensor data into semantic images to take advantage of XAI methods based on Convolutional Neural Networks (CNN) and generate explanations in natural language from the resulting heat maps.
Abstract: The sensor-based recognition of Activities of Daily Living (ADLs) in smart-home environments is an active research area, with relevant applications in healthcare and ambient assisted living. The application of Explainable Artificial Intelligence (XAI) to ADLs recognition has the potential of making this process trusted, transparent and understandable. The few works that investigated this problem considered only interpretable machine learning models. In this work, we propose DeXAR, a novel methodology to transform sensor data into semantic images to take advantage of XAI methods based on Convolutional Neural Networks (CNN). We apply different XAI approaches for deep learning and, from the resulting heat maps, we generate explanations in natural language. In order to identify the most effective XAI method, we performed extensive experiments on two different datasets, with both a common-knowledge and a user-based evaluation. The results of a user study show that the white-box XAI method based on prototypes is the most effective.

Journal ArticleDOI
TL;DR: This paper introduces semantic-aware Mixup that considers the activity semantic ranges to overcome the semantic inconsistency brought by domain differences and introduces the large margin loss to enhance the discrimination of Mixup to prevent misclassification brought by noisy virtual labels.
Abstract: It is expensive and time-consuming to collect sufficient labeled data to build human activity recognition (HAR) models. Training on existing data often makes the model biased towards the distribution of the training data, thus the model might perform terribly on test data with different distributions. Although existing efforts on transfer learning and domain adaptation try to solve the above problem, they still need access to unlabeled data on the target domain, which may not be possible in real scenarios. Few works pay attention to training a model that can generalize well to unseen target domains for HAR. In this paper, we propose a novel method called Semantic-Discriminative Mixup (SDMix) for generalizable cross-domain HAR. Firstly, we introduce semantic-aware Mixup that considers the activity semantic ranges to overcome the semantic inconsistency brought by domain differences. Secondly, we introduce the large margin loss to enhance the discrimination of Mixup to prevent misclassification brought by noisy virtual labels. Comprehensive generalization experiments on five public datasets demonstrate that our SDMix substantially outperforms the state-of-the-art approaches with 6% average accuracy improvement on cross-person, cross-dataset, and cross-position HAR.

Journal ArticleDOI
TL;DR: It is anticipated that EarCommand can serve as an efficient, intelligent speech interface for hand-free operation, which could significantly improve the quality and convenience of interactions.
Abstract: Intelligent speech interfaces have been developing vastly to support the growing demands for convenient control and interaction with wearable/earable and portable devices. To avoid privacy leakage during speech interactions and strengthen the resistance to ambient noise, silent speech interfaces have been widely explored to enable people’s interaction with mobile/wearable devices without audible sounds. However, most existing silent speech solutions require either restricted background illuminations or hand involvement to hold device or perform gestures. In this study, we propose a novel earphone-based, hand-free silent speech interaction approach, named EarCommand. Our technique discovers the relationship between the deformation of the ear canal and the movements of the articulator and takes advantage of this link to recognize different silent speech commands. Our system can achieve a WER (word error rate) of 10.02% for word-level recognition and 12.33% for sentence-level recognition, when tested in human subjects with 32 word-level commands and 25 sentence-level commands, which indicates the effectiveness of inferring silent speech commands. Moreover, EarCommand shows high reliability and robustness in a variety of configuration settings and environmental conditions. It is anticipated that EarCommand can serve as an efficient, intelligent speech interface for hand-free operation, which could significantly improve the quality and convenience of interactions.

Journal ArticleDOI
TL;DR: This paper approaches SLT as a spatio-temporal machine translation task and proposes a wearable-based system, WearSign, to enable direct translation from the sign-induced sensory signals into spoken texts and includes the synthetic pairs into the training process, which enables the network to learn better sequence-to-sequence mapping.
Abstract: Sign language translation (SLT) is considered as the core technology to break the communication barrier between the deaf and hearing people. However, most studies only focus on recognizing the sequence of sign gestures (sign language recognition (SLR)), ignoring the significant difference of linguistic structures between sign language and spoken language. In this paper, we approach SLT as a spatio-temporal machine translation task and propose a wearable-based system, WearSign, to enable direct translation from the sign-induced sensory signals into spoken texts. WearSign leverages a smartwatch and an armband of ElectroMyoGraphy (EMG) sensors to capture the sophisticated sign gestures. In the design of the translation network, considering the significant modality and linguistic gap between sensory signals and spoken language, we design a multi-task encoder-decoder framework which uses sign glosses (sign gesture labels) for intermediate supervision to guide the end-to-end training. In addition, due to the lack of sufficient training data, the performance of prior studies usually degrades drastically when it comes to sentences with complex structures or unseen in the training set. To tackle this, we borrow the idea of back-translation and leverage the much more available spoken language data to synthesize the paired sign language data. We include the synthetic pairs into the training process, which enables the network to learn better sequence-to-sequence mapping as well as generate more fluent spoken language sentences. We construct an American sign language (ASL) dataset consisting of 250 commonly used sentences gathered from 15 volunteers. WearSign achieves 4.7% and 8.6% word error rate (WER) in user-independent tests and unseen sentence tests respectively. We also implement a real-time version of WearSign which runs fully on the smartphone with a low latency and energy overhead. CCS Concepts:

Journal ArticleDOI
TL;DR: This paper presents a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data and shows that ColloSSL outperforms both fully- supervised and semi-supervised learning techniques in majority of the experiment settings.
Abstract: A major bottleneck in training robust Human-Activity Recognition models (HAR) is the need for large-scale labeled sensor datasets. Because labeling large amounts of sensor data is an expensive task, unsupervised and semi-supervised learning techniques have emerged that can learn good features from the data without requiring any labels. In this paper, we extend this line of research and present a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data. A key insight that underpins the design of ColloSSL is that unlabeled sensor datasets simultaneously captured by multiple devices can be viewed as natural transformations of each other, and leveraged to generate a supervisory signal for representation learning. We present three technical innovations to extend conventional self-supervised learning algorithms to a multi-device setting: a Device Selection approach which selects positive and negative devices to enable contrastive learning, a Contrastive Sampling algorithm which samples positive and negative examples in a multi-device setting, and a loss function called Multi-view Contrastive Loss which extends standard contrastive loss to a multi-device setting. Our experimental results on three multi-device datasets show that ColloSSL outperforms both fully-supervised and semi-supervised learning techniques in majority of the experiment settings, resulting in an absolute increase of upto 7.9% in F1 score compared to the best performing baselines. We also show that ColloSSL outperforms the fully-supervised methods in a low-data regime, by just using one-tenth of the available labeled data in the best case.

Journal ArticleDOI
TL;DR: TinyOdom is introduced, a framework for training and deploying neural inertial models on URC hardware and a magnetometer, physics, and velocity-centric sequence learning formulation robust to preceding inertial perturbations that significantly improve localization performance even with notably lightweight models.
Abstract: Deep inertial sequence learning has shown promising odometric resolution over model-based approaches for trajectory estimation in GPS-denied environments. However, existing neural inertial dead-reckoning frameworks are not suitable for real-time deployment on ultra-resource-constrained (URC) devices due to substantial memory, power, and compute bounds. Current deep inertial odometry techniques also suffer from gravity pollution, high-frequency inertial disturbances, varying sensor orientation, heading rate singularity, and failure in altitude estimation. In this paper, we introduce TinyOdom, a framework for training and deploying neural inertial models on URC hardware. TinyOdom exploits hardware and quantization-aware Bayesian neural architecture search (NAS) and a temporal convolutional network (TCN) backbone to train lightweight models targetted towards URC devices. In addition, we propose a magnetometer, physics, and velocity-centric sequence learning formulation robust to preceding inertial perturbations. We also expand 2D sequence learning to 3D using a model-free barometric g-h filter robust to inertial and environmental variations. We evaluate TinyOdom for a wide spectrum of inertial odometry applications and target hardware against competing methods. Specifically, we consider four applications: pedestrian, animal, aerial, and underwater vehicle dead-reckoning. Across different applications, TinyOdom reduces the size of neural inertial models by 31 × to 134 × with 2.5m to 12m error in 60 seconds, enabling the direct deployment of models on URC devices while still maintaining or exceeding the localization resolution over the state-of-the-art. The proposed barometric filter tracks altitude within ± 0 . 1 𝑚 and is robust to inertial disturbances and ambient dynamics. Finally, our ablation study shows that the introduced magnetometer, physics, and velocity-centric sequence learning formulation significantly improve localization performance even with notably lightweight models. Razor IMU board logs inertial sensor data from an agricultural robot onto an SD card, with developed neural-inertial model running in real-time on the board after training. Motion capture (MoCap) markers are attached to the robot to log ground truth position on the track using high-resolution infrared cameras with respect to the ground plane.

Journal ArticleDOI
TL;DR: This paper utilizes wide-area LoRa signals to sense soil moisture without a need of dedicated soil moisture sensors, and develops a delicate chirp ratio approach to cancel out the phase offset caused by transceiver unsynchronization to extract accurate phase information.
Abstract: Soil moisture sensing is one of the most important components in smart agriculture. It plays a critical role in increasing crop yields and reducing water waste. However, existing commercial soil moisture sensors are either expensive or inaccurate, limiting their real-world deployment. In this paper, we utilize wide-area LoRa signals to sense soil moisture without a need of dedicated soil moisture sensors. Different from traditional usage of LoRa in smart agriculture which is only for sensor data transmission, we leverage LoRa signal itself as a powerful sensing tool. The key insight is that the dielectric permittivity of soil which is closely related to soil moisture can be obtained from phase readings of LoRa signals. Therefore, antennas of a LoRa node can be placed in the soil to capture signal phase readings for soil moisture measurements. Though promising, it is non-trivial to extract accurate phase information due to unsynchronization of LoRa transmitter and receiver. In this work, we propose to include a low-cost switch to equip the LoRa node with two antennas to address the issue. We develop a delicate chirp ratio approach to cancel out the phase offset caused by transceiver unsynchronization to extract accurate phase information. The proposed system design has multiple unique advantages including high accuracy, robustness against motion interference and large sensing range for large-scale deployment in smart agriculture. Experiments with commodity LoRa nodes show that our system can accurately estimate soil moisture at an average error of 3.1%, achieving a performance comparable to high-end commodity soil moisture sensors. Field studies show that the proposed system can accurately sense soil moisture even when the LoRa gateway is 100 m away from the LoRa node, enabling wide-area soil moisture sensing for the first time.

Journal ArticleDOI

[...]

TL;DR: DeXAR is proposed, a novel methodology to transform sensor data into semantic images to take advantage of XAI methods based on Convolutional Neural Networks (CNN) and generate explanations in natural language from the resulting heat maps.
Abstract: The sensor-based recognition of Activities of Daily Living (ADLs) in smart-home environments is an active research area, with relevant applications in healthcare and ambient assisted living. The application of Explainable Artificial Intelligence (XAI) to ADLs recognition has the potential of making this process trusted, transparent and understandable. The few works that investigated this problem considered only interpretable machine learning models. In this work, we propose DeXAR, a novel methodology to transform sensor data into semantic images to take advantage of XAI methods based on Convolutional Neural Networks (CNN). We apply different XAI approaches for deep learning and, from the resulting heat maps, we generate explanations in natural language. In order to identify the most effective XAI method, we performed extensive experiments on two different datasets, with both a common-knowledge and a user-based evaluation. The results of a user study show that the white-box XAI method based on prototypes is the most effective.

Journal ArticleDOI
TL;DR: DiverSense is able to accurately monitor the human respiration even when the sensing signal is under noise floor, and therefore boosts sensing range to 40 meters, which is a 3 × improvement over the current state-of-the-art.
Abstract: The ubiquity of Wi-Fi infrastructure has facilitated the development of a range of Wi-Fi based sensing applications. Wi-Fi sensing relies on weak signal reflections from the human target and thus only supports a limited sensing range, which significantly hinders the real-world deployment of the proposed sensing systems. To extend the sensing range, traditional algorithms focus on suppressing the noise introduced by the imperfect Wi-Fi hardware. This paper picks a different direction and proposes to enhance the quality of the sensing signal by fully exploiting the signal diversity provided by the Wi-Fi hardware. We propose DiverSense, a system that combines sensing signal received from all subcarriers and all antennas in the array, to fully utilize the spatial and frequency diversity. To guarantee the diversity gain after signal combining, we also propose a time-diversity based signal alignment algorithm to align the phase of the multiple received sensing signals. We implement the proposed methods in a respiration monitoring system using commodity Wi-Fi devices and evaluate the performance in diverse environments. Extensive experimental results demonstrate that DiverSense is able to accurately monitor the human respiration even when the sensing signal is under noise floor, and therefore boosts sensing range to 40 meters , which is a 3 × improvement over the current state-of-the-art. DiverSense also works robustly under NLoS scenarios, e.g. , DiverSense is able to accurately monitor respiration even when the human and the Wi-Fi transceivers are separated by two concrete walls with wooden doors. between transceivers and the distance between transceivers is 11m. We close the door during the experiment. We ask the subject to sit in Room A (S1, S3, S4, S5) and breath normally. Results show that the mean absolute error is 0.15bpm, 0.09bpm, 0.16bpm, 0.22bpm, respectively. We then move Tx to T5 and there are

Journal ArticleDOI
TL;DR: A Behaviour Pattern Disentanglement (BPD) framework, which can disentangle the behavior patterns from the irrelevant noises such as personal styles or environmental noises, etc, and can be used on top of existing deep learning approaches for feature refinement.
Abstract: In wearable-based human activity recognition (HAR) research, one of the major challenges is the large intra-class variability problem. The collected activity signal is often, if not always, coupled with noises or bias caused by personal, environmental, or other factors, making it difficult to learn effective features for HAR tasks, especially when with inadequate data. To address this issue, in this work, we proposed a Behaviour Pattern Disentanglement (BPD) framework, which can disentangle the behavior patterns from the irrelevant noises such as personal styles or environmental noises, etc. Based on a disentanglement network, we designed several loss functions and used an adversarial training strategy for optimization, which can disentangle activity signals from the irrelevant noises with the least dependency (between them) in the feature space. Our BPD framework is flexible, and it can be used on top of existing deep learning (DL) approaches for feature refinement. Extensive experiments were conducted on four public HAR datasets, and the promising results of our proposed BPD scheme suggest its flexibility and effectiveness. This is an open-source project, and the code can be found at http://github.com/Jie-su/BPD

Journal ArticleDOI
TL;DR: In this article , the authors studied the effect of geographical diversity on mood inference models and showed that partially personalized country-specific models performed the best yielding area under the receiver operating characteristic curve (AUROC) scores of the range 0.78-0.98 for two-class (negative vs. positive valence) and 0.76--0.94 for threeclass (neutral vs. neutral vs.positive valence).
Abstract: Mood inference with mobile sensing data has been studied in ubicomp literature over the last decade. This inference enables context-aware and personalized user experiences in general mobile apps and valuable feedback and interventions in mobile health apps. However, even though model generalization issues have been highlighted in many studies, the focus has always been on improving the accuracies of models using different sensing modalities and machine learning techniques, with datasets collected in homogeneous populations. In contrast, less attention has been given to studying the performance of mood inference models to assess whether models generalize to new countries. In this study, we collected a mobile sensing dataset with 329K self-reports from 678 participants in eight countries (China, Denmark, India, Italy, Mexico, Mongolia, Paraguay, UK) to assess the effect of geographical diversity on mood inference models. We define and evaluate country-specific (trained and tested within a country), continent-specific (trained and tested within a continent), country-agnostic (tested on a country not seen on training data), and multi-country (trained and tested with multiple countries) approaches trained on sensor data for two mood inference tasks with population-level (non-personalized) and hybrid (partially personalized) models. We show that partially personalized country-specific models perform the best yielding area under the receiver operating characteristic curve (AUROC) scores of the range 0.78--0.98 for two-class (negative vs. positive valence) and 0.76--0.94 for three-class (negative vs. neutral vs. positive valence) inference. Further, with the country-agnostic approach, we show that models do not perform well compared to country-specific settings, even when models are partially personalized. We also show that continent-specific models outperform multi-country models in the case of Europe. Overall, we uncover generalization issues of mood inference models to new countries and how the geographical similarity of countries might impact mood inference.

Journal ArticleDOI
TL;DR: This article evaluated the generalizability of longitudinal passive sensing data from smartphones and wearable devices for human behavior modeling, such as depression detection, using depression detection as an application and showed that individual differences (both within and between populations) may play the most important role in the cross-dataset generalization challenge.
Abstract: There is a growing body of research revealing that longitudinal passive sensing data from smartphones and wearable devices can capture daily behavior signals for human behavior modeling, such as depression detection. Most prior studies build and evaluate machine learning models using data collected from a single population. However, to ensure that a behavior model can work for a larger group of users, its generalizability needs to be verified on multiple datasets from different populations. We present the first work evaluating cross-dataset generalizability of longitudinal behavior models, using depression detection as an application. We collect multiple longitudinal passive mobile sensing datasets with over 500 users from two institutes over a two-year span, leading to four institute-year datasets. Using the datasets, we closely re-implement and evaluated nine prior depression detection algorithms. Our experiment reveals the lack of model generalizability of these methods. We also implement eight recently popular domain generalization algorithms from the machine learning community. Our results indicate that these methods also do not generalize well on our datasets, with barely any advantage over the naive baseline of guessing the majority. We then present two new algorithms with better generalizability. Our new algorithm, Reorder , significantly and consistently outperforms existing methods on most cross-dataset generalization setups. However, the overall advantage is incremental and still has great room for improvement. Our analysis reveals that the individual differences (both within and between populations) may play the most important role in the cross-dataset generalization challenge. Finally, we provide an open-source benchmark platform GLOBEM – short for G eneralization of LO ngitudinal BE havior M odeling – to consolidate all 19 algorithms. GLOBEM can support researchers in using, developing, and evaluating different longitudinal behavior modeling methods. We call for researchers’ attention to model generalizability evaluation for future longitudinal human behavior modeling studies

Journal ArticleDOI
TL;DR: The effectiveness of the LOS-Net is demonstrated using sensor data collected from workers in actual factories and a logistics center, and it is shown that it can achieve state-of-the-art performance.
Abstract: This study presents a new neural network model for recognizing manual works using body-worn accelerometers in industrial settings, named Lightweight Ordered-work Segmentation Network (LOS-Net). In industrial domains, a human worker typically repetitively performs a set of predefined processes, with each process consisting of a sequence of activities in a predefined order. State-of-the-art activity recognition models, such as encoder-decoder models, have numerous trainable parameters, making their training difficult in industrial domains because of the consequent substantial cost for preparing a large amount of labeled data. In contrast, the LOS-Net is designed to be trained on a limited amount of training data. Specifically, the decoder in the LOS-Net has few trainable parameters and is designed to capture only the necessary information for precise recognition of ordered works. These are (i) the boundary information between consecutive activities, because a transition in the performed activities is generally associated with the trend change of the sensor data collected during the manual works and (ii) long-term context regarding the ordered works, e.g., information about the previous and next activity, which is useful for recognizing the current activity. This information is obtained by introducing a module that can collect it at distant time steps using few trainable parameters. Moreover, the LOS-Net can refine the activity estimation by the decoder by incorporating prior knowledge regarding the order of activities. We demonstrate the effectiveness of the LOS-Net using sensor data collected from workers in actual factories and a logistics center, and show that it can achieve state-of-the-art performance.

Journal ArticleDOI
TL;DR: This paper proposes SonicBot, a system that enables contact-free acoustic sensing under device motion by proposing a sequence of signal processing schemes to eliminate the impact of device motion and then obtain clean target movement information that is previously overwhelmed by device movement.
Abstract: Recent years have witnessed increasing attention from both academia and industry on contact-free acoustic sensing. Due to the pervasiveness of audio devices and fine granularity of acoustic sensing, it has been applied in numerous fields, including human-computer interaction and contact-free health sensing. Though promising, the limited working range hinders the wide adoption of acoustic sensing in real life. To break the sensing range limit, we propose to deploy the acoustic device on a moving platform (i.e., a robot) to support applications that require larger coverage and continuous sensing. In this paper, we propose SonicBot, a system that enables contact-free acoustic sensing under device motion. We propose a sequence of signal processing schemes to eliminate the impact of device motion and then obtain clean target movement information that is previously overwhelmed by device movement. We implement SonicBot using commercial audio devices and conduct extensive experiments to evaluate the performance of the proposed system. Experiment results show that our system can achieve a median error of 1.11 cm and 1.31 mm for coarse-grained and fine-grained tracking, respectively. To showcase the applicability of our proposed system in real-world settings, we perform two field studies, including coarse-grained gesture sensing and fine-grained respiration monitoring when the acoustic device moves along with a robot.

Journal ArticleDOI
TL;DR: This paper approaches SLT as a spatio-temporal machine translation task and proposes a wearable-based system, WearSign, to enable direct translation from the sign-induced sensory signals into spoken texts, and includes the synthetic pairs into the training process, which enables the network to learn better sequence-to-sequence mapping.
Abstract: Sign language translation (SLT) is considered as the core technology to break the communication barrier between the deaf and hearing people. However, most studies only focus on recognizing the sequence of sign gestures (sign language recognition (SLR)), ignoring the significant difference of linguistic structures between sign language and spoken language. In this paper, we approach SLT as a spatio-temporal machine translation task and propose a wearable-based system, WearSign, to enable direct translation from the sign-induced sensory signals into spoken texts. WearSign leverages a smartwatch and an armband of ElectroMyoGraphy (EMG) sensors to capture the sophisticated sign gestures. In the design of the translation network, considering the significant modality and linguistic gap between sensory signals and spoken language, we design a multi-task encoder-decoder framework which uses sign glosses (sign gesture labels) for intermediate supervision to guide the end-to-end training. In addition, due to the lack of sufficient training data, the performance of prior studies usually degrades drastically when it comes to sentences with complex structures or unseen in the training set. To tackle this, we borrow the idea of back-translation and leverage the much more available spoken language data to synthesize the paired sign language data. We include the synthetic pairs into the training process, which enables the network to learn better sequence-to-sequence mapping as well as generate more fluent spoken language sentences. We construct an American sign language (ASL) dataset consisting of 250 commonly used sentences gathered from 15 volunteers. WearSign achieves 4.7% and 8.6% word error rate (WER) in user-independent tests and unseen sentence tests respectively. We also implement a real-time version of WearSign which runs fully on the smartphone with a low latency and energy overhead.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel signature synthesis algorithm based on the observed specular reflection model of a human body, and an effective cross-modal deep metric learning model is introduced to deal with interference caused by unsynchronized data across radars and cameras.
Abstract: Human identification is a key requirement for many applications in everyday life, such as personalized services, automatic surveillance, continuous authentication, and contact tracing during pandemics, etc. This work studies the problem of cross-modal human re-identification (ReID), in response to the regular human movements across camera-allowed regions (e.g., streets) and camera-restricted regions (e.g., offices) deployed with heterogeneous sensors. By leveraging the emerging low-cost RGB-D cameras and mmWave radars, we propose the first-of-its-kind vision-RF system for cross-modal multi-person ReID at the same time. Firstly, to address the fundamental inter-modality discrepancy, we propose a novel signature synthesis algorithm based on the observed specular reflection model of a human body. Secondly, an effective cross-modal deep metric learning model is introduced to deal with interference caused by unsynchronized data across radars and cameras. Through extensive experiments in both indoor and outdoor environments, we demonstrate that our proposed system is able to achieve ~ 92.5% top-1 accuracy and ~ 97.5% top-5 accuracy out of 56 volunteers. We also show that our proposed system is able to robustly reidentify subjects even when multiple subjects are present in the sensors' field of view.

Journal ArticleDOI
TL;DR: COCOA (Cross mOdality COntrastive leArning), a self-supervised model that employs a novel objective function to learn quality representations from multisensor data by computing the cross-correlation between different data modalities and minimizing the similarity between irrelevant instances is proposed.
Abstract: Self-Supervised Learning (SSL) is a new paradigm for learning discriminative representations without labeled data, and has reached comparable or even state-of-the-art results in comparison to supervised counterparts. Contrastive Learning (CL) is one of the most well-known approaches in SSL that attempts to learn general, informative representations of data. CL methods have been mostly developed for applications in computer vision and natural language processing where only a single sensor modality is used. A majority of pervasive computing applications, however, exploit data from a range of different sensor modalities. While existing CL methods are limited to learning from one or two data sources, we propose COCOA (Cross mOdality COntrastive leArning), a self-supervised model that employs a novel objective function to learn quality representations from multisensor data by computing the cross-correlation between different data modalities and minimizing the similarity between irrelevant instances. We evaluate the effectiveness of COCOA against eight recently introduced state-of-the-art self-supervised models, and two supervised baselines across five public datasets. We show that COCOA achieves superior classification performance to all other approaches. Also, COCOA is far more label-efficient than the other baselines including the fully supervised model using only one-tenth of available labeled data. 2 , 𝐶𝑃𝐶 and both an F-score of 96%, 77%, 63% and 92% in SLEEPEDF and datasets, while only 10% of data. This shows about 10%, 25%,

Journal ArticleDOI
TL;DR: It was showed that a mixture of financial and altruistic benefits was important in eliciting data contribution and most of the participants were less concerned with open dataset collection and their perceived sensitivity of each sensor data did not change over time.
Abstract: Collecting large-scale mobile and wearable sensor datasets from daily contexts is essential in developing machine learning models for enabling everyday affective computing applications. However, there is a lack of knowledge on data contributors' perceived benefits and risks in participating in open dataset collection projects. To bridge this gap, we conducted an in-situ study on building an open dataset with mobile and wearable devices for affective computing research (N = 100, 4 weeks). Our study results showed that a mixture of financial and altruistic benefits was important in eliciting data contribution. Sensor-specific risks were largely associated with the revelation of personal traits and social behaviors. However, most of the participants were less concerned with open dataset collection and their perceived sensitivity of each sensor data did not change over time. We further discuss alternative approaches to promote data contributors' motivations and suggest design guidelines to alleviate potential privacy concerns in mobile open dataset collection.

Journal ArticleDOI
TL;DR: A novel mobile sensing system that leverages both front and rear cameras on a smartphone to generate high-quality self-supervised labels for training personalized contactless camera-based PPG models, and significantly outperforms the state-of-the-art on-device supervised training and few-shot adaptation methods.
Abstract: Camera-based contactless photoplethysmography refers to a set of popular techniques for contactless physiological measurement. The current state-of-the-art neural models are typically trained in a supervised manner using videos accompanied by gold standard physiological measurements. However, they often generalize poorly out-of-domain examples (i.e., videos that are unlike those in the training set). Personalizing models can help improve model generalizability, but many personalization techniques still require some gold standard data. To help alleviate this dependency, in this paper, we present a novel mobile sensing system called MobilePhys, the first mobile personalized remote physiological sensing system, that leverages both front and rear cameras on a smartphone to generate high-quality self-supervised labels for training personalized contactless camera-based PPG models. To evaluate the robustness of MobilePhys, we conducted a user study with 39 participants who completed a set of tasks under different mobile devices, lighting conditions/intensities, motion tasks, and skin types. Our results show that MobilePhys significantly outperforms the state-of-the-art on-device supervised training and few-shot adaptation methods. Through extensive user studies, we further examine how does MobilePhys perform in complex real-world settings. We envision that calibrated or personalized camera-based contactless PPG models generated from our proposed dual-camera mobile sensing system will open the door for numerous future applications such as smart mirrors, fitness and mobile health applications.

Journal ArticleDOI
TL;DR: NLoc is presented, a reliable non-line-of-sight localization system that overcomes the above limitations and incorporates novel algorithms to remove random ToF offsets due to lack of synchronization and compensate target orientation that determines the geometric features, for accurate location estimates.
Abstract: The past decade's research in RF indoor localization has led to technologies with decimeter-level accuracy under controlled experimental settings. However, existing solutions are not reliable in challenging environments with rich multipath and various occlusions. The errors can be 3-5 times compared to settings with clear LoS paths. In addition, when the direct path is completely blocked, such approaches would generate wrong location estimates. In this paper, we present NLoc, a reliable non-line-of-sight localization system that overcomes the above limitations. The key innovation of NLoc is to convert multipath reflections to virtual direct paths to enhance the localization performance. To this end, NLoc first extracts reliable multi-dimensional parameters by characterizing phase variations. Then, it models the relation between the target location and the geometric features of multipath reflections to obtain virtual direct paths. Finally, it incorporates novel algorithms to remove random ToF offsets due to lack of synchronization and compensate target orientation that determines the geometric features, for accurate location estimates. We implement NLoc on commercial off-the-shelf WiFi devices. Our experiments in multipath challenged environments with dozens of obstacles and occlusions demonstrate that NLoc outperforms state-of-the-art approaches by 44% at the median and 200% at 90% percentile.