Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.

/pdf/an-investigation-of-deep-neural-networks-for-noise-robust-tczu9r4vyf.pdf

An investigation of deep neural networks for noise robust speech recognition

Digital twin can be defined as a virtual representation of a physical asset enabled through data and simulators for real-time prediction, optimization, monitoring, controlling, and improved decision making. Recent advances in computational pipelines, multiphysics solvers, artificial intelligence, big data cybernetics, data processing and management tools bring the promise of digital twins and their impact on society closer to reality. Digital twinning is now an important and emerging trend in many applications. Also referred to as a computational megamodel, device shadow, mirrored system, avatar or a synchronized virtual prototype, there can be no doubt that a digital twin plays a transformative role not only in how we design and operate cyber-physical intelligent systems, but also in how we advance the modularity of multi-disciplinary systems to tackle fundamental barriers not addressed by the current, evolutionary modeling practices. In this work, we review the recent status of methodologies and techniques related to the construction of digital twins mostly from a modeling perspective. Our aim is to provide a detailed coverage of the current challenges and enabling technologies along with recommendations and reflections for various stakeholders.

/pdf/digital-twin-values-challenges-and-enablers-from-a-modeling-3hpoelaogn.pdf

Digital Twin: Values, Challenges and Enablers From a Modeling Perspective

New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.

/pdf/an-overview-of-noise-robust-automatic-speech-recognition-u8zbz18oa3.pdf

An overview of noise-robust automatic speech recognition

Gesture is becoming an increasingly popular means of interacting with computers. However, it is still relatively costly to deploy robust gesture recognition sensors in existing mobile platforms. We present SoundWave, a technique that leverages the speaker and microphone already embedded in most commodity devices to sense in-air gestures around the device. To do this, we generate an inaudible tone, which gets frequency-shifted when it reflects off moving objects like the hand. We measure this shift with the microphone to infer various gestures. In this note, we describe the phenomena and detection algorithm, demonstrate a variety of gestures, and present an informal evaluation on the robustness of this approach across different devices and people.

/pdf/soundwave-using-the-doppler-effect-to-sense-gestures-2l9xzfigh1.pdf

SoundWave: using the doppler effect to sense gestures

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is being strengthened with various factors, from mobile-based always-on access to connectivity with reality using virtual currency. The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse. This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches (i.e., user interaction, implementation, and application) rather than marketing or hardware approach to conduct a comprehensive analysis. Furthermore, we describe essential methods based on three components and techniques to Metaverse&#x2019;s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies. Finally, we summarize the limitations and directions for implementing the immersive Metaverse as social influences, constraints, and open challenges. 

/pdf/a-metaverse-taxonomy-components-applications-and-open-1ix4u7e7.pdf

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

This paper presents a new device based on ultrasonic sensors to recognize one-handed gestures. The device uses three ultrasonic receivers and a single transmitter. Gestures are characterized through the Doppler frequency shifts they generate in reflections of an ultrasonic tone emitted by the transmitter. We show that this setup can be used to classify simple one-handed gestures with high accuracy. The ultrasonic doppler based device is very inexpensive - $20 USD for the whole setup including the acquisition system, and computationally efficient as compared to most traditional devices (e.g. video). These gestures, could potentially be used to control and drive a device.

/pdf/one-handed-gesture-recognition-using-ultrasonic-doppler-24mb740mpt.pdf

One-handed gesture recognition using ultrasonic Doppler sonar

A person's gait is a characteristic that might be employed to identify him/her automatically. Conventionally, automatic for gait-based identification of subjects employ video and image processing to characterize gait. In this paper we present an Acoustic Doppler Sensor(ADS) based technique for the characterization of gait. The ADS is very inexpensive sensor that can be built using off-the-shelf components, for under $20 USD at today's prices. We show that remarkably good gait recognition is possible with the ADS sensor.

/pdf/acoustic-doppler-sonar-for-gait-recogination-103bhzlp0h.pdf

Acoustic Doppler sonar for gait recogination

Several properties differentiate ultrasonic Doppler sensing from other sensing techniques-high frame rate, low computational overhead, instantaneous velocity readings, and range independence Also, because it isn't vision-based, it might open doors to sensing in once taboo locations

Ultrasonic Doppler Sensing in HCI

In this paper we present a novel use of an acoustic Doppler sonar for multi-modal speaker identification. An ultrasonic emitter directs a 40 kHz tone toward the speaker. Reflections from the speaker's face are recorded as the speaker talks. The frequency of the tone is modified by the velocity of the facial structures it is reflected by. The received ultrasonic signal thus contains an entire spectrum of frequencies representing the set of all velocities of facial components. The pattern of frequencies in the reflected signal is observed to be typical of the speaker. The captured ultrasonic signal is synchronously analyzed with the corresponding voice signal to extract specific characteristics that can be used to identify the speaker. Experiments show that the information this can result in significant improvements in speaker identification accuracy both under clean conditions and in noise.

/pdf/ultrasonic-doppler-sensor-for-speaker-recognition-4nhs9bbsaw.pdf

Kaustubh Kalgaonkar

Papers

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

One-handed gesture recognition using ultrasonic Doppler sonar

Acoustic Doppler sonar for gait recogination

Ultrasonic Doppler Sensing in HCI

Ultrasonic Doppler sensor for speaker recognition