scispace - formally typeset
Search or ask a question

Showing papers by "Tim Fingscheidt published in 2022"


Proceedings ArticleDOI
02 Mar 2022
TL;DR: A novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation) and develops a novel edge consistency loss between all three modalities, thereby improving their initial consistency which supports the detection scheme.
Abstract: While deep neural networks (DNNs) achieve impressive performance on environment perception tasks, their sensitivity to adversarial perturbations limits their use in practical applications. In this paper, we (i) propose a novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation). Specifically, adversarial perturbations are detected by inconsistencies between extracted edges of the input image, the depth output, and the segmentation output. To further improve this technique, we (ii) develop a novel edge consistency loss between all three modalities, thereby improving their initial consistency which in turn supports our detection scheme. We verify our detection scheme's effectiveness by employing various known attacks and image noises. In addition, we (iii) develop a multi-task adversarial attack, aiming at fooling both tasks as well as our detection scheme. Experimental evaluation on the Cityscapes and KITTI datasets shows that under an assumption of a 5% false positive rate up to 100% of images are correctly detected as adversarially perturbed, depending on the strength of the perturbation. Code is available at https://github.com/ifnspaml/AdvAttackDet. A short video at https://youtu.be/KKa6gOyWmH4 provides qualitative results.

5 citations


Journal ArticleDOI
02 Mar 2022
TL;DR: This work expands a source-free UDA approach to a continual and therefore online-capable UDA on a single-image basis for semantic segmentation and modifies the source domain statistics in the batch normalization layers, using target domain images in an unsupervised fashion, which yields consistent performance improvements during inference.
Abstract: Environment perception in autonomous driving vehicles often heavily relies on deep neural networks (DNNs), which are subject to domain shifts, leading to a significantly decreased performance during DNN deployment. Usually, this problem is addressed by unsupervised domain adaptation (UDA) approaches trained either simultaneously on source and target domain datasets or even source-free only on target data in an offline fashion. In this work, we further expand a source-free UDA approach to a continual and therefore online-capable UDA on a single-image basis for semantic segmentation. Accordingly, our method only requires the pre-trained model from the supplier (trained in the source domain) and the current (unlabeled target domain) camera image. Our method Continual BatchNorm Adaptation (CBNA) modifies the source domain statistics in the batch normalization layers, using target domain images in an unsupervised fashion, which yields consistent performance improvements during inference. Thereby, in contrast to existing works, our approach can be applied to improve a DNN continuously on a single-image basis during deployment without access to source data, without algorithmic delay, and nearly without computational overhead. We show the consistent effectiveness of our method across a wide variety of source/target domain settings for semantic segmentation. Code is available at https://github.com/ifnspaml/CBNA

4 citations


Journal Article
TL;DR: This work considers downlink spatial multiplexing enabled by the RIS for weighted sum-rate (WSR) maximization and proposes a method to discretize the continuous phase shifts, which reduces the requirement for hardware complexity with acceptable performance loss.
Abstract: Reconfigurable intelligent surface (RIS) is an emerging technology for future wireless communication systems. In this work, we consider downlink spatial multiplexing enabled by the RIS for weighted sum-rate (WSR) maximization. In the literature, most solutions use alternating gradient-based optimization, which has moderate performance, high complexity, and limited scalability. We propose to apply a fully convolutional network (FCN) to solve this problem, which was originally designed for semantic segmentation of images. The rectangular shape of the RIS and the spatial correlation of channels with adjacent RIS antennas due to the short distance between them encourage us to apply it for the RIS configuration. We design a set of channel features that includes both cascaded channels via the RIS and the direct channel. In the base station (BS), the differentiable minimum mean squared error (MMSE) precoder is used for pretraining and the weighted minimum mean squared error (WMMSE) precoder is then applied for fine-tuning, which is nondifferentiable, more complex, but achieves a better performance. Evaluation results show that the proposed solution has higher performance and allows for a faster evaluation than the baselines. Hence it scales better to a large number of antennas, advancing the RIS one step closer to practical deployment.

3 citations


Journal ArticleDOI
01 Jun 2022
TL;DR: A generic way to generate datasets to train amodal semantic segmentation methods and proposes and evaluates a method as baseline on Amodal Cityscapes, showing its applicability in automotive environment perception.
Abstract: Amodal perception terms the ability of humans to imagine the entire shapes of occluded objects. This gives humans an advantage to keep track of everything that is going on, especially in crowded situations. Typical perception functions, however, lack amodal perception abilities and are therefore at a disadvantage in situations with occlusions. Complex urban driving scenarios often experience many different types of occlusions and, therefore, amodal perception for automated vehicles is an important task to investigate. In this paper, we consider the task of amodal semantic segmentation and propose a generic way to generate datasets to train amodal semantic segmentation methods. We use this approach to generate an amodal Cityscapes dataset. Moreover, we propose and evaluate a method as baseline on Amodal Cityscapes, showing its applicability for amodal semantic segmentation in automotive environment perception. We provide the means to re-generate this dataset on github 1.1https://github.com/ifnspaml/AmodalCityscapes

2 citations


Journal Article
TL;DR: It is found that noise injection significantly reduces the generation of watermarks and thus allows the recognition of highly relevant classes such as “traffic signs”, which are hardly detected by the ERFNet baseline.
Abstract: In recent years, semantic segmentation has taken benefit from various works in computer vision. Inspired by the very versatile CycleGAN architecture, we combine semantic segmentation with the concept of cycle consistency to enable a multitask training protocol. However, learning is largely prevented by the so-called steganography effect, which expresses itself as watermarks in the latent segmentation domain, making image reconstruction a too easy task. To combat this, we propose a noise injection, based either on quantization noise or on Gaussian noise addition to avoid this disadvantageous information flow in the cycle architecture. We find that noise injection significantly reduces the generation of watermarks and thus allows the recognition of highly relevant classes such as “traffic signs”, which are hardly detected by the ERFNet baseline. We report mIoU and PSNR results on the Cityscapes dataset for semantic segmentation and image reconstruction, respectively. The proposed methodology allows to achieve an mIoU improvement on the Cityscapes validation set of 5.7 % absolute over the same CycleGAN without noise injection, and still an absolute 4.9 % over the ERFNet non-cyclic baseline.

1 citations


Journal ArticleDOI
08 Oct 2022
TL;DR: The 3DHD CityScenes dataset as discussed by the authors provides a large-scale high-definition (HD) map annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains.
Abstract: In this paper, we present 3DHD CityScenes - a new dataset with the most comprehensive, large-scale high-definition (HD) map to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. The HD map covers a wide variety of map element types, for instance traffic signs and lights, construction site elements such as cones and fences, markings, lanes, and relations between map elements. Our pre-sented dataset is suitable for numerous perception tasks, such as 3D object detection or map deviation detection. Furthermore, we address the example task of detecting traffic signs in LiDAR point clouds, proposing a novel method based on a deep neural network. Our architecture, named 3DHDNet, specifically allows for the individual detection of vertically stacked signs. 3DHDNet significantly outperforms two state-of-the-art architectures that we selected for comparison. Our method achieves an F 1 score, recall, and precision, of 0.83, 0.76, and 0.90, respectively, and may serve as a baseline for future approaches. The dataset is available at https://www.hi-drive.eu/Data for both commercial and non-commercial use.

1 citations


Journal ArticleDOI
TL;DR: This paper explores relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: first, relaxed attention provides regularization when applied to the self-attention layers in the encoder, and second, it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder.
Abstract: The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches. Specifically, we exceed the former state-of-the-art performance of 26.90% word error rate on the largest public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as we achieve a top-performing BLEU score of 37.67 on the IWSLT14 (DE$\rightarrow$EN) machine translation task without external language models and virtually no additional model parameters. Code and models will be made publicly available.

1 citations


Proceedings ArticleDOI
04 May 2022
TL;DR: As ACR listening tests, the PESQNet does not necessarily require a clean speech reference input, opening the possibility of using real data for DNS training, and it is shown that only marginal benefits are obtained compared to the DNS trained with the non-intrusive PESZNet.
Abstract: Perceptual evaluation of speech quality (PESQ) requires a clean speech reference as input, but predicts the results from (reference-free) absolute category rating (ACR) tests. In this work, we train a fully convolutional recurrent neural network (FCRN) as deep noise suppression (DNS) model, with either a non-intrusive or an intrusive PESQNet, where only the latter has access to a clean speech reference. The PESQNet is used as a mediator providing a perceptual loss during the DNS training to maximize the PESQ score of the enhanced speech signal. For the intrusive PESQNet, we investigate two topologies, called early-fusion (EF) and middle-fusion (MF) PESQNet, and compare to the non-intrusive PESQNet to evaluate and to quantify the benefits of employing a clean speech reference input during DNS training. Detailed analyses show that the DNS model trained with the MF-intrusive PESQNet outperforms the Interspeech 2021 DNS Challenge baseline and the one trained with an MSE loss by 0.23 and 0.12 PESQ points, respectively. Furthermore, we can show that only marginal benefits are obtained compared to the DNS trained with the non-intrusive PESQNet. Therefore, as ACR listening tests, the PESQNet does not necessarily require a clean speech reference input, opening the possibility of using real data for DNS training.

1 citations


Journal ArticleDOI
01 Jun 2022
TL;DR: This paper investigates the influence of several data design choices regarding training and validation of deep driving models trainable in an end-to-end fashion, and investigates how the amount of training data influences the final driving performance.
Abstract: The emergence of data-driven machine learning (ML) has facilitated significant progress in many complicated tasks such as highly-automated driving. While much effort is put into improving the ML models and learning algorithms in such applications, little focus is put into how the training data and/or validation setting should be designed. In this paper we investigate the influence of several data design choices regarding training and validation of deep driving models trainable in an end-to-end fashion. Specifically, (i) we investigate how the amount of training data influences the final driving performance, and which performance limitations are induced through currently used mechanisms to generate training data. (ii) Further, we show by correlation analysis, which validation design enables the driving performance measured during validation to generalize well to unknown test environments. (iii) Finally, we investigate the effect of random seeding and non-determinism, giving insights which reported improvements can be deemed significant. Our evaluations using the popular CARLA simulator provide recommendations regarding data generation and driving route selection for an efficient future development of end-to-end driving models.

1 citations


Journal ArticleDOI
01 Jun 2022
TL;DR: A generic approach to quantization without codebook in learned image compression called one-hot max (OHM, Ω) quantization, which reorganizes the feature space resulting in an additional dimension, along which vector quantization yields one- hot vectors by comparing activations.
Abstract: We propose a generic approach to quantization without codebook in learned image compression called one-hot max (OHM, Ω) quantization. It reorganizes the feature space resulting in an additional dimension, along which vector quantization yields one-hot vectors by comparing activations. Furthermore, we show how to integrate Ω quantization into a compression system with bitrate adaptation, i.e., full control over bitrate during inference. We perform experiments on both MNIST and Kodak and report on rate-distortion trade-offs comparing with the integer rounding reference. For low bitrates (< 0.4 bpp), our proposed quantizer yields better performance while exhibiting also other advantageous training and inference properties. Code is available at https://github.com/ifnspaml/OHMQ.

1 citations


Book ChapterDOI
TL;DR: In this article , the authors investigate the possibility of joint prediction of amodal and visible semantic segmentation masks, and investigate whether both perception tasks benefit from a joint training approach, showing that the proposed joint training outperforms the separately trained networks in terms of mean intersection over union.
Abstract: AbstractAmodal perception is the ability to hallucinate full shapes of (partially) occluded objects. While natural to humans, learning-based perception methods often only focus on the visible parts of scenes. This constraint is critical for safe automated driving since detection capabilities of perception methods are limited when faced with (partial) occlusions. Moreover, corner cases can emerge from occlusions while the perception method is oblivious. In this work, we investigate the possibilities of joint prediction of amodal and visible semantic segmentation masks. More precisely, we investigate whether both perception tasks benefit from a joint training approach. We report our findings on both the Cityscapes and the Amodal Cityscapes dataset. The proposed joint training outperforms the separately trained networks in terms of mean intersection over union in amodal areas of the masks by \(6.84\%\) absolute, while even slightly improving the visible segmentation performance.

Proceedings ArticleDOI
09 May 2022
TL;DR: This proposed DNN model builds upon a fully convolutional recurrent network (FCRN) and introduces scalability over various bandwidths up to a fullband (FB) system (48 kHz sampling rate) and shows robustness even under highly delayed echo and dynamic echo path changes.
Abstract: Although today’s speech communication systems support various bandwidths from narrowband to super-wideband and beyond, state-of-the art DNN methods for acoustic echo cancellation (AEC) are lacking modularity and bandwidth scalability. Our proposed DNN model builds upon a fully convolutional recurrent network (FCRN) and introduces scalability over various bandwidths up to a fullband (FB) system (48 kHz sampling rate). This modular approach allows separate wideband (WB) pre-training of mask-based AEC and postfilter stages with dedicated losses, followed by a joint training of them on FB data. A third lightweight blind bandwidth extension stage is separately trained on FB data, flexibly allowing to extend the WB postfilter output towards higher bandwidths until reaching FB. Thereby, higher frequency noise and echo are reliably suppressed. On the ICASSP 2022 Acoustic Echo Cancellation Challenge blind test set we report a competitive performance, showing robustness even under highly delayed echo and dynamic echo path changes.

Journal ArticleDOI
01 Jun 2022
TL;DR: This paper proposes a novel per-image performance prediction for semantic segmentation with no need for additional sensors, sensors, or additional training data, and demonstrates its effectiveness with a new state-of-the-art benchmark both on KITTI and Cityscapes for image-only input methods.
Abstract: In supervised learning, a deep neural network’s performance is measured using ground truth data. In semantic segmentation, ground truth data is sparse, requires an expensive annotation process, and, most importantly, it is not available during online operation. To tackle this problem, recent works propose various forms of performance prediction. However, they either rely on inference data histograms, additional sensors, or additional training data. In this paper, we propose a novel per-image performance prediction for semantic segmentation, with (i) no need for additional sensors (sensor efficiency), (ii) no need for additional training data (data efficiency), and (iii) no need for a dedicated retraining of the semantic segmentation (training efficiency). Specifically, we extend an already trained semantic segmentation network having fixed parameters with an image reconstruction decoder. After training and a subsequent regression, the image reconstruction quality is evaluated to predict the semantic segmentation performance. We demonstrate our method’s effectiveness with a new state-of-the-art benchmark both on KITTI and Cityscapes for image-only input methods, on Cityscapes even excelling a LiDAR-supported benchmark.

Proceedings ArticleDOI
05 Sep 2022
TL;DR: This work shows that a restriction of the employed temporal context in the self-attention layers of a CNN-based network architecture is crucial for good speech enhancement performance and proposes to combine restricted attention with a subsampled attention variant that considers long-term context with a lower temporal resolution.
Abstract: The multi-head attention mechanism, which has been successfully applied in, e.g., machine translation and ASR, was also found to be a promising approach for temporal modeling in speech enhancement DNNs. Since speech enhancement can be expected to take less profit from long-term temporal context than machine translation or ASR, we propose to employ self-attention with modified context access. We first show that a restriction of the employed temporal context in the self-attention layers of a CNN-based network architecture is crucial for good speech enhancement performance. Furthermore, we propose to combine restricted attention with a subsampled attention variant that considers long-term context with a lower temporal resolution, which helps to effectively consider both long- and short-term context. We show that our proposed attention-based network outperforms similar networks using RNNs for temporal modeling as well as a strong reference method using unrestricted attention.