Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

Speaker Recognition from Raw Waveform with SincNet

Techniques are described for encoding and decoding digital video data using macroblocks that are larger than the macroblocks prescribed by conventional video encoding and decoding standards. For example, the techniques include encoding and decoding a video stream using macroblocks comprising greater than 16×16 pixels. In one example, an apparatus includes a video encoder configured to encode a coded unit comprising a plurality of video blocks, wherein at least one of the plurality of video blocks comprises a size of more than 16×16 pixels and to generate syntax information for the coded unit that includes a maximum size value, wherein the maximum size value indicates a size of a largest one of the plurality of video blocks in the coded unit. The syntax information may also include a minimum size value. In this manner, the encoder may indicate to a decoder the proper syntax decoder to apply to the coded unit.

Video coding with large macroblocks

The embedding of a covert communication signal amongst the ambient scattering from an incident radar pulse has previously been achieved by modulating a Doppler-like phase shift sequence over numerous pulses (i.e., on an inter-pulse basis). In contrast, this paper considers radar-embedded communications on an intrapulse basis whereby an incident radar waveform is converted into one of K communication waveforms, each of which acts as a communication symbol representing some predetermined information (e.g., a bit sequence). To preserve a low intercept probability, this manner of radar-embedded communications necessitates prudent selection of the set of communication waveforms as well as interference cancellation on receive. A general mathematical model and subsequent optimization problem is established for the design of the communication waveforms, from which three design strategies are developed. Also, receiver design issues are discussed, and an interference-canceling maximum likelihood receiver is presented. Performance results are presented in terms of the communication symbol error rate as well as a correlation-based metric from which intercept probability can be inferred. It is demonstrated that, given persistent radar illumination with a pulse repetition frequency (PRF) of 1-2 kHz, intrapulse radar-embedded communications can theoretically achieve data-rates commensurate with speech coding (for the interval of the radar dwell time) with the potential for even higher data-rates if additional diversity is appropriately incorporated.

Intrapulse Radar-Embedded Communications

Coded video data may be transmitted between an encoder and a decoder using multiple FEC codes and/or packets for error detection and correction. Only a subset of the FEC packets need be transmitted between the encoder and decoder. The FEC packets of each FEC group may take, as inputs, data packets of a current FEC group and also an untransmitted FEC packet of a preceding FEC group. Due to relationships among the FEC packets, when transmission errors arise and data packets are lost, there remain opportunities for a decoder to recover lost data packets from earlier-received FEC groups when later-received FEC groups are decoded. This opportunity to recover data packets from earlier FEC groups may be useful in video coding and other systems, in which later-received data often cannot be decoded unless earlier-received data is decoded properly.

Error correction coding

The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. 
A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.

In defence of metric learning for speaker recognition

The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A novel text-independent speaker verification (SV) framework based on the triplet loss and a very deep convolutional neural network architecture (i.e., Inception-Resnet-v1) are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks. A concise description of the neural network based speaker discriminative training with triplet loss is presented. An Euclidean distance similarity metric is applied in both network training and SV testing, which ensures the SV system to follow an end-to-end fashion. By replacing the final max/average pooling layer with a spatial pyramid pooling layer in the Inception-Resnet-v1 architecture, the fixed-length input constraint is relaxed and an obvious performance gain is achieved compared with the fixed-length input speaker embedding system. For datasets with more severe training/test condition mismatches, the probabilistic linear discriminant analysis (PLDA) back end is further introduced to replace the distance based scoring for the proposed speaker embedding system. Thus, we reconstruct the SV task with a neural network based front-end speaker embedding system and a PLDA that provides channel and noise variabilities compensation in the back end. Extensive experiments are conducted to provide useful hints that lead to a better testing performance. Comparison with the state-of-the-art SV frameworks on three public datasets (i.e., a prompt speech corpus, a conversational speech Switchboard corpus, and NIST SRE10 10 s–10 s condition) justifies the effectiveness of our proposed speaker embedding system.

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

An audio decoder provides a combination of decoding components including components implementing base band decoding, spectral peak decoding, frequency extension decoding and channel extension decoding techniques. The audio decoder decodes a compressed bitstream structured by a bitstream syntax scheme to permit the various decoding components to extract the appropriate parameters for their respective decoding technique.

Bitstream syntax for multi-process audio decoding

Techniques and tools related to coding and decoding of audio information are described. For example, redundant coded information for decoding a current frame includes signal history information associated with only a portion of a previous frame. As another example, redundant coded information for decoding a coded unit includes parameters for a codebook stage to be used in decoding the current coded unit only if the previous coded unit is not available. As yet another example, coded audio units each include a field indicating whether the coded unit includes main encoded information representing a segment of an audio signal, and whether the coded unit includes redundant coded information for use in decoding main encoded information.

Sub-band voice codec with multi-stage codebooks and redundant coding

In late fusion, each modality is processed in a separate unimodal Convolutional Neural Network (CNN) stream and the scores of each modality are fused at the end. Due to its simplicity late fusion is still the predominant approach in many state-of-the-art multimodal applications. In this paper, we present a simple neural network module for leveraging the knowledge from multiple modalities in convolutional neural networks. The propose unit, named Multimodal Transfer Module (MMTM), can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM utilizes the knowledge of multiple modalities to recalibrate the channel-wise features in each CNN stream. Despite other intermediate fusion methods, the proposed module could be used for feature modality fusion in convolution layers with different spatial dimensions. Another advantage of the proposed method is that it could be added among unimodal branches with minimum changes in the their network architectures, allowing each branch to be initialized with existing pretrained weights. Experimental results show that our framework improves the recognition accuracy of well-known multimodal networks. We demonstrate state-of-the-art or competitive performance on four datasets that span the task domains of dynamic hand gesture recognition, speech enhancement, and action recognition with RGB and body joints.

MMTM: Multimodal Transfer Module for CNN Fusion

An enhanced low-bit rate parametric voice coder that groups a number of frames from an underlying frame-based vocoder, such as MELP, into a superframe structure. Parameters are extracted from the group of underlying frames and quantized into the superframe which allows the bit rate of the underlying coding to be reduced without increasing the distortion. The speech data coded in the superframe structure can then be directly synthesized to speech or may be transcoded to a format so that an underlying frame-based vocoder performs the synthesis. The superframe structure includes additional error detection and correction data to reduce the distortion caused by the communication of bit errors.

Kazuhito Koishida

Papers

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

Bitstream syntax for multi-process audio decoding

Sub-band voice codec with multi-stage codebooks and redundant coding

MMTM: Multimodal Transfer Module for CNN Fusion

LPC-harmonic vocoder with superframe structure