scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Selected Topics in Signal Processing in 2017"


Journal ArticleDOI
TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Abstract: Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder–decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

724 citations


Journal ArticleDOI
TL;DR: This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
Abstract: Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human–computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a convolutional neural network (CNN) to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, long short-term memory networks are utilized. The system is then trained in an end-to-end fashion where—by also taking advantage of the correlations of each of the streams—we manage to significantly outperform, in terms of concordance correlation coefficient, traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.

495 citations


Journal ArticleDOI
TL;DR: A comprehensive overview and discussion of research in light field image processing, including basic light field representation and theory, acquisition, super-resolution, depth estimation, compression, editing, processing algorithms for light field display, and computer vision applications of light field data are presented.
Abstract: Light field imaging has emerged as a technology allowing to capture richer visual information from our world. As opposed to traditional photography, which captures a 2D projection of the light in the scene integrating the angular domain, light fields collect radiance from rays in all directions, demultiplexing the angular information lost in conventional photography. On the one hand, this higher dimensional representation of visual data offers powerful capabilities for scene understanding, and substantially improves the performance of traditional computer vision problems such as depth sensing, post-capture refocusing, segmentation, video stabilization, material classification, etc. On the other hand, the high-dimensionality of light fields also brings up new challenges in terms of data capture, data compression, content editing, and display. Taking these two elements together, research in light field image processing has become increasingly popular in the computer vision, computer graphics, and signal processing communities. In this paper, we present a comprehensive overview and discussion of research in this field over the past 20 years. We focus on all aspects of light field image processing, including basic light field representation and theory, acquisition, super-resolution, depth estimation, compression, editing, processing algorithms for light field display, and computer vision applications of light field data.

412 citations


Journal ArticleDOI
TL;DR: A blind image evaluator based on a convolutional neural network (BIECON) is proposed that follows the FR-IQA behavior using the local quality maps as intermediate targets for conventional neural networks, which leads to NR- IQA prediction accuracy that is comparable with that of state-of-the-art FR-iqA methods.
Abstract: In general, owing to the benefits obtained from original information, full-reference image quality assessment (FR-IQA) achieves relatively higher prediction accuracy than no-reference image quality assessment (NR-IQA). By fully utilizing reference images, conventional FR-IQA methods have been investigated to produce objective scores that are close to subjective scores. In contrast, NR-IQA does not consider reference images; thus, its performance is inferior to that of FR-IQA. To alleviate this accuracy discrepancy between FR-IQA and NR-IQA methods, we propose a blind image evaluator based on a convolutional neural network (BIECON). To imitate FR-IQA behavior, we adopt the strong representation power of a deep convolutional neural network to generate a local quality map, similar to FR-IQA. To obtain the best results from the deep neural network, replacing hand-crafted features with automatically learned features is necessary. To apply the deep model to the NR-IQA framework, three critical problems must be resolved: 1) lack of training data; 2) absence of local ground truth targets; and 3) different purposes of feature learning. BIECON follows the FR-IQA behavior using the local quality maps as intermediate targets for conventional neural networks, which leads to NR-IQA prediction accuracy that is comparable with that of state-of-the-art FR-IQA methods.

364 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel framework for learning/estimating graphs from data, which includes formulation of various graph learning problems, their probabilistic interpretations, and associated algorithms.
Abstract: Graphs are fundamental mathematical structures used in various fields to represent data, signals, and processes In this paper, we propose a novel framework for learning/estimating graphs from data The proposed framework includes (i) formulation of various graph learning problems, (ii) their probabilistic interpretations, and (iii) associated algorithms Specifically, graph learning problems are posed as the estimation of graph Laplacian matrices from some observed data under given structural constraints (eg, graph connectivity and sparsity level) From a probabilistic perspective, the problems of interest correspond to maximum a posteriori parameter estimation of Gaussian–Markov random field models, whose precision (inverse covariance) is a graph Laplacian matrix For the proposed graph learning problems, specialized algorithms are developed by incorporating the graph Laplacian and structural constraints The experimental results demonstrate that the proposed algorithms outperform the current state-of-the-art methods in terms of accuracy and computational efficiency

310 citations


Journal ArticleDOI
TL;DR: A review of postevaluation studies conducted using the same dataset illustrates the rapid progress stemming from ASVspoof and outlines the need for further investigation.
Abstract: Concerns regarding the vulnerability of automatic speaker verification (ASV) technology against spoofing can undermine confidence in its reliability and form a barrier to exploitation. The absence of competitive evaluations and the lack of common datasets has hampered progress in developing effective spoofing countermeasures. This paper describes the ASV Spoofing and Countermeasures (ASVspoof) initiative, which aims to fill this void. Through the provision of a common dataset, protocols, and metrics, ASVspoof promotes a sound research methodology and fosters technological progress. This paper also describes the ASVspoof 2015 dataset, evaluation, and results with detailed analyses. A review of postevaluation studies conducted using the same dataset illustrates the rapid progress stemming from ASVspoof and outlines the need for further investigation. Priority future research directions are presented in the scope of the next ASVspoof evaluation planned for 2017.

177 citations


Journal ArticleDOI
TL;DR: This paper proposes two kinds of algorithms for signal processing, one of which is matched filtering and the other is compressive sensing, because the new approach can be regarded as a sparse and random sampling of target information in the spatial-frequency domain.
Abstract: In this paper, we propose a new type of array antenna, termed the random frequency diverse array (RFDA), for an uncoupled indication of target direction and range with low system complexity. In RFDA, each array element has a narrow bandwidth and a randomly assigned carrier frequency. The beampattern of the array is shown to be stochastic but thumbtack-like, and its stochastic characteristics, such as the mean, variance, and asymptotic distribution are derived analytically. Based on these two features, we propose two kinds of algorithms for signal processing. One is matched filtering, due to the beampattern's good characteristics. The other is compressive sensing, because the new approach can be regarded as a sparse and random sampling of target information in the spatial-frequency domain. Fundamental limits, such as the Cramer–Rao bound and the observing matrix's mutual coherence, are provided as performance guarantees of the new array structure. The features and performances of RFDA are verified with numerical results.

172 citations


Journal ArticleDOI
TL;DR: The research progress of time/frequency modulated array studies is reviewed and the most recent advances are discussed, along with their technical challenges, especially in signal processing aspects.
Abstract: Time and frequency modulated arrays have numerous application areas including radar, navigation, and communications. Specifically, a time modulated array can create a beampattern with low sidelobes via connecting and disconnecting the antenna elements from the feed network, while the frequency modulated frequency diverse array produces a range-dependent pattern. In this paper, we aim to introduce these advanced arrays to the signal processing community so that more investigations in terms of theory, methods, and applications, can be facilitated. The research progress of time/frequency modulated array studies is reviewed and the most recent advances are discussed. Moreover, potential applications in radar and communications are presented, along with their technical challenges, especially in signal processing aspects.

170 citations


Journal ArticleDOI
Zhengfang Duanmu1, Kai Zeng1, Kede Ma1, Abdul Rehman1, Zhou Wang1 
TL;DR: This work builds a streaming video database and carries out a subjective user study to investigate the human responses to the combined effect of video compression, initial buffering, and stalling, and proposes a novel QoE prediction approach named Streaming QOE Index that accounts for the instantaneous quality degradation due to perceptual video presentation impairment, the playback stalling events, and the instantaneous interactions between them.
Abstract: With the rapid growth of streaming media applications, there has been a strong demand of quality-of-experience (QoE) measurement and QoE-driven video delivery technologies. Most existing methods rely on bitrate and global statistics of stalling events for QoE prediction. This is problematic for two reasons. First, using the same bitrate to encode different video content results in drastically different presentation quality. Second, the interactions between video presentation quality and playback stalling experiences are not accounted for. In this work, we first build a streaming video database and carry out a subjective user study to investigate the human responses to the combined effect of video compression, initial buffering, and stalling. We then propose a novel QoE prediction approach named Streaming QoE Index that accounts for the instantaneous quality degradation due to perceptual video presentation impairment, the playback stalling events, and the instantaneous interactions between them. Experimental results show that the proposed model is in close agreement with subjective opinions and significantly outperforms existing QoE models. The proposed model provides a highly effective and efficient meanings for QoE prediction in video streaming services. 1

144 citations


Journal ArticleDOI
TL;DR: A novel machine learning method based on latent factor models and probabilistic matrix factorization is proposed to discover course relevance, which is important for constructing efficient base predictors.
Abstract: Accurately predicting students’ future performance based on their ongoing academic records is crucial for effectively carrying out necessary pedagogical interventions to ensure students’ on-time and satisfactory graduation. Although there is a rich literature on predicting student performance when solving problems or studying for courses using data-driven approaches, predicting student performance in completing degrees (e.g., college programs) is much less studied and faces new challenges: 1) Students differ tremendously in terms of backgrounds and selected courses; 2) courses are not equally informative for making accurate predictions; and 3) students’ evolving progress needs to be incorporated into the prediction. In this paper, we develop a novel machine learning method for predicting student performance in degree programs that is able to address these key challenges. The proposed method has two major features. First, a bilayered structure comprising multiple base predictors and a cascade of ensemble predictors is developed for making predictions based on students’ evolving performance states. Second, a data-driven approach based on latent factor models and probabilistic matrix factorization is proposed to discover course relevance, which is important for constructing efficient base predictors. Through extensive simulations on an undergraduate student dataset collected over three years at University of California, Los Angeles, we show that the proposed method achieves superior performance to benchmark approaches.

140 citations


Journal ArticleDOI
TL;DR: A novel neural learning framework that is capable of handling both homogeneous and heterogeneous data while retaining the benefits of traditional CNN successes is proposed, which is term Graph-CNNs, which can handle both heterogeneous and homogeneous graph data.
Abstract: Convolutional neural networks (CNNs) have recently led to incredible breakthroughs on a variety of pattern recognition problems. Banks of finite-impulse response filters are learned on a hierarchy of layers, each contributing more abstract information than the previous layer. The simplicity and elegance of the convolutional filtering process makes them perfect for structured problems, such as image, video, or voice, where vertices are homogeneous in the sense of number, location, and strength of neighbors. The vast majority of classification problems, for example in the pharmaceutical, homeland security, and financial domains are unstructured. As these problems are formulated into unstructured graphs, the heterogeneity of these problems, such as number of vertices, number of connections per vertex, and edge strength, cannot be tackled with standard convolutional techniques. We propose a novel neural learning framework that is capable of handling both homogeneous and heterogeneous data while retaining the benefits of traditional CNN successes. Recently, researchers have proposed variations of CNNs that can handle graph data. In an effort to create learnable filter banks of graphs, these methods either induce constraints on the data or require preprocessing. As opposed to spectral methods, our framework, which we term Graph-CNNs, defines filters as polynomials of functions of the graph adjacency matrix. Graph-CNNs can handle both heterogeneous and homogeneous graph data, including graphs having entirely different vertex or edge sets. We perform experiments to validate the applicability of Graph-CNNs to a variety of structured and unstructured classification problems and demonstrate state-of-the-art results on document and molecule classification problems.

Journal ArticleDOI
TL;DR: This paper aims to investigate the genuine-spoofing discriminative ability from the back-end stage, utilizing recent advancements in deep-learning research, and proposes a novel spoofing detection system, which simultaneously employs convolutional neural networks (CNNs) and recurrent neural network (RNNs) is proposed.
Abstract: In this study, we explore the use of deep-learning approaches for spoofing detection in speaker verification. Most spoofing detection systems that have achieved recent success employ hand-craft features with specific spoofing prior knowledge, which may limit the feasibility to unseen spoofing attacks. We aim to investigate the genuine-spoofing discriminative ability from the back-end stage, utilizing recent advancements in deep-learning research. In this paper, alternative network architectures are exploited to target spoofed speech. Based on this analysis, a novel spoofing detection system, which simultaneously employs convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. In this framework, CNN is treated as a convolutional feature extractor applied on the speech input. On top of the CNN processed output, recurrent networks are employed to capture long-term dependencies across the time domain. Novel features including Teager energy operator critical band autocorrelation envelope, perceptual minimum variance distortionless response, and a more general spectrogram are also investigated as inputs to our proposed deep-learning frameworks. Experiments using the ASVspoof 2015 Corpus show that the integrated CNN–RNN framework achieves state-of-the-art single-system performance. The addition of score-level fusion further improves system robustness. A detailed analysis shows that our proposed approach can potentially compensate for the issue due to short duration test utterances, which is also an issue in the evaluation corpus.

Journal ArticleDOI
TL;DR: This paper addresses the general case of directed graphs and proposes an alternative approach that builds the graph Fourier basis as the set of orthonormal vectors that minimize a continuous extension of the graph cut size, known as the Lovász extension.
Abstract: The analysis of signals defined over a graph is relevant in many applications, such as social and economic networks, big data or biological networks, and so on. A key tool for analyzing these signals is the so-called graph Fourier transform (GFT). Alternative definitions of GFT have been suggested in the literature, based on the eigen-decomposition of either the graph Laplacian or adjacency matrix. In this paper, we address the general case of directed graphs and we propose an alternative approach that builds the graph Fourier basis as the set of orthonormal vectors that minimize a continuous extension of the graph cut size, known as the Lovasz extension. To cope with the nonconvexity of the problem, we propose two alternative iterative optimization methods, properly devised for handling orthogonality constraints. Finally, we extend the method to minimize a continuous relaxation of the balanced cut size. The formulated problem is again nonconvex, and we propose an efficient solution method based on an explicit–implicit gradient algorithm.

Journal ArticleDOI
TL;DR: A survey of psychophysiology-based assessment for quality of experience (QoE) in advanced multimedia technologies provides a classification of methods relevant to QoE and describes related psychological processes, experimental design considerations, and signal analysis techniques.
Abstract: We present a survey of psychophysiology-based assessment for quality of experience (QoE) in advanced multimedia technologies. We provide a classification of methods relevant to QoE and describe related psychological processes, experimental design considerations, and signal analysis techniques. We summarize multimodal techniques and discuss several important aspects of psychophysiology-based QoE assessment, including the synergies with psychophysical assessment and the need for standardized experimental design. This survey is not considered to be exhaustive but serves as a guideline for those interested to further explore this emerging field of research.

Journal ArticleDOI
TL;DR: The covariance matrix of the received signals corresponding to all sensors and employed frequencies is formulated to generate a space-frequency virtual difference coarrays and a fast algorithm with a lower computational complexity based on the multitask Bayesian compressive sensing approach is developed.
Abstract: Different from conventional phased-array radars, the frequency diverse array (FDA) radar offers a range-dependent beampattern capability that is attractive in various applications. The spatial and range resolutions of an FDA radar are fundamentally limited by the array geometry and the frequency offset. In this paper, we overcome this limitation by introducing a novel sparsity-based multitarget localization approach incorporating both coprime arrays and coprime frequency offsets. The covariance matrix of the received signals corresponding to all sensors and employed frequencies is formulated to generate a space-frequency virtual difference coarrays. By using $\mathcal {O}(M+N)$ antennas and $\mathcal {O}(M+N)$ frequencies, the proposed coprime arrays with coprime frequency offsets enables the localization of up to $\mathcal {O}(M^2N^2)$ targets with a resolution of $\mathcal {O}(1/(MN))$ in angle and range domains, where $M$ and $N$ are coprime integers. The joint direction-of-arrival (DOA) and range estimation is cast as a two-dimensional sparse reconstruction problem and is solved within the Bayesian compressive sensing framework. We also develop a fast algorithm with a lower computational complexity based on the multitask Bayesian compressive sensing approach. Simulations results demonstrate the superiority of the proposed approach in terms of DOA-range resolution, localization accuracy, and the number of resolvable targets.

Journal ArticleDOI
Kai Qiu1, Xianghui Mao1, Xinyue Shen1, Xiaohan Wang1, Tiejian Li1, Yuantao Gu1 
TL;DR: A new batch reconstruction method of time-varying graph signals is proposed by exploiting the smoothness of the temporal difference signals, and the uniqueness of the solution to the corresponding optimization problem is theoretically analyzed.
Abstract: Signal processing on graphs is an emerging research field dealing with signals living on an irregular domain that is captured by a graph, and has been applied to sensor networks, machine learning, climate analysis, etc. Existing works on sampling and reconstruction of graph signals mainly studied static bandlimited signals. However, many real-world graph signals are time-varying, and they evolve smoothly, so instead of the signals themselves being bandlimited or smooth on graph, it is more reasonable that their temporal differences are smooth on graph. In this paper, a new batch reconstruction method of time-varying graph signals is proposed by exploiting the smoothness of the temporal difference signals, and the uniqueness of the solution to the corresponding optimization problem is theoretically analyzed. Furthermore, driven by practical applications faced with real-time requirements, huge size of data, lack of computing center, or communication difficulties between two nonneighboring vertices, an online distributed method is proposed by applying local properties of the temporal difference operator and the graph Laplacian matrix. Experiments on a variety of synthetic and real-world datasets demonstrate the excellent performance of the proposed methods.

Journal ArticleDOI
TL;DR: An enhanced three-dimensional localization technique is proposed for the case with severe range ambiguity problem, which evidently reduces the dimensions of the processor and efficiently suppresses clutter in practical applications.
Abstract: High pulse repetition frequency incurs range ambiguity in radar systems, which in turn results in clutter suppression performance degradation and parameter estimation ambiguities. To tackle this issue, this paper proposes an adaptive range-angle-Doppler processing approach with airborne frequency diverse array (FDA) for multiple-input multiple-output (MIMO) radar. The FDA employs a small frequency increment across array elements and introduces additional controllable degrees-of-freedom (DOFs) in range dimension in the transmit antenna. Thus, it is able to perform range-angle-Doppler processing by exploiting the DOFs in transmit, receive, and pulse dimensions in the FDA-MIMO radar. By properly designing the frequency increment of the FDA, the clutter spectra of different ambiguous range regions can be discriminable in the transmit-receive spatial domains. As a result, multiple beams are formed in the transmit spatial, receive spatial, and Doppler domains and clutters from different range regions can be suppressed. An enhanced three-dimensional localization technique is proposed for the case with severe range ambiguity problem, which evidently reduces the dimensions of the processor and efficiently suppresses clutter in practical applications. Several numerical examples are presented to verify the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: A novel method for predicting the evolution of a student's grade in massive open online courses (MOOCs) by incorporating another, richer form of data collected from each student into the machine learning feature set, and using that to train a time series neural network that learns from both prior performance and clickstream data.
Abstract: We present a novel method for predicting the evolution of a student's grade in massive open online courses (MOOCs). Performance prediction is particularly challenging in MOOC settings due to per-student assessment response sparsity and the need for personalized models. Our method overcomes these challenges by incorporating another, richer form of data collected from each student—lecture video-watching clickstreams—into the machine learning feature set, and using that to train a time series neural network that learns from both prior performance and clickstream data. Through evaluation on two MOOC datasets, we find that our algorithm outperforms a baseline of average past performance by more than 60% on average, and a lasso regression baseline by more than 15%. Moreover, the gains are higher when the student has answered fewer questions, underscoring their ability to provide instructors with early detection of struggling and/or advanced students. We also show that despite these gains, when taken alone, none of the behavioral features are particularly correlated with performance, emphasizing the need to consider their combined effect and nonlinear predictors. Finally, we discuss how course instructors can use these predictive learning analytics to stage student interventions.

Journal ArticleDOI
TL;DR: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end- to-end framework and elaborate the effectiveness of this proposed method on the multichannel ASR benchmarks in noisy environments.
Abstract: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end-to-end framework. Recently, the end-to-end ASR paradigm has attracted great research interest as an alternative to conventional hybrid paradigms with deep neural networks and hidden Markov models. Using this novel paradigm, we simplify ASR architecture by integrating such ASR components as acoustic, phonetic, and language models with a single neural network and optimize the overall components for the end-to-end ASR objective: generating a correct label sequence. Although most existing end-to-end frameworks have mainly focused on ASR in clean environments, our aim is to build more realistic end-to-end systems in noisy environments. To handle such challenging noisy ASR tasks, we study multichannel end-to-end ASR architecture, which directly converts multichannel speech signal to text through speech enhancement. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the end-to-end ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise. We elaborate the effectiveness of our proposed method on the multichannel ASR benchmarks in noisy environments (CHiME-4 and AMI). The experimental results show that our proposed multichannel end-to-end system obtained performance gains over the conventional end-to-end baseline with enhanced inputs from a delay-and-sum beamformer (i.e., BeamformIT) in terms of character error rate. In addition, further analysis shows that our neural beamformer, which is optimized only with the end-to-end ASR objective, successfully learned a noise suppression function.

Journal ArticleDOI
TL;DR: A light field compression scheme based on a novel homography-based low-rank approximation method called HLRA, which shows substantial peak signal to noise ratio gain of the compression algorithm, as well as the accuracy of the proposed parameter prediction model, especially for real light fields.
Abstract: This paper describes a light field compression scheme based on a novel homography-based low-rank approximation method called HLRA. The HLRA method jointly searches for the set of homographies best aligning the light field views and for the low-rank approximation matrices. The light field views are aligned using either one global homography or multiple homographies depending on how much the disparity across views varies from one depth plane to the other. The light field low-rank representation is then compressed using high efficiency video coding (HEVC). The best pair of rank and quantization parameters of the coding scheme, for a given target bit rate, is predicted with a model defined as a function of light field disparity and texture features. The results are compared with those obtained by directly applying HEVC on the light field views restructured as a pseudovideo sequence. The experiments using different datasets show substantial peak signal to noise ratio (PSNR)-rate gain of our compression algorithm, as well as the accuracy of the proposed parameter prediction model, especially for real light fields. A scalable extension of the coding scheme is finally proposed.

Journal ArticleDOI
Chenghao Wang1, Jingwei Xu1, Guisheng Liao1, Xuefei Xu1, Yuhong Zhang1 
TL;DR: A range ambiguity resolution approach for HRWS SAR imaging using frequency diverse array (FDA), which employs a set of slightly different carrier frequencies, capable of distinguishing the range ambiguous echoes in the spatial frequency domain.
Abstract: In spaceborne synthetic aperture radar (SAR), it is a challenging problem to realize high resolution and wide swath imaging (HRWS) due to the conflict between Doppler and range ambiguities. To mitigate this conflict, a range ambiguity resolution approach for HRWS SAR imaging using frequency diverse array (FDA) is proposed in this paper. The FDA employs a set of slightly different carrier frequencies, each of which is emitted by an individual array element. Frequency diversity introduces wave-path difference among the array elements, thus resulting in the range-angle-dependent property of transmit steering vector. Utilizing the extra degrees-of-freedom in range domain, FDA is capable of distinguishing the range ambiguous echoes in the spatial frequency domain. In our approach, the range ambiguous echoes are compensated by range dependence compensation (RDC) technique in the transmit spatial frequency domain. In the sequel, the range ambiguous echoes are separated by using a series of transmit beamformers as the echoes from different range regions are discriminable. Finally, traditional imaging processing is performed on the reconstructed unambiguous data to achieve HRWS imaging. Simulation results have verified the effectiveness of the proposed approach.

Journal ArticleDOI
Kartik Audhkhasi1, Andrew Rosenberg1, Abhinav Sethy1, Bhuvana Ramabhadran1, Brian Kingsbury1 
TL;DR: This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.
Abstract: Conventional keyword search (KWS) systems for speech databases match the input text query to the set of word hypotheses generated by an automatic speech recognition (ASR) system from utterances in the database. Hence, such KWS systems attempt to solve the complex problem of ASR as a precursor. Training an ASR system itself is a time-consuming process requiring transcribed speech data. Our prior work presented an ASR-free end-to-end system that needed minimal supervision and trained significantly faster than an ASR-based KWS system. The ASR-free KWS system consisted of three subsystems. The first subsystem was a recurrent neural network based acoustic encoder that extracted a finite-dimensional embedding of the speech utterance. The second subsystem was a query encoder that produced an embedding of the input text query. The acoustic and query embeddings were input to a feedforward neural network that predicted whether the query occurred in the acoustic utterance or not. This paper extends our prior work in several ways. First, we significantly improve upon our previous ASR-free KWS results by nearly 20% relative through improvements to the acoustic encoder. Next, we show that it is possible to train the acoustic encoder on languages other than the language of interest with only a small drop in KWS performance. Finally, we attempt to predict the location of the detected keywords by training a location-sensitive KWS network.

Journal ArticleDOI
TL;DR: This paper aims at the evaluation of perceived visual quality of light field images and at comparing the performance of a few state-of-the-art algorithms for light field image compression, by means of a set of objective and subjective quality assessments.
Abstract: The recent advances in light field imaging, supported among others by the introduction of commercially available cameras, e.g., Lytro or Raytrix, are changing the ways in which visual content is captured and processed. Efficient storage and delivery systems for light field images must rely on compression algorithms. Several methods to compress light field images have been proposed recently. However, in-depth evaluations of compression algorithms have rarely been reported. This paper aims at the evaluation of perceived visual quality of light field images and at comparing the performance of a few state-of-the-art algorithms for light field image compression. First, a processing chain for light field image compression and decompression is defined for two typical use cases, professional and consumer. Then, five light field compression algorithms are compared by means of a set of objective and subjective quality assessments. An interactive methodology recently introduced by authors, as well as a passive methodology is used to perform these evaluations. The results provide a useful benchmark for future development of compression solutions for light field images.

Journal ArticleDOI
TL;DR: An example-based super-resolution algorithm for light fields is described, which allows the increase of the spatial resolution of the different views in a consistent manner across all subaperture images of the light field.
Abstract: Light field imaging has emerged as a very promising technology in the field of computational photography. Cameras are becoming commercially available for capturing real-world light fields. However, capturing high spatial resolution light fields remains technologically challenging, and the images rendered from real light fields have today a significantly lower spatial resolution compared to traditional two-dimensional (2-D) cameras. This paper describes an example-based super-resolution algorithm for light fields, which allows the increase of the spatial resolution of the different views in a consistent manner across all subaperture images of the light field. The algorithm learns linear projections between subspaces of reduced dimension in which reside patch-volumes extracted from the light field. The method is extended to cope with angular super-resolution, where 2-D patches of intermediate subaperture images are approximated from neighboring subaperture images using multivariate ridge regression. Experimental results show significant quality improvement when compared to state-of-the-art single-image super-resolution methods applied on each view separately, as well as when compared to a recent light field super-resolution techniques based on deep learning.

Journal ArticleDOI
TL;DR: The present paper broadens the kernel-based graph function estimation framework to reconstruct time-evolving functions over possibly time-varying topologies, and includes a novel kernel Kalman filter, developed to reconstruct space-time functions at affordable computational cost.
Abstract: Graph-based methods pervade the inference toolkits of numerous disciplines including sociology, biology, neuroscience, physics, chemistry, and engineering. A challenging problem encountered in this context pertains to determining the attributes of a set of vertices given those of another subset at possibly diffe-rent time instants. Leveraging spatiotemporal dynamics can drastically reduce the number of observed vertices, and hence the sampling cost. Alleviating the limited flexibility of the existing approaches, the present paper broadens the kernel-based graph function estimation framework to reconstruct time-evolving functions over possibly time-evolving topologies. This approach inherits the versatility and generality of kernel-based methods, for which no knowledge on distributions or second-order statistics is required. Systematic guidelines are provided to construct two families of space-time kernels with complementary strengths: the first facilitates judicious control of regularization on a space-time frequency plane, whereas the second accommodates time-varying topologies. Batch and online estimators are also put forth. The latter comprise a novel kernel Kalman filter, developed to reconstruct space-time functions at affordable computational cost. Numerical tests with real datasets corroborate the merits of the proposed methods relative to competing alternatives.

Journal ArticleDOI
TL;DR: The synthesis of frequency diverse arrays able to achieve time-invariant spatial focusing performance in the short-range is addressed and two time-modulated optimized frequency offset schemes are presented.
Abstract: The synthesis of frequency diverse arrays (FDAs) able to achieve time-invariant spatial focusing performance in the short-range is addressed. Two time-modulated optimized frequency offset schemes are presented that allow focusing the energy at a single target location and multiple target locations, respectively. A set of numerical examples is reported and discussed to validate the effectiveness of the proposed solutions also in comparison with the already proposed FDA design approaches.

Journal ArticleDOI
TL;DR: This work systematically study the neural machine translation context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence, and assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs.
Abstract: End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words—or sentences—which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an $F_1=98.2\%$ on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures $F_1$ reaches $98.9\%$ .

Journal ArticleDOI
TL;DR: Extensive numerical evaluations illustrate that the proposed spatio-temporal precoder-based multiantenna waveform design can facilitate good multiuser link performance, despite the extremely simple 1-bit ADCs in the receivers, hence being one possible enabling technology for the future low-complexity IoT networks.
Abstract: Internet-of-Things (IoT) refers to a high-density network of low-cost, low-bitrate terminals and sensors where low energy consumption is also one central feature. As the power budget of classical receiver chains is dominated by the high-resolution analog-to-digital converters (ADCs), there is a growing interest toward deploying receiver architectures with reduced bit or even 1-bit ADCs. In this paper, we study waveform design, optimization, and detection aspects of multiuser massive MIMO downlink, where user terminals adopt very simple 1-bit ADCs with oversampling. In order to achieve spectral efficiency higher than 1 bit/s/Hz per real dimension, and per receiver antenna, we propose a two-stage precoding structure, namely, a novel quantization precoder followed by maximum-ratio transmission or zero-forcing-type spatial channel precoder which jointly form the multiuser multiantenna transmit waveform. The quantization precoder outputs are designed and optimized, under appropriate transmitter and receiver filter bandwidth constraints, to provide controlled intersymbol interference enabling the input symbols to be uniquely detected from 1-bit quantized observations with a low-complexity symbol detector in the absence of noise. An additional optimization constraint is also imposed in the quantization precoder design to increase the robustness against noise and residual interuser interference (IUI). The purpose of the spatial channel precoder, in turn, is to suppress the IUI and provide high beamforming gains such that good symbol error rates can be achieved in the presence of noise and interference. Extensive numerical evaluations illustrate that the proposed spatio-temporal precoder-based multiantenna waveform design can facilitate good multiuser link performance, despite the extremely simple 1-bit ADCs in the receivers, hence being one possible enabling technology for the future low-complexity IoT networks.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a pseudo-sequence-based two-dimensional (2D) hierarchical coding structure for light-field image compression, which decomposes the light field image into multiple views and organize them into a 2-D coding structure according to the spatial coordinates of the corresponding microlens.
Abstract: In this paper, we propose a pseudo-sequence-based two-dimensional (2-D) hierarchical coding structure for light-field image compression. In the proposed scheme, we first decompose the light-field image into multiple views and organize them into a 2-D coding structure according to the spatial coordinates of the corresponding microlens. Then, we mainly develop three algorithms to optimize the 2-D coding structure. First, we propose a 2-D hierarchical coding structure with a limited number of reference frames to exploit the intercorrelations among various views. To be more specific, we divide all the views into four quadrants, and all the views are encoded one quadrant after another to reduce the reference buffer size as much as possible. Inside each quadrant, all the views are encoded hierarchically in both horizontal and vertical directions to fully exploit the correlations among different views. Second, we propose to use the distance between the current view and its reference views instead of the picture order count difference as the criterion for selecting better reference frames for each inter view. The distance-based criterion is also applied to the motion vector scaling process to obtain more accurate motion vector predictors. Third, an optimal bit allocation algorithm taking the influence of the various views on the following encoding views into account is proposed to further exploit the intercorrelations among various views and improve coding efficiency. The entire scheme is implemented in the reference software of high efficiency video coding. The experimental results demonstrate that the proposed novel pseudo-sequence-based 2-D hierarchical structure can achieve maximum 28.4% bit-rate savings compared with the previous pseudo-sequence-based light-field image compression method.

Journal ArticleDOI
TL;DR: The results show that the countermeasures based on the proposed features outperform other spectral features for both known and unknown attacks.
Abstract: Recent advancements in voice conversion (VC) and speech synthesis research make speech-based biometric systems highly prone to spoofing attacks. This can provoke an increase in false acceptance rate in such systems and requires countermeasure to mitigate such spoofing attacks. In this paper, we first study the characteristics of synthetic speech vis-a-vis natural speech and then propose a set of novel short-term spectral features that can efficiently capture the discriminative information between them. The proposed features are computed using inverted frequency warping scale and overlapped block transformation of filter bank log energies. Our study presents a detailed analysis of antispoofing performance with respect to the variations in the warping scale for inverted frequency and block size for the block transform. For performance analysis, Gaussian mixture model (GMM) based synthetic speech detector is used as a classifier on a stand-alone basis and also, integrated with automatic speaker verification (ASV) systems. For ASV systems, standard mel-frequency cepstral coefficients are used as feature while GMM with universal background model and i-vector are used as classifiers. The experiments are conducted on ten different kinds of synthetic data from ASVspoof 2015 corpus. The results show that the countermeasures based on the proposed features outperform other spectral features for both known and unknown attacks. An average equal error rate (EER) of 0.00% has been achieved for nine attacks that use VC or SS speech and the best performance of 7.12% EER is arrived at the remaining natural speech concatenation-based spoofing attack.