Showing papers in "IEEE Transactions on Multimedia in 2019"
TL;DR: Recently, powerful deep learning algorithms have been applied to SISR and have achieved state-of-the-art performance as discussed by the authors, which is a notoriously challenging ill-posed problem that aims to obtain a high resolution output from one of its low-resolution versions.
Abstract: Single image super-resolution (SISR) is a notoriously challenging ill-posed problem that aims to obtain a high-resolution output from one of its low-resolution versions. Recently, powerful deep learning algorithms have been applied to SISR and have achieved state-of-the-art performance. In this survey, we review representative deep learning-based SISR methods and group them into two categories according to their contributions to two essential aspects of SISR: The exploration of efficient neural network architectures for SISR and the development of effective optimization objectives for deep SISR learning. For each category, a baseline is first established, and several critical limitations of the baseline are summarized. Then, representative works on overcoming these limitations are presented based on their original content, as well as our critical exposition and analyses, and relevant comparisons are conducted from a variety of perspectives. Finally, we conclude this review with some current challenges and future trends in SISR that leverage deep learning algorithms.
528 citations
TL;DR: This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels, and proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module.
Abstract: This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.
229 citations
TL;DR: A low-rank representation model is employed to learn a shared sample representation coefficient matrix to generate the affinity graph and diversity regularization is used to learn the optimal weights for each view, which can suppress the redundancy and enhance the diversity among different feature views.
Abstract: With the ability to exploit the internal structure of data, graph-based models have received a lot of attention and have achieved great success in multiview subspace clustering for multimedia data. Most of the existing methods individually construct an affinity graph for each single view and fuse the result obtained from each single graph. However, the common representation shared by different views and the complementary diversity across these views are not efficiently exploited. In addition, noise and outliers are often mixed in original data, which adversely degenerate the clustering performance of many existing methods. In this paper, we propose addressing these issues by learning a joint affinity graph for multiview subspace clustering based on a low-rank representation with diversity regularization and a rank constraint. Specifically, a low-rank representation model is employed to learn a shared sample representation coefficient matrix to generate the affinity graph. At the same time, we use diversity regularization to learn the optimal weights for each view, which can suppress the redundancy and enhance the diversity among different feature views. In addition, the cluster number is used to promote affinity graph learning by using a rank constraint. The final clustering result is obtained by using normalized cuts on the learned affinity graph. An efficient algorithm based on an augmented Lagrangian multiplier with alternating direction minimization is carefully designed to solve the resulting optimization problem. Extensive experiments on various real-world datasets are conducted, and the results demonstrate well the effectiveness of the proposed algorithm.
194 citations
TL;DR: The proposed Siamese model is end-to-end trainable to jointly learn comparable hidden representations for paired pedestrian videos and their similarity value and outperforming state-of-the-art methods.
Abstract: Video-based person re-identification (re-id) is a central application in surveillance systems with a significant concern in security. Matching persons across disjoint camera views in their video fragments are inherently challenging due to the large visual variations and uncontrolled frame rates. There are two steps crucial to person re-id, namely, discriminative feature learning and metric learning. However, existing approaches consider the two steps independently, and they do not make full use of the temporal and spatial information in the videos. In this paper, we propose a Siamese attention architecture that jointly learns spatiotemporal video representations and their similarity metrics. The network extracts local convolutional features from regions of each frame and enhances their discriminative capability by focusing on distinct regions when measuring the similarity with another pedestrian video. The attention mechanism is embedded into spatial gated recurrent units to selectively propagate relevant features and memorize their spatial dependencies through the network. The model essentially learns which parts ( where ) from which frames ( when ) are relevant and distinctive for matching persons and attaches higher importance therein. The proposed Siamese model is end-to-end trainable to jointly learn comparable hidden representations for paired pedestrian videos and their similarity value. Extensive experiments on three benchmark datasets show the effectiveness of each component of the proposed deep network while outperforming state-of-the-art methods.
190 citations
TL;DR: A hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in the context of social multimedia is proposed to enhance the reliability of the software-defined networks (SDN).
Abstract: The continuous development and usage of multi-media-based applications and services have contributed to the exponential growth of social multimedia traffic. In this context, secure transmission of data plays a critical role in realizing all of the key requirements of social multimedia networks such as reliability, scalability, quality of information, and quality of service (QoS). Thus, a trust-based paradigm for multimedia analytics is highly desired to meet the increasing user requirements and deliver more timely and actionable insights. In this regard, software-defined networks (SDNs) play a vital role; however, several factors such as as-runtime security, and energy-aware networking limit its capabilities to facilitate efficient network control and management. Thus, with the view to enhance the reliability of the SDN, a hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in the context of social multimedia is proposed. It consists of the following two modules: 1) an anomaly detection module that leverages improved restricted Boltzmann machine and gradient descent-based support vector machine to detect the abnormal activities, and 2) an end-to-end data delivery module to satisfy strict QoS requirements of the SDN, that is, high bandwidth and low latency. Finally, the proposed scheme has been experimentally evaluated on both real-time and benchmark datasets to prove its effectiveness and efficiency in terms of anomaly detection and data delivery essential for social multimedia. Further, a large-scale analysis over a Carnegie Mellon University (CMU)-based insider threat dataset has been conducted to identify its performance in terms of detecting malicious events such as-Identity theft, profile cloning, confidential data collection, etc.
189 citations
TL;DR: This work learns a cross-modality bridging dictionary for the deep and complete understanding of a vast quantity of web images and proposes a knowledge-based concept transferring algorithm to discover the underlying relations of different categories.
Abstract: The understanding of web images has been a hot research topic in both artificial intelligence and multimedia content analysis domains. The web images are composed of various complex foregrounds and backgrounds, which makes the design of an accurate and robust learning algorithm a challenging task. To solve the above significant problem, first, we learn a cross-modality bridging dictionary for the deep and complete understanding of a vast quantity of web images. The proposed algorithm leverages the visual features into the semantic concept probability distribution, which can construct a global semantic description for images while preserving the local geometric structure. To discover and model the occurrence patterns between intra- and inter-categories, multi-task learning is introduced for formulating the objective formulation with Capped- $\ell _{1}$ penalty, which can obtain the optimal solution with a higher probability and outperform the traditional convex function-based methods. Second, we propose a knowledge-based concept transferring algorithm to discover the underlying relations of different categories. This distribution probability transferring among categories can bring the more robust global feature representation, and enable the image semantic representation to generalize better as the scenario becomes larger. Experimental comparisons and performance discussion with classical methods on the ImageNet, Caltech-256, SUN397, and Scene15 datasets show the effectiveness of our proposed method at three traditional image understanding tasks.
169 citations
TL;DR: A novel method, named deep comprehensive multipatches aggregation convolutional neural networks (CNNs), to solve the FER problem, which is a deep-based framework, which mainly consists of two branches of the CNN.
Abstract: Facial expression recognition (FER) has long been a challenging task in computer vision. In this paper, we propose a novel method, named deep comprehensive multipatches aggregation convolutional neural networks (CNNs), to solve the FER problem. The proposed method is a deep-based framework, which mainly consists of two branches of the CNN. One branch extracts local features from image patches while the other extracts holistic features from the whole expressional image. In the model, local features depict expressional details and holistic features characterize the high-level semantic information of an expression. We aggregate both local and holistic features before making classification. These two types of hierarchical features represent expressions in different scales. Compared with most current methods with single type of feature, the model can represent expressions more comprehensively. Additionally, in the training stage, a novel pooling strategy named expressional transformation-invariant pooling is proposed for handling nuisance variations, such as rotations, noises, etc. Extensive experiments are conducted on the famous the Extended Cohn-Kanade (CK+) dataset and the Japanese Female Facial Expression (JAFFE) database expression datasets, where the recognition results obtained.
157 citations
TL;DR: A framework based on scene graphs for image captioning that leverages both visual features and semantic knowledge in structured scene graphs and introduces a hierarchical-attention-based module to learn discriminative features for word generation at each time step.
Abstract: Automatically describing the content of an image has been attracting considerable research attention in the multimedia field. To represent the content of an image, many approaches directly utilize convolutional neural networks (CNNs) to extract visual representations, which are fed into recurrent neural networks to generate natural language. Recently, some approaches have detected semantic concepts from images and then encoded them into high-level representations. Although substantial progress has been achieved, most of the previous methods treat entities in images individually, thus lacking structured information that provides important cues for image captioning. In this paper, we propose a framework based on scene graphs for image captioning. Scene graphs contain abundant structured information because they not only depict object entities in images but also present pairwise relationships. To leverage both visual features and semantic knowledge in structured scene graphs, we extract CNN features from the bounding box offsets of object entities for visual representations, and extract semantic relationship features from triples (e.g., man riding bike ) for semantic representations. After obtaining these features, we introduce a hierarchical-attention-based module to learn discriminative features for word generation at each time step. The experimental results on benchmark datasets demonstrate the superiority of our method compared with several state-of-the-art methods.
150 citations
TL;DR: This paper devises location-customized caching schemes to maximize the total content hit rate, and demonstrates that those algorithms can be applied to scenarios with different noise features, and are able to make adaptive caching decisions, achieving a content hit rates comparable to that via the hindsight optimal strategy.
Abstract: Mobile edge caching aims to enable content delivery within the radio access network, which effectively alleviates the backhaul burden and reduces response time. To fully exploit edge storage resources, the most popular contents should be identified and cached. Observing that user demands on certain contents vary greatly at different locations, this paper devises location-customized caching schemes to maximize the total content hit rate. Specifically, a linear model is used to estimate the future content hit rate. For the case with zero-mean noise, a ridge regression-based online algorithm with positive perturbation is proposed. Regret analysis indicates that the hit rate achieved by the proposed algorithm asymptotically approaches that of the optimal caching strategy in the long run. When the noise structure is unknown, an $H_{\infty }$ filter-based online algorithm is devised by taking a prescribed threshold as input, which guarantees prediction accuracy even under the worst-case noise process. Both online algorithms require no training phases and, hence, are robust to the time-varying user demands. The estimation errors of both algorithms are numerically analyzed. Moreover, extensive experiments using real-world datasets are conducted to validate the applicability of the proposed algorithms. It is demonstrated that those algorithms can be applied to scenarios with different noise features, and are able to make adaptive caching decisions, achieving a content hit rate that is comparable to that via the hindsight optimal strategy.
143 citations
TL;DR: A data embedding method ( PBTL–DE) is proposed to embed secret data to an image by exploiting spatial redundancy within small image blocks and a PBTL-based reversible data hiding method in encrypted images (PBTL–RDHEI).
Abstract: This paper first introduces a parametric binary tree labeling scheme (PBTL) to label image pixels in two different categories. Using PBTL, a data embedding method (PBTL–DE) is proposed to embed secret data to an image by exploiting spatial redundancy within small image blocks. We then apply PBTL–DE into the encrypted domain and propose a PBTL-based reversible data hiding method in encrypted images (PBTL–RDHEI). PBTL–RDHEI is a separable and reversible method that both the original image and secret data can be recovered and extracted losslessly and independently. Experiment results and analysis show that PBTL–RDHEI is able to achieve an average embedding rate as large as 1.752 bpp and 2.003 bpp when block size is set to $2\times 2$ and $3\times 3$ , respectively.
131 citations
TL;DR: Results demonstrate that the proposed FuseGAN presents accurate decision maps for focus regions in multi-focus images, such that the fused images are superior to 11 recent state-of-the-art algorithms, not only in visual perception, but also in quantitative analysis in terms of five metrics.
Abstract: We study the problem of multi-focus image fusion, where the key challenge is detecting the focused regions accurately among multiple partially focused source images. Inspired by the conditional generative adversarial network (cGAN) to image-to-image task, we propose a novel FuseGAN to fulfill the images-to-image for multi-focus image fusion. To satisfy the requirement of dual input-to-one output, the encoder of the generator in FuseGAN is designed as a Siamese network. The least square GAN objective is employed to enhance the training stability of FuseGAN, resulting in an accurate confidence map for focus region detection. Also, we exploit the convolutional conditional random fields technique on the confidence map to reach a refined final decision map for better focus region detection. Moreover, due to the lack of a large-scale standard dataset, we synthesize a large enough multi-focus image dataset based on a public natural image dataset PASCAL VOC 2012, where we utilize a normalized disk point spread function to simulate the defocus and separate the background and foreground in the synthesis for each image. We conduct extensive experiments on two public datasets to verify the effectiveness of the proposed method. Results demonstrate that the proposed method presents accurate decision maps for focus regions in multi-focus images, such that the fused images are superior to 11 recent state-of-the-art algorithms, not only in visual perception, but also in quantitative analysis in terms of five metrics.
TL;DR: A novel multiview-based network architecture that combines convolutional neural networks with long short-term memory (LSTM) to exploit the correlative information from multiple views for 3-D shape recognition and retrieval is proposed.
Abstract: Shape representation for 3-D models is an important topic in computer vision, multimedia analysis, and computer graphics. Recent multiview-based methods demonstrate promising performance for 3-D shape recognition and retrieval. However, most multiview-based methods ignore the correlations of multiple views or suffer from high computional cost. In this paper, we propose a novel multiview-based network architecture for 3-D shape recognition and retrieval. Our network combines convolutional neural networks (CNNs) with long short-term memory (LSTM) to exploit the correlative information from multiple views. Well-pretrained CNNs with residual connections are first used to extract a low-level feature of each view image rendered from a 3-D shape. Then, a LSTM and a sequence voting layer are employed to aggregate these features into a shape descriptor. The highway network and a three-step training strategy are also adopted to boost the optimization of the deep network. Experimental results on two public datasets demonstrate that the proposed method achieves promising performance for 3-D shape recognition and the state-of-the-art performance for the 3-D shape retrieval.
TL;DR: COCO-CN as mentioned in this paper is a dataset enriched with manually written Chinese sentences and tags, which provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval.
Abstract: This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN , a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods. Data and code are publicly available at https://github.com/li-xirong/coco-cn .
TL;DR: An in-depth study of head pose estimation is conducted and a quaternion-based multiregression loss method achieves state-of-the-art performance on the AFLW2000, AFLW test set, and AFW datasets and is closing the gap with methods that utilize depth information on the BIWI dataset.
Abstract: Head pose estimation has attracted immense research interest recently, as its inherent information significantly improves the performance of face-related applications such as face alignment and face recognition. In this paper, we conduct an in-depth study of head pose estimation and present a multiregression loss function, an $L2$ regression loss combined with an ordinal regression loss, to train a convolutional neural network (CNN) that is dedicated to estimating head poses from RGB images without depth information. The ordinal regression loss is utilized to address the nonstationary property observed as the facial features change with respect to different head pose angles and learn robust features. The $L2$ regression loss leverages these features to provide precise angle predictions for input images. To avoid the ambiguity problem in the commonly used Euler angle representation, we further formulate the head pose estimation problem in quaternions. Our quaternion-based multiregression loss method achieves state-of-the-art performance on the AFLW2000, AFLW test set, and AFW datasets and is closing the gap with methods that utilize depth information on the BIWI dataset.
TL;DR: A unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities, which differs from conventional deep networks, which focus on the attention mechanism, because the authors' temporal attention provides a principled and global guidance across different modalities and video segments.
Abstract: Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell , that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell , a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori , to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.
TL;DR: In this article, an RNN is trained by using as features the angles formed by the finger bones of the human hands, acquired by a leap motion controller sensor, and the proposed method, including the effectiveness of the selected angles, was initially tested by creating a very challenging dataset composed by a large number of gestures defined by the American sign language.
Abstract: Hand gesture recognition is still a topic of great interest for the computer vision community. In particular, sign language and semaphoric hand gestures are two foremost areas of interest due to their importance in human–human communication and human–computer interaction, respectively. Any hand gesture can be represented by sets of feature vectors that change over time. Recurrent neural networks (RNNs) are suited to analyze this type of set thanks to their ability to model the long-term contextual information of temporal sequences. In this paper, an RNN is trained by using as features the angles formed by the finger bones of the human hands. The selected features, acquired by a leap motion controller sensor, are chosen because the majority of human hand gestures produce joint movements that generate truly characteristic corners. The proposed method, including the effectiveness of the selected angles, was initially tested by creating a very challenging dataset composed by a large number of gestures defined by the American sign language. On the latter, an accuracy of over 96% was achieved. Afterwards, by using the Shape Retrieval Contest (SHREC) dataset, a wide collection of semaphoric hand gestures, the method was also proven to outperform in accuracy competing approaches of the current literature.
TL;DR: A DHA quality evaluation method is proposed by integrating some dehazing-relevant features, including image structure recovering, color rendition, and over-enhancement of low-contrast areas, which works for both types of images, but is further improved for aerial images by incorporating its specific characteristics.
Abstract: To enhance the visibility and usability of images captured in hazy conditions, many image dehazing algorithms (DHAs) have been proposed. With so many image DHAs, there is a need to evaluate and compare these DHAs. Due to the lack of the reference haze-free images, DHAs are generally evaluated qualitatively using real hazy images. But it is possible to perform quantitative evaluation using synthetic hazy images since the reference haze-free images are available and full-reference (FR) image quality assessment (IQA) measures can be utilized. In this paper, we follow this strategy and study DHA evaluation using synthetic hazy images systematically. We first build a synthetic haze removing quality (SHRQ) database. It consists of two subsets: regular and aerial image subsets, which include 360 and 240 dehazed images created from 45 and 30 synthetic hazy images using 8 DHAs, respectively. Since aerial imaging is an important application area of dehazing, we create an aerial image subset specifically. We then carry out subjective quality evaluation study on these two subsets. We observe that taking DHA evaluation as an exact FR IQA process is questionable, and the state-of-the-art FR IQA measures are not effective for DHA evaluation. Thus, we propose a DHA quality evaluation method by integrating some dehazing-relevant features, including image structure recovering, color rendition, and over-enhancement of low-contrast areas. The proposed method works for both types of images, but we further improve it for aerial images by incorporating its specific characteristics. Experimental results on two subsets of the SHRQ database validate the effectiveness of the proposed measures.
TL;DR: A robust and discriminative pedestrian image descriptor, namely, the Global–Local-Alignment Descriptor (GLAD), designed to perform offline relevance mining to eliminate the huge person ID redundancy in the gallery set, and accelerate the online Re-ID procedure.
Abstract: The huge variance of human pose and the misalign-ment of detected human images significantly increase the difficulty of pedestrian image matching in person Re-Identification (Re-ID). Moreover, the massive visual data being produced by surveillance video cameras requires highly efficient person Re-ID systems. Targeting to solve the first problem, this work proposes a robust and discriminative pedestrian image descriptor, namely, the Global–Local-Alignment Descriptor (GLAD). For the second problem, this work treats person Re-ID as image retrieval and proposes an efficient indexing and retrieval framework. GLAD explicitly leverages the local and global cues in the human body to generate a discriminative and robust representation. It consists of part extraction and descriptor learning modules, where several part regions are first detected and then deep neural networks are designed for representation learning on both the local and global regions. A hierarchical indexing and retrieval framework is designed to perform offline relevance mining to eliminate the huge person ID redundancy in the gallery set, and accelerate the online Re-ID procedure. Extensive experimental results on widely used public benchmark datasets show GLAD achieves competitive accuracy compared to the state-of-the-art methods. On a large-scale person, with the Re-ID dataset containing more than 520 K images, our retrieval framework significantly accelerates the online Re-ID procedure while also improving Re-ID accuracy. Therefore, this work has the potential to work better on person Re-ID tasks in real scenarios.
TL;DR: The semi-supervised model named adaptive semi- supervised feature selection for cross-modal retrieval uses the semantic regression to strengthen the neighboring relationship between the data with the same semantic and an efficient joint optimization algorithm is proposed to update the mapping matrices and the label matrix for unlabeled data simultaneously and iteratively.
Abstract: In order to exploit the abundant potential information of the unlabeled data and contribute to analyzing the correlation among heterogeneous data, we propose the semi-supervised model named adaptive semi-supervised feature selection for cross-modal retrieval. First, we utilize the semantic regression to strengthen the neighboring relationship between the data with the same semantic. And the correlation between heterogeneous data can be optimized via keeping the pairwise closeness when learning the common latent space. Second, we adopt the graph-based constraint to predict accurate labels for unlabeled data, and it can also keep the geometric structure consistency between the label space and the feature space of heterogeneous data in the common latent space. Finally, an efficient joint optimization algorithm is proposed to update the mapping matrices and the label matrix for unlabeled data simultaneously and iteratively. It makes samples from different classes to be far apart, while the samples from same class lie as close as possible. Meanwhile, the ${l_{2,1}}$ -norm constraint is used for feature selection and outlier reduction when the mapping matrices are learned. In addition, we propose learning different mapping matrices corresponding to different sub-tasks to emphasize the semantic and structural information of query data. Experiment results on three datasets demonstrate that our method performs better than the state-of-the-art methods.
TL;DR: This paper proposes an efficient lossy coding solution for the geometry of static point clouds using an octree-based approach for a base layer and a graph-based transform approach for the enhancement layer where an Inter-layer residual is coded.
Abstract: Recently, 3D visual representation models such as light fields and point clouds are becoming popular due to their capability to represent the real world in a more complete and immersive way, paving the road for new and more advanced visual experiences. The point cloud representation model is able to efficiently represent the surface of objects/scenes by means of a set of 3D points and associated attributes and is increasingly being used from autonomous cars to augmented reality. Emerging imaging sensors have made it easier to perform richer and denser point cloud acquisitions, notably with millions of points, making it impossible to store and transmit these very high amounts of data without appropriate coding. This bottleneck has raised the need for efficient point cloud coding solutions in order to offer more immersive visual experiences and better quality of experience to the users. In this context, this paper proposes an efficient lossy coding solution for the geometry of static point clouds. The proposed coding solution uses an octree-based approach for a base layer and a graph-based transform approach for the enhancement layer where an Inter-layer residual is coded. The performance assessment shows very significant compression gains regarding the state-of-the-art, especially for the most relevant lower and medium rates.
TL;DR: A novel multimodal deep binary reconstruction model is proposed, which can be trained to simultaneously model the correlation across modalities and learn the binary hashing codes, where the model can be easily optimized by a standard gradient descent optimizer.
Abstract: To satisfy the huge storage space and organization capacity requirements in addressing big multimodal data, hashing techniques have been widely employed to learn binary representations in cross-modal retrieval tasks. However, optimizing the hashing objective under the necessary binary constraint is truly a difficult problem. A common strategy is to relax the constraint and perform individual binarizations over the learned real-valued representations. In this paper, in contrast to conventional two-stage methods, we propose to directly learn the binary codes, where the model can be easily optimized by a standard gradient descent optimizer. However, before that, we present a theoretical guarantee of the effectiveness of the multimodal network in preserving the inter- and intra-modal consistencies. Based on this guarantee, a novel multimodal deep binary reconstruction model is proposed, which can be trained to simultaneously model the correlation across modalities and learn the binary hashing codes. To generate binary codes and to avoid the tiny gradient problem, a novel activation function first scales the input activations to suitable scopes and, then, feeds them to the tanh function to build the hashing layer. Such a composite function is named adaptive tanh . Both linear and nonlinear scaling methods are proposed and shown to generate efficient codes after training the network. Extensive ablation studies and comparison experiments are conducted for the image2text and text2image retrieval tasks; the method is found to outperform several state-of-the-art deep-learning methods with respect to different evaluation metrics.
TL;DR: A dataset for facilitating audio-visual analysis of music performances that comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks is introduced.
Abstract: We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches to address the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely used music audio datasets. We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multimodal tasks. We benchmark two existing MIR tasks (multipitch analysis and score-informed source separation) on the dataset and compare them with other existing music audio datasets. In addition, we consider two novel multimodal MIR tasks (visually informed multipitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measurements and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.
TL;DR: This paper presents a novel unsupervised deep feature learning algorithm for the abnormal event detection problem, and introduces the quadruplet concept to model the multilevel similarity structure, which could be used to construct a generalized triplet loss for training the C3D network.
Abstract: Abnormal event detection in large videos is an important task in research and industrial applications, which has attracted considerable attention in recent years. Existing methods usually solve this problem by extracting local features and then learning an outlier detection model on training videos. However, most previous approaches merely employ hand-crafted visual features, which is a clear disadvantage due to their limited representation capacity. In this paper, we present a novel unsupervised deep feature learning algorithm for the abnormal event detection problem. To exploit the spatiotemporal information of the inputs, we utilize the deep three-dimensional convolutional network (C3D) to perform feature extraction. Then, the key problem is how to train the C3D network without any category labels. Here, we employ the sparse coding results of the hand-crafted features generated from the inputs to guide the unsupervised feature learning. Specifically, we define a multilevel similarity relationship between these inputs according to the statistical information of the shared atoms. In the following, we introduce the quadruplet concept to model the multilevel similarity structure, which could be used to construct a generalized triplet loss for training the C3D network. Furthermore, the C3D network could be utilized to generate the features for sparse coding again, and this pipeline could be iterated for several times. By jointly optimizing between the sparse coding and the unsupervised feature learning, we can obtain robust and rich feature representations. Based on the learned representations, the sparse reconstruction error is applied to predicting the anomaly score of each testing input. Experiments on several publicly available video surveillance datasets in comparison with a number of existing works demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
TL;DR: The proposed end-to-end framework does not require the explicit symbol segmentation and a predefined expression grammar for parsing and demonstrates the strong complementarity between offline information with static-image input and online information with ink-trajectory input by blending a fully convolutional networks-based watcher into TAP.
Abstract: In this paper, we introduce Track, Attend, and Parse (TAP), an end-to-end approach based on neural networks for online handwritten mathematical expression recognition (OHMER). The architecture of TAP consists of a tracker and a parser. The tracker employs a stack of bidirectional recurrent neural networks with gated recurrent units (GRU) to model the input handwritten traces, which can fully utilize the dynamic trajectory information in OHMER. Followed by the tracker, the parser adopts a GRU equipped with guided hybrid attention (GHA) to generate notations. The proposed GHA is composed of a coverage-based spatial attention, a temporal attention, and an attention guider. Moreover, we demonstrate the strong complementarity between offline information with static-image input and online information with ink-trajectory input by blending a fully convolutional networks-based watcher into TAP. Inherently, unlike traditional methods, this end-to-end framework does not require the explicit symbol segmentation and a predefined expression grammar for parsing. Validated on a benchmark published by the CROHME competition, the proposed approach outperforms the state-of-the-art methods and achieves the best reported results with an expression recognition accuracy of 61.16% on CROHME 2014 and 57.02% on CROHME 2016, using only official training dataset.
TL;DR: A user-centric video transmission mechanism based on device-to-device communications that allows mobile users to cache and share videos between each other, in a cooperative manner, to achieve a QoE-guaranteed video streaming service in a cellular network.
Abstract: The ever-increasing demand for videos on mobile devices poses a significant challenge to existing cellular network infrastructures. To cope with the challenge, we propose a user-centric video transmission mechanism based on device-to-device communications that allows mobile users to cache and share videos between each other, in a cooperative manner. The proposed solution jointly considers users’ similarity in accessing videos, users’ sharing willingness, users’ location distribution, and users’ quality of experience (QoE) requirements, in order to achieve a QoE-guaranteed video streaming service in a cellular network. Specifically, a service set consisting of several service providers and mobile users, is dynamically configured to provide timely service according to the probability of successful service. Numerical results show that when the number of providers and demanded videos is 40 and 2, respectively, the improved users experience rate in the proposed solution is approximately 85%, and the data offload rate on base station(s) is about 78%.
TL;DR: This paper develops a robust fusion center with virtual and real zones to make a global decision based on preliminary candidate targets generated by each detector, and mitigates the sensitivity of missed detections in the generalized covariance intersection fusion process, thereby improving the fusion performance and tracking consistency.
Abstract: In this paper, we propose a multi-level cooperative fusion approach to address the online multiple human tracking problem in a Gaussian mixture probability hypothesis density (GM-PHD) filter framework. The proposed fusion approach consists essentially of three steps. First, we integrate two human detectors with different characteristics (full-body and body-parts), and investigate their complementary benefits for tracking multiple targets. For each detector domain, we then propose a novel discriminative correlation matching model, and fuse it with spatio-temporal information to address ambiguous identity association in the GM-PHD filter. Finally, we develop a robust fusion center with virtual and real zones to make a global decision based on preliminary candidate targets generated by each detector. This center also mitigates the sensitivity of missed detections in the generalized covariance intersection fusion process, thereby improving the fusion performance and tracking consistency. Experiments on the MOTChallenge Benchmark demonstrate that the proposed method achieves improved performance over other state-of-the-art RFS-based tracking methods.
TL;DR: A new framework is introduced in this paper to remotely estimate the HR under realistic conditions by combining spatial and temporal filtering and a convolutional neural network and shows better performance compared with the benchmark on the MMSE-HR dataset in terms of both the average HR estimation and short-time HR estimation.
Abstract: With the increase in health consciousness, noninvasive body monitoring has aroused interest among researchers. As one of the most important pieces of physiological information, researchers have remotely estimated the heart rate (HR) from facial videos in recent years. Although progress has been made over the past few years, there are still some limitations, like the processing time increasing with accuracy and the lack of comprehensive and challenging datasets for use and comparison. Recently, it was shown that HR information can be extracted from facial videos by spatial decomposition and temporal filtering. Inspired by this, a new framework is introduced in this paper to remotely estimate the HR under realistic conditions by combining spatial and temporal filtering and a convolutional neural network. Our proposed approach shows better performance compared with the benchmark on the MMSE-HR dataset in terms of both the average HR estimation and short-time HR estimation. High consistency in short-time HR estimation is observed between our method and the ground truth.
TL;DR: A novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC) is introduced, which is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, it is able to enhance the image captioned performance in the target domain.
Abstract: Recent artificial intelligence research has witnessed great interest in automatically generating text descriptions of images, which are known as the image captioning task. Remarkable success has been achieved on domains where a large number of paired data in multimedia are available. Nevertheless, annotating sufficient data is labor-intensive and time-consuming, establishing significant barriers for adapting the image captioning systems to new domains. In this study, we introduc a novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC). MLADIC is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, we are able to enhance the image captioning performance in the target domain. Concretely, the image captioning task is trained with an encoder–decoder model (i.e., CNN-LSTM) to generate textual descriptions of the input images. The image synthesis task employs the conditional generative adversarial network (C-GAN) to synthesize plausible images based on text descriptions. In C-GAN, a generative model $G$ synthesizes plausible images given text descriptions, and a discriminative model $D$ tries to distinguish the images in training data from the generated images by $G$ . The adversarial process can eventually guide $G$ to generate plausible and high-quality images. To bridge the gap between different domains, a two-step strategy is adopted in order to transfer knowledge from the source domains to the target domains. First, we pre-train the model to learn the alignment between the neural representations of images and that of text data with the sufficient labeled source domain data. Second, we fine-tune the learned model by leveraging the limited image–text pairs and unpaired data in the target domain. We conduct extensive experiments to evaluate the performance of MLADIC by using the MSCOCO as the source domain data, and using Flickr30k and Oxford-102 as the target domain data. The results demonstrate that MLADIC achieves substantially better performance than the strong competitors for the cross-domain image captioning task.
TL;DR: Simulation results show that the proposed PVRV streaming system can improve the streaming performance in both energy efficiency and the quality of received viewport over the state-of-the-art schemes.
Abstract: Panoramic virtual reality video (PVRV) is becoming increasingly popular since it offers a true immersive experience. However, the ultra-high resolution of PVRV requires significant bandwidth and ultra-low latency for PVRV streaming, something that makes challenging the extension of this application to mobile networks. Besides bandwidth, the frequent perspective viewport rendering induces a heavy computational load on battery-constrained mobile devices. To attack these problems jointly, this paper proposes a PVRV streaming system that is designed for modern multiconnectivity-based millimeter wave (mmWave) cellular networks in conjunction with mobile edge computing (MEC). First, mmWave is deployed to support the high bandwidth needs of PVRV streaming. Next, the multiple mmWave links that tend to suffer from outages are coupled with a sub-6 GHz link to ensure disruption-free wireless communication. With the help of an MEC server, the tradeoff among link adaptation, transcoding-based chunk quality adaptation, and viewport rendering offloading is sought to improve the wireless bandwidth utilization and mobile device's energy efficiency. Simulation results show that the proposed scheme can improve the streaming performance in both energy efficiency and the quality of received viewport over the state-of-the-art schemes.
TL;DR: This work empirically shows that current NR-IQA methods are inconsistent with human visual perception when predicting the relative quality of image pairs with different image contents, and puts forward a new NR- IQA method based on semantic feature aggregation (SFA) to alleviate the impact of image content variation.
Abstract: Image content variation is a typical and challenging problem in no-reference image-quality assessment (NR-IQA). This work pays special attention to the impact of image content variation on NR-IQA methods. To better analyze this impact, we focus on blur-dominated distortions to exclude the impacts of distortion-type variations. We empirically show that current NR-IQA methods are inconsistent with human visual perception when predicting the relative quality of image pairs with different image contents. In view of deep semantic features of pretrained image classification neural networks always containing discriminative image content information, we put forward a new NR-IQA method based on semantic feature aggregation (SFA) to alleviate the impact of image content variation. Specifically, instead of resizing the image, we first crop multiple overlapping patches over the entire distorted image to avoid introducing geometric deformations. Then, according to an adaptive layer selection procedure, we extract deep semantic features by leveraging the power of a pretrained image classification model for its inherent content-aware property. After that, the local patch features are aggregated using several statistical structures. Finally, a linear regression model is trained for mapping the aggregated global features to image-quality scores. The proposed method, SFA, is compared with nine representative blur-specific NR-IQA methods, two general-purpose NR-IQA methods, and two extra full-reference IQA methods on Gaussian blur images (with and without Gaussian noise/JPEG compression) and realistic blur images from multiple databases, including LIVE, TID2008, TID2013, MLIVE1, MLIVE2, BID, and CLIVE. Experimental results show that SFA is superior to the state-of-the-art NR methods on all seven databases. It is also verified that deep semantic features play a crucial role in addressing image content variation, and this provides a new perspective for NR-IQA.