scispace - formally typeset
Search or ask a question

Showing papers in "IEEE MultiMedia in 2021"


Journal ArticleDOI
TL;DR: The proposed IWAN model, which detects sarcasm by focusing on the word-level incongruity between modalities via a scoring mechanism, achieves the state-of-the-art performance on the MUStARD dataset but also offers the advantages of interpretability.
Abstract: Sarcasm is a sophisticated linguistic phenomenon and commonly manifests on social media platforms, which poses a great challenge for opinion mining systems. Therefore, multimodal sarcasm detection, which aims to understand the implied sentiment in the video, has gained more and more attention. However, previous works mostly focus on multimodal feature fusion without explicitly modeling the incongruity between modalities, such as expressing verbal compliments while rolling eyes, which is an obvious cue for detecting sarcasm. In this article, we propose the incongruity-aware attention network (IWAN), which detects sarcasm by focusing on the word-level incongruity between modalities via a scoring mechanism. This scoring mechanism could assign larger weights to words with incongruent modalities. Experimental results demonstrate the effectiveness of our proposed IWAN model, which not only achieves the state-of-the-art performance on the MUStARD dataset but also offers the advantages of interpretability.

20 citations


Journal ArticleDOI
TL;DR: An affective-motion imaging that cumulates rapid and short-lived variational information of microexpressions into a single response is proposed and the experimental results of the proposed network outperform the state-of-the-art approaches with significant margin for MER approaches.
Abstract: Microexpressions are hard to spot due to fleeting and involuntary moments of facial muscles. Interpretation of microemotions from video clips is a challenging task. In this article, we propose an affective-motion imaging that cumulates rapid and short-lived variational information of microexpressions into a single response. Moreover, we have proposed an AffectiveNet: Affective-motion feature learning network that can perceive subtle changes and learns the most discriminative dynamic features to describe the emotion classes. The AffectiveNet holds two blocks: MICRoFeat and MFL block. MICRoFeat block conserves the scale-invariant features, which allows network to capture both coarse and tiny edge variations. Whereas, the MFL block learns microlevel dynamic variations from two different intermediate convolutional layers. Effectiveness of the proposed network is tested over four datasets by using two experimental setups: person independent and cross dataset validation. The experimental results of the proposed network outperform the state-of-the-art approaches with significant margin for MER approaches.

14 citations


Journal ArticleDOI
TL;DR: This article outlines a software component to be integrated into authoring tools that uses content analysis assistance to indicate moments of sensory effects activation, according to author preferences and is expected to considerably reduce the effort of synchronizing audiovisual content with sensory effects.
Abstract: Synchronization of sensory effects with multimedia content is a nontrivial and error-prone task that can discourage authoring of mulsemedia applications. Although there are authoring tools that perform some automatic authoring of sensory effect metadata, the analysis techniques that they use are not generally enough to identify complex components that may be related to sensory effects. In this article, we present a new method, which allows the semiautomatic definition of sensory effects in an authoring tool. We outline a software component to be integrated into authoring tools that uses content analysis assistance to indicate moments of sensory effects activation, according to author preferences. The proposed method was implemented in the STEVE 2.0 authoring tool and an evaluation was performed to assess the precision of the generated sensory effects in comparison with human authoring. This solution is expected to considerably reduce the effort of synchronizing audiovisual content with sensory effects—in particular, by easing the author’s repetitive task of synchronizing recurring effects with lengthy media.

11 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a general framework to achieve multichannel steganography based on modern steganographic paradigm and designed two schemes to achieve separable and sequential multichannels steganographies with the proposed framework.
Abstract: In this article, we focus on a new concept called multichannel steganography, in which a sender is able to transmit different secret data to multiple receivers via the same cover image. First, we propose a general framework to achieve multichannel steganography based on modern steganographic paradigm. Then, we design two schemes to, respectively, achieve separable and sequential multichannel steganography with the proposed framework. In the separable scheme, a receiver can extract secret data for him/her using the corresponding data hiding key. For other parts of secret data, the receiver cannot affirm the existence and, therefore, cannot extract them. In the sequential scheme, the multiple receivers are dependent. The receiver with the highest priority can extract data first. After that, the receiver with the second priority can extract data when authorized by the highest priority receiver, and so on. In addition, the two schemes achieve multichannel steganography without decreasing the undetectability of steganography.

10 citations


Journal ArticleDOI
TL;DR: A novel neighborhood adaptive distortion metric to be used in the training loss function, which allows significantly improving the rate-distortion performance with commonly used objective quality metrics and an explicit quantization approach at the training and coding times to generate varying rate/quality with a single trained deep learning coding model.
Abstract: As the interest in deep learning tools continues to rise, new multimedia research fields begin to discover its potential. Both image and point cloud coding are good examples of technologies, where deep learning-based solutions have recently displayed very competitive performance. In this context, this article brings two novel contributions to the point cloud geometry coding state-of-the-art; first, a novel neighborhood adaptive distortion metric to be used in the training loss function, which allows significantly improving the rate-distortion performance with commonly used objective quality metrics; second, an explicit quantization approach at the training and coding times to generate varying rate/quality with a single trained deep learning coding model, effectively reducing the training complexity and storage requirements. The result is an improved deep learning-based point cloud geometry coding solution, which is both more compression efficient and less demanding in training complexity and storage.

9 citations


Journal ArticleDOI
TL;DR: This method selects four most discriminative regions of interest (ROIs) and determines the appearance of MEs using the proposed feature and a novel decision criterion based on the characteristic of the optical flow angle and magnitude.
Abstract: Microexpression (ME) spotting is a crucial step for emotion analysis to detect people’s true emotions. However, the short duration, small motion amplitude, and limited sample number make accurate spotting and ME-locating challenging. To address these problems, we make two contributions in this article, including an ME dataset SDU2 and a spotting method. The dataset SDU2 contains hybrid expressions of 1602 video clips labeled by professional psychologists, covering six main categories of emotions with balanced distribution. Our ME spotting method is based on a magnitude- and angle-combined optical flow feature, exploiting the angle information, which has been overlooked by other spotting methods. In this method, we select four most discriminative regions of interest (ROIs) and determine the appearance of MEs using the proposed feature and a novel decision criterion based on the characteristic of the optical flow angle and magnitude. We have conducted experiments on SDU2 and CASME II datasets. The results demonstrate that our method achieves much better performance compared to other state-of-the-art methods in terms of spotting accuracy.

9 citations


Journal ArticleDOI
TL;DR: In this paper, an end-to-end real-time stereo matching network (RTSMNet) is proposed, which consists of three modules: the initial disparity estimation module is a compact three-dimensional convolution architecture aiming to produce the low-resolution (LR) disparity map rapidly.
Abstract: In this article, we propose an end-to-end real-time stereo matching network (RTSMNet). RTSMNet consists of three modules. The global and local feature extraction (GLFE) module captures the hierarchical context information and generates the coarse cost volume. The initial disparity estimation module is a compact three-dimensional convolution architecture aiming to produce the low-resolution (LR) disparity map rapidly. The feature-guided spatial attention upsampling module takes the LR disparity map and the shared features from the GLFE module as guidance, first estimates residual disparity values and then an attention mechanism is developed to generate context-aware adaptive kernels for each upsampled pixel. The adaptive kernels emphasize higher attention weights on the reliable area, which can significantly reduce blurred edges and recover thin structures. The proposed networks achieve 66 ∼ 175 fps on a 2080Ti and 11 ∼ 42 fps on edge computing devices, with competitive accuracy compared to state-of-the-art methods on multiple benchmarks.

8 citations


Journal ArticleDOI
TL;DR: A novel reinforcement learning network that keeps track of the gradual emotional changes from every utterance throughout the conversation and uses this information for each utterance’s emotion detection.
Abstract: In this article, we propose a novel reinforcement learning network that keeps track of the gradual emotional changes from every utterance throughout the conversation and uses this information for each utterance’s emotion detection. Concretely, we first establish an agent and, then, utilize sliding windows to extract the accumulated emotional information before the current utterance. We define the concatenation of accumulated emotional information and the contextual information as the state of the reinforcement learning framework. The action of the established agent is formulated as the emotional label of the current utterance. On this basis, we formulate the progressive emotional interaction process throughout the conversation as a sequential decision problem and solve it with the reinforcement learning framework. Detailed evaluations on the published multimodal MELD dataset demonstrate the effectiveness of our approach.

7 citations


Journal ArticleDOI
TL;DR: A region-adaptive two-shot network (RATNet) that follows a coarse-to-fine framework that derives the final prediction by adaptively aggregating the results after illumination modification and detail restoration, whose region-variant weights are jointly optimized by maximizing the similarity between the authors' fused result and haze-free counterpart.
Abstract: Single image dehazing is the key to enhancing image visibility in outdoor scenes, which facilitates human observation and computer recognition. The existing approaches generally utilize a one-shot strategy that indiscriminately applies the same filters to all local regions. However, due to neglecting inhomogeneous illumination and detail distortion, their dehazed results easily suffer from underfiltering or overfiltering across different regions. To tackle this issue, we propose a region-adaptive two-shot network (RATNet) that follows a coarse-to-fine framework. First, a lightweight subnetwork is applied to execute regular global filtering and obtain an initially restored image. Then, a two-branch subnetwork is put forward whose branches separately refine its illumination and detail. Eventually, we derive the final prediction by adaptively aggregating the results after illumination modification and detail restoration, whose region-variant weights are jointly optimized by maximizing the similarity between our fused result and haze-free counterpart. Extensive experiments validate the superiority of our proposed algorithm.

7 citations


Journal ArticleDOI
Tianrong Rao1, Jie Li1, Xiaoyu Wang1, Yibo Sun1, Hong Chen1 
TL;DR: A novel multiscale graph convolutional network (GCN) based on landmark graphs extracted from facial images is proposed that outperforms the traditional deep learning frameworks and achieves more stable performance on different datasets.
Abstract: Recognizing emotion through facial expression has now been widely applied in our daily lives. Therefore, facial expression recognition (FER) is attracting increasing research interests in the field of artificial intelligence and multimedia. With the development of convolutional neural networks (CNN), end-to-end deep learning frameworks for FER have achieved great success on large-scale datasets. However, these works still face the problems of redundant information and data bias, which obviously decrease the performance of FER. In this article, we propose a novel multiscale graph convolutional network (GCN) based on landmark graphs extracted from facial images. The proposed method is evaluated on different popular datasets. The results show that the proposed method outperforms the traditional deep learning frameworks and achieves more stable performance on different datasets.

7 citations


Journal ArticleDOI
TL;DR: This article proposes novel Transformer-based multimodal frameworks for the navigator and speaker, respectively, where the multihead self-attention with the residual connection is used to carry the information flow.
Abstract: Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process inter-modal interactions has been widely verified, however, LSTM neglects the intra-model interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this paper, we propose novel Transformer-based multi-modal frameworks for the navigator and speaker respectively. In our frameworks, the multi-head self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.

Journal ArticleDOI
TL;DR: In this article, a multimodal dataset consisting of 180 videos with accompanying audio recordings and transcripts, featuring 88 politicians categorized by political party, was used to detect deception in political statements.
Abstract: Political statements are carefully crafted to garner public support for a particular ideology. These statements are often biased and sometimes misleading. Separating fact from fiction has proven to be a difficult task, generally accomplished by cross-checking political statements against an impartial and trustworthy news source. In this article, we make three contributions. First, we compile a novel multimodal dataset, which consists of 180 videos with accompanying audio recordings and transcripts, featuring 88 politicians categorized by political party. To our knowledge, this is the second multimodal deception detection dataset from real-life data and the first in the political field. Second, we extract features from the linguistic, visual, and acoustic modalities to develop a system capable of discriminating between truthful and deceptive political statements. Finally, we perform an extensive analysis on different multimodal features to identify the behavioral patterns used by politicians when it comes to deception.

Journal ArticleDOI
Haiying Xia1, Changyuan Li1, Yumei Tan1, Lingyun Li, Shuxiang Song1 
TL;DR: Zhang et al. as mentioned in this paper proposed a simple and efficient framework that can learn more discriminative expression features from scrambled facial images by dividing the input image into local subregions of the same size and shuffle them randomly at a certain range to obtain the damaged image to increase the difficulty of recognition.
Abstract: The most discriminative expression features are mostly concentrated in local key facial regions. Thus, we propose a simple and efficient framework that can learn more discriminative expression features from scrambled facial images. Specifically, we first divide the input image into local subregions of the same size and shuffle them randomly at a certain range to obtain the damaged image to increase the difficulty of recognition. Then, the original image and the damaged image are fed to the network. A channel attention module is exploited for highlighting the effective features and suppressing irrelevant features. Simultaneously, during the reconstruction phase, a region alignment model is appended to establish the semantic correlation between each subregion, aiming at restoring the original spatial layout of local subregions in the original image. Extensive experiments on the RAF-DB and the FERPlus datasets demonstrate that our proposed method significantly outperformed state-of-the-art methods without any external facial expression pretraining.

Journal ArticleDOI
TL;DR: The approach combines multiple human co-occurring modalities and two interpretations of context and uses multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint.
Abstract: We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities (such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video, we use a self-attention-based CNN to encode. Similarly, for modeling the sociodynamic interactions among people (second context interpretation) in the input image/video, we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on four benchmark emotion recognition datasets (IEMOCAP, CMU-MOSEI, EMOTIC, and GroupWalk). Our model outperforms on state of the art (SOTA) learning methods with an average $5\%-9\%$5%-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multimodality, context, and multiplicative fusion.

Journal ArticleDOI
TL;DR: In this paper, the authors analyzed and determined the factors that have an impact on the intensive use of Instagram and its relationship with smartphone addiction and self-esteem, and found that there are statistically significant differences in the intensity of Instagram usage and smartphone addiction according to the individual's employment status and the level of studies completed.
Abstract: This study aims to analyze and determine the factors that have an impact on the intensive use of Instagram and its relationship with smartphone addiction and self-esteem. A total of 389 Instagram users aged between 18 and 57 (M = 23.98; SD = 5.37) completed an online survey based on three standardized scales. The findings of the study suggest that there are statistically significant differences in the intensive use of Instagram and smartphone addiction, according to the individual's employment status and the level of studies completed. Furthermore, the multiple linear regression analysis established age and time spent on the social network as predictors of the intensive use of Instagram and smartphone addiction. Finally, the structural equation model showed a positive correlation between making an intensive use of Instagram and smartphone addiction, and a negative correlation between smartphone addiction and self-esteem.

Journal ArticleDOI
TL;DR: Three image and two text classification models are employed to recognize emotions and two prediction emotion synthesis methods are proposed to synthesize the outputs of multiple models, resulting in emotion distribution, confusion, and transfer.
Abstract: Emotion is precious, useful for many applications such as public opinion detection and psychological disease prediction. Emotion recognition on multimodal data has attracted extensive attention. Although modifying model structure or multimodal feature fusion methods have contributed a lot to emotion recognition, little attention is paid to mining implicit emotion relationship. In this article, implicit emotion relationship consists of emotion distribution, confusion, and transfer. Emotion distribution allows multiple emotions in one sample, while confusion and transfer imply the prediction confusion and bias. In order to mine implicit emotion relationship in multimodal data, this article employs three image and two text classification models to recognize emotions, respectively. Two prediction emotion synthesis methods (optimal prediction emotion synthesis and majority prediction emotion synthesis) are proposed to synthesize the outputs of multiple models. Based on the results of two emotion synthesis methods, emotion distribution on samples is obtained. Emotion confusion and transfer among different emotion samples are analyzed by relative entropy and Jensen–Shannon divergence. Implicit emotion relationship mining has potential not only in the interpretation of model performance, but also in guiding the development of emotion recognition as prior knowledge. Finally, we take topic scenario as an instance to mine implicit emotion relationships.

Journal ArticleDOI
TL;DR: An end-to-end expression-guided generative adversarial network (EGGAN), which synthesizes an image with expected expression given continuous expression label and structured latent code, and can edit and synthesize continuous intermediate expressions between source and target expressions.
Abstract: Fine-grained facial expression aims at changing the expression of an image without altering facial identity. Most current expression manipulation methods are based on a discrete expression label, which mainly manipulates holistic expression with details neglected. To handle the above mentioned problems, we propose an end-to-end expression-guided generative adversarial network (EGGAN), which synthesizes an image with expected expression given continuous expression label and structured latent code. In particular, an adversarial autoencoder is used to translate a source image into a structured latent space. The encoded latent code and the target expression label are input to a conditional GAN to synthesize an image with the target expression. Moreover, a perceptual loss and a multiscale structural similarity loss are introduced to preserve facial identity and global shape during expression manipulation. Extensive experiments demonstrate that our approach can edit fine-grained expressions, and synthesize continuous intermediate expressions between source and target expressions.

Journal ArticleDOI
TL;DR: A face spoofing detection method by learning to fuse high-frequency and low-frequency features, in an effort to improve the generalization capability and fill up the domain gap between training and testing when the antispoofing is practically conducted in unseen scenarios.
Abstract: In this article, we propose a face spoofing detection method by learning to fuse high-frequency (HF) and low-frequency (LF) features, in an effort to improve the generalization capability and fill up the domain gap between training and testing when the antispoofing is practically conducted in unseen scenarios. In particular, the proposed face antispoofing model consists of two streams that extract HF and LF components of a facial image with three high-pass and three low-pass filters. Moreover, considering the fact that spoofing features exist in different feature levels, we train our network with a novel multiscale triplet loss. The cross-frequency spatial attention module further enables the two streams to communicate and exchange information with each other. Finally, the outputs of the two streams are fused with a weighting strategy for final classification. Extensive experiments conducted on intra- and cross-database settings show the superiority of the proposed scheme.

Journal ArticleDOI
TL;DR: Quincy et al. as discussed by the authors proposed the continuous style embedding to the general formulation of variational autoencoder (VAE) to allow users to be able to condition on the style of the generated music.
Abstract: Classically, the style of the generated music by deep learning models is usually governed by the training dataset. In this article, we improved this by proposing the continuous style embedding ${z}_{s}$ z s to the general formulation of variational autoencoder (VAE) to allow users to be able to condition on the style of the generated music. For this purpose, we explored and compared two different methods to integrate ${z}_{s}$ z s into the VAE. In the literature of conditional generative modeling, disentanglement of attributes from the latent space is often associated with better generative performance. In our experiments, we find that this is not the case with our proposed model. Empirically and from a musical theory perspective, we show that our proposed model can generate better music samples than a baseline model that utilizes a discrete style label. The source code and generated samples are available at https://github.com/daQuincy/DeepMusicvStyle .

Journal ArticleDOI
TL;DR: In this article, a lightweight deep architecture with approximately 1 MB was proposed for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: lightweight feature extractor, attention strategy, and adaptive loss.
Abstract: This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.

Journal ArticleDOI
TL;DR: This work proposes a new state representation learning scheme with Adjacent State Consistency Loss (ASC Loss), based on the hypothesis that the distance between adjacent states is smaller than that of far apart ones, since scenes in videos generally evolve smoothly.
Abstract: Through well-designed optimization paradigm and deep neural networks as feature extractor, deep reinforcement learning (DRL) algorithms learn optimal policy on discrete and continuous action space. However, such capability is restricted by the low sampling efficiency. By inspecting the importance of feature extraction in DRL, we find that state feature learning is one of the key obstacles for sampling efficiently. To this end, we propose a new state representation learning scheme with adjacent state consistency loss (ASC loss). The loss is based on the hypothesis that the distance between adjacent states is smaller than that of far apart ones since scenes in videos generally evolve smoothly. We exploit ASC loss as an assistant of RL loss in the training phase to boost the state feature learning, and make evaluation on existing DRL algorithms as well as behavioral cloning algorithm. Experiments on Atari games and MuJoCo continuous control tasks demonstrate the effectiveness of our scheme.

Journal ArticleDOI
TL;DR: In this paper, a multimodal event-aware network is proposed to analyze sentiment from Weibos that contain multiple modalities, i.e., text and images, to obtain more discriminative representations, based on which they simultaneously perceive the event and sentiment in a multitask framework.
Abstract: Considering the application of a sentiment analysis in decision-making and personalized advertising, we adopt it in tourism. Specifically, we perform a sentiment analysis on the posted Weibos about the passengers’ experience in civil aviation travel. Different travel events could influence passengers’ sentiment, e.g., flight delay may cause negative sentiment. Inspired by this observation, we propose a novel multimodal event-aware network to analyze sentiment from Weibos that contain multiple modalities, i.e., text and images. We first extract features from each modality and, then, model the cross-modal associations to obtain more discriminative representations, based on which we simultaneously perceive the event and sentiment in a multitask framework. Extensive experiments demonstrate that the proposed method outperforms the existing state-of-the-art approaches.

Journal ArticleDOI
TL;DR: A novel multi-level attention network is proposed to hierarchically learn an efficient feature embedding for vehicle re-ID, guaranteeing the intra-class compactness and inter-class separability for vehicleRe-ID.
Abstract: The rapid development and popularization of video surveillance highlight the critical and challenging problem, vehicle reidentification, which suffers from the limited interinstance discrepancy between different vehicle identities and large intrainstance differences of the same vehicle. In this article, we propose a novel multilevel attention network to hierarchically learn an efficient feature embedding for vehicle re-ID. Three kinds of attention are designed in the network: hard local-level attention to localize vehicle salient parts, soft pixel-level attention to refine attended pixels both globally and locally, and spatial attention to enhance the encoder’s spatial awareness of salient regions within the windscreen area. Multigrain features are subsequently learned from semantic awareness to spatial awareness, guaranteeing the intraclass compactness and interclass separability for vehicle re-ID. Extensive experiments validate the effectiveness of each attention component and demonstrate that our approach outperforms the state-of-the-art re-ID methods on two challenging datasets: VehicleID and Vehicle-1 M.

Journal ArticleDOI
TL;DR: In this paper, the authors explored behavioral and biometric data to enhance the usefulness of VR/AR applications with potential in affective learning, resource generation, and developer tools, with the goal of improving virtual reality and augmented reality (VR/AR).
Abstract: Multimedia is one of the key drivers improving virtual reality and augmented reality (VR/AR), which are promising to reform human–computer interaction in the future with lower-cost and all-in-one headsets containing powerful hardware. Advances in multimedia research on video compression and human–computer interfaces have further enhanced the immersion and efficiency of experiences on the platform. However, many VR/AR experiences are still very difficult to build using traditional engineering methods and many available behavioral and biometric data have not been well explored. Further research in the multimedia community is needed to enhance the usefulness of these systems, with potential in affective learning, resource generation, and developer tools.

Journal ArticleDOI
TL;DR: A novel multiple instance learning (MIL) based model, VQA-MIL, is developed, which dynamically adjusts the weights by a block-wise attention module and enriches the features of video bags by a MI Pooling layer and outperforms popular state-of-the-art no-reference V QA methods on NUDVs.
Abstract: Each part of a nonuniform distorted video (NUDV) has a unique distortion degree. When NUDV blocks are used as inputs, traditional machine-learning-based video quality assessment (VQA) methods frequently do not work effectively. Because these methods directly assign the label of the entire video to blocks, causing the unreliability of labels. We creatively propose video bag, a collection of video blocks, to deal with this unreliability. We develop a novel multiple instance learning (MIL) based model, VQA-MIL, which dynamically adjusts the weights by a block-wise attention module and enriches the features of video bags by a MI Pooling layer. Furthermore, we apply the mixup data-augmentation strategy to address the lack of human labels in common video datasets. We test our method on LIVE and CSIQ, and on a relatively large-scale dataset, named NUDV-KT, that we have collected. Results show that our method outperforms popular state-of-the-art no-reference VQA methods on NUDVs.

Journal ArticleDOI
TL;DR: This work proposes the first method to predict the Satisfied User Ratio (SUR) for symmetrically and asymmetrically compressed stereoscopic images, and exploits the properties of binocular vision.
Abstract: The satisfied user ratio (SUR) for a given distortion level is the fraction of subjects that cannot perceive a quality difference between the original image and its compressed version. By predicting the SUR, one can determine the highest distortion level which allows to save bit rate while guaranteeing a good visual quality. We propose the first method to predict the SUR for symmetrically and asymmetrically compressed stereoscopic images. Unlike SUR prediction techniques for two-dimensional images and videos, our method exploits the properties of binocular vision. We first extract features that characterize image quality and image content. Then, we use gradient boosting decision trees to reduce the number of features and train a regression model that learns a mapping function from the features to the SUR values. Experimental results on the SIAT-JSSI and SIAT-JASI datasets show high SUR prediction accuracy for H.265 All-Intra and JPEG2000 symmetrically and asymmetrically compressed stereoscopic images.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an efficient framework for 3D anomaly detection in power grids using UAV-based aerial images using context information to become adaptive with different nonlinear movements that are unavoidable in aerial imaging.
Abstract: Critical utility infrastructure like power grids are vast (in hundreds of kilometers), linear, and operated 24×7 throughout the year. Maintenance inspections using low-cost unmanned aerial vehicles and aerial imaging are therefore gaining popularity. To have low-cost framework, the quality of the camera used is not of high quality or a stereo-rig one. Also, the sensors used are limited in variety and efficiency. 3-D reconstruction of a power grid will help to improve access, detect anomalies (damages), and reduce projection error. The depth estimation of wiry objects, like powerlines, in a cluttered background is challenging. The background clutter includes trees, pavement, greenery patches, and man-made objects. In this article, we propose an efficient framework for 3-D anomaly detection in power grids using UAV-based aerial images. The framework uses context information to become adaptive with different nonlinear movements that are unavoidable in aerial imaging. The proposed work is tested on real-data captured using a low-cost framework consisting of a non-stereo-rig aerial camera and a mini-UAV.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new CNN-based post-processing approach, which has been integrated with two state-of-the-art coding standards, VVC and AV1.
Abstract: In recent years, video compression techniques have been significantly challenged by the rapidly increased demands associated with high quality and immersive video content. Among various compression tools, post-processing can be applied on reconstructed video content to mitigate visible compression artefacts and to enhance overall perceptual quality. Inspired by advances in deep learning, we propose a new CNN-based post-processing approach, which has been integrated with two state-of-the-art coding standards, VVC and AV1. The results show consistent coding gains on all tested sequences at various spatial resolutions, with average bit rate savings of 4.0% and 5.8% against original VVC and AV1 respectively (based on the assessment of PSNR). This network has also been trained with perceptually inspired loss functions, which have further improved reconstruction quality based on perceptual quality assessment (VMAF), with average coding gains of 13.9% over VVC and 10.5% against AV1.

Journal ArticleDOI
TL;DR: This article focuses on the sentiment-aware emoji insertion task, which predicts multiple emojis and their positions in a sentence conditioned on the plain texts and sentiment polarities, and forms the insertion process as a sequence tagging task and applies a BERT-BiLSTM-CRF model.
Abstract: Due to the booming popularity of online social networks, emojis have been widely used in online communication. As nonverbal language units, emojis help to convey emotions and express feelings. In this article, we focus on the sentiment-aware emoji insertion task, which predicts multiple emojis and their positions in a sentence conditioned on the plain texts and sentiment polarities. To facilitate future research in this field, we construct a large-scale emoji insertion corpus named “MultiEmoji,” which contains 420 000 English posts with at least one emoji per post. We formulate the insertion process as a sequence tagging task and apply a BERT-BiLSTM-CRF model to the insertion of emojis. Extensive experiments illustrate that our model outperforms existing methods by a large margin.

Journal ArticleDOI
TL;DR: The authors in this paper have developed multimedia tools, techniques, and applications to facilitate the recovery, resilience, and management of COVID-19, including pandemic status monitoring and impact prediction, enhancing public awareness and telehealth, etc.
Abstract: Coronavirus Disease 2019 (COVID-19) has been affecting most of the countries and impacting almost every aspect of people's lives More than one hundred million confirmed cases and two million deaths have been reported due to COVID-19 as of February 2021 While our society suffers an unanticipated epidemic, researchers and engineers have developed various technologies to manage this global emergency Specifically, multimedia tools, techniques, and applications have been developed and played essential roles in facilitating the recovery, resilience, and management of COVID-19, including pandemic status monitoring and impact prediction, enhancing public awareness and telehealth, etc However, there are many challenges that require further investigation and research to better manage COVID-19 and prepare for future pandemics