Showing papers in "ACM Transactions on Multimedia Computing, Communications, and Applications in 2020"

PDF

Open Access

Journal Article•DOI•

Dual-path Convolutional Image-Text Embeddings with Instance Loss

[...]

Zhedong Zheng¹, Liang Zheng², Michael Garrett³, Yi Yang¹, Mingliang Xu⁴, Yi-Dong Shen⁵ - Show less +2 more•Institutions (5)

University of Technology, Sydney¹, Australian National University², Edith Cowan University³, Zhengzhou University⁴, Chinese Academy of Sciences⁵

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end dual-path convolutional network to learn the image and text representations, which is based on an unsupervised assumption that each image/text group can be viewed as a class.

...read moreread less

Abstract: Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

...read moreread less

161 citations

Journal Article•DOI•

Machine Learning Techniques for the Diagnosis of Alzheimer’s Disease: A Review

[...]

Muhammad Tanveer¹, Bharat Richhariya¹, R. U. Khan¹, Aamir Rashid², Pritee Khanna³, Mukesh Prasad⁴, Chin-Teng Lin⁴ - Show less +3 more•Institutions (4)

Indian Institute of Technology Indore¹, National Institute of Standards and Technology², Indian Institute of Information Technology, Design and Manufacturing, Jabalpur³, University of Technology, Sydney⁴

15 Apr 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A large number of novel and efficient automated techniques are needed for early diagnosis of Alzheimer’s disease, and many novel approaches to diagnosis are being developed.

...read moreread less

Abstract: Alzheimer’s disease is an incurable neurodegenerative disease primarily affecting the elderly population. Efficient automated techniques are needed for early diagnosis of Alzheimer’s. Many novel approaches are proposed by researchers for classification of Alzheimer’s disease. However, to develop more efficient learning techniques, better understanding of the work done on Alzheimer’s is needed. Here, we provide a review on 165 papers from 2005 to 2019, using various feature extraction and machine learning techniques. The machine learning techniques are surveyed under three main categories: support vector machine (SVM), artificial neural network (ANN), and deep learning (DL) and ensemble methods. We present a detailed review on these three approaches for Alzheimer’s with possible future directions.

...read moreread less

128 citations

Journal Article•DOI•

Depth Image Denoising Using Nuclear Norm and Learning Graph Model

[...]

Chenggang Yan¹, Li Zhisheng¹, Yongbing Zhang², Yutao Liu², Xiangyang Ji², Yongdong Zhang³ - Show less +2 more•Institutions (3)

Hangzhou Dianzi University¹, Tsinghua University², University of Science and Technology of China³

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A group-based nuclear norm and learning graph (GNNLG) model, where for each patch, the most similar patches within a searching window are found, which is superior to other current state-of-the-art denoising methods in both subjective and objective criterion.

...read moreread less

Abstract: Depth image denoising is increasingly becoming the hot research topic nowadays, because it reflects the three-dimensional scene and can be applied in various fields of computer vision. But the depth images obtained from depth camera usually contain stains such as noise, which greatly impairs the performance of depth-related applications. In this article, considering that group-based image restoration methods are more effective in gathering the similarity among patches, a group-based nuclear norm and learning graph (GNNLG) model was proposed. For each patch, we find and group the most similar patches within a searching window. The intrinsic low-rank property of the grouped patches is exploited in our model. In addition, we studied the manifold learning method and devised an effective optimized learning strategy to obtain the graph Laplacian matrix, which reflects the topological structure of image, to further impose the smoothing priors to the denoised depth image. To achieve fast speed and high convergence, the alternating direction method of multipliers is proposed to solve our GNNLG. The experimental results show that the proposed method is superior to other current state-of-the-art denoising methods in both subjective and objective criterion.

...read moreread less

125 citations

Journal Article•DOI•

DenseNet-201-Based Deep Neural Network with Composite Learning Factor and Precomputation for Multiple Sclerosis Classification

[...]

Shuihua Wang¹, Yudong Zhang²•Institutions (2)

Loughborough University¹, University of Leicester²

20 Jun 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Convolutional neural network can achieve good results, but tuning hyperparameters of CNN needs expert know-how, so expert knows how to tune the parameters of the network.

...read moreread less

Abstract: (Aim) Multiple sclerosis is a neurological condition that may cause neurologic disability. Convolutional neural network can achieve good results, but tuning hyperparameters of CNN needs expert knowledge and are difficult and time-consuming. To identify multiple sclerosis more accurately, this article proposed a new transfer-learning-based approach. (Method) DenseNet-121, DenseNet-169, and DenseNet-201 neural networks were compared. In addition, we proposed the use of a composite learning factor (CLF) that assigns different learning factor to three types of layers: early frozen layers, middle layers, and late replaced layers. How to allocate layers into those three layers remains a problem. Hence, four transfer learning settings (viz., Settings A, B, C, and D) were tested and compared. A precomputation method was utilized to reduce the storage burden and accelerate the program. (Results) We observed that DenseNet-201-D (the layers from CP to T3 are frozen, the layers of D4 are updated with learning factor of 1, and the final new layers of FCL are randomly initialized with learning factor of 10) can achieve the best performance. The sensitivity, specificity, and accuracy of DenseNet-201-D was 98.27± 0.58, 98.35± 0.69, and 98.31± 0.53, respectively. (Conclusion) Our method gives better performances than state-of-the-art approaches. Furthermore, this composite learning rate gives superior results to traditional simple learning factor (SLF) strategy.

...read moreread less

119 citations

Journal Article•DOI•

Adaptive Exploration for Unsupervised Person Re-identification

[...]

Yuhang Ding¹, Hehe Fan², Mingliang Xu³, Yi Yang²•Institutions (3)

Southern University of Science and Technology¹, University of Technology, Sydney², Zhengzhou University³

17 Feb 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: In this article, an adaptive exploration (AE) method is proposed to address the domain shift problem for re-ID in an unsupervised manner, where a non-parametric classifier with a feature memory is exploited to encourage person images to move far away from each other.

...read moreread less

Abstract: Due to domain bias, directly deploying a deep person re-identification (re-ID) model trained on one dataset often achieves considerably poor accuracy on another dataset. In this article, we propose an Adaptive Exploration (AE) method to address the domain-shift problem for re-ID in an unsupervised manner. Specifically, in the target domain, the re-ID model is inducted to (1) maximize distances between all person images and (2) minimize distances between similar person images. In the first case, by treating each person image as an individual class, a non-parametric classifier with a feature memory is exploited to encourage person images to move far away from each other. In the second case, according to a similarity threshold, our method adaptively selects neighborhoods for each person image in the feature space. By treating these similar person images as the same class, the non-parametric classifier forces them to stay closer. However, a problem of the adaptive selection is that, when an image has too many neighborhoods, it is more likely to attract other images as its neighborhoods. As a result, a minority of images may select a large number of neighborhoods while a majority of images has only a few neighborhoods. To address this issue, we additionally integrate a balance strategy into the adaptive selection. We evaluate our methods with two protocols. The first one is called “target-only re-ID”, in which only the unlabeled target data is used for training. The second one is called “domain adaptive re-ID”, in which both the source data and the target data are used during training. Experimental results on large-scale re-ID datasets demonstrate the effectiveness of our method. Our code has been released at https://github.com/dyh127/Adaptive-Exploration-for-Unsupervised-Person-Re-Identification.

...read moreread less

89 citations

Journal Article•DOI•

Exploring Deep Learning for View-Based 3D Model Retrieval

[...]

Zan Gao¹, Yinming Li¹, Shaohua Wan²•Institutions (2)

Tianjin University of Technology¹, Zhongnan University of Economics and Law²

17 Feb 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: This work systematically evaluates the performance of deep learning features in view-based 3D model retrieval on four popular datasets (ETH, NTU60, PSB, and MVRED) by different kinds of similarity measure methods, and it is clear that theseDeep learning features can consistently outperform all of the hand-crafted features, and they are also more robust than the Handcrafted features when different degrees of noise are added into the image.

...read moreread less

Abstract: In recent years, view-based 3D model retrieval has become one of the research focuses in the field of computer vision and machine learning. In fact, the 3D model retrieval algorithm consists of feature extraction and similarity measurement, and the robust features play a decisive role in the similarity measurement. Although deep learning has achieved comprehensive success in the field of computer vision, deep learning features are used for 3D model retrieval only in a small number of works. To the best of our knowledge, there is no benchmark to evaluate these deep learning features. To tackle this problem, in this work we systematically evaluate the performance of deep learning features in view-based 3D model retrieval on four popular datasets (ETH, NTU60, PSB, and MVRED) by different kinds of similarity measure methods. In detail, the performance of hand-crafted features and deep learning features are compared, and then the robustness of deep learning features is assessed. Finally, the difference between single-view deep learning features and multi-view deep learning features is also evaluated. By quantitatively analyzing the performances on different datasets, it is clear that these deep learning features can consistently outperform all of the hand-crafted features, and they are also more robust than the hand-crafted features when different degrees of noise are added into the image. The exploration of latent relationships among different views in multi-view deep learning network architectures shows that the performance of multi-view deep learning outperforms that of single-view deep learning features with low computational complexity.

...read moreread less

88 citations

Journal Article•DOI•

Securing Multimedia by Using DNA-Based Encryption in the Cloud Computing Environment

[...]

Suyel Namasudra¹, Rupak Chakraborty², Abhishek Majumder³, Nageswara Rao Moparthi⁴•Institutions (4)

National Institute of Technology, Patna¹, Guru Nanak Institute of Technology², Tripura University³, K L University⁴

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A novel DNA-based encryption scheme is proposed in this article for protecting multimedia files in the cloud computing environment and the efficiency of the proposed scheme over some well-known existing schemes is shown.

...read moreread less

Abstract: Today, the size of a multimedia file is increasing day by day from gigabytes to terabytes or even petabytes, mainly because of the evolution of a large amount of real-time data. As most of the multimedia files are transmitted through the internet, hackers and attackers try to access the users’ personal and confidential data without any authorization. Thus, maintaining a strong security technique has become a significant concerned to protect the personal information. Deoxyribonucleic Acid (DNA) computing is an advanced field for improving security, which is based on the biological concept of DNA. A novel DNA-based encryption scheme is proposed in this article for protecting multimedia files in the cloud computing environment. Here, a 1024-bit secret key is generated based on DNA computing and the user's attributes and password to encrypt any multimedia file. To generate the secret key, the decimal encoding rule, American Standard Code for Information Interchange value, DNA reference key, and complementary rule are used, which enable the system to protect the multimedia file against many security attacks. Experimental results, as well as theoretical analyses, show the efficiency of the proposed scheme over some well-known existing schemes.

...read moreread less

72 citations

Journal Article•DOI•

Inception U-Net Architecture for Semantic Segmentation to Identify Nuclei in Microscopy Cell Images

[...]

Narinder Singh Punn¹, Sonali Agarwal¹•Institutions (1)

Indian Institute of Information Technology, Allahabad¹

17 Feb 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: An inception U-Net architecture for automating nuclei detection in microscopy cell image analysis with increasing applications of deep learning in biomedical image analysis is introduced.

...read moreread less

Abstract: With the increasing applications of deep learning in biomedical image analysis, in this article we introduce an inception U-Net architecture for automating nuclei detection in microscopy cell images of varying size and modality to help unlock faster cures, inspired from Kaggle Data Science Bowl Challenge 2018 (KDSB18). This study follows from the fact that most of the analysis requires nuclei detection as the starting phase for getting an insight into the underlying biological process and further diagnosis. The proposed architecture consists of a switch normalization layer, convolution layers, and inception layers (concatenated 1x1, 3x3, and 5x5 convolution and the hybrid of a max and Hartley spectral pooling layer) connected in the U-Net fashion for generating the image masks. This article also illustrates the model perception of image masks using activation maximization and filter map visualization techniques. A novel objective function segmentation loss is proposed based on the binary cross entropy, dice coefficient, and intersection over union loss functions. The intersection over union score, loss value, and pixel accuracy metrics evaluate the model over the KDSB18 dataset. The proposed inception U-Net architecture exhibits quite significant results as compared to the original U-Net and recent U-Net++ architecture.

...read moreread less

65 citations

Journal Article•DOI•

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

[...]

Feiran Huang¹, Kaimin Wei¹, Jian Weng¹, Zhoujun Li²•Institutions (2)

Jinan University¹, Beihang University²

05 Jul 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A novel method is proposed—Attention-Based Modality-Gated Networks (AMGN)—to exploit the correlation between the modalities of images and texts and extract the discriminative features for multimodal sentiment analysis, and demonstrates the superiority of this approach through comparison with state-of-the-art models.

...read moreread less

Abstract: Sentiment analysis of social multimedia data has attracted extensive research interest and has been applied to many tasks, such as election prediction and products evaluation. Sentiment analysis of one modality (e.g., text or image) has been broadly studied. However, not much attention has been paid to the sentiment analysis of multimodal data. Different modalities usually have information that is complementary. Thus, it is necessary to learn the overall sentiment by combining the visual content with text description. In this article, we propose a novel method—Attention-Based Modality-Gated Networks (AMGN)—to exploit the correlation between the modalities of images and texts and extract the discriminative features for multimodal sentiment analysis. Specifically, a visual-semantic attention model is proposed to learn attended visual features for each word. To effectively combine the sentiment information on the two modalities of image and text, a modality-gated LSTM is proposed to learn the multimodal features by adaptively selecting the modality that presents stronger sentiment information. Then a semantic self-attention model is proposed to automatically focus on the discriminative features for sentiment classification. Extensive experiments have been conducted on both manually annotated and machine weakly labeled datasets. The results demonstrate the superiority of our approach through comparison with state-of-the-art models.

...read moreread less

52 citations

Journal Article•DOI•

Cloud Gaming with Foveated Video Encoding

[...]

Gazi Illahi¹, Thomas van Gemert¹, Matti Siekkinen², Enrico Masala³, Antti Oulasvirta¹, Antti Ylä-Jääski¹ - Show less +2 more•Institutions (3)

Aalto University¹, University of Helsinki², Polytechnic University of Turin³

17 Feb 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A cloud gaming FVE prototype that is game-agnostic and requires no modifications to the underlying game engine is provided and results suggest that it is possible to find a “sweet spot” for the encoding parameters so the users hardly notice the presence of foveated encoding but at the same time the scheme yields most of the achievable bandwidth savings.

...read moreread less

Abstract: Cloud gaming enables playing high-end games, originally designed for PC or game console setups, on low-end devices such as netbooks and smartphones, by offloading graphics rendering to GPU-powered cloud servers. However, transmitting the high-resolution video requires a large amount of network bandwidth, even though it is a compressed video stream. Foveated video encoding (FVE) reduces the bandwidth requirement by taking advantage of the non-uniform acuity of human visual system and by knowing where the user is looking. Based on a consumer-grade real-time eye tracker and an open source cloud gaming platform, we provide a cloud gaming FVE prototype that is game-agnostic and requires no modifications to the underlying game engine. In this article, we describe the prototype and its evaluation through measurements with representative games from different genres to understand the effect of parametrization of the FVE scheme on bandwidth requirements and to understand its feasibility from the latency perspective. We also present results from a user study on first-person shooter games. The results suggest that it is possible to find a “sweet spot” for the encoding parameters so the users hardly notice the presence of foveated encoding but at the same time the scheme yields most of the achievable bandwidth savings.

...read moreread less

42 citations

Journal Article•DOI•

Recurrent Attention Network with Reinforced Generator for Visual Dialog

[...]

Hehe Fan¹, Linchao Zhu¹, Yi Yang¹, Fei Wu²•Institutions (2)

University of Technology, Sydney¹, Zhejiang University²

05 Jul 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A dialog network is used to memorize the temporal context and an attention processor to parse the spatial context in Visual Dialog to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates to ameliorate the problem.

...read moreread less

Abstract: In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

...read moreread less

Journal Article•DOI•

Data Hiding: Current Trends, Innovation and Potential Challenges

[...]

Amit Singh¹•Institutions (1)

National Institute of Technology, Patna¹

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Data hiding approaches have received much attention in a number of application areas, however, those approaches unable to solve many issues that necessary to be measured in future investigations.

...read moreread less

Abstract: With the widespread growth of digital information and improved internet technologies, the demand for improved information security techniques has significantly increased due to privacy leakage, identity theft, illegal copying, and data distribution. Because of this, data hiding approaches have received much attention in several application areas. However, those approaches are unable to solve many issues that are necessary to measure in future investigations. This survey provides a comprehensive survey on data hiding techniques and their new trends for solving new challenges in real-world applications. The notable applications are telemedicine, 3D objects, mobile devices, cloud/distributed computing and data mining environments, chip and hardware protection, cyber physical systems, internet traffic, fusion of watermarking and encryption, joint compression and watermarking, biometric watermarking, watermarking at the physical layer, and many other perspectives. Further, the potential issues that existing approaches of data hiding face are identified. I believe that this survey will provide a valuable source of information for finding research directions for fledgling researchers and developers.

...read moreread less

Journal Article•DOI•

A Benchmark Dataset and Comparison Study for Multi-modal Human Action Analytics

[...]

Jiaying Liu¹, Sijie Song¹, Chunhui Liu¹, Yanghao Li¹, Yueyu Hu¹ - Show less +1 more•Institutions (1)

Peking University¹

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: In this paper, a large-scale benchmark for multi-modal human action analytics, namely the PKU Multi-Modal Dataset (PKU-MMD), is introduced.

...read moreread less

Abstract: Large-scale benchmarks provide a solid foundation for the development of action analytics. Most of the previous activity benchmarks focus on analyzing actions in RGB videos. There is a lack of large-scale and high-quality benchmarks for multi-modal action analytics. In this article, we introduce PKU Multi-Modal Dataset (PKU-MMD), a new large-scale benchmark for multi-modal human action analytics. It consists of about 28,000 action instances and 6.2 million frames in total and provides high-quality multi-modal data sources, including RGB, depth, infrared radiation (IR), and skeletons. To make PKU-MMD more practical, our dataset comprises two subsets under different settings for action understanding, namely Part I and Part II. Part I contains 1,076 untrimmed video sequences with 51 action classes performed by 66 subjects, while Part II contains 1,009 untrimmed video sequences with 41 action classes performed by 13 subjects. Compared to Part I, Part II is more challenging due to short action intervals, concurrent actions and heavy occlusion. PKU-MMD can be leveraged in two scenarios: action recognition with trimmed video clips and action detection with untrimmed video sequences. For each scenario, we provide benchmark performance on both subsets by conducting different methods with different modalities under two evaluation protocols, respectively. Experimental results show that PKU-MMD is a significant challenge to many state-of-the-art methods. We further illustrate that the features learned on PKU-MMD can be well transferred to other datasets. We believe this large-scale dataset will boost the research in the field of action analytics for the community.

...read moreread less

Journal Article•DOI•

Analysis of the Security of Internet of Multimedia Things

[...]

Zhihan Lv¹, Liang Qiao², Houbing Song³•Institutions (3)

Warsaw University of Technology¹, Qingdao University², Embry–Riddle Aeronautical University³

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: It can be found that ABAH can avoid the communication overhead and privacy leakage caused by the revocation list, ensure the integrity of batch verification information, meet the security performance of the vehicular ad hoc network under the Internet of Things, and protect the privacy of users from being disclosed.

...read moreread less

Abstract: To study the security performance of the Internet of multimedia things on the privacy protection of user identity, behavior trajectory, and preference under the new information technology industry wave, in this study, aiming at the problems of the sharing of Internet of things perception data and the exposure of users’ privacy information, the Anonymous Batch Authentication Scheme (ABAH) for privacy protection is designed. Hash-based Message Authentication Code is used to cancel the list-checking process and analyze its security performance. Compared with the methods of elliptic curve digital signature algorithm, Bayes least-square method, identity-based bulk verification, anonymous batch authentication and key protocol, conditional privacy authentication scheme, and expert message authentication protocol, the transmission delay, packet loss rate, and computation cost are studied without considering the undo list and during the undo check. The results show that with the increase of information size, the transmission delay and packet loss rate also increase, and the transmission delay of ABAH increases by about 15%, while the correlation between speed and transmission delay is small. In the case of the same amount of validation information, ABAH has the highest validation efficiency, and it still has an efficient validation effect in the case of invalid information. The message packet loss rate for ABAH is always 0 when the undo check validation overhead is considered. It can be found that ABAH can avoid the communication overhead and privacy leakage caused by the revocation list, ensure the integrity of batch verification information, meet the security performance of the vehicular ad hoc network under the Internet of Things, and protect the privacy of users from being disclosed.

...read moreread less

Journal Article•DOI•

EGroupNet: A Feature-enhanced Network for Age Estimation with Novel Age Group Schemes

[...]

Mingxing Duan¹, Kenli Li¹, Aijia Ouyang, Khin Nandar Win¹, Keqin Li², Qi Tian³ - Show less +2 more•Institutions (3)

Hunan University¹, State University of New York System², University of Texas at San Antonio³

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A hierarchic approach referred to as EGroupNet for age prediction, which includes two main stages, feature enhancement via excavating the correlations among age-related attributes and age estimation based on different age group schemes, is proposed.

...read moreread less

Abstract: Although age estimation is easily affected by smiling, race, gender, and other age-related attributes, most of the researchers did not pay attention to the correlations among these attributes. Moreover, many researchers perform age estimation from a wide range of age; however, conducting an age prediction over a narrow age range may achieve better results. This article proposes a hierarchic approach referred to as EGroupNet for age prediction. The method includes two main stages, i.e., feature enhancement via excavating the correlations among age-related attributes and age estimation based on different age group schemes. First, we apply the multi-task learning model to learn multiple face attributes simultaneously to obtain discriminative features of different attributes. Second, we project the outputs of fully connected layers of several subnetworks into a highly correlated matrix space via the correlation learning process. Third, we classify these enhanced features into narrow age groups using two Extreme Learning Machine models. Finally, we make predictions based on the results of the age groups mergence. We conduct a large number of experiments on MORPH-II, LAP-2016 dataset, and Adience benchmark. The mean absolute errors of the two different settings on MORPH-II are 2.48 and 2.13 years, respectively; the normal score (e) on the LAP-2016 dataset is 0.3578; and the accuracy of age prediction on Adience benchmark is 0.6978.

...read moreread less

Journal Article•DOI•

Region-Level Visual Consistency Verification for Large-Scale Partial-Duplicate Image Search

[...]

Zhili Zhou¹, Q. M. Jonathan Wu², Yimin Yang³, Xingming Sun¹•Institutions (3)

Nanjing University of Information Science and Technology¹, University of Windsor², Lakehead University³

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Most recent large-scale image search approaches build on a bag-of-visual-words model, in which local features are quantized and then efficiently matched between images.

...read moreread less

Abstract: Most recent large-scale image search approaches build on a bag-of-visual-words model, in which local features are quantized and then efficiently matched between images. However, the limited discriminability of local features and the BOW quantization errors cause a lot of mismatches between images, which limit search accuracy. To improve the accuracy, geometric verification is popularly adopted to identify geometrically consistent local matches for image search, but it is hard to directly use these matches to distinguish partial-duplicate images from non-partial-duplicate images. To address this issue, instead of simply identifying geometrically consistent matches, we propose a region-level visual consistency verification scheme to confirm whether there are visually consistent region (VCR) pairs between images for partial-duplicate search. Specifically, after the local feature matching, the potential VCRs are constructed via mapping the regions segmented from candidate images to a query image by utilizing the properties of the matched local features. Then, the compact gradient descriptor and convolutional neural network descriptor are extracted and matched between the potential VCRs to verify their visual consistency to determine whether they are VCRs. Moreover, two fast pruning algorithms are proposed to further improve efficiency. Extensive experiments demonstrate the proposed approach achieves higher accuracy than the state of the art and provide comparable efficiency for large-scale partial-duplicate search tasks.

...read moreread less

Journal Article•DOI•

Am I Done? Predicting Action Progress in Videos

[...]

Federico Becattini¹, Tiberio Uricchio¹, Lorenzo Seidenari¹, Lamberto Ballan², Alberto Del Bimbo¹ - Show less +1 more•Institutions (2)

University of Florence¹, University of Padua²

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: In this paper, the authors proposed a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution.

...read moreread less

Abstract: In this article, we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task, since it can be valuable for a wide range of interaction applications. To this end, we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.

...read moreread less

Journal Article•DOI•

Random Forest with Self-Paced Bootstrap Learning in Lung Cancer Prognosis

[...]

Qingyong Wang¹, Yun Zhou¹, Weiping Ding², Zhiguo Zhang³, Khan Muhammad⁴, Zehong Cao⁵ - Show less +2 more•Institutions (5)

National University of Defense Technology¹, Nantong University², Shenzhen University³, Sejong University⁴, University of Tasmania⁵

13 Apr 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: This study proposes an ensemble learning with random forest approach to improving the model classification performance by selecting multi-classifiers and investigates the sampling strategy by gradually embedding from high- to low-quality samples by self-paced learning.

...read moreread less

Abstract: Training gene expression data with supervised learning approaches can provide an alarm sign for early treatment of lung cancer to decrease death rates. However, the samples of gene features involve lots of noises in a realistic environment. In this study, we present a random forest with self-paced learning bootstrap for improvement of lung cancer classification and prognosis based on gene expression data. To be specific, we propose an ensemble learning with random forest approach to improving the model classification performance by selecting multi-classifiers. Then, we investigate the sampling strategy by gradually embedding from high- to low-quality samples by self-paced learning. The experimental results based on five public lung cancer datasets show that our proposed method could select significant genes exactly, which improves classification performance compared to that of existing approaches. We believe that our proposed method has the potential to assist doctors in gene selections and lung cancer prognosis.

...read moreread less

Journal Article•DOI•

Blind Image Quality Assessment by Natural Scene Statistics and Perceptual Characteristics

[...]

Yutao Liu¹, Ke Gu², Xiu Li¹, Yongbing Zhang¹•Institutions (2)

Tsinghua University¹, Beijing University of Technology²

25 Aug 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Zhang et al. as mentioned in this paper investigated the natural scene statistics (NSS) and perceptual characteristics of the human brain for visual perception and designed a set of quality-aware features to characterize the image quality effectively.

...read moreread less

Abstract: Opinion-unaware blind image quality assessment (OU BIQA) refers to establishing a blind quality prediction model without using the expensive subjective quality scores, which is a highly promising direction in the BIQA research. In this article, we focus on OU BIQA and propose a novel OU BIQA method. Specifically, in our proposed method, we deeply investigate the natural scene statistics (NSS) and the perceptual characteristics of the human brain for visual perception. Accordingly, a set of quality-aware NSS and perceptual characteristics-related features are designed to characterize the image quality effectively. For inferring the image quality, we learn a pristine multivariate Gaussian (MVG) model on a collection of pristine images, which serves as the reference information for quality evaluation. At last, the quality of a new given image is defined by measuring the divergence between its MVG model and the learned pristine MVG model. Thorough experiments performed on seven popular image databases demonstrate that the proposed OU BIQA method delivers superior performance to the state-of-the-art OU BIQA methods. The Matlab source code of the proposed method will be made publicly available at https://github.com/YT2015?tab=;repositories.

...read moreread less

Journal Article•DOI•

Active Balancing Mechanism for Imbalanced Medical Data in Deep Learning–Based Classification Models

[...]

Hongyi Zhang¹, Haoke Zhang¹, Sandeep Pirbhulal², Wanqing Wu³, Victor Hugo C. de Albuquerque⁴ - Show less +1 more•Institutions (4)

Xiamen University of Technology¹, Chinese Academy of Sciences², Sun Yat-sen University³, University of Fortaleza⁴

12 Mar 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A large number of under-sampling techniques, particularly those that consume more time and suffer from loss of samples containing critical information during imbala...

...read moreread less

Abstract: Imbalanced data always has a serious impact on a predictive model, and most under-sampling techniques consume more time and suffer from loss of samples containing critical information during imbalanced data processing, especially in the biomedical field. To solve these problems, we developed an active balancing mechanism (ABM) based on valuable information contained in the biomedical data. ABM adopts the Gaussian naive Bayes method to estimate the object samples and entropy as a query function to evaluate sample information and only retains valuable samples of the majority class to achieve under-sampling. The Physikalisch Technische Bundesanstalt diagnostic electrocardiogram (ECG) database, including 5,173 normal ECG samples and 26,654 myocardial infarction ECG samples, is applied to verify the validity of ABM. At imbalance rates of 13 and 5, experimental results reveal that ABM takes 7.7 seconds and 13.2 seconds, respectively. Both results are significantly faster than five conventional under-sampling methods. In addition, at the imbalance rate of 13, ABM-based data obtained the highest accuracy of 92.23% and 97.52% using support vector machines and modified convolutional neural networks (MCNNs) with eight layers, respectively. At the imbalance rate of 5, the processed data by ABM also achieved the best accuracy of 92.31% and 98.46% based on support vector machines and MCNNs, respectively. Furthermore, ABM has better performance than two compared methods in F1-measure, G-means, and area under the curve. Consequently, ABM could be a useful and effective approach to deal with imbalanced data in general, particularly biomedical myocardial infarction ECG datasets, and the MCNN can also achieve higher performance compared to the state of the art.

...read moreread less

Journal Article•DOI•

Human Activity Recognition from Multiple Sensors Data Using Multi-fusion Representations and CNNs

[...]

Farzan Majeed Noori¹, Michael Riegler, Zia Uddin¹, Jim Torresen¹•Institutions (1)

University of Oslo¹

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: With the emerging interest in the ubiquitous sensing field, it has become possible to build assistive technologies for persons during their daily life activities to provide personalized feedback an...

...read moreread less

Abstract: With the emerging interest in the ubiquitous sensing field, it has become possible to build assistive technologies for persons during their daily life activities to provide personalized feedback and services. For instance, it is possible to detect an individual’s behavioral pattern (e.g., physical activity, location, and mood) by using sensors embedded in smart-watches and smartphones. The multi-sensor environments also come with some challenges, such as how to fuse and combine different sources of data. In this article, we explore several methods of fusion for multi-representations of data from sensors. Furthermore, multiple representations of sensor data were generated and then fused using data-level, feature-level, and decision-level fusions. The presented methods were evaluated using three publicly available human activity recognition (HAR) datasets. The presented approaches utilize Deep Convolutional Neural Networks (CNNs). A generic architecture for fusion of different sensors is proposed. The proposed method shows promising performance, with the best results reaching an overall accuracy of 98.4% for the Context-Awareness via Wrist-Worn Motion Sensors (HANDY) dataset and 98.7% for the Wireless Sensor Data Mining (WISDM version 1.1) dataset. Both results outperform previous approaches.

...read moreread less

Journal Article•DOI•

Privacy Protection for Medical Data Sharing in Smart Healthcare

[...]

Liming Fang¹, Changchun Yin¹, Juncen Zhu¹, Chunpeng Ge¹, Muhammad Tanveer², Alireza Jolfaei³, Zehong Cao⁴ - Show less +3 more•Institutions (4)

Nanjing University of Aeronautics and Astronautics¹, Indian Institute of Technology Indore², Macquarie University³, University of Tasmania⁴

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A new data-sharing framework and a data access control mechanism is proposed to support the selective sharing of electronic medical records from different medical institutions between different doctors and ensures that privacy concerns are taken into account when processing requests for access to patients’ medical information.

...read moreread less

Abstract: In virtue of advances in smart networks and the cloud computing paradigm, smart healthcare is transforming. However, there are still challenges, such as storing sensitive data in untrusted and controlled infrastructure and ensuring the secure transmission of medical data, among others. The rapid development of watermarking provides opportunities for smart healthcare. In this article, we propose a new data-sharing framework and a data access control mechanism. The applications are submitted by the doctors, and the data is processed in the medical data center of the hospital, stored in semi-trusted servers to support the selective sharing of electronic medical records from different medical institutions between different doctors. Our approach ensures that privacy concerns are taken into account when processing requests for access to patients’ medical information. For accountability, after data is modified or leaked, both patients and doctors must add digital watermarks associated with their identification when uploading data. Extensive analytical and experimental results are presented that show the security and efficiency of our proposed scheme.

...read moreread less

Journal Article•DOI•

Few-shot Food Recognition via Multi-view Representation Learning

[...]

Shuqiang Jiang¹, Weiqing Min¹, Lyu Yongqiang, Linhu Liu¹•Institutions (1)

Chinese Academy of Sciences¹

14 Jul 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A Multi-View Few-Shot Learning (MVFSL) framework to explore additional ingredient information for few-shot food recognition, and extends another two types of networks, namely, Siamese Network and Matching Network, by introducing ingredient information.

...read moreread less

Abstract: This article considers the problem of few-shot learning for food recognition. Automatic food recognition can support various applications, e.g., dietary assessment and food journaling. Most existing works focus on food recognition with large numbers of labelled samples, and fail to recognize food categories with few samples. To address this problem, we propose a Multi-View Few-Shot Learning (MVFSL) framework to explore additional ingredient information for few-shot food recognition. Besides category-oriented deep visual features, we introduce ingredient-supervised deep network to extract ingredient-oriented features. As general and intermediate attributes of food, ingredient-oriented features are informative and complementary to category-oriented features, and thus they play an important role in improving food recognition. Particularly in few-shot food recognition, ingredient information can bridge the gap between disjoint training categories and test categories. To take advantage of ingredient information, we fuse these two kinds of features by first combining their feature maps from their respective deep networks and then convolving combined feature maps. Such convolution is further incorporated into a multi-view relation network, which is capable of comparing pairwise images to enable fine-grained feature learning. MVFSL is trained in an end-to-end fashion for joint optimization on two types of feature learning subnetworks and relation subnetworks. Extensive experiments on different food datasets have consistently demonstrated the advantage of MVFSL in multi-view feature fusion. Furthermore, we extend another two types of networks, namely, Siamese Network and Matching Network, by introducing ingredient information for few-shot food recognition. Experimental results have also shown that introducing ingredient information into these two networks can improve the performance of few-shot food recognition.

...read moreread less

Journal Article•DOI•

Shuffled ImageNet Banks for Video Event Detection and Search

[...]

Pascal Mettes¹, Dennis C. Koelma¹, Cees G. M. Snoek¹•Institutions (1)

University of Amsterdam¹

19 May 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: The proposed bank is a concept bank with an order of magnitude more concepts compared to standard ImageNet banks for event detection and event search, with state-of-the-art performance on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.

...read moreread less

Abstract: This article aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept, and (ii) not all concepts are equally relevant for events. In this article, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. Compared to standard ImageNet pre-training, our shuffles result in more discriminative representations to train event models from the limited video event examples. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.

...read moreread less

Journal Article•DOI•

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

[...]

Donghuo Zeng¹, Yi Yu¹, Keizo Oyama¹•Institutions (1)

National Institute of Informatics¹

14 Jul 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace and positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video.

...read moreread less

Abstract: Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyrics-audio. Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of the audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose TNN-C-CCA, a novel deep triplet neural network with cluster canonical correlation analysis, which is an end-to-end supervised learning architecture with an audio branch and a video branch. We not only consider the matching pairs in the common space but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made. First, a better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. Second, positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video. Our experiment is run over fivefold cross validation, where average performance is applied to demonstrate the performance of audio-video cross-modal retrieval. The experimental results achieved on two different audio-visual datasets show that the proposed learning architecture with two branches outperforms existing six canonical correlation analysis–based methods and four state-of-the-art-based cross-modal retrieval methods.

...read moreread less

Journal Article•DOI•

Hybrid Wolf-Bat Algorithm for Optimization of Connection Weights in Multi-layer Perceptron

[...]

Utkarsh Agrawal¹, Jatin Arora¹, Rahul Singh¹, Deepak Gupta¹, Ashish Khanna¹, Aditya Khamparia² - Show less +2 more•Institutions (2)

Maharaja Agrasen Institute of Technology¹, Lovely Professional University²

13 Apr 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: In a neural network, the weights act as parameters to determine the output(s) from a set of inputs to find the activation values of nodes of a layer from the values of the pre-existing nodes.

...read moreread less

Abstract: In a neural network, the weights act as parameters to determine the output(s) from a set of inputs. The weights are used to find the activation values of nodes of a layer from the values of the previous layer. Finding the ideal set of these weights for training a Multi-layer Perceptron neural network such that it minimizes the classification error is a widely known optimization problem. The presented article proposes a Hybrid Wolf-Bat algorithm, a novel optimization algorithm, as a solution to solve the discussed problem. The proposed algorithm is a hybrid of two already existing nature-inspired algorithms, Grey Wolf Optimization algorithm and Bat algorithm. The novel introduced approach is tested on ten different datasets of the medical field, obtained from the UCI machine learning repository. The performance of the proposed algorithm is compared with the recently developed nature-inspired algorithms: Grey Wolf Optimization algorithm, Cuckoo Search, Bat Algorithm, and Whale Optimization Algorithm, along with the standard Back-propagation training method available in the literature. The obtained results demonstrate that the proposed method outperforms other bio-inspired algorithms in terms of both speed of convergence and accuracy.

...read moreread less

Journal Article•DOI•

Constrained LSTM and Residual Attention for Image Captioning

[...]

Liang Yang¹, Haifeng Hu¹, Songlong Xing¹, Xinlong Lu¹•Institutions (1)

Sun Yat-sen University¹

05 Jul 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A model is proposed that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template and develops a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image.

...read moreread less

Abstract: Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship between words in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the state-of-the-art methods and achieves superior performance on COCO captioning Leaderboard.

...read moreread less

Journal Article•DOI•

SDN-Assisted DDoS Defense Framework for the Internet of Multimedia Things

[...]

Kshira Sagar Sahoo, Deepak Puthal¹•Institutions (1)

Newcastle University¹

16 Dec 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: The Internet of things (IoT) is visualized as a fundamental networking model which bridge the gap between the cyber and real-world entity.

...read moreread less

Abstract: The Internet of Things is visualized as a fundamental networking model that bridges the gap between the cyber and real-world entity. Uniting the real-world object with virtualization technology is opening further opportunities for innovation in nearly every individual’s life. Moreover, the usage of smart heterogeneous multimedia devices is growing extensively. These multimedia devices that communicate among each other through the Internet form a unique paradigm called the Internet of Multimedia Things (IoMT). As the volume of the collected data in multimedia application increases, the security, reliability of communications, and overall quality of service need to be maintained. Primarily, distributed denial of service attacks unveil the pervasiveness of vulnerabilities in IoMT systems. However, the Software Defined Network (SDN) is a new network architecture that has the central visibility of the entire network, which helps to detect any attack effectively. In this regard, the combination of SDN and IoMT, termed SD-IoMT, has the immense ability to improve the network management and security capabilities of the IoT system. This article proposes an SDN-assisted two-phase detection framework, namely SD-IoMT-Protector, in which the first phase utilizes the entropy technique as the detection metric to verify and alert about the malicious traffic. The second phase has trained with an optimized machine learning technique for classifying different attacks. The outcomes of the experimental results signify the usefulness and effectiveness of the proposed framework for addressing distributed denial of service issues of the SD-IoMT system.

...read moreread less

Journal Article•DOI•

Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval

[...]

Zhenguo Yang¹, Zehang Lin², Peipei Kang², Jianming Lv³, Qing Li⁴, Wenyin Liu² - Show less +2 more•Institutions (4)

City University of Hong Kong¹, Guangdong University of Technology², South China University of Technology³, Hong Kong Polytechnic University⁴

17 Feb 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: The effectiveness of S3CA, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data, is outperforming the state-of-the-art methods.

...read moreread less

Abstract: In this article, we propose to learn shared semantic space with correlation alignment (S3CA) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully connected neural network to extract semantic features for text documents, including news articles from news media. In particular, nonlinear correlations of layer activations in the two neural networks are aligned with correlation alignment during the joint training of the networks. Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly. In addition, we contribute a Wiki-Flickr Event dataset, where the multimodal data samples are not describing each other in pairs like the existing paired datasets, but all of them are describing semantic events. Extensive experiments conducted on both paired and unpaired datasets manifest the effectiveness of S3CA, outperforming the state-of-the-art methods.

...read moreread less

Journal Article•DOI•

Multichannel Attention Refinement for Video Question Answering

[...]

Yueting Zhuang¹, Dejing Xu¹, Xin Yan¹, Wenzhuo Cheng¹, Zhou Zhao¹, Shiliang Pu, Jun Xiao¹ - Show less +3 more•Institutions (1)

Zhejiang University¹

12 Mar 2020-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: Video Question Answering is the extension of image question answering (ImageQA) in the video domain and methods are required to give the correct answer after analyzing the provided video.

...read moreread less

Abstract: Video Question Answering (VideoQA) is the extension of image question answering (ImageQA) in the video domain. Methods are required to give the correct answer after analyzing the provided video and question in this task. Comparing to ImageQA, the most distinctive part is the media type. Both tasks require the understanding of visual media, but VideoQA is much more challenging, mainly because of the complexity and diversity of videos. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this article, we propose to tackle the task from a multichannel perspective. Appearance, motion, and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer. We also incorporate the relevant text information acquired from Wikipedia as an attempt to extend the capability of the method. Experiments on TGIF-QA and ActivityNet-QA datasets show the advantages of our method compared to existing methods. We also demonstrate the effectiveness and interpretability of our method by analyzing the refined attention weights during the question-answering procedure.

...read moreread less