scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Multimedia Computing, Communications, and Applications in 2018"


Journal ArticleDOI
TL;DR: A progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains and demonstrates that PUL outputs discriminative features that improve the re-ID accuracy.
Abstract: The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re-identification (re-ID). In this article, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains. Our method is easy to implement and can be viewed as an effective baseline for unsupervised re-ID feature learning. Specifically, PUL iterates between (1) pedestrian clustering and (2) fine-tuning of the convolutional neural network (CNN) to improve the initialization model trained on the irrelevant labeled dataset. Since the clustering results can be very noisy, we add a selection operation between the clustering and fine-tuning. At the beginning, when the model is weak, CNN is fine-tuned on a small amount of reliable examples that locate near to cluster centroids in the feature space. As the model becomes stronger, in subsequent iterations, more images are being adaptively selected as CNN training samples. Progressively, pedestrian clustering and the CNN model are improved simultaneously until algorithm convergence. This process is naturally formulated as self-paced learning. We then point out promising directions that may lead to further improvement. Extensive experiments on three large-scale re-ID datasets demonstrate that PUL outputs discriminative features that improve the re-ID accuracy. Our code has been released at https://github.com/hehefan/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning.

488 citations


Journal ArticleDOI
TL;DR: It is demonstrated that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model), and it is proved that multi-task learning is beneficial to increase model generality and gain performance.
Abstract: Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models “translate” image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.

113 citations


Journal ArticleDOI
TL;DR: This work proposes an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on whichParts of the image are salient and which are contextual.
Abstract: Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large-scale datasets, that our model achieves superior performance with respect to captioning baselines with and without saliency and to different state-of-the-art approaches combining saliency and captioning.

90 citations


Journal ArticleDOI
TL;DR: QoE management is addressed in the context of ongoing developments, such as the move to softwarized networks, the exploitation of big data analytics and machine learning, and the steady rise of new and immersive services.
Abstract: Quality of Experience (QoE) has received much attention over the past years and has become a prominent issue for delivering services and applications. A significant amount of research has been devoted to understanding, measuring, and modelling QoE for a variety of media services. The next logical step is to actively exploit that accumulated knowledge to improve and manage the quality of multimedia services, while at the same time ensuring efficient and cost-effective network operations. Moreover, with many different players involved in the end-to-end service delivery chain, identifying the root causes of QoE impairments and finding effective solutions for meeting the end users’ requirements and expectations in terms of service quality is a challenging and complex problem. In this article, we survey state-of-the-art findings and present emerging concepts and challenges related to managing QoE for networked multimedia services. Going beyond a number of previously published survey articles addressing the topic of QoE management, we address QoE management in the context of ongoing developments, such as the move to softwarized networks, the exploitation of big data analytics and machine learning, and the steady rise of new and immersive services (e.g., augmented and virtual reality). We address the implications of such paradigm shifts in terms of new approaches in QoE modeling and the need for novel QoE monitoring and management infrastructures.

88 citations


Journal ArticleDOI
TL;DR: An adaptive fractional-pixel ME skipped scheme is proposed for low-complexity HEVC ME, which reduces ME encoding time by an average of 63.22% while encoding efficiency performance is maintained.
Abstract: High-Efficiency Video Coding (HEVC) efficiently addresses the storage and transmit problems of high-definition videos, especially for 4K videos. The variable-size Prediction Units (PUs)--based Motion Estimation (ME) contributes a significant compression rate to the HEVC encoder and also generates a huge computation load. Meanwhile, high-level encoding complexity prevents widespread adoption of the HEVC encoder in multimedia systems. In this article, an adaptive fractional-pixel ME skipped scheme is proposed for low-complexity HEVC ME. First, based on the property of the variable-size PUs--based ME process and the video content partition relationship among variable-size PUs, all inter-PU modes during a coding unit encoding process are classified into root-type PU mode and children-type PU modes. Then, according to the ME result of the root-type PU mode, the fractional-pixel ME of its children-type PU modes is adaptively skipped. Simulation results show that, compared to the original ME in HEVC reference software, the proposed algorithm reduces ME encoding time by an average of 63.22% while encoding efficiency performance is maintained.

66 citations


Journal ArticleDOI
TL;DR: A spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations.
Abstract: In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a three-dimensional depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarks, ChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15%, respectively. Our project and codes will be released at https://davidsonic.github.io/index/acm_tomm_2017.html.

64 citations


Journal ArticleDOI
TL;DR: This article presents baselines for the artistic domain with a new benchmark dataset featuring over 2 million images with rich structured metadata dubbed OmniArt, and establishes and presents baseline scores on multiple tasks such as artist attribution, creation period estimation, type, style, and school prediction.
Abstract: Baselines are the starting point of any quantitative multimedia research, and benchmarks are essential for pushing those baselines further. In this article, we present baselines for the artistic domain with a new benchmark dataset featuring over 2 million images with rich structured metadata dubbed OmniArt. OmniArt contains annotations for dozens of attribute types and features semantic context information through concepts, IconClass labels, color information, and (limited) object-level bounding boxes. For our dataset we establish and present baseline scores on multiple tasks such as artist attribution, creation period estimation, type, style, and school prediction. In addition to our metadata related experiments, we explore the color spaces of art through different types and evaluate a transfer learning object recognition pipeline.

59 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed approach outperforms traditional multiple classifier solutions based on uniform weighting, and outperforms recent state-of-the-art approaches.
Abstract: In this article, we address the problem of recognizing an event from a single related picture. Given the large number of event classes and the limited information contained in a single shot, the problem is known to be particularly hard. To achieve a reliable detection, we propose a combination of multiple classifiers, and we compare three alternative strategies to fuse the results of each classifier, namely: (i) induced order weighted averaging operators, (ii) genetic algorithms, and (iii) particle swarm optimization. Each method is aimed at determining the optimal weights to be assigned to the decision scores yielded by different deep models, according to the relevant optimization strategy. Experimental tests have been performed on three event recognition datasets, evaluating the performance of various deep models, both alone and selectively combined. Experimental results demonstrate that the proposed approach outperforms traditional multiple classifier solutions based on uniform weighting, and outperforms recent state-of-the-art approaches.

45 citations


Journal ArticleDOI
TL;DR: OpenFace as discussed by the authors is an open-source face recognition system that approaches state-of-the-art accuracy and integrates with interframe tracking to denature video streams that selectively blurs faces according to specified policies at full frame rates.
Abstract: We show how to build the components of a privacy-aware, live video analytics ecosystem from the bottom up, starting with OpenFace, our new open-source face recognition system that approaches state-of-the-art accuracy. Integrating OpenFace with interframe tracking, we build RTFace, a mechanism for denaturing video streams that selectively blurs faces according to specified policies at full frame rates. This enables privacy management for live video analytics while providing a secure approach for handling retrospective policy exceptions. Finally, we present a scalable, privacy-aware architecture for large camera networks using RTFace and show how it can be an enabler for a vibrant ecosystem and marketplace of privacy-aware video streams and analytics services.

42 citations


Journal ArticleDOI
TL;DR: A taxonomy of the representation methods is proposed, distinguishing between spatial and temporal modeling of the data, and the analysis and recognition of 3D humans from 3D static and dynamic data is focused on.
Abstract: Computer Vision and Multimedia solutions are now offering an increasing number of applications ready for use by end users in everyday life. Many of these applications are centered for detection, representation, and analysis of face and body. Methods based on 2D images and videos are the most widespread, but there is a recent trend that successfully extends the study to 3D human data as acquired by a new generation of 3D acquisition devices. Based on these premises, in this survey, we provide an overview on the newly designed techniques that exploit 3D human data and also prospect the most promising current and future research directions. In particular, we first propose a taxonomy of the representation methods, distinguishing between spatial and temporal modeling of the data. Then, we focus on the analysis and recognition of 3D humans from 3D static and dynamic data, considering many applications for body and face.

39 citations


Journal ArticleDOI
TL;DR: This article proposes a novel and effective approach to FER using multi-model two-dimensional and 3D videos, which encodes both static and dynamic clues by scattering convolution network, and adopts Multiple Kernel Learning to combine the features in the 2D and3D modalities and compute similarities to predict the expression label.
Abstract: Facial Expression Recognition (FER) is one of the most important topics in the domain of computer vision and pattern recognition, and it has attracted increasing attention for its scientific challenges and application potentials. In this article, we propose a novel and effective approach to FER using multi-model two-dimensional (2D) and 3D videos, which encodes both static and dynamic clues by scattering convolution network. First, a shape-based detection method is introduced to locate the start and the end of an expression in videos; segment its onset, apex, and offset states; and sample the important frames for emotion analysis. Second, the frames in Apex of 2D videos are represented by scattering, conveying static texture details. Those of 3D videos are processed in a similar way, but to highlight static shape details, several geometric maps in terms of multiple order differential quantities, i.e., Normal Maps and Shape Index Maps, are generated as the input of scattering, instead of original smooth facial surfaces. Third, the average of neighboring samples centred at each key texture frame or shape map in Onset is computed, and the scattering features extracted from all the average samples of 2D and 3D videos are then concatenated to capture dynamic texture and shape cues, respectively. Finally, Multiple Kernel Learning is adopted to combine the features in the 2D and 3D modalities and compute similarities to predict the expression label. Thanks to the scattering descriptor, the proposed approach not only encodes distinct local texture and shape variations of different expressions as by several milestone operators, such as SIFT, HOG, and so on, but also captures subtle information hidden in high frequencies in both channels, which is quite crucial to better distinguish expressions that are easily confused. The validation is conducted on the BU-4DFE and BP-4D databa ses, and the accuracies reached are very competitive, indicating its competency for this issue.

Journal ArticleDOI
TL;DR: A survey is presented of research works on adaptive video streaming, together with a classification based on where the optimization takes place, which goes beyond client-based heuristics to investigate the usage of server- and network-assisted architectures and of new application and transport layer protocols.
Abstract: Video streaming applications currently dominate Internet traffic. Particularly, HTTP Adaptive Streaming (HAS) has emerged as the dominant standard for streaming videos over the best-effort Internet, thanks to its capability of matching the video quality to the available network resources. In HAS, the video client is equipped with a heuristic that dynamically decides the most suitable quality to stream the content, based on information such as the perceived network bandwidth or the video player buffer status. The goal of this heuristic is to optimize the quality as perceived by the user, the so-called Quality of Experience (QoE). Despite the many advantages brought by the adaptive streaming principle, optimizing users’ QoE is far from trivial. Current heuristics are still suboptimal when sudden bandwidth drops occur, especially in wireless environments, thus leading to freezes in the video playout, the main factor influencing users’ QoE. This issue is aggravated in case of live events, where the player buffer has to be kept as small as possible in order to reduce the playout delay between the user and the live signal. In light of the above, in recent years, several works have been proposed with the aim of extending the classical purely client-based structure of adaptive video streaming, in order to fully optimize users’ QoE. In this article, a survey is presented of research works on this topic together with a classification based on where the optimization takes place. This classification goes beyond client-based heuristics to investigate the usage of server- and network-assisted architectures and of new application and transport layer protocols. In addition, we outline the major challenges currently arising in the field of multimedia delivery, which are going to be of extreme relevance in future years.

Journal ArticleDOI
TL;DR: The recent advances on delay-sensitive video computations in the cloud, which are crucial to cloud-assisted conversational video services, such as cloud gaming, Virtual Reality (VR), Augmented Reality (AR), and telepresence are surveyed.
Abstract: While cloud servers provide a tremendous amount of resources for networked video applications, most successful stories of cloud-assisted video applications are presentational video services, such as YouTube and NetFlix. This article surveys the recent advances on delay-sensitive video computations in the cloud, which are crucial to cloud-assisted conversational video services, such as cloud gaming, Virtual Reality (VR), Augmented Reality (AR), and telepresence. Supporting conversational video services with cloud resources is challenging because most cloud servers are far away from the end users while these services incur the following stringent requirements: high bandwidth, short delay, and high heterogeneity. In this article, we cover the literature with a top-down approach: from applications and experience, to architecture and management, and to optimization in and outside of the cloud. We also point out major open challenges, hoping to stimulate more research activities in this emerging and exciting direction.

Journal ArticleDOI
TL;DR: This article compares different approaches to estimation of synchronization among multiple spectators’ signals, such as pairwise, group, and overall synchronization measures to detect aesthetic highlights in movies and finds that pairwise synchronization measures perform the most accurately independently of the category of the highlights and movie genres.
Abstract: Detection of aesthetic highlights is a challenge for understanding the affective processes taking place during movie watching. In this article, we study spectators’ responses to movie aesthetic stimuli in a social context. Moreover, we look for uncovering the emotional component of aesthetic highlights in movies. Our assumption is that synchronized spectators’ physiological and behavioral reactions occur during these highlights because: (i) aesthetic choices of filmmakers are made to elicit specific emotional reactions (e.g., special effects, empathy, and compassion toward a character) and (ii) watching a movie together causes spectators’ affective reactions to be synchronized through emotional contagion. We compare different approaches to estimation of synchronization among multiple spectators’ signals, such as pairwise, group, and overall synchronization measures to detect aesthetic highlights in movies. The results show that the unsupervised architecture relying on synchronization measures is able to capture different properties of spectators’ synchronization and detect aesthetic highlights based on both spectators’ electrodermal and acceleration signals. We discover that pairwise synchronization measures perform the most accurately independently of the category of the highlights and movie genres. Moreover, we observe that electrodermal signals have more discriminative power than acceleration signals for highlight detection.

Journal ArticleDOI
TL;DR: In this paper, a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion was proposed, which outperformed 13 relevant methods based on geometric features and a method to learn the features by a combination of principal component analysis and linear discriminant analysis.
Abstract: Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This article contributes to the state of the art with a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion. Experiments on the CMU MoCap database show that the suggested method outperforms 13 relevant methods based on geometric features and a method to learn the features by a combination of Principal Component Analysis and Linear Discriminant Analysis. The methods are evaluated in terms of the distribution of biometric templates in respective feature spaces expressed in a number of class separability coefficients and classification metrics. Results also indicate a high portability of learned features, what means that we can learn what aspects of walk people generally differ in and extract those as general gait features. Recognizing people without needing group-specific features is convenient, as particular people might not always provide annotated learning data. As a contribution to reproducible research, our evaluation framework and database have been made publicly available. This research makes motion capture technology directly applicable for human recognition.

Journal ArticleDOI
TL;DR: The potential of utilizing affect or emotion recognition research in AEH models is explored and the conceptual Emotion-based E-learning Model (EEM) with the proposed emotion recognition framework is proposed for future work.
Abstract: Adaptive Educational Hypermedia (AEH) e-learning models aim to personalize educational content and learning resources based on the needs of an individual learner. The Adaptive Hypermedia Architecture (AHA) is a specific implementation of the AEH model that exploits the cognitive characteristics of learner feedback to adapt resources accordingly. However, beside cognitive feedback, the learning realm generally includes both the affective and emotional feedback of the learner, which is often neglected in the design of e-learning models. This article aims to explore the potential of utilizing affect or emotion recognition research in AEH models. The framework is referred to as Multiple Kernel Learning Decision Tree Weighted Kernel Alignment (MKLDT-WFA). The MKLDT-WFA has two merits over classical MKL. First, the WFA component only preserves the relevant kernel weights to reduce redundancy and improve the discrimination for emotion classes. Second, training via the decision tree reduces the misclassification issues associated with the SimpleMKL. The proposed work has been evaluated on different emotion datasets and the results confirm the good performances. Finally, the conceptual Emotion-based E-learning Model (EEM) with the proposed emotion recognition framework is proposed for future work.

Journal ArticleDOI
TL;DR: This article considers the problem of designing a secure, robust, high-fidelity, storage-efficient image-sharing scheme over Facebook, a representative OSN that is widely accessed and proposes a DCT-domain image encryption/decryption framework that is robust against these lossy operations.
Abstract: Sharing images online has become extremely easy and popular due to the ever-increasing adoption of mobile devices and online social networks (OSNs). The privacy issues arising from image sharing over OSNs have received significant attention in recent years. In this article, we consider the problem of designing a secure, robust, high-fidelity, storage-efficient image-sharing scheme over Facebook, a representative OSN that is widely accessed. To accomplish this goal, we first conduct an in-depth investigation on the manipulations that Facebook performs to the uploaded images. Assisted by such knowledge, we propose a DCT-domain image encryption/decryption framework that is robust against these lossy operations. As verified theoretically and experimentally, superior performance in terms of data privacy, quality of the reconstructed images, and storage cost can be achieved.

Journal ArticleDOI
TL;DR: A Spatially Coherent featurelearning method for Pose-invariant Facial Expression Recognition (SC-PFER), which outperforms current state-of-the-art FER methods and introduces a linkage structure over the learning-based features and the corresponding geometry information of each key region to encode the dependencies of the regions.
Abstract: Feature learning has enjoyed much attention and achieved good performance in recent studies of image processing. Unlike the required training conditions often assumed there, far less labeled data is available for training emotion classification systems. In addition, current feature learning is typically performed on an entire face image without considering the dependency between features. These approaches ignore the fact that faces are structured and the neighboring features are dependent. Thus, the learned features lack the power to describe visually coherent facial images. Our method is therefore designed with the goal of simplifying the problem domain by removing expression-irrelevant factors from the input images, with a key region-based mechanism, which is an effort to reduce the amount of data required to effectively train the feature-learning methods. Meanwhile, we can construct geometric constraints between the key regions and its detected positions. To this end, we introduce a Spatially Coherent featurelearning method for Pose-invariant Facial Expression Recognition (SC-PFER). In our model, we first perform face frontalization through a 3D pose-normalization technique, which could normalize poses while preserving the identity information through synthesizing frontal faces for facial images with arbitrary views. Subsequently, we select a sequence of key regions around 51 key points in the synthetic frontal face images for efficient unsupervised feature learning. Finally, we introduce a linkage structure over the learning-based features and the corresponding geometry information of each key region to encode the dependencies of the regions. Our method, on the whole, does not require training multiple models for each specific pose and avoids separating training and parameter tuning for each pose. The proposed framework has been evaluated on two benchmark databases, BU-3DFE and SFEW, for pose-invariant Facial Expression Recognition (FER). The experimental results demonstrate that our algorithm outperforms current state-of-the-art FER methods. Specifically, our model achieves an improvement of 1.72% and 1.11% FER accuracy, on average, on BU-3DFE and SFEW, respectively.

Journal ArticleDOI
Weiqi Luo1, Haodong Li2, Qi Yan1, Rui Yang1, Jiwu Huang2 
TL;DR: An improved audio steganalytic feature set derived from both the time and Mel-frequency domains for detecting some typical steganography in the time domain, including LSB matching, Hide4PGP, and Steghide is designed.
Abstract: Digital multimedia steganalysis has attracted wide attention over the past decade. Currently, there are many algorithms for detecting image steganography. However, little research has been devoted to audio steganalysis. Since the statistical properties of image and audio files are quite different, features that are effective in image steganalysis may not be effective for audio. In this article, we design an improved audio steganalytic feature set derived from both the time and Mel-frequency domains for detecting some typical steganography in the time domain, including LSB matching, Hide4PGP, and Steghide. The experiment results, evaluated on different audio sources, including various music and speech clips of different complexity, have shown that the proposed features significantly outperform the existing ones. Moreover, we use the proposed features to detect and further identify some typical audio operations that would probably be used in audio tampering. The extensive experiment results have shown that the proposed features also outperform the related forensic methods, especially when the length of the audio clip is small, such as audio clips with 800 samples. This is very important in real forensic situations.

Journal ArticleDOI
TL;DR: This article begins with the measurement of individual viewing behavior from two aspects: the temporal characteristics and user interest, and investigates the predictability of video popularity as a collective user behavior through early views using the ARMA model.
Abstract: Understanding streaming user behavior is crucial to the design of large-scale Video-on-Demand (VoD) systems. In this article, we begin with the measurement of individual viewing behavior from two aspects: the temporal characteristics and user interest. We observe that active users spend more hours on each active day, and their daily request time distribution is more scattered than that of the less active users, while the inter-view time distribution differs negligibly between two groups. The common interest in popular videos and the latest uploaded videos is observed in both groups. We then investigate the predictability of video popularity as a collective user behavior through early views. In the light of the limitations of classical approaches, the Autoregressive-Moving-Average (ARMA) model is employed to forecast the popularity dynamics of individual videos at fine-grained time scales, thus achieving much higher prediction accuracy. When applied to video caching, the ARMA-assisted Least Frequently Used (LFU) algorithm can outperform the Least Recently Used (LRU) by 11--16%, the well-tuned LFU by 6--13%, and the LFU is only 2--4% inferior to the offline LFU in terms of hit ratio.

Journal ArticleDOI
TL;DR: This article addresses the cross-domain (i.e., street and shop) clothing retrieval problem and investigates its real-world applications for online clothing shopping, focusing on learning an effective feature-embedding model to generate robust and discriminative feature representation across domains.
Abstract: In this article, we address the cross-domain (i.e., street and shop) clothing retrieval problem and investigate its real-world applications for online clothing shopping. It is a challenging problem due to the large discrepancy between street and shop domain images. We focus on learning an effective feature-embedding model to generate robust and discriminative feature representation across domains. Existing triplet embedding models achieve promising results by finding an embedding metric in which the distance between negative pairs is larger than the distance between positive pairs plus a margin. However, existing methods do not address the challenges in the cross-domain clothing retrieval scenario sufficiently. First, the intradomain and cross-domain data relationships need to be considered simultaneously. Second, the number of matched and nonmatched cross-domain pairs are unbalanced. To address these challenges, we propose a deep cross-triplet embedding algorithm together with a cross-triplet sampling strategy. The extensive experimental evaluations demonstrate the effectiveness of the proposed algorithms well. Furthermore, we investigate two novel online shopping applications, clothing trying on and accessories recommendation, based on a unified cross-domain clothing retrieval framework.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method to estimate contrast enhancement operations from a single image, which takes advantage of the nature of contrast enhancement as a mapping between pixel values and the distinct characteristics it introduces to the image pixel histogram.
Abstract: Inconsistency in contrast enhancement can be used to expose image forgeries. In this work, we describe a new method to estimate contrast enhancement operations from a single image. Our method takes advantage of the nature of contrast enhancement as a mapping between pixel values and the distinct characteristics it introduces to the image pixel histogram. Our method recovers the original pixel histogram and the contrast enhancement simultaneously from a single image with an iterative algorithm. Unlike previous works, our method is robust in the presence of additive noise perturbations that are used to hide the traces of contrast enhancement. Furthermore, we also develop an effective method to detect image regions undergone contrast enhancement transformations that are different from the rest of the image, and we use this method to detect composite images. We perform extensive experimental evaluations to demonstrate the efficacy and efficiency of our method.

Journal ArticleDOI
TL;DR: To detect automatic information broadcast in OSN, a wavelet-based model is developed that classifies users as being human, legitimate robot, or malicious robot, as a result of spectral patterns obtained from users’ textual content.
Abstract: Social interactions take place in environments that influence people’s behaviours and perceptions. Nowadays, the users of Online Social Network (OSN) generate a massive amount of content based on social interactions. However, OSNs wide popularity and ease of access created a perfect scenario to practice malicious activities, compromising their reliability. To detect automatic information broadcast in OSN, we developed a wavelet-based model that classifies users as being human, legitimate robot, or malicious robot, as a result of spectral patterns obtained from users’ textual content. We create the feature vector from the Discrete Wavelet Transform along with a weighting scheme called Lexicon-based Coefficient Attenuation. In particular, we induce a classification model using the Random Forest algorithm over two real Twitter datasets. The corresponding results show the developed model achieved an average accuracy of 94.47% considering two different scenarios: single theme and miscellaneous one.

Journal ArticleDOI
TL;DR: This article develops (and open-source) a Facebook application, named YouQ1, as an experimental platform for studying individual experience for videos, and shows that subjective experiments based on YouQ can produce reliable results as compared to a controlled laboratory experiment.
Abstract: The next generation of multimedia services have to be optimized in a personalized way, taking user factors into account for the evaluation of individual experience. Previous works have investigated the influence of user factors mostly in a controlled laboratory environment which often includes a limited number of users and fails to reflect real-life environment. Social media, especially Facebook, provide an interesting alternative for Internet-based subjective evaluation. In this article, we develop (and open-source) a Facebook application, named YouQ1, as an experimental platform for studying individual experience for videos. Our results show that subjective experiments based on YouQ can produce reliable results as compared to a controlled laboratory experiment. Additionally, YouQ has the ability to collect user information automatically from Facebook, which can be used for modeling individual experience.

Journal ArticleDOI
TL;DR: Whether it is possible to enhance a tourist's archaeological experience, which is often derived from only scarce remains, by using an audio augmented reality (AAR) system to recreate the soundscape of a medieval archaeological site is investigated.
Abstract: This article investigates the use of an audio augmented reality (AAR) system to recreate the soundscape of a medieval archaeological site. The aim of our work was to explore whether it is possible to enhance a tourist's archaeological experience, which is often derived from only scarce remains. We developed a smartphone-based AAR system, which uses location and orientation sensors to synthesize the soundscape of a site and plays it to the user via headphones. We recreated the ancient soundscape of a medieval archaeological site in Croatia and tested it in situ on two groups of participants using the soundwalk method. One test group performed the soundwalk while listening to the recreated soundscape using the AAR system, while the second control group did not use the AAR equipment. We measured the experiences of the participants using two methods: the standard soundwalk questionnaire and affective computing equipment for detecting the emotional state of participants. The results of both test methods show that participants who were listening to the ancient soundscape using our AAR system experienced higher arousal than those visiting the site without AAR.

Journal ArticleDOI
TL;DR: The evaluation suggests that users generally appreciate the idea of FEPs, and that it can effectively help novice and medium experienced users in crafting film sequences with little training.
Abstract: This article introduces Film Editing Patterns (FEP), a language to formalize film editing practices and stylistic choices found in movies. FEP constructs are constraints, expressed over one or more shots from a movie sequence, that characterize changes in cinematographic visual properties, such as shot sizes, camera angles, or layout of actors on the screen. We present the vocabulary of the FEP language, introduce its usage in analyzing styles from annotated film data, and describe how it can support users in the creative design of film sequences in 3D. More specifically, (i) we define the FEP language, (ii) we present an application to craft filmic sequences from 3D animated scenes that uses FEPs as a high level mean to select cameras and perform cuts between cameras that follow best practices in cinema, and (iii) we evaluate the benefits of FEPs by performing user experiments in which professional filmmakers and amateurs had to create cinematographic sequences. The evaluation suggests that users generally appreciate the idea of FEPs, and that it can effectively help novice and medium experienced users in crafting film sequences with little training.

Journal ArticleDOI
TL;DR: QEM4VR is presented, a high-fidelity mesh simplification algorithm specifically designed for VR that addresses the deficiencies of prior quadric error metric (QEM) approaches by leveraging the insight that the most relevant boundary edges lie along curvatures while linear boundary edges can be collapsed.
Abstract: With the increasing accessibility of the mobile head-mounted displays (HMDs), mobile virtual reality (VR) systems are finding applications in various areas. However, mobile HMDs are highly constrained with limited graphics processing units (GPUs) and low processing power and onboard memory. Hence, VR developers must be cognizant of the number of polygons contained within their virtual environments to avoid rendering at low frame rates and inducing simulator sickness. The most robust and rapid approach to keeping the overall number of polygons low is to use mesh simplification algorithms to create low-poly versions of pre-existing, high-poly models. Unfortunately, most existing mesh simplification algorithms cannot adequately handle meshes with lots of boundaries or nonmanifold meshes, which are common attributes of many 3D models.In this article, we present QEM4VR, a high-fidelity mesh simplification algorithm specifically designed for VR. This algorithm addresses the deficiencies of prior quadric error metric (QEM) approaches by leveraging the insight that the most relevant boundary edges lie along curvatures while linear boundary edges can be collapsed. Additionally, our algorithm preserves key surface properties, such as normals, texture coordinates, colors, and materials, as it preprocesses 3D models and generates their low-poly approximations offline.We evaluated the effectiveness of our QEM4VR algorithm by comparing its simplified-mesh results to those of prior QEM variations in terms of geometric approximation error, texture error, progressive approximation errors, frame rate impact, and perceptual quality measures. We found that QEM4VR consistently yielded simplified meshes with less geometric approximation error and texture error than the prior QEM variations. It afforded better frame rates than QEM variations with boundary preservation constraints that create unnecessary lower bounds on overall polygon count reduction. Our evaluation revealed that QEM4VR did not fair well in terms of existing perceptual distance measurements, but human-based inspections demonstrate that these algorithmic measurements are not suitable substitutes for actual human perception. In turn, we present a user-based methodology for evaluating the perceptual qualities of mesh simplification algorithms.

Journal ArticleDOI
TL;DR: Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multitask fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.
Abstract: Features extracted by deep networks have been popular in many visual search tasks. This article studies deep network structures and training schemes for mobile visual search. The goal is to learn an effective yet portable feature representation that is suitable for bridging the domain gap between mobile user photos and (mostly) professionally taken product images while keeping the computational cost acceptable for mobile-based applications. The technical contributions are twofold. First, we propose an alternative of the contrastive loss popularly used for training deep Siamese networks, namely robust contrastive loss, where we relax the penalty on some positive and negative pairs to alleviate overfitting. Second, a simple multitask fine-tuning scheme is leveraged to train the network, which not only utilizes knowledge from the provided training photo pairs but also harnesses additional information from the large ImageNet dataset to regularize the fine-tuning process. Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multitask fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.

Journal ArticleDOI
TL;DR: A novel image captioning model with Affective Guiding and Selective Attention Mechanism named AG-SAM is proposed, which aims to bridge the affective gap between image Captioning and the emotional response elicited by the image.
Abstract: Image captioning is an increasingly important problem associated with artificial intelligence, computer vision, and natural language processing. Recent works revealed that it is possible for a machine to generate meaningful and accurate sentences for images. However, most existing methods ignore latent emotional information in an image. In this article, we propose a novel image captioning model with Affective Guiding and Selective Attention Mechanism named AG-SAM. In our method, we aim to bridge the affective gap between image captioning and the emotional response elicited by the image. First, we introduce affective components that capture higher-level concepts encoded in images into AG-SAM. Hence, our language model can be adapted to generate sentences that are more passionate and emotive. In addition, a selective gate acting on the attention mechanism controls the degree of how much visual information AG-SAM needs. Experimental results have shown that our model outperforms most existing methods, clearly reflecting an association between images and emotional components that is usually ignored in existing works.

Journal ArticleDOI
TL;DR: The goal is to infer the class label information of 3D human actions with partial observation of temporally incomplete action executions using a stochastic process called dynamic marked point process (DMP) to model the 3D action as temporal dynamic patterns, where both timing and strength information are captured.
Abstract: Action recognition is an important research problem of human motion analysis (HMA). In recent years, 3D observation-based action recognition has been receiving increasing interest in the multimedia and computer vision communities, due to the recent advent of cost-effective sensors, such as depth camera Kinect. This work takes this one step further, focusing on early recognition of ongoing 3D human actions, which is beneficial for a large variety of time-critical applications, e.g., gesture-based human machine interaction, somatosensory games, and so forth. Our goal is to infer the class label information of 3D human actions with partial observation of temporally incomplete action executions. By considering 3D action data as multivariate time series (m.t.s.) synchronized to a shared common clock (frames), we propose a stochastic process called dynamic marked point process (DMP) to model the 3D action as temporal dynamic patterns, where both timing and strength information are captured. To achieve even more early and better accuracy of recognition, we also explore the temporal dependency patterns between feature dimensions. A probabilistic suffix tree is constructed to represent sequential patterns among features in terms of the variable-order Markov model (VMM). Our approach and several baselines are evaluated on five 3D human action datasets. Extensive results show that our approach achieves superior performance for early recognition of 3D human actions.