scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Multimedia Computing, Communications, and Applications in 2017"


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a Siamese network that simultaneously computes the identification loss and verification loss, and the network learns a discriminative embedding and a similarity measurement at the same time.
Abstract: In this article, we revisit two popular convolutional neural networks in person re-identification (re-ID): verification and identification models. The two models have their respective advantages and limitations due to different loss functions. Here, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a Siamese network that simultaneously computes the identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two input images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus taking full usage of the re-ID annotations. Our method can be easily applied on different pretrained networks. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show that our architecture can also be applied to image retrieval. The code is available at https://github.com/layumi/2016_person_re-ID.

662 citations


Journal ArticleDOI
TL;DR: The state of the art in this exciting research area is reported, looking back to the evolution of neural networks, and arriving to the most recent results in terms of methodologies, technologies, and applications for mobile environments.
Abstract: Deep Learning (DL) has become a crucial technology for multimedia computing. It offers a powerful instrument to automatically produce high-level abstractions of complex multimedia data, which can be exploited in a number of applications, including object detection and recognition, speech-to- text, media retrieval, multimodal data analysis, and so on. The availability of affordable large-scale parallel processing architectures, and the sharing of effective open-source codes implementing the basic learning algorithms, caused a rapid diffusion of DL methodologies, bringing a number of new technologies and applications that outperform, in most cases, traditional machine learning technologies. In recent years, the possibility of implementing DL technologies on mobile devices has attracted significant attention. Thanks to this technology, portable devices may become smart objects capable of learning and acting. The path toward these exciting future scenarios, however, entangles a number of important research challenges. DL architectures and algorithms are hardly adapted to the storage and computation resources of a mobile device. Therefore, there is a need for new generations of mobile processors and chipsets, small footprint learning and inference algorithms, new models of collaborative and distributed processing, and a number of other fundamental building blocks. This survey reports the state of the art in this exciting research area, looking back to the evolution of neural networks, and arriving to the most recent results in terms of methodologies, technologies, and applications for mobile environments.

124 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a self-trained subspace learning paradigm for person re-ID that effectively utilizes both labeled and unlabeled data to learn a discriminative subspace where person images across disjoint camera views can be easily matched.
Abstract: Despite the promising progress made in recent years, person re-identification (re-ID) remains a challenging task due to the complex variations in human appearances from different camera views. For this challenging problem, a large variety of algorithms have been developed in the fully supervised setting, requiring access to a large amount of labeled training data. However, the main bottleneck for fully supervised re-ID is the limited availability of labeled training samples. To address this problem, we propose a self-trained subspace learning paradigm for person re-ID that effectively utilizes both labeled and unlabeled data to learn a discriminative subspace where person images across disjoint camera views can be easily matched. The proposed approach first constructs pseudo-pairwise relationships among unlabeled persons using the k-nearest neighbors algorithm. Then, with the pseudo-pairwise relationships, the unlabeled samples can be easily combined with the labeled samples to learn a discriminative projection by solving an eigenvalue problem. In addition, we refine the pseudo-pairwise relationships iteratively, which further improves learning performance. A multi-kernel embedding strategy is also incorporated into the proposed approach to cope with the non-linearity in a person’s appearance and explore the complementation of multiple kernels. In this way, the performance of person re-ID can be greatly enhanced when training data are insufficient. Experimental results on six widely used datasets demonstrate the effectiveness of our approach, and its performance can be comparable to the reported results of most state-of-the-art fully supervised methods while using much fewer labeled data.

86 citations


Journal ArticleDOI
TL;DR: This survey explores crowd analysis as it relates to two primary research areas: crowd statistics and behavior understanding, and survey methods for counting individuals and approximating the density of the crowd.
Abstract: Crowd video analysis has applications in crowd management, public space design, and visual surveillance. Example tasks potentially aided by automated analysis include anomaly detection (such as a person walking against the grain of traffic or rapid assembly/dispersion of groups of people), population and density measurements, and interactions between groups of people. This survey explores crowd analysis as it relates to two primary research areas: crowd statistics and behavior understanding. First, we survey methods for counting individuals and approximating the density of the crowd. Second, we showcase research efforts on behavior understanding as related to crowds. These works focus on identifying groups, interactions within small groups, and abnormal activity detection such as riots and bottlenecks in large crowds. Works presented in this section also focus on tracking groups of individuals, either as a single entity or a subset of individuals within the frame of reference. Finally, a summary of datasets available for crowd activity video research is provided.

69 citations


Journal ArticleDOI
TL;DR: Experimental results show that this approach can achieve 0.6--6.7% relative improvements over state-of-the-art methods in terms of the F-measure and Precision metrics, which demonstrates the effectiveness of the proposed approach.
Abstract: Saliency detection has recently received increasing research interest on using high-dimensional datasets beyond two-dimensional images. Despite the many available capturing devices and algorithms, there still exists a wide spectrum of challenges that need to be addressed to achieve accurate saliency detection. Inspired by the success of the light-field technique, in this article, we propose a new computational scheme to detect salient regions by integrating multiple visual cues from light-field images. First, saliency prior maps are generated from several light-field features based on superpixel-level intra-cue distinctiveness, such as color, depth, and flow inherited from different focal planes and multiple viewpoints. Then, we introduce the location prior to enhance the saliency maps. These maps will finally be merged into a single map using a random-search-based weighting strategy. Besides, we refine the object details by employing a two-stage saliency refinement to obtain the final saliency map. In addition, we present a more challenging benchmark dataset for light-field saliency analysis, named HFUT-Lytro, which consists of 255 light fields with a range from 53 to 64 images generated from each light-field image, therein spanning multiple occurrences of saliency detection challenges such as occlusions, cluttered background, and appearance changes. Experimental results show that our approach can achieve 0.6--6.7% relative improvements over state-of-the-art methods in terms of the F-measure and Precision metrics, which demonstrates the effectiveness of the proposed approach.

59 citations


Journal ArticleDOI
TL;DR: A mobile food recognition system that uses the picture of the food, taken by the user's mobile device, to recognize multiple food items in the same meal, such as steak and potatoes on the same plate, to estimate the calorie and nutrition of the meal.
Abstract: In this article, we propose a mobile food recognition system that uses the picture of the food, taken by the user’s mobile device, to recognize multiple food items in the same meal, such as steak and potatoes on the same plate, to estimate the calorie and nutrition of the meal. To speed up and make the process more accurate, the user is asked to quickly identify the general area of the food by drawing a bounding circle on the food picture by touching the screen. The system then uses image processing and computational intelligence for food item recognition. The advantage of recognizing items, instead of the whole meal, is that the system can be trained with only single item food images. At the training stage, we first use region proposal algorithms to generate candidate regions and extract the convolutional neural network (CNN) features of all regions. Second, we perform region mining to select positive regions for each food category using maximum cover by our proposed submodular optimization method. At the testing stage, we first generate a set of candidate regions. For each region, a classification score is computed based on its extracted CNN features and predicted food names of the selected regions. Since fast response is one of the important parameters for the user who wants to eat the meal, certain heavy computational parts of the application are offloaded to the cloud. Hence, the processes of food recognition and calorie estimation are performed in cloud server. Our experiments, conducted with the FooDD dataset, show an average recall rate of 90.98%, precision rate of 93.05%, and accuracy of 94.11% compared to 50.8% to 88% accuracy of other existing food recognition systems.

55 citations


Journal ArticleDOI
TL;DR: A Tucker deep computation model is proposed by using the Tucker decomposition to compress the weight tensors in the full-connected layers for multimedia feature learning and a learning algorithm based on the back-propagation strategy is devised to train the parameters of the Tuckerdeep computation model.
Abstract: Recently, the deep computation model, as a tensor deep learning model, has achieved super performance for multimedia feature learning. However, the conventional deep computation model involves a large number of parameters. Typically, training a deep computation model with millions of parameters needs high-performance servers with large-scale memory and powerful computing units, limiting the growth of the model size for multimedia feature learning on common devices such as portable CPUs and conventional desktops. To tackle this problem, this article proposes a Tucker deep computation model by using the Tucker decomposition to compress the weight tensors in the full-connected layers for multimedia feature learning. Furthermore, a learning algorithm based on the back-propagation strategy is devised to train the parameters of the Tucker deep computation model. Finally, the performance of the Tucker deep computation model is evaluated by comparing with the conventional deep computation model on two representative multimedia datasets, that is, CUAVE and SNAE2, in terms of accuracy drop, parameter reduction, and speedup in the experiments. Results imply that the Tucker deep computation model can achieve a large-parameter reduction and speedup with a small accuracy drop for multimedia feature learning.

53 citations


Journal ArticleDOI
TL;DR: This work proposes an adaptive HMM method to obtain the hidden state number of each sign by affinity propagation clustering and discovers that inherent latent change states of eachsign are related not only to the number of key gestures and body poses but also to their translation relationships.
Abstract: In sign language recognition (SLR) with multimodal data, a sign word can be represented by multiply features, for which there exist an intrinsic property and a mutually complementary relationship among them. To fully explore those relationships, we propose an online early-late fusion method based on the adaptive Hidden Markov Model (HMM). In terms of the intrinsic property, we discover that inherent latent change states of each sign are related not only to the number of key gestures and body poses but also to their translation relationships. We propose an adaptive HMM method to obtain the hidden state number of each sign by affinity propagation clustering. For the complementary relationship, we propose an online early-late fusion scheme. The early fusion (feature fusion) is dedicated to preserving useful information to achieve a better complementary score, while the late fusion (score fusion) uncovers the significance of those features and aggregates them in a weighting manner. Different from classical fusion methods, the fusion is query adaptive. For different queries, after feature selection (including the combined feature), the fusion weight is inversely proportional to the area under the curve of the normalized query score list for each selected feature. The whole fusion process is effective and efficient. Experiments verify the effectiveness on the signer-independent SLR with large vocabulary. Compared either on different dataset sizes or to different SLR models, our method demonstrates consistent and promising performance.

50 citations


Journal ArticleDOI
TL;DR: In this article, an iterative logistic regression is used to select and weight the contributions of each projection and perform the matching between the two views, which obtains comparable performance on the VIPeR and PRID 450s datasets and improves on the PRID and CUHK01 datasets with respect to the state of the art.
Abstract: In this article, we introduce a method to overcome one of the main challenges of person reidentification in multicamera networks, namely cross-view appearance changes. The proposed solution addresses the extreme variability of person appearance in different camera views by exploiting multiple feature representations. For each feature, kernel canonical correlation analysis with different kernels is employed to learn several projection spaces in which the appearance correlation between samples of the same person observed from different cameras is maximized. An iterative logistic regression is finally used to select and weight the contributions of each projection and perform the matching between the two views. Experimental evaluation shows that the proposed solution obtains comparable performance on the VIPeR and PRID 450s datasets and improves on the PRID and CUHK01 datasets with respect to the state of the art.

47 citations


Journal ArticleDOI
TL;DR: An application-layer multiplexing scheme for teleoperation systems with multimodal feedback (video, audio, and haptics) that gives high priority to the haptic signal and applies a preemptive-resume scheduling strategy to stream the audio and video data.
Abstract: This article proposes an application-layer multiplexing scheme for teleoperation systems with multimodal feedback (video, audio, and haptics). The available transmission resources are carefully allocated to avoid delay-jitter for the haptic signal potentially caused by the size and arrival time of the video and audio data. The multiplexing scheme gives high priority to the haptic signal and applies a preemptive-resume scheduling strategy to stream the audio and video data. The proposed approach estimates the available transmission rate in real time and adapts the video bitrate, data throughput, and force buffer size accordingly. Furthermore, the proposed scheme detects sudden transmission rate drops and applies congestion control to avoid abrupt delay increases and converge promptly to the altered transmission rate. The performance of the proposed scheme is measured objectively in terms of end-to-end signal latencies, packet rates, and peak signal-to-noise ratio (PSNR) for visual quality. Moreover, peak-delay and convergence time measurements are carried out to investigate the performance of the congestion control mode of the system.

41 citations


Journal ArticleDOI
TL;DR: This article addresses the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors by proposing the use of a compact Convolutional Neural Network that performs object classification and localization.
Abstract: In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum hall, or if he or she is talking with a friend. The guide performs automatic recognition of artworks, and it provides configurable interface features to improve the user experience and the fruition of multimedia materials through semi-automatic interaction.Our smart audio guide is backed by a computer vision system capable of working in real time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy, we perform additional video processing using shape-based filtering, artwork tracking, and temporal filtering. The system has been deployed on an NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1 and tested in a real-world environment (Bargello Museum of Florence).

Journal ArticleDOI
TL;DR: This article proposes a robust watermarking framework for HEVC-encoded video using informed detector and shows that the proposed work effectively limits the increase in video bitrate and degradation in perceptual quality.
Abstract: Digital watermarking has received much attention in recent years as a promising solution to copyright protection. Video watermarking in compressed domain has gained importance since videos are stored and transmitted in a compressed format. This decreases the overhead to fully decode and re-encode the video for embedding and extraction of the watermark. High Efficiency Video Coding (HEVC/H.265) is the latest and most efficient video compression standard and a successor to H.264 Advanced Video Coding. In this article, we propose a robust watermarking framework for HEVC-encoded video using informed detector. A readable watermark is embedded invisibly in P frames for better perceptual quality. Our framework imposes security and robustness by selecting appropriate blocks using a random key and the spatio-temporal characteristics of the compressed video. A detail analysis of the strengths of different compressed domain features is performed for implementing the watermarking framework. We experimentally demonstrate the utility of the proposed work. The results show that the proposed work effectively limits the increase in video bitrate and degradation in perceptual quality. The proposed framework is robust against re-encoding and image processing attacks.

Journal ArticleDOI
TL;DR: A novel, holistic multimedia system aiming to tackle automatic analysis of video from gastrointestinal (GI) endoscopy that combines filters using machine learning, image recognition, and extraction of global and local image features and is by far leading in terms of real-time performance and efficient resource consumption.
Abstract: Holistic medical multimedia systems covering end-to-end functionality from data collection to aided diagnosis are highly needed, but rare. In many hospitals, the potential value of multimedia data collected through routine examinations is not recognized. Moreover, the availability of the data is limited, as the health care personnel may not have direct access to stored data. However, medical specialists interact with multimedia content daily through their everyday work and have an increasing interest in finding ways to use it to facilitate their work processes. In this article, we present a novel, holistic multimedia system aiming to tackle automatic analysis of video from gastrointestinal (GI) endoscopy. The proposed system comprises the whole pipeline, including data collection, processing, analysis, and visualization. It combines filters using machine learning, image recognition, and extraction of global and local image features. The novelty is primarily in this holistic approach and its real-time performance, where we automate a complete algorithmic GI screening process. We built the system in a modular way to make it easily extendable to analyze various abnormalities, and we made it efficient in order to run in real time. The conducted experimental evaluation proves that the detection and localization accuracy are comparable or even better than existing systems, but it is by far leading in terms of real-time performance and efficient resource consumption.

Journal ArticleDOI
TL;DR: A Video Control Plane is built that enforces Video Quality Fairness among concurrent video flows generated by heterogeneous client devices and a max-min fairness optimization problem is solved at runtime.
Abstract: This article investigates several network-assisted streaming approaches that rely on active cooperation between video streaming applications and the network. We build a Video Control Plane that enforces Video Quality Fairness among concurrent video flows generated by heterogeneous client devices. For this purpose, a max-min fairness optimization problem is solved at runtime. We compare two approaches to actuate the optimal solution in an Software Defined Networking network: The first one allocates network bandwidth slices to video flows, and the second one guides video players in the video bitrate selection. We assess performance through several QoE-related metrics, such as Video Quality Fairness, video quality, and switching frequency. The impact of client-side adaptation algorithms is also investigated.

Journal ArticleDOI
TL;DR: The experimental results validate the efficacy of the proposed image encryption scheme against various kinds of possible attacks, tested with a variety of images and found the tamper detection accuracy to be satisfactorily high for most of the tampering scenarios.
Abstract: The benefits of high-end computation infrastructure facilities provided by cloud-based multimedia systems are attracting people all around the globe. However, such cloud-based systems possess security issues as third party servers become involved in them. Rendering data in an unreadable form so that no information is revealed to the cloud data centers will serve as the best solution to these security issues. One such image encryption scheme based on a Permutation Ordered Binary Number System has been proposed in this work. It distributes the image information in totally random shares, which can be stored at the cloud data centers. Further, the proposed scheme authenticates the shares at the pixel level. If any tampering is done at the cloud servers, the scheme can accurately identify the altered pixels via authentication bits and localizes the tampered area. The tampered portion is also reflected back in the reconstructed image that is obtained at the authentic user end. The experimental results validate the efficacy of the proposed scheme against various kinds of possible attacks, tested with a variety of images. The tamper detection accuracy has been computed on a pixel basis and found to be satisfactorily high for most of the tampering scenarios.

Journal ArticleDOI
TL;DR: This article proposes an efficient motion detection and tracking scheme for encrypted H.264/AVC video bitstreams, which has the advantages of requiring only a small storage of the encrypted video and has a low computational cost for both encryption and detection.
Abstract: Performing detection on surveillance videos contributes significantly to the goals of safety and security. However, performing detection on unprotected surveillance video may reveal the privacy of innocent people in the video. Therefore, striking a proper balance between maintaining personal privacy while enhancing the feasibility of detection is an important issue. One promising solution to this problem is to encrypt the surveillance videos and perform detection on the encrypted videos. Most existing encrypted signal processing methods focus on still images or small data volumes; however, because videos are typically much larger, investigating how to process encrypted videos is a significant challenge. In this article, we propose an efficient motion detection and tracking scheme for encrypted H.264/AVC video bitstreams, which does not require the previous decryption on the encrypted video. The main idea is to first estimate motion information from the bitstream structure and codeword length and, then, propose a region update (RU) algorithm to deal with the loss and error drifting of motion caused by the video encryption. The RU algorithm is designed based on the prior knowledge that the object motion in the video is continuous in space and time. Compared to the existing scheme, which is based on video encryption that occurs at the pixel level, the proposed scheme has the advantages of requiring only a small storage of the encrypted video and has a low computational cost for both encryption and detection. Experimental results show that our scheme performs better regarding detection accuracy and execution speed. Moreover, the proposed scheme can work with more than one format-compliant video encryption method, provided that the positions of the macroblocks can be extracted from the encrypted video bitstream. Due to the coupling of video stream encryption and detection algorithms, our scheme can be directly connected to the video stream output (e.g., surveillance cameras) without requiring any camera modifications.

Journal ArticleDOI
TL;DR: A compressed domain watermarking scheme is proposed for H.265/HEVC bit stream that can handle drift error propagation both for intra- and interprediction process and shows adequate robustness against recompression attack as well as common image processing attacks while maintaining decent visual quality.
Abstract: It has been observed in the recent literature that the drift error due to watermarking degrades the visual quality of the embedded video. The existing drift error handling strategies for recent video standards such as H.264 may not be directly applicable for upcoming high-definition video standards (such as High Efficiency Video Coding (HEVC)) due to different compression architecture. In this article, a compressed domain watermarking scheme is proposed for H.265/HEVC bit stream that can handle drift error propagation both for intra- and interprediction process. Additionally, the proposed scheme shows adequate robustness against recompression attack as well as common image processing attacks while maintaining decent visual quality. A comprehensive set of experiments has been carried out to justify the efficacy of the proposed scheme over the existing literature.

Journal ArticleDOI
TL;DR: This article presents a novel multimedia hashing framework, called Label Preserving Multimedia Hashing (LPMH), which is competitive with state-of-the-art methods in both speed and accuracy for multimedia similarity search.
Abstract: Learning-based hashing has been researched extensively in the past few years due to its great potential in fast and accurate similarity search among huge volumes of multimedia data. In this article, we present a novel multimedia hashing framework, called Label Preserving Multimedia Hashing (LPMH) for multimedia similarity search. In LPMH, a general optimization method is used to learn the joint binary codes of multiple media types by explicitly preserving semantic label information. Compared with existing hashing methods which are typically developed under and thus restricted to some specific objective functions, the proposed optimization strategy is not tied to any specific loss function and can easily incorporate bit balance constraints to produce well-balanced binary codes. Specifically, our formulation leads to a set of Binary Integer Programming (BIP) problems that have exact solutions both with and without bit balance constraints. These problems can be solved extremely fast and the solution can easily scale up to large-scale datasets. In the hash function learning stage, the boosted decision trees algorithm is utilized to learn multiple media-specific hash functions that can map heterogeneous data sources into a homogeneous Hamming space for cross-media retrieval. We have comprehensively evaluated the proposed method using a range of large-scale datasets in both single-media and cross-media retrieval tasks. The experimental results demonstrate that LPMH is competitive with state-of-the-art methods in both speed and accuracy.

Journal ArticleDOI
TL;DR: This survey identifies three core elements of SSI and delivers a timely discussion on SSI oriented around the screen, the smart device, and the interaction modality.
Abstract: The meeting of pervasive screens and smart devices has witnessed the birth of screen-smart device interaction (SSI), a key enabler to many novel interactive use cases. Most current surveys focus on direct human-screen interaction, and to the best of our knowledge, none have studied state-of-the-art SSI. This survey identifies three core elements of SSI and delivers a timely discussion on SSI oriented around the screen, the smart device, and the interaction modality. Two evaluation metrics (i.e., interaction latency and accuracy) have been adopted and refined to match the evaluation criterion of SSI. The bottlenecks that hinder the further advancement of the current SSI in connection with this metrics are studied. Last, future research challenges and opportunities are highlighted in the hope of inspiring continuous research efforts to realize the next generation of SSI.

Journal ArticleDOI
TL;DR: PLACID is an automated PLatform for Accelerator CreatIon for DCNNs that enables generation of an accelerator with the highest throughput for a given DCNN on a specific target FPGA platform and shows that architectures synthesized by PLACID yield 2× higher throughput density than the best competing approach.
Abstract: Deep Convolutional Neural Networks (DCNNs) exhibit remarkable performance in a number of pattern recognition and classification tasks. Modern DCNNs involve many millions of parameters and billions of operations. Inference using such DCNNs, if implemented as software running on an embedded processor, results in considerable execution time and energy consumption, which is prohibitive in many mobile applications. Field-programmable gate array (FPGA)-based acceleration of DCNN inference is a promising approach to improve both energy consumption and classification throughput. However, the engineering effort required for development and verification of an optimized FPGA-based architecture is significant.In this article, we present PLACID, an automated PLatform for Accelerator CreatIon for DCNNs. PLACID uses an analytical approach to characterization and exploration of the implementation space. PLACID enables generation of an accelerator with the highest throughput for a given DCNN on a specific target FPGA platform. Subsequently, it generates an RTL level architecture in Verilog, which can be passed onto commercial tools for FPGA implementation. PLACID is fully automated, and reduces the accelerator design time from a few months down to a few hours. Experimental results show that architectures synthesized by PLACID yield 2× higher throughput density than the best competing approach.

Journal ArticleDOI
TL;DR: A novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets, which describe different details of the queried location is presented.
Abstract: In this article, we present a novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets, which describe different details of the queried location. The main strength of the proposed approach is its flexibility that permits us to filter out non-relevant images and to obtain a reliable set of diverse and relevant images by first clustering similar images according to their textual descriptions and their visual content and then extracting images from different clusters according to a measure of the user’s credibility. Clustering is based on a two-step process, where textual descriptions are used first and the clusters are then refined according to the visual features. The degree of diversification can be further increased by exploiting users’ judgments on the results produced by the proposed algorithm through a novel approach, where users not only provide a relevance feedback but also a diversity feedback. Experimental results performed on the MediaEval 2015 “Retrieving Diverse Social Images” dataset show that the proposed framework can achieve very good performance both in the case of automatic retrieval of diverse images and in the case of the exploitation of the users’ feedback. The effectiveness of the proposed approach has been also confirmed by a small case study involving a number of real users.

Journal ArticleDOI
TL;DR: A fast image search framework, named DeepSearch, which makes complex image search based on CNNs feasible on mobile phones and significantly speed up the CNN models and further makes CNN-based image search practical on common smart phones.
Abstract: Content-based image retrieval (CBIR) is one of the most important applications of computer vision. In recent years, there have been many important advances in the development of CBIR systems, especially Convolutional Neural Networks (CNNs) and other deep-learning techniques. On the other hand, current CNN-based CBIR systems suffer from high computational complexity of CNNs. This problem becomes more severe as mobile applications become more and more popular. The current practice is to deploy the entire CBIR systems on the server side while the client side only serves as an image provider. This architecture can increase the computational burden on the server side, which needs to process thousands of requests per second. Moreover, sending images have the potential of personal information leakage. As the need of mobile search expands, concerns about privacy are growing. In this article, we propose a fast image search framework, named DeepSearch, which makes complex image search based on CNNs feasible on mobile phones. To implement the huge computation of CNN models, we present a tensor Block Term Decomposition (BTD) approach as well as a nonlinear response reconstruction method to accelerate the CNNs involving in object detection and feature extraction. The extensive experiments on the ImageNet dataset and Alibaba Large-scale Image Search Challenge dataset show that the proposed accelerating approach BTD can significantly speed up the CNN models and further makes CNN-based image search practical on common smart phones.

Journal ArticleDOI
TL;DR: A video bitrate adaptation and prediction mechanism based on Fuzzy logic for HAS players, which takes into consideration the estimate of available network bandwidth as well as the predicted buffer occupancy level in order to proactively and intelligently respond to current conditions is proposed.
Abstract: The Hypertext Transfer Protocol (HTTP) Adaptive Streaming (HAS) has now become ubiquitous and accounts for a large amount of video delivery over the Internet. But since the Internet is prone to bandwidth variations, HAS's up and down switching between different video bitrates to keep up with bandwidth variations leads to a reduction in Quality of Experience (QoE). In this article, we propose a video bitrate adaptation and prediction mechanism based on Fuzzy logic for HAS players, which takes into consideration the estimate of available network bandwidth as well as the predicted buffer occupancy level in order to proactively and intelligently respond to current conditions. This leads to two contributions: First, it allows HAS players to take appropriate actions, sooner than existing methods, to prevent playback interruptions caused by buffer underrun, reducing the ON-OFF traffic phenomena associated with current approaches and increasing the QoE. Second, it facilitates fair sharing of bandwidth among competing players at the bottleneck link. We present the implementation of our proposed mechanism and provide both empirical/QoE analysis and performance comparison with existing work. Our results show that, compared to existing systems, our system has (1) better fairness among multiple competing players by almost 50% on average and as much as 80% as indicated by Jain's fairness index and (2) better perceived quality of video by almost 8% on average and as much as 17%, according to the estimate the Mean Opinion Score (eMOS) model.

Journal ArticleDOI
TL;DR: This article proposes to analyze and evaluate one of the famous engines in the market, that is, “Unity 3D", and proposes a test-bed to evaluate the CPU and GPU consumption per frame and per module for nine representative games on three platforms, namely, a stand-alone computer, embedded systems, and web players.
Abstract: Mobile gaming is an emerging concept wherein gamers are using mobile devices, like smartphones and tablets, to play best-seller games. Compared to dedicated gaming boxes or PCs, these devices still fall short of executing newly complex 3D video games with a rich immersion. Three novel solutions, relying on cloud computing infrastructure, namely, computation offloading, cloud gaming, and client-server architecture, will represent the next generation of game engine architecture aiming at improving the gaming experience. The basis of these aforementioned solutions is the distribution of the game code over different devices (including set-top boxes, PCs, and servers). In order to know how the game code should be distributed, advanced knowledge of game engines is required. By consequence, dissecting and analyzing game engine performances will surely help to better understand how to move in these new directions (i.e., distribute game code), which is so far missing in the literature. Aiming at filling this gap, we propose in this article to analyze and evaluate one of the famous engines in the market, that is, “Unity 3D.” We begin by detailing the architecture and the game logic of game engines. Then, we propose a test-bed to evaluate the CPU and GPU consumption per frame and per module for nine representative games on three platforms, namely, a stand-alone computer, embedded systems, and web players. Based on the obtained results and observations, we build a valued graph of each module, composing the Unity 3D architecture, which reflects the internal flow and CPU consumption. Finally, we made a comparison in terms of CPU consumption between these architectures.

Journal ArticleDOI
TL;DR: A literature-based analytical study of what kind of issues location-based game design faces, and how they can be solved, and presents O-Mopsi game that combines physical activity with problem solving.
Abstract: Location-based games have been around already since 2000 but only recently when PokemonGo came to markets it became clear that they can reach wide popularity. In this article, we perform a literature-based analytical study of what kind of issues location-based game design faces, and how they can be solved. We study how to use and verify the location, the role of the games as exergames, use in education, and study technical and safety issues. As a case study, we present O-Mopsi game that combines physical activity with problem solving. It includes three challenges: (1) navigating to the next target, (2) deciding the order of targets, (3) physical movement. All of them are unavoidable and relevant. For guiding the players, we use three types of multimedia: images (targets and maps), sound (user guidance), and GPS (for positioning). We discuss motivational aspects, analysis of the playing, and content creation. The quality of experiences is reported based on playing in SciFest Science festivals during 2011--2016.

Journal ArticleDOI
TL;DR: The Semantic Event Retrieval System is presented which shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high- level events and uses a novel concept selection method (i-w2v) based on semantic embeddings.
Abstract: Searching in digital video data for high-level events, such as a parade or a car accident, is challenging when the query is textual and lacks visual example images or videos. Current research in deep neural networks is highly beneficial for the retrieval of high-level events using visual examples, but without examples it is still hard to (1) determine which concepts are useful to pre-train (Vocabulary challenge) and (2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our article, we present our Semantic Event Retrieval System which (1) shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high-level events and (2) uses a novel concept selection method (i-w2v) based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.

Journal ArticleDOI
TL;DR: These studies show that Random Forests--based prediction models achieve high accuracy for both the INRS audiovisual quality dataset and other publicly available comparable datasets.
Abstract: In order to mechanically predict audiovisual quality in interactive multimedia services, we have developed machine learning--based no-reference parametric models. We have compared Decision Trees--based ensemble methods, Genetic Programming and Deep Learning models that have one and more hidden layers. We have used the Institut national de la recherche scientifique (INRS) audiovisual quality dataset specifically designed to include ranges of parameters and degradations typically seen in real-time communications. Decision Trees--based ensemble methods have outperformed both Deep Learning-- and Genetic Programming--based models in terms of Root-Mean-Square Error (RMSE) and Pearson correlation values. We have also trained and developed models on various publicly available datasets and have compared our results with those of these original models. Our studies show that Random Forests--based prediction models achieve high accuracy for both the INRS audiovisual quality dataset and other publicly available comparable datasets.

Journal ArticleDOI
TL;DR: Through mathematical analysis, it is shown that transmission failure differentiation, or transmission collision detection, helps a node to efficiently reserve a time slot even with a large number of nodes contending for time slots, and significantly improves the performance of D-TDMA protocols.
Abstract: The increasing number of road accidents has led to the evolution of vehicular ad hoc networks (VANETs), which allow vehicles and roadside infrastructure to continuously broadcast safety messages, including necessary information to avoid undesired events on the road. To support reliable broadcast of safety messages, distributed time division multiple access (D-TDMA) protocols are proposed for medium access control in VANETs. Existing D-TDMA protocols react to a transmission failure without distinguishing whether the failure comes from a transmission collision or from a poor radio channel condition, resulting in degraded performance. In this article, we present the importance of transmission failure differentiation due to a poor channel or due to a transmission collision for D-TDMA protocols in vehicular networks. We study the effects of such a transmission failure differentiation on the performance of a node when reserving a time slot to access the transmission channel. Furthermore, we propose a method for transmission failure differentiation, employing the concept of deep-learning techniques, for a node to decide whether to release or continue using its acquired time slot. The proposed method is based on the application of a Markov chain model to estimate the channel state when a transmission failure occurs. The Markov model parameters are dynamically updated by each node (i.e., vehicle or roadside unit) based on information included in the safety messages that are periodically received from neighboring nodes. In addition, from the D-TDMA protocol headers of received messages, a node approximately determines the error in estimating the channel state based on the proposed Markov model and then uses this channel estimation error to further improve subsequent channel state estimations. Through mathematical analysis, we show that transmission failure differentiation, or transmission collision detection, helps a node to efficiently reserve a time slot even with a large number of nodes contending for time slots. Furthermore, through extensive simulations in a highway scenario, we demonstrate that the proposed solution significantly improves the performance of D-TDMA protocols by reducing unnecessary contention on the available time slots, thus increasing the number of nodes having unique time slots for successful broadcast of safety messages.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm can significantly improve the accuracy of the CTU-level rate control and thus the coding performance; the proposed scheme consistently outperforms HM 16.0 and other state-of-the-art algorithms in a variety of testing configurations.
Abstract: Rate control is a crucial consideration in high-efficiency video coding (HEVC). The estimation of model parameters is very important for coding tree unit (CTU)-level rate control, as it will significantly affect bit allocation and thus coding performance. However, the model parameters in the CTU-level rate control sometimes fails because of inadequate consideration of the correlation between model parameters and complexity characteristic. In this study, we establish a novel complexity correlation-based CTU-level rate control for HEVC. First, we formulate the model parameter estimation scheme as a multivariable estimation problem; second, based on the complexity correlation of the neighbouring CTU, an optimal direction is selected in five directions for reference CTU set selection during model parameter estimation to further improve the prediction accuracy of the complexity of the current CTU. Third, to improve their precision, the relationship between the model parameters and the complexity of the reference CTU set in the optimal direction is established by using least square method (LS), and the model parameters are solved via the estimated complexity of the current CTU. Experimental results show that the proposed algorithm can significantly improve the accuracy of the CTU-level rate control and thus the coding performance; the proposed scheme consistently outperforms HM 16.0 and other state-of-the-art algorithms in a variety of testing configurations. More specifically, up to 8.4% and on average 6.4% BD-Rate reduction is achieved compared to HM 16.0 and up to 4.7% and an average of 3.4% BD-Rate reduction is achieved compared to other algorithms, with only a slight complexity overhead.

Journal ArticleDOI
TL;DR: A framework for authoring tactile cues (tactile gestures as used in this article and enabling automatic rendering of said gestures to intensify emotional reactions in an immersive film experience is presented.
Abstract: The film industry continuously strives to make visitors’ movie experience more immersive and thus, more captivating. This is realized through larger screens, sophisticated speaker systems, and high quality 2D and 3D content. Moreover, a recent trend in the film industry is to incorporate multiple interaction modalities, such as 4D film, to simulate rain, wind, vibration, and heat, in order to intensify viewers’ emotional reactions. In this context, humans’ sense of touch possesses significant potential for intensifying emotional reactions for the film experience beyond audio-visual sensory modalities. This article presents a framework for authoring tactile cues (tactile gestures as used in this article) and enabling automatic rendering of said gestures to intensify emotional reactions in an immersive film experience. To validate the proposed framework, we conducted an experimental study where tactile gestures are designed and evaluated for the ability to intensify four emotional reactions: high valence-high arousal, high valence-low arousal, low valence-high arousal, and low valence-low arousal. Using a haptic jacket, participants felt tactile gestures that are synchronized with the audio-visual contents of a film. Results demonstrated that (1) any tactile feedback generated a positive user experience; (2) the tactile feedback intensifies emotional reactions when the audio-visual stimuli elicit clear emotional responses, except for low arousal emotional response since tactile gestures seem to always generate excitement; (3) purposed tactile gestures do not seem to significantly outperform randomized tactile gesture for intensifying specific emotional reactions; and (4) using a haptic jacket is not distracting for the users.