scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Multimedia Computing, Communications, and Applications in 2016"


Journal ArticleDOI
TL;DR: This article proposes a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, in particular, largely social images and tags, and offers an easy-to-use implementation.
Abstract: Feature representation for visual content is the key to the progress of many fundamental applications such as annotation and cross-modal retrieval. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application domains where high-quality and large-scale training data are expensive to obtain. In this article, we propose a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, in particular, largely social images and tags. Differing from existing feature learning approaches that rely on high-quality image-label supervision, our weak supervision is acquired by mining the visual-semantic embeddings from noisy, sparse, and diverse social image collections. The resultant image-word embedding space can be used to (1) fine-tune deep visual models for low-level feature extractions and (2) seek sparse representations as high-level cross-modal features for both image and text. We offer an easy-to-use implementation for the proposed paradigm, which is fast and compatible with any state-of-the-art deep architectures. Extensive experiments on several benchmarks demonstrate that the cross-modal features learned by our paradigm significantly outperforms others in various applications such as content-based retrieval, classification, and image captioning.

121 citations


Journal ArticleDOI
TL;DR: This work designs a purely distributed D2D video distribution scheme without the monitoring of any central controller and provides a practical implementation of the scheme, which does not need to know the video availability, user demand, and device mobility.
Abstract: As video traffic has dominated the data flow of smartphones, traditional cellular communications face substantial transmission challenges. In this work, we study mobile device-to-device (D2D) video distribution that leverages the storage and communication capacities of smartphones. In such a mobile distributed framework, D2D communication represents an opportunistic process to selectively store and transmit local videos to meet the future demand of others. The performance is measured by the service time, which denotes the elapsed period for fulfilling the demand, and the corresponding implementation of each device depends on the video’s demand, availability, and size. The main contributions of this work lie in (1) considering the impact of video size in a practical mobile D2D video distribution scenario and proposing a general global estimation of the video distribution based on limited and local observations; (2) designing a purely distributed D2D video distribution scheme without the monitoring of any central controller; and (3) providing a practical implementation of the scheme, which does not need to know the video availability, user demand, and device mobility. Numerical results have demonstrated the efficiency and robustness of the proposed scheme.

116 citations


Journal ArticleDOI
TL;DR: A new analytical model to investigate the performance of SDN in the presence of the bursty and correlated arrivals modelled by the Markov Modulated Poisson Process (MMPP) is presented and the Quality-of-Service performance metrics are derived based on the developed analytical model.
Abstract: Software-Defined Networking (SDN) is an emerging architecture for the next-generation Internet, providing unprecedented network programmability to handle the explosive growth of big data driven by the popularisation of smart mobile devices and the pervasiveness of content-rich multimedia applications. In order to quantitatively investigate the performance characteristics of SDN networks, several research efforts from both simulation experiments and analytical modelling have been reported in the current literature. Among those studies, analytical modelling has demonstrated its superiority in terms of cost-effectiveness in the evaluation of large-scale networks. However, for analytical tractability and simplification, existing analytical models are derived based on the unrealistic assumptions that the network traffic follows the Poisson process, which is suitable to model nonbursty text data, and the data plane of SDN is modelled by one simplified Single-Server Single-Queue (SSSQ) system. Recent measurement studies have shown that, due to the features of heavy volume and high velocity, the multimedia big data generated by real-world multimedia applications reveals the bursty and correlated nature in the network transmission. With the aim of capturing such features of realistic traffic patterns and obtaining a comprehensive and deeper understanding of the performance behaviour of SDN networks, this article presents a new analytical model to investigate the performance of SDN in the presence of the bursty and correlated arrivals modelled by the Markov Modulated Poisson Process (MMPP). The Quality-of-Service performance metrics in terms of the average latency and average network throughput of the SDN networks are derived based on the developed analytical model. To consider a realistic multiqueue system of forwarding elements, a Priority-Queue (PQ) system is adopted to model the SDN data plane. To address the challenging problem of obtaining the key performance metrics, for example, queue-length distribution of a PQ system with a given service capacity, a versatile methodology extending the Empty Buffer Approximation (EBA) method is proposed to facilitate the decomposition of such a PQ system to two SSSQ systems. The validity of the proposed model is demonstrated through extensive simulation experiments. To illustrate its application, the developed model is then utilised to study the strategy of the network configuration and resource allocation in SDN networks.

74 citations


Journal ArticleDOI
TL;DR: This article presents a system to automatically generate visual-textual presentation layouts by investigating a set of aesthetic design principles, through which an average user can easily create visually appealing layouts, and demonstrates that the designs achieve the best reading experience compared with the reimplementation of parts of existing state-of-the-art designs.
Abstract: Visual-textual presentation layout (e.g., digital magazine cover, poster, Power Point slides, and any other rich media), which combines beautiful image and overlaid readable texts, can result in an eye candy touch to attract users’ attention. The designing of visual-textual presentation layout is therefore becoming ubiquitous in both commercially printed publications and online digital magazines. However, handcrafting aesthetically compelling layouts still remains challenging for many small businesses and amateur users. This article presents a system to automatically generate visual-textual presentation layouts by investigating a set of aesthetic design principles, through which an average user can easily create visually appealing layouts. The system is attributed with a set of topic-dependent layout templates and a computational framework integrating high-level aesthetic principles (in a top-down manner) and low-level image features (in a bottom-up manner). The layout templates, designed with prior knowledge from domain experts, define spatial layouts, semantic colors, harmonic color models, and font emotion and size constraints. We formulate the typography as an energy optimization problem by minimizing the cost of text intrusion, the utility of visual space, and the mismatch of information importance in perception and semantics, constrained by the automatically selected template and further preserving color harmonization. We demonstrate that our designs achieve the best reading experience compared with the reimplementation of parts of existing state-of-the-art designs through a series of user studies.

70 citations


Journal ArticleDOI
TL;DR: This work introduces an adaptation algorithm for HTTP-based live streaming called LOLYPOP (short for low-latency prediction-based adaptation), which is designed to operate with a transport latency of a few seconds, and leverages Transmission Control Protocol throughput predictions on multiple time scales.
Abstract: Recently, Hypertext Transfer Protocol (HTTP)-based adaptive streaming has become the de facto standard for video streaming over the Internet. It allows clients to dynamically adapt media characteristics to the varying network conditions to ensure a high quality of experience (QoE)—that is, minimize playback interruptions while maximizing video quality at a reasonable level of quality changes. In the case of live streaming, this task becomes particularly challenging due to the latency constraints. The challenge further increases if a client uses a wireless access network, where the throughput is subject to considerable fluctuations. Consequently, live streams often exhibit latencies of up to 20 to 30 seconds. In the present work, we introduce an adaptation algorithm for HTTP-based live streaming called LOLYPOP (short for low-latency prediction-based adaptation), which is designed to operate with a transport latency of a few seconds. To reach this goal, LOLYPOP leverages Transmission Control Protocol throughput predictions on multiple time scales, from 1 to 10 seconds, along with estimations of the relative prediction error distributions. In addition to satisfying the latency constraint, the algorithm heuristically maximizes the QoE by maximizing the average video quality as a function of the number of skipped segments and quality transitions. To select an efficient prediction method, we studied the performance of several time series prediction methods in IEEE 802.11 wireless access networks. We evaluated LOLYPOP under a large set of experimental conditions, limiting the transport latency to 3 seconds, against a state-of-the-art adaptation algorithm called FESTIVE. We observed that the average selected video representation index is by up to a factor of 3 higher than with the baseline approach. We also observed that LOLYPOP is able to reach points from a broader region in the QoE space, and thus it is better adjustable to the user profile or service provider requirements.

56 citations


Journal ArticleDOI
TL;DR: The proposed framework first localizes the moving-sounding objects through multimodal analysis and generate an audio attention map, in which greater value denotes higher possibility of a position being the sound source, and calculates the spatial and temporal attention maps using only the visual modality.
Abstract: In this article, we propose to predict human eye fixation through incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli’s visual features, yet they bypass all audio information. In the real world, however, we not only direct our gaze according to visual saliency, but also are attracted by salient audio cues. Psychological experiments show that audio has an influence on visual attention, and subjects tend to be attracted by the sound sources. Therefore, we propose fusing both audio and visual information to predict eye fixation. In our proposed framework, we first localize the moving--sound-generating objects through multimodal analysis and generate an audio attention map. Then, we calculate the spatial and temporal attention maps using the visual modality. Finally, the audio, spatial, and temporal attention maps are fused to generate the final audiovisual saliency map. The proposed method is applicable to scenes containing moving--sound-generating objects. We gather a set of video sequences and collect eye-tracking data under an audiovisual test condition. Experiment results show that we can achieve better eye fixation prediction performance when taking both audio and visual cues into consideration, especially in some typical scenes in which object motion and audio are highly correlated.

50 citations


Journal ArticleDOI
TL;DR: This work provides the first comprehensive study of gamecast sharing sites, including commercial streaming-based sites such as Amazon’s Twitch.tv and community-maintained replay-based Sites such as WoTreplays, and investigates basic characteristics and analyzes the activities of their creators and spectators.
Abstract: Online gaming franchises such as World of Tanks, Defense of the Ancients, and StarCraft have attracted hundreds of millions of users who, apart from playing the game, also socialize with each other through gaming and viewing gamecasts. As a form of User Generated Content (UGC), gamecasts play an important role in user entertainment and gamer education. They deserve the attention of both industrial partners and the academic communities, corresponding to the large amount of revenue involved and the interesting research problems associated with UGC sites and social networks. Although previous work has put much effort into analyzing general UGC sites such as YouTube, relatively little is known about the gamecast sharing sites. In this work, we provide the first comprehensive study of gamecast sharing sites, including commercial streaming-based sites such as Amazon’s Twitch.tv and community-maintained replay-based sites such as WoTreplays. We collect and share a novel dataset on WoTreplays that includes more than 380,000 game replays, shared by more than 60,000 creators with more than 1.9 million gamers. Together with an earlier published dataset on Twitch.tv, we investigate basic characteristics of gamecast sharing sites, and we analyze the activities of their creators and spectators. Among our results, we find that (i) WoTreplays and Twitch.tv are both fast-consumed repositories, with millions of gamecasts being uploaded, viewed, and soon forgotten; (ii) both the gamecasts and the creators exhibit highly skewed popularity, with a significant heavy tail phenomenon; and (iii) the upload and download preferences of creators and spectators are different: while the creators emphasize their individual skills, the spectators appreciate team-wise tactics. Our findings provide important knowledge for infrastructure and service improvement, for example, in the design of proper resource allocation mechanisms that consider future gamecasting and in the tuning of incentive policies that further help player retention.

48 citations


Journal ArticleDOI
TL;DR: The size of processed video has been effectively reduced and the quality of experience has not been lowered due to a suitably selected parameter η, and the simulation shows that the model has a steady performance and is powerful enough for continuously growing multimedia big data.
Abstract: In the age of multimedia big data, the popularity of mobile devices has been in an unprecedented growth, the speed of data increasing is faster than ever before, and Internet traffic is rapidly increasing, not only in volume but also in heterogeneity. Therefore, data processing and network overload have become two urgent problems. To address these problems, extensive papers have been published on image analysis using deep learning, but only a few works have exploited this approach for video analysis. In this article, a hybrid-stream model is proposed to solve these problems for video analysis. Functionality of this model covers Data Preprocessing, Data Classification, and Data-Load-Reduction Processing. Specifically, an improved Convolutional Neural Networks (CNN) classification algorithm is designed to evaluate the importance of each video frame and video clip to enhance classification precision. Then, a reliable keyframe extraction mechanism will recognize the importance of each frame or clip, and decide whether to abandon it automatically by a series of correlation operations. The model will reduce data load to a dynamic threshold changed by σ, control the input size of the video in mobile Internet, and thus reduce network overload. Through experimental simulations, we find that the size of processed video has been effectively reduced and the quality of experience (QoE) has not been lowered due to a suitably selected parameter η. The simulation also shows that the model has a steady performance and is powerful enough for continuously growing multimedia big data.

47 citations


Journal ArticleDOI
TL;DR: A novel method for minimizing the end-to-end latency within a cloud gaming data center is proposed, and a Lagrangian Relaxation time-efficient heuristic algorithm as a practical solution is proposed.
Abstract: Gaming on demand is an emerging service that has recently started to garner prominence in the gaming industry. Cloud-based video games provide affordable, flexible, and high-performance solutions for end-users with constrained computing resources and enables them to play high-end graphic games on low-end thin clients. Despite its advantages, cloud gaming's Quality of Experience (QoE) suffers from high and varying end-to-end delay. Since the significant part of computational processing, including game rendering and video compression, is performed in data centers, controlling the transfer of information within the cloud has an important impact on the quality of cloud gaming services. In this article, a novel method for minimizing the end-to-end latency within a cloud gaming data center is proposed. We formulate an optimization problem for reducing delay, and propose a Lagrangian Relaxation (LR) time-efficient heuristic algorithm as a practical solution. Simulation results indicate that the heuristic method can provide close-to-optimal solutions. Also, the proposed model reduces end-to-end delay and delay variation by almost 11p and 13.5p, respectively, and outperforms the existing server-centric and network-centric models. As a byproduct, our proposed method also achieves better fairness among multiple competing players by almost 45p, on average, in comparison with existing methods.

40 citations


Journal ArticleDOI
TL;DR: The proposed method for digitally simulating the sensation of taste by utilizing electrical stimulation on the human tongue does not require any chemical solutions and facilitates further research opportunities in several directions including human--computer interaction, virtual reality, food and beverage, as well as medicine.
Abstract: Among the five primary senses, the sense of taste is the least explored as a form of digital media applied in Human--Computer Interface. This article presents an experimental instrument, the Digital Lollipop, for digitally simulating the sensation of taste (gustation) by utilizing electrical stimulation on the human tongue. The system is capable of manipulating the properties of electric currents (magnitude, frequency, and polarity) to formulate different stimuli. To evaluate the effectiveness of this method, the system was experimentally tested in two studies. The first experiment was conducted using separate regions of the human tongue to record occurrences of basic taste sensations and their respective intensity levels. The results indicate occurrences of sour, salty, bitter, and sweet sensations from different regions of the tongue. One of the major discoveries of this experiment was that the sweet taste emerges via an inverse-current mechanism, which deserves further research in the future. The second study was conducted to compare natural and artificial (virtual) sour taste sensations and examine the possibility of effectively controlling the artificial sour taste at three intensity levels (mild, medium, and strong). The proposed method is attractive since it does not require any chemical solutions and facilitates further research opportunities in several directions including human--computer interaction, virtual reality, food and beverage, as well as medicine.

39 citations


Journal ArticleDOI
TL;DR: A new semantically aware photo retargeting that shrinks a photo according to region semantics and a probabilistic model is proposed to enforce the spatial layout of a retargeted photo to be maximally similar to those from the training photos.
Abstract: With the popularity of mobile devices, photo retargeting has become a useful technique that adapts a high-resolution photo onto a low-resolution screen Conventional approaches are limited in two aspects The first factor is the de-emphasized role of semantic content that is many times more important than low-level features in photo aesthetics Second is the importance of image spatial modeling: toward a semantically reasonable retargeted photo, the spatial distribution of objects within an image should be accurately learned To solve these two problems, we propose a new semantically aware photo retargeting that shrinks a photo according to region semantics The key technique is a mechanism transferring semantics of noisy image labels (inaccurate labels predicted by a learner like an SVM) into different image regions In particular, we first project the local aesthetic features (graphlets in this work) onto a semantic space, wherein image labels are selectively encoded according to their noise level Then, a category-sharing model is proposed to robustly discover the semantics of each image region The model is motivated by the observation that the semantic distribution of graphlets from images tagged by a common label remains stable in the presence of noisy labels Thereafter, a spatial pyramid is constructed to hierarchically encode the spatial layout of graphlet semantics Based on this, a probabilistic model is proposed to enforce the spatial layout of a retargeted photo to be maximally similar to those from the training photos Experimental results show that (1) noisy image labels predicted by different learners can improve the retargeting performance, according to both qualitative and quantitative analysis, and (2) the category-sharing model stays stable even when 3236p of image labels are incorrectly predicted

Journal ArticleDOI
TL;DR: A model for network coverage probability and average rate analysis in a D2D communication overlaying a two-tier downlink cellular network, where nineteen macro base stations with pico base stations placed at the end point of macro cell borders are employed according to the 3GPP specifications.
Abstract: Device-to-device (D2D) communication, which utilizes mobile devices located within close proximity for direct connection and data exchange, holds great promise for improving energy and spectrum efficiency of mobile multimedia in 5G networks. It has been observed that most available D2D-based works—considered only the single-cell scenario with a single BS. Such scenario-based schemes, although tractable and able to illustrate the relationship between D2D links and cellular links, failed to take into account the distribution of surrounding base stations and user equipments (UEs), as well as the accumulated interference from ongoing transmissions in other cells. Furthermore, the single-tier network with one BS considered in available works is far from the real 5G scenario in which multi-tier BSs are heterogeneously distributed among the whole network area. In light of such observations, we present in this article a model for network coverage probability and average rate analysis in a D2D communication overlaying a two-tier downlink cellular network, where nineteen macro base stations (MBSs) with pico base stations (PBSs) placed at the end point of macro cell (hexagons) borders are employed according to the 3GPP specifications, and mobile users are spatially distributed according to the homogeneous Poisson Point Process model. Each mobile UE is able to establish a D2D link with adjacent UEs or connect to a nearby macro or pico base station. Stochastic geometric analysis is adopted to characterize the intratier interference distribution within the MBS-tier, PBS-tier, and D2D-tier based on which network coverage probability and per-user average rate are derived with a careful consideration of important issues such as threshold value, SINR value, user density, content hit rate, spectrum allocation, and cell coverage range. Our results show that, even for the overlaying case, D2D communication can significantly improve network coverage probability and per-user average downlink rate. Another finding is that the frequency allocation for D2D communications should be carefully tuned according to network settings, which may result in totally different varying behaviors for the per-user average rate.

Journal ArticleDOI
TL;DR: This article proposes a unified YouTube video recommendation solution by transferring and integrating users’ rich social and content information in Twitter network and shows that the proposed cross-network collaborative solution achieves superior performance not only in terms of accuracy, but also in improving the diversity and novelty of the recommended videos.
Abstract: Online video sharing sites are increasingly encouraging their users to connect to the social network venues such as Facebook and Twitter, with goals to boost user interaction and better disseminate the high-quality video content. This in turn provides huge possibilities to conduct cross-network collaboration for personalized video recommendation. However, very few efforts have been devoted to leveraging users’ social media profiles in the auxiliary network to capture and personalize their video preferences, so as to recommend videos of interest. In this article, we propose a unified YouTube video recommendation solution by transferring and integrating users’ rich social and content information in Twitter network. While general recommender systems often suffer from typical problems like cold-start and data sparsity, our proposed recommendation solution is able to effectively learn from users’ abundant auxiliary information on Twitter for enhanced user modeling and well address the typical problems in a unified framework. In this framework, two stages are mainly involved: (1) auxiliary-network data transfer, where user preferences are transferred from an auxiliary network by learning cross-network knowledge associations; and (2) cross-network data integration, where transferred user preferences are integrated with the observed behaviors on a target network in an adaptive fashion. Experimental results show that the proposed cross-network collaborative solution achieves superior performance not only in terms of accuracy, but also in improving the diversity and novelty of the recommended videos.

Journal ArticleDOI
TL;DR: It can be concluded that the presence of the audio has the ability to mask larger synchronization skews between the other media components in olfaction-enhanced multimedia presentations.
Abstract: Media-rich content plays a vital role in consumer applications today, as these applications try to find new and interesting ways to engage their users. Video, audio, and the more traditional forms of media content continue to dominate with respect to the use of media content to enhance the user experience. Tactile interactivity has also now become widely popular in modern computing applications, while our olfactory and gustatory senses continue to have a limited role. However, in recent times, there have been significant advancements regarding the use of olfactory media content (i.e., smell), and there are a variety of devices now available to enable its computer-controlled emission. This paper explores the impact of the audio stream on user perception of olfactory-enhanced video content in the presence of skews between the olfactory and video media. This research uses the results from two experimental studies of user-perceived quality of olfactory-enhanced multimedia, where audio was present and absent, respectively. Specifically, the paper shows that the user Quality of Experience (QoE) is generally higher in the absence of audio for nearly perfect synchronized olfactory-enhanced multimedia presentations (i.e., an olfactory media skew of between l−10,+10sr); however, for greater olfactory media skews (ranging between l−30s;−10sr and l+10s, +30sr) user QoE is higher when the audio stream is present. It can be concluded that the presence of the audio has the ability to mask larger synchronization skews between the other media components in olfaction-enhanced multimedia presentations.

Journal ArticleDOI
TL;DR: A large-scale study of one of the most popular Porn 2.0 websites: YouPorn reveals a global delivery infrastructure that is repeatedly crawled to collect statistics and characterise the corpus, as well as inspecting popularity trends and how they relate to other features, for example, categories and ratings.
Abstract: Today, the Internet is a large multimedia delivery infrastructure, with websites such as YouTube appearing at the top of most measurement studies. However, most traffic studies have ignored an important domain: adult multimedia distribution. Whereas, traditionally, such services were provided primarily via bespoke websites, recently these have converged towards what is known as “Porn 2.0”. These services allow users to upload, view, rate, and comment on videos for free (much like YouTube). Despite their scale, we still lack even a basic understanding of their operation. This article addresses this gap by performing a large-scale study of one of the most popular Porn 2.0 websites: YouPorn. Our measurements reveal a global delivery infrastructure that we have repeatedly crawled to collect statistics (on 183k videos). We use this data to characterise the corpus, as well as to inspect popularity trends and how they relate to other features, for example, categories and ratings. To explore our discoveries further, we use a small-scale user study, highlighting key system implications.

Journal ArticleDOI
TL;DR: This article proposes a collectiveness-measuring method that is capable of quantifying collectiveness accurately and is able to compute relatively accurate collectiveness and keep high consistency with human perception.
Abstract: Crowd system has motivated a surge of interests in many areas of multimedia, as it contains plenty of information about crowd scenes. In crowd systems, individuals tend to exhibit collective behaviors, and the motion of all those individuals is called collective motion. As a comprehensive descriptor of collective motion, collectiveness has been proposed to reflect the degree of individuals moving as an entirety. Nevertheless, existing works mostly have limitations to correctly find the individuals of a crowd system and precisely capture the various relationships between individuals, both of which are essential to measure collectiveness. In this article, we propose a collectiveness-measuring method that is capable of quantifying collectiveness accurately. Our main contributions are threefold: (1) we compute relatively accurate collectiveness by making the tracked feature points represent the individuals more precisely with a point selection strategy; (2) we jointly investigate the spatial-temporal information of individuals and utilize it to characterize the topological relationship between individuals by manifold learning; (3) we propose a stability descriptor to deal with the irregular individuals, which influence the calculation of collectiveness. Intensive experiments on the simulated and real world datasets demonstrate that the proposed method is able to compute relatively accurate collectiveness and keep high consistency with human perception.

Journal ArticleDOI
TL;DR: A double-cipher scheme to implement nonlocal means (NLM) denoising in encrypted images, which shows that the quality of denoised images in the encrypted domain is comparable to that obtained in the plain domain.
Abstract: Signal processing in the encrypted domain becomes a desired technique to protect privacy of outsourced data in cloud. In this article, we propose a double-cipher scheme to implement nonlocal means (NLM) denoising in encrypted images. In this scheme, one ciphertext is generated by the Paillier scheme, which enables the mean filter, and the other is obtained by a privacy-preserving transform, which enables the nonlocal search. By the privacy-preserving transform, the cloud server can search the similar pixel blocks in the ciphertexts with the same speed as in the plaintexts; thus, the proposed method can be executed fast. To enhance the security, we randomly permutate both ciphertexts. To reduce the denoising complexity caused by random permutation, a random NLM method is exploited in the encrypted domain. The experimental results show that the quality of denoised images in the encrypted domain is comparable to that obtained in the plain domain.

Journal ArticleDOI
TL;DR: The Ghost Detector is designed, implemented, and evaluated—an educational location-based museum game for children that demonstrates the viability of the use of a seamful approach to BLE, that is, to reveal and exploit problems and turn them into a part of the actual experience.
Abstract: The application of mobile computing is currently altering patterns of our behavior to a greater degree than perhaps any other invention. In combination with the introduction of power-efficient wireless communication technologies, such as Bluetooth Low Energy (BLE), designers are today increasingly empowered to shape the way we interact with our physical surroundings and thus build entirely new experiences. However, our evaluations of BLE and its abilities to facilitate mobile location-based experiences in public environments revealed a number of potential problems. Most notably, the position and orientation of the user in combination with various environmental factors, such as crowds of people traversing the space, were found to cause major fluctuations of the received BLE signal strength. These issues are rendering a seamless functioning of any location-based application practically impossible. Instead of achieving seamlessness by eliminating these technical issues, we thus choose to advocate the use of a seamful approach, that is, to reveal and exploit these problems and turn them into a part of the actual experience. In order to demonstrate the viability of this approach, we designed, implemented, and evaluated the Ghost Detector—an educational location-based museum game for children. By presenting a qualitative evaluation of this game and by motivating our design decisions, this article provides insight into some of the challenges and possible solutions connected to the process of developing location-based BLE-enabled experiences for public cultural spaces.

Journal ArticleDOI
TL;DR: The goal of this article is to adopt unlabeled videos with the help of text descriptions to learn an embedding function, which can be used to extract more effective semantic features from videos when only a few labeled samples are available for video recognition.
Abstract: Content-based video understanding is extremely difficult due to the semantic gap between low-level vision signals and the various semantic concepts (object, action, and scene) in videos. Though feature extraction from videos has achieved significant progress, most of the previous methods rely only on low-level features, such as the appearance and motion features. Recently, visual-feature extraction has been improved significantly with machine-learning algorithms, especially deep learning. However, there is still not enough work focusing on extracting semantic features from videos directly. The goal of this article is to adopt unlabeled videos with the help of text descriptions to learn an embedding function, which can be used to extract more effective semantic features from videos when only a few labeled samples are available for video recognition. To achieve this goal, we propose a novel embedding convolutional neural network (ECNN). We evaluate our algorithm by comparing its performance on three challenging benchmarks with several popular state-of-the-art methods. Extensive experimental results show that the proposed ECNN consistently and significantly outperforms the existing methods.

Journal ArticleDOI
TL;DR: A novel scheme of progressive visual cryptography with four or more number of unexpanded as well as meaningful shares has been proposed and a novel and efficient Candidate Block Replacement preprocessing approach and a basis matrix creation algorithm have been introduced.
Abstract: The traditional k-out-of-n Visual Cryptography (VC) scheme is the conception of “all or nothing” for n participants to share a secret image. The original secret image can be visually revealed only when a subset of k or more shares are superimposed together, but if the number of stacked shares are less than k, nothing will be revealed. On the other hand, a Progressive Visual Cryptography (PVC) scheme differs from the traditional VC with respect to decoding. In PVC, clarity and contrast of the decoded secret image will be increased progressively with the number of stacked shares. Much of the existing state-of-the-art research on PVC has problems with pixel expansion and random pattern of the shares. In this article, a novel scheme of progressive visual cryptography with four or more number of unexpanded as well as meaningful shares has been proposed. For this, a novel and efficient Candidate Block Replacement preprocessing approach and a basis matrix creation algorithm have also been introduced. The proposed method also eliminates many unnecessary encryption constraints like a predefined codebook for encoding and decoding the secret image, restriction on the number of participants, and so on. From the experiments, it is observed that the reconstruction probability of black pixels in the decoded image corresponding to the black pixel in the secret image is always 1, whereas that of white pixels is 0.5 irrespective of the meaningful contents visible in the shares, thus ensuring the value of contrast to alwasys be 50p. Therefore, a reconstructed image can be easily identified by a human visual system without any computation.

Journal ArticleDOI
TL;DR: A novel second-order deep architecture with the Field Effect Restricted Boltzmann Machine is designed, which models the reliability of the delivered information according to the availability of the features and can jointly determine the classification boundary and estimate the missing features.
Abstract: Image recognition with incomplete data is a well-known hard problem in computer vision and machine learning. This article proposes a novel deep learning technique called Field Effect Bilinear Deep Networks (FEBDN) for this problem. To address the difficulties of recognizing incomplete data, we design a novel second-order deep architecture with the Field Effect Restricted Boltzmann Machine, which models the reliability of the delivered information according to the availability of the features. Based on this new architecture, we propose a new three-stage learning procedure with field effect bilinear initialization, field effect abstraction and estimation, and global fine-tuning with missing features adjustment. By integrating the reliability of features into the new learning procedure, the proposed FEBDN can jointly determine the classification boundary and estimate the missing features. FEBDN has demonstrated impressive performance on recognition and estimation tasks in various standard datasets.

Journal ArticleDOI
TL;DR: A fractal-based VoIP steganographic approach was proposed to realize covert VoIP communications in the presence of packet loss, and the experimental results indicated that the speech quality degradation increased with the escalating packet-loss level.
Abstract: The last few years have witnessed an explosive growth in the research of information hiding in multimedia objects, but few studies have taken into account packet loss in multimedia networks. As one of the most popular real-time services in the Internet, Voice over Internet Protocol (VoIP) contributes to a large part of network traffic for its advantages of real time, high flow, and low cost. So packet loss is inevitable in multimedia networks and affects the performance of VoIP communications. In this study, a fractal-based VoIP steganographic approach was proposed to realize covert VoIP communications in the presence of packet loss. In the proposed scheme, secret data to be hidden were divided into blocks after being encrypted with the block cipher, and each block of the secret data was then embedded into VoIP streaming packets. The VoIP packets went through a packet-loss system based on Gilbert model which simulates a real network situation. And a prediction model based on fractal interpolation was built to decide whether a VoIP packet was suitable for data hiding. The experimental results indicated that the speech quality degradation increased with the escalating packet-loss level. The average variance of speech quality metrics (PESQ score) between the “no-embedding” speech samples and the “with-embedding” stego-speech samples was about 0.717, and the variances narrowed with the increasing packet-loss level. Both the average PESQ scores and the SNR values of stego-speech samples and the data-retrieving rates had almost the same varying trends when the packet-loss level increased, indicating that the success rate of the fractal prediction model played an important role in the performance of covert VoIP communications.

Journal ArticleDOI
TL;DR: This work proposes a novel yet practical iterative algorithm to predict virality timing, in which the correlation between the timing and growth of content popularity is captured by using its own big data naturally generated from users’ sharing.
Abstract: Predicting content going viral in social networks is attractive for viral marketing, advertisement, entertainment, and other applications, but it remains a challenge in the big data era today. Previous works mainly focus on predicting the possible popularity of content rather than the timing of reaching such popularity. This work proposes a novel yet practical iterative algorithm to predict virality timing, in which the correlation between the timing and growth of content popularity is captured by using its own big data naturally generated from users’ sharing. Such data is not only able to correlate the dynamics and associated timings in social cascades of viral content but also can be useful to self-correct the predicted timing against the actual timing of the virality in each iterative prediction. The proposed prediction algorithm is verified by datasets from two popular social networks—Twitter and Digg—as well as two synthesized datasets with extreme network densities and infection rates. With about 50% of the required content virality data available (i.e., halfway before reaching its actual virality timing), the error of the predicted timing is proven to be bounded within a 40% deviation from the actual timing. To the best of our knowledge, this is the first work that predicts content virality timing iteratively by capturing social cascades dynamics.

Journal ArticleDOI
TL;DR: This work proposes a novel method for the automatic localization of points of interest depicted in photos taken by people across the world, which exploits the geographic coordinates and the compass direction supplied by modern cameras, while accounting for possible measurement errors due to the variability in accuracy of the sensors that produced them.
Abstract: Points of interest are an important requirement for location-based services, yet they are editorially curated and maintained, either professionally or through community. Beyond the laborious manual annotation task, further complications arise as points of interest may appear, relocate, or disappear over time, and may be relevant only to specific communities. To assist, complement, or even replace manual annotation, we propose a novel method for the automatic localization of points of interest depicted in photos taken by people across the world. Our technique exploits the geographic coordinates and the compass direction supplied by modern cameras, while accounting for possible measurement errors due to the variability in accuracy of the sensors that produced them. We statistically demonstrate that our method significantly outperforms techniques from the research literature on the task of estimating the geographic coordinates and geographic footprints of points of interest in various cities, even when photos are involved in the estimation process that do not show the point of interest at all.

Journal ArticleDOI
TL;DR: This article proposes a new method for automatically parsing fashion images in high processing efficiency with significantly less training time by applying a modification of MRFs, named reweighted MRF (RW-MRF), which resolves the problem of over smoothing infrequent labels.
Abstract: Previous image parsing methods usually model the problem in a conditional random field which describes a statistical model learned from a training dataset and then processes a query image using the conditional probability. However, for clothing images, fashion items have a large variety of layering and configuration, and it is hard to learn a certain statistical model of features that apply to general cases. In this article, we take fashion images as an example to show how Markov Random Fields (MRFs) can outperform Conditional Random Fields when the application does not follow a certain statistical model learned from the training data set. We propose a new method for automatically parsing fashion images in high processing efficiency with significantly less training time by applying a modification of MRFs, named reweighted MRF (RW-MRF), which resolves the problem of over smoothing infrequent labels. We further enhance RW-MRF with occlusion prior and background prior to resolve two other common problems in clothing parsing, occlusion, and background spill. Our experimental results indicate that our proposed clothing parsing method significantly improves processing time and training time over state-of-the-art methods, while ensuring comparable parsing accuracy and improving label recall rate.

Journal ArticleDOI
TL;DR: This work proposes a robust method to detect whether a given image has undergone filtering (linear or nonlinear) based enhancement, possibly followed by format conversion after forgery, and finds that the proposed technique is superior in most of the cases.
Abstract: The availability of intelligent image editing techniques and antiforensic algorithms, make it convenient to manipulate an image and to hide the artifacts that it might have produced in the process. Real world forgeries are generally followed by the application of enhancement techniques such as filtering and/or conversion of the image format to suppress the forgery artifacts. Though several techniques evolved in the direction of detecting some of these manipulations, additional operations like recompression, nonlinear filtering, and other antiforensic methods during forgery are not deeply investigated. Toward this, we propose a robust method to detect whether a given image has undergone filtering (linear or nonlinear) based enhancement, possibly followed by format conversion after forgery. In the proposed method, JPEG quantization noise is obtained using natural image prior and quantization noise models. Transition probability features extracted from the quantization noise are used for machine learning based detection and classification. We test the effectiveness of the algorithm in classifying the class of the filter applied and the efficacy in detecting filtering in low resolution images. Experiments are performed to compare the performance of the proposed technique with state-of-the-art forensic filtering detection algorithms. It is found that the proposed technique is superior in most of the cases. Also, experiments against popular antiforensic algorithms show the counter antiforensic robustness of the proposed technique.


Journal ArticleDOI
TL;DR: A layered structure for screen coding and rendering is proposed to deliver diverse screen content to the client side with an adaptive frame rate to enable screen sharing among multiple devices with high fidelity and responsive interaction.
Abstract: The pervasive computing environment and wide network bandwidth provide users more opportunities to share screen content among multiple devices. In this article, we introduce a remote display system to enable screen sharing among multiple devices with high fidelity and responsive interaction. In the developed system, the frame-level screen content is compressed and transmitted to the client side for screen sharing, and the instant control inputs are simultaneously transmitted to the server side for interaction. Even if the screen responds immediately to the control messages and updates at a high frame rate on the server side, it is difficult to update the screen content with low delay and high frame rate in the client side due to non-negligible time consumption on the whole screen frame compression, transmission, and display buffer updating. To address this critical problem, we propose a layered structure for screen coding and rendering to deliver diverse screen content to the client side with an adaptive frame rate. More specifically, the interaction content with small region screen update is compressed by a blockwise screen codec and rendered at a high frame rate to achieve smooth interaction, while the natural video screen content is compressed by standard video codec and rendered at a regular frame rate for a smooth video display. Experimental results with real applications demonstrate that the proposed system can successfully reduce transmission bandwidth cost and interaction delay during screen sharing. Especially for user interaction in small regions, the proposed system can achieve a higher frame rate than most previous counterparts.

Journal ArticleDOI
TL;DR: Results show that the proposed approach provides network connectivity independency to users with mobile apps when Internet connectivity is unavailable and improved significantly the overall system performance and the service provided for a given mobile application.
Abstract: Network architectures based on mobile devices and wireless communications present several constraints (e.g., processor, energy storage, bandwidth, etc.) that affect the overall network performance. Cooperation strategies have been considered as a solution to address these network limitations. In the presence of unstable network infrastructures, mobile nodes cooperate with each other, forwarding data and performing other specific network functionalities. This article proposes a generalized incentive-based cooperation solution for mobile services and applications called MobiCoop. This reputation-based scheme includes an application framework for mobile applications that uses a Web service to handle all the nodes reputation and network permissions. The main goal of MobiCoop is to provide Internet services to mobile devices without network connectivity through cooperation with neighbor devices. The article includes a performance evaluation study of MobiCoop considering both a real scenario (using a prototype) and a simulation-based study. Results show that the proposed approach provides network connectivity independency to users with mobile apps when Internet connectivity is unavailable. Then, it is concluded that MobiCoop improved significantly the overall system performance and the service provided for a given mobile application.

Journal ArticleDOI
TL;DR: This work proposes RAVO, a novel and efficient algorithm based on linear programming with proven optimality gap that achieves close-to-optimal performance, outperforming other advanced schemes significantly (often by multiple times).
Abstract: We consider providing large-scale Netflix-like video-on-demand (VoD) service on a cloud platform, where cloud proxy servers are placed close to user pools. Videos may have heterogeneous popularity at different geo-locations. A repository provides video backup for the network, and the proxy servers collaboratively store and stream videos. To deploy the VoD cloud, the content provider rents resources consisting of link capacities among servers, server storage, and server processing capacity to handle remote requests.We study how to minimize the deployment cost by jointly optimizing video management (in terms of video placement and retrieval at servers) and resource allocation (in terms of link, storage, and processing capacities), subject to a certain user delay requirement on video access. We first formulate the joint optimization problem and show that it is NP-hard. To address it, we propose Resource allocation And Video management Optimization (RAVO), a novel and efficient algorithm based on linear programming with proven optimality gap. For a large video pool, we propose a video clustering algorithm to substantially reduce the run-time computational complexity without compromising performance. Using extensive simulation and trace-driven real data, we show that RAVO achieves close-to-optimal performance, outperforming other advanced schemes significantly (often by multiple times).