scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Multimedia Computing, Communications, and Applications in 2015"


Journal ArticleDOI
TL;DR: A novel rate adaptation algorithm, capable of increasing clients’ Quality of Experience (QoE) and achieving fairness in a multiclient setting, is proposed, which can improve fairness up to 80% compared to state-of-the-art HAS heuristics in a scenario with three networks.
Abstract: HTTP Adaptive Streaming (HAS) is quickly becoming the de facto standard for video streaming services. In HAS, each video is temporally segmented and stored in different quality levels. Rate adaptation heuristics, deployed at the video player, allow the most appropriate level to be dynamically requested, based on the current network conditions. It has been shown that today’s heuristics underperform when multiple clients consume video at the same time, due to fairness issues among clients. Concretely, this means that different clients negatively influence each other as they compete for shared network resources. In this article, we propose a novel rate adaptation algorithm called FINEAS (Fair In-Network Enhanced Adaptive Streaming), capable of increasing clients’ Quality of Experience (QoE) and achieving fairness in a multiclient setting. A key element of this approach is an in-network system of coordination proxies in charge of facilitating fair resource sharing among clients. The strength of this approach is threefold. First, fairness is achieved without explicit communication among clients and thus no significant overhead is introduced into the network. Second, the system of coordination proxies is transparent to the clients, that is, the clients do not need to be aware of its presence. Third, the HAS principle is maintained, as the in-network components only provide the clients with new information and suggestions, while the rate adaptation decision remains the sole responsibility of the clients themselves. We evaluate this novel approach through simulations, under highly variable bandwidth conditions and in several multiclient scenarios. We show how the proposed approach can improve fairness up to 80p compared to state-of-the-art HAS heuristics in a scenario with three networks, each containing 30 clients streaming video at the same time.

114 citations


Journal ArticleDOI
TL;DR: This article proposes a new benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE), and comprehensively evaluates several popular algorithms to large-scale object retrieval problem with multiple evaluation metrics.
Abstract: Over the last several decades, researches on visual object retrieval and recognition have achieved fast and remarkable success. However, while the category-level tasks prevail in the community, the instance-level tasks (especially recognition) have not yet received adequate focuses. Applications such as content-based search engine and robot vision systems have alerted the awareness to bring instance-level tasks into a more realistic and challenging scenario. Motivated by the limited scope of existing instance-level datasets, in this article we propose a new benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). Compared with existing datasets, INSTRE has the following major properties: (1) balanced data scale, (2) more diverse intraclass instance variations, (3) cluttered and less contextual backgrounds, (4) object localization annotation for each image, (5) well-manipulated double-labelled images for measuring multiple object (within one image) case. We will quantify and visualize the merits of INSTRE data, and extensively compare them against existing datasets. Then on INSTRE, we comprehensively evaluate several popular algorithms to large-scale object retrieval problem with multiple evaluation metrics. Experimental results show that all the methods suffer a performance drop on INSTRE, proving that this field still remains a challenging problem. Finally we integrate these algorithms into a simple yet efficient scheme for recognition and compare it with classification-based methods. Importantly, we introduce the realistic multiobjects recognition problem. All experiments are conducted in both single object case and multiple objects case.

80 citations


Journal ArticleDOI
TL;DR: This survey aims to provide researchers with a state-of-the-art overview of various techniques for multi-camera coordination and control (MC3) that have been adopted in surveillance systems.
Abstract: The use of multiple heterogeneous cameras is becoming more common in today's surveillance systems. In order to perform surveillance tasks, effective coordination and control in multi-camera systems is very important, and is catching significant research attention these days. This survey aims to provide researchers with a state-of-the-art overview of various techniques for multi-camera coordination and control (MC3) that have been adopted in surveillance systems. The existing literature on MC3 is presented through several classifications based on the applicable architectures, frameworks and the associated surveillance tasks. Finally, a discussion on the open problems in surveillance area that can be solved effectively using MC3 and the future directions in MC3 research is presented

77 citations


Journal ArticleDOI
TL;DR: This article takes advantage of the user behavior of requesting videos from the top of the related list provided by YouTube to improve the performance of YouTube caches and recommends that local caches reorder the related lists associated with YouTube videos, presenting the cached content above noncached content.
Abstract: In this article, we take advantage of the user behavior of requesting videos from the top of the related list provided by YouTube to improve the performance of YouTube caches. We recommend that local caches reorder the related lists associated with YouTube videos, presenting the cached content above noncached content. We argue that the likelihood that viewers select content from the top of the related list is higher than selection from the bottom, and pushing contents already in the cache to the top of the related list would increase the likelihood of choosing cached content. To verify that the position on the list really is the selection criterion more dominant than the content itself, we conduct a user study with 40 YouTube-using volunteers who were presented with random related lists in their everyday YouTube use. After confirming our assumption, we analyze the benefits of our approach by an investigation that is based on two traces collected from a university campus. Our analysis shows that the proposed reordering approach for related lists would lead to a 2 to 5 times increase in cache hit rate compared to an approach without reordering the related list. This increase in hit rate would lead to reduction in server load and backend bandwidth usage, which in turn reduces the latency in streaming the video requested by the viewer and has the potential to improve the overall performance of YouTube's content distribution system. An analysis of YouTube's recommendation system reveals that related lists are created from a small pool of videos, which increases the potential for caching content from related lists and reordering based on the content in the cache.

76 citations


Journal ArticleDOI
TL;DR: The proposed BMM-SLDA can effectively exploit the multimodality and the multiclass property of social events jointly, and make use of the supervised category label information to classify multiclass social event directly, and is suitable for large-scale data analysis by utilizing boosting weighted sampling strategy to iteratively select a small subset of data to efficiently train the corresponding topic models.
Abstract: With the rapidly increasing popularity of social media sites (e.g., Flickr, YouTube, and Facebook), it is convenient for users to share their own comments on many social events, which successfully facilitates social event generation, sharing and propagation and results in a large amount of user-contributed media data (e.g., images, videos, and text) for a wide variety of real-world events of different types and scales. As a consequence, it has become more and more difficult to exactly find the interesting events from massive social media data, which is useful to browse, search and monitor social events by users or governments. To deal with these issues, we propose a novel boosted multimodal supervised Latent Dirichlet Allocation (BMM-SLDA) for social event classification by integrating a supervised topic model, denoted as multi-modal supervised Latent Dirichlet Allocation (mm-SLDA), in the boosting framework. Our proposed BMM-SLDA has a number of advantages. (1) Our mm-SLDA can effectively exploit the multimodality and the multiclass property of social events jointly, and make use of the supervised category label information to classify multiclass social event directly. (2) It is suitable for large-scale data analysis by utilizing boosting weighted sampling strategy to iteratively select a small subset of data to efficiently train the corresponding topic models. (3) It effectively exploits social event structure by the document weight distribution with classification error and can iteratively learn new topic model to correct the previously misclassified event documents. We evaluate our BMM-SLDA on a real world dataset and show extensive experimental results, which demonstrate that our model outperforms state-of-the-art methods.

69 citations


Journal ArticleDOI
TL;DR: This article forms an integer linear program that maximizes users' average satisfaction, taking into account network dynamics, type of video content, and user population characteristics, and proposes a few theoretical guidelines that can be used, in realistic settings, to choose the encoding parameters based on the user characteristics, the network capacity and the type ofVideo content.
Abstract: Adaptive streaming addresses the increasing and heterogeneous demand of multimedia content over the Internet by offering several encoded versions for each video sequence. Each version (or representation) is characterized by a resolution and a bit rate, and it is aimed at a specific set of users, like TV or mobile phone clients. While most existing works on adaptive streaming deal with effective playout-buffer control strategies on the client side, in this article we take a providers' perspective and propose solutions to improve user satisfaction by optimizing the set of available representations. We formulate an integer linear program that maximizes users' average satisfaction, taking into account network dynamics, type of video content, and user population characteristics. The solution of the optimization is a set of encoding parameters corresponding to the representations set that maximizes user satisfaction. We evaluate this solution by simulating multiple adaptive streaming sessions characterized by realistic network statistics, showing that the proposed solution outperforms commonly used vendor recommendations, in terms of user satisfaction but also in terms of fairness and outage probability. The simulation results show that video content information as well as network constraints and users' statistics play a crucial role in selecting proper encoding parameters to provide fairness among users and to reduce network resource usage. We finally propose a few theoretical guidelines that can be used, in realistic settings, to choose the encoding parameters based on the user characteristics, the network capacity and the type of video content.

67 citations


Journal ArticleDOI
TL;DR: Using the proposed method, it is shown that several image enhancement operations such as noise removal, antialiasing, edge and contrast enhancement, and dehazing can be performed in encrypted domain with near-zero loss in accuracy and minimal computation and data overhead.
Abstract: Cloud-based multimedia systems are becoming increasingly common. These systems offer not only storage facility, but also high-end computing infrastructure which can be used to process data for various analysis tasks ranging from low-level data quality enhancement to high-level activity and behavior identification operations. However, cloud data centers, being third party servers, are often prone to information leakage, raising security and privacy concerns. In this article, we present a Shamir's secret sharing based method to enhance the quality of encrypted image data over cloud. Using the proposed method we show that several image enhancement operations such as noise removal, antialiasing, edge and contrast enhancement, and dehazing can be performed in encrypted domain with near-zero loss in accuracy and minimal computation and data overhead. Moreover, the proposed method is proven to be information theoretically secure.

50 citations


Journal ArticleDOI
TL;DR: This work proposes a dynamic user modeling strategy to tackle personalized video recommendation issues in the multimedia sharing platform YouTube, by transferring knowledge from the social textual stream-based platform Twitter.
Abstract: Traditional personalized video recommendation methods focus on utilizing user profile or user history behaviors to model user interests, which follows a static strategy and fails to capture the swift shift of the short-term interests of users. According to our cross-platform data analysis, the information emergence and propagation is faster in social textual stream-based platforms than that in multimedia sharing platforms at micro user level. Inspired by this, we propose a dynamic user modeling strategy to tackle personalized video recommendation issues in the multimedia sharing platform YouTube, by transferring knowledge from the social textual stream-based platform Twitter. In particular, the cross-platform video recommendation strategy is divided into two steps. (1) Real-time hot topic detection: the hot topics that users are currently following are extracted from users' tweets, which are utilized to obtain the related videos in YouTube. (2) Time-aware video recommendation: for the target user in YouTube, the obtained videos are ranked by considering the user profile in YouTube, time factor, and quality factor to generate the final recommendation list. In this way, the short-term (hot topics) and long-term (user profile) interests of users are jointly considered. Carefully designed experiments have demonstrated the advantages of the proposed method.

47 citations


Journal ArticleDOI
TL;DR: The quaternion polar harmonic transform (QPHT) for invariant color image watermarking is introduced in this article, which can be seen as the generalization of PHT for gray-level images and is shown that the QPHT can be obtained from the PHT of each color channel.
Abstract: It is a challenging work to design a robust color image watermarking scheme against geometric distortions. Moments and moment invariants have become a powerful tool in robust image watermarking owing to their image description capability and geometric invariance property. However, the existing moment-based watermarking schemes were mainly designed for gray images but not for color images, and detection quality and robustness will be lowered when watermark is directly embedded into the luminance component or three color channels of color images. Furthermore, the imperceptibility of the embedded watermark is not well guaranteed. Based on algebra of quaternions and polar harmonic transform (PHT), we introduced the quaternion polar harmonic transform (QPHT) for invariant color image watermarking in this article, which can be seen as the generalization of PHT for gray-level images. It is shown that the QPHT can be obtained from the PHT of each color channel. We derived and analyzed the rotation, scaling, and translation (RST) invariant property of QPHT. We also discussed the problem of color image watermarking using QPHT. Experimental results are provided to illustrate the efficiency of the proposed color image watermarking against geometric distortions and common image processing operations (including color attacks).

47 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new method for improving the presentation of subtitles in video (e.g., TV and movies) by placing on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles.
Abstract: We propose a new method for improving the presentation of subtitles in video (e.g., TV and movies). With conventional subtitles, the viewer has to constantly look away from the main viewing area to read the subtitles at the bottom of the screen, which disrupts the viewing experience and causes unnecessary eyestrain. Our method places on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles. We use novel identification algorithms to detect the speakers based on audio and visual information. Then the placement of the subtitles is determined using global optimization. A comprehensive usability study indicated that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain.

41 citations


Journal ArticleDOI
TL;DR: A compact representation for scalable object retrieval from few generic object regions is proposed with a fusion of learning-based features and aggregated SIFT features and is evaluated on two public ground-truth datasets with promising results.
Abstract: In content-based visual object retrieval, image representation is one of the fundamental issues in improving retrieval performance. Existing works adopt either local SIFT-like features or holistic features, and may suffer sensitivity to noise or poor discrimination power. In this article, we propose a compact representation for scalable object retrieval from few generic object regions. The regions are identified with a general object detector and are described with a fusion of learning-based features and aggregated SIFT features. Further, we compress feature representation in large-scale image retrieval scenarios. We evaluate the performance of the proposed method on two public ground-truth datasets, with promising results. Experimental results on a million-scale image database demonstrate superior retrieval accuracy with efficiency gain in both computation and memory usage.

Journal ArticleDOI
TL;DR: This article proposes a fast mode decision algorithm for 3D-HEVC that jointly exploits the inter-view coding mode correlation, theinter-component (texture-depth) correlation and theInter-level correlation in the quadtree structure of HEVC.
Abstract: 3D High Efficiency Video Coding (3D-HEVC) is an extension of the HEVC standard for coding of multiview videos and depth maps. It inherits the same quadtree coding structure as HEVC for both components, which allows recursively splitting into four equal-sized coding units (CU). One of 11 different prediction modes is chosen to code a CU in inter-frames. Similar to the joint model of H.264/AVC, the mode decision process in HM (reference software of HEVC) is performed using all the possible depth levels and prediction modes to find the one with the least rate distortion cost using a Lagrange multiplier. Furthermore, both motion estimation and disparity estimation need to be performed in the encoding process of 3D-HEVC. Those tools achieve high coding efficiency, but lead to a significant computational complexity. In this article, we propose a fast mode decision algorithm for 3D-HEVC. Since multiview videos and their associated depth maps represent the same scene, at the same time instant, their prediction modes are closely linked. Furthermore, the prediction information of a CU at the depth level X is strongly related to that of its parent CU at the depth level X-1 in the quadtree coding structure of HEVC since two corresponding CUs from two neighboring depth levels share similar video characteristics. The proposed algorithm jointly exploits the inter-view coding mode correlation, the inter-component (texture-depth) correlation and the inter-level correlation in the quadtree structure of 3D-HEVC. Experimental results show that our algorithm saves 66p encoder runtime on average with only a 0.2p BD-Rate increase on coded views and 1.3p BD-Rate increase on synthesized views.

Journal ArticleDOI
TL;DR: Qualitative and quantitative evaluation results demonstrate the effectiveness of the Robust Cross-Platform Multimedia Co-Clustering (RCPMM-CC) method on emerging topic detection and elaboration using multimedia streams cross different online platforms.
Abstract: With the explosive growth of online media platforms in recent years, it becomes more and more attractive to provide users a solution of emerging topic detection and elaboration. And this posts a real challenge to both industrial and academic researchers because of the overwhelming information available in multiple modalities and with large outlier noises. This article provides a method on emerging topic detection and elaboration using multimedia streams cross different online platforms. Specifically, Twitter, New York Times and Flickr are selected for the work to represent the microblog, news portal and imaging sharing platforms. The emerging keywords of Twitter are firstly extracted using aging theory. Then, to overcome the nature of short length message in microblog, Robust Cross-Platform Multimedia Co-Clustering (RCPMM-CC) is proposed to detect emerging topics with three novelties: 1) The data from different media platforms are in multimodalities; 2) The coclustering is processed based on a pairwise correlated structure, in which the involved three media platforms are pairwise dependent; 3) The noninformative samples are automatically pruned away at the same time of coclustering. In the last step of cross-platform elaboration, we enrich each emerging topic with the samples from New York Times and Flickr by computing the implicit links between social topics and samples from selected news and Flickr image clusters, which are obtained by RCPMM-CC. Qualitative and quantitative evaluation results demonstrate the effectiveness of our method.

Journal ArticleDOI
TL;DR: This work introduces a system to automatically generate a summarization from multiple user generated videos and present their salience to viewers in an enjoyable manner, and proposes a probabilistic model to evaluate the aesthetic quality of each user generated video.
Abstract: In recent years, with the rapid development of camera technology and portable devices, we have witnessed a flourish of user generated videos, which are gradually reshaping the traditional professional video oriented media market. The volume of user generated videos in repositories is increasing at a rapid rate. In today's video retrieval systems, a simple query will return many videos which seriously increase the viewing burden. To manage these video retrievals and provide viewers with an efficient way to browse, we introduce a system to automatically generate a summarization from multiple user generated videos and present their salience to viewers in an enjoyable manner. Among multiple consumer videos, we find their qualities to be highly diverse due to various factors such as a photographer's experience or environmental conditions at the time of capture. Such quality inspires us to include a video quality evaluation component into the video summarization since videos with poor qualities can seriously degrade the viewing experience. We first propose a probabilistic model to evaluate the aesthetic quality of each user generated video. This model compares the rich aesthetics information from several well-known photo databases with generic unlabeled consumer videos, under a human perception component indicating the correlation between a video and its constituting frames. Subjective studies were carried out with the results indicating that our method is reliable. Then a novel graph-based formulation is proposed for the multi-video summarization task. Desirable summarization criteria is incorporated as the graph attributes and the problem is solved through a dynamic programming framework. Comparisons with several state-of-the-art methods demonstrate that our algorithm performs better than other methods in generating a skimming video in preserving the essential scenes from the original multiple input videos, with smooth transitions among consecutive segments and appealing aesthetics overall.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed approach has better identification accuracy than an MPEG-7 based scheme for distorted and noisy audio when compared with other schemes, and uses fewer bits with comparable performance.
Abstract: This article proposes to use the relative distances between adjacent envelope peaks detected in stereo audio as fingerprints for copy identification. The matching algorithm used is the rough longest common subsequence (RLCS) algorithm. The experimental results show that the proposed approach has better identification accuracy than an MPEG-7 based scheme for distorted and noisy audio. When compared with other schemes, the proposed scheme uses fewer bits with comparable performance. The proposed fingerprints can also be used in conjunction with the MPEG-7 based scheme for lower computational burden.

Journal ArticleDOI
TL;DR: A comprehensive overview of techniques related to the pipeline from 3D sensing to printing is provided and several sensing, postprocessing, and printing techniques available from both commercial deployments and published research are introduced.
Abstract: Three-dimensional (3D) sensing and printing technologies have reshaped our world in recent years. In this article, a comprehensive overview of techniques related to the pipeline from 3D sensing to printing is provided. We compare the latest 3D sensors and 3D printers and introduce several sensing, postprocessing, and printing techniques available from both commercial deployments and published research. In addition, we demonstrate several devices, software, and experimental results of our related projects to further elaborate details of this process. A case study is conducted to further illustrate the possible tradeoffs during the process of this pipeline. Current progress, future research trends, and potential risks of 3D technologies are also discussed.

Journal ArticleDOI
TL;DR: Experimental restoration results via qualitative and quantitative evaluations show that the proposed approach can provide higher haze-removal efficacy for images captured in varied weather conditions than can the other state-of-the-art approaches.
Abstract: Haze removal is the process by which horizontal obscuration is eliminated from hazy images captured during inclement weather. Images captured in natural environments with varied weather conditions frequently exhibit localized light sources or color-shift effects. The occurrence of these effects presents a difficult challenge for hazy image restoration, with which many traditional restoration methods cannot adequately contend. In this article, we present a new image haze removal approach based on Fisher's linear discriminant-based dual dark channel prior scheme in order to solve the problems associated with the presence of localized light sources and color shifts, and thereby achieve effective restoration. Experimental restoration results via qualitative and quantitative evaluations show that our proposed approach can provide higher haze-removal efficacy for images captured in varied weather conditions than can the other state-of-the-art approaches.

Journal ArticleDOI
TL;DR: This article proposes to leverage crowdsourced data from social multimedia applications that host tags of diverse semantics to build a spatio-temporal tag repository, consequently acting as input to the auto-annotation approach.
Abstract: Videos are increasingly geotagged and used in practical and powerful GIS applications. However, video search and management operations are typically supported by manual textual annotations, which are subjective and laborious. Therefore, research has been conducted to automate or semi-automate this process. Since a diverse vocabulary for video annotations is of paramount importance towards good search results, this article proposes to leverage crowdsourced data from social multimedia applications that host tags of diverse semantics to build a spatio-temporal tag repository, consequently acting as input to our auto-annotation approach. In particular, to build the tag store, we retrieve the necessary data from several social multimedia applications, mine both the spatial and temporal features of the tags, and then refine and index them accordingly. To better integrate the tag repository, we extend our previous approach by leveraging the temporal characteristics of videos as well. Moreover, we set up additional ranking criteria on the basis of tag similarity, popularity and location bias. Experimental results demonstrate that, by making use of such a tag repository, the generated tags have a wide range of semantics, and the resulting rankings are more consistent with human perception.

Journal ArticleDOI
TL;DR: Some algorithms are developed and shown to be quite effective for detecting fake views, which are often significant financial incentive to use a robot (or a botnet) to artificially create fake views.
Abstract: Online video-on-demand(VoD) services invariably maintain a view count for each video they serve, and it has become an important currency for various stakeholders, from viewers, to content owners, advertizers, and the online service providers themselves. There is often significant financial incentive to use a robot (or a botnet) to artificially create fake views. How can we detect fake viewsq Can we detect them (and stop them) efficientlyq What is the extent of fake views with current VoD service providersq These are the questions we study in this article. We develop some algorithms and show that they are quite effective for this problem.

Journal ArticleDOI
TL;DR: A new dynamic voltage and frequency scaling (DVFS) scheme that allocates a frequency and a workload to each CPU with the aim of minimizing power consumption while meeting all transcoding deadlines is proposed.
Abstract: Recent popular streaming services such as TV Everywhere, N-Screen, and dynamic adaptive streaming over HTTP (DASH) need to deliver content to the wide range of devices, requiring video content to be transcoded into different versions. Transcoding tasks require a lot of computation, and each task typically has its own real-time constraint. These make it difficult to manage transcoding, but the more efficient use of energy in servers is an imperative. We characterize transcoding workloads in terms of deadlines and computation times, and propose a new dynamic voltage and frequency scaling (DVFS) scheme that allocates a frequency and a workload to each CPU with the aim of minimizing power consumption while meeting all transcoding deadlines. This scheme has been simulated, and also implemented in a Linux transcoding server, in which a frontend node distributes transcoding requests to heterogeneous backend nodes. This required a new protocol for communication between nodes, a DVFS management scheme to reduce power consumption and thread management and scheduling schemes which ensure that transcoding deadlines are met. Power measurements show that this approach can reduce system-wide energy consumption by 17p to 31p, compared with the Linux Ondemand governor.

Journal ArticleDOI
TL;DR: This article proposes multilevel visual features for extracting spectrogram textures and their temporal variations and a confidence-based late fusion is proposed for combining the acoustic and visual features.
Abstract: Most music genre classification approaches extract acoustic features from frames to capture timbre information, leading to the common framework of bag-of-frames analysis. However, time-frequency analysis is also vital for modeling music genres. This article proposes multilevel visual features for extracting spectrogram textures and their temporal variations. A confidence-based late fusion is proposed for combining the acoustic and visual features. The experimental results indicated that the proposed method achieved an accuracy improvement of approximately 14p and 2p in the world's largest benchmark dataset (MASD) and Unique dataset, respectively. In particular, the proposed approach won the Music Information Retrieval Evaluation eXchange (MIREX) music genre classification contests from 2011 to 2013, demonstrating the feasibility and necessity of combining acoustic and visual features for classifying music genres.

Journal ArticleDOI
TL;DR: A boosted multifeature learning (BMFL) approach to iteratively learn multiple representations within a boosting procedure for unsupervised domain adaption and demonstrates that the proposed BMFL algorithm performs favorably against state-of-the-art domainAdaption methods.
Abstract: Conventional learning algorithm assumes that the training data and test data share a common distribution. However, this assumption will greatly hinder the practical application of the learned model for cross-domain data analysis in multimedia. To deal with this issue, transfer learning based technology should be adopted. As a typical version of transfer learning, domain adaption has been extensively studied recently due to its theoretical value and practical interest. In this article, we propose a boosted multifeature learning (BMFL) approach to iteratively learn multiple representations within a boosting procedure for unsupervised domain adaption. The proposed BMFL method has a number of properties. (1) It reuses all instances with different weights assigned by the previous boosting iteration and avoids discarding labeled instances as in conventional methods. (2) It models the instance weight distribution effectively by considering the classification error and the domain similarity, which facilitates learning new feature representation to correct the previously misclassified instances. (3) It learns multiple different feature representations to effectively bridge the source and target domains. We evaluate the BMFL by comparing its performance on three applications: image classification, sentiment classification and spam filtering. Extensive experimental results demonstrate that the proposed BMFL algorithm performs favorably against state-of-the-art domain adaption methods.

Journal ArticleDOI
TL;DR: It is argued that the energy spent in designing autonomous camera control systems is not spent in vain and two low-complexity servoing methods are presented that can compete with the user experience for recordings from an expert operator with several years of experience.
Abstract: In this article, we argue that the energy spent in designing autonomous camera control systems is not spent in vain. We present a real-time virtual camera system that can create smooth camera motion. Similar systems are frequently benchmarked with the human operator as the best possible reference; however, we avoid a priori assumptions in our evaluations. Our main question is simply whether we can design algorithms to steer a virtual camera that can compete with the user experience for recordings from an expert operator with several years of experienceq In this respect, we present two low-complexity servoing methods that are explored in two user studies. The results from the user studies give a promising answer to the question pursued. Furthermore, all components of the system meet the real-time requirements on commodity hardware. The growing capabilities of both hardware and network in mobile devices give us hope that this system can be deployed to mobile users in the near future. Moreover, the design of the presented system takes into account that services to concurrent users must be supported.

Journal ArticleDOI
TL;DR: A hybrid retrieval method based on the integration of the visual (content) and geographic (context) information, which is shown to achieve significant improvements in the authors' experiments is proposed.
Abstract: Due to the ubiquity of sensor-equipped smartphones, it has become increasingly feasible for users to capture videos together with associated geographic metadata, for example the location and the orientation of the camera. Such contextual information creates new opportunities for the organization and retrieval of geo-referenced videos. In this study we explore the task of landmark retrieval through the analysis of two types of state-of-the-art techniques, namely media-content-based and geocontext-based retrievals. For the content-based method, we choose the Spatial Pyramid Matching (SPM) approach combined with two advanced coding methods: Sparse Coding (SC) and Locality-Constrained Linear Coding (LLC). For the geo-based method, we present the Geo Landmark Visibility Determination (GeoLVD) approach which computes the visibility of a landmark based on intersections of a camera's field-of-view (FOV) and the landmark's geometric information available from Geographic Information Systems (GIS) and services. We first compare the retrieval results of the two methods, and discuss the strengths and weaknesses of each approach in terms of precision, recall and execution time. Next we analyze the factors that affect the effectiveness for the content-based and the geo-based methods, respectively. Finally we propose a hybrid retrieval method based on the integration of the visual (content) and geographic (context) information, which is shown to achieve significant improvements in our experiments. We believe that the results and observations in this work will enlighten the design of future geo-referenced video retrieval systems, improve our understanding of selecting the most appropriate visual features for indexing and searching, and help in selecting between the most suitable methods for retrieval based on different conditions.

Journal ArticleDOI
TL;DR: The results shows that the proposed framework allows to learn a novel computational model which effectively encodes an inter-user definition of pleasantness and generalizes well to new photo datasets of different themes and sizes not used in the learning.
Abstract: In this article, we consider how to automatically create pleasing photo collages created by placing a set of images on a limited canvas area. The task is formulated as an optimization problem. Differently from existing state-of-the-art approaches, we here exploit subjective experiments to model and learn pleasantness from user preferences. To this end, we design an experimental framework for the identification of the criteria that need to be taken into account to generate a pleasing photo collage. Five different thematic photo datasets are used to create collages using state-of-the-art criteria. A first subjective experiment where several subjects evaluated the collages, emphasizes that different criteria are involved in the subjective definition of pleasantness. We then identify new global and local criteria and design algorithms to quantify them. The relative importance of these criteria are automatically learned by exploiting the user preferences, and new collages are generated. To validate our framework, we performed several psycho-visual experiments involving different users. The results shows that the proposed framework allows to learn a novel computational model which effectively encodes an inter-user definition of pleasantness. The learned definition of pleasantness generalizes well to new photo datasets of different themes and sizes not used in the learning. Moreover, compared with two state-of-the-art approaches, the collages created using our framework are preferred by the majority of the users.

Journal ArticleDOI
TL;DR: Simulation results for LTE uplink transmission show that significant gains in perceived video quality can be achieved by the cross-layer resource optimization scheme and the distributed optimization at the mobile producers can further improve the user experience across the different types of video consumers.
Abstract: We study the problem of resource-efficient uplink distribution of user-generated video content over fourth-generation mobile networks. This is challenged by (1) the capacity-limited and time-variant uplink channel, (2) the resource-hungry upstreamed videos and their dynamically changing complexity, and (3) the different playout times of the video consumers. To address these issues, we propose a systematic approach for quality-of-experience (QoE)-based resource optimization and uplink transmission of multiuser generated video content. More specifically, we present an analytical model for distributed scalable video transmission at the mobile producers which considers these constraints. This is complemented by a multiuser cross-layer optimizer in the mobile network which determines the transmission capacity for each mobile terminal under current cell load and radio conditions. Both optimal and low-complexity solutions are presented. Simulation results for LTE uplink transmission show that significant gains in perceived video quality can be achieved by our cross-layer resource optimization scheme. In addition, the distributed optimization at the mobile producers can further improve the user experience across the different types of video consumers.

Journal ArticleDOI
TL;DR: This article proposes a framework for creating user-preference-aware music medleys from users' music collections that treats the medley generation process as an audio version of a musical dice game.
Abstract: This article proposes a framework for creating user-preference-aware music medleys from users' music collections. We treat the medley generation process as an audio version of a musical dice game. Once the user's collection has been analyzed, the system is able to generate various pleasing medleys. This flexibility allows users to create medleys according to the specified conditions, such as the medley structure or the must-use clips. Even users without musical knowledge can compose medley songs from their favorite tracks. The effectiveness of the system has been evaluated through both objective and subjective experiments on individual components in the system.

Journal ArticleDOI
TL;DR: A tool to simulate digital video re-acquisition using a digital video camera, which allows for an easy setup and calibration of all the simulated artifacts, using sample/reference pairs of images and videos.
Abstract: To support the development of any system that includes the generation and evaluation of camcorder copies, as well as to provide a common benchmark for robustness against camcorder copies, we present a tool to simulate digital video re-acquisition using a digital video camera. By resampling each video frame, we simulate the typical artifacts occurring in a camcorder copy: geometric modifications (aspect ratio changes, cropping, perspective and lens distortion), temporal sampling artifacts (due to different frame rates, shutter speeds, rolling shutters, or playback), spatial and color subsampling (rescaling, filtering, Bayer color filter array), and processing steps (automatic gain control, automatic white balance). We also support the simulation of camera movement (e.g., a hand-held camera) and background insertion. Furthermore, we allow for an easy setup and calibration of all the simulated artifacts, using sample/reference pairs of images and videos. Specifically temporal subsampling effects are analyzed in detail to create realistic frame blending artifacts in the simulated copies. We carefully evaluated our entire camcorder simulation system and found that the models we developed describe and match the real artifacts quite well.

Journal ArticleDOI
TL;DR: This work proposes Area of Simulation (AoS), a new scalability mechanism, which combines and extends the mechanisms of AoI and EBLS, and designs an AoS-based architecture, which can operate with an order of magnitude more avatars and a larger virtual world without exceeding the resource capacity of players' computers.
Abstract: Although Multi-Avatar Distributed Virtual Environments (MAVEs) such as Real-Time Strategy (RTS) games entertain daily hundreds of millions of online players, their current designs do not scale. For example, even popular RTS games such as the StarCraft series support in a single game instance only up to 16 players and only a few hundreds of avatars loosely controlled by these players, which is a consequence of the Event-Based Lockstep Simulation (EBLS) scalability mechanism they employ. Through empirical analysis, we show that a single Area of Interest (AoI), which is a scalability mechanism that is sufficient for single-avatar virtual environments (such as Role-Playing Games), also cannot meet the scalability demands of MAVEs. To enable scalable MAVEs, in this work we propose Area of Simulation (AoS), a new scalability mechanism, which combines and extends the mechanisms of AoI and EBLS. Unlike traditional AoI approaches, which employ only update-based operational models, our AoS mechanism uses both event-based and update-based operational models to manage not single, but multiple areas of interest. Unlike EBLS, which is traditionally used to synchronize the entire virtual world, our AoS mechanism synchronizes only selected areas of the virtual world. We further design an AoS-based architecture, which is able to use both our AoS and traditional AoI mechanisms simultaneously, dynamically trading-off consistency guarantees for scalability. We implement and deploy this architecture and we demonstrate that it can operate with an order of magnitude more avatars and a larger virtual world without exceeding the resource capacity of players' computers.

Journal ArticleDOI
TL;DR: A content-aware (CA) priority marking and layer dropping scheme is proposed, which is lightweight both in terms of architecture and computation and very close to the performance obtained with the computation and signaling-intensive QoE optimization schemes.
Abstract: The increasing popularity of mobile video streaming applications has led to a high volume of video traffic in mobile networks. As the base station, for instance, the eNB in LTE networks, has limited physical resources, it can be overloaded by this traffic. This problem can be addressed by using Scalable Video Coding (SVC), which allows the eNB to drop layers of the video streams to dynamically adapt the bitrate. The impact of bitrate adaptation on the Quality of Experience (QoE) for the users depends on the content characteristics of videos. As the current mobile network architectures do not support the eNB in obtaining video content information, QoE optimization schemes with explicit signaling of content information have been proposed. These schemes, however, require the eNB or a specific optimization module to process the video content on the fly in order to extract the required information. This increases the computation and signaling overhead significantly, raising the OPEX for mobile operators. To address this issue, in this article, a content-aware (CA) priority marking and layer dropping scheme is proposed. The CA priority indicates a transmission order for the layers of all transmitted videos across all users, resulting from a comparison of their utility versus rate characteristics. The CA priority values can be determined at the P-GW on the fly, allowing mobile operators to control the priority marking process. Alternatively, they can be determined offline at the video servers, avoiding real-time computation in the core network. The eNB can perform content-aware SVC layer dropping using only the priority values. No additional content processing is required. The proposed scheme is lightweight both in terms of architecture and computation. The improvement in QoE is substantial and very close to the performance obtained with the computation and signaling-intensive QoE optimization schemes.