Showing papers in "IEEE Transactions on Multimedia in 2013"
TL;DR: A novel distance metric, Finger-Earth Mover's Distance (FEMD), is proposed, which only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences.
Abstract: The recently developed depth sensors, e.g., the Kinect sensor, have provided new opportunities for human-computer interaction (HCI). Although great progress has been made by leveraging the Kinect sensor, e.g., in human body tracking, face recognition and human action recognition, robust hand gesture recognition remains an open problem. Compared to the entire human body, the hand is a smaller object with more complex articulations and more easily affected by segmentation errors. It is thus a very challenging problem to recognize hand gestures. This paper focuses on building a robust part-based hand gesture recognition system using Kinect sensor. To handle the noisy hand shapes obtained from the Kinect sensor, we propose a novel distance metric, Finger-Earth Mover's Distance (FEMD), to measure the dissimilarity between hand shapes. As it only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences. The extensive experiments demonstrate that our hand gesture recognition system is accurate (a 93.2% mean accuracy on a challenging 10-gesture dataset), efficient (average 0.0750 s per frame), robust to hand articulations, distortions and orientation or scale changes, and can work in uncontrolled environments (cluttered backgrounds and lighting conditions). The superiority of our system is further demonstrated in two real-life HCI applications.
693 citations
TL;DR: It is found that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel.
Abstract: Online social and news media generate rich and timely information about real-world events of all kinds. However, the huge amount of data available, along with the breadth of the user base, requires a substantial effort of information filtering to successfully drill down to relevant topics and events. Trending topic detection is therefore a fundamental building block to monitor and summarize information originating from social sources. There are a wide variety of methods and variables and they greatly affect the quality of results. We compare six topic detection methods on three Twitter datasets related to major events, which differ in their time scale and topic churn rate. We observe how the nature of the event considered, the volume of activity over time, the sampling procedure and the pre-processing of the data all greatly affect the quality of detected topics, which also depends on the type of detection method used. We find that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel. One of the novel topic detection methods we propose, based on -grams cooccurrence and topic ranking, consistently achieves the best performance across all these conditions, thus being more reliable than other state-of-the-art techniques.
423 citations
TL;DR: A fast CU size decision algorithm for HM that can significantly reduce computational complexity while maintaining almost the same RD performance as the original HEVC encoder is proposed.
Abstract: The emerging high efficiency video coding standard (HEVC) adopts the quadtree-structured coding unit (CU). Each CU allows recursive splitting into four equal sub-CUs. At each depth level (CU size), the test model of HEVC (HM) performs motion estimation (ME) with different sizes including 2N × 2N, 2N × N, N × 2N and N × N. ME process in HM is performed using all the possible depth levels and prediction modes to find the one with the least rate distortion (RD) cost using Lagrange multiplier. This achieves the highest coding efficiency but requires a very high computational complexity. In this paper, we propose a fast CU size decision algorithm for HM. Since the optimal depth level is highly content-dependent, it is not efficient to use all levels. We can determine CU depth range (including the minimum depth level and the maximum depth level) and skip some specific depth levels rarely used in the previous frame and neighboring CUs. Besides, the proposed algorithm also introduces early termination methods based on motion homogeneity checking, RD cost checking and SKIP mode checking to skip ME on unnecessary CU sizes. Experimental results demonstrate that the proposed algorithm can significantly reduce computational complexity while maintaining almost the same RD performance as the original HEVC encoder.
406 citations
TL;DR: A novel fusion framework is proposed for multimodal medical images based on non-subsampled contourlet transform (NSCT) to enable more accurate analysis of multimodality images.
Abstract: Multimodal medical image fusion, as a powerful tool for the clinical applications, has developed with the advent of various imaging modalities in medical imaging. The main motivation is to capture most relevant information from sources into a single output, which plays an important role in medical diagnosis. In this paper, a novel fusion framework is proposed for multimodal medical images based on non-subsampled contourlet transform (NSCT). The source medical images are first transformed by NSCT followed by combining low- and high-frequency components. Two different fusion rules based on phase congruency and directive contrast are proposed and used to fuse low- and high-frequency coefficients. Finally, the fused image is constructed by the inverse NSCT with all composite coefficients. Experimental results and comparative study show that the proposed fusion framework provides an effective way to enable more accurate analysis of multimodality images. Further, the applicability of the proposed framework is carried out by the three clinical examples of persons affected with Alzheimer, subacute stroke and recurrent tumor.
381 citations
TL;DR: This paper proposes content-based photo quality assessment using both regional and global features and proposes an approach of online training an adaptive classifier to combine the proposed features according to the visual content of a test photo without knowing its category.
Abstract: Automatically assessing photo quality from the perspective of visual aesthetics is of great interest in high-level vision research and has drawn much attention in recent years. In this paper, we propose content-based photo quality assessment using both regional and global features. Under this framework, subject areas, which draw the most attentions of human eyes, are first extracted. Then regional features extracted from both subject areas and background regions are combined with global features to assess photo quality. Since professional photographers adopt different photographic techniques and have different aesthetic criteria in mind when taking different types of photos (e.g., landscape versus portrait), we propose to segment subject areas and extract visual features in different ways according to the variety of photo content. We divide the photos into seven categories based on their visual content and develop a set of new subject area extraction methods and new visual features specially designed for different categories. The effectiveness of this framework is supported by extensive experimental comparisons of existing photo quality assessment approaches as well as our new features on different categories of photos. In addition, we propose an approach of online training an adaptive classifier to combine the proposed features according to the visual content of a test photo without knowing its category. Another contribution of this work is to construct a large and diversified benchmark dataset for the research of photo quality assessment. It includes 17,673 photos with manually labeled ground truth. This new benchmark dataset can be down loaded at http://mmlab.ie.cuhk.edu.hk/CUHKPQ/Dataset.htm.
300 citations
TL;DR: It is found that the links to related videos generated by uploaders' choices form a small-world network, which suggests that the videos have strong correlations with each other, and creates opportunities for developing novel caching and peer-to-peer distribution schemes to efficiently deliver videos to end users.
Abstract: Established in 2005, YouTube has become the most successful Internet website providing a new generation of short video sharing service. Today, YouTube alone consumes as much bandwidth as did the entire Internet in year 2000 . Understanding the features of YouTube and similar video sharing sites is thus crucial to their sustainable development and to network traffic engineering. In this paper, using traces crawled in a 1.5-year span (from February 2007 to September 2008), we present an in-depth and systematic measurement study on the characteristics of YouTube videos. We find that YouTube videos have noticeably different statistics compared to traditional streaming videos, ranging from length, access pattern, to their active life span. The series of datasets also allow us to identify the growth trend of this fast evolving Internet site, which has seldom been explored before. We also look closely at the social networking aspect of YouTube, as this is a key driving force toward its success. In particular, we find that the links to related videos generated by uploaders' choices form a small-world network. This suggests that the videos have strong correlations with each other, and creates opportunities for developing novel caching and peer-to-peer distribution schemes to efficiently deliver videos to end users.
242 citations
TL;DR: A novel approach-Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR and shows that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.
Abstract: Near-duplicate video retrieval (NDVR) has recently attracted much research attention due to the exponential growth of online videos. It has many applications, such as copyright protection, automatic video tagging and online video monitoring. Many existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Moreover, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach-Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structural information of each individual feature and also globally considers the local structures for all the features to learn a group of hash functions to map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos collected from YouTube by ourselves. This dataset has been released (http://itee.uq.edu.au/shenht/UQ_VIDEO/). The experimental results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.
235 citations
TL;DR: A novel saliency detection model is introduced by utilizing low-level features obtained from the wavelet transform domain to modulate local contrast at a location with its global saliency computed based on the likelihood of the features.
Abstract: Researchers have been taking advantage of visual attention in various image processing applications such as image retargeting, video coding, etc. Recently, many saliency detection algorithms have been proposed by extracting features in spatial or transform domains. In this paper, a novel saliency detection model is introduced by utilizing low-level features obtained from the wavelet transform domain. Firstly, wavelet transform is employed to create the multi-scale feature maps which can represent different features from edge to texture. Then, we propose a computational model for the saliency map from these features. The proposed model aims to modulate local contrast at a location with its global saliency computed based on the likelihood of the features, and the proposed model considers local center-surround differences and global contrast in the final saliency map. Experimental evaluation depicts the promising results from the proposed model by outperforming the relevant state of the art saliency detection models.
231 citations
TL;DR: Detecting of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream, forming the basis for a generic, bottom-up video summarization algorithm.
Abstract: Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.
229 citations
TL;DR: An analytical framework for optimization of forwarding decisions at the control layer to enable dynamic Quality of Service (QoS) over OpenFlow networks is presented and application of this framework to QoS-enabled streaming of scalable encoded videos with two QoS levels is discussed.
Abstract: OpenFlow is a programmable network protocol and associated hardware designed to effectively manage and direct traffic by decoupling control and forwarding layers of routing. This paper presents an analytical framework for optimization of forwarding decisions at the control layer to enable dynamic Quality of Service (QoS) over OpenFlow networks and discusses application of this framework to QoS-enabled streaming of scalable encoded videos with two QoS levels. We pose and solve optimization of dynamic QoS routing as a constrained shortest path problem, where we treat the base layer of scalable encoded video as a level-1 QoS flow, while the enhancement layers can be treated as level-2 QoS or best-effort flows. We provide experimental results which show that the proposed dynamic QoS framework achieves significant improvement in overall quality of streaming of scalable encoded videos under various coding configurations and network congestion scenarios.
217 citations
TL;DR: A new multi-task feature selection algorithm is proposed and applied to multimedia (e.g., video and image) analysis, which enables the common knowledge of multiple tasks as supplementary information to facilitate decision making.
Abstract: While much progress has been made to multi-task classification and subspace learning, multi-task feature selection has long been largely unaddressed. In this paper, we propose a new multi-task feature selection algorithm and apply it to multimedia (e.g., video and image) analysis. Instead of evaluating the importance of each feature individually, our algorithm selects features in a batch mode, by which the feature correlation is considered. While feature selection has received much research attention, less effort has been made on improving the performance of feature selection by leveraging the shared knowledge from multiple related tasks. Our algorithm builds upon the assumption that different related tasks have common structures. Multiple feature selection functions of different tasks are simultaneously learned in a joint framework, which enables our algorithm to utilize the common knowledge of multiple tasks as supplementary information to facilitate decision making. An efficient iterative algorithm is proposed to optimize it, whose convergence is guaranteed. Experiments on different databases have demonstrated the effectiveness of the proposed algorithm.
TL;DR: The optimal rule of value modification under a payload-distortion criterion is found by using an iterative procedure, and a practical reversible data hiding scheme is proposed, where the secret data are carried by the differences between the original pixel-values and the corresponding values estimated from the neighbors.
Abstract: In reversible data hiding techniques, the values of host data are modified according to some particular rules and the original host content can be perfectly restored after extraction of the hidden data on receiver side. In this paper, the optimal rule of value modification under a payload-distortion criterion is found by using an iterative procedure, and a practical reversible data hiding scheme is proposed. The secret data, as well as the auxiliary information used for content recovery, are carried by the differences between the original pixel-values and the corresponding values estimated from the neighbors. Here, the estimation errors are modified according to the optimal value transfer rule. Also, the host image is divided into a number of pixel subsets and the auxiliary information of a subset is always embedded into the estimation errors in the next subset. A receiver can successfully extract the embedded secret data and recover the original content in the subsets with an inverse order. This way, a good reversible data hiding performance is achieved.
TL;DR: This work investigates the feasibility of crowdsourcing personality impressions from vlogging as a way to obtain judgements from a variate audience that consumes social media video, and addresses the task of automatic prediction of vloggers' personality impressions using nonverbal cues and machine learning techniques.
Abstract: Despite an increasing interest in understanding human perception in social media through the automatic analysis of users' personality, existing attempts have explored user profiles and text blog data only. We approach the study of personality impressions in social media from the novel perspective of crowdsourced impressions, social attention, and audiovisual behavioral analysis on slices of conversational vlogs extracted from YouTube. Conversational vlogs are a unique case study to understand users in social media, as vloggers implicitly or explicitly share information about themselves that words, either written or spoken cannot convey. In addition, research in vlogs may become a fertile ground for the study of video interactions, as conversational video expands to innovative applications. In this work, we first investigate the feasibility of crowdsourcing personality impressions from vlogging as a way to obtain judgements from a variate audience that consumes social media video. Then, we explore how these personality impressions mediate the online video watching experience and relate to measures of attention in YouTube. Finally, we investigate on the use of automatic nonverbal cues as a suitable lens through which impressions are made, and we address the task of automatic prediction of vloggers' personality impressions using nonverbal cues and machine learning techniques. Our study, conducted on a dataset of 442 YouTube vlogs and 2210 annotations collected in Amazon's Mechanical Turk, provides new findings regarding the suitability of collecting personality impressions from crowdsourcing, the types of personality impressions that emerge through vlogging, their association with social attention, and the level of utilization of nonverbal cues in this particular setting. In addition, it constitutes a first attempt to address the task of automatic vlogger personality impression prediction using nonverbal cues, with promising results.
TL;DR: This paper investigates the problem of how to cache a set of media files with optimal streaming rates, under HTTP adaptive bit rate streaming over wireless networks, and finds there is a fundamental phase change in the optimal solution as the number of cached files grows.
Abstract: In this paper, we investigate the problem of optimal content cache management for HTTP adaptive bit rate (ABR) streaming over wireless networks. Specifically, in the media cloud, each content is transcoded into a set of media files with diverse playback rates, and appropriate files will be dynamically chosen in response to channel conditions and screen forms. Our design objective is to maximize the quality of experience (QoE) of an individual content for the end users, under a limited storage budget. Deriving a logarithmic QoE model from our experimental results, we formulate the individual content cache management for HTTP ABR streaming over wireless network as a constrained convex optimization problem. We adopt a two-step process to solve the snapshot problem. First, using the Lagrange multiplier method, we obtain the numerical solution of the set of playback rates for a fixed number of cache copies and characterize the optimal solution analytically. Our investigation reveals a fundamental phase change in the optimal solution as the number of cached files increases. Second, we develop three alternative search algorithms to find the optimal number of cached files, and compare their scalability under average and worst complexity metrics. Our numerical results suggest that, under optimal cache schemes, the maximum QoE measurement, i.e., mean-opinion-score (MOS), is a concave function of the allowable storage size. Our cache management can provide high expected QoE with low complexity, shedding light on the design of HTTP ABR streaming services over wireless networks.
TL;DR: A multiple-Kinect capturing system and a novel methodology for the creation of accurate, realistic, full 3-D reconstructions of moving foreground objects, e.g., humans, to be exploited in real-time applications.
Abstract: The problem of robust, realistic and especially fast 3-D reconstruction of objects, although extensively studied, is still a challenging research task. Most of the state-of-the-art approaches that target real-time applications, such as immersive reality, address mainly the problem of synthesizing intermediate views for given view-points, rather than generating a single complete 3-D surface. In this paper, we present a multiple-Kinect capturing system and a novel methodology for the creation of accurate, realistic, full 3-D reconstructions of moving foreground objects, e.g., humans, to be exploited in real-time applications. The proposed method generates multiple textured meshes from multiple RGB-Depth streams, applies a coarse-to-fine registration algorithm and finally merges the separate meshes into a single 3-D surface. Although the Kinect sensor has attracted the attention of many researchers and home enthusiasts and has already appeared in many applications over the Internet, none of the already presented works can produce full 3-D models of moving objects from multiple Kinect streams in real-time. We present the capturing setup, the methodology for its calibration and the details of the proposed algorithm for real-time fusion of multiple meshes. The presented experimental results verify the effectiveness of the approach with respect to the 3-D reconstruction quality, as well as the achieved frame rates.
TL;DR: Mobile cloud computing can help bridge the gap, providing mobile applications the capabilities of cloud servers and storage together with the benefits of mobile devices and mobile connectivity, possibly enabling a new generation of truly ubiquitous multimedia applications on mobile devices: Cloud Mobile Media (CMM) applications.
Abstract: With worldwide shipments of smartphones (487.7 million) exceeding PCs (414.6 million including tablets) in 2011, and in the US alone, more users predicted to access the Internet from mobile devices than from PCs by 2015, clearly there is a desire to be able to use mobile devices and networks like we use PCs and wireline networks today. However, in spite of advances in the capabilities of mobile devices, a gap will continue to exist, and may even widen, with the requirements of rich multimedia applications. Mobile cloud computing can help bridge this gap, providing mobile applications the capabilities of cloud servers and storage together with the benefits of mobile devices and mobile connectivity, possibly enabling a new generation of truly ubiquitous multimedia applications on mobile devices: Cloud Mobile Media (CMM) applications.
TL;DR: A hierarchical regression model is designed to exploit the information derived from each type of feature, which is then collaboratively fused to obtain a multimedia semantic concept classifier.
Abstract: Multimedia data are usually represented by multiple features. In this paper, we propose a new algorithm, namely Multi-feature Learning via Hierarchical Regression for multimedia semantics understanding, where two issues are considered. First, labeling large amount of training data is labor-intensive. It is meaningful to effectively leverage unlabeled data to facilitate multimedia semantics understanding. Second, given that multimedia data can be represented by multiple features, it is advantageous to develop an algorithm which combines evidence obtained from different features to infer reliable multimedia semantic concept classifiers. We design a hierarchical regression model to exploit the information derived from each type of feature, which is then collaboratively fused to obtain a multimedia semantic concept classifier. Both label information and data distribution of different features representing multimedia data are considered. The algorithm can be applied to a wide range of multimedia applications and experiments are conducted on video data for video concept annotation and action recognition. Using Trecvid and CareMedia video datasets, the experimental results show that it is beneficial to combine multiple features. The performance of the proposed algorithm is remarkable when only a small amount of labeled training data are available.
TL;DR: A novel RR IQA index based on visual information fidelity is proposed, advocating that distortions on the primary visual information mainly disturb image understanding, and distortions in the residual uncertainty mainly change the comfort of perception.
Abstract: Reduced-reference (RR) image quality assessment (IQA) aims to use less data about the reference image and achieve higher evaluation accuracy. Recent research on brain theory suggests that the human visual system (HVS) actively predicts the primary visual information and tries to avoid the residual uncertainty for image perception and understanding. Therefore, the perceptual quality relies to the information fidelities of the primary visual information and the residual uncertainty. In this paper, we propose a novel RR IQA index based on visual information fidelity. We advocate that distortions on the primary visual information mainly disturb image understanding, and distortions on the residual uncertainty mainly change the comfort of perception. We separately compute the quantities of the primary visual information and the residual uncertainty of an image. Then the fidelities of the two types of information are separately evaluated for quality assessment. Experimental results demonstrate that the proposed index uses few data (30 bits) and achieves high consistency with human perception.
TL;DR: The experiments confirm that people attributes of individuals and groups are promising and orthogonal to prior works using travel logs only and can further improve prior travel recommendation methods especially for difficult predictions by further leveraging user contexts via mobile devices.
Abstract: Leveraging community-contributed data (e.g., blogs, GPS logs, and geo-tagged photos) for personalized recommendation is one of the active research problems since there are rich contexts and human activities in such explosively growing data. In this work, we focus on personalized travel recommendation and show promising applications by leveraging the freely available community-contributed photos. We propose to conduct personalized travel recommendation by further considering specific user profiles or attributes (e.g., gender, age, race) as well as travel group types (e.g., family, friends, couple). Instead of mining photo logs only, we exploit the automatically detected people attributes and travel group types in the photo contents. By information-theoretic measures, we demonstrate that such detected user profiles are informative and effective for travel recommendation-especially providing a promising aspect for personalization. We effectively mine the demographics of individual and group travelers for different locations (or landmarks) and their travel paths. A probabilistic Bayesian learning framework which further entails mobile recommendation on the spot is introduced as well. We experiment on more than 10 million photos collected from 19 major cities worldwide and conduct the extensive investigation of profiling activities in communities according to temporal and spatial information. Note that the photos in the paper attribute to various Flickr users under the Creative Commons License. The experiments confirm that people attributes of individuals and groups are promising and orthogonal to prior works using travel logs only and can further improve prior travel recommendation methods especially for difficult predictions by further leveraging user contexts via mobile devices.
TL;DR: This paper proposes a method to directly optimize the graph Laplacian in spectral hashing, which can better represent similarity between samples, and is then applied to SH for effective binary code learning.
Abstract: The ability of fast similarity search in a large-scale dataset is of great importance to many multimedia applications. Semantic hashing is a promising way to accelerate similarity search, which designs compact binary codes for a large number of images so that semantically similar images are mapped to close codes. Retrieving similar neighbors is then simply accomplished by retrieving images that have codes within a small Hamming distance of the code of the query. Among various hashing approaches, spectral hashing (SH) has shown promising performance by learning the binary codes with a spectral graph partitioning method. However, the Euclidean distance is usually used to construct the graph Laplacian in SH, which may not reflect the inherent distribution of the data. Therefore, in this paper, we propose a method to directly optimize the graph Laplacian. The learned graph, which can better represent similarity between samples, is then applied to SH for effective binary code learning. Meanwhile, our approach, unlike metric learning, can automatically determine the scale factor during the optimization. Extensive experiments are conducted on publicly available datasets and the comparison results demonstrate the effectiveness of our approach.
TL;DR: A novel transfer learning framework that utilizes knowledge from social streams to grasp sudden popularity bursts in online content and envision that this cross-domain popularity prediction model will be substantially useful for various media applications that could not be previously solved by traditional multimedia techniques alone.
Abstract: Previous research on online media popularity prediction concluded that the rise in popularity of online videos maintains a conventional logarithmic distribution. However, recent studies have shown that a significant portion of online videos exhibit bursty/sudden rise in popularity, which cannot be accounted for by video domain features alone. In this paper, we propose a novel transfer learning framework that utilizes knowledge from social streams (e.g., Twitter) to grasp sudden popularity bursts in online content. We develop a transfer learning algorithm that can learn topics from social streams allowing us to model the social prominence of video content and improve popularity predictions in the video domain. Our transfer learning framework has the ability to scale with incoming stream of tweets, harnessing physical world event information in real-time. Using data comprising of 10.2 million tweets and 3.5 million YouTube videos, we show that social prominence of the video topic (context) is responsible for the sudden rise in its popularity where social trends have a ripple effect as they spread from the Twitter domain to the video domain. We envision that our cross-domain popularity prediction model will be substantially useful for various media applications that could not be previously solved by traditional multimedia techniques alone.
TL;DR: This paper proposes a scheme that is able to enrich textual answers in cQA with appropriate media data and can enable a novel multimedia question answering (MMQA) approach as users can find multimedia answers by matching their questions with those in the pool.
Abstract: Community question answering (cQA) services have gained popularity over the past years. It not only allows community members to post and answer questions but also enables general users to seek information from a comprehensive set of well-answered questions. However, existing cQA forums usually provide only textual answers, which are not informative enough for many questions. In this paper, we propose a scheme that is able to enrich textual answers in cQA with appropriate media data. Our scheme consists of three components: answer medium selection, query generation for multimedia search, and multimedia data selection and presentation. This approach automatically determines which type of media information should be added for a textual answer. It then automatically collects data from the web to enrich the answer. By processing a large set of QA pairs and adding them to a pool, our approach can enable a novel multimedia question answering (MMQA) approach as users can find multimedia answers by matching their questions with those in the pool. Different from a lot of MMQA research efforts that attempt to directly answer questions with image and video data, our approach is built based on community-contributed textual answers and thus it is able to deal with more complex questions. We have conducted extensive experiments on a multi-source QA dataset. The results demonstrate the effectiveness of our approach.
TL;DR: It is shown that the private agents in the clouds can effectively provide the adaptive streaming, and perform video sharing based on the social network analysis, in a new mobile video streaming framework, dubbed AMES-Cloud.
Abstract: While demands on video traffic over mobile networks have been souring, the wireless link capacity cannot keep up with the traffic demand. The gap between the traffic demand and the link capacity, along with time-varying link conditions, results in poor service quality of video streaming over mobile networks such as long buffering time and intermittent disruptions. Leveraging the cloud computing technology, we propose a new mobile video streaming framework, dubbed AMES-Cloud, which has two main parts: adaptive mobile video streaming (AMoV) and efficient social video sharing (ESoV). AMoV and ESoV construct a private agent to provide video streaming services efficiently for each mobile user. For a given user, AMoV lets her private agent adaptively adjust her streaming flow with a scalable video coding technique based on the feedback of link quality. Likewise, ESoV monitors the social network interactions among mobile users, and their private agents try to prefetch video content in advance. We implement a prototype of the AMES-Cloud framework to demonstrate its performance. It is shown that the private agents in the clouds can effectively provide the adaptive streaming, and perform video sharing (i.e., prefetching) based on the social network analysis.
TL;DR: This paper presents a novel self-learning approach for SR that advances support vector regression (SVR) with image sparse representation with excellent generalization in modeling the relationship between images and their associated SR versions.
Abstract: Learning-based approaches for image super-resolution (SR) have attracted the attention from researchers in the past few years. In this paper, we present a novel self-learning approach for SR. In our proposed framework, we advance support vector regression (SVR) with image sparse representation, which offers excellent generalization in modeling the relationship between images and their associated SR versions. Unlike most prior SR methods, our proposed framework does not require the collection of training low and high-resolution image data in advance, and we do not assume the reoccurrence (or self-similarity) of image patches within an image or across image scales. With theoretical supports of Bayes decision theory, we verify that our SR framework learns and selects the optimal SVR model when producing an SR image, which results in the minimum SR reconstruction error. We evaluate our method on a variety of images, and obtain very promising SR results. In most cases, our method quantitatively and qualitatively outperforms bicubic interpolation and state-of-the-art learning-based SR approaches.
TL;DR: A novel method to discover co-salient objects from a group of images, which is modeled as a linear fusion of an intra-image saliency (IaIS) map and an inter-image Saliency (Ir IS) map, which can be optimized by linear programming.
Abstract: In this paper, we propose a novel method to discover co-salient objects from a group of images, which is modeled as a linear fusion of an intra-image saliency (IaIS) map and an inter-image saliency (IrIS) map. The first term is to measure the salient objects from each image using multiscale segmentation voting. The second term is designed to detect the co-salient objects from a group of images. To compute the IrIS map, we perform the pairwise similarity ranking based on an image pyramid representation. A minimum spanning tree is then constructed to determine the image matching order. For each region in an image, we design three types of visual descriptors, which are extracted from the local appearance, e.g., color, color co-occurrence and shape properties. The final region matching problem between the images is formulated as an assignment problem that can be optimized by linear programming. Experimental evaluation on a number of images demonstrates the good performance of the proposed method on co-salient object detection.
TL;DR: This paper reviews the recent work in NC for multimedia applications and focuses on the techniques that fill the gap between NC theory and practical applications, and outlines the benefits of NC and presents the open challenges in this area.
Abstract: While every network node only relays messages in a traditional communication system, the recent network coding (NC) paradigm proposes to implement simple in-network processing with packet combinations in the nodes. NC extends the concept of “encoding” a message beyond source coding (for compression) and channel coding (for protection against errors and losses). It has been shown to increase network throughput compared to traditional networks implementation, to reduce delay and to provide robustness to transmission errors and network dynamics. These features are so appealing for multimedia applications that they have spurred a large research effort towards the development of multimedia-specific NC techniques. This paper reviews the recent work in NC for multimedia applications and focuses on the techniques that fill the gap between NC theory and practical applications. It outlines the benefits of NC and presents the open challenges in this area. The paper initially focuses on multimedia-specific aspects of network coding, in particular delay, in-network error control, and media-specific error control. These aspects permit to handle varying network conditions as well as client heterogeneity, which are critical to the design and deployment of multimedia systems. After introducing these general concepts, the paper reviews in detail two applications that lend themselves naturally to NC via the cooperation and broadcast models, namely peer-to-peer multimedia streaming and wireless networking.
TL;DR: A Bayesian framework that makes use of personal dietary tendencies to improve both food-image detection and food-balance estimation is proposed, which is the first to make use of statistical personal bias to improve the performance of the analysis.
Abstract: We have investigated the “FoodLog” multimedia food-recording tool, whereby users upload photographs of their meals and a food diary is constructed using image-processing functions such as food-image detection and food-balance estimation. In this paper, following a brief introduction to FoodLog, we propose a Bayesian framework that makes use of personal dietary tendencies to improve both food-image detection and food-balance estimation. The Bayesian framework facilitates incremental learning. It incorporates three personal dietary tendencies that influence food analysis: likelihood, prior distribution, and mealtime category. In the evaluation of the proposed method using images uploaded to FoodLog, both food-image detection and food-balance estimation are improved. In particular, in the food-balance estimation, the mean absolute error is significantly reduced from 0.69 servings to 0.28 servings on average for two persons using more than 200 personal images, and 0.59 servings to 0.48 servings on average for four persons using 100 personal images. Among the works analyzing food images, this is the first to make use of statistical personal bias to improve the performance of the analysis.
TL;DR: A scheme for image annotation on the cloud is presented, which transmits mobile images compressed by Hamming compressed sensing to the cloud and conducts semantic annotation through a novel Hessian regularized support vector machine on thecloud.
Abstract: With the rapid development of the cloud computing and mobile service, users expect a better experience through multimedia computing, such as automatic or semi-automatic personal image and video organization and intelligent user interface. These functions heavily depend on the success of image understanding, and thus large-scale image annotation has received intensive attention in recent years. The collaboration between mobile and cloud opens a new avenue for image annotation, because the heavy computation can be transferred to the cloud for immediately responding user actions. In this paper, we present a scheme for image annotation on the cloud, which transmits mobile images compressed by Hamming compressed sensing to the cloud and conducts semantic annotation through a novel Hessian regularized support vector machine on the cloud. We carefully explained the rationality of Hessian regularization for encoding the local geometry of the compact support of the marginal distribution and proved that Hessian regularized support vector machine in the reproducing kernel Hilbert space is equivalent to conduct Hessian regularized support vector machine in the space spanned by the principal components of the kernel principal component analysis. We conducted experiments on the PASCAL VOC'07 dataset and demonstrated the effectiveness of Hessian regularized support vector machine for large-scale image annotation.
TL;DR: The proposed JIGSAW${+}$ is able to achieve 5% gain in terms of search performance and is ten times faster.
Abstract: This paper describes a novel multimodal interactive image search system on mobile devices. The system, the Joint search with ImaGe, Speech, And Word Plus (JIGSAW ${+}$ ), takes full advantage of the multimodal input and natural user interactions of mobile devices. It is designed for users who already have pictures in their minds but have no precise descriptions or names to address them. By describing it using speech and then refining the recognized query by interactively composing a visual query using exemplary images, the user can easily find the desired images through a few natural multimodal interactions with his/her mobile device. Compared with our previous work JIGSAW, the algorithm has been significantly improved in three aspects: 1) segmentation-based image representation is adopted to remove the artificial block partitions; 2) relative position checking replaces the fixed position penalty; and 3) inverted index is constructed instead of brute force matching. The proposed JIGSAW ${+}$ is able to achieve 5% gain in terms of search performance and is ten times faster.
TL;DR: A method of cloud-based image coding that is different from current image coding even on the ground no longer compresses images pixel by pixel and instead tries to describe images and reconstruct them from a large-scale image database via the descriptions.
Abstract: Current image coding schemes make it hard to utilize external images for compression even if highly correlated images can be found in the cloud. To solve this problem, we propose a method of cloud-based image coding that is different from current image coding even on the ground. It no longer compresses images pixel by pixel and instead tries to describe images and reconstruct them from a large-scale image database via the descriptions. First, we describe an input image based on its down-sampled version and local feature descriptors. The descriptors are used to retrieve highly correlated images in the cloud and identify corresponding patches. The down-sampled image serves as a target to stitch retrieved image patches together. Second, the down-sampled image is compressed using current image coding. The feature vectors of local descriptors are predicted by the corresponding vectors extracted in the decoded down-sampled image. The predicted residual vectors are compressed by transform, quantization, and entropy coding. The experimental results show that the visual quality of reconstructed images is significantly better than that of intra-frame coding in HEVC and JPEG at thousands to one compression .