scispace - formally typeset
Search or ask a question

Showing papers in "IEEE MultiMedia in 2014"


Journal ArticleDOI
TL;DR: In this paper, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference, which produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents).
Abstract: This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.

126 citations


Journal ArticleDOI
Ling-Yu Duan1, Jie Lin1, Jie Chen1, Tiejun Huang1, Wen Gao1 
TL;DR: Major progress is reviewed in standardizing technologies that will enable efficient and interoperable design of visual search applications and the location search and recognition oriented data collection and benchmark under the MPEG CDVS evaluation framework is presented.
Abstract: To ensure application interoperability in visual object search technologies, the MPEG Working Group has made great efforts in standardizing visual search technologies. Moreover, extraction and transmission of compact descriptors are valuable for next-generation, mobile, visual search applications. This article reviews the significant progress of MPEG Compact Descriptors for Visual Search (CDVS) in standardizing technologies that will enable efficient and interoperable design of visual search applications. In addition, the article presents the location search and recognition oriented data collection and benchmark under the MPEG CDVS evaluation framework.

89 citations


Journal ArticleDOI
TL;DR: The recent research progress in view-based 3D object retrieval is introduced by reviewing advances and identifying challenges in this field.
Abstract: View-based 3D object retrieval is an emerging research topic that has numerous geographic-related applications in many fields, such as computer-aided design (CAD) and virtual city navigation. This article briefly introduces the recent research progress in view-based 3D object retrieval by reviewing advances and identifying challenges in this field.

76 citations


Journal ArticleDOI
Jianbo Jiao1, Ronggang Wang1, Wenmin Wang1, Dong Shengfu1, Zhenyu Wang1, Wen Gao1 
TL;DR: A local stereo matching method that employs a new combined cost approach and a secondary disparity refinement mechanism that is the best cost-volume filtering-based local method and validate the proposed method's effectiveness.
Abstract: Recent local stereo matching methods have achieved comparable performance with global methods. However, the final disparity map still contains significant outliers. In this article, the authors propose a local stereo matching method that employs a new combined cost approach and a secondary disparity refinement mechanism. They formulate combined cost using a modified color census transform and truncated absolute differences of color and gradients. They also use symmetric guided filter for cost aggregation. Unlike in traditional stereo matching, they propose a novel secondary disparity refinement to further remove the remaining outliers. Experimental results on the Middlebury benchmark show that their method ranks fifth out of 153 submitted methods, and it's the best cost-volume filtering-based local method. Experiments on real-world sequences and depth-based applications also validate the proposed method's effectiveness.

65 citations


Journal ArticleDOI
TL;DR: The Video Browser Showdown evaluates the performance of exploratory tools for interactive content search in videos in direct competition and in front of an audience to push research on user-centric video search tools.
Abstract: The Video Browser Showdown is an international competition in the field of interactive video search and retrieval. It is held annually as a special session at the International Conference on Multimedia Modeling (MMM). The Video Browser Showdown evaluates the performance of exploratory tools for interactive content search in videos in direct competition and in front of an audience. Its goal is to push research on user-centric video search tools including video navigation, content browsing, content interaction, and video content visualization. This article summarizes the first three VBS competitions (2012-2014).

61 citations


Journal ArticleDOI
TL;DR: An overview of SHVC, the scalable extension of H.265/HEVC, which adopts a scalable coding architecture with only high-level syntax changes relative to its base codec, which allows SHVC to be deployed with significantly reduced implementation cost.
Abstract: This article presents an overview of SHVC, the scalable extension of H.265/HEVC. SHVC adopts a scalable coding architecture with only high-level syntax changes relative to its base codec, which allows SHVC to be deployed with significantly reduced implementation cost. SHVC supports a rich set of scalability features. It also addresses the increasing market demand for higher quality and higher value video content delivery by providing a set of desired scalability features with high coding efficiency.

59 citations


Journal ArticleDOI
TL;DR: The progress of standardization of biometric template protection schemes is reviewed, an umbrella term for a class of techniques used to mitigate the security and privacy threats inherent in biometric recognition.
Abstract: Whether it is providing fingerprints at airport immigration desks, tagging friends on social networking sites, or logging into a smartphone, biometrics provide a fast, convenient, and unobtrusive means for access control or identity verification. Biometric template protection is an umbrella term for a class of techniques used to mitigate the security and privacy threats inherent in biometric recognition. During the past decade and a half, template protection has gained traction in academia and industry, becoming the subject of publications, patents and conferences. This article reviews the progress of standardization of biometric template protection schemes.

48 citations


Journal ArticleDOI
TL;DR: The authors have made two contributions to the design of a BOF-based on-device MVLR system, and propose a memory-efficient approximate nearest-neighbor search algorithm by combining residual vector quantization and tree-structured RVQ (TSRVQ).
Abstract: Existing mobile visual location recognition (MVLR) applications typically rely on bag-of-features (BOF) representation, which shows superior performance in retrieval accuracy. However, although the BOF framework is promising, it is not compact enough for on-device MVLR. The authors have made two contributions to the design of a BOF-based on-device MVLR system. First, to generate BOF descriptors, they propose a memory-efficient approximate nearest-neighbor search algorithm by combining residual vector quantization (RVQ) and tree-structured RVQ (TSRVQ). Second, they implemented a GPS-based and heading-aware RankBoost algorithm to reduce the dimensionality of the BOF descriptors. The authors evaluate the effectiveness of the proposed algorithms on an HTC mobile phone. Their work applies to on-device MVLR in city-scale workspaces.

48 citations


Journal ArticleDOI
TL;DR: The authors propose a method of projected residual vector quantization for ANN search that considers the projection errors in the quantization process and design three simple and effective optimization strategies to improve the performance of the PRVQ algorithm.
Abstract: In this research, we propose Projected Residual Vector Quantization (PRVQ) to deal with the problem of large-scale approximate nearest neighbor (ANN) search in a high-dimensional space. A lot of quantization-based ANN search algorithms have been proposed in the past few years. However, most of the existing methods discard the projection errors generated in the dimension reduction process, which inevitably decreases the search accuracy. In view of that, the authors propose a method of projected residual vector quantization for ANN search that considers the projection errors in the quantization process. They also design three simple and effective optimization strategies to improve the performance of the PRVQ algorithm. The authors have integrated the proposed PRVQ algorithm into a mobile landmark recognition system to prove its effectiveness.

46 citations


Journal ArticleDOI
TL;DR: Computer vision promises to be an extraordinary enabling technology for augmenting visitor experiences, bridging the affective gap by understanding the visitor's individual cognitive needs and interests and his or her situational affective state.
Abstract: Museum visitor experiences differ from person to person, from cognitive to affective experiences. Progress in information technology has provided us with the opportunity to improve both the quantity and personalization of cultural information, privileging the cognitive experience against the affective. Computer vision promises to be an extraordinary enabling technology for augmenting visitor experiences, bridging the affective gap by understanding the visitor's individual cognitive needs and interests and his or her situational affective state.

41 citations


Journal ArticleDOI
TL;DR: The authors propose a general-purpose, no-reference image quality assessment (NR-IQA) with the goal of developing a model that does not require prior knowledge about nondistorted reference images and the types of distortions, and which can achieve better prediction performance than the other state-of-the-art approaches.
Abstract: With the rapid increase of digital imaging and communication technology usage, there's now great demand for fast and practical image quality assessment (IQA) algorithms that can predict an image's quality as consistently as humans. The authors propose a general-purpose, no-reference image quality assessment (NR-IQA) with the goal of developing a model that does not require prior knowledge about nondistorted reference images and the types of distortions. The key is to obtain effective image representations using learning quality-aware filters (QAFs). Unlike other regression models, they also use a random forest to train the mapping from the feature space. Extensive experiments conducted on the LIVE and CSIQ datasets demonstrate that the proposed NR-IQA metric QAF can achieve better prediction performance than the other state-of-the-art approaches in terms of both prediction accuracy and generalization capability.

Journal ArticleDOI
TL;DR: In this article, the authors describe methods that compress visual word histograms, which require a codebook and decoding compressed signatures, and use residuals to achieve the same accuracy with much smaller codebooks and compressed domain matching.
Abstract: Mobile visual search systems compare images against a database for object recognition. If query data is transmitted over a slow network or processed on a congested server, the latency increases substantially. This article shows how on-device database matching guarantees fast recognition regardless of external conditions. The database signatures must be compact because of limited memory, capable of fast comparisons, and discriminative for robust recognition. The authors first describe methods that compress visual word histograms, which require a codebook and decoding compressed signatures. They then describe methods that use residuals to achieve the same accuracy with much smaller codebooks and compressed domain matching.

Journal ArticleDOI
TL;DR: A multiscreen, social TV system integrated with social sense via a second screen as a novel paradigm for content consumption and the feasibility and effectiveness of the proposed approach in transforming the TV viewing experience are described.
Abstract: The increasing popularity of social interactions and geotagged, user-generated content has transformed the television viewing experience from laid-back video watching behavior into a "lean-forward"' socially engaged experience. This article describes a multiscreen, social TV system integrated with social sense via a second screen as a novel paradigm for content consumption. This new application is built upon the authors' cloud-centric media platform, which provides on-demand virtual machines for content platform services, including media distribution, storage, and processing. The media platform is also integrated with a Big Data social platform that crawls and mines social data related to the media content. Specifically, this new social TV approach consists of three key subsystems: interactive TV, social sense, and multiscreen orchestration. Interactive TV implements a cloud-based, social TV system, offering rich social features; social sense discovers the geolocation-aware public perception and knowledge related to the media content; and multiscreen orchestration provides an intuitive and user-friendly human-computer interface to combine the two other subsystems, fusing the TV viewing experience with social perception. The authors have built a proof-of-concept demo over a private cloud at the Nanyang Technological University (NTU), Singapore. Feature verification and performance comparisons demonstrate the feasibility and effectiveness of the proposed approach in transforming the TV viewing experience.

Journal Article
TL;DR: With advances in technology, the 21st century has witnessed significant advances in storage, processing, sensing, and communication technologies, resulting in the popularization of strong data-dependent approaches, leading to the rise in the popularity of scientism in almost all disciplines where data can be collected.
Abstract: Humans have always been interested in understanding themselves and their environment. Understanding their relationship with the environment is important to survival as well as thriving in the present situation and planning for the future. With advances in technology, the 21st century has witnessed significant advances in storage, processing, sensing, and communication technologies. All these have resulted in the popularization of strong data-dependent approaches, leading to the rise in the popularity of scientism in almost all disciplines where data can be collected. As the availability of data has become widespread, the desire to understand the physical reality at different levels in different applications has also become possible and desirable.

Journal ArticleDOI
TL;DR: The compression formats described in this article can be used to support emerging auto-stereoscopic displays and free-viewpoint video functionalities.
Abstract: This article reviews the most recent extensions to the Advanced Video Coding (AVC) and High Efficiency Video Coding (HEVC) coding standards, which integrate depth video to support advanced multiview and 3D video functionalities. All the extensions provide single-view compatibility, while some extensions add depth support on top of conforming stereoscopic bitstreams. To achieve the highest gains in coding efficiency, depth information is utilized in coding the texture views. The compression formats described in this article can be used to support emerging auto-stereoscopic displays and free-viewpoint video functionalities.

Journal ArticleDOI
TL;DR: Unlike previous gaze estimation methods using explicit offline calibration with fixed number of calibration points or implicit calibration, the authors' approach constantly improves person-specific eye parameters through online calibration, which enables the system to adapt gradually to a new user.
Abstract: Gaze-tracking technology is highly valuable in many interactive and diagnostic applications. For many gaze estimation systems, calibration is an unavoidable procedure necessary to determine certain person-specific parameters, either explicitly or implicitly. Recently, several offline implicit calibration methods have been proposed to ease the calibration burden. However, the calibration procedure is still cumbersome, and gaze estimation accuracy needs further improvement. In this article, the authors present a novel 3D gaze estimation system with online calibration. The proposed system is based on a new 3D model-based gaze estimation method using a single consumer depth camera sensor (via Kinect). Unlike previous gaze estimation methods using explicit offline calibration with fixed number of calibration points or implicit calibration, their approach constantly improves person-specific eye parameters through online calibration, which enables the system to adapt gradually to a new user. The experimental results and the human-computer interaction (HCI) application show that the proposed system can work in real time with superior gaze estimation accuracy and minimal calibration burden.

Journal ArticleDOI
TL;DR: The state-of-the-art clothing analysis techniques (clothing modeling, recognition, and parsing) that can be applied in many real applications, such as clothing retrieval and recommendation are surveyed.
Abstract: Driven by the huge profit potential in the fashion industry, intelligent fashion analysis based on techniques for clothing and makeover analysis is receiving much attention in the multimedia and computer vision literature. This article surveys the state-of-the-art clothing analysis techniques (clothing modeling, recognition, and parsing) that can be applied in many real applications, such as clothing retrieval and recommendation. The authors then introduce several makeover-related research directions, such as facial attractiveness prediction, facial makeup synthesis, and hair segmentation. Lastly, they discuss promising future directions for clothing and makeover analysis.

Journal ArticleDOI
TL;DR: Haptics is presented as a new component of the filmmaker's toolkit and a taxonomy of haptic effects is proposed and new effects coupled with classical cinematographic motions are introduced to enhance the video-viewing experience.
Abstract: Haptics, the technology which brings tactile or force-feedback to users, has a great potential for enhancing movies and could lead to new immersive experiences. This article introduces haptic cinematography, which presents haptics as a new component of the filmmaker's toolkit. The authors propose a taxonomy of haptic effects and introduce new effects coupled with classical cinematographic motions to enhance the video-viewing experience. They propose two models to render haptic effects based on camera motions: the first model makes the audience feel the motion of the camera, and the second provides haptic metaphors related to the semantics of the camera effect. Results from a user study suggest that these new effects improve the quality of experience. Filmmakers can use this new way of creating haptic effects to propose new immersive audio-visual experiences.

Journal ArticleDOI
TL;DR: This work proposes a probabilistic topic model called Multimodal Spatio-Temporal Theme Modeling (mmSTTM), which considers both textual and visual contexts to learn general, local, and temporal themes, which span a low-dimensional theme space.
Abstract: Here, we discuss mining and summarizing landmarks' general themes as well as the local and temporal themes. General themes occur extensively in various landmarks, and include accommodations and other standard features. The local theme implies a specific theme that exists only at a certain landmark, such as a unique physical characteristic. The temporal theme corresponds to the location-time-representative pattern, which relates only to a certain landmark during a certain period-such as fleet week at the Golden Gate Bridge or red maple leaves in Kiyomizu-dera. Local themes are useful in landmark analysis for their discriminative and representative attributes. However, the ability to discover landmark diversity at different moments makes temporal themes equally important in landmark studies. Time dependent diversity shows complete viewing angles over time and complements local themes in landmark understanding. Furthermore, it provides more comprehensive and structured information for landmark history browsing and tourist decision making. We propose a probabilistic topic model called Multimodal Spatio-Temporal Theme Modeling (mmSTTM). The model considers both textual and visual contexts to learn general, local, and temporal themes, which span a low-dimensional theme space. The model also assigns all textual and visual keywords to each theme, along with a probability for each; a keyword with high weight assignment is meaningful for the theme, while low-weighted keywords are considered noise.

Journal ArticleDOI
TL;DR: The authors present a novel, multi-modal fusion scheme to effectively fuse the multimodel results and generate the final ranked retrieval results.
Abstract: A multimedia semantic retrieval system based on hidden coherent feature groups (HCFGs) can support multimedia semantic retrieval on mobile applications. The system can capture the correlation between features and partition the original feature set into HCFGs, which have strong intragroup correlation while maintaining low intercorrelation. The authors present a novel, multimodel fusion scheme to effectively fuse the multimodel results and generate the final ranked retrieval results. In addition, to incorporate user interaction for effective retrieval, the proposed system also features a user feedback mechanism that helps refine the retrieval results.

Journal ArticleDOI
TL;DR: This article presents a 3D feature learning framework that combines different modality data effectively to promote the discriminability of unimodal features.
Abstract: Three-dimensional shapes contain different kinds of information that jointly characterize the shape. Traditional methods, however, perform recognition or retrieval using only one type. This article presents a 3D feature learning framework that combines different modality data effectively to promote the discriminability of unimodal features. Two independent deep belief networks (DBNs) are employed to learn high-level features from low-level features, and a restricted Boltzmann machine (RBM) is trained for mining the deep correlations between the different modalities. Experiments demonstrate that the proposed method can achieve better performance.

Journal ArticleDOI
TL;DR: A feature-based watermarking algorithm to embed a binary image as a watermark in the DCT domain to guarantee visual efficiency and resist synchronous damage to images is proposed.
Abstract: Brought up by the crucial issue of copyright, digital watermarking plays a key role in protecting integrity and providing authorization for multimedia. Efficient watermarking techniques require visual imperceptibility and robustness against various attacks. In this new approach, the authors propose a feature-based watermarking algorithm to embed a binary image as a watermark in the DCT domain to guarantee visual efficiency. In particular, the technique embeds marker bits to locate the original block of a cropped image and thereby resist synchronous damage to images. As simulation results show, compared to related methods, the proposed approach better resists several major image attacks, including cropping, shifting, blurring, noise, sharpening, and JPEG lossy compression. Moreover, this watermarking method can achieve blind extraction, in which the original image isn't required for watermark extraction and the embedded image can be restored with high visual efficiency.

Journal Article
TL;DR: A taxonomy of haptic effects is proposed and novel effects coupled with classical cinematographic motions are introduced to enhance video viewing experience.
Abstract: Haptics, the technology which brings tactile or force-feedback to users, has a great potential for enhancing movies and could lead to new immersive experiences. In this paper we introduce \textit{Haptic Cinematography} which presents haptics as a new component of the filmmaker's toolkit. We propose a taxonomy of haptic effects and we introduce novel effects coupled with classical cinematographic motions to enhance video viewing experience. More precisely we propose two models to render haptic effects based on camera motions: the first model makes the audience feel the motion of the camera and the second provides haptic metaphors related to the semantics of the camera effect. Results from a user study suggest that these new effects improve the quality of experience. Filmmakers may use this new way of creating haptic effects to propose new immersive audiovisual experiences.

Journal ArticleDOI
TL;DR: The authors propose a data-driven approach to explore the use of friendship locality, social proximity, and content proximity for geographically nearby users and extensively evaluates the proposed method using a large-scale real dataset to achieve 15 percent relative improvement over state-of-the-art approaches.
Abstract: Location information in social media is becoming increasingly vital in applications such as election prediction, epidemic forecasting, and emergency detection. However, only a tiny proportion of users proactively share their residence locations (which can be used to approximate the locations of most user-generated content) in their profiles, and inferring the residence location of the remaining users is nontrivial. In this article, the authors propose a framework for residence location inference in social media by jointly considering social, visual, and textual information. They first propose a data-driven approach to explore the use of friendship locality, social proximity, and content proximity for geographically nearby users. Based on these observations, they then propose a location propagation algorithm to effectively infer residence location for social media users. They extensively evaluate the proposed method using a large-scale real dataset and achieve a 15 percent relative improvement over state-of-the-art approaches.

Journal ArticleDOI
TL;DR: The authors present two mobile systems, MMedia2U and CAPTAIN, that take the concept of context-aware multimedia management beyond photo organization and annotation beyond photo organizations and annotation.
Abstract: Context-aware and semantic-based technologies have been successfully employed to improve multimedia management in mobile environments. Large sets of context-tagged images on the Web are a concrete example of this achievement. The authors present two mobile systems, MMedia2U and CAPTAIN, that take the concept of context-aware multimedia management beyond photo organization and annotation. CAPTAIN is a tool that helps generate logbooks using context-tagged images and tracking data. Crewmembers used this tool to manage multimedia content and publish it to a blog during a sea expedition. MMedia2U is a mobile photo recommender system that exploits the user's context and context-tagged images to improve photo recommendation.

Journal ArticleDOI
TL;DR: The authors develop a method that improves face-clustering accuracy by incorporating the social context information inherent among characters in a movie by presenting a fusion scheme that eliminates ambiguities and bridges information from two fields.
Abstract: Clustering faces in movies is a challenging task because faces in a feature-length film are relatively uncontrolled and vary widely in appearance. Such variations make it difficult to appropriately measure the similarity between faces under significantly different settings. In this article, the authors develop a method that improves face-clustering accuracy by incorporating the social context information inherent among characters in a movie. In particular, they study the relation of social network construction and face clustering and present a fusion scheme that eliminates ambiguities and bridges information from two fields. Experiments on real-world data show superior clustering performance compared with state-of-the-art methods. Furthermore, their method can help incrementally build a character's social network that is similar to a manually labeled example.

Journal ArticleDOI
TL;DR: The authors first present a linear projection view to formulate subspace learning and then develop a novel framework, called Latent Subspace Projection Pursuit (LSPP), to estimate the intrinsic dimension, removing corruptions and recovering the subspace structure for observed datasets.
Abstract: This article develops a novel subspace learning algorithm for visual tracking. Specifically, the authors first present a linear projection view to formulate subspace learning and then develop a novel framework, called Latent Subspace Projection Pursuit (LSPP), to estimate the intrinsic dimension, removing corruptions and recovering the subspace structure for observed datasets. The authors evaluate the performance of their proposed method on various synthetic and real-world datasets, and the experimental results demonstrate that LSPP can achieve significant improvements in terms of performance and reduced computational complexity for visual tracking.

Journal ArticleDOI
TL;DR: The authors propose a bit-level context-adaptive correlation model to exploit high-order statistical correlation for wavelet-domain distributed video coding (DVC) and introduces SI binning to classify the SI based on its quality.
Abstract: The authors propose a bit-level context-adaptive correlation model to exploit high-order statistical correlation for wavelet-domain distributed video coding (DVC). The magnitude and sign of each coefficient are coded separately in a bit-plane fashion. The context for magnitude bit plane are designed based on the side information (SI), the local neighborhood, and the parent coefficient. The sign bit plane takes the sign of the SI as the context. The authors also introduce SI binning to classify the SI based on its quality. The SI's class is then included in the contexts for both magnitude coding and sign coding. Experimental results show that the proposed scheme provides significant coding gain over existing DVC systems.

Journal ArticleDOI
TL;DR: This article introduces a novel paradigm, describes a set of derived query-processing strategies and compares them along three dimensions: push versus pull, whether or not a communication infrastructure is utilized, and whether metadata dissemination is separated from blob dissemination.
Abstract: In this article, the authors study querying binary large objects (blobs) such as video and voice clips in a network of vehicles communicating wirelessly They introduce a novel paradigm, describe a set of derived query-processing strategies and compare them along three dimensions: push versus pull, whether or not a communication infrastructure is utilized, and whether metadata dissemination is separated from blob dissemination They analyze these strategies theoretically and experimentally in terms of answer throughput and communication overhead

Journal ArticleDOI
John R. Smith1
TL;DR: The visual semantic concept basis is larger than the number of unique words and more efforts are needed to build out the set of visual semantic concepts.
Abstract: Visual scenes require complex description and modeling that involves more than a list of words. The visual semantic concept basis is larger than the number of unique words. More efforts are needed to build out the set of visual semantic concepts.