Showing papers in "Multimedia Systems in 2016"
TL;DR: This paper critically examine MOS and the various ways it is being used today and discusses a variety of alternative approaches that have been proposed for media quality measurement.
Abstract: Mean opinion score (MOS) has become a very popular indicator of perceived media quality. While there is a clear benefit to such a "reference quality indicator" and its widespread acceptance, MOS is often applied without sufficient consideration of its scope or limitations. In this paper, we critically examine MOS and the various ways it is being used today. We highlight common issues with both subjective and objective MOS and discuss a variety of alternative approaches that have been proposed for media quality measurement.
336 citations
TL;DR: This work proposes a cross-media public sentiment analysis system for microblog which fuses the text sentiment and image sentiment not only from sentiment results, but also from sentiment ontology.
Abstract: Since classical public sentiment analysis systems for microblog are based on the text sentiment analysis, it is difficult to determine the sentiment of short text without clear sentiment words in microblog posts. Fortunately, a lot of microblog posts contain images which also represent users' sentiment. To fully understand users' sentiment, we propose a cross-media public sentiment analysis system for microblog. The best advantage of this novel system is the unified cross-media public sentiment analysis framework which fuses the text sentiment and image sentiment not only from sentiment results, but also from sentiment ontology. To enhance presentation effects, this system presents sentiment results from macroscopic view and microscopic view which details the sentiment results in region, topic, microblog content and user diffusion. In our knowledge, this is the first unified cross-media public sentiment analysis system.
69 citations
TL;DR: A self-critical evaluation of the experience with regard to the requirements of a sustainable and effective ecological surveillance tool and the bias and the incompleteness of the produced data are discussed.
Abstract: [email protected] is an innovative participatory sensing platform relying on image-based plants identification as a mean to enlist non-expert contributors and facilitate the production of botanical observation data. One year after the public launch of the mobile application, we carry out a self-critical evaluation of the experience with regard to the requirements of a sustainable and effective ecological surveillance tool. We first demonstrate the attractiveness of the developed multimedia system (with more than 90K end-users) and the nice self-improving capacities of the whole collaborative workflow. We then point out the current limitations of the approach towards producing timely and accurate distribution maps of plants at a very large scale. We discuss in particular two main issues: the bias and the incompleteness of the produced data. We finally open new perspectives and describe upcoming realizations towards bridging these gaps.
66 citations
TL;DR: A robust human action recognition algorithm by tensor representation and Tucker decomposition, where the still image containing human action is represented by a tensor descriptor (Histograms of Oriented Gradients).
Abstract: The spatial information is the important cue for human action recognition. Different from the vector representation, the spatial structure of human action in the still images can be preserved by the tensor representation. This paper proposes a robust human action recognition algorithm by tensor representation and Tucker decomposition. In this method, the still image containing human action is represented by a tensor descriptor (Histograms of Oriented Gradients). This representation preserves the spatial information inside the human action. Based on this representation, the unknown tensor parameter is decomposed according to the Tucker tensor decomposition at first, and then the optimization problems can be solved using the alternative optimization method, where at each iteration, the tensor descriptor is projected along one order and the parameter along the corresponding order can be estimated by solving the Ridge Regression problem. The estimated tensor parameter is more discriminative because of effectively using the spacial information along each order. Experiments are conducted using action images from three publicly available databases. Experimental results demonstrate that our method outperforms other methods.
40 citations
TL;DR: This paper proposes a semi-supervised feature selection algorithm that is based on a hierarchical regression model that utilizes a statistical approach to exploit both labeled and unlabeled data, which preserves the manifold structure of each feature type.
Abstract: Feature selection is an important step for large-scale image data analysis, which has been proved to be difficult due to large size in both dimensions and samples. Feature selection firstly eliminates redundant and irrelevant features and then chooses a subset of features that performs as efficient as the complete set. Generally, supervised feature selection yields better performance than unsupervised feature selection because of the utilization of labeled information. However, labeled data samples are always expensive to obtain, which constraints the performance of supervised feature selection, especially for the large web image datasets. In this paper, we propose a semi-supervised feature selection algorithm that is based on a hierarchical regression model. Our contribution can be highlighted as: (1) Our algorithm utilizes a statistical approach to exploit both labeled and unlabeled data, which preserves the manifold structure of each feature type. (2) The predicted label matrix of the training data and the feature selection matrix are learned simultaneously, making the two aspects mutually benefited. Extensive experiments are performed on three large-scale image datasets. Experimental results demonstrate the better performance of our algorithm, compared with the state-of-the-art algorithms.
34 citations
TL;DR: A serious game framework based on augmented reality technology that may motivate the patients’ involvement in the rehabilitation exercise and show that the serious games with vibrotactile feedback are well accepted by patients is presented.
Abstract: Stroke is considered one of the main causes of death around the world Survivors often suffer different kinds of disabilities in terms of their cognitive and motor capabilities, and are therefore unable to perform their day-to-day activities To regain some of their cognitive as well as motor abilities, they require rehabilitation To this end, we present a serious game framework based on augmented reality technology that may motivate the patients' involvement in the rehabilitation exercise Additionally, we analyze the requirements for such a framework and describe the concept and implementation of the proposed approach Furthermore, we designed a wireless vibrotactile output device that is attached to a tangible object The tangible object that is connected to the framework can give haptic as well as audio-visual feedback to the patient in a more motivating and entertaining environment for rehabilitation exercises The suitability and utility of the proposed framework was evaluated with real stroke patients and compared against the performance of a healthy control group, thus facilitating occupational therapists in assessing a patient's progress Our evaluations show that the serious games with vibrotactile feedback are well accepted by patients
31 citations
TL;DR: A novel learning-based logo detection method with social network information assistance is proposed and a new dense histogram type feature is proposed to classify logo and non-logo image patches.
Abstract: Recent years have shown us the quick development of social network. For companies, microblog platform is more and more important as one source to disseminate brand information and monitor their development. Compared with the frequently used text information existing in traditional media, microblog platform provides information about brands in more types such as images and other related information forms. According to the statistics, microblogs posted on social network contain more and more percentage of images. Hence how to recognize logos in images from social network is of high value. To address this problem, we propose a novel learning-based logo detection method with social network information assistance. A new dense histogram type feature is proposed to classify logo and non-logo image patches. To increase the detection precision, social network content is analyzed and employed to do filtering to reduce detection window candidates. Through the evaluation on large-scale data collected from Sina Weibo platform, the proposed method is demonstrated effective.
30 citations
TL;DR: A set of comprehensive empirical studies to explore the effects of multiple query evidences on large-scale social image search and a novel quantitative metric is proposed and applied to assess the influences of different visual queries based on their complexity levels.
Abstract: System performance assessment and comparison are fundamental for large-scale image search engine development. This article documents a set of comprehensive empirical studies to explore the effects of multiple query evidences on large-scale social image search. The search performance based on the social tags, different kinds of visual features and their combinations are systematically studied and analyzed. To quantify the visual query complexity, a novel quantitative metric is proposed and applied to assess the influences of different visual queries based on their complexity levels. Besides, we also study the effects of automatic text query expansion with social tags using a pseudo relevance feedback method on the retrieval performance. Our analysis of experimental results shows a few key research findings: (1) social tag-based retrieval methods can achieve much better results than content-based retrieval methods; (2) a combination of textual and visual features can significantly and consistently improve the search performance; (3) the complexity of image queries has a strong correlation with retrieval results' quality--more complex queries lead to poorer search effectiveness; and (4) query expansion based on social tags frequently causes search topic drift and consequently leads to performance degradation.
26 citations
TL;DR: An extensive set of experiments are performed and this method is compared with some of the most recent approaches in the field using publicly available datasets as well as a new annotated human action dataset including actions performed in complex scenarios.
Abstract: Detecting suspicious behavior from high definition (HD) videos is always a complex and time-consuming process. To solve that problem, a fast suspicious behavior recognition method is proposed based on motion vectors. In this paper, the data format and decoding features of HD videos are analyzed. Then, the characteristics of suspicious activities and the ways of obtaining motion vectors directly from the video stream are concluded. Besides, the motion vectors are normalized by taking the reference frames into account. The feature vectors that display the inter-frame and intra-frame information of the region of interest are extracted. Gaussian radial basis function is employed as the kernel function of the support vector machines (SVM). It also realizes the detection and classification of suspicious behavior in HD videos. Finally, an extensive set of experiments are performed and this method is compared with some of the most recent approaches in the field using publicly available datasets as well as a new annotated human action dataset including actions performed in complex scenarios.
25 citations
TL;DR: This paper deviates from the traditional visual speech information and proposes an AVSR system integrating 3D lip information that improved the recognition performance of traditional ASR andAVSR system in acoustic noise environments.
Abstract: Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.
22 citations
TL;DR: A framework, called RecAm, is proposed, which enables the collection of contextual information and the delivery of resulted recommendation by adapting the user’s environment using Ambient Intelligent (AmI) Interfaces and establishes a bridge between the multimedia resources, user joint preferences, and the detected contextual information.
Abstract: With an ever-increasing accessibility to different multimedia contents in real-time, it is difficult for users to identify the proper resources from such a vast number of choices. By utilizing the user's context while consuming diverse multimedia contents, we can identify different personal preferences and settings. However, there is a need to reinforce the recommendation process in a systematic way, with context-adaptive information. The contributions of this paper are twofold. First, we propose a framework, called RecAm, which enables the collection of contextual information and the delivery of resulted recommendation by adapting the user's environment using Ambient Intelligent (AmI) Interfaces. Second, we propose a recommendation model that establishes a bridge between the multimedia resources, user joint preferences, and the detected contextual information. Hence, we obtain a comprehensive view of the user's context, as well as provide a personalized environment to deliver the feedback. We demonstrate the feasibility of RecAm with two prototypes applications that use contextual information for recommendations. The offline experiment conducted shows the improvement of delivering personalized recommendations based on the user's context on two real-world datasets.
TL;DR: This paper forms topic detection and tracking as an online tracking, detection and learning problem, and obtains a topic tracker which could also discover novel topics from the new stream data.
Abstract: With the pervasiveness of online social media and rapid growth of web data, a large amount of multi-media data is available online. However, how to organize them for facilitating users' experience and government supervision remains a problem yet to be seriously investigated. Topic detection and tracking, which has been a hot research topic for decades, could cluster web videos into different topics according to their semantic content. However, how to online discover topic and track them from web videos and images has not been fully discussed. In this paper, we formulate topic detection and tracking as an online tracking, detection and learning problem. First, by learning from historical data including labeled data and plenty of unlabeled data using semi-supervised multi-class multi-feature method, we obtain a topic tracker which could also discover novel topics from the new stream data. Second, when new data arrives, an online updating method is developed to make topic tracker adaptable to the evolution of the stream data. We conduct experiments on public dataset to evaluate the performance of the proposed method and the results demonstrate its effectiveness for topic detection and tracking.
TL;DR: An unsupervised co-clustering framework is proposed to address both the pixel spectral and spatial constraints, in which the relationship among pixels is formulated using an undirected bipartite graph.
Abstract: The high dimensionality of hyperspectral images are usually coupled with limited data available, which degenerates the performances of clustering techniques based only on pixel spectral. To improve the performances of clustering, incorporation of spectral and spatial is needed. As an attempt in this direction, in this paper, we propose an unsupervised co-clustering framework to address both the pixel spectral and spatial constraints, in which the relationship among pixels is formulated using an undirected bipartite graph. The optimal partitions are obtained by spectral clustering on the bipartite graph. Experiments on four hyperspectral data sets are performed to evaluate the effectiveness of the proposed framework. Results also show our method achieves similar or better performance when compared to the other clustering methods.
TL;DR: A cross-domain semantic modeling method for automatic image annotation for visual information from social network platforms, which leverages cross- domain datasets to discover the common knowledge of each semantic concept from UGCs and boost the performance of semantic annotation by semantic transfer.
Abstract: With the rapid development of location-based social networks (LBSNs), more and more media data are unceasingly uploaded by users. The asynchrony between the visual and textual information has made it extremely difficult to manage the multimodal information for manual annotation-free retrieval and personalized recommendation. Consequently the automated image semantic discovery of multimedia location-related user-generated contents (UGCs) for user experience has become mandatory. Most of the literatures leverage single-modality data or correlated multimedia data for image semantic detection. However, the intrinsically heterogeneous UGCs in LBSNs are usually independent and uncorrelated. It is hard to build correlation between textual information and visual information. In this paper, we propose a cross-domain semantic modeling method for automatic image annotation for visual information from social network platforms. First, we extract a set of hot topics from the collected textual information for image dataset preparation. Then the proposed noisy sample filtering is implemented to remove low-relevance photos. Finally, we leverage cross-domain datasets to discover the common knowledge of each semantic concept from UGCs and boost the performance of semantic annotation by semantic transfer. The comparison experiments on cross-domain datasets were conducted to demonstrate the superiority of the proposed method.
TL;DR: Reversible watermarking is proposed to protect the classical fingerprint recognition system from copy and replay attacks and ensure that native fingerprint recognition accuracy remains unaffected.
Abstract: The classical fingerprint recognition system can be compromised at the database and the sensor. Therefore, to beef up security of the fingerprint recognition system, reversible watermarking is proposed to protect these two points. Reversible watermarks thwart manipulations, $${viz}$$viz., copy and replay attacks and ensure that native fingerprint recognition accuracy remains unaffected. Fingerprint-dependent watermark $$W_1$$W1 authenticates the database and shields it against the copy attack. The second watermark $$W_2$$W2 verifies fingerprint captured by the sensor and foil replay attack. $$W_2$$W2 is derived from the higher order moments of the fingerprint. Divergence in the computed and extracted watermarks indicates loss of authenticity. The experimental results validate proposed hypothesis.
TL;DR: Data collected from SopCast is used to show that there is high correlation between peer centrality—out-degree, out-closeness, and betweenness—in the P2P overlay graph and peer cooperation, and to propose a new regression-based model to predict peer cooperation from its past centrality.
Abstract: The Peer-to-Peer (P2P) architecture has been successfully used to reduce costs and increase the scalability of Internet live streaming systems. However, the effectiveness of these applications depends largely on user (peer) cooperation. In this article we use data collected from SopCast, a popular P2P live application, to show that there is high correlation between peer centrality--out-degree, out-closeness, and betweenness--in the P2P overlay graph and peer cooperation. We use this finding to propose a new regression-based model to predict peer cooperation from its past centrality. Our model takes only peer out-degrees as input, as out-degree has the strongest correlation with peer cooperation. Our evaluation shows that our model has good accuracy and does not need to be trained too often (e.g., once each 16 min). We also use our model to sketch a mechanism to detect malicious peers that report artificially inflated cooperation aiming at, for example, receiving better quality of service.
TL;DR: A BHoG descriptor is proposed for sketch-based image retrieval, where the boundary image is detected from natural image using Berkeley boundary detector, and then divided into many blocks, and the principal gradient orientation is found.
Abstract: Due to the popularity of devices with touch screens, it is convenient to match images with a hand-drawn sketch query. However, existing methods usually care little about memory space and time efficiency thus is inadequate for the rapid growth of multimedia resources. In this paper, a BHoG descriptor is proposed for sketch-based image retrieval. Firstly, the boundary image is detected from natural image using Berkeley boundary detector, and then divided into many blocks. Secondly, we calculate the gradient feature of each block, and find the principal gradient orientation. Finally, the principal gradient orientation is encoded to binary codes, which is proved to be efficient and discriminative. We evaluated the performance of BHoG on a large-scale social media dataset. The experimental results have shown that BHoG not only has a better performance on flexibility and efficiency, but also occupies small memory.
TL;DR: This special issue solicits recent related attempts in the multimedia community connecting both the social media and big data contexts to multimedia computing.
Abstract: In the past decade, social media contributes significantly to the arrival of the Big Data era. Big Data has not only provided new solutions for social media mining and applications, but brought about a paradigm shift to many fields of data analytics. This special issue solicits recent related attempts in the multimedia community. We believe that the enclosed papers in this special issue provide a unique opportunity for multidisciplinary works connecting both the social media and big data contexts to multimedia computing.
TL;DR: A novel framework called comprehensive video tagger (CVTagger) to facilitate accurate tag-based video annotation that applies both multimodal and temporal properties combined with a novel classification framework with hierarchical structure based on multilayer concept model and regression analysis.
Abstract: Accurate video tagging has been becoming increasingly crucial for online video management and search. This article documents a novel framework called comprehensive video tagger (CVTagger) to facilitate accurate tag-based video annotation. The system applies both multimodal and temporal properties combined with a novel classification framework with hierarchical structure based on multilayer concept model and regression analysis. The advanced architecture enables effective incorporation of both video concept dependency and temporal dynamics. Using a large-scale test collection containing 50,000 YouTube videos, a set of empirical studies have been carried out and experimental results demonstrate various advantages of CVTagger over the state-of-the-art techniques.
TL;DR: A novel visual tracking algorithm via online semi-supervised co-boosting, which has a good ability to recover from drifting by incorporating prior knowledge of the object while being adaptive to appearance changes by effectively combining the complementary strengths of different feature views.
Abstract: This paper proposes a novel visual tracking algorithm via online semi-supervised co-boosting, which investigates the benefits of co-boosting (i.e., the integration of co-training and boosting) and semi-supervised learning in the online tracking process. Existing discriminative tracking algorithms often use the classification results to update the classifier itself. However, the classification errors are easily accumulated during the self-training process. In this paper, we employ an effective online semi-supervised co-boosting framework to update the weak classifiers built on two different feature views. In this framework, the pseudo-label and importance of an unlabeled sample are estimated based on the additive logistic regression for an integration of a prior model and an online classifier learned on one feature view, and then used to update the weak classifiers built on the other feature view. The proposed algorithm has a good ability to recover from drifting by incorporating prior knowledge of the object while being adaptive to appearance changes by effectively combining the complementary strengths of different feature views. Experimental results on a series of challenging video sequences demonstrate the superior performance of our algorithm compared to state-of-the-art tracking algorithms.
TL;DR: Based on native Thai users, the evaluation result surprisingly shows that ThaiVQE can contribute better accuracy and reliability than the standard E-model with error reduction of over 13 % for G.711 and 28 %For G.729, an example study for other countries that have their own languages and cultures to create their subjective MOS model.
Abstract: This paper presents a mathematical model that has been created from the subjective MOS, instead of modifying or improving the existing objective measurement methods (e.g., E-model) for VoIP quality measurement. The proposed model of VoIP quality measurement method is based on native Thai users who communicate to each other using Thai language, which is a tonal language, unlike English and most western languages. The data have been gathered using conversation-opinion tests with 400 and 354 native Thai subjects for two popular codecs, G.711 and G.729, respectively, referring to effects from two major network factors, packet loss and packet delay. This model is called the Thai subjective VoIP quality evaluation model (ThaiVQE). It has been evaluated using two test sets of subjective MOS, from 50 native Thai subjects for G.711 and 64 native Thai subjects for G.729, then the results have been compared with the E-model results. Based on native Thai users, the evaluation result surprisingly shows that ThaiVQE can contribute better accuracy and reliability than the standard E-model with error reduction of over 13 % for G.711 and 28 % for G.729. Therefore, this is an example study for other countries that have their own languages and cultures to create their subjective MOS model.
TL;DR: A novel visual assisted instant messaging scheme named Chat with illustration (CWI), which presents users visual messages associated with textual message automatically, and a visual dialogue summarization is also proposed, which help users recall the past dialogue.
Abstract: Instant messaging service is an important aspect of social media and sprung up in last decades. Traditional instant messaging service transfers information mainly based on textual message, while the visual message is ignored to a great extent. Such instant messaging service is thus far from satisfactory in all-around information communication. In this paper, we propose a novel visual assisted instant messaging scheme named Chat with illustration (CWI), which presents users visual messages associated with textual message automatically. When users start their chat, the system first identifies meaningful keywords from dialogue content and analyzes grammatical and logical relations. Then CWI explores keyword-based image search on a hierarchically clustering image database which is built offline. Finally, according to grammatical and logical relations, CWI assembles these images properly and presents an optimal visual message. With the combination of textual and visual message, users could get a more interesting and vivid communication experience. Especially for different native language speakers, CWI can help them cross language barrier to some degree. In addition, a visual dialogue summarization is also proposed, which help users recall the past dialogue. The in-depth user studies demonstrate the effectiveness of our visual assisted instant messaging scheme.
TL;DR: This paper demonstrates how to uncover interesting spatio-temporal patterns by utilizing the aggregate measures released by a LBSN service, and describes the correlations between the social features of user clusters and users’ check-in patterns.
Abstract: Analysis of users' check-ins in location-based social networks (LBSNs, also called GeoSocial Networks), such as Foursquare and Yelp, is essential to understand users' mobility patterns and behaviors. However, most empirical results of users' mobility patterns reported in the current literature are based on users' sampled and nonconsecutive public check-ins. Additionally, such analyses take no account of the noise or false information in the dataset, such as dishonest check-ins created by users. These empirical results may be biased and hence may bring side effects to LBSN services, such as friend and venue recommendations. Foursquare, one of the most popular LBSNs, provides a feature called a user's score. A user's score is an aggregate measure computed by the system based on more accurate and complete check-ins of the user. It reflects a snapshot of the user's temporal and spatial patterns from his/her check-ins. For example, a high user score indicates that the user checked in at many venues regularly or s/he visited a number of new venues. In this paper, we show how a user's score can be used as an alternative way to investigate the user's mobility patterns. We first characterize a set of properties from the time series of a user's consecutive weekly scores. Based on these properties, we identify different types of users by clustering users' common check-in patterns using non-negative matrix factorization (NMF). We then analyze the correlations between the social features of user clusters and users' check-in patterns. We present several interesting findings. For example, users with high scores (more mobile) tend to have more friends (more social). Our empirical results demonstrate how to uncover interesting spatio-temporal patterns by utilizing the aggregate measures released by a LBSN service.
TL;DR: An efficient dynamic path search method is proposed to detect the target video clips, and highly compact audio fingerprint and visual ordinal features are jointly utilized in a flexible frame to facilitate tolerance of the length variations caused during video re-targeting.
Abstract: Efficient and robust video copy detection is an important topic for many applications, such as commercial monitoring and social media retrieval. In this paper, with the aim of handling large-scale video data, we propose an efficient and robust video copy detection method jointly utilizing the characteristics of temporal continuity and multi-modality of video. The video is converted to a continuous sequence of states, and both the visual and auditory features are extracted for temporal frames. To facilitate tolerance of the length variations caused during video re-targeting, an efficient dynamic path search method is proposed to detect the target video clips, and highly compact audio fingerprint and visual ordinal features are jointly utilized in a flexible frame. The proposed scheme not only achieves high computational efficiency but also guarantees effectiveness in real applications. Comparison experiments were conducted using video commercials and real television programs from four channels as well as a benchmark video copy detection dataset, and the results demonstrate both the high efficiency and high robustness of the proposed method.
TL;DR: A framework of image clustering based on semi-supervised distance learning and multi-modal information is proposed and an effective clustering method is proposed.
Abstract: How to organize and retrieve images is now a great challenge in various domains. Image clustering is a key tool in some practical applications including image retrieval and understanding. Traditional image clustering algorithms consider a single set of features and use ad hoc distance functions, such as Euclidean distance, to measure the similarity between samples. However, multi-modal features can be extracted from images. The dimension of multi-modal data is very high. In addition, we usually have several, but not many labeled images, which lead to semi-supervised learning. In this paper, we propose a framework of image clustering based on semi-supervised distance learning and multi-modal information. First we fuse multiple features and utilize a small amount of labeled images for semi-supervised metric learning. Then we compute similarity with the Gaussian similarity function and the learned metric. Finally, we construct a semi-supervised Laplace matrix for spectral clustering and propose an effective clustering method. Extensive experiments on some image data sets show the competent performance of the proposed algorithm.
TL;DR: The embedding procedure is analyzed to infer that Song et al. in fact used the embedding strategy by adding the gray information of the watermark image to the amplitude of the HT of the quantum carrier image with a weight α, and that the quantum watermark extracting cannot be implemented.
Abstract: et al. want to use quantum networks for arithmetic operation i.e., the plain adder network in Ref. [4] to realize the addition operation. As we know, the inputs of quantum gates of the quantum network are encoded in a binary form in the computational basis of quantum registers. The addition of two registers |a� and |b� can be written as |a, b� → |a, a+ b� . As one can reconstruct the input (a, b) out of the output (a, a + b), there is no loss of information, and the calculation can be implemented reversibly. By observing HT(|C�) and |PW〉, one can easily find that they are both multi-particle entangled states so that the plain adder network is invalid for implementing HT(|CW�) = HT(|C�)+ |PW�. Obviously, Song et al. neglected the constraint condition under which the plain adder network is used, that is, the inputs of two registers should be encoded in binary form in the computational basis states. To further understand the embedding procedure, we analyze the classical simulation procedure in Ref. [1]. By reimplementing the classical simulation procedure in Ref. [1], we can infer that Song et al. [1] in fact used the embedding strategy by adding the gray information of the watermark image to the amplitude of the HT of the quantum carrier image with a weight α. The similar problem occurs in the watermark image’s extracting phase. The embedder extracts the final watermark image by the following way: |W � = P(HT(|CW�) − HT(|C�)). By analyzing and reimplementing the classical simulation procedure in Ref. [1], we infer that the authors of Ref. [1] in fact implemented the extracting strategy by subtracting the carrier image from the embedded carrier image. Similar to the watermark embedding algorithm, the implementation of the transform |W � = P(HT(|CW�) − HT(|C�)) should also abide by the principles of quantum mechanics. That is, the quantum watermark extracting cannot be implemented Dear Sir,
TL;DR: An encryption framework design and implementation which add region-of-interest encryption functionality to existing video surveillance systems with minimal integration and deployment effort is proposed and their performance, despite their frequent use in surveillance systems, is not insufficient for practical purposes.
Abstract: We propose an encryption framework design and implementation which add region-of-interest encryption functionality to existing video surveillance systems with minimal integration and deployment effort. Apart from region-of-interest detection, all operations take place at bit-stream level and require no re-compression whatsoever. This allows for very fast encryption and decryption speed at negligible space overhead. Furthermore, we provide both objective and subjective security evaluations of our proposed encryption framework. Furthermore, we address design- and implementation-related challenges and practical concerns. These include modularity, parallelization and, most notably, the performance of state-of-the-art face detectors. We find that their performance, despite their frequent use in surveillance systems, is not insufficient for practical purposes, both in terms of speed and detection accuracy.
TL;DR: This paper proposes to improve OCCF accuracy by exploiting the social media content information to find the potential negative examples from the missing user-item pairs and gets a content topic feature for each user and item by probabilistic topic modeling and embed them into the Matrix Factorization model.
Abstract: In recent years, recommender systems have become popular to handle the information overload problem of social media websites. The most widely used Collaborative Filtering methods make recommendations by mining users' rating history. However, users' behaviors in social media are usually implicit, where no ratings are available. This is a One-Class Collaborative Filtering (OCCF) problem with only positive examples. How to distinguish the negative examples from missing data is important for OCCF. Existing OCCF methods tackle this by the statistical properties of users' historical behavior; however, they ignored the rich content information in social media websites, which provide additional evidence for profiling users and items. In this paper, we propose to improve OCCF accuracy by exploiting the social media content information to find the potential negative examples from the missing user-item pairs. Specifically, we get a content topic feature for each user and item by probabilistic topic modeling and embed them into the Matrix Factorization model. Extensive experiments show that our algorithm can achieve better performance than the state-of-art methods.
TL;DR: An image annotation framework is proposed that uses a novel automatic random forest-based method and that takes into consideration visual and geographical closeness in the classification process and is able to annotate with a reasonable degree of confidence four of the main habitat classes.
Abstract: The classification of habitats is crucial for structuring knowledge and developing our understanding of the natural world. Currently, most successful methods employ human surveyors--a laborious, expensive and subjective process. In this paper, we formulate habitat classification as a fine-grained visual categorization problem. We build on previous work and propose an image annotation framework that uses a novel automatic random forest-based method and that takes into consideration visual and geographical closeness in the classification process. During training, low-level visual features and medium-level contextual information are extracted. For the latter, we use a human-in-the-loop methodology by asking humans a set of 17 questions about the appearances of the image that can be easily answered by non-ecologists to extract medium-level knowledge about the images. During testing, and considering that close areas have similar ecological properties, we weigh the influence of the prediction of each tree of the forest according to their distance to the unseen test photography. Additionally, we present an updated version of a geo-referenced habitat image database containing over 1,000 high-resolution ground photographs that have been manually annotated by habitat classification experts. This has been made publicly available image database specifically designed for the development of multimedia analysis techniques for ecological applications. We show experimental recall and precision results which illustrate that our image annotation framework is able to annotate with a reasonable degree of confidence four of the main habitat classes: woodland and scrub, grassland and marsh, heathland and miscellaneous.
TL;DR: A visuo-haptic augmented reality system is presented for object manipulation and task learning from human demonstration and experiments show that a precedence graph, encoding the sequential structure of the task, can be successfully extracted from multiple user demonstrations and that the learned task can be executed by a robot system.
Abstract: A visuo-haptic augmented reality system is presented for object manipulation and task learning from human demonstration. The proposed system consists of a desktop augmented reality setup where users operate a haptic device for object interaction. Users of the haptic device are not co-located with the environment where real objects are present. A three degrees of freedom haptic device, providing force feedback, is adopted for object interaction by pushing, selection, translation and rotation. The system also supports physics-based animation of rigid bodies. Virtual objects are simulated in a physically plausible manner and seem to coexist with real objects in the augmented reality space. Algorithms for calibration, object recognition, registration and haptic rendering have been developed. Automatic model-based object recognition and registration are performed from 3D range data acquired by a moving laser scanner mounted on a robot arm. Several experiments have been performed to evaluate the augmented reality system in both single-user and collaborative tasks. Moreover, the potential of the system for programming robot manipulation tasks by demonstration is investigated. Experiments show that a precedence graph, encoding the sequential structure of the task, can be successfully extracted from multiple user demonstrations and that the learned task can be executed by a robot system.