scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Multimedia in 2016"


Proceedings ArticleDOI
TL;DR: In this article, a viewport-adaptive 360-degree video streaming system is proposed to reduce the bandwidth waste, while still providing an immersive experience, by preparing multiple video representations, which differ not only by their bit-rate, but also by the qualities of different scene regions.
Abstract: The delivery and display of 360-degree videos on Head-Mounted Displays (HMDs) presents many technical challenges. 360-degree videos are ultra high resolution spherical videos, which contain an omnidirectional view of the scene. However only a portion of this scene is displayed on the HMD. Moreover, HMD need to respond in 10 ms to head movements, which prevents the server to send only the displayed video part based on client feedback. To reduce the bandwidth waste, while still providing an immersive experience, a viewport-adaptive 360-degree video streaming system is proposed. The server prepares multiple video representations, which differ not only by their bit-rate, but also by the qualities of different scene regions. The client chooses a representation for the next segment such that its bit-rate fits the available throughput and a full quality region matches its viewing. We investigate the impact of various spherical-to-plane projections and quality arrangements on the video quality displayed to the user, showing that the cube map layout offers the best quality for the given bit-rate budget. An evaluation with a dataset of users navigating 360-degree videos demonstrates that segments need to be short enough to enable frequent view switches.

228 citations


Posted Content
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang 
TL;DR: A number of representative methods for cross-modal retrieval are reviewed and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning.
Abstract: In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. Various methods have been proposed to deal with such a problem. In this paper, we first review a number of representative methods for cross-modal retrieval and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning. Real-valued representation learning methods aim to learn real-valued common representations for different modalities of data. To speed up the cross-modal retrieval, a number of binary representation learning methods are proposed to map different modalities of data into a common Hamming space. Then, we introduce several multimodal datasets in the community, and show the experimental results on two commonly used multimodal datasets. The comparison reveals the characteristic of different kinds of cross-modal retrieval methods, which is expected to benefit both practical applications and future research. Finally, we discuss open problems and future research directions.

222 citations


Book ChapterDOI
TL;DR: In this article, a Variable-filter-size Residue-learning CNN (VRCNN) was proposed to improve the performance and to accelerate network training for High Efficiency Video Coding.
Abstract: Lossy image and video compression algorithms yield visually annoying artifacts including blocking, blurring, and ringing, especially at low bit-rates. To reduce these artifacts, post-processing techniques have been extensively studied. Recently, inspired by the great success of convolutional neural network (CNN) in computer vision, some researches were performed on adopting CNN in post-processing, mostly for JPEG compressed images. In this paper, we present a CNN-based post-processing algorithm for High Efficiency Video Coding (HEVC), the state-of-the-art video coding standard. We redesign a Variable-filter-size Residue-learning CNN (VRCNN) to improve the performance and to accelerate network training. Experimental results show that using our VRCNN as post-processing leads to on average 4.6% bit-rate reduction compared to HEVC baseline. The VRCNN outperforms previously studied networks in achieving higher bit-rate reduction, lower memory cost, and multiplied computational speedup.

201 citations


Posted Content
TL;DR: In this article, the authors propose an adaptive bandwidth-efficient 360 VR video streaming system using a divide-and-conquer approach, which uses MPEG-DASH SRD to describe the spatial relationship of tiles in the 360-degree space, and prioritize the tiles in Field of View (FoV).
Abstract: While traditional multimedia applications such as games and videos are still popular, there has been a significant interest in the recent years towards new 3D media such as 3D immersion and Virtual Reality (VR) applications, especially 360 VR videos. 360 VR video is an immersive spherical video where the user can look around during playback. Unfortunately, 360 VR videos are extremely bandwidth intensive, and therefore are difficult to stream at acceptable quality levels. In this paper, we propose an adaptive bandwidth-efficient 360 VR video streaming system using a divide and conquer approach. In our approach, we propose a dynamic view-aware adaptation technique to tackle the huge streaming bandwidth demands of 360 VR videos. We spatially divide the videos into multiple tiles while encoding and packaging, use MPEG-DASH SRD to describe the spatial relationship of tiles in the 360-degree space, and prioritize the tiles in the Field of View (FoV). In order to describe such tiled representations, we extend MPEG-DASH SRD to the 3D space of 360 VR videos. We spatially partition the underlying 3D mesh, and construct an efficient 3D geometry mesh called hexaface sphere to optimally represent a tiled 360 VR video in the 3D space. Our initial evaluation results report up to 72% bandwidth savings on 360 VR video streaming with minor negative quality impacts compared to the baseline scenario when no adaptations is applied.

168 citations


Journal ArticleDOI
TL;DR: A machine learning system to compose fashion outfits automatically to score fashion outfit candidates based on the appearances and metadata and achieves an AUC of 85% for the scoring component, and an accuracy of 77% for a constrained composition task.
Abstract: Composing fashion outfits involves deep understanding of fashion standards while incorporating creativity for choosing multiple fashion items (e.g., Jewelry, Bag, Pants, Dress). In fashion websites, popular or high-quality fashion outfits are usually designed by fashion experts and followed by large audiences. In this paper, we propose a machine learning system to compose fashion outfits automatically. The core of the proposed automatic composition system is to score fashion outfit candidates based on the appearances and meta-data. We propose to leverage outfit popularity on fashion oriented websites to supervise the scoring component. The scoring component is a multi-modal multi-instance deep learning system that evaluates instance aesthetics and set compatibility simultaneously. In order to train and evaluate the proposed composition system, we have collected a large scale fashion outfit dataset with 195K outfits and 368K fashion items from Polyvore. Although the fashion outfit scoring and composition is rather challenging, we have achieved an AUC of 85% for the scoring component, and an accuracy of 77% for a constrained composition task.

130 citations


Posted Content
TL;DR: The transform coder is described, with particular attention to the psychoacoustic knowledge built into the Opus codec, which out-performs existing audio codecs that do not operate under real-time constraints.
Abstract: The IETF recently standardized the Opus codec as RFC6716. Opus targets a wide range of real-time Internet applications by combining a linear prediction coder with a transform coder. We describe the transform coder, with particular attention to the psychoacoustic knowledge built into the format. The result out-performs existing audio codecs that do not operate under real-time constraints.

71 citations


Posted Content
TL;DR: This study tries to cover different aspects related to VR content representation, streaming, and quality assessment that will help establishing the basic knowledge of how to build a VR streaming system.
Abstract: The recent rise of interest in Virtual Reality (VR) came with the availability of commodity commercial VR prod- ucts, such as the Head Mounted Displays (HMD) created by Oculus and other vendors. To accelerate the user adoption of VR headsets, content providers should focus on producing high quality immersive content for these devices. Similarly, multimedia streaming service providers should enable the means to stream 360 VR content on their platforms. In this study, we try to cover different aspects related to VR content representation, streaming, and quality assessment that will help establishing the basic knowledge of how to build a VR streaming system.

63 citations


Journal ArticleDOI
TL;DR: In this paper, the authors presented an algorithm for encrypted HTTP adaptive video streaming title classification, where the adversary does not interact actively with the device, but he is able to eavesdrop on the network traffic of the device from the network side.
Abstract: Desktops and laptops can be maliciously exploited to violate privacy. There are two main types of attack scenarios: active and passive. In this paper, we consider the passive scenario where the adversary does not interact actively with the device, but he is able to eavesdrop on the network traffic of the device from the network side. Most of the Internet traffic is encrypted and thus passive attacks are challenging. Previous research has shown that information can be extracted from encrypted multimedia streams. This includes video title classification of non HTTP adaptive streams (non-HAS). This paper presents an algorithm for encrypted HTTP adaptive video streaming title classification. We show that an external attacker can identify the video title from video HTTP adaptive streams (HAS) sites such as YouTube. To the best of our knowledge, this is the first work that shows this. We provide a large data set of 10000 YouTube video streams of 100 popular video titles (each title downloaded 100 times) as examples for this task. The dataset was collected under real-world network conditions. We present several machine algorithms for the task and run a through set of experiments, which shows that our classification accuracy is more than 95%. We also show that our algorithms are able to classify video titles that are not in the training set as unknown and some of the algorithms are also able to eliminate false prediction of video titles and instead report unknown. Finally, we evaluate our algorithms robustness to delays and packet losses at test time and show that a solution that uses SVM is the most robust against these changes given enough training data. We provide the dataset and the crawler for future research.

56 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a generic hybrid deep learning framework for JPEG steganalysis incorporating the domain knowledge behind rich steganalytic models, which involves two main stages: the first stage is hand-crafted, corresponding to the convolution phase and the quantization & truncation phase of the rich models.
Abstract: Adoption of deep learning in image steganalysis is still in its initial stage. In this paper we propose a generic hybrid deep-learning framework for JPEG steganalysis incorporating the domain knowledge behind rich steganalytic models. Our proposed framework involves two main stages. The first stage is hand-crafted, corresponding to the convolution phase and the quantization & truncation phase of the rich models. The second stage is a compound deep neural network containing multiple deep subnets in which the model parameters are learned in the training procedure. We provided experimental evidences and theoretical reflections to argue that the introduction of threshold quantizers, though disable the gradient-descent-based learning of the bottom convolution phase, is indeed cost-effective. We have conducted extensive experiments on a large-scale dataset extracted from ImageNet. The primary dataset used in our experiments contains 500,000 cover images, while our largest dataset contains five million cover images. Our experiments show that the integration of quantization and truncation into deep-learning steganalyzers do boost the detection performance by a clear margin. Furthermore, we demonstrate that our framework is insensitive to JPEG blocking artifact alterations, and the learned model can be easily transferred to a different attacking target and even a different dataset. These properties are of critical importance in practical applications.

50 citations


Journal ArticleDOI
TL;DR: This technical report formally defines the QoE metrics which are introduced and discussed in the article "QoE Beyond the MOS: An In-Depth Look at QoEs via Better Metrics and their Relation to MOS".
Abstract: This technical report formally defines the QoE metrics which are introduced and discussed in the article "QoE Beyond the MOS: An In-Depth Look at QoE via Better Metrics and their Relation to MOS" by Tobias Hosfeld, Poul E. Heegaard, Martin Varela, Sebastian Moller, accepted for publication in the Springer journal "Quality and User Experience". Matlab scripts for computing the QoE metrics for given data sets are available in GitHub.

44 citations


Journal ArticleDOI
TL;DR: The dataset as mentioned in this paper consists of 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks and provides the musical score in MIDI format, audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground truth annotation files including frame-level and note-level transcriptions.
Abstract: We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches for addressing the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely-used music audio datasets. We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multi-modal tasks. We benchmark two existing MIR tasks (multi-pitch analysis and score-informed source separation) on the dataset and compare with other existing music audio datasets. Additionally, we consider two novel multi-modal MIR tasks (visually informed multi-pitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measures and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.

Posted Content
TL;DR: Wang et al. as discussed by the authors designed a CNN-based steganalyzer for images obtained by applying steganography with a unique embedding key, which is able to deal with larger images and lower payloads.
Abstract: For the past few years, in the race between image steganography and steganalysis, deep learning has emerged as a very promising alternative to steganalyzer approaches based on rich image models combined with ensemble classifiers. A key knowledge of image steganalyzer, which combines relevant image features and innovative classification procedures, can be deduced by a deep learning approach called Convolutional Neural Networks (CNN). These kind of deep learning networks is so well-suited for classification tasks based on the detection of variations in 2D shapes that it is the state-of-the-art in many image recognition problems. In this article, we design a CNN-based steganalyzer for images obtained by applying steganography with a unique embedding key. This one is quite different from the previous study of {\em Qian et al.} and its successor, namely {\em Pibre et al.} The proposed architecture embeds less convolutions, with much larger filters in the final convolutional layer, and is more general: it is able to deal with larger images and lower payloads. For the "same embedding key" scenario, our proposal outperforms all other steganalyzers, in particular the existing CNN-based ones, and defeats many state-of-the-art image steganography schemes.

Posted Content
TL;DR: In this paper, a CNN architecture is proposed to generate a map that highlights semantically-salient regions so that they can be encoded at higher quality as compared to background regions by adding a complete set of features for every class and then taking a threshold over the sum of all feature activations.
Abstract: It has long been considered a significant problem to improve the visual quality of lossy image and video compression. Recent advances in computing power together with the availability of large training data sets has increased interest in the application of deep learning cnns to address image recognition and image processing tasks. Here, we present a powerful cnn tailored to the specific task of semantic image understanding to achieve higher visual quality in lossy compression. A modest increase in complexity is incorporated to the encoder which allows a standard, off-the-shelf jpeg decoder to be used. While jpeg encoding may be optimized for generic images, the process is ultimately unaware of the specific content of the image to be compressed. Our technique makes jpeg content-aware by designing and training a model to identify multiple semantic regions in a given image. Unlike object detection techniques, our model does not require labeling of object positions and is able to identify objects in a single pass. We present a new cnn architecture directed specifically to image compression, which generates a map that highlights semantically-salient regions so that they can be encoded at higher quality as compared to background regions. By adding a complete set of features for every class, and then taking a threshold over the sum of all feature activations, we generate a map that highlights semantically-salient regions so that they can be encoded at a better quality compared to background regions. Experiments are presented on the Kodak PhotoCD dataset and the MIT Saliency Benchmark dataset, in which our algorithm achieves higher visual quality for the same compressed size.

Posted Content
TL;DR: In this article, a new approach was proposed to recover subjective quality scores from noisy raw measurements, using maximum likelihood estimation, by jointly estimating the subjective quality of impaired videos, the bias and consistency of test subjects, and the ambiguity of video contents all together.
Abstract: Simple quality metrics such as PSNR are known to not correlate well with subjective quality when tested across a wide spectrum of video content or quality regime. Recently, efforts have been made in designing objective quality metrics trained on subjective data (e.g. VMAF), demonstrating better correlation with video quality perceived by human. Clearly, the accuracy of such a metric heavily depends on the quality of the subjective data that it is trained on. In this paper, we propose a new approach to recover subjective quality scores from noisy raw measurements, using maximum likelihood estimation, by jointly estimating the subjective quality of impaired videos, the bias and consistency of test subjects, and the ambiguity of video contents all together. We also derive closed-from expression for the confidence interval of each estimate. Compared to previous methods which partially exploit the subjective information, our approach is able to exploit the information in full, yielding tighter confidence interval and better handling of outliers without the need for z-scoring or subject rejection. It also handles missing data more gracefully. Finally, as side information, it provides interesting insights on the test subjects and video contents.

Posted Content
TL;DR: The authors proposed an approach that leverages contextual cues derived from the environment that the game is being played in to provide information about the excitement levels in the game, which can be ranked and selected to automatically produce high quality basketball highlights.
Abstract: The massive growth of sports videos has resulted in a need for automatic generation of sports highlights that are comparable in quality to the hand-edited highlights produced by broadcasters such as ESPN. Unlike previous works that mostly use audio-visual cues derived from the video, we propose an approach that additionally leverages contextual cues derived from the environment that the game is being played in. The contextual cues provide information about the excitement levels in the game, which can be ranked and selected to automatically produce high-quality basketball highlights. We introduce a new dataset of 25 NCAA games along with their play-by-play stats and the ground-truth excitement data for each basket. We explore the informativeness of five different cues derived from the video and from the environment through user studies. Our experiments show that for our study participants, the highlights produced by our system are comparable to the ones produced by ESPN for the same games.

Posted Content
TL;DR: A novel strategy to extract temporal trajectory-like features from sensor data and propose to apply the Fisher Kernel framework to fuse video and temporal enhanced sensor features to enhance information-rich video data is presented.
Abstract: With the increasing availability of wearable devices, research on egocentric activity recognition has received much attention recently. In this paper, we build a Multimodal Egocentric Activity dataset which includes egocentric videos and sensor data of 20 fine-grained and diverse activity categories. We present a novel strategy to extract temporal trajectory-like features from sensor data. We propose to apply the Fisher Kernel framework to fuse video and temporal enhanced sensor features. Experiment results show that with careful design of feature extraction and fusion algorithm, sensor data can enhance information-rich video data. We make publicly available the Multimodal Egocentric Activity dataset to facilitate future research.

Posted Content
TL;DR: This paper provides a comprehensive evaluation of ten different adaptation logics/algorithms, which have been proposed in the past years and can be used to evaluate any other/new adaptation logic and to compare it directly with the results reported here.
Abstract: Multimedia content delivery over the Internet is predominantly using the Hypertext Transfer Protocol (HTTP) as its primary protocol and multiple proprietary solutions exits. The MPEG standard Dynamic Adaptive Streaming over HTTP (DASH) provides an interoperable solution and in recent years various adaptation logics/algorithms have been proposed. However, to the best of our knowledge, there is no comprehensive evaluation of the various logics/algorithms. Therefore, this paper provides a comprehensive evaluation of ten different adaptation logics/algorithms, which have been proposed in the past years. The evaluation is done both objectively and subjectively. The former is using a predefined bandwidth trajectory within a controlled environment and the latter is done in a real-world environment adopting crowdsourcing. The results shall provide insights about which strategy can be adopted in actual deployment scenarios. Additionally, the evaluation methodology described in this paper can be used to evaluate any other/new adaptation logic and to compare it directly with the results reported here.

Proceedings ArticleDOI
TL;DR: In this paper, the authors examine the Periscope service and crawl the service in order to understand its usage patterns and study the typical quality of experience indicators, such as playback smoothness and latency, video quality, and energy consumption of the Android application.
Abstract: Live multimedia streaming from mobile devices is rapidly gaining popularity but little is known about the QoE they provide. In this paper, we examine the Periscope service. We first crawl the service in order to understand its usage patterns. Then, we study the protocols used, the typical quality of experience indicators, such as playback smoothness and latency, video quality, and the energy consumption of the Android application.

Posted Content
TL;DR: This paper investigates and evaluates the usage of advanced transport options for the dynamic adaptive streaming over HTTP and utilizes a common test setup to evaluate HTTP/2.0 and Google's Quick UDP Internet Connections (QUIC) protocol in the context of DASH-based services.
Abstract: Multimedia streaming over HTTP is no longer a niche research topic as it has entered our daily live. The common assumption is that it is deployed on top of the existing infrastructure utilizing application (HTTP) and transport (TCP) layer protocols as is. Interestingly, standards like MPEG's Dynamic Adaptive Streaming over HTTP (DASH) do not mandate the usage of any specific transport protocol allowing for sufficient deployment flexibility which is further supported by emerging developments within both protocol layers. This paper investigates and evaluates the usage of advanced transport options for the dynamic adaptive streaming over HTTP. We utilize a common test setup to evaluate HTTP/2.0 and Google's Quick UDP Internet Connections (QUIC) protocol in the context of DASH-based services.

Posted Content
TL;DR: In this article, the authors presented an automatic thumbnail selection system that exploits two important characteristics commonly associated with meaningful and attractive thumbnails: high relevance to video content and superior visual aesthetic quality.
Abstract: Thumbnails play such an important role in online videos. As the most representative snapshot, they capture the essence of a video and provide the first impression to the viewers; ultimately, a great thumbnail makes a video more attractive to click and watch. We present an automatic thumbnail selection system that exploits two important characteristics commonly associated with meaningful and attractive thumbnails: high relevance to video content and superior visual aesthetic quality. Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video. On the task of predicting thumbnails chosen by professional video editors, we demonstrate the effectiveness of our system against six baseline methods, using a real-world dataset of 1,118 videos collected from Yahoo Screen. In addition, we study what makes a frame a good thumbnail by analyzing the statistical relationship between thumbnail frames and non-thumbnail frames in terms of various image quality features. Our study suggests that the selection of a good thumbnail is highly correlated with objective visual quality metrics, such as the frame texture and sharpness, implying the possibility of building an automatic thumbnail selection system based on visual aesthetics.

Posted Content
TL;DR: This paper employs deep networks to learn distinct fake image related features through an AdaBoost-like transfer learning algorithm and obtains superiror results over transfer learning methods based on the general ImageNet set.
Abstract: Numerous fake images spread on social media today and can severely jeopardize the credibility of online content to public. In this paper, we employ deep networks to learn distinct fake image related features. In contrast to authentic images, fake images tend to be eye-catching and visually striking. Compared with traditional visual recognition tasks, it is extremely challenging to understand these psychologically triggered visual patterns in fake images. Traditional general image classification datasets, such as ImageNet set, are designed for feature learning at the object level but are not suitable for learning the hyper-features that would be required by image credibility analysis. In order to overcome the scarcity of training samples of fake images, we first construct a large-scale auxiliary dataset indirectly related to this task. This auxiliary dataset contains 0.6 million weakly-labeled fake and real images collected automatically from social media. Through an AdaBoost-like transfer learning algorithm, we train a CNN model with a few instances in the target training set and 0.6 million images in the collected auxiliary set. This learning algorithm is able to leverage knowledge from the auxiliary set and gradually transfer it to the target task. Experiments on a real-world testing set show that our proposed domain transferred CNN model outperforms several competing baselines. It obtains superiror results over transfer learning methods based on the general ImageNet set. Moreover, case studies show that our proposed method reveals some interesting patterns for distinguishing fake and authentic images.

Posted Content
TL;DR: This paper attacks the challenging problem of violence detection in videos by adding and exploiting subclasses visually related to violence by enriching the MediaEval 2015 violence dataset by manually labeling violence videos with respect to the subclasses.
Abstract: This paper attacks the challenging problem of violence detection in videos. Different from existing works focusing on combining multi-modal features, we go one step further by adding and exploiting subclasses visually related to violence. We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses. Such fine-grained annotations not only help understand what have impeded previous efforts on learning to fuse the multi-modal features, but also enhance the generalization ability of the learned fusion to novel test data. The new subclass based solution, with AP of 0.303 and P100 of 0.55 on the MediaEval 2015 test set, outperforms several state-of-the-art alternatives. Notice that our solution does not require fine-grained annotations on the test set, so it can be directly applied on novel and fully unlabeled videos. Interestingly, our study shows that motion related features, though being essential part in previous systems, are dispensable.

Posted Content
TL;DR: A new method is presented and it is shown for the first time that video quality representation classification for (YouTube) encrypted HTTP adaptive streaming is possible and can independently classify, in real time, every video segment into one of the quality representation layers with 97.18% average accuracy.
Abstract: The increasing popularity of HTTP adaptive video streaming services has dramatically increased bandwidth requirements on operator networks, which attempt to shape their traffic through Deep Packet Inspection (DPI). However, Google and certain content providers have started to encrypt their video services. As a result, operators often encounter difficulties in shaping their encrypted video traffic via DPI. This highlights the need for new traffic classification methods for encrypted HTTP adaptive video streaming to enable smart traffic shaping. These new methods will have to effectively estimate the quality representation layer and playout buffer. We present a new method and show for the first time that video quality representation classification for (YouTube) encrypted HTTP adaptive streaming is possible. We analyze the performance of this classification method with Safari over HTTPS. Based on a large number of offline and online traffic classification experiments, we demonstrate that it can independently classify, in real time, every video segment into one of the quality representation layers with 97.18% average accuracy.

Posted Content
TL;DR: This work proposes a framework based on Bloom filters, which can be used to index long video segments, enabling efficient image-to-video comparisons and investigates several retrieval architectures, by considering different types of aggregation and different functions to encode visual information.
Abstract: We consider the problem of using image queries to retrieve videos from a database. Our focus is on large-scale applications, where it is infeasible to index each database video frame independently. Our main contribution is a framework based on Bloom filters, which can be used to index long video segments, enabling efficient image-to-video comparisons. Using this framework, we investigate several retrieval architectures, by considering different types of aggregation and different functions to encode visual information -- these play a crucial role in achieving high performance. Extensive experiments show that the proposed technique improves mean average precision by 24% on a public dataset, while being 4X faster, compared to the previous state-of-the-art.

Posted Content
TL;DR: A submodular approach incorporates item relevance score within its optimization function, and produces a relevant and uniformly diverse set of recommendations to diversify Amazon Music recommendations.
Abstract: We compare submodular and Jaccard methods to diversify Amazon Music recommendations. Submodularity significantly improves recommendation quality and user engagement. Unlike the Jaccard method, our submodular approach incorporates item relevance score within its optimization function, and produces a relevant and uniformly diverse set.

Proceedings ArticleDOI
TL;DR: This paper applies energy conservation principles to the Daala video codec using gain-shape vector quantization to encode a vector of AC coefficients as a length and direction and derives an encoding of the vector-quantized codewords that takes advantage of their non-uniform distribution.
Abstract: This paper applies energy conservation principles to the Daala video codec using gain-shape vector quantization to encode a vector of AC coefficients as a length (gain) and direction (shape). The technique originates from the CELT mode of the Opus audio codec, where it is used to conserve the spectral envelope of an audio signal. Conserving energy in video has the potential to preserve textures rather than low-passing them. Explicitly quantizing a gain allows a simple contrast masking model with no signaling cost. Vector quantizing the shape keeps the number of degrees of freedom the same as scalar quantization, avoiding redundancy in the representation. We demonstrate how to predict the vector by transforming the space it is encoded in, rather than subtracting off the predictor, which would make energy conservation impossible. We also derive an encoding of the vector-quantized codewords that takes advantage of their non-uniform distribution. We show that the resulting technique outperforms scalar quantization by an average of 0.90 dB on still images, equivalent to a 24.8% reduction in bitrate at equal quality, while for videos, the improvement averages 0.83 dB, equivalent to a 13.7% reduction in bitrate.

Posted Content
TL;DR: In this article, the authors have developed reduced reference parametric models for estimating perceived quality in audiovisual multimedia services, using Random Forest and Neural Network based machine learning methods to estimate Mean Opinion Scores (MOS) values.
Abstract: We have developed reduced reference parametric models for estimating perceived quality in audiovisual multimedia services. We have created 144 unique configurations for audiovisual content including various application and network parameters such as bitrates and distortions in terms of bandwidth, packet loss rate and jitter. To generate the data needed for model training and validation we have tasked 24 subjects, in a controlled environment, to rate the overall audiovisual quality on the absolute category rating (ACR) 5-level quality scale. We have developed models using Random Forest and Neural Network based machine learning methods in order to estimate Mean Opinion Scores (MOS) values. We have used information retrieved from the packet headers and side information provided as network parameters for model training. Random Forest based models have performed better in terms of Root Mean Square Error (RMSE) and Pearson correlation coefficient. The side information proved to be very effective in developing the model. We have found that, while the model performance might be improved by replacing the side information with more accurate bit stream level measurements, they are performing well in estimating perceived quality in audiovisual multimedia services.

Journal ArticleDOI
TL;DR: New 8-point DCT approximations with very low arithmetic complexity are presented and the best proposed transform according to the introduced metric presents a reduction in power consumption of 21–25 %.
Abstract: Due to its remarkable energy compaction properties, the discrete cosine transform (DCT) is employed in a multitude of compression standards, such as JPEG and H.265/HEVC. Several low-complexity integer approximations for the DCT have been proposed for both 1-D and 2-D signal analysis. The increasing demand for low-complexity, energy efficient methods require algorithms with even lower computational costs. In this paper, new 8-point DCT approximations with very low arithmetic complexity are presented. The new transforms are proposed based on pruning state-of-the-art DCT approximations. The proposed algorithms were assessed in terms of arithmetic complexity, energy retention capability, and image compression performance. In addition, a metric combining performance and computational complexity measures was proposed. Results showed good performance and extremely low computational complexity. Introduced algorithms were mapped into systolic-array digital architectures and physically realized as digital prototype circuits using FPGA technology and mapped to 45nm CMOS technology. All hardware-related metrics showed low resource consumption of the proposed pruned approximate transforms. The best proposed transform according to the introduced metric presents a reduction in power consumption of 21--25%.

Posted Content
TL;DR: A novel video captioning framework, termed as BiLSTM, which deeply captures bidirectional global temporal structure in video, and which is comprehensively preserving sequential and visual information and adaptively learning dense visual features and sparse semantic representations for videos and sentences.
Abstract: Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.

Posted Content
TL;DR: The proposed audio codec is based on the modified discrete cosine transform with very short frames and uses gain-shape quantization to preserve the spectral envelope and out-performs the ULD codec operating at the same rate.
Abstract: We propose an audio codec that addresses the low-delay requirements of some applications such as network music performance. The codec is based on the modified discrete cosine transform (MDCT) with very short frames and uses gain-shape quantization to preserve the spectral envelope. The short frame sizes required for low delay typically hinder the performance of transform codecs. However, at 96 kbit/s and with only 4 ms algorithmic delay, the proposed codec out-performs the ULD codec operating at the same rate. The total complexity of the codec is small, at only 17 WMOPS for real-time operation at 48 kHz.