Showing papers in "Multimedia Systems in 2018"
TL;DR: A content-based recommendation algorithm based on convolutional neural network (CNN) which can be used to predict the latent factors from the text information of the multimedia resources and which is regularized by L1-norm.
Abstract: Automatic multimedia learning resources recommendation has become an increasingly relevant problem: it allows students to discover new learning resources that match their tastes, and enables the e-learning system to target the learning resources to the right students. In this paper, we propose a content-based recommendation algorithm based on convolutional neural network (CNN). The CNN can be used to predict the latent factors from the text information of the multimedia resources. To train the CNN, its input and output should first be solved. For its input, the language model is used. For its output, we propose the latent factor model, which is regularized by L 1-norm. Furthermore, the split Bregman iteration method is introduced to solve the model. The major novelty of the proposed recommendation algorithm is that the text information is used directly to make the content-based recommendation without tagging. Experimental results on public databases in terms of quantitative assessment show significant improvements over conventional methods. In addition, the split Bregman iteration method which is introduced to solve the model can greatly improve the training efficiency.
TL;DR: This paper presents a comprehensive and scrutinizing bibliography addressing the published literature in the field of passive-blind video content authentication, with primary focus on forgery/tamper detection, video re-capture and phylogeny detection, and video anti- Forensics and counter anti-forensics.
Abstract: In this digital day and age, we are becoming increasingly dependent on multimedia content, especially digital images and videos, to provide a reliable proof of occurrence of events. However, the availability of several sophisticated yet easy-to-use content editing software has led to great concern regarding the trustworthiness of such content. Consequently, over the past few years, visual media forensics has emerged as an indispensable research field, which basically deals with development of tools and techniques that help determine whether or not the digital content under consideration is authentic, i.e., an actual, unaltered representation of reality. Over the last two decades, this research field has demonstrated tremendous growth and innovation. This paper presents a comprehensive and scrutinizing bibliography addressing the published literature in the field of passive-blind video content authentication, with primary focus on forgery/tamper detection, video re-capture and phylogeny detection, and video anti-forensics and counter anti-forensics. Moreover, the paper intimately analyzes the research gaps found in the literature, provides worthy insight into the areas, where the contemporary research is lacking, and suggests certain courses of action that could assist developers and future researchers explore new avenues in the domain of video forensics. Our objective is to provide an overview suitable for both the researchers and practitioners already working in the field of digital video forensics, and for those researchers and general enthusiasts who are new to this field and are not yet completely equipped to assimilate the detailed and complicated technical aspects of video forensics.
TL;DR: This paper offers new insights on music emotion recognition methods based on different combinations of data features that they use during the modeling phase from three aspects, music features only, ground-truth data only, and their combination, and provides a comprehensive review of them.
Abstract: The ability of music to induce or convey emotions ensures the importance of its role in human life. Consequently, research on methods for identifying the high-level emotion states of a music segment from its low-level features has attracted attention. This paper offers new insights on music emotion recognition methods based on different combinations of data features that they use during the modeling phase from three aspects, music features only, ground-truth data only, and their combination, and provides a comprehensive review of them. Then, focusing on the relatively popular methods in which the two types of data, music features and ground-truth data, are combined, we further subdivide the methods in the literature according to the label- and numerical-type ground-truth data, and analyze the development of music emotion recognition with the cue of modeling methods and time sequence. Three current important research directions are then summarized. Although much has been achieved in the area of music emotion recognition, many issues remain. We review these issues and put forward some suggestions for future work.
TL;DR: An evidence-based and personalized model for music emotion recognition based on the IADS, a set of acoustic emotional stimuli for experimental investigations of emotion and attention, is proposed and Experimental results suggest the proposed approach is effective.
Abstract: Emotion recognition of music objects is a promising and important research issues in the field of music information retrieval. Usually, music emotion recognition could be considered as a training/classification problem. However, even given a benchmark (a training data with ground truth) and using effective classification algorithms, music emotion recognition remains a challenging problem. Most previous relevant work focuses only on acoustic music content without considering individual difference (i.e., personalization issues). In addition, assessment of emotions is usually self-reported (e.g., emotion tags) which might introduce inaccuracy and inconsistency. Electroencephalography (EEG) is a non-invasive brain-machine interface which allows external machines to sense neurophysiological signals from the brain without surgery. Such unintrusive EEG signals, captured from the central nervous system, have been utilized for exploring emotions. This paper proposes an evidence-based and personalized model for music emotion recognition. In the training phase for model construction and personalized adaption, based on the IADS (the International Affective Digitized Sound system, a set of acoustic emotional stimuli for experimental investigations of emotion and attention), we construct two predictive and generic models $$AN\!N_1$$ (“EEG recordings of standardized group vs. emotions”) and $$AN\!N_2$$ (“music audio content vs. emotion”). Both models are trained by an artificial neural network. We then collect a subject’s EEG recordings when listening the selected IADS samples, and apply the $$AN\!N_1$$ to determine the subject’s emotion vector. With the generic model and the corresponding individual differences, we construct the personalized model H by the projective transformation. In the testing phase, given a music object, the processing steps are: (1) to extract features from the music audio content, (2) to apply $$AN\!N_2$$ to calculate the vector in the arousal-valence emotion space, and (3) to apply the transformation matrix H to determine the personalized emotion vector. Moreover, with respect to a moderate music object, we apply a sliding window on the music object to obtain a sequence of personalized emotion vectors, in which those predicted vectors will be fitted and organized as an emotion trail for revealing dynamics in the affective content of music object. Experimental results suggest the proposed approach is effective.
TL;DR: Experimental results demonstrate the main feature of the proposed CLM-based HEVC SE, which turned out to save the time of the video encoding with remaining of the near visual distortion of the encrypted video stream by Glenn HEVCSE, which uses the Advanced Encryption Standard (AES).
Abstract: At present, the digital video encryption technology has become an interest research topic as a result of very rapid evolution in the application of real-time video over the Internet. So this paper presents a new method for encrypting the selective sensitive data of the latest video coding standard, which called High-Efficiency Video Coding (HEVC). The High-Efficiency Video Coding was founded in 2013 by the Joint Collaborative Team on Video Coding (JCT-VC) from the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG). The proposed selective encryption HEVC video technique uses the low complexity overhead chaotic logistic map (CLM) to encrypt the sign bits of the Motion Vector Difference (MVD) and the Discrete Cosine Transform (DCT) coefficients in the entropy stage of the process of video encoding. The contribution of the proposed CLM-based HEVC SE is to encrypt the sensitive video bits with the features of low complexity overhead, fast encoding time, keeping the HEVC constant bitrate and format compliant. Also, this paper introduces a comparative study between the proposed CLM-based HEVC SE and the Glenn HEVC SE that uses the Advanced Encryption Standard (AES). Experimental results demonstrate the main feature of the proposed CLM-based HEVC SE, which turned out to save the time of the video encoding with remaining of the near visual distortion of the encrypted video stream by Glenn HEVC SE. This feature is due to the low complexity of the CLM-based encryption employed in the proposed CLM-based HEVC SE scheme instead of using the AES in the Glenn HEVC SE. A course of security investigation experiments is performed for the proposed CLM-based HEVC SE including the main security performance metrics like encryption quality, key space, statistical and sensitivity tests. The achieved test results ensured the superiority of the proposed CLM-based HEVC SE for digital video streams.
TL;DR: A new frame duplication detection method based on Bag-of-Words (BoW) model, a model used in textual analysis first and image and video retrieval later by researchers, which shows a better detection performance and reduced run time compared to similar methods reported in the literature.
Abstract: Duplicated sequence of frames in a video to cover up or replicate a scene is a video forgery. There are methods to authenticate video files, but embedding authentication information into videos requires extra hardware or software. It is possible to detect frame duplication forgery by carefully inspecting the content to discover high correlation among group of frames. A new frame duplication detection method based on Bag-of-Words (BoW) model is proposed in this paper. BoW is a model used in textual analysis first and image and video retrieval later by researchers. We used BoW to create visual words and build a dictionary from Scale Independent Feature Transform (SIFT) keypoints of frames in video. Frame features, i.e., visual word representations at keypoints, are used to detect sequence of duplicated parts in the video. The method computes thresholds depending on the content to improve both robustness and performance. The proposed method is tested on 31 test videos selected from Surrey University Library for Forensic Analysis (SULFA) and from various movies. Experimental results show a better detection performance and reduced run time compared to similar methods reported in the literature.
TL;DR: A high payload, perceptually transparent and robust audio watermarking solution for such a problem by optimizing the existing problem using genetic algorithm is presented and shows that the proposed algorithm is able to achieve high payload with good robustness under perceptual constraints.
Abstract: In the field of audio watermarking, reliably embedding the large number of watermarking bits per second in an audio signal without affecting the audible quality of the host audio with good robustness against signal processing attacks is still one of the most challenging issues. In this paper, a high payload, perceptually transparent and robust audio watermarking solution for such a problem by optimizing the existing problem using genetic algorithm is presented. The genetic algorithm in this paper is used to find the optimal number of audio samples required for hiding each watermarking bit. The embedding is done using the imperceptible properties of LU (lower upper) factorization in wavelet domain. This paper addresses the robustness within perceptual constraints at high payload rate in both mathematical analysis and experimental testing by representing behavior of various attacks using attack characterization. Experimental results show that the proposed audio watermarking algorithm can achieve 1280 bps capacity at an average Signal-to-noise ratio (SNR) of 31.02 dB with good robustness to various signal processing attacks such as noise addition, filtering, and compression. In addition, the proposed watermarking algorithm is blind as it does not require the original signal or watermark during extraction. The comparison of the proposed algorithm with the existing techniques also shows that the proposed algorithm is able to achieve high payload with good robustness under perceptual constraints.
TL;DR: Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition and the decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements.
Abstract: In this paper, we propose a multi-modal event recognition framework based on the integration of feature fusion, deep learning, scene classification and decision fusion. Frames, shots, and scenes are identified through the video decomposition process. Events are modeled utilizing features of and relations between the physical video parts. Event modeling is achieved through visual concept learning, scene segmentation and association rule mining. Visual concept learning is employed to reveal the semantic gap between the visual content and the textual descriptors of the events. Association rules are discovered by a specialized association rule mining algorithm where the proposed strategy integrates temporality into the rule discovery process. In addition to frames, shots and scenes, the concept of scene segment is proposed to define and extract elements of association rules. Various feature sources such as audio, motion, keypoint descriptors, temporal occurrence characteristics and fully connected layer outputs of CNN model are combined into the feature fusion. The proposed decision fusion approach employs logistic regression to formulate the relation between dependent variable (event type) and independent variables (classifiers’ outputs) in terms of decision weights. Multi-modal fusion-based scene classifiers are employed in the event recognition. Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition. The decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements. The proposal is also open to new event type integrations. The accuracy of the proposed methodology is evaluated on the CCV and Hollywood2 dataset for event recognition and results are compared with the benchmark implementations in the literature.
TL;DR: A detailed comparative evaluation of combining these four congestion control variants with two traffic-shaping methods, the Hierarchical Token Bucket shaping Method and the Receive Window Tuning Method, indicates that Illinois with RWTM offers the best QoE without causing congestion.
Abstract: HTTP adaptive streaming (HAS) is a streaming video technique widely used over the Internet. However, it has many drawbacks that degrade its user quality of experience (QoE). Our investigation involves several HAS clients competing for bandwidth inside the same home network. Studies have shown that managing the bandwidth between HAS clients using traffic shaping methods improves the QoE. Additionally, the TCP congestion control algorithm in the HAS server may also impact the QoE because every congestion control variant has its own method to control the congestion window size. Based on previous work, we describe two traffic shaping methods, the Hierarchical Token Bucket shaping Method (HTBM) and the Receive Window Tuning Method (RWTM), as well as four popular congestion control variants: NewReno, Vegas, Illinois, and Cubic. In this paper, our objective is to provide a detailed comparative evaluation of combining these four congestion control variants with these two shaping methods. The main result indicates that Illinois with RWTM offers the best QoE without causing congestion. Results were validated through experimentation and objective QoE analytical criteria.
TL;DR: A fully automatic active registration method which deforms a high-resolution template mesh to match the low-quality human body scans and is comparable with the state-of-the-art methods for high-quality meshes in terms of accuracy and outperforms them in the case of low- quality scans where noises, holes and obscure parts are prevalent.
Abstract: Registration of 3D human body has been a challenging research topic for over decades. Most of the traditional human body registration methods require manual assistance, or other auxiliary information such as texture and markers. The majority of these methods are tailored for high-quality scans from expensive scanners. Following the introduction of the low-quality scans from cost-effective devices such as Kinect, the 3D data capturing of human body becomes more convenient and easier. However, due to the inevitable holes, noises and outliers in the low-quality scan, the registration of human body becomes even more challenging. To address this problem, we propose a fully automatic active registration method which deforms a high-resolution template mesh to match the low-quality human body scans. Our registration method operates on two levels of statistical shape models: (1) the first level is a holistic body shape model that defines the basic figure of human; (2) the second level includes a set of shape models for every body part, aiming at capturing more body details. Our fitting procedure follows a coarse-to-fine approach that is robust and efficient. Experiments show that our method is comparable with the state-of-the-art methods for high-quality meshes in terms of accuracy and it outperforms them in the case of low-quality scans where noises, holes and obscure parts are prevalent.
TL;DR: In this paper, the authors proposed a new crowd algorithm that maximizes the QoE by using a crowd knowledge database generated by users of a professional service and evaluated their algorithm against state-of-the-art algorithms on large, real-life, crowdsourcing datasets.
Abstract: The increasing demand for video streaming services with a high Quality of Experience (QoE) has prompted considerable research on client-side adaptation logic approaches. However, most algorithms use the client’s previous download experience and do not use a crowd knowledge database generated by users of a professional service. We propose a new crowd algorithm that maximizes the QoE. We evaluate our algorithm against state-of-the-art algorithms on large, real-life, crowdsourcing datasets. There are six datasets, each of which contains samples of a single operator (T-Mobile, AT&T or Verizon) from a single road (I100 or I405). All measurements were from Android cellphones. The datasets were provided by WeFi LTD and are public for academic users. Our new algorithm outperforms all other methods in terms of QoE (eMOS).
TL;DR: A noise-resilient compressed domain video watermarking system for in-car security to document the vehicle history which can be a proof for any mischief.
Abstract: In-car camera system emerges as a very useful technology for automobile drivers, as it can provide proper evidence to insurance companies and to police investigators in case of car accidents. Dashcam, being one of the car video storage devices, stores the data in the SD card which is overwritten and can be tampered very easily. Thus, it is important to develop an in-car security system, where data can be stored in the server and can provide privacy to the user. In this paper, we have proposed a noise-resilient compressed domain video watermarking system for in-car security to document the vehicle history which can be a proof for any mischief. We have used Independent Pass Coding (INPAC) compression technique for designing the system. Here, data generated by the system is stored in the server database which is accessible to the authentic registered user and administrator in case of any claim. This system ensures copyright, proprietorship, authentication and security against third party intrusion. The proposed system is implemented in off-line mode to evaluate the robustness and efficiency of the algorithm, and in online mode to verify the system. Graphical user interface is designed for the end user access, to make it user friendly.
TL;DR: RAM$$^3$$3S is presented, a framework for the real-time analysis of massive multimedia streams, where data come from multiple data sources that are widely located on the territory, with the final goal to discovery new and hidden information from the output of data sources as they occur, thus with very limited latency.
Abstract: Big Data platforms provide opportunities for the management and analysis of large quantities of information, but the services they provide are often too raw, since they focus on issues of fault-tolerance, increased parallelism, and so on. An additional software layer is, therefore, needed to effectively use such architectures for advanced applications in several important real-world domains, such as scientific and health care sensors, user-generated data, supply chain systems and financial companies, to name a few. In this paper, we present RAM $$^3$$ S, a framework for the real-time analysis of massive multimedia streams, where data come from multiple data sources (such as sensors and cameras) that are widely located on the territory, with the final goal to discovery new and hidden information from the output of data sources as they occur, thus with very limited latency. We apply RAM $$^3$$ S to the use case of automatic detection of “suspect” people from several concurrent video streams, and instantiate it on top of three different open source engines for the analysis of streaming Big Data (i.e., Apache Spark, Apache Storm, and Apache Flink). The effectiveness and scalability of RAM $$^3$$ S instantiation is experimentally evaluated on real data, also comparing the performance of the three considered Big Data platforms. Such comparison is performed both on a cluster of physical machines in our datalab and on the Google Cloud Platform.
TL;DR: A novel method for tracking multiple players in soccer videos, which include severe occlusions between players and nonlinear motions by their complex interactions, is introduced and shows the efficiency and robustness of this method compared to previous approaches introduced in the literature.
Abstract: Visual tracking is an essential technique in computer vision Even though the notable improvement has been achieved during last few years, tracking multiple objects still remains as a challenging task In this paper, a novel method for tracking multiple players in soccer videos, which include severe occlusions between players and nonlinear motions by their complex interactions, is introduced Specifically, we first extract moving objects (ie, players) by refining results of background subtraction via the edge information obtained from the frame differencing result Then, we conduct multiscale sampling in foreground regions, which are spatially close to each tracked player, and subsequently computing the dissimilarity between sampled image blocks and each tracked player Based on the best-matched case, the state of each tracked player (eg, center position, color, etc) is consistently updated using the online interpolation scheme Experimental results in various soccer videos show the efficiency and robustness of our method compared to previous approaches introduced in the literature
TL;DR: This paper thoroughly examines the impact of packet loss on user QoE in cases when multimedia streaming service is based on underlying User Datagram Protocol and demonstrated how the overall user experience can be redeemed, despite the perceived quality distortions, if the content is entertaining to the viewer.
Abstract: Multimedia content delivery has become one of the pillar services of modern day mobile and fixed networks. The variety of devices, platforms, and content providers together with increasing network capacity has impacted the popularity of this type of service. Considering this context, it is crucial to ensure end-to-end service quality that can fulfill users’ expectations. The user quality of experience (QoE) for multimedia streaming is tempered by numerous objective and subjective parameters; therefore, it is important to understand the relationships among them. In this paper, we thoroughly examine the impact of packet loss on user QoE in cases when multimedia streaming service is based on underlying User Datagram Protocol. The dependencies between the chosen objective and subjective parameters and the user QoE were examined in a real-life environment by conducting a survey with 602 test subjects who rated the quality of a 1-h documentary film (72 different test sequences were prepared for the rating process). Based on the obtained results, we ranked the objective parameters by their order of importance in relation to their impact on user QoE as follows: (1) total duration of packet loss occurrences (PLOs), i.e., quality distortions in a video; (2) number of PLOs; (3) packet loss rate; and (4) duration of a single PLO. We also demonstrated how the overall user experience can be redeemed, despite the perceived quality distortions, if the content is entertaining to the viewer. The user experience was also found to be influenced by the existence/non-existence of video subtitles.
TL;DR: VMShadow, a system to automatically optimize the location and performance of latency-sensitive VMs in the cloud that employs black-box fingerprinting of a VM’s network traffic to infer the latency-sensitivity and employs both ILP and greedy heuristic based algorithms to move highly latency- sensitive VMs to cloud sites that are closer to their end users.
Abstract: Distributed clouds offer a choice of data center locations for providers to host their applications. In this paper, we consider distributed clouds that host virtual desktops which are then accessed by users through remote desktop protocols. Virtual desktops have different levels of latency-sensitivity, primarily determined by the actual applications running and affected by the end users’ locations. In the scenario of mobile users, even switching between 3G and WiFi networks affects the latency-sensitivity. We design VMShadow, a system to automatically optimize the location and performance of latency-sensitive VMs in the cloud. VMShadow performs black-box fingerprinting of a VM’s network traffic to infer the latency-sensitivity and employs both ILP and greedy heuristic based algorithms to move highly latency-sensitive VMs to cloud sites that are closer to their end users. VMShadow employs a WAN-based live migration and a new network connection migration protocol to ensure that the VM migration and subsequent changes to the VM’s network address are transparent to end-users. We implement a prototype of VMShadow in a nested hypervisor and demonstrate its effectiveness for optimizing the performance of VM-based desktops in the cloud. Our experiments on a private as well as the public EC2 cloud show that VMShadow is able to discriminate between latency-sensitive and insensitive desktop VMs and judiciously moves only those that will benefit the most from the migration. For desktop VMs with video activity, VMShadow improves VNC’s refresh rate by 90% by migrating virtual desktop to the closer location. Transcontinental remote desktop migrations only take about 4 min and our connection migration proxy imposes 13 μs overhead per packet.
TL;DR: The proposed covert VoIP communications system not only achieved high quality of VoIP and prevented detection of statistical analysis, but also provided integrity for secret data.
Abstract: As one of the most popular real-time services on the Internet, Voice over internet protocol (VoIP) has attracted researchers' attention in the information security field for its characters of real-time and high flow. To protect data security, a new covert VoIP communications system was proposed in this study to realize secure communications by hiding secret information in VoIP streams. In the proposed algorithm, secret data were divided into blocks after being encrypted with a block cipher, and then each block of secret data was embedded into VoIP streaming packets randomly using a chaotic mapping. The symmetric key was distributed through an efficient and secure channel, and the message digest was implemented to protect the integrity of secret data. The experimental data were collected by comparing audio data between the sender and the receiver. The experimental results indicated that data embedding had little impact on the quality of speech. Besides, statistical analysis could not detect the secret data embedded in VoIP streams using the block cipher and random numbers generated from chaotic mapping. The proposed covert VoIP communications system not only achieved high quality of VoIP and prevented detection of statistical analysis, but also provided integrity for secret data.
TL;DR: An efficient three-dimensional histogram shifting is proposed for reversible data hiding in this paper and only one information bit could be hidden with at most one modification of one coefficient, whereas two data bits could behidden at the same cost by using the proposed scheme.
Abstract: Histogram shifting is an important method of reversible data hiding. However, every pixel, difference, or prediction-error is respectively changed to hide a data bit in the traditional histogram shifting, which constrains the capacity-distortion embedding performance. An efficient three-dimensional histogram shifting is proposed for reversible data hiding in this paper. Take H.264 videos as covers to show this method. In a 4 × 4 quantized discrete cosine transform luminance block, which is not inferred by others, three alternating current coefficients are selected randomly as an embeddable group. According to the different values of the selected coefficient groups, they could be divided into different sets. Data could be hidden according to these sets. In the traditional histogram shifting, only one information bit could be hidden with at most one modification of one coefficient, whereas two data bits could be hidden at the same cost by using the proposed scheme. The superiority of the presented technique is verified through experiments.
TL;DR: This paper proposes a throughput estimation method that accurately estimates the throughput based on previous throughput samples, and is robust to short term and small fluctuations, and sensitive to large fluctuations in throughput.
Abstract: Adaptive streaming allows for dynamic adaptation of the bitrate to varying network conditions, to guarantee the best user experience. Adaptive bitrate algorithms face a significant challenge in correctly estimating the throughput, as the throughput varies widely over time. The current throughput estimation methods cannot distinguish between throughput fluctuations of different amplitude and frequency. In this paper, we propose a throughput estimation method that accurately estimates the throughput based on previous throughput samples. It is robust to short term and small fluctuations, and sensitive to large fluctuations in throughput. Furthermore, we propose a rate adaptive algorithm for video on demand (VoD) services that selects the quality of the video based on the estimated throughput and playback buffer occupancy. The objective of the rate adaptive algorithms is to guarantee high video quality to improve user experience. The proposed algorithm dynamically adjusts the quality level of the video stream. The proposed method selects high quality video segments, while minimizing the risk of playback interruption. Furthermore, the proposed method minimizes the frequency of video rate changes. We show that the algorithm smoothly switches the video rate to improve user experience. Furthermore, we determine that it efficiently utilizes network resources to achieve a high video rate; competing HTTP clients achieve equitable video rates. We also confirm that variations in the playback buffer size and segment duration do not affect the performance of the proposed algorithm.
TL;DR: Zhang et al. as mentioned in this paper proposed a pseudo-positive regularization method to enrich the diversity of the training data to reduce the risk of overfitting in person re-ID.
Abstract: An intrinsic challenge of person re-identification (re-ID) is the annotation difficulty. This typically means (1) few training samples per identity and (2) thus the lack of diversity among the training samples. Consequently, we face high risk of over-fitting when training the convolutional neural network (CNN), a state-of-the-art method in person re-ID. To reduce the risk of over-fitting, this paper proposes a Pseudo-Positive Regularization method to enrich the diversity of the training data. Specifically, unlabeled data from an independent pedestrian database are retrieved using the target training data as query. A small proportion of these retrieved samples are randomly selected as the Pseudo-Positive samples and added to the target training set for the supervised CNN training. The addition of Pseudo-Positive samples is therefore a Data Augmentation method to reduce the risk of over-fitting during CNN training. We implement our idea in the identification CNN models (i.e., CaffeNet, VGGNet-16 and ResNet-50). On CUHK03 and Market-1501 datasets, experimental results demonstrate that the proposed method consistently improves the baseline and yields competitive performance to the state-of-the-art person re-ID methods.
TL;DR: Experimental results demonstrate that the proposed framework of object removal using the techniques of deep learning is better than the conventional method in terms of inpainting applications and object removal.
Abstract: Object removal is a popular image manipulation technique, which mainly involves object segmentation and image inpainting two technical problems. In the conventional object removal framework, the object segmentation part needs a mask or artificial pre-processing; and the inpainting technique still requires further improving the quality. In this paper, we propose a new framework of object removal using the techniques of deep learning. Conditional random fields as recurrent neural networks (CRF-RNN) is used to segment the target in sematic, which can avoid the trouble of mask or artificial pre-processing for object segmentation. In inpainting part, a new method for inpainting the missing region is proposed. Besides, the representation features are calculated from the convolutional neural network (CNN) feature maps of the neighbor regions of the missing region. Then, large-scale bound-constrained optimization (L-BFGS) is used to synthesize the missing region based on the CNN representation features of similarity neighbor regions. We evaluate the proposed method by applying it to different kinds of images and textures for object removal and inpainting. Experimental results demonstrate that our method is better than the conventional method in terms of inpainting applications and object removal.
TL;DR: A saliency image embedding as a pedestrian descriptor is proposed and it is shown that the learned pedestrian descriptor by the proposed SIF CNN architecture provides a significant improvement over the baselines and produces a competitive performance compared with the state-of-the-art person re-ID methods on three large-scale people re- ID benchmarks.
Abstract: Background interference, which arises from complex environment, is a critical problem for a robust person re-identification (re-ID) system. The background noise may significantly compromise the feature learning and matching process. To reduce the background interference, this paper proposes a saliency image embedding as a pedestrian descriptor. First, to eliminate the background for each pedestrian image, the saliency image is constructed, which is implemented through an unsupervised manifold ranking-based saliency detection algorithm. Second, to reduce some errors and details missing of pedestrian during the saliency image construction process, a saliency image fusion (SIF) convolutional neural network (CNN) architecture is well designed, in which the original pedestrian image and saliency image are both employed as input. We implement our idea in the identification models based on some state-of-the-art backbone CNN models (i.e., CaffeNet, VGGNet-16, GoogLeNet and ResNet-50). We show that the learned pedestrian descriptor by the proposed SIF CNN architecture provides a significant improvement over the baselines and produces a competitive performance compared with the state-of-the-art person re-ID methods on three large-scale person re-ID benchmarks (i.e., Market-1501, DukeMTMC-reID and MARS).
TL;DR: Compression analysis and extensive security tests have demonstrated the robustness of the proposed selective encryption approach against the most known types of attacks, the preservation of the main compression properties, and the efficiency in term of execution time compared to others similar JPEG-2000 images encryption schemes.
Abstract: This paper presents an efficient and lightweight format-compliant selective encryption algorithm for secure JPEG 2000 coding. The proposed encryption scheme is dynamic in nature, where the key is changed for every input image. Furthermore, an amount of 4% of bytes from each packet data is selected to follow the encryption process. Moreover, to achieve the desired security, two rounds of substitution–diffusion processes are applied to the selected bytes. Experimental analysis has proved that this amount of encrypted data ensures a hard image distortion, while significantly preserve the communication bandwidth. In addition, compression analysis and extensive security tests have demonstrated: (1) the robustness of the proposed selective encryption approach against the most known types of attacks, (2) the preservation of the main compression properties (i.e., compression friendliness and format-compliant), and most importantly, and (3) the efficiency in term of execution time compared to others similar JPEG-2000 images encryption schemes.
TL;DR: A customer pose estimation system using orientational spatio-temporal deep neural network from surveillance camera, which applies a bi-directional recurrent neural network on top of the system to improve the estimation accuracy by considering both forward and backward image sequences.
Abstract: The analysis of customer pose draws more and more attention of retailers and researchers, because this information can reveal the customer habits and the customer interest level to the merchandise. In the retail store environment, customers' poses are highly related to their body orientations. For example, when a customer is picking an item from merchandise shelf, he or she must face to the shelf. On the other hand, if the customer body orientation is parallel to the shelf, this customer is probably just walking through. Considering this fact, we propose a customer pose estimation system using orientational spatio-temporal deep neural network from surveillance camera. This system first generates the initial joint heatmaps using a fully convolutional network. Based on these heatmaps, we propose a set of novel orientational message-passing layers to fine-tune joint heatmaps by introducing the body orientation information into the conventional message-passing layers. In addition, we apply a bi-directional recurrent neural network on top of the system to improve the estimation accuracy by considering both forward and backward image sequences. Therefore, in this system, the global body orientation, local joint connections, and temporal pose continuity are integrally considered. At last, we conduct a series of comparison experiments to show the effectiveness of our system.
TL;DR: The aim of this research was to model user QoE for the User Datagram Protocol-based video streaming service from the results of uncontrolled subjective tests, and showed that a strong positive linear relationship exists between the assessedQoE of the model and the Mean Opinion Scores of the subjects.
Abstract: One of the paramount research questions in the scientific community today is how to remotely assess user quality of experience (QoE) for a specific service. To this end, various user QoE assessment models have been developed; however, they are mostly based on the data gathered from controlled environment experimentation. The aim of this research was to model user QoE for the User Datagram Protocol-based video streaming service from the results of uncontrolled subjective tests. Specifically, using fuzzy logic, we have correlated the values of three objective network parameters (the packet loss rate and the number and duration of packet loss occurrences in one streaming session) with test subjects’ subjective perception about perceived quality distortions. The dependencies between different values of the parameters and the subjects’ perception of video quality were used to develop a no reference objective video quality assessment model for assessing user QoE. The key distinguishing feature of the developed model lays in the process of subjective evaluation, which was conducted with a panel of 602 test subjects who evaluated the quality of 1-h video in home environments. For this purpose, 72 different test sequences were prepared for rating. We showed that a strong positive linear relationship exists between the assessed QoE of the model and the Mean Opinion Scores of the subjects (a Pearson correlation coefficient equal to 0.8841).
TL;DR: The experimental results show that the performance of the proposed UMIT method is both efficient and effective, minimizing the response time by decreasing the network transmission cost.
Abstract: To effectively and efficiently reduce the transmission costs of large medical image in (mobile) telemedicine systems, we design and implement a professionally user-adaptive large medical image transmission method called UMIT. Before transmission, a preprocessing step is first conducted to obtain the optimal image block (IB) replicas based on the users’ professional preference model and the network bandwidth at a master node. After that, the candidate IBs are transmitted via slave nodes according to the transmission priorities. Finally, the IBs can be reconstructed and displayed at the users’ devices. The proposed method includes three enabling techniques: (1) user’s preference degree derivation of the medically useful areas, (2) an optimal IB replica storage scheme, and (3) an adaptive and robust multi-resolution-based IB replica selection and transmission method. The experimental results show that the performance of our proposed UMIT method is both efficient and effective, minimizing the response time by decreasing the network transmission cost.
TL;DR: A novel iterative algorithm called conditionally adaptive multiple template update is proposed to regulate the object templates for handling dynamic occlusions effectively and shows that the proposed method competes well with other relevant state-of-the-art tracking methods.
Abstract: Tracking of moving objects in real-time scenes is a challenging research problem in computer vision. This is due to incessant live changes in the object features, background, occlusions, and illumination deviations occurring at different instances in the scene. With the objective of tracking visual objects in intricate videos, this paper presents a new color-independent tracking approach, the contributions of which are threefold. First, the illumination level of the sequences is maintained constant using fast discrete curvelet transform. Fisher information metric is calculated based on a cumulative score by comparing the template patches with a reference template at different timeframes. This metric is used for quantifying distances between the consecutive frame histogram distributions. Then, a novel iterative algorithm called conditionally adaptive multiple template update is proposed to regulate the object templates for handling dynamic occlusions effectively. The proposed method is evaluated on a set of extensive challenging benchmark datasets. Experimental results in terms of Center Location Error (CLE), Tracking Success Score (TSS), and Occlusion Success Score (OSS) show that the proposed method competes well with other relevant state-of-the-art tracking methods.
TL;DR: After the models were evaluated using Mean Absolute Percentage Error (MAPE), it was found that the proposed models for G.726 and G.729 provided better performance than PESQ, particularly by reducing the MAPE by about 30% and 17% respectively, compared to P ESQ.
Abstract: This paper proposes two models of Mean Opinion Score (MOS) estimation based on Thai users and the Thai language, referring to packet loss effects, for G.726 and G.729 codecs. Based on Thai users and Thai speech referring to packet loss effects in this work, the Absolute Category Rate (ACR) listening tests were conducted with 89 participants and 107 participants for the MOS estimation model development of G.726 and G.729 respectively, while the same tests were conducted with totally 60 participants for the model evaluation of both codecs. Packet loss rates were 0–15% for G.726 with 5 test conditions and G.729 with 6 test conditions; each condition was conducted with at least 16 participants. After gathering the data, the MOS estimation models for both codecs were simply created and then evaluated with the test sets, comparing Perceptual Evaluation of Speech Quality (PESQ), a popular measurement method. For one of the contributions of this study, after the models were evaluated using Mean Absolute Percentage Error (MAPE), it was found that the proposed models for G.726 and G.729 provided better performance than PESQ, particularly by reducing the MAPE by about 30% and 17% respectively, compared to PESQ.
TL;DR: It is shown how the Windsurf library can be effectively exploited to assess a fair comparison among the existing alternative approaches based on salient points, which can be contrasted on aspects of both effectiveness and efficiency.
Abstract: Despite their popularity, approaches based on salient point descriptors have yet to be proven effective for content-based image retrieval. In this paper, we show how the Windsurf library can be effectively exploited to assess a fair comparison among the existing alternative approaches based on salient points, which can be contrasted on aspects of both effectiveness and efficiency. Our extensive experimental evaluation, performed on four different image benchmarks, indeed, shows that techniques based on salient point descriptors have effectiveness not better than other existing techniques and are less amenable to be indexed, and thus, their efficiency remains questionable.
TL;DR: It shows that if employed correctly, haptic and olfactory stimuli can enhance immersion and user experience significantly and major success factors appear to be the amplitude and frequency of stimulation as well as the temporal synchronization with the other media channels, in particular the visual stimuli.
Abstract: The Virtual Jumpcube is a virtual reality setup from 2015 that allows for jumping and flying in audiovisual virtual environments Recently, we have included several haptic and olfactory stimuli that should further increase the degree of immersion in the experienced virtuality These additional media channels were tested by the participants of several events and the feedback of 196 jumpers was gathered in a questionnaire In this paper, we describe the stimulation hardware and software as well as the performed experiment and we present the major findings of the evaluation It shows that if employed correctly, haptic and olfactory stimuli can enhance immersion and user experience significantly Major success factors appear to be the amplitude and frequency of stimulation as well as the temporal synchronization with the other media channels, in particular the visual stimuli