scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2013"


Journal ArticleDOI
TL;DR: This paper categorizes different ALPR techniques according to the features they used for each stage, and compares them in terms of pros, cons, recognition accuracy, and processing speed.
Abstract: Automatic license plate recognition (ALPR) is the extraction of vehicle license plate information from an image or a sequence of images. The extracted information can be used with or without a database in many applications, such as electronic payment systems (toll payment, parking fee payment), and freeway and arterial monitoring systems for traffic surveillance. The ALPR uses either a color, black and white, or infrared camera to take images. The quality of the acquired images is a major factor in the success of the ALPR. ALPR as a real-life application has to quickly and successfully process license plates under different environmental conditions, such as indoors, outdoors, day or night time. It should also be generalized to process license plates from different nations, provinces, or states. These plates usually contain different colors, are written in different languages, and use different fonts; some plates may have a single color background and others have background images. The license plates can be partially occluded by dirt, lighting, and towing accessories on the car. In this paper, we present a comprehensive review of the state-of-the-art techniques for ALPR. We categorize different ALPR techniques according to the features they used for each stage, and compare them in terms of pros, cons, recognition accuracy, and processing speed. Future forecasts of ALPR are given at the end.

682 citations


Journal ArticleDOI
TL;DR: A family of reduced reference video quality assessment (QA) models that utilize spatial and temporal entropic differences are presented, adopting a hybrid approach of combining statistical models and perceptual principles to design QA algorithms.
Abstract: We present a family of reduced reference video quality assessment (QA) models that utilize spatial and temporal entropic differences. We adopt a hybrid approach of combining statistical models and perceptual principles to design QA algorithms. A Gaussian scale mixture model for the wavelet coefficients of frames and frame differences is used to measure the amount of spatial and temporal information differences between the reference and distorted videos, respectively. The spatial and temporal information differences are combined to obtain the spatio-temporal-reduced reference entropic differences. The algorithms are flexible in terms of the amount of side information required from the reference that can range between a single scalar per frame and the entire reference information. The spatio-temporal entropic differences are shown to correlate quite well with human judgments of quality, as demonstrated by experiments on the LIVE video quality assessment database.

308 citations


Journal ArticleDOI
Seung-Hyun Cho1, Munchurl Kim1
TL;DR: A fast CU splitting and pruning method is presented for HEVC intra coding, which allows for significant reduction in computational complexity with small degradations in rate-distortion (RD) performance.
Abstract: High Efficiency Video Coding (HEVC), a new video coding standard currently being established, adopts a quadtree-based Coding Unit (CU) block partitioning structure that is flexible in adapting various texture characteristics of images. However, this causes a dramatic increase in computational complexity compared to previous video coding standards due to the necessity of finding the best CU partitions. In this paper, a fast CU splitting and pruning method is presented for HEVC intra coding, which allows for significant reduction in computational complexity with small degradations in rate-distortion (RD) performance. The proposed fast splitting and pruning method is performed in two complementary steps: 1) early CU split decision and 2) early CU pruning decision. For CU blocks, the early CU splitting and pruning tests are performed at each CU depth level according to a Bayes decision rule method based on low-complexity RD costs and full RD costs, respectively. The statistical parameters for the early CU split and pruning tests are periodically updated on the fly for each CU depth level to cope with varying signal characteristics. Experimental results show that our proposed fast CU splitting and pruning method reduces the computational complexity of the current HM to about 50% in encoding time with only 0.6% increases in BD rate.

306 citations


Journal ArticleDOI
TL;DR: A novel prediction-based reversible steganographic scheme based on image inpainting that provides a greater embedding rate and better visual quality compared with recently reported methods.
Abstract: In this paper, we propose a novel prediction-based reversible steganographic scheme based on image inpainting. First, reference pixels are chosen adaptively according to the distribution characteristics of the image content. Then, the image inpainting technique based on partial differential equations is introduced to generate a prediction image that has similar structural and geometric information as the cover image. Finally, by using the two selected groups of peak points and zero points, the histogram of the prediction error is shifted to embed the secret bits reversibly. Since the same reference pixels can be exploited in the extraction procedure, the embedded secret bits can be extracted from the stego image correctly, and the cover image can be restored losslessly. Through the use of the adaptive strategy for choosing reference pixels and the inpainting predictor, the prediction accuracy is high, and more embeddable pixels are acquired. Thus, the proposed scheme provides a greater embedding rate and better visual quality compared with recently reported methods.

227 citations


Journal ArticleDOI
TL;DR: This paper organizes and surveys the corresponding literature, defines unambiguous key terms, and discusses links among fundamental building blocks ranging from human detection to action and interaction recognition, providing a comprehensive coverage of key aspects of video-based human behavior understanding.
Abstract: Understanding human behaviors is a challenging problem in computer vision that has recently seen important advances. Human behavior understanding combines image and signal processing, feature extraction, machine learning, and 3-D geometry. Application scenarios range from surveillance to indexing and retrieval, from patient care to industrial safety and sports analysis. Given the broad set of techniques used in video-based behavior understanding and the fast progress in this area, in this paper we organize and survey the corresponding literature, define unambiguous key terms, and discuss links among fundamental building blocks ranging from human detection to action and interaction recognition. The advantages and the drawbacks of the methods are critically discussed, providing a comprehensive coverage of key aspects of video-based human behavior understanding, available datasets for experimentation and comparisons, and important open research issues.

199 citations


Journal ArticleDOI
TL;DR: This paper presents regularized smoothing KISS metric learning (RS-KISS) by seamlessly integrating smoothing and regularization techniques for robustly estimating covariance matrices and introduces incremental learning to RS-K ISS.
Abstract: With the rapid development of the intelligent video surveillance (IVS), person re-identification, which is a difficult yet unavoidable problem in video surveillance, has received increasing attention in recent years. That is because computer capacity has shown remarkable progress and the task of person re-identification plays a critical role in video surveillance systems. In short, person re-identification aims to find an individual again that has been observed over different cameras. It has been reported that KISS metric learning has obtained the state of the art performance for person re-identification on the VIPeR dataset . However, given a small size training set, the estimation to the inverse of a covariance matrix is not stable and thus the resulting performance can be poor. In this paper, we present regularized smoothing KISS metric learning (RS-KISS) by seamlessly integrating smoothing and regularization techniques for robustly estimating covariance matrices. RS-KISS is superior to KISS, because RS-KISS can enlarge the underestimated small eigenvalues and can reduce the overestimated large eigenvalues of the estimated covariance matrix in an effective way. By providing additional data, we can obtain a more robust model by RS-KISS. However, retraining RS-KISS on all the available examples in a straightforward way is time consuming, so we introduce incremental learning to RS-KISS. We thoroughly conduct experiments on the VIPeR dataset and verify that 1) RS-KISS completely beats all available results for person re-identification and 2) incremental RS-KISS performs as well as RS-KISS but reduces the computational cost significantly.

168 citations


Journal ArticleDOI
TL;DR: An adaptive self-interpolation algorithm is first proposed to estimate a sharp high-resolution gradient field directly from the input low-resolution image, regarded as a gradient constraint or an edge-preserving constraint to reconstruct the high- resolution image.
Abstract: Super-resolution from a single image plays an important role in many computer vision systems. However, it is still a challenging task, especially in preserving local edge structures. To construct high-resolution images while preserving the sharp edges, an effective edge-directed super-resolution method is presented in this paper. An adaptive self-interpolation algorithm is first proposed to estimate a sharp high-resolution gradient field directly from the input low-resolution image. The obtained high-resolution gradient is then regarded as a gradient constraint or an edge-preserving constraint to reconstruct the high-resolution image. Extensive results have shown both qualitatively and quantitatively that the proposed method can produce convincing super-resolution images containing complex and sharp features, as compared with the other state-of-the-art super-resolution algorithms.

158 citations


Journal ArticleDOI
TL;DR: A new way to incorporate spatial information between neighboring pixels into the Gaussian mixture model based on Markov random field (MRF) to demonstrate its robustness, accuracy, and effectiveness, compared with other mixture models.
Abstract: In this paper, a new mixture model for image segmentation is presented. We propose a new way to incorporate spatial information between neighboring pixels into the Gaussian mixture model based on Markov random field (MRF). In comparison to other mixture models that are complex and computationally expensive, the proposed method is fast and easy to implement. In mixture models based on MRF, the M-step of the expectation-maximization (EM) algorithm cannot be directly applied to the prior distribution ${\pi_{ij}}$ for maximization of the log-likelihood with respect to the corresponding parameters. Compared with these models, our proposed method directly applies the EM algorithm to optimize the parameters, which makes it much simpler. Experimental results obtained by employing the proposed method on many synthetic and real-world grayscale and colored images demonstrate its robustness, accuracy, and effectiveness, compared with other mixture models.

150 citations


Journal ArticleDOI
TL;DR: This work proposes a keypoint-based framework to address the keyframe selection problem so that local features can be employed in selecting keyframes, and introduces two criteria, coverage and redundancy, based on keypoint matching in the selection process.
Abstract: Keyframe selection has been crucial for effective and efficient video content analysis. While most of the existing approaches represent individual frames with global features, we, for the first time, propose a keypoint-based framework to address the keyframe selection problem so that local features can be employed in selecting keyframes. In general, the selected keyframes should both be representative of video content and containing minimum redundancy. Therefore, we introduce two criteria, coverage and redundancy, based on keypoint matching in the selection process. Comprehensive experiments demonstrate that our approach outperforms the state of the art.

134 citations


Journal ArticleDOI
TL;DR: A novel real-time stereo matching method is presented that uses a two-pass approximation of adaptive support-weight aggregation, and a low-complexity iterative disparity refinement technique, which is shown to be an accurate approximation of the support weights while greatly reducing the complexity of aggregation.
Abstract: High-quality real-time stereo matching has the potential to enable various computer vision applications including semi-automated robotic surgery, teleimmersion, and 3-D video surveillance. A novel real-time stereo matching method is presented that uses a two-pass approximation of adaptive support-weight aggregation, and a low-complexity iterative disparity refinement technique. Through an evaluation of computationally efficient approaches to adaptive support-weight cost aggregation, it is shown that the two-pass method produces an accurate approximation of the support weights while greatly reducing the complexity of aggregation. The refinement technique, constructed using a probabilistic framework, incorporates an additive term into matching cost minimization and facilitates iterative processing to improve the accuracy of the disparity map. This method has been implemented on massively parallel high-performance graphics hardware using the Compute Unified Device Architecture computing engine. Results show that the proposed method is the most accurate among all of the real-time stereo matching methods listed on the Middlebury stereo benchmark.

120 citations


Journal ArticleDOI
TL;DR: A probabilistic computational algorithm by integrating objectness likelihood with appearance rarity is developed and can serve as a basis for many techniques such as image/video segmentation, retrieval, retargeting, and compression.
Abstract: Saliency detection aims at quantitatively predicting attended locations in an image. It may mimic the selection mechanism of the human vision system, which processes a small subset of a massive amount of visual input while the redundant information is ignored. Motivated by the biological evidence that the receptive fields of simple cells in V1 of the vision system are similar to sparse codes learned from natural images, this paper proposes a novel framework for saliency detection by using image sparse coding representations as features. Unlike many previous approaches dedicated to examining the local or global contrast of each individual location, this paper develops a probabilistic computational algorithm by integrating objectness likelihood with appearance rarity. In the proposed framework, image sparse coding representations are yielded through learning on a large amount of eye-fixation patches from an eye-tracking dataset. The objectness likelihood is measured by three generic cues called compactness, continuity, and center bias. The appearance rarity is inferred by using a Gaussian mixture model. The proposed paper can serve as a basis for many techniques such as image/video segmentation, retrieval, retargeting, and compression. Extensive evaluations on benchmark databases and comparisons with a number of up-to-date algorithms demonstrate its effectiveness.

Journal ArticleDOI
TL;DR: The proposed method can successfully operate in situations that may appear in real application scenarios, since it does not set any assumption concerning the visual scene background and the camera view angle.
Abstract: In this paper, we propose a novel method aiming at view-independent human action recognition. Action description is based on local shape and motion information appearing at spatiotemporal locations of interest in a video. Action representation involves fuzzy vector quantization, while action classification is performed by a feedforward neural network. A novel classification algorithm, called minimum class variance extreme learning machine, is proposed in order to enhance the action classification performance. The proposed method can successfully operate in situations that may appear in real application scenarios, since it does not set any assumption concerning the visual scene background and the camera view angle. Experimental results on five publicly available databases, aiming at different application scenarios, denote the effectiveness of both the adopted action recognition approach and the proposed minimum class variance extreme learning machine algorithm.

Journal ArticleDOI
TL;DR: The pixel-based classification is adopted for refining the results from the block-based background subtraction, which can further classify pixels as foreground, shadows, and highlights and can provide a high precision and efficient processing speed to meet the requirements of real-time moving object detection.
Abstract: Moving object detection is an important and fundamental step for intelligent video surveillance systems because it provides a focus of attention for post-processing. A multilayer codebook-based background subtraction (MCBS) model is proposed for video sequences to detect moving objects. Combining the multilayer block-based strategy and the adaptive feature extraction from blocks of various sizes, the proposed method can remove most of the nonstationary (dynamic) background and significantly increase the processing efficiency. Moreover, the pixel-based classification is adopted for refining the results from the block-based background subtraction, which can further classify pixels as foreground, shadows, and highlights. As a result, the proposed scheme can provide a high precision and efficient processing speed to meet the requirements of real-time moving object detection.

Journal ArticleDOI
TL;DR: The proposed scheme takes advantages of local and global features and, therefore, provides a discriminative representation for human actions and outperforms the state-of-the-art methods on the IXMAS action recognition dataset.
Abstract: In this paper, we propose a novel scheme for human action recognition that combines the advantages of both local and global representations. We explore human silhouettes for human action representation by taking into account the correlation between sequential poses in an action. A modified bag-of-words model, named bag of correlated poses, is introduced to encode temporally local features of actions. To utilize the property of visual word ambiguity, we adopt the soft assignment strategy to reduce the dimensionality of our model and circumvent the penalty of computational complexity and quantization error. To compensate for the loss of structural information, we propose an extended motion template, i.e., extensions of the motion history image, to capture the holistic structural features. The proposed scheme takes advantages of local and global features and, therefore, provides a discriminative representation for human actions. Experimental results prove the viability of the complimentary properties of two descriptors and the proposed approach outperforms the state-of-the-art methods on the IXMAS action recognition dataset.

Journal ArticleDOI
TL;DR: This paper extends previous work by extracting audiovisual and film grammar descriptors and, driven by users' rates on connotative properties, creates a shared framework where movie scenes are placed, compared, and recommended according to connotation.
Abstract: The apparent difficulty in assessing emotions elicited by movies and the undeniable high variability in subjects' emotional responses to film content have been recently tackled by exploring film connotative properties: the set of shooting and editing conventions that help in transmitting meaning to the audience. Connotation provides an intermediate representation that exploits the objectivity of audiovisual descriptors to predict the subjective emotional reaction of single users. This is done without the need of registering users' physiological signals. It is not done by employing other people's highly variable emotional rates, but by relying on the intersubjectivity of connotative concepts and on the knowledge of user's reactions to similar stimuli. This paper extends previous work by extracting audiovisual and film grammar descriptors and, driven by users' rates on connotative properties, creates a shared framework where movie scenes are placed, compared, and recommended according to connotation. We evaluate the potential of the proposed system by asking users to assess the ability of connotation in suggesting film content able to target their affective requests.

Journal ArticleDOI
TL;DR: A novel cost aggregation method inspired by domain transformation, a recently proposed dimensionality reduction technique, that enables the aggregation of 2-D cost data to be performed using a sequence of 1-D filters, which lowers computation and memory costs compared to conventional 2- D filters.
Abstract: Binocular stereo matching is one of the most important algorithms in the field of computer vision. Adaptive support-weight approaches, the current state-of-the-art local methods, produce results comparable to those generated by global methods. However, excessive time consumption is the main problem of these algorithms since the computational complexity is proportionally related to the support window size. In this paper, we present a novel cost aggregation method inspired by domain transformation, a recently proposed dimensionality reduction technique. This transformation enables the aggregation of 2-D cost data to be performed using a sequence of 1-D filters, which lowers computation and memory costs compared to conventional 2-D filters. Experiments show that the proposed method outperforms the state-of-the-art local methods in terms of computational performance, since its computational complexity is independent of the input parameters. Furthermore, according to the experimental results with the Middlebury dataset and real-world images, our algorithm is currently one of the most accurate and efficient local algorithms.

Journal ArticleDOI
TL;DR: A new control-point representation that favors differential coding is proposed for efficient compression of affine parameters by exploiting the spatial correlation between adjacent coding blocks, motion vectors at control points can be predicted and thus efficiently coded, leading to overall improved performance.
Abstract: The affine-motion model is able to capture rotation, zooming, and the deformation of moving objects, thereby providing a better motion-compensated prediction. However, it is not widely used due to difficulty in both estimation and efficient coding of its motion parameters. To alleviate this problem, a new control-point representation that favors differential coding is proposed for efficient compression of affine parameters. By exploiting the spatial correlation between adjacent coding blocks, motion vectors at control points can be predicted and thus efficiently coded, leading to overall improved performance. To evaluate the proposed method, four new affine prediction modes are designed and embedded into the high-efficiency video coding test model HM1.0. The encoder adaptively chooses whether to use the new affine mode in an operational rate-distortion optimization. Bitrate savings up to 33.82% in low-delay and 23.90% in random-access test conditions are obtained for low-complexity encoder settings. For high-efficiency settings, bitrate savings up to 14.26% and 4.89% for these two modes are observed.

Journal ArticleDOI
TL;DR: A block-based low-complexity screen compression scheme, in which multiple block modes are adopted to exploit the intra- and inter-frame redundancies, and a proposed fast block classification algorithm, which exploits the discriminative features between the pictorial and the textual blocks.
Abstract: Interactive screen sharing requires extremely low latency end-to-end transmission, which in turn requires highly efficient and low-complexity screen compression. In this paper, we present a block-based low-complexity screen compression scheme, in which multiple block modes are adopted to exploit the intra- and inter-frame redundancies. In particular, we classify the intra-coded blocks to pictorial blocks and textual blocks using a proposed fast block classification algorithm, which exploits the discriminative features between the pictorial and the textual blocks. Then, we design a low-complexity, yet efficient, algorithm to compress the textual blocks. We use base colors and escape colors to represent and quantize the textual pixels, which not only achieves high compression ratios but also preserves a high quality on textual pixels. The two-dimensionally predictive index coding and hierarchical pattern coding technologies are used to exploit local spatial correlations and global pattern correlation, respectively. To further utilize the correlation between the luminance and chrominance channels, we propose a joint-channel index coding method. We compare the coding efficiency and the computational complexity of the proposed scheme against the standard image coding schemes such as JPEG, JPEG2000, and PNG, the compound image compressor HJPC, and the popular video coding standard H.264. We also compare the visual quality of the proposed scheme against H.264 intra coding, JPEG2000, and HJPC. The evaluation results show that the proposed scheme achieves superior or comparable compression efficiency with much lower complexity than other schemes in most of the cases.

Journal ArticleDOI
TL;DR: In this article, a block-based method is proposed to deal with noise, illumination variations, and dynamic backgrounds, while still obtaining smooth contours of foreground objects, where image sequences are analyzed on an overlapping block-by-block basis and a low-dimensional texture descriptor obtained from each block is passed through an adaptive classifier cascade, where each stage handles a distinct problem.
Abstract: Background subtraction is a fundamental low-level processing task in numerous computer vision applications. The vast majority of algorithms process images on a pixel-by-pixel basis, where an independent decision is made for each pixel. A general limitation of such processing is that rich contextual information is not taken into account. We propose a block-based method capable of dealing with noise, illumination variations, and dynamic backgrounds, while still obtaining smooth contours of foreground objects. Specifically, image sequences are analyzed on an overlapping block-by-block basis. A low-dimensional texture descriptor obtained from each block is passed through an adaptive classifier cascade, where each stage handles a distinct problem. A probabilistic foreground mask generation approach then exploits block overlaps to integrate interim block-level decisions into final pixel-level foreground segmentation. Unlike many pixel-based methods, ad-hoc postprocessing of foreground masks is not required. Experiments on the difficult Wallflower and I2R datasets show that the proposed approach obtains on average better results (both qualitatively and quantitatively) than several prominent methods. We furthermore propose the use of tracking performance as an unbiased approach for assessing the practical usefulness of foreground segmentation methods, and show that the proposed approach leads to considerable improvements in tracking accuracy on the CAVIAR dataset.

Journal ArticleDOI
TL;DR: This paper presents a novel multiview gait recognition method that combines the enhanced Gabor (EG) representation of the gait energy image and the regularized local tensor discriminant analysis (RLTDA) method, and adopts a nonlinear mapping to emphasize those important feature points.
Abstract: This paper presents a novel multiview gait recognition method that combines the enhanced Gabor (EG) representation of the gait energy image and the regularized local tensor discriminant analysis (RLTDA) method. EG first derives desirable gait features characterized by spatial frequency, spatial locality, and orientation selectivity to cope with the variations due to surface, shoe types, clothing, carrying conditions, and so on. Unlike traditional Gabor transformation, which does not consider the structural characteristics of the gait features, our representation method not only considers the statistical property of the input features but also adopts a nonlinear mapping to emphasize those important feature points. The dimensionality of the derivation of EG gait feature is further reduced by using RLTDA, which directly obtains a set of locally optimal tensor eigenvectors and can capture nonlinear manifolds of gait features that exhibit appearance changes due to variable viewing angles. An aggregation scheme is adopted to combine the complementary information from differently RLTDA recognizers at the matching score level. The proposed method achieves the best average Rank-1 recognition rates for multiview gait recognition based on image sequences from the USF HumanID gait challenge database and the CASIA gait database.

Journal ArticleDOI
TL;DR: It is shown that a regularization term based on the scale invariance of fractal dimension and length can be effective in recovering details of the high-resolution image.
Abstract: In this paper, we propose a single image super-resolution and enhancement algorithm using local fractal analysis If we treat the pixels of a natural image as a fractal set, the image gradient can then be regarded as a measure of the fractal set According to the scale invariance (a special case of bi-Lipschitz invariance) feature of fractal dimension, we will be able to estimate the gradient of a high-resolution image from that of a low-resolution one Moreover, the high-resolution image can be further enhanced by preserving the local fractal length of gradient during the up-sampling process We show that a regularization term based on the scale invariance of fractal dimension and length can be effective in recovering details of the high-resolution image Analysis is provided on the relation and difference among the proposed approach and some other state of the art interpolation methods Experimental results show that the proposed method has superior super-resolution and enhancement results as compared to other competitors

Journal ArticleDOI
Yong Ju Jung1, Hosik Sohn1, Seong-il Lee1, HyunWook Park1, Yong Man Ro1 
TL;DR: A new objective assessment method for visual discomfort of stereoscopic images that makes effective use of the human visual attention model and can achieve significantly higher prediction accuracy than the state-of-the-art methods.
Abstract: We introduce a new objective assessment method for visual discomfort of stereoscopic images that makes effective use of the human visual attention model. The proposed method takes into account visual importance regions that play an important role in determining the overall degree of visual discomfort of a stereoscopic image. After obtaining a saliency-based visual importance map for an image, perceptually significant disparity features are extracted to predict the overall degree of visual discomfort. Experimental results show that the proposed method can achieve significantly higher prediction accuracy than the state-of-the-art methods.

Journal ArticleDOI
Tao Lin1, Peijun Zhang1, Shuhui Wang1, Kailun Zhou1, Xianyi Chen1 
TL;DR: A mixed chroma sampling-rate approach for screen content coding achieves very high visual quality with minimal computing complexity increment for SCC, and has better R-D performance than two full-chroma coders approach, especially in low bitrate.
Abstract: Computer screens contain discontinuous-tone content and continuous-tone content. Thus, the most effective way for screen content coding (SCC) is to use two essentially different coders: a dictionary-entropy coder and a traditional hybrid coder. Although screen content is originally in a full-chroma (e.g., YUV444) format, the current method of compression is to first subsample chroma of pictures and then compress pictures using a chroma-subsampled (e.g., YUV420) coder. Using two chroma-subsampled coders cannot achieve high-quality SCC, but using two full-chroma coders is overkill and inefficient for SCC. To solve the dilemma, this paper proposes a mixed chroma sampling-rate approach for SCC. An original full-chroma input macroblock (coding unit) or its prediction residual is chroma-subsampled. One full-chroma base coder and one chroma-subsampled base coder are used simultaneously to code the original and the chroma-subsampled macroblock, respectively. The coder minimizing rate-distortion (R-D) is selected as the final coder for the macroblock. The two base coders are coherently unified and optimized to get the best overall coding performance and share coding components and resources as much as possible. The approach achieves very high visual quality with minimal computing complexity increment for SCC, and has better R-D performance than two full-chroma coders approach, especially in low bitrate.

Journal ArticleDOI
TL;DR: DSRML can preserve the properties of the aforementioned local geometrical structure by employing manifold learning, e.g., locally linear embedding.
Abstract: Over the past few years, high resolutions have been desirable or essential, e.g., in online video systems, and therefore, much has been done to achieve an image of higher resolution from the corresponding low-resolution ones. This procedure of recovering/rebuilding is called single-image super-resolution (SR). Performance of image SR has been significantly improved via methods of sparse coding. That is to say, the image frame patch can be sparse linear combinations of basis elements. However, most of these existing methods fail to consider the local geometrical structure in the space of the training data. To take this crucial issue into account, this paper proposes a method named double sparsity regularized manifold learning (DSRML). DSRML can preserve the properties of the aforementioned local geometrical structure by employing manifold learning, e.g., locally linear embedding. Based on a large amount of experimental results, DSRML is demonstrated to be more robust and more effective than previous efforts in the task of single-image SR.

Journal ArticleDOI
TL;DR: This letter proposes to detect salient objects based on selective contrast based on a well-defined model for interpreting saliency formulation to tackle the problem of accuracy of saliency detection in visual media.
Abstract: Automatic detection of salient objects in visual media (e.g., videos and images) has been attracting much attention. The detected salient objects can be utilized for segmentation, recognition, and retrieval. However, the accuracy of saliency detection remains a challenge. The reason behind this challenge is mainly due to the lack of a well-defined model for interpreting saliency formulation. To tackle this problem, this letter proposes to detect salient objects based on selective contrast. Selective contrast intrinsically explores the most distinguishable component information in color, texture, and location. A large number of experiments are thereafter carried out upon a benchmark dataset, and the results are compared with those of 12 other popular state-of-the-art algorithms. In addition, the advantage of the proposed algorithm is also demonstrated in a retargeting application.

Journal ArticleDOI
TL;DR: A novel adaptive modes skipping algorithm for mode decision and signaling processing is presented in this paper and can be speeded up due to some modes skipping in the first two sets, and importantly less bits are required to signal the mode index.
Abstract: Up to 35 intra prediction modes are available for each Luma prediction unit in the coming HEVC standard. This can provide more accurate predictions and thereby improve the compression efficiency of intra coding. However, the encoding complexity is thus increased dramatically due to a large number of modes involved in the intra mode decision process. In addition, more overhead bits should be assigned to signal the mode index. Intuitively, it is not necessary for all modes to be checked and signaled all the time. Therefore, a novel adaptive modes skipping algorithm for mode decision and signaling processing is presented in this paper. More specifically, three optimized candidate sets with 1, 19, and 35 intra prediction modes are initiated for each prediction unit in the proposed algorithm. Based on the statistical properties of the neighboring reference samples used for intra prediction, the proposed algorithm is able to adaptively select the optimal set from the three candidates for each prediction unit preceding the mode decision and signaling processing. As a result, the mode decision process can be speeded up due to some modes skipping in the first two sets, and importantly less bits are required to signal the mode index. Experimental results show that, compared to the test model HM7.0 of HEVC, BD-Rate savings of 0.18% and as well as 0.18% on average are achieved for AI-Main and AI-HE10 cases for low-bitrate ranges, and the average encoding times can also be reduced by 8%-38% and 8%-34% for AI-Main and AI-HE10 cases in low-bitrate ranges, respectively.

Journal ArticleDOI
TL;DR: This paper proposes a variety of sparse census transforms that dramatically reduce the resource requirements of census-based stereo systems while maintaining stereo correlation accuracy and proposes and analyzes a new class of Census-like transforms, called the generalized census transforms.
Abstract: Real-time stereo vision has proven to be a useful technology with many applications. However, the computationally intensive nature of stereo vision algorithms makes real-time implementation difficult in resource-limited systems. The field-programmable gate array (FPGA) has proven to be very useful in the implementation of local stereo methods, yet the resource requirements can still be a significant challenge. This paper proposes a variety of sparse census transforms that dramatically reduce the resource requirements of census-based stereo systems while maintaining stereo correlation accuracy. This paper also proposes and analyzes a new class of census-like transforms, called the generalized census transforms. This new transform allows a variety of very sparse census-like stereo correlation algorithms to be implemented while demonstrating increased robustness and flexibility. The resource savings and performance of these transforms is demonstrated by the design and implementation of a parameterizable stereo system that can implement stereo correlation using any census transform. Several optimizations for typical FPGA-based correlation systems are also proposed. The resulting system is capable of running at over 500 MHz on a modern FPGA, resulting in a throughput of over 500 million input pixel pairs per second.

Journal ArticleDOI
TL;DR: The proposed design scheme has derived a convolution-based generic architecture for the computation of three-level 2-D DWT based on Daubechies (Daub) as well as biorthogonal filters that offers significant saving of area and power over the other due to substantial reduction in memory size and smaller clock-period.
Abstract: In this paper, we have proposed a design strategy for the derivation of memory-efficient architecture for multilevel 2-D DWT. Using the proposed design scheme, we have derived a convolution-based generic architecture for the computation of three-level 2-D DWT based on Daubechies (Daub) as well as biorthogonal filters. The proposed structure does not involve frame-buffer. It involves line-buffers of size 3(K-2)M/4 which is independent of throughput-rate, where K is the order of Daubechies/biorthogonal wavelet filter and M is the image height. This is a major advantage when the structure is implemented for higher throughput. The structure has regular data-flow, small cycle period TM and 100% hardware utilization efficiency. As per theoretical estimate, for image size 512 × 512, the proposed structure for Daub-4 filter requires 152 more multipliers and 114 more adders, but involves 82 412 less memory words and takes 10.5 times less time to compute three-level 2-D DWT than the best of the existing convolution-based folded structures. Similarly, compared with the best of the existing lifting-based folded structures, proposed structure for 9/7-filter involves 93 more multipliers and 166 more adders, but uses 85 317 less memory words and requires 2.625 times less computation time for the same image size. It involves 90 (nearly 47.6%) more multipliers and 118 (nearly 40.1%) more adders, but requires 2723 less memory words than the recently proposed parallel structure and performs the computation in nearly half the time of the other. Inspite of having more arithmetic components than the lifting-based structures, the proposed structure offers significant saving of area and power over the other due to substantial reduction in memory size and smaller clock-period. ASIC synthesis result shows that, the proposed structure for Daub-4 involves 1.7 times less area-delay-product (ADP) and consumes 1.21 times less energy per image (EPI) than the corresponding best available convolution-based structure. It involves 2.6 times less ADP and consumes 1.48 times less EPI than the parallel lifting-based structure.

Journal ArticleDOI
TL;DR: The proposed algorithms can achieve a good balance among multiple objectives and effectively optimize both operational cost and user experience and pave the way for building the next-generation video cloud.
Abstract: For Internet video services, the high fluctuation of user demands in geographically distributed regions results in low resource utilizations of traditional content distribution network systems. Due to the capability of rapid and elastic resource provisioning, cloud computing emerges as a new paradigm to reshape the model of video distribution over the Internet, in which resources (such as bandwidth, storage) can be rented on demand from cloud data centers to meet volatile user demands. However, it is challenging for a video service provider (VSP) to optimally deploy its distribution infrastructure over multiple geo-distributed cloud data centers. A VSP needs to minimize the operational cost induced by the rentals of cloud resources without sacrificing user experience in all regions. The geographical diversity of cloud resource prices further makes the problem complicated. In this paper, we investigate the optimal deployment problem of cloud-assisted video distribution services and explore the best tradeoff between the operational cost and the user experience. We aim to pave the way for building the next-generation video cloud. Toward this objective, we first formulate the deployment problem into a min-cost network flow problem, which takes both the operational cost and the user experience into account. Then, we apply the Nash bargaining solution to solve the joint optimization problem efficiently and derive the optimal bandwidth provisioning strategy and optimal video placement strategy. In addition, we extend the algorithms to the online case and consider the scenario when peers participate into video distribution. Finally, we conduct extensive simulations to evaluate our algorithms in the realistic settings. Our results show that our proposed algorithms can achieve a good balance among multiple objectives and effectively optimize both operational cost and user experience.

Journal ArticleDOI
Seung-Won Jung1
TL;DR: An adaptive joint trilateral filter (AJTF), which consists of domain, range, and depth filters, which is used for the joint enhancement of images and depth maps and shows that the proposed algorithm is effective.
Abstract: In this paper, we present an adaptive joint trilateral filter (AJTF), which consists of domain, range, and depth filters. The AJTF is used for the joint enhancement of images and depth maps, which is achieved by suppressing the noise and sharpening the edges simultaneously. For improving the sharpness of the image and depth map, the AJTF parameters, the offsets, and the standard deviations of the range and depth filters are determined in such a way that image edges that match well with depth edges are emphasized. To this end, pattern matching between local patches in the image and depth map is performed and the matching result is utilized to adjust the AJTF parameters. Experimental results show that the AJTF produces sharpness-enhanced images and depth maps without overshoot and undershoot artifacts, while successfully reducing noise as well. A comparison of the performance of the AJTF with those of conventional image and depth enhancement algorithms shows that the proposed algorithm is effective.