scispace - formally typeset
Search or ask a question

Showing papers in "Signal Processing-image Communication in 2006"


Journal ArticleDOI
TL;DR: A real-time FTV is constructed including the complete chain from capturing to display, and a new algorithm was developed to generate free viewpoint images to create ray-based image engineering through the development of FTV.
Abstract: We have been developing ray-based 3D information systems that consist of ray acquisition, ray processing, and ray display. Free viewpoint television (FTV) based on the ray-space method is a typical example. FTV will bring an epoch-making change in the history of television because it enables us to view a distant 3D world freely by changing our viewpoints as if we were there. We constructed a real-time FTV including the complete chain from capturing to display. A new algorithm was developed to generate free viewpoint images. In addition, a new user interface is presented for FTV to make full use of 3D information. FTV is not a pixel-based system but a ray-based system. We are creating ray-based image engineering through the development of FTV.

261 citations


Journal ArticleDOI
TL;DR: Three main types of relevance feedback algorithms are investigated; the Euclidean, the query point movements and the correlation-based approaches, and a newly objective criterion, called average normalized similarity metric distance is introduced which exploits the difference among the actual and ideal similarity measure among all best retrievals.
Abstract: Multimedia content modeling, i.e., identification of semantically meaningful entities, is an arduous task mainly due to the fact that (a) humans perceive the content using high-level concepts and (b) the subjectivity of human perception, which often interprets the same content in a different way at different times. For this reason, an efficient content management system has to be adapted to current user's information needs and preferences through an on-line learning strategy based on users’ interaction. One adaptive learning strategy is relevance feedback, originally developed in traditional text-based information retrieval systems. In this way, the user interacts with the system to provide information about the relevance of the content, which is then fed back to the system to update its performance. In this paper, we evaluate and investigate three main types of relevance feedback algorithms; the Euclidean, the query point movements and the correlation-based approaches. In the first case, we examine heuristic and optimal techniques which are based either on the weighted or the generalized Euclidean distance. In the second case, we survey single and multipoint query movement schemes. As far as the third type is concerned, two different ways for parametrizing the normalized cross-correlation similarity metric are proposed. The first scales only the elements of the query feature vector and called query-scaling strategy, while the second scales both the query and the selected samples (query-sample scaling strategy). All the examined algorithms are evaluated using both subjective and objective criteria. Subjective evaluation is performed by depicting the best retrieved images as response of the system to a user's query. Instead, objective evaluation is obtained using standard criteria, such as the precision–recall curve and the average normalized modified retrieval rank (ANMRR). Furthermore, a newly objective criterion, called average normalized similarity metric distance is introduced which exploits the difference among the actual and ideal similarity measure among all best retrievals. Discussions and comparisons of all the aforementioned relevance feedback algorithms are presented.

74 citations


Journal ArticleDOI
TL;DR: The proposed visual quality metric is based on an effective Human Visual System model and relies on the computation of three distortion factors: blockiness, edge errors and visual impairments, which take into account the typical artifacts introduced by several classes of coders.
Abstract: In this paper, a multi-factor full-reference image quality index is presented. The proposed visual quality metric is based on an effective Human Visual System model. Images are pre-processed in order to take into account luminance masking and contrast sensitivity effects. The proposed metric relies on the computation of three distortion factors: blockiness, edge errors and visual impairments, which take into account the typical artifacts introduced by several classes of coders. A pooling algorithm is used in order to obtain a single distortion index. Results show the effectiveness of the proposed approach and its consistency with subjective evaluations.

64 citations


Journal ArticleDOI
TL;DR: A theoretical framework to analyze the rate-distortion performance of a light field coding and streaming system is proposed, revealing that the efficiency gains from more accurate geometry, increase as correlation between images increases.
Abstract: A theoretical framework to analyze the rate-distortion performance of a light field coding and streaming system is proposed. This framework takes into account the statistical properties of the light field images, the accuracy of the geometry information used in disparity compensation, and the prediction dependency structure or transform used to exploit correlation among views. Using this framework, the effect that various parameters have on compression efficiency is studied. The framework reveals that the efficiency gains from more accurate geometry, increase as correlation between images increases. The coding gains due to prediction suggested by the framework match those observed from experimental results. This framework is also used to study the performance of light field streaming by deriving a view-trajectory-dependent rate-distortion function. Simulation results show that the streaming results depend both the prediction structure and the viewing trajectory. For instance, independent coding of images gives the best streaming performance for certain view trajectories. These and other trends described by the simulation results agree qualitatively with actual experimental streaming results.

41 citations


Journal ArticleDOI
TL;DR: A new image-based rendering method that uses input from an array of cameras and synthesizes high-quality free-viewpoint images in real-time and discusses the focus measurement scheme in both spatial and frequency domains.
Abstract: This paper introduces a new image-based rendering method that uses input from an array of cameras and synthesizes high-quality free-viewpoint images in real-time. The input cameras can be roughly arranged, if they are calibrated in advance. Our method uses a set of depth layers to deal with scenes with large depth ranges, but does not require prior knowledge of the scene geometry. Instead, during the on-the-fly process, the optimal depth layer is automatically assigned to each pixel on the synthesized image by using our focus measurement scheme. We implemented the rendering method and achieved nearly interactive frame rates on a commodity PC. This paper also discusses the focus measurement scheme in both spatial and frequency domains. The discussion in the spatial domain is practical since it can be applied for arbitrary camera arrays. On the other hand, the frequency domain analysis is theoretically interesting since it proves that a signal-processing theory is applicable to the depth assignment problem.

41 citations


Journal ArticleDOI
TL;DR: An RD-optimized dynamic 3D mesh coder that includes different prediction modes as well as an RD cost computation that controls the mode selection across all possible spatial partitions of a mesh to find the clustering structure together with the associated prediction modes is presented.
Abstract: Compression of computer graphics data such as static and dynamic 3D meshes has received significant attention in recent years, since new applications require transmission over channels and storage on media with limited capacity. This includes pure graphics applications (virtual reality, games) as well as 3DTV and free viewpoint video. Efficient compression algorithms have been developed first for static 3D meshes, and later for dynamic 3D meshes and animations. Standard formats are available for instance in MPEG-4 3D mesh compression for static meshes, and Interpolator Compression for the animation part. For some important types of 3D objects, e.g. human head or body models, facial and body animation parameters have been introduced. Recent results for compression of general dynamic meshes have shown that the statistical dependencies within a mesh sequence can be exploited well by predictive coding approaches. Coders introduced so far use experimentally determined or heuristic thresholds for tuning the algorithms. In video coding, rate-distortion (RD) optimization is often used to avoid fixed thresholds and to select the optimum prediction mode. We applied these ideas and present here an RD-optimized dynamic 3D mesh coder. It includes different prediction modes as well as an RD cost computation that controls the mode selection across all possible spatial partitions of a mesh to find the clustering structure together with the associated prediction modes. The general coding structure is derived from statistical analysis of mesh sequences and exploits temporal as well as spatial mesh dependencies. To evaluate the coding efficiency of the developed coder, comparative coding results for mesh sequences at different resolutions were carried out.

38 citations


Journal ArticleDOI
TL;DR: A continuous optimization approach to solve the MSSI problem and a very effective dynamic programming algorithm to measure the similarity between the attributed nodes are adapted for shape-based matching of multi-object images.
Abstract: We aim at developing a geometry-based retrieval system for multi-object images. We model both shape and topology of image objects including holes using a structured representation called curvature tree (CT); the hierarchy of the CT reflects the inclusion relationships between the objects and holes. To facilitate shape-based matching, triangle-area representation (TAR) of each object and hole is stored at the corresponding node in the CT. The similarity between two multi-object images is measured based on the maximum similarity subtree isomorphism (MSSI) between their CTs. For this purpose, we adapt a continuous optimization approach to solve the MSSI problem and a very effective dynamic programming algorithm to measure the similarity between the attributed nodes. Our matching scheme agrees with many recent findings in psychology about the human perception of multi-object images. Experiments on a database of 1500 logos and the MPEG-7 CE-1 database of 1400 shape images have shown the significance of the proposed method.

37 citations


Journal ArticleDOI
TL;DR: This paper first analyse the sign language viewer's eye-gaze, based on the results of an eye-tracking study that was conducted, as well as the video content involved in sign language person-to-person communication, and proposes a sign language video coding system using foveated processing.
Abstract: The ability to communicate remotely through the use of video as promised by wireless networks and already practised over fixed networks, is for deaf people as important as voice telephony is for hearing people. Sign languages are visual–spatial languages and as such demand good image quality for interaction and understanding. In this paper, we first analyse the sign language viewer's eye-gaze, based on the results of an eye-tracking study that we conducted, as well as the video content involved in sign language person-to-person communication. Based on this analysis we propose a sign language video coding system using foveated processing, which can lead to bit rate savings without compromising the comprehension of the coded sequence or equivalently produce a coded sequence with higher comprehension value at the same bit rate. We support this claim with the results of an initial comprehension assessment trial of such coded sequences by deaf users. The proposed system constitutes a new paradigm for coding sign language image sequences at limited bit rates.

34 citations


Journal ArticleDOI
TL;DR: Experimental results show the ability of the system to detect tampering and to limit the peak error between the original and the processed images.
Abstract: A system is presented to jointly achieve image watermarking and compression The watermark is a fragile one being intended for authentication purposes The watermarked and compressed images are fully compliant with the JPEG-LS standard, the only price to pay being a slight reduction of compression efficiency and an additional distortion that can be anyway tuned to grant a maximum preset error Watermark detection is possible both in the compressed and in the pixel domain, thus increasing the flexibility and usability of the system The system is expressly designed to be used in remote sensing and telemedicine applications, hence we designed it in such a way that the maximum compression and watermarking error can be strictly controlled (near-lossless compression and watermarking) Experimental results show the ability of the system to detect tampering and to limit the peak error between the original and the processed images

34 citations


Journal ArticleDOI
TL;DR: This paper presents a new multiple image view synthesis algorithm for novel view creation that requires only implicit scene geometry information and identifies and selects only the best quality surface areas from available reference images, thereby reducing perceptual errors in virtual view reconstruction.
Abstract: Interactive audio-visual applications such as free viewpoint video (FVV) endeavour to provide unrestricted spatio-temporal navigation within a multiple camera environment. Current novel view creation approaches for scene navigation within FVV applications are either purely image-based, implying large information redundancy and dense sampling of the scene; or involve reconstructing complex 3-D models of the scene. In this paper we present a new multiple image view synthesis algorithm for novel view creation that requires only implicit scene geometry information. The multi-view synthesis approach can be used in any multiple camera environment and is scalable, as virtual views can be created given 1 to N of the available video inputs, providing a means to gracefully handle scenarios where camera inputs decrease or increase over time. The algorithm identifies and selects only the best quality surface areas from available reference images, thereby reducing perceptual errors in virtual view reconstruction. Experimental results are provided and verified using both objective (PSNR) and subjective comparisons and also the improvements over the traditional multiple image view synthesis approach of view-oriented weighting are presented.

32 citations


Journal ArticleDOI
TL;DR: Experimental results show that good shadow and object contours and light source locations are obtained with the proposed method even if the theoretical assumptions are not fully valid.
Abstract: This paper proposes a new method which allows a joint estimation of the light source projection on the image plane and the segmentation of moving cast shadows in natural video sequences. It allows improving the segmentation of moving objects by separating clearly cast shadows from moving objects. The method is based on a shadow model which mainly assumes that the cast shadows are projected on plane and Lambertian surfaces, and that the light source is unique. The moving cast shadows, including the penumbra, are detected using a segmentation method based on a comparison between a reference image and the original one. The light source position is estimated using geometrical relations linking the light source, the object and its cast shadow on the 2-D image plane. This is obtained using a robust temporal filtering method. For each image using the current estimation of the light source position and the video object contours, a cast shadow search area is defined. This reduces the risk of false detections during the segmentation process, and thus allows increasing the detection rate and reducing the false alarm one. Experimental results show that good shadow and object contours and light source locations are obtained with the proposed method even if the theoretical assumptions are not fully valid.

Journal ArticleDOI
TL;DR: ICA provides an excellent tool for learning a coder for a specific image class, which can even be done using a single image from that class, and generalizes very well for a wide range of image classes.
Abstract: This paper addresses the use of independent component analysis (ICA) for image compression. Our goal is to study the adequacy (for lossy transform compression) of bases learned from data using ICA. Since these bases are, in general, non-orthogonal, two methods are considered to obtain image representations: matching pursuit type algorithms and orthogonalization of the ICA bases followed by standard orthogonal projection. Several coder architectures are evaluated and compared, using both the usual SNR and a perceptual quality measure called picture quality scale . We consider four classes of images (natural, faces, fingerprints, and synthetic) to study the generalization and adaptation abilities of the data-dependent ICA bases. In this study, we have observed that: bases learned from natural images generalize well to other classes of images; bases learned from the other specific classes show good specialization. For example, for fingerprint images, our coders perform close to the special-purpose WSQ coder developed by the FBI. For some classes, the visual quality of the images obtained with our coders is similar to that obtained with JPEG2000, which is currently the state-of-the-art coder and much more sophisticated than a simple transform coder. We conclude that ICA provides a excellent tool for learning a coder for a specific image class, which can even be done using a single image from that class. This is an alternative to hand tailoring a coder for a given class (as was done, for example, in the WSQ for fingerprint images). Another conclusion is that a coder learned from natural images acts like an universal coder, that is, generalizes very well for a wide range of image classes.

Journal ArticleDOI
TL;DR: A fast block-matching algorithm based on search center prediction and search early termination, called center-prediction and early-termination based motion search algorithm (CPETS), which satisfies high performance and efficient VLSI implementation and outperforms some popular fast algorithms.
Abstract: In this paper, we propose a fast block-matching algorithm based on search center prediction and search early termination, called center-prediction and early-termination based motion search algorithm (CPETS). The CPETS satisfies high performance and efficient VLSI implementation. It makes use of the spatial and temporal correlation in motion vector (MV) fields and feature of all-zero blocks to accelerate the searching process. This paper describes the CPETS with three levels. At the coarsest level, which happens when center prediction fails, the search area is defined to enclose all original search range. At the middle level, the search area is defined as a 7×7-pels square area around the predicted center. At the finest level, a 5×5-pels search area around the predicted center is adopted. At each level, 9-points uniformly allocated search pattern is adopted. The experiment results show that the CPETS is able to achieve a reduction of 95.67% encoding time in average compared with full-search scheme, with a negligible peak signal-noise ratio (PSNR) loss and bitrate increase. Also, the efficiency of CPETS outperforms some popular fast algorithms such as: three-step search, new three-step search, four-step search evidently. This paper also describes an efficient four-way pipelined VLSI architecture based on the CPETS for H.264/AVC coding. The proposed architecture divides current block and search area into four sub-regions, respectively, with 4:1 sub-sampling and processes them in parallel. Also, each sub-region is processed by a pipelined structure to ensure the search for nine candidate points is performed simultaneously. By adopting search early-termination strategy, the architecture can compute one MV for 16×16 block in 81 clock cycles in the best case and 901 clock cycles in the poorest case. The architecture has been designed and simulated with VHDL language. Simulation results show that the proposed architecture achieves a high performance for real-time motion estimation. Only 47.3 K gates and 1624×8 bits on-chip RAM are needed for a search range of (−15, +15) with three reference frames and four candidate block modes by using 36 processing elements.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed method can produce good scene models from a small set of widely separated images and synthesize novel views in good quality.
Abstract: In this paper, we present a method for modeling a complex scene from a small set of input images taken from widely separated viewpoints and then synthesizing novel views. First, we find sparse correspondences across multiple input images and calibrate these input images taken with unknown cameras. Then one of the input images is chosen as the reference image for modeling by match propagation. A sparse set of reliably matched pixels in the reference image is initially selected and then propagated to neighboring pixels based on both the clustering-based light invariant photoconsistency constraint and the data-driven depth smoothness constraint, which are integrated into a pixel matching quality function to efficiently deal with occlusions, light changes and depth discontinuity. Finally, a novel view rendering algorithm is developed to fast synthesize a novel view by match propagation again. Experimental results show that the proposed method can produce good scene models from a small set of widely separated images and synthesize novel views in good quality.

Journal ArticleDOI
TL;DR: A new marker-based segmentation algorithm relying on disjoint set union is proposed in this paper, which consists of three steps, namely: pixel sorting, set union, and pixel resolving.
Abstract: Marker-based image segmentation has been widely used in image analysis and understanding. The well-known Meyer's marker-based watershed algorithm by immersion is realized using the hierarchical circular queues. A new marker-based segmentation algorithm relying on disjoint set union is proposed in this paper. It consists of three steps, namely: pixel sorting, set union, and pixel resolving. The memory requirement for the proposed algorithm is fixed as 2× N integers ( N is the image size), whereas the memory requirement for Meyer's algorithm is image dependent. The advantage of the proposed algorithm lies at its regularity and simplicity in software/firmware/hardware implementation.

Journal ArticleDOI
TL;DR: Simulation results show that, over a wide range of channel signal-to-noise ratio (SNR), the combined technique is superior to non-scalable transmission and outperforms UEP with turbo coding alone.
Abstract: This paper investigates the unequal error protected (UEP) transmission of scalable H.264 bitstreams with two-priority layers, where differentiated turbo coding provides better protection for the high priority (HP) base layer than for the low priority (LP) enhancement layer. The drawback of such a method is the high overhead introduced by the channel coding, which results in a low source data rate for the HP layer, and hence lowers video quality. To overcome this problem, we introduce an efficient combination of turbo coding and hierarchical quadrature amplitude modulation (HQAM) to provide a high protection for the HP layer and at the same time maintaining the requisite channel coding redundancy. Simulation results show that, over a wide range of channel signal-to-noise ratio (SNR), our combined technique is superior to non-scalable transmission and outperforms UEP with turbo coding alone.

Journal ArticleDOI
TL;DR: Experimental results show that the compression performance of the proposed coding algorithm is competitive to other wavelet-based image coding algorithms reported in the literature.
Abstract: In this paper, a new wavelet transform image coding algorithm is presented The discrete wavelet transform (DWT) is applied to the original image The DWT coefficients are firstly quantized with a uniform scalar dead zone quantizer Then the quantized coefficients are decomposed into four symbol streams: a binary significance map symbol stream, a binary sign stream, a position of the most significant bit (PMSB) symbol stream and a residual bit stream An adaptive arithmetic coder with different context models is employed for the entropy coding of these symbol streams Experimental results show that the compression performance of the proposed coding algorithm is competitive to other wavelet-based image coding algorithms reported in the literature

Journal ArticleDOI
TL;DR: Experiments carried out on several multispectral images show that the resulting unsupervised coder has a fully acceptable complexity, and a rate-distortion performance which is superior to that of the original supervised coder, and comparable to the best coders known in the literature.
Abstract: Compression of remote-sensing images can be necessary in various stages of the image life, and especially on-board a satellite before transmission to the ground station. Although on-board CPU power is quite limited, it is now possible to implement sophisticated real-time compression techniques, provided that complexity constraints are taken into account at design time. In this paper we consider the class-based multispectral image coder originally proposed in [Gelli and Poggi, Compression of multispectral images by spectral classification and transform coding, IEEE Trans. Image Process. (April 1999) 476-489 [5]] and modify it to allow its use in real time with limited hardware resources. Experiments carried out on several multispectral images show that the resulting unsupervised coder has a fully acceptable complexity, and a rate-distortion performance which is superior to that of the original supervised coder, and comparable to that of the best coders known in the literature.

Journal ArticleDOI
TL;DR: The Integral Images for computation of histogram, mean and variance are proposed, with which the similarity measure can be evaluated at negligible computational cost and exhaustive search is made efficiently for object localization which guarantees the global maximum be achieved.
Abstract: The paper presents a clustering-based color model and develops a fast algorithm for object tracking. The color model is built upon K-means clustering, by which the color space of the object can be partitioned adaptively and the histogram bins can be determined accordingly. In addition, in each bin the multi-channel gray level is modelled as Gaussian distribution. Defined in this way the color model can describe accurately the color distribution with very little bins. To evaluate similarity between the reference model and the candidate model, a similarity measure based on Bhattacharrya distance is introduced and its simplified form is derived under assumption that in each bin distribution of gray level in different channel is independent of each other. Motivated by the paper of Viola and Jones, the Integral Images for computation of histogram, mean and variance are proposed, with which the similarity measure can be evaluated at negligible computational cost. Thus, exhaustive search is made efficiently for object localization which guarantees the global maximum be achieved. Comparisons with the well-known mean shift algorithm demonstrate that the proposed algorithm has better performance while having the same (or less) computational cost.

Journal ArticleDOI
TL;DR: An interactive browsing environment for 3D scenes, which allows for the dynamic optimization of selected client views by distributing available transmission resources between geometry and texture components, is considered.
Abstract: We consider an interactive browsing environment for 3D scenes, which allows for the dynamic optimization of selected client views by distributing available transmission resources between geometry and texture components. Texture information is available at a server in the form of scalably compressed images, corresponding to a multitude of original image views. Surface geometry is also available at the server in the form of scalably compressed depth maps, again corresponding to a multitude of original views. Texture and depth components are both open to augmentation as more content becomes available. At any point in the interactive browsing experience, the server must decide how to allocate transmission resources between the delivery of new elements from the various original view bit-streams and new elements from the original geometry bit-streams. The proposed framework implicitly supports dynamic view sub-sampling, based on rate-distortion criteria, since the best server policy is not always to send the nearest original view image to the one which the client is rendering. In this paper, we particularly elaborate upon a novel strategy for distortion-sensitive synthesis of both geometry and rendered imagery at the client, based upon whatever data is provided by the server. We also outline how the JPIP standard for interactive communication of JPEG2000 images, can be leveraged for the 3D scene browsing application.

Journal ArticleDOI
TL;DR: BFlavor is an efficient and harmonized description tool for enabling XML-driven adaptation of media resources in a format-agnostic way and is outperformed by BSDL and XFlavor in terms of execution times, memory consumption, and file sizes.
Abstract: During recent years, several tools have been developed that allow the automatic generation of XML descriptions containing information about the syntax of binary media resources. Such a bitstream syntax description (BSD) can then be transformed to reflect a desired adaptation of a media resource, and can subsequently be used to create a tailored version of this resource. The main contribution of this paper is the introduction of BFlavor, a new tool for exposing the syntax of binary media resources as an XML description. Its development was inspired by two other technologies, i.e. MPEG-21 BSDL and XFlavor. Although created from a different point of view, both languages offer solutions for translating the syntax of a media resource into an XML representation for further processing. BFlavor (BSDL+XFlavor) harmonizes the two technologies by combining their strengths and eliminating their weaknesses. More precisely, the processing efficiency and expressive power of XFlavor on the one hand, and the ability to create high-level BSDs using MPEG-21 BSDL on the other hand, were our key motives for its development. To assess the expressive power and performance of a BFlavor-driven content adaptation chain, several experiments were conducted. These experiments test the automatic generation of BSDs for MPEG-1 Video and H.264/AVC, as well as the exploitation of multi-layered temporal scalability in H.264/AVC. Our results show that BFlavor is an efficient and harmonized description tool for enabling XML-driven adaptation of media resources in a format-agnostic way. BSDL and XFlavor are outperformed by BFlavor in terms of execution times, memory consumption, and file sizes.

Journal ArticleDOI
TL;DR: This paper presents an efficient variable block size motion estimation algorithm for use in real-time H.264 video encoder implementation that results in over 80% reduction in the encoding time over full search reference implementation and around 55% improvement over the fastmotion estimation algorithm (FME) of the reference implementation.
Abstract: This paper presents an efficient variable block size motion estimation algorithm for use in real-time H.264 video encoder implementation. In this recursive motion estimation algorithm, results of variable block size modes and motion vectors previously obtained for neighboring macroblocks are used in determining the best mode and motion vectors for encoding the current macroblock. Considering only a limited number of well chosen candidates helps reduce the computational complexity drastically. An additional fine search stage to refine the initially selected motion vector enhances the motion estimator accuracy and SNR performance to a value close to that of full search algorithm. The proposed methods result in over 80% reduction in the encoding time over full search reference implementation and around 55% improvement in the encoding time over the fast motion estimation algorithm (FME) of the reference implementation. The average SNR and compression performance do not show significant difference from the reference implementation. Results based on a number of video sequences are presented to demonstrate the advantage of using the proposed motion estimation technique.

Journal ArticleDOI
TL;DR: In this paper, the human visual attention mechanism direct the viewer's eye movements around the image to provide a sequence of fixations, which are analyzed, clustered and classified into regions of interest (ROI).
Abstract: Current image coding systems such as JPEG are far away from the capability of the human perceptual system in that the encoding may not maximise the reconstruction quality of image contents. Humans are often concerned with the interpretability of the image and thus enhanced reconstruction quality in image contents would facilitate improved recognition performance. This paper addresses this issue by incorporating characteristics of the human perceptual system into an image coding system. This is achieved by analysing the spatial and temporal characteristics of the human visual attention system as recorded from an eye-tracking device at the encoding end. Human visual attention mechanisms direct the viewer's eye movements around the image to provide a sequence of fixations, which are analysed, clustered and classified into regions of interest (ROI). These ROIs are used to selectively encode and prioritise regions such that an improved image content recognition performance can be achieved.

Journal ArticleDOI
TL;DR: This paper proposes a spatial error concealment method that uses edge-related information in order not only to preserve existing edges but also to avoid introducing new strong ones by switching to a smooth approximation of missing information where necessary.
Abstract: Video transmission over error-prone networks can suffer from packet erasures which can greatly reduce the quality of the received video. Error concealment methods reduce the perceived quality degradation at the receiving end by masking the effects of such errors. They accomplish this by exploiting temporal and spatial correlations that exist in image sequences. Spatial error concealment approaches conceal errors by making use of spatial information only which is necessary in cases where motion information is not available or reliable. The performance of such methods can be greatly increased if perceptual considerations are taken into account as, e.g., the preservation of edge information. This paper proposes a spatial error concealment method that uses edge-related information in order not only to preserve existing edges but also to avoid introducing new strong ones by switching to a smooth approximation of missing information where necessary. A novel switching algorithm which uses the directional entropy of neighbouring edges chooses between two interpolation methods, a directional along detected edges or a bilinear using the nearest neighbouring pixels. Results show that the performance of the proposed method is better compared to both ‘single interpolation’ and to edge strength-based switching methods.

Journal ArticleDOI
TL;DR: Simulation results show the efficacy of the proposed transmission system using hybrid unequal-error-protection and selective-retransmission for 3D meshes which are encoded with multi-resolutions in reducing transmission latency and providing smooth performance for interactive applications.
Abstract: Three-dimensional (3D) meshes are used intensively in distributed graphics applications where model data are transmitted on demand to users’ terminals and rendered for interactive manipulation. For real-time rendering and high-resolution visualization, the transmission system should adapt to both data properties and transport link characteristics while providing scalability to accommodate terminals with disparate rendering capabilities. This paper presents a transmission system using hybrid unequal-error-protection and selective-retransmission for 3D meshes which are encoded with multi-resolutions. Based on the distortion-rate performance of the 3D data, the end-to-end channel statistics and the network parameters, transmission policies that maximize the service quality for a client-specific constraint is determined with linear computation complexity. A TCP-friendly protocol is utilized to further provide performance stability over time as well as bandwidth fairness for parallel flows in the network. Simulation results show the efficacy of the proposed transmission system in reducing transmission latency and providing smooth performance for interactive applications. For example, for a fixed rendering quality, the proposed system achieves 20–30% reduction in transmission latency compared to the system based on 3TP, which is a recently presented 3D application protocol using hybrid TCP and UDP.

Journal ArticleDOI
TL;DR: A general two-stage multiple description coding (MDC) scheme using whitening transform is analyzed and identifies the importance of a good coarse approximation and explores different approaches for changing its resolution and coding it.
Abstract: A general two-stage multiple description coding (MDC) scheme using whitening transform is analyzed. It represents the original image in a form of a coarse image approximation and a residual image. The coarse approximation is subsequently duplicated and combined with the residual image further split into two descriptions using a checkerboard block transform coefficients rearrangement. We identify the importance of a good coarse approximation and explore different approaches for changing its resolution and coding it. We also propose different approaches for coding the residual signal. The coder scheme is quite simple and yet achieves high performance comparable with other MDC methods.

Journal ArticleDOI
TL;DR: A geometric representation is developed that can re-organize the original perspective images into a set of parallel projections with different oblique viewing angles that can be used as both an advanced video interface and a pre-processing step for 3D reconstruction.
Abstract: In this paper we address the problem of fusing images from many video cameras or a moving video camera. The captured images have obvious motion parallax, but they will be aligned and integrated into a few mosaics with a large field-of-view (FOV) that preserve 3D information. We have developed a compact geometric representation that can re-organize the original perspective images into a set of parallel projections with different oblique viewing angles. In addition to providing a wide field of view, mosaics with various oblique views well represent occlusion regions that cannot be seen in a usual nadir view. Stereo mosaic pairs can be formed from mosaics with different oblique viewing angles. This representation can be used as both an advanced interface for interactive 3D video and a pre-processing step for 3D reconstruction. A ray interpolation approach for generating the parallel-projection mosaics is presented, and efficient 3D scene/object rendering based on multiple parallel-projection mosaics is discussed. Several real-world examples are provided, with applications ranging from aerial video surveillance/environmental monitoring, ground mobile robot navigation, to under-vehicle inspection.

Journal ArticleDOI
TL;DR: A new adaptation method, called the partial linear regression (PLR), is proposed and adopted in an audio-driven talking head application, which allows users to adapt the partial parameters from the available adaptive data while keeping the others unchanged.
Abstract: Avatars in many applications are constructed manually or by a single speech-driven model which needs a lot of training data and long training time. It is essential to build up a user-dependent model more efficiently. In this paper, a new adaptation method, called the partial linear regression (PLR), is proposed and adopted in an audio-driven talking head application. This method allows users to adapt the partial parameters from the available adaptive data while keeping the others unchanged. In our experiments, the PLR algorithm can retrench the hours of time spent on retraining a new user-dependent model, and adjust the user-independent model to a more personalized one. The animated results with adapted models are 36% closer to the user-dependent model than using the pre-trained user-independent model.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that using the MINVAR based rate distortion tradeoff framework, the decoded picture quality is smoother than the traditional H.264 joint model (JM) rate control without sacrificing global quality such that a better subjective visual quality is guaranteed.
Abstract: In this paper, we review the rate distortion tradeoff issues in real-time video coding and introduce a minimum variation (MINVAR) distortion criterion based approach. The MINVAR based rate distortion tradeoff framework provides a local optimization strategy as a rate control mechanism in real-time video coding applications by minimizing the distortion variation while the corresponding bit rate fluctuation is limited by utilizing the encoder buffer. The proposed approach aims to achieve a smooth decoded picture quality for pleasing human visual experience. The performance of the proposed method is evaluated with H.264. The experimental results demonstrate that using the proposed approach, the decoded picture quality is smoother than the traditional H.264 joint model (JM) rate control without sacrificing global quality such that a better subjective visual quality is guaranteed.

Journal ArticleDOI
TL;DR: A system for image replica detection to adapt a system for detecting the replica of a specific reference image and is able to classify test images as replicas of the reference image or as unrelated images.
Abstract: This paper presents a system for image replica detection. The idea behind the proposed approach is to adapt a system for detecting the replica of a specific reference image. The system is then able to classify test images as replicas of the reference image or as unrelated images. More precisely, the test procedure is as follows. A set of features is extracted from a test image, representing texture, colour and grey-level characteristics. These features are then feed into a preprocessing step, which is fine-tuned to the reference image. Finally, the resulting features are entered to a support vector classifier that determines if the test image is a replica of the reference image. Experimental results show the effectiveness of the proposed system. Target applications include search for copyright infringement (e.g. variations of copyrighted images) and known illicit content (e.g. paedophile images known to the police).