scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Real-time Image Processing in 2014"


Journal ArticleDOI
TL;DR: A novel fall detection system based on the Kinect sensor that is capable of detecting walking falls accurately and robustly without taking into account any false positive activities (i.e. lying on the floor).
Abstract: This paper presents a novel fall detection system based on the Kinect sensor. The system runs in real-time and is capable of detecting walking falls accurately and robustly without taking into account any false positive activities (i.e. lying on the floor). Velocity and inactivity calculations are performed to decide whether a fall has occurred. The key novelty of our approach is measuring the velocity based on the contraction or expansion of the width, height and depth of the 3D bounding box. By explicitly using the 3D bounding box, our algorithm requires no pre-knowledge of the scene (i.e. floor), as the set of detected actions are adequate to complete the process of fall detection.

335 citations


Journal ArticleDOI
TL;DR: The first GPU-based NFFT algorithm without special structural assumptions on the positions of nodes is designed, a novel nearest-neighbour identification scheme for continuous point distributions is introduced, and the whole algorithm is optimised for n-body problems such as electrostatic halftoning.
Abstract: Electrostatic halftoning is a high-quality method for stippling, dithering, and sampling, but it suffers from a high runtime. This made the technique difficult to use for most real-world applications. A recently proposed minimisation scheme based on the non-equispaced fast Fourier transform (NFFT) lowers the complexity in the particle number M from $$\mathcal{O}(M^2)$$ to $$\mathcal{O}(M \log M).$$ However, the NFFT is hard to parallelise, and the runtime on modern CPUs lies still in the orders of an hour for about 50,000 particles, to a day for 1 million particles. Our contributions to remedy this problem are threefold: we design the first GPU-based NFFT algorithm without special structural assumptions on the positions of nodes, we introduce a novel nearest-neighbour identification scheme for continuous point distributions, and we optimise the whole algorithm for n-body problems such as electrostatic halftoning. For 1 million particles, this new algorithm runs 50 times faster than the most efficient technique on the CPU, and even yields a speedup of 7,000 over the original algorithm.

57 citations


Journal ArticleDOI
TL;DR: This paper proposes to implement the application of automatic recognition of road signs in real time by optimizing the techniques used in different phases of the recognition process by integrating a Virtex4 FPGA family connected to a camera mounted in the moving vehicle.
Abstract: The automatic detection of road signs is an application that alerts the vehicle's driver of the presence of signals and invites him to react on time in the aim to avoid potential traffic accidents. This application can thus improve the road safety of persons and vehicles traveling in the road. Several techniques and algorithms allowing automatic detection of road signs are developed and implemented in software and do not allow embedded application. We propose in this work an efficient algorithm and its hardware implementation in an embedded system running in real time. In this paper we propose to implement the application of automatic recognition of road signs in real time by optimizing the techniques used in different phases of the recognition process. The system is implemented in a Virtex4 FPGA family which is connected to a camera mounted in the moving vehicle. The system can be integrated into the dashboard of the vehicle. The performance of the system shows a good compromise between speed and efficiency.

49 citations


Journal ArticleDOI
TL;DR: An advanced system that is able to generate and maintain a complex background model for a scene as well as segment the foreground for an HD colour video stream (1,920 × 1,080 @ 60 fps) in real-time is presented.
Abstract: The processing of a high-definition video stream in real-time is a challenging task for embedded systems. However, modern FPGA devices have both a high operating frequency and sufficient logic resources to be successfully used in these tasks. In this article, an advanced system that is able to generate and maintain a complex background model for a scene as well as segment the foreground for an HD colour video stream (1,920 × 1,080 @ 60 fps) in real-time is presented. The possible application ranges from video surveillance to machine vision systems. That is, in all cases, when information is needed about which objects are new or moving in the scene. Excellent results are obtained by using the CIE Lab colour space, advanced background representation as well as integrating information about lightness, colour and texture in the segmentation step. Finally, the complete system is implemented in a single high-end FPGA device.

37 citations


Journal ArticleDOI
TL;DR: A stereo algorithm that is capable of estimating scene depth information with high accuracy and in real time and driven by two design goals: real-time performance and high accuracy depth estimation is presented.
Abstract: We present a stereo algorithm that is capable of estimating scene depth information with high accuracy and in real time. The key idea is to employ an adaptive cost-volume filtering stage in a dynamic programming optimization framework. The per-pixel matching costs are aggregated via a separable implementation of the bilateral filtering technique. Our separable approximation offers comparable edge-preserving filtering capability and leads to a significant reduction in computational complexity compared to the traditional 2D filter. This cost aggregation step resolves the disparity inconsistency between scanlines, which are the typical problem for conventional dynamic programming based stereo approaches. Our algorithm is driven by two design goals: real-time performance and high accuracy depth estimation. For computational efficiency, we utilize the vector processing capability and parallelism in commodity graphics hardware to speed up this aggregation process over two orders of magnitude. Over 90 million disparity evaluations per second [the number of disparity evaluations per seconds (MDE/s) corresponds to the product of the number of pixels and the disparity range and the obtained frame rate and, therefore, captures the performance of a stereo algorithm in a single number] are achieved in our current implementation. In terms of quality, quantitative evaluation using data sets with ground truth disparities shows that our approach is one of the state-of-the-art real-time stereo algorithms.

33 citations


Journal ArticleDOI
TL;DR: New fast implementations of VD and HySime using commodity graphics processing units are developed and validated in terms of accuracy and computational performance, showing significant speedups with regards to optimized serial implementations.
Abstract: Spectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It amounts at identifying a set of spectrally pure components (called endmembers) and their associated per-pixel coverage fractions (called abundances). A challenging problem in spectral unmixing is how to determine the number of endmembers in a given scene. Several automatic techniques exist for this purpose, including the virtual dimensionality (VD) concept or the hyperspectral signal identification by minimum error (HySime). Due to the complexity and high dimensionality of hyperspectral scenes, these techniques are computationally expensive. In this paper, we develop new fast implementations of VD and HySime using commodity graphics processing units. The proposed parallel implementations are validated in terms of accuracy and computational performance, showing significant speedups with regards to optimized serial implementations. The newly developed implementations are integrated in a fully operational unmixing chain which exhibits real-time performance with regards to the time that the hyperspectral instrument takes to collect the image data.

32 citations


Journal ArticleDOI
TL;DR: This work investigates the acceleration of the image reconstruction by GPUs and FPGAs and shows that both architectures are able to accelerate processing, whereas the GPU reaches the highest performance.
Abstract: As today's standard screening methods frequently fail to diagnose breast cancer before metastases have developed, earlier breast cancer diagnosis is still a major challenge. Three-dimensional ultrasound computer tomography promises high-quality images of the breast, but is currently limited by a time-consuming image reconstruction. In this work, we investigate the acceleration of the image reconstruction by GPUs and FPGAs. We compare the obtained performance results with a recent multi-core CPU. We show that both architectures are able to accelerate processing, whereas the GPU reaches the highest performance. Furthermore, we draw conclusions in terms of applicability of the accelerated reconstructions in future clinical application and highlight general principles for speed-up on GPUs and FPGAs.

32 citations


Journal ArticleDOI
TL;DR: The MDC tool, a novel automatic platform builder exploiting dataflow specifications for the creation of run-time reconfigurable multi-application systems, is presented and evaluated and 60 % of savings can be achieved with the MDC generated coprocessor compared to an equivalent non-reconfigurable design, without performance losses.
Abstract: Dataflow specifications are suitable to describe both signal processing applications and the relative specialized hardware architectures, fostering the hardware---software development gap closure. They can be exploited for the development of automatic tools aimed at the integration of multiple applications on the same coarse-grained computational substrate. In this paper, the multi-dataflow composer (MDC) tool, a novel automatic platform builder exploiting dataflow specifications for the creation of run-time reconfigurable multi-application systems, is presented and evaluated. In order to prove the effectiveness of the adopted approach, a coprocessor for still image and video processing acceleration has been assembled and implemented on both FPGA and 90 nm ASIC technology. 60 % of savings for both area occupancy and power consumption can be achieved with the MDC generated coprocessor compared to an equivalent non-reconfigurable design, without performance losses. Thanks to the generality of high-level dataflow specification approach, this tool can be successfully applied in different application domains.

31 citations


Journal ArticleDOI
TL;DR: This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems and results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach.
Abstract: The potential computational power of today multicore processors has drastically improved compared to the single processor architecture. Since the trend of increasing the processor frequency is almost over, the competition for increased performance has moved on the number of cores. Consequently, the fundamental feature of system designs and their associated design flows and tools need to change, so that, to support the scalable parallelism and the design portability. The same feature can be exploited to design reconfigurable hardware, such as FPGAs, which leads to rethink the mapping of sequential algorithms to HDL. The sequential programming paradigm, widely used for programming single processor systems, does not naturally provide explicit or implicit forms of scalable parallelism. Conversely, dataflow programming is an approach that naturally provides parallelism and the potential to unify SW and HDL designs on heterogeneous platforms. This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems. Experimental results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach.

31 citations


Journal ArticleDOI
TL;DR: This paper introduces an additional, intra-operator level of parallelism in this dilation/erosion algorithm, realized in a dedicated hardware, for rectangular structuring elements with programmable size, which allows obtaining previously unachievable, real-time performances for these traditionally costly operators.
Abstract: Many useful morphological filters are built as more or less long concatenations of erosions and dilations: openings, closings, size distributions, sequential filters, etc. An efficient implementation of these concatenations would allow all the sequentially concatenated operators run simultaneously, on the time-delayed data. A recent algorithm (see below) for the morphological dilation/erosion allows such inter-operator parallelism. This paper introduces an additional, intra-operator level of parallelism in this dilation/erosion algorithm. Realized in a dedicated hardware, for rectangular structuring elements with programmable size, such an implementation allows obtaining previously unachievable, real-time performances for these traditionally costly operators. Low latency and memory requirements are the main benefits when the performance is not deteriorated even for long concatenations or high-resolution images.

27 citations


Journal ArticleDOI
TL;DR: A solution that can be easily adapted to different types of lens and camera, and meets real-time constraints with a power budget within 100 mW and a board size of few cm2 is proposed.
Abstract: The development of an embedded system for real-time correction of fish-eye effect is presented. The fish-eye lens is applied to driver assistance video systems because of its wide-angled view. A large field of view can reduce the number of cameras needed for video system and their cost, installation, maintenance and wiring issues. On the other hand, this lens causes inherent radial distortion to image that has to be corrected in real-time with a low-cost and low-power processing platform. This paper proposes a solution that can be easily adapted to different types of lens and camera, and meets real-time constraints with a power budget within 100 mW and a board size of few cm2. Starting from mathematical equations, given by the geometrical optics, a state-of-the-art correction method is presented, then optimizations are introduced at different levels: algorithmic level, where a real-time correction parameter calculation avoids extra non-volatile off-chip memory cards; data transfer level, where a new pixel pair management reduces memory access and storage burden; HW-SW implementation level, where a low-power board has been developed and tested in real automotive scenarios. Other applications of the developed system, such as multi-camera and multi-dimensional video systems, are finally presented.

Journal ArticleDOI
TL;DR: This article describes the typical modelling steps involved in the creation of a range of digital documents provided by the 3D digitization company Artescan to customers and presents how these modelling steps were applied in the context of creating digital documents used in the preservation of Mosteiro da Batalha.
Abstract: Advances in both terrestrial laser scanning hardware and photogrammetric systems combined are creating increasingly precise and rich 3D coloured data. In this article we show how computer graphics and visualization techniques have played an important role in real-time visualization, data management, modelling, and data fusion in an increasing number of applications such as surveying engineering, structure analysis, architecture, archaeology and cultural heritage. Specifically, we describe the typical modelling steps involved in the creation of a range of digital documents provided by the 3D digitization company Artescan to customers. We present how these modelling steps were applied in the context of creating digital documents used in the preservation of Mosteiro da Batalha.

Journal ArticleDOI
TL;DR: The GPURetinex algorithm is presented, which is a data parallel algorithm accelerating a modified center/surround retinex with GPGPU/CUDA and can gain 74 times acceleration compared with an SSE-optimized single-threaded implementation on Core2 Duo.
Abstract: Retinex is an image restoration approach used to restore the original appearance of an image. Among various methods, a center/surround retinex algorithm is favorable for parallelization because it uses the convolution operations with large-scale sizes to achieve dynamic range compression and color/lightness rendition. This paper presents a GPURetinex algorithm, which is a data parallel algorithm accelerating a modified center/surround retinex with GPGPU/CUDA. The GPURetinex algorithm exploits the massively parallel threading and heterogeneous memory hierarchy of a GPGPU to improve efficiency. Two challenging problems, irregular memory access and block size for data partition, are analyzed mathematically. The proposed mathematical models help optimally choose memory spaces and block sizes for maximal parallelization performance. The mathematical analyses are applied to three parallelization issues existing in the retinex problem: block-wise, pixel-wise, and serial operations. The experimental results conducted on GT200 GPU and CUDA 3.2 showed that the GPURetinex can gain 74 times acceleration, compared with an SSE-optimized single-threaded implementation on Core2 Duo for the images with 4,096 ? 4,096 resolution. The proposed method also outperforms the parallel retinex implemented with the nVidia Performance Primitives library. Our experimental results indicate that careful design of memory access and multithreading patterns for CUDA devices should acquire great performance acceleration for real-time processing of image restoration.

Journal ArticleDOI
TL;DR: The Reed–Xiaoli real-time oriented techniques have been improved using a linear algebra-based strategy to efficiently update the inverse covariance matrix thus avoiding its computation and inversion for each pixel of the hyperspectral image.
Abstract: In the field of hyperspectral image processing, anomaly detection (AD) is a deeply investigated task whose goal is to find objects in the image that are anomalous with respect to the background. In many operational scenarios, detection, classification and identification of anomalous spectral pixels have to be performed in real time to quickly furnish information for decision-making. In this framework, many studies concern the design of computationally efficient AD algorithms for hyperspectral images in order to assure real-time or nearly real-time processing. In this work, a sub-class of anomaly detection algorithms is considered, i.e., those algorithms aimed at detecting small rare objects that are anomalous with respect to their local background. Among such techniques, one of the most established is the Reed---Xiaoli (RX) algorithm, which is based on a local Gaussian assumption for background clutter and locally estimates its parameters by means of the pixels inside a window around the pixel under test (PUT). In the literature, the RX decision rule has been employed to develop computationally efficient algorithms tested in real-time systems. Initially, a recursive block-based parameter estimation procedure was adopted that makes the RX processing and the detection performance differ from those of the original RX. More recently, an update strategy has been proposed which relies on a line-by-line processing without altering the RX detection statistic. In this work, the above-mentioned RX real-time oriented techniques have been improved using a linear algebra-based strategy to efficiently update the inverse covariance matrix thus avoiding its computation and inversion for each pixel of the hyperspectral image. The proposed strategy has been deeply discussed pointing out the benefits introduced on the two analyzed architectures in terms of overall number of elementary operations required. The results show the benefits of the new strategy with respect to the original architectures.

Journal ArticleDOI
TL;DR: A new technique based on the FABEMD with the aim of improving the well-known pyramidal algorithm of Lucas and Kanade which, in principle, utilizes two consecutive frames extracted from video sequence to determine a dense optical flow.
Abstract: Motion estimation is a basic step that can be used to serve several processes in computer vision. This motion is currently approximated by the visual displacement field called optical flow. Currently, several methods are used to estimate it, but a good compromise between computational cost and accuracy is hard to achieve. This paper tackles the problem by proposing a new technique based on the FABEMD (fast and adaptive bidimensional empirical mode decomposition) with the aim of improving the well-known pyramidal algorithm of Lucas and Kanade (LK) which, in principle, utilizes two consecutive frames extracted from video sequence to determine a dense optical flow. The proposed algorithm uses the FABEMD method to decompose each of the two considered frames into several BIMFs (bidimensional intrinsic mode functions) that are matched in number and proprieties. Thus, to compute the optical flow, the LK algorithm is applied to each of the two matching BIMFs which belong to the same mode of the decomposition. Although the implementation does not use an iterative refinement, the results show that the proposed approach is less sensitive to noise and provides improved motion estimation with a reduction of computing time compared to iterative methods.

Journal ArticleDOI
TL;DR: This study proposes color invariant-based binary, ternary, and quaternary coded structured light-based range scanners that can scan shiny and matte objects under ambient light and hypothesizes that, by using color invariants, they can eliminate the effect of highlights and ambient light in the scanning process.
Abstract: Three dimensional range data provides useful information for various computer vision and computer graphics applications. For these, extracting the range data reliably is of utmost importance. Therefore, various range scanners based on different working principles are proposed in the literature. Among these, coded structured light-based range scanners are popular and used in most industrial applications. Unfortunately, these range scanners cannot scan shiny objects reliably. Either highlights on the shiny object surface or the ambient light in the environment disturb the code word. As the code is changed, the range data extracted from it will also be disturbed. In this study, we focus on developing a system that can scan shiny and matte objects under ambient light. Therefore, we propose color invariant-based binary, ternary, and quaternary coded structured light-based range scanners. We hypothesize that, by using color invariants, we can eliminate the effect of highlights and ambient light in the scanning process. Thus, we can extract the range data of shiny and matte objects in a robust manner. We implemented these scanners using a TI DM6437 EVM board with a flexible system setup such that the user can select the scanning type. Furthermore, we implemented a TI MSP430 microcontroller-based rotating table system that accompanies our scanner. With the help of this system, we can obtain the range data of the target object from different viewpoints. We also implemented a range image registration method to obtain the complete object model from the range data extracted. We tested our scanner system on various objects and provided their range and model data.

Journal ArticleDOI
TL;DR: The sufficient experimentations on various scenes indicate that the proposed graphic processing unit (GPU)-accelerated real-time image enhancing method can process a large hazy image with one megapixel to visually moderate haze-free result at a rate of 80 frames/s.
Abstract: Single image de-hazing is an important and challenging research topic in computer vision. The computational efficiency and robustness of this issue are key problems in real-time applications. In this paper, a graphic processing unit (GPU)-accelerated real-time image enhancing method is proposed to remove haze from a single hazy input image. The foundation of this method is a novel pixel-level optimal de-hazing criterion proposed to combine a virtual hazy-free candidate image sequence into a final single hazy-free image. This image sequence is estimated from the input hazy image by exhausting all possible discretely sampled scene depth values. The main advantage of proposed method is the single pixel independently computing fashion. Its computing at one single pixel position is independent of others. Based on this property, it is straightforward to implement the proposed method with fully parallel GPU acceleration. The sufficient experimentations on various scenes indicate that the proposed method can process a large hazy image with one megapixel to visually moderate haze-free result at a rate of 80 frames/s. Moreover, the proposed method is also less affected by the nonuniform illumination compared to previous methods.

Journal ArticleDOI
TL;DR: This paper proposes in this paper a flexible hardware implementation of the motion estimator which enables the integer motion search algorithms to be modified and the fractional search as well as variable block size to be selected and adjusted.
Abstract: Despite the diversity of video compression standard, the motion estimation still remains a key process which is used in most of them. Moreover, the required coding performances (bit-rate, PSNR, image spatial resolution,etc.) depend obviously of the application, the environment and the network communication. The motion estimation can therefore be adapted to fit with these performances. Meanwhile, the real time encoding is required in many applications. To reach this goal, we propose in this paper a flexible hardware implementation of the motion estimator which enables the integer motion search algorithms to be modified and the fractional search as well as variable block size to be selected and adjusted. Hence, this novel architecture, especially designed for FPGA targets, proposes high-speed processing for a configuration which supports the variable size blocks and quarter-pel refinement, as described in H.264. The proposed low-cost architecture based on Virtex 6 FPGA can process integer motion estimation on 1080 HD video streams, respectively, at 13 fps using full search strategy (108k Macroblocks/s) and up to 223 fps using diamond search (1.8M Macroblocks/s). Moreover subpel refinement in quarter-pel mode is performed at 232k Macroblocks/s.

Journal ArticleDOI
TL;DR: This paper has designed a multi-layer neuro-fuzzy computing system based on the memristor crossbar structure by introducing a new concept called the fuzzy minterm, and shows how the fuzzy XOR function can be constructed and how it can be used to extract edges from grayscale images.
Abstract: Fuzzy inference systems always suffer from the lack of efficient structures or platforms for their hardware implementation. In this paper, we tried to overcome this difficulty by proposing a new method for the implementation of the fuzzy rule-based inference systems. To achieve this goal, we have designed a multi-layer neuro-fuzzy computing system based on the memristor crossbar structure by introducing a new concept called the fuzzy minterm. Although many applications can be realized through the use of our proposed system, in this study we only show how the fuzzy XOR function can be constructed and how it can be used to extract edges from grayscale images. One main advantage of our memristive fuzzy edge detector (implemented in analog form) compared to other commonly used edge detectors is it can be implemented in parallel form, which makes it a powerful device for real-time applications.

Journal ArticleDOI
TL;DR: A hardware solution to ensure a successful and friendly acquisition of the fingerprint image, which can be incorporated at low cost into an embedded fingerprint recognition system due to its small size and high speed.
Abstract: The first step in any fingerprint recognition system is the fingerprint acquisition. A well-acquired fingerprint image results in high-resolution accuracy and low computational effort of processing. Hence, it is very useful for the recognition system to evaluate recognition confidence level to request new fingerprint samples if the confidence level is low, and to facilitate recognition process if the confidence level is high. This paper presents a hardware solution to ensure a successful and friendly acquisition of the fingerprint image, which can be incorporated at low cost into an embedded fingerprint recognition system due to its small size and high speed. The solution implements a novel technique based on directional image processing that allows not only the estimation of fingerprint image quality, but also the extraction of useful information (in particular, singular points). The digital architecture of the module is detailed and their features in terms of resource consumption and processing speed are illustrated with implementation results into FPGAs from Xilinx. Performance of the solution has been verified with fingerprints from several standard databases that have been acquired with sensors of different sizes and technologies (optical, capacitive, and thermal sweeping).

Journal ArticleDOI
TL;DR: An efficient algorithm for fusing a pair of long- and short-exposure images that work in the JPEG domain, which uses the spatial frequency analysis provided by the discrete cosine transform within JPEG to combine the uniform regions of the long-ex exposure image with the detailed areas of the short-Exposure image, thereby reducing noise while providing sharp details.
Abstract: We present an efficient algorithm for fusing a pair of long- and short-exposure images that work in the JPEG domain. The algorithm uses the spatial frequency analysis provided by the discrete cosine transform within JPEG to combine the uniform regions of the long-exposure image with the detailed regions of the short-exposure image, thereby reducing noise while providing sharp details. Two additional features of the algorithm enable its implementation at low cost, and in real time, on a digital camera: the camera's response between exposures is equalized with a look-up table implementing a parametric sigmoidal function; and image fusion is performed by selective overwriting during the JPEG file save operation. The algorithm requires no more than a single JPEG macro-block of the short-exposure image to be maintained in RAM at any one time, and needs only a single pass over both long- and short-exposure images. The performance of the algorithm is demonstrated with examples of image stabilization and high dynamic range image acquisition.

Journal ArticleDOI
TL;DR: The study proposes a method of parallelisation theoretically more tailored to execution on a single machine along with a number of optimisations to further improve performance and provides indicative results, for example multicore CPU and GPU platforms that might be of interest to researchers and practitioners wishing to implement a real-time 3D reconstruction system.
Abstract: Our ultimate aim was to achieve commodity telepresence systems capable of communicating both what someone looks like and what, within the technology joined space, they are looking at. Towards this we have implemented a previously distributed approach to reconstructing form from multiple video streams, so that it runs on a single computer. Importantly, the way in which the problem is parallelised has been optimised to reflect the various stages of the process rather than the need to minimise data communication across a network. The Exact Polyhedral Visual Hull (EPVH) algorithm had previously been distributed to achieve real time frame rates. EPVH has five sequential steps of which four were previously parallelised as two pairs. The metric for parallelisation of each pair was thus the best fit across both sequential steps within it and the outcome of the first stage of a pair could not determine the parallelisation of the second. We parallelised all five stages according to both distinct metrics and data from the previous stage. In this way we provided a better fit of parallelisation to both process and data. The study proposes a method of parallelisation theoretically more tailored to execution on a single machine, providing a detailed description of the implementation along with a number of optimisations to further improve performance and provides indicative results, for example multicore CPU and GPU platforms that might be of interest to researchers and practitioners wishing to implement a real-time 3D reconstruction system.

Journal ArticleDOI
TL;DR: This work proposes a method which significantly reduces the search space to only a few candidates, and permits the implementation of real-time vision and video encoding algorithms which do not require specialized hardware such as GPU's or FPGA’s.
Abstract: In computer vision and video encoding applications, one of the first and most important steps is to establish a pixel-to-pixel correspondence between two images of the same scene obtained at slightly different times or points of view. One of the most popular methods to find these correspondences, known as Area Matching, consists in performing a computationally intensive search for each pixel in the first image, around a neighborhood of the same pixel in the second image. In this work we propose a method which significantly reduces the search space to only a few candidates, and permits the implementation of real-time vision and video encoding algorithms which do not require specialized hardware such as GPU's or FPGA's. Theoretical and experimental support for this method is provided. Specifically, we present results from the application of the method to the realtime video compression and transmission, as well as the realtime estimation of dense optical flow and stereo disparity maps, where a basic implementation achieves up to 100 fps in a typical dual-core PC.

Journal ArticleDOI
TL;DR: A novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation is proposed, accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU.
Abstract: H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder.

Journal ArticleDOI
TL;DR: A new and fast algorithm for star labeling and centroid calculation that needs only one scan of the input image for higher accuracy as well as higher update rate in star tracking applications.
Abstract: Nowadays, hardware implementation of image and video processing algorithms on application specific integrated circuit (ASIC) has become a viable target in many applications. Star tracking algorithm is commonly used in space missions to recover the attitude of the satellite or spaceship. The algorithm matches stars of the satellite camera with the stars in a catalog to calculate the camera orientation (attitude). The number of stars in the catalog has the major impact on the accuracy of the star tracking algorithm. However, the higher number of stars in the catalog increases the computation burden and decreases the update rate of the algorithm. Hardware implementation of the star tracking algorithm using parallel and pipelined architecture is a proper solution to ensure higher accuracy as well as higher update rate. Noise filtering and also the detection of stars and their centroids in the camera image are the main stages in most of the star tracking algorithms. In this paper, we propose a new hardware architecture for star detection and centroid calculation in star tracking applications. The method contains several stages, including noise smoothing with fast Gaussian and median filters, connected component labeling, and centroid calculation. We introduce a new and fast algorithm for star labeling and centroid calculation that needs only one scan of the input image.

Journal ArticleDOI
TL;DR: In order to improve the performance of co-occurrence matrices and texture feature extraction algorithms, an architecture on FPGA platform is proposed and the computation of 13 texture features is reduced to 3 texture features using ranking of Haralick’s features.
Abstract: The most popular second-order statistical texture features are derived from the co-occurrence matrix, which has been proposed by Haralick. However, the computation of both matrix and extracting texture features are very time consuming. In order to improve the performance of co-occurrence matrices and texture feature extraction algorithms, we propose an architecture on FPGA platform. In the proposed architecture, first, the co-occurrence matrix is computed then all thirteen texture features are calculated in parallel using computed co-occurrence matrix. We have implemented the proposed architecture on Virtex 5 fx130T-3 FPGA device. Our experimental results show that a speedup of 421[× yields over a software implementation on Intel Core i7 2.0 GHz processor. In order to improve much more performance on textures, we have reduced the computation of 13 texture features to 3 texture features using ranking of Haralick's features. The performance improvement is 484×.

Journal ArticleDOI
TL;DR: A modified accumulation scheme for the Hough transform, using a new parameterization of lines “PClines”, suitable for computer systems with a small but fast read-write memory and for special and low-power processors and special-purpose chips.
Abstract: The Hough transform is a well-known and popular algorithm for detecting lines in raster images. The standard Hough transform is rather slow to be usable in real time, so different accelerated and approximated algorithms exist. This study proposes a modified accumulation scheme for the Hough transform, using a new parameterization of lines "PClines". This algorithm is suitable for computer systems with a small but fast read-write memory, such as today's graphics processors. The algorithm requires no floating-point computations or goniometric functions. This makes it suitable for special and low-power processors and special-purpose chips. The proposed algorithm is evaluated both on synthetic binary images and on complex real-world photos of high resolutions. The results show that using today's commodity graphics chips, the Hough transform can be computed at interactive frame rates, even with a high resolution of the Hough space and with the Hough transform fully computed.

Journal ArticleDOI
TL;DR: This special issue presents several papers which address this general topic, and also presents papers dealing with the implementation complexity as well as exploring different opportunities concerning the possible architectures of CPU, GPGPU, FPGA and ASIC implementations.
Abstract: Current embedded systems are increasingly used to support high-performance applications. This is due to the diffusion of these systems in the domain of mobile devices and the need of a large number of services required from these systems. To support the execution of these applications, several architectures based on CPU, GPU or FPGA have been developed and are still under investigation. When the application demands both performance and flexibility, architectures based on several types of execution resources are of high benefit. In this context, designers define architectures which include all the necessary resources on the same chip, also called ‘‘Multiprocessor System on a Chip’’ (MPSoC). Image processing is one of the major applications in embedded domain, which requires high effort in computation. Image processing for medicine, automotive, and for video compression is the main algorithm that has been addressed by the authors of the Design and Architecture for Image and Signal Processing (DASIP) conference. This special issue presents several papers which address this general topic, and also presents papers dealing with the implementation complexity as well as exploring different opportunities concerning the possible architectures of CPU, GPGPU, FPGA and ASIC implementations. Comparisons among these different technologies are also presented in order to attempt defining the best implementation of the applications. Due to the complexities of the applications and architectures, research concerning methodologies for implementation is also addressed in this special issue. The main objective is to provide designers efficient methodologies and tools which can help during the exploration of different implementation opportunities. In the next paragraph, the guest editors provide a brief description of each paper presented in this special issue. We wish to have provided JRTIP readers a good reading collection and hope that these selected papers will be a source of inspiration for future works.

Journal ArticleDOI
TL;DR: A high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster by automating computation–communication overlap, which can lead to significant speedups as shown in the presented benchmark.
Abstract: Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation---communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps.

Journal ArticleDOI
TL;DR: A flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA is presented.
Abstract: Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability.