scispace - formally typeset
Search or ask a question

Showing papers presented at "Southern Conference Programmable Logic in 2012"


Proceedings ArticleDOI
20 Mar 2012
TL;DR: In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs, and the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor.
Abstract: Convolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200× in comparison to C, 70× in comparison to Matlab and 20× in comparison to FPGA.

36 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: Experimental measurements of power consumption using different techniques to turn off part of a system and switch between active and standby modes and the main ideas analyzed are: clock gating, clock enable, and blocking inputs.
Abstract: This paper presents experimental measurements of power consumption using different techniques to turn off part of a system and switch between active and standby modes. The main ideas analyzed are: clock gating, clock enable, and blocking inputs. The laboratory work is described, including the measurement setups and the benchmark circuits. Quantitative measurements in both a 65 nm CMOS Cyclone III and a 45 nm CMOS Spartan 6 FPGAs are presented. The selected circuits used as benchmarks are different type of multipliers. Results of power consumption in active and standby modes are presented and compared.

32 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents the hardware design of a 16-points 1-D DCT used in the emerging video coding standard HEVC - High Efficiency Video Coding, and is the first work in the literature that presents hardware results for the HEVC transforms.
Abstract: This paper presents the hardware design of a 16-points 1-D DCT used in the emerging video coding standard HEVC — High Efficiency Video Coding. The 1-D DCT is used by the 16×16 2-D DCT of the HEVC standard. The transforms stage is one of the innovations proposed by HEVC, not only because of the variable size (from 4×4 to 32×32) but also because higher dimension transforms other than the traditional 4×4 and 8×8 are used. The hardware design presented in this work focuses on low cost and high throughput. To achieve such objectives, the 16-points algorithm from HEVC was simplified, so that a more efficient hardware design could be implemented. Some strategies were used during this simplification, such as operations reordering, factoring to compress the length of the operators, multiplications by constant turned into shifts and adds, sub-expressions sharing, among others. The architecture was designed in a fully combinational way in order to reduce hardware overhead. Synthesis results obtained using Altera FPGAs from the Cyclone II and Stratix III families showed hardware resources reduction reaching 72% when compared to an architecture described as a direct transcription of the non-optimized version of the algorithm. Even with a purely combinational implementation, the designed architecture achieved a throughput between 376Msamples/s and 1.4Gsamples/s. With these results, the architecture is capable of processing, in the worst case, more than 30 QFHD frames (3840×2160 pixels) per second. Therefore, the architecture is capable of processing videos with significantly high resolutions in real time. To the best of our knowledge, this is the first work in the literature that presents hardware results for the HEVC transforms.

25 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: A hardware architecture for motion detection based on the background subtraction algorithm, which is implemented on FPGAs (Field Programmable Gate Arrays), which provides one processed pixel per FPGA's clock cycle and speed-ups the software implementation by a factor of 32.
Abstract: Currently, both the market and the academic communities have required applications based on image and video processing with several real-time constraints. On the other hand, detection of moving objects is a very important task in mobile robotics and surveillance applications. In order to achieve an alternative design that allows for rapid development of real time motion detection systems, this paper proposes a hardware architecture for motion detection based on the background subtraction algorithm, which is implemented on FPGAs (Field Programmable Gate Arrays). For achieving this, the following steps are executed: (a) a background image (in gray-level format) is stored in an external SRAM memory, (b) a low-pass filter is applied to both the stored and current images, (c) a subtraction operation between both images is obtained, and (d) a morphological filter is applied over the resulting image. Afterward, the gravity center of the object is calculated and sent to a PC (via RS-232 interface). Both the practical results of the motion detection system and synthesis results have demonstrated the feasibility of FPGAs for implementing the proposed algorithms on an FPGA based hardware platform. The implemented system provides one processed pixel per FPGA's clock cycle (after the latency time) and speed-ups the software implementation (using the real-time xPC Target OS from MathWorks) by a factor of 32.

24 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: This work presents an architecture to compute matrix inversions in a hardware reconfigurable FPGA using different floating-point representation precision: single, double and 40-bits.
Abstract: This work presents an architecture to compute matrix inversions in a hardware reconfigurable FPGA using different floating-point representation precision: single, double and 40-bits. The architectural approach is divided into five principal parts, four modules and one unit, namely Change Row Module, Pivo Module, Matrix Elimination Module, Normalization Module and finally the Gauss-Jordan Control-Circuit Unit. This division allows the work with other smaller arithmetic units that are organized in order to maintain the accuracy of the results without the need to internally normalize and de-normalize the floatingpoint data. The implementation of the operations and the whole units take advantage of the resources available in the Virtex-5 FPGA. The error propagation and resource consumption of the implementation, specially the internal RAM memory blocks that are used, constitute improvements when compared with previous work of the authors and other more elaborated architectures whose implementations are significantly more complex than the current one and thus unsuitable for its application. The approach is validated by implementing benchmarks based on solutions in FPGA and software (e.g. Matlab) implemented previously.

16 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: HardNoC is a platform based on simple modules to inject traffic and collect basic statistics of NoCs, used to early validate NoC designs and to provide initial numerical results for NoC evaluation and design.
Abstract: The use of intrachip buses is no longer a consensus to build interconnection architectures for complex integrated circuits. Networks on chip (NoCs) are a choice in several real designs. However, the distributed nature of NoCs, the huge amount of wires and interfaces of large NoCs can make system/interconnection architecture debugging a nightmare. This work accelerates the NoC validation process using FPGA prototyping. HardNoC is a platform based on simple modules to inject traffic and collect basic statistics of NoCs. It can be used to early validate NoC designs and to provide initial numerical results for NoC evaluation and design.

15 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: The use of a simple processor named BIP (Basic Instruction-set Processor), developed by applying a multidisciplinary approach, is discussed in courses on digital circuits and systems design.
Abstract: Design of digital circuits and systems are topics covered in undergraduate courses on Computer Science, Computer Engineering, and Electrical Engineering. Simple processor architectures are used as example of digital systems to apply and integrate the concepts studied in these courses. In this paper, we discuss the use of a simple processor named BIP (Basic Instruction-set Processor) in courses on digital circuits and systems design. BIP is distinguished from similar processors because it was developed by applying a multidisciplinary approach in order to allow its use in introductory courses on computer programming and in several other courses in the Computer Science area.

12 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: This work proposes the use of dual port BRAM often available in modern FPGAs to implement a core using Memory mapped I/O (MMIO) and presents the development of an AVR microcontroller core with the Media Access Controller (MAC) Ethernet built in.
Abstract: Nowadays, Direct Memory Access (DMA) is one of the most used mechanisms for data transfer between a processor and its peripherals. Another possibility is to map peripherals directly in the memory space, which has the disadvantage of requiring dual port memories when the device handles large quantities of data. It typically is the case of video and network applications. In this work we propose the use of dual port BRAM often available in modern FPGAs to implement a core using Memory mapped I/O (MMIO). As a case study, we present the development of an AVR microcontroller core with the Media Access Controller (MAC) Ethernet built in. It is capable of running the uIP TCP/IP stack, with a Web Server as example application. Additionally, we discuss the advantages of moving the program code to an external memory that use the Common Flash Interface (CFI) standard. This design was simulated with Free Software tools and it was verified in hardware using a Xilinx Virtex 4 FPGA.

11 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper describes and analyzes the main features of the Hardware Real-Time Scheduler Coprocessor unit (HRTC) for NIOS II processor and describes how the HRTSC supports time, events, task and priorities.
Abstract: In this paper we describe and analyze the main features of the Hardware Real-Time Scheduler Coprocessor unit (HRTC) for NIOS II processor. We describe how the HRTSC supports time, events, task and priorities. The HRTSC was designed as a SOPC component to incorporate real-time features for embedded real-time applications. The hardware architecture has an easy integration with the IDE programming environment. The Avalon interface showed to be an efficient specification to share memory and data communication among memory, processor and HRTSC. The performance of the HRTSC architecture is analyzed considering real-time flexibility, programmability and power consumption reduction.

11 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: The hardware synthesis and performance results show that the designed cryptoprocessor presents a good area-throughput trade-off and it can be used as a suitable core for an RSA cryptosystem embedded into a SoC.
Abstract: This paper presents the design of an 8192-bit RSA cryptoprocessor using a radix 2 Montgomery multiplier based on a systolic architecture. In this case, the Montgomery multiplier simultaneously performs two multiplications, and the cryptoprocessor carries out the modular exponentiation using the binary exponentiation algorithm. The designs are described using generic structural VHDL and synthesized on the EP3SL150F1152C2, using Quartus II 11. The hardware synthesis and performance results show that the designed cryptoprocessor presents a good area-throughput trade-off and it can be used as a suitable core for an RSA cryptosystem embedded into a SoC.

10 citations


Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents the Reference Frame Context Adaptive Variable-Length Compressor (RFCAVLC) for video coding systems, and results indicate that this solution can be easily coupled to a complete video encoder system with negligible hardware overhead and without compromising throughput.
Abstract: This paper presents the Reference Frame Context Adaptive Variable-Length Compressor (RFCAVLC) for video coding systems. RFCAVLC aims to reduce the external memory bandwidth required to carry out this process. Six experiments were performed, all based on adaptations of the Huffman algorithm, and the best experiment achieved an average compression rate of more than 24% without any loss in quality for all targeted resolutions. This result is similar to the best solutions proposed in the literature, but it is the only one without losses in this process. The presented RFCAVLC splits the reference frames in 4×4 blocks and compresses these blocks using one of four static code tables in a context-adaptive way. An architecture that implements the encoder of the RFCAVLC solution was described in VHDL and synthesized to an Altera Stratix IV FPGA. The synthesis results achieved by the designed architecture indicate that this solution can be easily coupled to a complete video encoder system with negligible hardware overhead and without compromising throughput.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: In this work a co-design flow for processor centric embedded systems with hardware acceleration using FPGAs is proposed, which helps to reduce design effort by raising abstraction level while not imposing the need for engineers to learn new languages and tools.
Abstract: In this work a co-design flow for processor centric embedded systems with hardware acceleration using FPGAs is proposed. This flow helps to reduce design effort by raising abstraction level while not imposing the need for engineers to learn new languages and tools. The whole system is designed using well established high level modeling techniques, languages and tools from the software domain. That is, an OOP design approach expressed in UML and implemented in C++. Software coding effort is reduced since the C++ implementation not only provides a golden reference model, but may also be used as part of the final embedded software. Hardware coding effort is also reduced. The modular OOP design facilitates the engineer to find the exact methods that need to be accelerated by hardware using profiling tools, preventing useless translations to hardware. Moreover, the two-process structured VHDL design method used for hardware implementation has proven to reduce man-years, code lines and bugs in many major developments. A real-time image processing application for multiple robot localization is presented as a case study. The overall time improvement from the original software solution to the final hardware accelerated solution is 9.7×, with only 4% increase in area (143 extra slices). The embedded solution achieved following the proposed methodology runs 17% faster than in a standard PC, and it is a much smaller, cheaper and less power-consuming solution.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents an MPEG-4 AAC decoder described in VHDL language and compliant with the Brazilian Digital Television standard (SBTVD), synthesized to an Altera Cyclone II 2C35 FPGA using 26549 logic elements and 248704 memory bits.
Abstract: This paper presents an MPEG-4 AAC decoder described in VHDL language and compliant with the Brazilian Digital Television standard (SBTVD). It has been synthesized to an Altera Cyclone II 2C35 FPGA using 26549 logic elements and 248704 memory bits. The implemented architecture has been verified using an Altera DE2 prototyping board, being capable of decoding stereo signals coded as MPEG-4 AAC Low Complexity audio objects. The minimum operating frequency required for real time decoding of a stereo audio stream with a sampling rate of 48 kHz is 4 MHz and the implemented decoder is capable of running at 56 MHz, meeting the requirements. This decoder design is intended to be integrated with a system on chip for the SBTVD set-top box.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: Taking advantage of the heterogeneous resources of FPGA, e.g. embedded memory and digital signal processing blocks, the performance of the architecture is improved and the use of DSP blocks improves the critical path, increasing the maximum frequency, which enables the architecture to process 60 HD1080p frames per second.
Abstract: Video coding applications are disseminated in a range of devices and require application-specific hardware support to deal with the ever increasing computational complexity of advanced video coding standards. The design of application-specific circuit for intra-frame prediction module in H.264/AVC standard is the most efficient solution, however, it make really difficult and costly for future design changes. In this work is presented an H.264/AVC intra-frame prediction hardware architecture targeting Field-Programmable Gate Array (FPGA). Taking advantage of the heterogeneous resources of FPGA, e.g. embedded memory and digital signal processing blocks, the performance of our architecture is improved. Storing intermediate data in block RAM memories reduces the number of cycles to process a macroblock in up to 73% and the memory bandwidth in 75%. The use of DSP blocks improves the critical path, increasing the maximum frequency, which enables the architecture to process 60 HD1080p frames per second.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: The present work first details an MPSoC architecture, which supports the execution of distributed applications, including an operating system enabling multitask execution at each processing element, and a framework able to cover the design steps previously mentioned is presented.
Abstract: The design of a Multiprocessor System-on-Chip (MPSoC) is a complex task, including steps as application development, platform configuration, code generation, task mapping onto the platform and debugging. An integrated environment covering most of these steps is a gap in the literature. The present work first details an MPSoC architecture, which supports the execution of distributed applications, including an operating system enabling multitask execution at each processing element. The MPSoC is heterogeneous, due to the support to different processor architectures. Then, a framework able to cover the design steps previously mentioned is presented. The framework enables the design space exploration for applications to be executed in the MPSoC, varying for example the number and type of processors, the memory size, the task mapping. Results demonstrate the correct operation for different MPSoC configurations, generated from the proposed framework. Such open-source framework enables the research community to investigate new subjects related to MPSoC and Network on Chip (NoC) design, as well as evaluate distributed applications in a multiprocessor environment.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: An FPGA implementation of the soft parity check node for min-sum LDPC decoders is analyzed and it is shown that more than 60% of the hardware resources of the CNPU is used for finding the two smallest input values.
Abstract: A typical high-speed decoder implementation for an LDPC may require hundreds or even thousands of variable and check node processors. Since check node processing unit (CNPU) is far more complex than variable processing unit, hardware requirements of CNPU has a big impact on the final decoder complexity. Here, an FPGA implementation of the soft parity check node for min-sum LDPC decoders is analyzed. The hardware cost and speed of the main block of CNPU, which finds the two smallest input values, is thoroughly studied for different numbers of input values with different bit-widths. Experiments for an FPGA implementation demonstrate that hardware cost and speed vary with the number of input values in the same way as they do for an ASIC implementation. Furthermore, it is shown that more than 60% of the hardware resources of the CNPU is used for finding the two smallest input values.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: A new asynchronous GALS wrapper architecture to be implemented in FPGAs that is essentially free from hazard, not needing any special cares in implementation concerning to LUTs choice and being fully compatible with FPGA is proposed.
Abstract: Contemporary digital systems must be based on the “System-on-Chip — SoC” concept. An interesting style for SoC design is the GALS paradigm (Globally Asynchronous, Locally Synchronous), which can be used to implement circuits in FPGAs (Field Programmable Gate Arrays), but the implementation of asynchronous interfaces (asynchronous wrapper — AW) constitutes a major drawback for this kind of devices. Although there is a typical AW design style which is based on asynchronous controllers and provides communication between modules (called ports), Port controllers are subject to essential-hazard when implemented FPGA. In this context, this paper proposes a new asynchronous GALS wrapper architecture to be implemented in FPGAs that is essentially free from hazard, not needing any special cares in implementation concerning to LUTs choice and being fully compatible with FPGA. Additional advantages of the proposed architecture are the total autonomy that synchronous modules achieve when interacting with the asynchronous wrapper; its ports can be synthesized in the direct mapping style (so without knowledge of asynchronous logic synthesis); and ports interacts in Ib/Ob Mode, not needing a timing analysis and also being more robust than GFM.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: In this paper, the history of the electronic technology can be seen, as well as science, through revolutions, which can be predicted by means of two projections: Moore's Law and Makimoto's Wave.
Abstract: The intention of this paper is to show how the history of the electronic technology can be seen, as well as science, through revolutions. Such changes can be predicted by means of two projections: Moore's Law and Makimoto's Wave. The first one, in the present of normative nature, indicates that procedure must follow the semiconductors industry. The second one, analytical, describes the industry behavior as a consequence of the observation.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: The main goal of this methodology is to create an FPGA environment to emulate such soft core, which is fully compatible to test the manufactured ASIC.
Abstract: Brazilian government has been investing in microelectronics, especially in hardware education as a strategic factor. In the literature, FPGA-based methodologies have been widely used in hardware and embedded systems design teaching. However, these methodologies don't take into account timing design constraints and an in-depth verification process, essential to understand physical issues, reduce non-recurrent engineering costs and fault risks. This paper presents a design methodology that integrates functionally verified ASIC soft cores into an FPGA. The main goal of this methodology is to create an FPGA environment to emulate such soft core, which is fully compatible to test the manufactured ASIC. As a result of applying this methodology in education, students can learn the fundamentals of hardware and its designs challenges, not only development, but also verification and physical implications.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: A PCB-level platform which can be used as a common platform for SRAM-based FPGA has been designed with BQV series FPG a of BMTI in this paper, and the complete testing time for logic resources of B QV600 has been decreased significantly.
Abstract: Dynamic reconfiguration system allows us to dynamically allocate hardware resources as needed by particular applications. This paper focuses on the application of FPGA-based partial dynamic reconfiguration system (PDRS) with configurable boundary-scan circuit (CBSC), which can be used in many different fields, especially in the military and aerospace fields. Generally speaking, if an important function such as key encryption needs to be changed in a PDRS operating on a high security system, the corresponding logic resources need to be verified and tested again before being reconfigured. By making use of the CBSC technology, the effective speed of fault diagnosis for target FPGA will be accelerated, and the reliability of the PDRS will be improved. A PCB-level platform which can be used as a common platform for SRAM-based FPGA has been designed with BQV series FPGA of BMTI in this paper. Verified with test vectors of BQV600, the complete testing time for logic resources of BQV600 has been decreased significantly.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This work presents a FPGA based hardware architecture for the H.264/AVC motion vector predictor targeting HD1080p resolution and has shown that the architecture uses few hardware resources and it can process until 52 HD 1080p frames per second.
Abstract: Motion vector coding is an important issue in low bitrate video coding, since it relatively increases the efficiency of modern video encoders. The motion vector prediction exploits the correlation between the motion of neighbor blocks, since they may represent the same object and then present the same movement direction. The motion vector prediction is performed by a difference between the current motion vector and the predictive motion vector (PMV), generated using the neighbor blocks as reference. This way, only the motion vector difference (MVD) is sent to the bit stream. Due to its performance the motion vector prediction is defined as an obligatory tool in the H.264/AVC standard. This work presents a FPGA based hardware architecture for the H.264/AVC motion vector predictor targeting HD1080p resolution. The architecture was described in VHDL and synthesized to Xilinx xc5vlx30 Virtex V FPGA. The results were compared with one motion vector prediction architecture from the literature. Our design has shown better results considering hardware usage and throughput than the related work. Besides, we used a motion estimation and motion compensation architecture composing a whole inter-frame prediction module, to perform a better evaluation of the results generated by our proposed motion vector predictor architecture. The results have shown that our architecture uses few hardware resources and it can process until 52 HD1080p frames per second.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: A set of well defined steps to design functional verification monitors intended to verify Floating Point Units (FPU) described in HDL, and an already verified reference model is used in order to test the correctness of the Device Under Verification (DUV).
Abstract: This paper proposes a set of well defined steps to design functional verification monitors intended to verify Floating Point Units (FPU) described in HDL. The first step consists on defining the input and output domain coverage. Next, the corner cases are defined. Finally, an already verified reference model is used in order to test the correctness of the Device Under Verification (DUV). As a case study a monitor for an IEEE754–2008 compliant design is implemented. This monitor is built to be easily instantiated into verification frameworks such as OVM. Two different designs were verified reaching complete input coverage and successful compliant results.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents an efficient implementation of a fully pipelined decimal multiplier designed with Carry Save Addition and coded into a reduced group of BCD-4221.
Abstract: Decimal multiplication is one of the most frequently used operations in financial, scientific, commercial and internet-based applications. This paper presents an efficient implementation of a fully pipelined decimal multiplier designed with Carry Save Addition and coded into a reduced group of BCD-4221. This design is based on multiplier operands recoded in Signed-Digit radix-10, a simplified partial products generator, and decimal adders. A variety of multipliers architectures are processed on a Virtex-6 FPGA device. Several assessments are carried out in various N by M multiplications and their respective synthesis results show slightly optimistic figures in terms of area and delay in regard to some previously published works.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents a high throughput and low off-chip memory bandwidth Motion and Disparity Estimation architecture targeting the Multiview Video Coding requirements and presents the best efficiency in terms of off- chip memory access and maximum throughput at this data input.
Abstract: This paper presents a high throughput and low off-chip memory bandwidth Motion and Disparity Estimation architecture targeting the Multiview Video Coding requirements. The ME and DE modules are the critical paths in the multiview encoding process, corresponding to up to 80% of the encoding time. Besides, these two modules are responsible for more than 70% of the off-chip memory accesses. The goal of this work is to design a hardware architecture that deals with these two constraints. The design space exploration points the best balance between area and throughput. Besides, the Memory Hierarchy allows a reduction of 87% for memory accesses when compared to a solution without memory management. The synthesis results for the FPGA implementation show that the ME/DE architecture is able to process up to 5-view HD 1080p multiview videos in real time in a typical prediction structure with 2 reference frames (temporal and disparity neighbors). When compared to related works, this work presents the best efficiency in terms of off-chip memory access and maximum throughput at this data input.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: Simulation results showed that the R2SDF processor is able to compute the FFT for a 2048-point sequence in 180 ms, being almost five times faster than the FSSR2B processor.
Abstract: This paper presents a Radix-2 Single-Path Delay Feedback (R2SDF) configurable processor to calculate 64/128/512/1024/2048-point Fast Fourier Transform (FFT). Such range of FFT input sequences allows for the realization of the widely used wireless protocols IEEE 802.11n (WLAN) and the IEEE 802.16 (WiMax). The presented R2SDF configurable processor, as well as a fully sequential configurable processor that uses a single radix-2 butterfly (FSSR2B), were synthesized into Cyclone III Altera FPGAs to allow for a comparison in terms of hardware resources and performance. The R2SDF processor required more FPGA resources than the FSSR2B processor, mostly in the datapath. However, the overhead in terms of memory bits and registers was moderate. On the other hand, simulation results showed that the R2SDF processor is able to compute the FFT for a 2048-point sequence in 180 ms, being almost five times faster than the FSSR2B processor. The average relative error was evaluated by comparing the results provided by the designed FFT processors to that obtained from a software implementation of the FFT algorithm.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: A new interpolative method is presented that makes use of the information obtained in previous steps of the WLO process to guide the search so the number of required simulations is minimized and provides optimized results several times faster than the traditional approaches.
Abstract: As Digital Signal Processing (DSP) systems grow in complexity, the classical simulation-based approaches to the wordlength optimization (WLO) problem for fixed-point data representation can no longer be used due to unaffordable execution times. Thus, it is necessary to accelerate the computations and significantly reduce the number of simulations performed in order to obtain optimized solutions in reasonable times. In this paper a new interpolative method is presented. This technique makes use of the information obtained in previous steps of the WLO process to guide the search so the number of required simulations is minimized. Experimental results show that this process provides optimized results several times faster than the traditional approaches without any significant penalty on the quality of the solutions.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper proposes two protection schemes against design theft, for SRAM-based FPGA devices, considering the key management issue at the customer facility, making good use of different cryptography features.
Abstract: The growing evolution in field programmable gate array (FPGA) performances appeals to embedded systems designers to expansively incorporate FPGA devices in their systems. This expanding use makes FPGA-based systems more attractive to several attackers and hence vulnerable to a number of threats. In this paper, we propose two protection schemes against design theft, for SRAM-based FPGA devices, considering the key management issue at the customer facility. The first scheme proposes some improvements to a pre-reported scheme for reducing its implementation cost. The second proposition is distinct from others by combining both symmetric and asymmetric encryption. An evident comparison shows that our proposed schemes are more advantageous over other works and present the best tradeoff between hardware, security and key management, making good use of different cryptography features.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents the implementation and integration of the AES 128 data encryption IP and the I2C serial communication interface IP, into the IP of the M8051 microcontroller, and performs functionality testing in FPGA to verify the correct functioning of the IPs.
Abstract: This paper presents the implementation and integration the AES 128 data encryption IP and the I2C serial communication interface IP, into the IP of the M8051 microcontroller. We detail each block and validate them though testbench simulation. We performed functionality testing in FPGA to verify the correct functioning of the IPs and their integration.

Proceedings ArticleDOI
20 Mar 2012
TL;DR: This paper presents the process of implementation of the MIPS-1 ISA on a simple didactic processor, without increasing the datapath complexity, and shows the physical changes needed in the targetdatapath to fit the features of the new ISA.
Abstract: This paper presents the process of implementation of the MIPS-1 ISA on a simple didactic processor, without increasing the datapath complexity This implementation may be desirable for academic purposes or for the use of datapaths of different complexity and performance in the MPSoC (Multiprocessor System-on-Chip) design This paper shows the physical changes needed in the target datapath to fit the features of the new ISA The techniques used to maintain the datapath simplicity are also shown Finally, we present a simple implementation example used to validate this datapath, with simulation and synthesis results on FPGA

Proceedings ArticleDOI
20 Mar 2012
TL;DR: The paper discusses design and implementation issues for several modules like video scaler, video captioning and also the generation of video outputs signals (VGA or composite PAL-M) and implementation results using a FPGA-based hardware platform.
Abstract: In this paper a video processing architecture for use in a set top box (STB) compatible with the Brazilian Digital Television System (SBTVD) is presented. After the decoding process, a video frame is stored in the STB memory and is scanned by the output subsystem while executing several operations in order to fit the external display. The paper discusses design and implementation issues for several modules like video scaler, video captioning and also the generation of video outputs signals (VGA or composite PAL-M). Implementation results using a FPGA-based hardware platform are also provided. The goal is to go to silicon implementation after the FPGA validation phase.