Showing papers on "VHDL published in 2015"

PDF

Open Access

Journal Article•DOI•

Design of Mixed Synchronous/Asynchronous Systems with Multiple Clocks

[...]

Yu Jiang¹, Hehua Zhang¹, Huafeng Zhang¹, Han Liu¹, Xiaoyu Song², Ming Gu¹, Jiaguang Sun¹ - Show less +3 more•Institutions (2)

Tsinghua University¹, Portland State University²

01 Aug 2015-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A novel computation model named GalsBlock is proposed for the design of multi-clocked embedded system with both synchronous and asynchronous components, and the graphical modeling, simulation, verification, and code generation toolkit is developed to support the computation model.

...read moreread less

Abstract: Today’s distributed systems are commonly equipped with both synchronous and asynchronous components controlled with multiple clocks. The key challenges in designing such systems are (1) how to model multi-clocked local synchronous component, local asynchronous component, and asynchronous communication among components in a single framework. (2) how to ensure the correctness of model, and keep consistency between the model and the implementation of real system. In this paper, we propose a novel computation model named GalsBlock for the design of multi-clocked embedded system with both synchronous and asynchronous components. The computation model consists of several hierarchical compound and atom blocks communicating with data port connections. Each atom block can be refined as parallel mealy automata. The synchronous component can be captured in an atom block with the corresponding local control clock while the asynchronous component in an atom block without clock, and the asynchronous communications can be captured in the data port connections among blocks. The unified operational semantics and formal semantics are defined, which can be used for simulation and verification, respectively. Then, we can generate efficient VHDL code from the validated model, which can be synthesized into the FPGA processor for execution directly. We have developed the graphical modeling, simulation, verification, and code generation toolkit to support the computation model, and applied it in the design of a sub-system used in the real train communication control.

...read moreread less

55 citations

Proceedings Article•DOI•

Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA

[...]

Kenneth Hill¹, Stefan Craciun¹, Alan D. George¹, Herman Lam¹•Institutions (1)

University of Florida¹

27 Jul 2015

TL;DR: This study conducts a performance and productivity comparison between three image-processing kernels developed using Altera's SDK for OpenCL and traditional VHDL, finding similar performance in terms of frequency and resource utilization.

...read moreread less

Abstract: Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous productivity challenges, limiting the potential impact of reconfigurable computing (RC) with FPGAs in high-performance computing. Major challenges with HDL design include steep learning curves, large and complex codes, long compilation times, and lack of development standards across platforms. A relative newcomer to RC, the Open Computing Language (OpenCL) reduces productivity hurdles by providing a platform-independent, C-based programming language. In this study, we conduct a performance and productivity comparison between three image-processing kernels (Canny edge detector, Sobel filter, and SURF feature-extractor) developed using Altera's SDK for OpenCL and traditional VHDL. Our results show that VHDL designs achieved a more efficient use of resources (59% to 70% less logic), however, both OpenCL and VHDL designs resulted in similar timing constraints (255MHz max < 325MHz). Furthermore, we observed a 6× increase in productivity when using OpenCL development tools, as well as the ability to efficiently port the same OpenCL designs without change to three different RC platforms, with similar performance in terms of frequency and resource utilization.

...read moreread less

50 citations

Journal Article•DOI•

Hardware implementation of neural network with Sigmoidal activation functions using CORDIC

[...]

Vipin Tiwari¹, Nilay Khare¹•Institutions (1)

Maulana Azad National Institute of Technology¹

01 Aug 2015-Microprocessors and Microsystems

TL;DR: This article presents the field-programmable gate array (FPGA)-based hardware implementation of a multilayer feed-forward neural network, with a log sigmoid activation function and a tangent sigmoidal (hyperbolic tangent) activation function, with more accuracy than any other previous implementation ofA neural network with the same activation function.

...read moreread less

49 citations

Journal Article•DOI•

FPGA Implementation of a Real-Time Weak Signal Detector Using a Duffing Oscillator

[...]

Vahid Rashtchi¹, Mohsen Nourazar¹•Institutions (1)

University of Zanjan¹

01 Oct 2015-Circuits Systems and Signal Processing

TL;DR: Experimental results are presented to demonstrate the effectiveness of the proposed implementation of a weak signal detector using a Duffing oscillator on field programmable arrays (FPGAs) for the real-time detection of weak signals in noisy environments.

...read moreread less

Abstract: This paper presents an implementation of a weak signal detector using a Duffing oscillator on field programmable arrays (FPGAs) for the real-time detection of weak signals in noisy environments. The proposed implementation has combined the efficiency of weak signal detection by chaotic oscillators in noisy environments with the advantages of hardware implementation to achieve an efficient weak signal detector. To optimize the performance versus area, we have used VHDL. A novel state detector, phase trajectory autocorrelation, has been introduced for the state detection of the Duffing oscillator. As an experiment, the Duffing oscillator has been implemented on a Cyclone IV GX FPGA. In this paper, in addition to the structure and resource utilization of the design, experimental results are presented to demonstrate the effectiveness of the proposed implementation.

...read moreread less

40 citations

Book•

Digital Design and Computer Architecture: ARM Edition

[...]

Sarah L. Harris, David Harris

09 Apr 2015

TL;DR: Digital Design and Computer Architecture: ARM Edition covers the fundamentals of digital logic design and reinforces logic concepts through the design of an ARM microprocessor and features side-by-side examples of the two most prominent Hardware Description Languages (HDLs)-System Verilog and VHDL-which illustrate and compare the ways each can be used in theDesign of digital systems.

...read moreread less

Abstract: Digital Design and Computer Architecture: ARM Edition takes a unique and modern approach to digital design. Beginning with digital logic gates and progressing to the design of combinational and sequential circuits, Harris and Harris use these fundamental building blocks as the basis for what follows: the design of an actual ARM processor. With over 75% of the worlds population using products with ARM processors, the design of the ARM processor offers an exciting and timely application of digital design while also teaching the fundamentals of computer architecture. System Verilog and VHDL are integrated throughout the text in examples illustrating the methods and techniques for CAD-based circuit design. By the end of this book, readers will be able to build their own microprocessor and will have a top-to-bottom understanding of how it works. Harris and Harris have combined an engaging and humorous writing style with an updated and hands-on approach to digital design. Covers the fundamentals of digital logic design and reinforces logic concepts through the design of an ARM microprocessor. Features side-by-side examples of the two most prominent Hardware Description Languages (HDLs)-System Verilog and VHDL-which illustrate and compare the ways each can be used in the design of digital systems. Includes examples throughout the text that enhance the readers understanding and retention of key concepts and techniques. The Companion website includes a chapter on I/O systems with practical examples that show how to use the Raspberry Pi computer to communicate with peripheral devices such as LCDs, Bluetooth radios, and motors. The Companion website also includes appendices covering practical digital design issues and C programming as well as links to CAD tools, lecture slides, laboratory projects, and solutions to exercises.

...read moreread less

33 citations

Proceedings Article•DOI•

Design and implementation of 16 × 16 multiplier using Vedic mathematics

[...]

S.P. Pohokar¹, R.S. Sisal¹, K.M. Gaikwad¹, M.M. Patil¹, Rushikesh Borse¹ - Show less +1 more•Institutions (1)

Sinhgad Academy of Engineering¹

28 May 2015

TL;DR: The basic building block: 16 × 16 Vedic multiplier based on Urdhva-Tiryagbhyam Sutra is implemented and coded in VHDL and synthesized and simulated by using Xilinx ISE 10.1.

...read moreread less

Abstract: This paper briefly describes the Urdhva-Tiryagbhyam Sutra of vedic mathematics and we have designed multiplier based on the sutra. Vedic Mathematics is the ancient system of mathematics which has a unique technique of calculations based on 16 Sutras which are discovered by Sri Bharti Krishna Tirthaji. In this era of digitalization, it is required to increase the speed of the digital circuits while reducing the on chip area and memory consumption. In various applications of digital signal processing, multiplication is one of the key component. Vedic technique eliminates the unwanted multiplication steps thus reducing the propagation delay in processor and hence reducing the hardware complexity in terms of area and memory requirement. We implement the basic building block: 16 × 16 Vedic multiplier based on Urdhva-Tiryagbhyam Sutra. This Vedic multiplier is coded in VHDL and synthesized and simulated by using Xilinx ISE 10.1. Further the design of array multiplier in VHDL is compared with proposed multiplier in terms of speed and memory.

...read moreread less

31 citations

Proceedings Article•DOI•

Performance, analysis and comparison of digital adders

[...]

Jasmine Saini¹, Somya Agarwal¹, Aditi Kansal¹•Institutions (1)

Jaypee Institute of Information Technology¹

19 Mar 2015

TL;DR: The drawbacks and gains of ripple carry, carry look ahead, carry select and kogges stone in terms of area, speed, delay are discussed.

...read moreread less

Abstract: This paper primarily discusses the construction of different high speed adders using very high speed integrated circuit hardware design in the platform Modelsim 5.5c. The reason for this investigation is that adders are the most important circuits requiring improved designs in order to obtain maximum gain possible. In any digital system adders are the most fundamental unit. Addition is an indispensible operation in any Digital, Analog, or Control system. They are not only as arithmetic logic unit in computers and some processors but used in some other kind of processors too, where they are used to calculate addresses, table indices, and similar operations [6]. Today technology in measured by its ability to measure computational algorithms. This paper discusses the drawbacks and gains of ripple carry, carry look ahead, carry select and kogges stone in terms of area, speed, delay. This paper focuses on implementation and simulation of 64 bit full adder using very high speed integrated circuit hardware description language(VHDL).

...read moreread less

28 citations

Journal Article•DOI•

Comparative Design and Analysis of Mesh, Torus and Ring NoC☆

[...]

Arpit Jain¹, Adesh Kumar², Sanjeev Sharma•Institutions (2)

Teerthanker Mahaveer University¹, University of Petroleum and Energy Studies²

01 Jan 2015-Procedia Computer Science

TL;DR: The simulation and FPGA synthesis of mesh, torus and ring Network on Chip (NoC) based on the Multiprocessor System on Chip structure for a network cluster of 256 nodes is presented.

...read moreread less

27 citations

Dissertation•DOI•

Digital circuit in CλaSH: functional specifications and type-directed synthesis

[...]

Christiaan Pieter Rudolf Baaij

23 Jan 2015

TL;DR: This thesis describes the inner workings of the C$\lambda$aSH compiler, which translates the aforementioned circuit descriptions written in Haskell to low-level descriptions in VHDL, and proves that this term rewrite system always reduces a polymorphic, higher-order circuit description to a synthesisable variant.

...read moreread less

Abstract: Over the last three decades, the number of transistors used in microchips has increased by three orders of magnitude, from millions to billions. The productivity of the designers, however, lags behind. Managing to implement complex algorithms, while keeping non-functional properties within desired bounds, and thoroughly verifying the design against its specification, are the main difficulties in circuit design. As a motivation for our work we make a qualitative analysis of the tools available to circuit designers. Here we see that progress has been slow, and that the same techniques have been used for over 20 years. We claim that functional languages can be used to raise the abstraction level in circuit design. Especially higher-order functional languages, where functions are first-class and can be manipulated by other functions, offer a single abstraction mechanism that can capture many design patterns. This thesis explores the idea of using the functional language Haskell directly as a hardware specification language, and move beyond the limitations of embedded languages. Additionally, we can use normal functions from existing Haskell libraries to model the behaviour of our circuits. This thesis describes the inner workings of our CλaSH compiler, which translates the aforementioned circuit descriptions written in Haskell to low-level descriptions in VHDL. The challenge then becomes the reduction of the higher-level abstractions in the descriptions to a form where synthesis is feasible. This thesis describes a term rewrite system (with bound variables) to achieve this reduction. We prove that this term rewrite system always reduces a polymorphic, higher-order circuit description to a synthesisable variant. Even when descriptions use high-level abstractions, the CλaSH compiler can synthesize efficient circuits. Case studies show that circuits designed in Haskell, and synthesized with the CλaSH compiler, are on par with hand-written VHDL, in both area and gate propagation delay. This thesis thus shows the merits of using a modern functional language for circuit design. The advanced type system and higher-order functions allow us to design circuits that have the desired property of being correct-by-construction. Finally, our synthesis approach enables us to derive efficient circuits from descriptions that use high-level abstractions.

...read moreread less

26 citations

Journal Article•DOI•

Hardware architectures for the H.265/HEVC discrete cosine transform

[...]

Grzegorz Pastuszak¹•Institutions (1)

University of Warsaw¹

01 Jun 2015-Iet Image Processing

TL;DR: This study presents a design methodology for the two-dimensional discrete cosine transform dedicated for H.265/HEVC hardware encoders that decomposes matrix multiplications for different transform sizes into some steps based on the division of transform units into fixed-size blocks.

...read moreread less

Abstract: This study presents a design methodology for the two-dimensional (2D) discrete cosine transform dedicated for H.265/HEVC hardware encoders. The methodology decomposes matrix multiplications for different transform sizes into some steps based on the division of transform units into fixed-size blocks. The modified order of processed blocks allows a significant reduction of the size of the transposition buffer. As a consequence, the resource consumption of the whole 2D-transform architecture is decreased. Separate transform cores assigned to two transform stages increase the throughput more than twice. The decomposition enables different hardware configurations of the architectures. Particularly, the architectures applying the proposed methodology are parametrically specified in VHDL, and configuration parameters enable the tradeoff between resources and the throughput. Furthermore, the interface adaptation to desired horizontal and vertical sizes is possible. The use of regular multipliers allows the support for transforms specified in other video standards. Computational elements embedded in architectures are well-suited to FPGA devices, which improves the area-speed efficiency. Synthesis results show that they can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively.

...read moreread less

26 citations

Journal Article•DOI•

Multiplier-less pipeline architecture for lifting-based two-dimensional discrete wavelet transform

[...]

Anand D. Darji, R Arun, Shabbir Noman Merchant, A.N. Chandorkar

19 Mar 2015-Iet Computers and Digital Techniques

TL;DR: A multiplier-less, high-speed and low-power pipeline architecture with novel dual Z-scanning technique for lifting-based two-dimensional (2D) discrete wavelet transform (DWT) with superior speed, power and hardware utilisation for similar throughput specification is presented.

...read moreread less

Abstract: In this study, the authors present a multiplier-less, high-speed and low-power pipeline architecture with novel dual Z-scanning technique for lifting-based two-dimensional (2D) discrete wavelet transform (DWT). The proposed architecture is composed of pipeline one-dimensional row, column processors and five transposing registers. Moreover, it uses 4N temporal line buffers to process 2D DWT of image with N × N resolution. Multipliers are designed with shift-and-add logic to reduce the critical path to one adder. Dual Z-scanning method is employed to reduce the transposition buffers and latency. The proposed architecture is superior to the existed architectures in speed, power and hardware utilisation for similar throughput specification. Register transfer logic (RTL) of the proposed design is described using VHDL and synthesised using Xilinx ISE 10.1. The proposed architecture operates at a frequency of 353.107 MHz, when synthesised for Xilinx Virtex-IV series field programmable gate array. Frame processing rate of 340 frames/second for full high-definition video can be achieved at this frequency of operation. RTL of the proposed design is synthesised using UMC 180 nm technology complementary metal-oxide semiconductor (CMOS) standard cell library for application specific integrated circuit (ASIC) implementation. ASIC synthesis of 2D DWT core uses 20 358 logic gates and consumes only 20.83 mW power at 100 MHz frequency.

...read moreread less

Proceedings Article•DOI•

Out-of-plane NML modeling and architectural exploration

[...]

Fabrizio Cairo¹, Giovanna Turvani¹, Fabrizio Riente¹, Marco Vacca¹, S. Breitkreutz-v. Gamm², Markus Becherer², Mariagrazia Graziano¹, Maurizio Zamboni¹ - Show less +4 more•Institutions (2)

Polytechnic University of Turin¹, Technische Universität München²

27 Jul 2015

TL;DR: This paper presents the design of a full adder entirely based on single domain out-of-plane nanomagnetic logic (pNML), and proposes different solutions of the same circuit which allow for the best performance in terms of occupied area and timing.

...read moreread less

Abstract: One of the most innovative solutions studied as an alternative technology to CMOS transistors is represented by NanoMagnetic Logic (NML). It exhibits remarkable characteristics that overcome some intrinsic limitations of CMOS as low power consumption and the possibility to merge logic and memory in the same device. We present the design of a full adder entirely based on single domain out-of-plane nanomagnetic logic (pNML). We propose different solutions of the same circuit which allow us to obtain the best performance in terms of occupied area and timing. We modeled, using VHDL (VHSIC Hardware Description Language), the pNML basic elements and then we performed micromagnetic simulations to demonstrate the correct operation of the circuits.

...read moreread less

Journal Article•DOI•

Power efficient and high performance VLSI architecture for AES algorithm

[...]

K. Kalaiselvi, H. Mangalam¹•Institutions (1)

Sri Krishna College of Engineering & Technology¹

01 Sep 2015-Journal of Electrical Systems and Information Technology

TL;DR: Experimental results reveal that the proposed AES architectures offer superior performance than the existing VLSI architectures in terms of power, throughput and critical path delay.

...read moreread less

Journal Article•DOI•

Design and simulation of a sensorless permanent magnet synchronous motor drive with microprocessor-based PI controller and dedicated hardware EKF estimator

[...]

Ying-Shieh Kung¹, Nguyen Phan Thanh², Nguyen Phan Thanh¹, Ming-Shyng Wang¹•Institutions (2)

Southern Taiwan University of Science and Technology¹, Ho Chi Minh City University of Technology²

01 Oct 2015-Applied Mathematical Modelling

TL;DR: A digital hardware implementation of a speed controller for a sensorless permanent magnet synchronous motor (PMSM) drive using the extended Kalman filter is proposed and the EKF algorithm is used to estimate the rotor flux angle and rotor speed.

...read moreread less

Proceedings Article•DOI•

Accelerating video and image processing design for FPGA using HDL coder and simulink

[...]

Jerry Chan Ting Hai¹, Ooi Chee Pun¹, Tan Wooi Haw¹•Institutions (1)

Multimedia University¹

01 Oct 2015

TL;DR: A model based design framework based on HDL Coder, Vision HDL Toolbox and Simulink to accelerate the design of video and image solution and tackle the technical complexity and reduce development time of traditional FPGA design is presented.

...read moreread less

Abstract: Video and Image Processing solution requiring high throughput rate are often implemented in a dedicated hardware such as FPGA. The design process traditionally uses Verilog and VHDL for synthesizing and validating the hardware. These design process are technically complex and time consuming. In this paper, we present an alternative approach using a model based design framework based on HDL Coder, Vision HDL Toolbox and Simulink to accelerate the design of video and image solution. Several important issues in this framework are discussed namely, Pixel Streaming Design, Cosimulation and FPGA in the Loop (FIL). Based on this framework, a video of human walking are processed to extract out two features which are the human height and edge. The design is implemented in an Altera DE2-115 FPGA board. The goal of this paper is to tackle the technical complexity and reduce development time of traditional FPGA design.

...read moreread less

Journal Article•DOI•

Computer architecture and FPGAs: A learning-by-doing methodology for digital-native students

[...]

Ma de los Angeles Cifredo-Chacon¹, Angel Quiros-Olozabal¹, Jose Maria Guerrero-Rodriguez¹•Institutions (1)

University of Cádiz¹

01 May 2015-Computer Applications in Engineering Education

TL;DR: A learning‐by‐doing methodology to teach Computer Architecture to first‐year student who belong to a digital‐native generation by developing a whole computer from scratch while they are introduced to hardware description languages (HDL) and programmable logic devices.

...read moreread less

Abstract: The theoretical teaching of Computer Architecture is not suitable longer. In the present time, students claim for a learning-by-doing according to their dynamic and active character. Nowadays, interactive teaching is possible thanks to the decrease in the prices of the Field Programmable Gate Arrays. This paper proposes a learning-by-doing methodology to teach Computer Architecture to first-year student who belong to a digital-native generation. The method consists in developing a whole computer from scratch while they are introduced to hardware description languages (HDL) and programmable logic devices. Firstly, students design each and every element of the computer by VHDL language. Later on, they interconnect the verified elements and test the complete computer. A FPGA-based board is needed to implement and check the correct performance of the designed computer. This educational approach is intended to be used with first-year students from Computer Engineering Degree, thus, it is the first experience of the students with the basics of Computer Architecture. Students have a computer and a FPGA-based board in anytime. In the final exam, a design of a different computer is propounded. Computer testing and programming is a requirement to pass. The high percentage of passed students corroborated the success of the methodology. Thus, computer functioning and construction is understood by a hands-on methodology at the same time as VHDL language and FPGA technology are introduced. Lack attention is avoided since students keep a dynamic role working with their personal computer and FPGA at all times. © 2015 Wiley Periodicals, Inc. Comput Appl Eng Educ 23:464–470, 2015; View this article online at wileyonlinelibrary.com/journal/cae; DOI 10.1002/cae.21617

...read moreread less

Posted Content•

GMU Hardware API for Authenticated Ciphers.

[...]

Ekawat Homsirikamol, William Diehl, Ahmed Ferozpuri, Farnoud Farahmand, Malik Umar Sharif, Kris Gaj¹ - Show less +2 more•Institutions (1)

George Mason University¹

01 Jan 2015-IACR Cryptology ePrint Archive

TL;DR: A universal hardware Application Programming Interface (API) for authenticated ciphers is proposed, intended to meet the requirements of all algorithms submitted to the CAESAR competition, and composed of the specification of an interface of the authenticated cipher core, and the communication protocol describing the exact format of all inputs and outputs.

...read moreread less

Abstract: In this paper, we propose a universal hardware Application Programming Interface (API) for authenticated ciphers. In particular, our API is intended to meet the requirements of all algorithms submitted to the CAESAR competition. Two major parts of the API, the interface and the communication protocol, were developed with the goal of reducing any potential biases in benchmarking of authenticated ciphers in hardware. Our high-speed implementation of the proposed hardware API includes universal, open-source pre-processing and post-processing units, common for all CAESAR candidates and the current standards, such as AES-GCM and AES-CCM. Apart from the full documentation, examples, and the source code of the pre-processing and post-processing units, we have made available in public domain a) a universal testbench to verify the functionality of any CAESAR candidate implemented using our hardware API, b) a Python script used to automatically generate test vectors for this testbench, c) VHDL wrappers used to determine the maximum clock frequency and the resource utilization of all implementations, and d) RTL VHDL source codes of high-speed implementations of AES and the Keccak Permutation F, which may be used as building blocks in implementations of related ciphers. We hope that the existence of these resources will substantially reduce the time necessary to develop hardware implementations of all CAESAR candidates for the purpose of evaluation, comparison, and future deployment in real products. 1 Motivation The CAESAR competition [1], launched in 2014, aims at identifying a portfolio of future authenticated ciphers with security, performance, and flexibility exceeding that of the current standards, such as AES-GCM [2] and AES-CCM [3]. Although security is commonly accepted to be the most important criterion in all cryptographic contests, it is rarely by itself sufficient to determine a winner. This is because multiple candidates generally offer adequate security, and a tradeoff between security and performance must be investigated. The focus of this paper is to facilitate the comparison of modern authenticated ciphers in terms of their performance and cost in hardware, and in particular in FPGAs, All Programmable Systems on Chip, and ASICs. As a starting point for such a comparison we propose defining hardware API, composed of the specification of an interface of the authenticated cipher core, and the communication protocol describing the exact format of all inputs and outputs, as well as the timing dependencies among all data and control signals passing through the specified interface. Similarly to the case of previous contests, software implementations of the CAESAR candidates are being compared using a uniform API, clearly defined in the call for submissions [1]. So far, no similar hardware API has been proposed, not to mention accepted by the cryptographic community. As a result any attempt at the comparison of existing hardware implementations is highly dependent on specific assumptions about the hardware API, made independently by various hardware designers. These assumptions can have potentially a very high influence on all major performance measures of the developed implementations. Additionally, a hardware API is typically much more difficult to modify than a software API, making any last minute standardization efforts and code adjustments highly inefficient and questionable. Therefore, there is a clear need for a proposal regarding a uniform hardware API, which could be further modified and improved using feedback from the cryptographic community, and eventually endorsed by the CAESAR Committee, and adopted by majority of future hardware developers. Our goal is to address this issue by providing the exact specification of the proposed interface, as well as multiple supporting materials, such as open-source codes of pre-processing and post-processing units, a universal testbench, and uniform ways of generating optimized results. 2 Proposed Features The proposed features of our hardware API are as follows: – inputs of arbitrary size in bytes (but a multiple of a byte only) – size of the entire message/ciphertext does not need to be known before the encryption/decryption starts (unless required by the algorithm itself) – wide range of data port widths, 8 ≤ w ≤ 256 – independent data and key inputs – simple high-level communication protocol – support for the burst mode – possible overlap among processing the current input block, reading the next input block, and storing the previous output block – storing decrypted messages internally, until the result of authentication is known – support for encryption and decryption within the same core – ability to communicate with very simple, passive devices, such as FIFOs – ease of extension to support existing communication interfaces and protocols, such as AMBA-AXI4 – a de-facto standard for the System-on-Chip (SoC) buses [4], and PCI Express – high-bandwidth serial communication between PCs and hardware accelerator boards [5].

...read moreread less

Journal Article•DOI•

Picos: A hardware runtime architecture support for OmpSs

[...]

Fahimeh Yazdanpanah¹, Fahimeh Yazdanpanah², Carlos Alvarez¹, Carlos Alvarez², Daniel Jiménez-González¹, Daniel Jiménez-González², Rosa M. Badia¹, Rosa M. Badia³, Rosa M. Badia², Mateo Valero², Mateo Valero¹ - Show less +7 more•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Spanish National Research Council³

01 Dec 2015-Future Generation Computer Systems

TL;DR: This paper describes the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design, and proposes Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model.

...read moreread less

Journal Article•DOI•

Parallel H.264/AVC Fast Rate-Distortion Optimized Motion Estimation by Using a Graphics Processing Unit and Dedicated Hardware

[...]

Muhammad Usman Shahid¹, Ashfaq Ahmed², Maurizio Martina¹, Guido Masera¹, Enrico Magli¹ - Show less +1 more•Institutions (2)

Polytechnic University of Turin¹, COMSATS Institute of Information Technology²

01 Apr 2015-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: An inherent parallel low-complexity rate-distortion (RD) optimized fast ME algorithm well suited for parallel implementations, eliminating various data dependencies caused by a reliance on spatial predictions is presented.

...read moreread less

Abstract: Heterogeneous systems on a single chip composed of a central processing unit, graphics processing unit (GPU), and field-programmable gate array (FPGA) are expected to emerge in the near future. In this context, the system on chip can be dynamically adapted to employ different architectures for execution of data-intensive applications. Motion estimation (ME) is one such task that can be accelerated using FPGA and GPU for high-performance H.264/Advanced Video Coding encoder implementation. This paper presents an inherent parallel low-complexity rate-distortion (RD) optimized fast ME algorithm well suited for parallel implementations, eliminating various data dependencies caused by a reliance on spatial predictions. In addition, this paper provides details of the GPU and FPGA implementations of the parallel algorithm by using OpenCL and Very High Speed Integrated Circuits (VHSIC) Hardware Descriptive Language (VHDL), respectively, and presents a practical performance comparison between the two implementations. The experimental results show that the proposed scheme achieves significant speedup on GPU and FPGA, and has comparable RD performance with respect to sequential fast ME algorithm.

...read moreread less

Proceedings Article•DOI•

A 2.48Gb/s FPGA-based QC-LDPC decoder: An algorithmic compiler implementation

[...]

Swapnil Mhaske¹, David C. Uliana², Hojin Kee², Tai Ly², Ahsan Aziz², Predrag Spasojevic¹ - Show less +2 more•Institutions (2)

Rutgers University¹, National Instruments²

12 Nov 2015

TL;DR: This brief presents two approaches to improve the throughput of a Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) decoder architecture, providing an algorithmic method to enhance parallel processing within the decoder in the first approach and applying the decoding architecture to achieve another highly-parallel architecture in the second approach.

...read moreread less

Abstract: The increasing data rates expected to be of the order of Gb/s for future wireless systems directly impact the throughput requirements of the modulation and coding systems of the physical layer. In an effort to design a suitable channel coding solution for 5G wireless systems, in this brief we present two approaches to improve the throughput of a Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) decoder architecture. While providing an algorithmic method to enhance parallel processing within the decoder in the first approach, in the second approach we apply the decoder architecture to achieve another highly-parallel architecture. We have successfully validated the second approach to get a 2.48Gb/s QC-LDPC decoder implementation operating at 200MHz on the Xilinx Kintex-7 FPGA in the NI USRP-2953R. For rapid-prototyping our research findings, the high-level description of the entire decoder was translated to a Hardware Description Language (HDL), namely VHDL, using the algorithmic compiler in the National Instruments LabVIEW™ Communication System Design Suite (CSDS™). As per our knowledge, at the time of writing this paper, this is the fastest FPGA-based implementation of a standard compliant QC-LDPC decoder on a USRP using an algorithmic compiler.

...read moreread less

Proceedings Article•DOI•

An overview of Altera SDK for OpenCL: A user perspective

[...]

Ian Janik¹, Qing Tang¹, Mohammed A. S. Khalid¹•Institutions (1)

University of Windsor¹

03 May 2015

TL;DR: A user-centric overview of Altera SDK for OpenCL is presented to provide the novice users with a useful tutorial that will enable them to quickly become proficient in using this important HLS CAD tool.

...read moreread less

Abstract: In recent years there has been a great interest in High Level Synthesis (HLS) CAD tools to raise the level of design abstraction, reduce design time, rapidly explore the design space and fully exploit the multi-million gate heterogeneous hardware platforms provided by dramatic improvements in integrated circuits. Open Computing Language (OpenCL) is a well-known standard for heterogeneous computing. The Altera SDK for OpenCL is used to convert OpenCL code to kernels that can be run on an FPGA accelerator card. It is a recently introduced HLS CAD tool that allows for the potential to convert existing, or create new C/C++ programs that utilize dedicated hardware to execute specific applications much faster and more efficient than current computer systems, whether single core or multi-core. This can all be done without the knowledge of FPGAs, VHDL, or Verilog as the SDK converts the OpenCL files into Verilog models that are then compiled into FPGA hardware. This paper presents a user-centric overview of Altera SDK for OpenCL. As a first step to achieve the best speedup, the candidate algorithm for acceleration must be analyzed to check if it is inherently parallelizable. The key features such as designing appropriate OpenCL kernels and host program, their compilation, execution and testing are summarized. A working example for accelerating a simple matrix multiplication algorithm is described. Our motivation is to provide the novice users with a useful tutorial that will enable them to quickly become proficient in using this important HLS CAD tool. To our knowledge, such a user-centric tutorial has not been presented so far in the literature.

...read moreread less

Journal Article•DOI•

Design and Implementation of RSA Algorithm using FPGA

[...]

Ari Shawkat Tahir¹•Institutions (1)

University of Zakho¹

25 Sep 2015

TL;DR: A new architecture and modeling has been proposed for RSA public key algorithm, the suggested system uses 1024-bit RSA encryption/decryption for restricted system that uses the multiply and square algorithm to perform modular operation.

...read moreread less

Abstract: RSA cryptographic algorithm used to encrypt and decrypt the messages to send it over the secure transmission channel like internet. The RSA algorithm is a secure, high quality, public key algorithm. In this paper, a new architecture and modeling has been proposed for RSA public key algorithm, the suggested system uses 1024-bit RSA encryption/decryption for restricted system. The system uses the multiply and square algorithm to perform modular operation. The design has been described by VHDL and simulated by using Xilinx ISE 12.2 tool. The architectures have been implemented on reconfigurable platforms FPGAs. Accomplishment when implemented on Xilinx_Spartan3 (device XC3S50, package PG208, speed -4) which confirms that the proposed architectures have minimum hardware resource, where only 29% of the chip resources are used for RSA algorithm design with realizable operating clock frequency of 68.573 MHz.

...read moreread less

Journal Article•DOI•

Secured Network on Chip (NoC) Architecture and Routing with Modified TACIT Cryptographic Technique

[...]

Adesh Kumar¹, Piyush Kuchhal¹, Sonal Singhal²•Institutions (2)

University of Petroleum and Energy Studies¹, Shiv Nadar University²

01 Jan 2015-Procedia Computer Science

TL;DR: NoC architecture is integrated with modified TACIT security algorithm on Virtex-5 FPGA and the key generation scheme is considered based on Hash function and distributed under 4 Hash function (4H) scheme to provide secured data in NoC routers.

...read moreread less

Proceedings Article•DOI•

High-Performance FPGA Implementation of Modular Inversion over F_256 for Elliptic Curve Cryptography

[...]

Selim Hossain¹, Yinan Kong¹•Institutions (1)

Macquarie University¹

11 Dec 2015

TL;DR: The main goal is to implement a fast, high-performance modular inversion for ECC using field-programmable gate-array (FPGA) technology and an area-efficient design which takes a small amount of resources on the FPGA and needs only 1480 slices.

...read moreread less

Abstract: Modular Inversion over a prime field is an important operation for public-key cryptographic applications. It is the most crucial operation to speed up the calculation of an elliptic curve crypto-processor (ECC) when affine coordinates are used. In this work, the main goal is to implement a fast, high-performance modular inversion for ECC using field-programmable gate-array (FPGA) technology. A binary inversion algorithm in VHDL has been used for this efficient implementation. Timing simulation shows that the delay for one modular inversion operation in a modern Xilinx Virtex-7 FPGA is only 2.329 us at the maximum frequency of 146.389 MHz. We have implemented an area-efficient design which takes a small amount of resources on the FPGA and needs only 1480 slices. To the best of the authors' knowledge, the proposed modular inversion over F_256 provides a better performance than the available hardware implementations in terms of the area and the timing.

...read moreread less

Journal Article•DOI•

A new intelligent hardware implementation based on field programmable gate array for chaotic systems

[...]

Remzi Tuntas¹•Institutions (1)

Yüzüncü Yıl University¹

01 Oct 2015

TL;DR: A new intelligent hardware implementation was developed for chaotic systems by using field programmable gate array (FPGA) and the results obtained show that the proposed intelligent system simulation has much higher speed in comparison with HSPICE simulation.

...read moreread less

Abstract: This paper presents a new intelligent hardware implementation for chaotic systems.Intelligent hardware is based on the wavelet decomposition and neural network (NN).Wavelet decomposition was used for extracting feature and NN was used for modeling.Configurations have been simulated and tested under ModelSim Xilinx software.The best configuration has been implemented under the Xilinx Virtex-II Pro chip. In the present study, a new intelligent hardware implementation was developed for chaotic systems by using field programmable gate array (FPGA). The success and superior properties of this new intelligent hardware implementation was shown by applying the Modified Van der Pol-Duffing Oscillator Circuit (MVPDOC). The validation of intelligent system model was tested with both software and hardware. For this purpose, initially the intelligent system model of MVPDOC was obtained by using the wavelet decompositions and Artificial Neural Network (ANN). Then, the intelligent system model obtained has been written in Very High Speed Integrated Circuit Hardware Description Language (VHDL). In the next step, these configurations have been simulated and tested under ModelSim Xilinx software. And finally the best configuration has been implemented under the Xilinx Virtex-II Pro FPGA (XC2V1000). Furthermore, the High Personal Simulation Program with Integrated Circuit Emphasis (HSPICE) simulation of MVPDOC has been carried out under ModelSim Xilinx software for comparison with proposed intelligent system. The results obtained show that the proposed intelligent system simulation has much higher speed in comparison with HSPICE simulation.

...read moreread less

Proceedings Article•DOI•

Hardware implementation of linear back-projection algorithm for capacitance tomography

[...]

Hans Herdian¹, Imamul Muttakin, Almushfi Saputra, Arbai Yusuf, Wahyu Widada, Warsito P. Taruno - Show less +2 more•Institutions (1)

Bandung Institute of Technology¹

01 Nov 2015

TL;DR: The final design is able to reconstruct a 32×32 pixel image from 8-electrode Electrical Capacitance Tomography (ECT) with speed of 23809 slice images per second and the image is shown on LCD.

...read moreread less

Abstract: This paper presents method to implement Linear Back Projection (LBP) algorithm in Field Programmable Gate Arrays (FPGA). Top-down approach has been adopted for the design of the hardware of LBP algorithm. The FPGA used is Xilinx Spartan 3A and the language used to design the hardware is VHSIC Hardware Description Language (VHDL). The final design is able to reconstruct a 32×32 pixel image from 8-electrode Electrical Capacitance Tomography (ECT) with speed of 23809 slice images per second and the image is shown on LCD. It could be further extended to form quasi 3D image with 32 slices at rate 744 frame-per-second.

...read moreread less

Proceedings Article•DOI•

Memory-Aware and High-Throughput Hardware Design for the HEVC Fractional Motion Estimation

[...]

Vladimir Afonso¹, Henrique Maich¹, Luan Audibert¹, Bruno Zatt¹, Marcelo Porto¹, Luciano Agostini¹ - Show less +2 more•Institutions (1)

Universidade Federal de Pelotas¹

31 Aug 2015

TL;DR: The synthesis results for TSMC 65nm standard cells demonstrate that the developed design is able to process UHD 2160p videos at 60 frames per second (fps), reducing the required hardware resources in about five times when compared with the main related work.

...read moreread less

Abstract: This paper presents a hardware design for the Fractional Motion Estimation (FME) of the High Efficiency Video Coding (HEVC) standard. The solution designed in this work uses a scheme to reduce the number of accesses to the reference frames stored in the external memory in up to 49.22%. A strategy to reduce the computational effort is also used. This strategy consists in using only the four square-shaped Prediction Unit (PU) sizes rather than using all the 24 possible PU sizes. This approach reduces the total encoding time in about 59%, with a bit-rate increase of only 4% for the same image quality. The hardware design was described in VHDL and synthesized for FPGA and ASIC technologies. The synthesis results for TSMC 65nm standard cells demonstrate that the developed design is able to process UHD 2160p videos at 60 frames per second (fps), reducing the required hardware resources in about five times when compared with the main related work.

...read moreread less

Journal Article•DOI•

Power Consumption Analysis of BCD Adder using XPower Analyzer on VIRTEX FPGA

[...]

Gaurav Verma¹, Shambhavi Mishra¹, Sakshi Aggarwal¹, Surabhi Singh¹, Sushant Shekhar¹, Sukhbani Kaur Virdi¹ - Show less +2 more•Institutions (1)

Jaypee Institute of Information Technology¹

06 Aug 2015-Indian journal of science and technology

TL;DR: In this work an efficient BCD ADDER1 is analyzed in terms of power consumption by scaling the various parameters like voltage, frequency and load capacitance and the focus is also given on the airflow of the device to reduce the power.

...read moreread less

Abstract: Adders are the integral part of any digital circuit operation. Optimization of adder’s supremacy along with its vicinity is a demanding chore. In this work an efficient BCD ADDER1 is analyzed in terms of power consumption by scaling the various parameters like voltage, frequency and load capacitance. In addition to this the focus is also given on the airflow of the device to reduce the power. Finally the power is reduced by sending different encoded data at the input. The proposed designs are hardened and implement by means of VHDL and Xilinx ISE (integrated Software Environment) 14.5 and validated using XPower targeting Virtex FPGA. Power consumption is discussed in terms of clock, signals, logic, input/ outputs and leakage. A comparative analysis has been shown at the end to validate the obtained results.

...read moreread less

Proceedings Article•DOI•

Real time implementation of a novel chaotic generator on FPGA

[...]

Murat Tuna¹, Ismail Koyuncu², Can Bülent Fidan³, Ihsan Pehlivan⁴•Institutions (4)

Kırklareli University¹, Düzce University², Karabük University³, Sakarya University⁴

16 May 2015

TL;DR: By the developed FPGA-based novel chaotic system model, chaos-based various engineering applications such as true random number generation and secure communication system can be performed.

...read moreread less

Abstract: In this study, a new continuous-time autonomous chaotic system has been presented and implemented on FPGA. Presented a new chaotic system has been designed using the IEEE 754-1985 floating-point format and implemented using Heun algorithm with VHDL language. The designed system has been synthesized and tested on Xilinx Virtex-6 FPGA chip. According to the test results, operation frequency of the FPGA-based a new chaotic signal generator is certain as 390 MHz and performance results have been given with chip statistics. In addition, the results obtained from FPGA-based new chaotic generator have been compared with the Matlab-based numerical results and it has been observed that obtained results are successful. By the developed FPGA-based novel chaotic system model, chaos-based various engineering applications such as true random number generation and secure communication system can be performed.

...read moreread less

Proceedings Article•DOI•

A General-Purpose Method for Faithfully Rounded Floating-Point Function Approximation in FPGAs

[...]

David B. Thomas¹•Institutions (1)

Imperial College London¹

22 Jun 2015

TL;DR: This paper presents a method for automatically creating high-performance pipelined floating-point function approximations, which can be integrated as IP cores into numerical accelerators, whether derived from HLS or traditional design methods.

...read moreread less

Abstract: A barrier to wide-spread use of Field Programmable Gate Arrays (FPGAs) has been the complexity of programming, but recent advances in High-Level Synthesis (HLS) have made it possible for non-experts to easily create floating-point numerical accelerators from C-like code. However, HLS users are limited to the set of numerical primitives provided by HLS vendors and designers of floating-point IP cores, and cannot easily implement new fast or accurate numerical primitives. This paper presents a method for automatically creating high-performance pipelined floating-point function approximations, which can be integrated as IP cores into numerical accelerators, whether derived from HLS or traditional design methods. Both input and output are floating-point, but internally the function approximator uses fixed-point polynomial segments, guaranteeing a faithfully rounded output. A robust and automated non-uniform segmentation scheme is used to segment any twice-differentiable input function and produce platform-independent VHDL. The approach is demonstrated across ten functions, which are automatically generated then placed and routed in Xilinx devices. The method provides a 1.1x-3x improvement in area over composite numerical approximations, while providing similar performance and significantly better relative error.

...read moreread less

Collapse