scispace - formally typeset
Search or ask a question

Showing papers on "Reconfigurable computing published in 2018"


Journal ArticleDOI
TL;DR: A survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics, which include the supported applications, architectural choices, design space exploration methods, and achieved performance.
Abstract: In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep-learning ecosystem to provide a tunable balance between performance, power consumption, and programmability. In this article, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics, which include the supported applications, architectural choices, design space exploration methods, and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete, and in-depth evaluation of CNN-to-FPGA toolflows.

167 citations


Posted Content
TL;DR: The methods and tools investigated in this survey represent the recent trends in FPGA CNN inference accelerators and will fuel the future advances on effcient hardware deep learning.
Abstract: Convolutional Neural Networks (CNNs) are currently adopted to solve an ever greater number of problems, ranging from speech recognition to image classification and segmentation The large amount of processing required by CNNs calls for dedicated and tailored hardware support methods Moreover, CNN workloads have a streaming nature, well suited to reconfigurable hardware architectures such as FPGAs The amount and diversity of research on the subject of CNN FPGA acceleration within the last 3 years demonstrates the tremendous industrial and academic interest This paper presents a state-of-the-art of CNN inference accelerators over FPGAs The computational workloads, their parallelism and the involved memory accesses are analyzed At the level of neurons, optimizations of the convolutional and fully connected layers are explained and the performances of the different methods compared At the network level, approximate computing and datapath optimization methods are covered and state-of-the-art approaches compared The methods and tools investigated in this survey represent the recent trends in FPGA CNN inference accelerators and will fuel the future advances on effcient hardware deep learning

114 citations


Journal ArticleDOI
27 Feb 2018
TL;DR: It is demonstrated how system designers can exploit hybrid and reconfigurable computing on SmallSats to harness these advantages for a variety of purposes, and several recent missions by NASA and industry that feature these principles and technologies are highlighted.
Abstract: Due to the increasing demands of onboard sensor and autonomous processing, one of the principal needs and challenges for future spacecraft is onboard computing. Space computers must provide high performance and reliability (which are often at odds), using limited resources (power, size, weight, and cost), in an extremely harsh environment (due to radiation, temperature, vacuum, and vibration). As spacecraft shrink in size, while assuming a growing role for science and defense missions, the challenges for space computing become particularly acute. For example, processing capabilities on CubeSats (smaller class of SmallSats) have been extremely limited to date, often featuring microcontrollers with performance and reliability barely sufficient to operate the vehicle let alone support various sensor and autonomous applications. This article surveys the challenges and opportunities of onboard computers for small satellites (SmallSats) and focuses upon new concepts, methods, and technologies that are revolutionizing their capabilities, in terms of two guiding themes: hybrid computing and reconfigurable computing. These innovations are of particular need and value to CubeSats and other Smallsats. With new technologies, such as CHREC Space Processor (CSP), we demonstrate how system designers can exploit hybrid and reconfigurable computing on SmallSats to harness these advantages for a variety of purposes, and we highlight several recent missions by NASA and industry that feature these principles and technologies.

101 citations


Proceedings ArticleDOI
15 Feb 2018
TL;DR: Rosetta is a realistic benchmark suite for software programmable FPGAs that can be useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users.
Abstract: Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.

91 citations


Journal ArticleDOI
TL;DR: Recryptor is a reconfigurable cryptographic processor that augments the existing memory of a commercial general-purpose processor with compute capabilities and demonstrates Recryptor’s programmability by implementing the cryptographic primitives of various public/ secret key cryptographies and hash functions.
Abstract: Providing security for the Internet of Things (IoT) is increasingly important, but supporting many different cryptographic algorithms and standards within the physical constraints of IoT devices is highly challenging. Software implementations are inefficient due to the high bitwidth cryptographic operations; domain-specific accelerators are often inflexible; and reconfigurable crypto processors generally have large area and power overhead. This paper proposes Recryptor, a reconfigurable cryptographic processor that augments the existing memory of a commercial general-purpose processor with compute capabilities. It supports in-memory bitline computing using a 10-transistor bitcell to support different bitwise operations up to 512-bits wide. Custom-designed shifter, rotator, and S-box modules sit near the memory, providing high-throughput near-memory computing capabilities. We demonstrate Recryptor’s programmability by implementing the cryptographic primitives of various public/ secret key cryptographies and hash functions. Recryptor runs at 28.8 MHz in 0.7 V, achieving $6.8\times $ average speedup and $12.8\times $ average energy improvements over the state-of-the-art software- and hardware-accelerated implementations with only 0.128 mm2 area overhead in 40-nm CMOS.

81 citations


Journal ArticleDOI
TL;DR: This research work is the first comprehensive survey on how random number generators are implemented on Field Programmable Gate Arrays (FPGAs), with a rich and up-to-date list of generators specifically mapped to FPGA.

75 citations


Proceedings ArticleDOI
05 Nov 2018
TL;DR: The Tile-Grained Pipeline Architecture (TGPA) is proposed, a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators.
Abstract: FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

64 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: BISMO is presented, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing that utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism.
Abstract: Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.

63 citations


Proceedings ArticleDOI
29 May 2018
TL;DR: This paper observes that a "long" routing wire carrying a logical 1 reduces the propagation delay of other adjacent but unconnected long wires in the FPGA interconnect, thereby leaking information about its state, and proposes a communication channel that can be used for both covert transmissions between circuits, and for exfiltration of secrets from the chip.
Abstract: Field-Programmable Gate Arrays (FPGAs) are integrated circuits that implement reconfigurable hardware. They are used in modern systems, creating specialized, highly-optimized integrated circuits without the need to design and manufacture dedicated chips. As the capacity of FPGAs grows, it is increasingly common for designers to incorporate implementations of algorithms and protocols from a range of third-party sources. The monolithic nature of FPGAs means that all on-chip circuits, including third party black-box designs, must share common on-chip infrastructure, such as routing resources. In this paper, we observe that a "long" routing wire carrying a logical 1 reduces the propagation delay of other adjacent but unconnected long wires in the FPGA interconnect, thereby leaking information about its state. We exploit this effect and propose a communication channel that can be used for both covert transmissions between circuits, and for exfiltration of secrets from the chip. We show that the effect is measurable for both static and dynamic signals, and that it can be detected using very small on-board circuits. In our prototype, we are able to correctly infer the logical state of an adjacent long wire over 99% of the time, even without error correction, and for signals that are maintained for as little as 82us. Using a Manchester encoding scheme, our channel bandwidth is as high as 6kbps. We characterize the channel in detail and show that it is measurable even when multiple competing circuits are present and can be replicated on different generations and families of Xilinx devices (Virtex 5, Virtex 6, and Artix 7). Finally, we propose countermeasures that can be deployed by systems and tools designers to reduce the impact of this information leakage.

60 citations


Journal ArticleDOI
01 Oct 2018
TL;DR: In this paper, the authors proposed a memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner.
Abstract: For decades, advances in electronics were directly driven by the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as resistive random-access memory (RRAM) devices, are expected to sustain the exponential growth of computing capability. Here, we propose a novel memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner. The proposed computing architecture is based on a uniform, physical, resistive, memory-centric fabric that can be optimally reconfigured and utilized to perform different computing and data storage tasks in a massively parallel approach. The system can be tailored to achieve maximal energy efficiency based on the data flow by dynamically allocating the basic computing fabric for storage, arithmetic, and analog computing including neuromorphic computing tasks.

49 citations


Proceedings ArticleDOI
24 Jun 2018
TL;DR: This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library that provides higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers.
Abstract: The architectural differences between ASICs and FPGAs limit the effective performance gains achievable by the application of ASIC-based approximation principles for FPGA-based reconfigurable computing systems. This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library. Our designs provide higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers. Moreover, compared to the multiplier IP offered by the Xilinx Vivado, our proposed design achieves up to 30%, 53%, and 67% gains in terms of area, latency, and energy, respectively, while incurring an insignificant accuracy loss (on average, below 1% average relative error). Our library of approximate multipliers is open-source and available online at https://cfaed.tudresden.de/pd-downloads to fuel further research and development in this area, and thereby enabling a new research direction for the FPGA community.

Journal ArticleDOI
TL;DR: This work introduces a new design paradigm where the analogue and digital worlds are seamlessly fused via memristors, enabling electronics with reconfigurability.
Abstract: As the world enters the age of ubiquitous computing, the need for reconfigurable hardware operating close to the fundamental limits of energy consumption becomes increasingly pressing. Simultaneously, scaling-driven performance improvements within the framework of traditional analogue and digital design become progressively more restricted by fundamental physical constraints. Emerging nanoelectronics technologies bring forth new prospects yet a significant rethink of electronics design is required for realising their full potential. Here we lay the foundations of a design approach that fuses analogue and digital thinking by combining digital electronics with analogue memristive devices for achieving charge-based computation; information processing where every dissipated charge counts. This is realised by introducing memristive devices into standard logic gates, thus rendering them reconfigurable and capable of performing analogue computation at a power cost close to digital. The versatility and benefits of our approach are experimentally showcased through a hardware data clusterer and an analogue NAND gate.

Journal ArticleDOI
TL;DR: A design for phase measurement logic core having resolution and precision in the range of a few picoseconds is proposed, based on subsample accumulation using systematic sampling over the phase detector signal.
Abstract: Phase measurement is required in electronic applications where a synchronous relationship between the signals needs to be preserved. Traditional electronic systems used for time measurement are designed using a classical mixed-signal approach. With the advent of reconfigurable hardware such as field-programmable gate arrays (FPGAs), it is more advantageous for designers to opt for all-digital architecture. Most high-speed serial transceivers of the FPGA circuitry do not ensure the same chip latency after each power cycle, reset cycle, or firmware upgrade. These cause uncertainty of phase relationship between the recovered signals. To address the need to register minute phase shift changes inside an FPGA, we propose a design for phase measurement logic core having resolution and precision in the range of a few picoseconds. The working principle is based on subsample accumulation using systematic sampling over the phase detector signal. The phase measurement logic can operate over a wide range of digital clock frequencies, ranging from a few kilohertz to the maximum frequency that is supported within the FPGA fabric. A mathematical model is developed to illustrate the operating principle of the design. The VLSI architecture is designed for the logic core. We also discussed the procedure of the phase measurement system, the calibration sequence involved, followed by the performance of the design in terms of accuracy, precision, and resolution.

Proceedings ArticleDOI
01 Aug 2018-Rice
TL;DR: This work converses the reviews on various research articles of neural networks whose concerns focused in execution of more than one input neuron and multilayer with or without linearity property by using FPGA, and involves a Multi Layer Perceptron with a Back Propagation learning algorithm to identify a prototype for the diagnosis.
Abstract: Basic hardware comprehension of an artificial neural network (ANN), to a major scale depends on the proficient realization of a distinct neuron. For hardware execution of NNs, mostly FPGA-designed reconfigurable computing systems are favorable. FPGA comprehension of ANNs through a huge amount of neurons is mainly an exigent assignment. This work converses the reviews on various research articles of neural networks whose concerns focused in execution of more than one input neuron and multilayer with or without linearity property by using FPGA. An execution technique through reserve substitution is projected to adjust signed decimal facts. A detailed review of many research papers have been done for the proposed work. The proposed paper involves a Multi Layer Perceptron with a Back Propagation learning algorithm to identify a prototype for the diagnosis. In this paper, a brief introduction about artificial neural network used nowadays for diagnosis of disease is given.

Journal ArticleDOI
TL;DR: This paper focuses on improving the performance of the data plane from the edge to the core network segment (backhaul) in a 5G multi-tenant network by leveraging and exploring the programmability introduced by software-based networking.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: In this paper, a high-level P4 language is used to implement a packet parser on a reconfigurable hardware (i.e., FPGA), which is then compiled to firmware by Xilinx SDNet.
Abstract: Nowadays network managers look for ways to change the design and management of networks that can make decisions on the control plane. Future switches should be able to support the new features and flexibility required for parsing and processing packets. One of the critical components of switches is the packet parser that processes the headers of the packets to be able to decide on the incoming packets. Here the data plane, and particularly packet parser in OpenFlow switches, which should have the flexibility and programmability to support the new requirements and OpenFlow multiple versions, are focused. Designed here is an architecture that unlike the static network equipments, it has the flexibility and programmability in the data plane network, especially the SDN network, and supports the parsing and processing of specific packets. To describe this architecture, a high-level P4 language is used to implement it on a reconfigurable hardware (i.e., FPGA). After automatic generating the protocol-independent Packet parser architecture on the Virtex-7, it is compiled to firmware by Xilinx SDNet, and ultimately an FPGA Platform is implemented. It has fewer consumption resources and it is more efficient in terms of throughput and processing speed in comparison with other architectures.

Journal ArticleDOI
01 Feb 2018-EPL
TL;DR: In this article, a quantum interferometer is used as a programmable spin logic device (PSLD) to characterize spin-based logical operations using spin degree of freedom of electron.
Abstract: Exploiting spin degree of freedom of electron a new proposal is given to characterize spin-based logical operations using a quantum interferometer that can be utilized as a programmable spin logic device (PSLD). The ON and OFF states of both inputs and outputs are described by spin state only, circumventing spin-to-charge conversion at every stage as often used in conventional devices with the inclusion of extra hardware that can eventually diminish the efficiency. All possible logic functions can be engineered from a single device without redesigning the circuit which certainly offers the opportunities of designing new generation spintronic devices. Moreover, we also discuss the utilization of the present model as a memory device and suitable computing operations with proposed experimental setups.

Proceedings ArticleDOI
20 Oct 2018
TL;DR: The framework introduces a task-based computation model with explicit continuation passing to support dynamic parallelism in addition to static parallelism and introduces a design methodology that includes an architectural template that allows easily creating parallel accelerators from high-level descriptions.
Abstract: In this paper, we propose ParallelXL, an architectural framework for building application-specific parallel accelerators with low manual effort. The framework introduces a task-based computation model with explicit continuation passing to support dynamic parallelism in addition to static parallelism. In contrast, today's high-level design frameworks for accelerators focus on static data-level or thread-level parallelism that can be identified and scheduled at design time. To realize the new computation model, we develop an accelerator architecture that efficiently handles dynamic task generation and scheduling as well as load balancing through work stealing. The architecture is general enough to support many dynamic parallel constructs such as fork-join, data-dependent task spawning, and arbitrary nesting and recursion of tasks, as well as static parallel patterns. We also introduce a design methodology that includes an architectural template that allows easily creating parallel accelerators from high-level descriptions. The proposed framework is studied through an FPGA prototype as well as detailed simulations. Evaluation results show that the framework can generate high-performance accelerators targeting FPGAs for a wide range of parallel algorithms and achieve an average of 4.0x speedup over an eight-core out-of-order processor (24.1x over a single core), while being 11.8x more energy efficient.

Journal ArticleDOI
TL;DR: To the best of the knowledge, this is the first scalable FPGA implementation of the bilateral filter that requires just $O(1)$ operations for any arbitrary operations and is both scalable and reconfigurable.
Abstract: Bilateral filter is an edge-preserving smoother that has applications in image processing, computer vision, and computational photography. In the past, field-programmable gate array (FPGA) implementations of the filter have been proposed that can achieve high throughput using parallelization and pipelining. An inherent limitation with direct implementations is that their complexity scales as $O(\omega ^2)$ with the filter width $\omega$ . In this paper, we propose an FPGA implementation of a fast bilateral filter that requires just $O(1)$ operations for any arbitrary $\omega$ . The attractive feature of the FPGA implementation is that it is both scalable and reconfigurable. To the best of our knowledge, this is the first scalable FPGA implementation of the bilateral filter. As an application, we use the FPGA implementation for image denoising.

Journal ArticleDOI
TL;DR: A master–slave AMR architecture using the reconfigurability of field-programmable gate arrays (FPGAs) and the constraint conditions of AMR in FPGA are proposed from the aspects of computing optimization and memory access optimization.
Abstract: Intelligent radios collect information by sensing signals within the radio spectrum, and the automatic modulation recognition (AMR) of signals is one of their most challenging tasks. Although the result of a modulation classification based on a deep neural network is better, the training of the neural network requires complicated calculations and expensive hardware. Therefore, in this paper, we propose a master–slave AMR architecture using the reconfigurability of field-programmable gate arrays (FPGAs). First, we discuss the method of building AMR, by using a stack convolution autoencoder (CAE), and analyze the principles of training and classification. Then, on the basis of the radiofrequency network-on-chip architecture, the constraint conditions of AMR in FPGA are proposed from the aspects of computing optimization and memory access optimization. The experimental results not only demonstrated that AMR-based CAEs worked correctly, but also showed that AMR based on neural networks could be implemented on FPGAs, with the potential for dynamic spectrum allocation and cognitive radio systems.

Journal ArticleDOI
TL;DR: This paper presents new challenges for the real-time scheduling of distributed reconfigurable embedded systems powered by a renewable energy and shows the effectiveness of the proposed intelligent multiagent distributed architecture in terms of the number of exchanged messages, deadline success ratio, and the energy consumption.
Abstract: This paper presents new challenges for the real-time scheduling of distributed reconfigurable embedded systems powered by a renewable energy. Reconfigurable computing systems have to deal with unpredictable events from the environment, such as activation of new tasks and hardware or software failures, by adapting the task allocation and scheduling in order to maintain the system feasibility and performance. The proposed approach is based on an intelligent multiagent distributed architecture composed of: 1) a global agent “coordinator” associated with the whole distributed system and 2) four local agents, such as supervisor, scheduler, battery manager, and reconfiguration manager, belong to each subsystem. The efficiency and completeness of the reconfiguration adaptative strategy is proved as all possible reconfiguration forms are considered to guarantee a feasible system with a graceful quality of service. Two communication protocols, such as an intra-subsystem communication protocol and an inter-subsystem communication protocol, are proposed to ensure the effectiveness of the proposed reconfiguration strategy. Extensive simulations show the effectiveness of the proposed intelligent multiagent distributed architecture in terms of the number of exchanged messages, deadline success ratio, and the energy consumption.

Journal ArticleDOI
TL;DR: A flexible and scalable hardware accelerator for realization of classification using RBFNN, which puts no limitation on the dimension of the input data is developed and comparison of results shows that scalability of the hardware architecture makes it favorable solution for classification of very large data sets.
Abstract: In this paper we present design and analysis of scalable hardware architectures for training learning parameters of RBFNN to classify large data sets. We design scalable hardware architectures for K-means clustering algorithm to training the position of hidden nodes at hidden layer of RBFNN and pseudoinverse algorithm for weight adjustments at output layer. These scalable parallel pipelined architectures are capable of implementing data sets with no restriction on their dimensions. This paper also presents a flexible and scalable hardware accelerator for realization of classification using RBFNN, which puts no limitation on the dimension of the input data is developed. We report FPGA synthesis results of our implementations. We compare results of our hardware accelerator with CPU, GPU and implementations of the same algorithms and with other existing algorithms. Analysis of these results show that scalability of our hardware architecture makes it favorable solution for classification of very large data sets.

Journal ArticleDOI
TL;DR: This work discusses the reviews on various research articles of neural networks whose concerns are focused in execution of more than one input neuron and multilayer with or without linearity property by using FPGA, and an execution technique through reserve substitution is projecteded to adjust signed decimal facts.
Abstract: Basic hardware comprehension of an artificial neural network (ANN), to a major scale depends on the proficientrealization of a distinctneuron. For hardware execution of NNs, mostly FPGA-designed reconfigurable computing systems are favorable .FPGA comprehension of ANNs through a hugeamount of neurons is mainlyan exigentassignment. This workconverses the reviews on various research articles of neural networks whose concernsfocused in execution of more than one input neuron and multilayer with or without linearity property by using FPGA. An execution technique through reserve substitution isprojected to adjust signed decimal facts. A detailed review of many research papers have been done for the proposed work.

Journal ArticleDOI
TL;DR: This paper proposes a reliable yet efficient FPGA-based security system via crypto engines and Physical Unclonable Functions (PUFs) for big data applications for cloud computing.
Abstract: Editor’s note: In cloud computing framework, the data security and protection is one of the most important aspects for optimization and concrete implementation. This paper proposes a reliable yet efficient FPGA-based security system via crypto engines and Physical Unclonable Functions (PUFs) for big data applications. Considering that FPGA or GPU-based accelerators are popular in data centers, we believe the proposed approach is very practical and effective method for data security in cloud computing. —Gi-Joon Nam, IBM Research

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the hardware version of the HFC-VD algorithm can significantly outperform an equivalent software version, which makes the reconfigurable system appealing for onboard hyperspectral data processing.
Abstract: A challenging problem in spectral unmixing is how to determine the number of endmembers in a given scene. One of the most popular ways to determine the number of endmembers is by estimating the virtual dimensionality (VD) of the hyperspectral image using the well-known Harsanyi–Farrand–Chang (HFC) method. Due to the complexity and high dimensionality of hyperspectral scenes, this task is computationally expensive. Reconfigurable field-programmable gate arrays (FPGAs) are promising platforms that allow hardware/software codesign and the potential to provide powerful onboard computing capabilities and flexibility at the same time. In this paper, we present the first FPGA design for the HFC-VD algorithm. The proposed method has been implemented on a Virtex-7 XC7VX690T FPGA and tested using real hyperspectral data collected by NASA’s Airborne Visible Infra-Red Imaging Spectrometer over the Cuprite mining district in Nevada and the World Trade Center in New York. Experimental results demonstrate that our hardware version of the HFC-VD algorithm can significantly outperform an equivalent software version, which makes our reconfigurable system appealing for onboard hyperspectral data processing. Most important, our implementation exhibits real-time performance with regard to the time that the hyperspectral instrument takes to collect the image data.

Proceedings ArticleDOI
07 Aug 2018
TL;DR: A new network function is described that relies on in-network computing to limit the erratic effect of failing network links, to enable the continued use of those links until they can be repaired.
Abstract: Failing network links are usually disabled, and packets are routed around them until the links are repaired. While it is often possible to utilize some of a failing link's capacity, losing what remains of a link's capacity is typically deemed preferable to the erratic effect that unreliable links can have on application-level behavior.We describe a new network function that relies on in-network computing to limit the erratic effect of failing network links, to enable the continued use of those links until they can be repaired. We explore the design space using ns-3, and evaluate our implementation on a physical test-bed that includes programmable switches and reconfigurable hardware. Our current hardware prototype can almost saturate a 10GbE link while using around 10% of our FPGA's resources.

Journal ArticleDOI
TL;DR: In this paper, a quantum interferometer was used as a programmable spin logic device (PSLD) to characterize spin-based logical operations using a quantum Interferometer that can be utilized as a memory device.
Abstract: Exploiting spin degree of freedom of electron a new proposal is given to characterize spin-based logical operations using a quantum interferometer that can be utilized as a programmable spin logic device (PSLD). The ON and OFF states of both inputs and outputs are described by {\em spin} state only, circumventing spin-to-charge conversion at every stage as often used in conventional devices with the inclusion of extra hardware that can eventually diminish the efficiency. All possible logic functions can be engineered from a single device without redesigning the circuit which certainly offers the opportunities of designing new generation spintronic devices. Moreover we also discuss the utilization of the present model as a memory device and suitable computing operations with proposed experimental setups.

Posted Content
TL;DR: BISMO as discussed by the authors is a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, which utilizes the excellent binary operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism.
Abstract: Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: A novel framework for virtualizing FPGA resources in the cloud that prevents the overhead of context switches between the virtual machine and host address spaces by using the in-kernel network stack for transferring packets to FPGAs.
Abstract: In this paper, we introduce a novel framework for virtualizing FPGA resources in the cloud. The proposed framework targets hardware/software architectures that leverage the Virtio paradigm for efficient communication between virtual machines (VMs) and the FPGAs. Furthermore, we present an FPGA overlay that uses reconfigurable hardware tiles and a flexible network-on-chip (NoC) architecture for transparent and optimized allocation of FPGA resources to VMs. The proposed overlay makes it possible to merge several FPGA regions allocated to a VM into a larger area, thus allowing resizing of FPGA's resources on demand. Hardware sandboxes are then provided as a means to enforce domain separation between hardware tasks belonging to different VMs. The framework introduced prevents the overhead of context switches between the virtual machine and host address spaces by using the in-kernel network stack for transferring packets to FPGAs. Experimental results show a 2x to 35x performance increase compared to current state of the art virtualization approaches.

Journal ArticleDOI
TL;DR: An FPGA-oriented baseband processing architecture suitable for communication scenarios such as non-contiguous carrier aggregation, centralized Cloud Radio Access Network (C-RAN) processing, and 4G/5G waveform coexistence is proposed and evaluated.
Abstract: The next evolution in cellular communications will not only improve upon the performance of previous generations, but also represent an unparalleled expansion in the number of services and use cases. One of the foundations for this evolution is the design of highly flexible, versatile, and resource-/power-efficient hardware components. This paper proposes and evaluates an FPGA-oriented baseband processing architecture suitable for communication scenarios such as non-contiguous carrier aggregation, centralized Cloud Radio Access Network (C-RAN) processing, and 4G/5G waveform coexistence. Our system is upgradeable, resource-efficient, cost-effective, and provides support for three 5G waveform candidates. Exploring Dynamic Partial Reconfiguration (DPR), the proposed architecture expands the design space exploration beyond the available hardware resources on the Zynq xc7z020 through hardware virtualization. Additionally, Dynamic Frequency Scaling (DFS) allows for run-time adjustment of processing throughput and reduces power consumption up to 88%. The resource overhead for DPR and DFS is residual, and the reconfiguration latency is two orders of magnitude below the control plane latency requirements proposed for 5G communications.