scispace - formally typeset
Search or ask a question

Showing papers on "System on a chip published in 2018"


Proceedings ArticleDOI
05 Nov 2018
TL;DR: DNNBuilder, an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity, is designed and demonstrated.
Abstract: Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

244 citations


Journal ArticleDOI
TL;DR: This paper quantitatively analyzing and optimizing the design objectives of the CNN accelerator based on multiple design variables and proposes a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance.
Abstract: As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

228 citations


Proceedings ArticleDOI
08 Mar 2018
TL;DR: This paper demonstrates the use of on-chip SGD-based training to compensate for PVT and data statistics variation to design a robust in-memory SVM classifier.
Abstract: Embedded sensory systems (Fig. 31.2.1) continuously acquire and process data for inference and decision-making purposes under stringent energy constraints. These always-ON systems need to track changing data statistics and environmental conditions, such as temperature, with minimal energy consumption. Digital inference architectures [1,2] are not well-suited for such energy-constrained sensory systems due to their high energy consumption, which is dominated (>75%) by the energy cost of memory read accesses and digital computations. In-memory architectures [3,4] significantly reduce the energy cost by embedding pitch-matched analog computations in the periphery of the SRAM bitcell array (BCA). However, their analog nature combined with stringent area constraints makes these architectures susceptible to process, voltage, and temperature (PVT) variation. Previously, off-chip training [4] has been shown to be effective in compensating for PVT variations of in-memory architectures. However, PVT variations are die-specific and data statistics in always-ON sensory systems can change over time. Thus, on-chip training is critical to address both sources of variation and to enable the design of energy efficient always-ON sensory systems based on in-memory architectures. The stochastic gradient descent (SGD) algorithm is widely used to train machine learning algorithms such as support vector machines (SVMs), deep neural networks (DNNs) and others. This paper demonstrates the use of on-chip SGD-based training to compensate for PVT and data statistics variation to design a robust in-memory SVM classifier.

156 citations


Journal ArticleDOI
TL;DR: In this paper, a scalable high performance depthwise separable convolution optimized CNN accelerator is proposed, which can fit into an FPGA of different sizes, provided the balancing between hardware resources and processing speed.
Abstract: Convolutional neural networks (CNNs) have been widely deployed in the fields of computer vision and pattern recognition because of their high accuracy. However, large convolution operations are computing intensive and often require a powerful computing platform such as a graphics processing unit. This makes it difficult to apply CNNs to portable devices. The state-of-the-art CNNs, such as MobileNetV2 and Xception, adopt depthwise separable convolution to replace the standard convolution for embedded platforms, which significantly reduces operations and parameters with only limited loss in accuracy. This highly structured model is very suitable for field-programmable gate array (FPGA) implementation. In this brief, a scalable high performance depthwise separable convolution optimized CNN accelerator is proposed. The accelerator can be fit into an FPGA of different sizes, provided the balancing between hardware resources and processing speed. As an example, MobileNetV2 is implemented on Arria 10 SoC FPGA, and the results show this accelerator can classify each picture from ImageNet in 3.75 ms, which is about 266.6 frames per second. The FPGA design achieves 20x speedup if compared to CPU.

150 citations


Proceedings ArticleDOI
19 Mar 2018
TL;DR: An adaptive layer partitioning and scheduling scheme, called SmartShuttle, to minimize off-chip memory accesses for CNN accelerators and adaptively switch among different data reuse schemes and the corresponding tiling factor settings to dynamically match different convolutional layers is proposed.
Abstract: Convolutional Neural Network (CNN) accelerators are rapidly growing in popularity as a promising solution for deep learning based applications. Though optimizations on computation have been intensively studied, the energy efficiency of such accelerators remains limited by off-chip memory accesses since their energy cost is magnitudes higher than other operations. Minimizing off-chip memory access volume, therefore, is the key to further improving energy efficiency. However, we observed that sticking to minimizing the accesses of one data type as many prior work did cannot fit the varying shapes of convolutional layers in CNNs. Hence, there exists a dilemma of minimizing the accesses of which data type. To overcome the problem, this paper proposed an adaptive layer partitioning and scheduling scheme, called SmartShuttle, to minimize off-chip memory accesses for CNN accelerators. Smartshuttle can adaptively switch among different data reuse schemes and the corresponding tiling factor settings to dynamically match different convolutional layers. Moreover, SmartShuttle thoroughly investigates the impact of data reusability and sparsity on the memory access volume. The experimental results show that SmartShuttle processes the convolutional layers at 434.8 multiply and accumulations (MACs)/DRAM access for VGG16 (batch size = 3), and 526.3 MACs/DRAM access for AlexNet (batch size = 4), which outperforms the state-of-the-art approach (Eyeriss) by 52.2% and 52.6%, respectively.

81 citations


Journal ArticleDOI
TL;DR: A new architecture for FPGA-based CNN accelerator that maps all the layers to their own on-chip units and working concurrently as a pipeline is proposed, which can achieve maximum resource utilization as well as optimal computational efficiency.
Abstract: Recently, field-programmable gate arrays (FPGAs) have been widely used in the implementations of hardware accelerator for convolutional neural networks (CNNs). However, most of these existing accelerators are designed in the same idea as their ASIC counterparts, in which all operations from different layers are mapped to the same hardware units and working in a multiplexed way. This manner does not take full advantage of reconfigurability and customizability of FPGAs, resulting in a certain degree of computational efficiency degradation. In this paper, we propose a new architecture for FPGA-based CNN accelerator that maps all the layers to their own on-chip units and working concurrently as a pipeline. A comprehensive mapping and optimizing methodology based on establishing roofline model oriented optimization model is proposed, which can achieve maximum resource utilization as well as optimal computational efficiency. Besides, to ease the programming burden, we propose a design framework which can provide a one-stop function for developers to generate the accelerator with our optimizing methodology. We evaluate our proposal by implementing different modern CNN models on Xilinx Zynq-7020 and Virtex-7 690t FPGA platforms. Experimental results show that our implementations can achieve a peak performance of 910.2 GOPS on Virtex-7 690t, and 36.36 GOP/s/W energy efficiency on Zynq-7020, which are superior to the previous approaches.

80 citations


Proceedings ArticleDOI
01 Apr 2018
TL;DR: This work presents a scalable framework, FPDeep, which helps engineers map a specific CNN's training logic to a multi-FPGA cluster or cloud and to build RTL implementations for the target network.
Abstract: FPGA-based CNN accelerators have advantages in flexibility and power efficiency and so are being deployed by a number of cloud computing service providers, including Microsoft, Amazon, Tencent, and Alibaba Given the increasing complexity of neural networks, however, it is becoming challenging to efficiently map CNNs to multi-FPGA platforms In this work, we present a scalable framework, FPDeep, which helps engineers map a specific CNN's training logic to a multi-FPGA cluster or cloud and to build RTL implementations for the target network With FPDeep, multi-FPGA accelerators work in a deeply-pipelined manner using a simple 1-D topology; this enables the accelerators to map directly onto many existing platforms, including Catapult, Catapult2, and almost any tightly-coupled FPGA cluster FPDeep uses two mechanisms to facilitate high-performance and energy-efficiency First, FPDeep provides a strategy to balance workload among FPGAs, leading to improved utilization Second, training of CNNs is executed in a fine-grained inter- and intra-layer pipelined manner, minimizing the time that features need to remain available while waiting for back-propagation This reduces the storage demand to where only on-chip memory is required for convolution layers Experiments show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth Using six transceivers per FPGA, FPDeep shows linearity up to 60 FPGAs We evaluate energy efficiency in GOPs/J and find that FPDeep provides up to 34 times higher energy efficiency than the Tesla K80 GPU

67 citations


Journal ArticleDOI
18 Jun 2018
TL;DR: This work presents the first support vector machine (SVM) processor that supports on-chip active learning for seizure detection with the best detection performance and a 153,310x higher energy efficiency than a high-end CPU for SVM training.
Abstract: This article presents a support vector machine (SVM) processor that supports both seizure detection and on-chip model adaptation for epileptic seizure control. Alternating direction method of multipliers (ADMM) is utilized for highly parallel computing for SVM training. From the algorithm aspect, minimum redundancy maximum relevance (mRMR) and low-rank approximation are exploited to reduce overall computational complexity by 99.4% while also reducing memory storage by 90.4%. For hardware optimization, overall hardware complexity is reduced by 87% through a hardware-shared configurable coordinate rotation digital computer (CORDIC)-based processing element array. Parallel rotations and folded structure for the approximate Jacobi method reduce overall training latency by 98.6%. The chip achieves a detection performance with a 96.6% accuracy and a 0.28/h false alarm rate within 0.71 s with the power dissipation of 1.9 mW. The proposed SVM processor achieves the shortest detection latency compared with the state-of-the-art seizure detectors. It also supports real-time model adaptation with a latency of 0.78 s. Compared with previous designs, this work achieves a $22\times $ higher throughput and a $162\times $ higher energy efficiency for SVM training.

47 citations


Journal ArticleDOI
TL;DR: A solution to tackle both privacy issues and big data transmission by incorporating the theory of compressive sensing and a simple, yet, efficient identification mechanism using the electrocardiogram (ECG) signal as a biometric trait is proposed.
Abstract: The ever-increasing demand for biometric solutions for the Internet-of-Things (IoT)-based connected health applications is mainly driven by the need to tackle fraud issues, along with the imperative to improve patient privacy, safety, and personalized medical assistance However, the advantages offered by the IoT platforms come with the burden of big data and its associated challenges in terms of computing complexity, bandwidth availability, and power consumption This paper proposes a solution to tackle both privacy issues and big data transmission by incorporating the theory of compressive sensing and a simple, yet, efficient identification mechanism using the electrocardiogram (ECG) signal as a biometric trait Moreover, the paper presents the hardware implementation of the proposed solution on a system-on-chip (SoC) platform with an optimized architecture to further reduce the hardware resource usage First, we investigate the feasibility of compressing the ECG data while maintaining a high identification quality The obtained results show a 9888% identification rate using only a compression ratio of 30% Furthermore, the proposed system has been implemented on a Zynq SoC using heterogeneous software/hardware solution, which is able to accelerate the software implementation by a factor of 773 with a power consumption of 2318 W

46 citations


Journal ArticleDOI
TL;DR: The ability of the chip to reach high upper frequency of operation is demonstrated, thus overcoming the low-frequency Debye screening limit at nearly physiological salt concentrations in the electrolyte, and allowing for detection of events occurring beyond the extent of the electrical double layer.
Abstract: We describe the realization of a fully electronic label-free temperature-controlled biosensing platform aimed to overcome the Debye screening limit over a wide range of electrolyte salt concentrations. It is based on an improved version of a 90-nm CMOS-integrated circuit featuring a nanocapacitor array, readout and A/D conversion circuitry, and a field programmable gate array (FPGA)-based interface board with NIOS II soft processor. We describe chip's processing, mounting, microfluidics, temperature control system, as well as the calibration and compensation procedures to reduce systematic errors, which altogether make up a complete quantitative sensor platform. Capacitance spectra recorded up to 70 MHz are shown and successfully compared to predictions by finite element method (FEM) numerical simulations in the Poisson–Drift–Diffusion formalism. They demonstrate the ability of the chip to reach high upper frequency of operation, thus overcoming the low-frequency Debye screening limit at nearly physiological salt concentrations in the electrolyte, and allowing for detection of events occurring beyond the extent of the electrical double layer. Furthermore, calibrated multifrequency measurements enable quantitative recording of capacitance spectra, whose features can reveal new properties of the analytes. The scalability of the electrode dimensions, interelectrode pitch, and size of the array make this sensing approach of quite general applicability, even in a non-bio context (e.g., gas sensing).

46 citations


Proceedings ArticleDOI
20 Oct 2018
TL;DR: To attain power savings without NN accuracy loss, a novel technique is proposed that relies on the deterministic behavior of undervolting faults and can limit the accuracy loss to 0.1% without any timing-slack overhead.
Abstract: In this work, we evaluate aggressive undervolting, i.e., voltage scaling below the nominal level to reduce the energy consumption of Field Programmable Gate Arrays (FPGAs). Usually, voltage guardbands are added by chip vendors to ensure the worst-case process and environmental scenarios. Through experimenting on several FPGA architectures, we measure this voltage guardband to be on average 39% of the nominal level, which in turn, delivers more than an order of magnitude power savings. However, further undervolting below the voltage guardband may cause reliability issues as the result of the circuit delay increase, i.e., start to appear faults. We extensively characterize the behavior of these faults in terms of the rate, location, type, as well as sensitivity to environmental temperature, with a concentration of on-chip memories, or Block RAMs (BRAMs). Finally, we evaluate a typical FPGA-based Neural Network (NN) accelerator under low-voltage BRAM operations. In consequence, the substantial NN energy savings come with the cost of NN accuracy loss. To attain power savings without NN accuracy loss, we propose a novel technique that relies on the deterministic behavior of undervolting faults and can limit the accuracy loss to 0.1% without any timing-slack overhead.

Journal ArticleDOI
25 Jul 2018-PLOS ONE
TL;DR: A potentiostat built with a single commercially available integrated circuit (IC) that does not require any external electronic components to perform electrochemical experiments is demonstrated using the capabilities of the Programmable System on a Chip (PSoC) by Cypress Semiconductor.
Abstract: In this paper we demonstrate a potentiostat built with a single commercially available integrated circuit (IC) that does not require any external electronic components to perform electrochemical experiments. This is done using the capabilities of the Programmable System on a Chip (PSoC®) by Cypress Semiconductor, which integrates all of the necessary electrical components. This is in contrast to other recent papers that have developed potentiostats but require technical skills or specialized equipment to produce. This eliminates the process of having to make a printed circuit board and soldering on electronic components. To control the device, a graphical user interface (GUI) was developed in the python programming language. Python is open source, with a style that makes it easy to read and write programs, making it an ideal choice for open source projects. As the developed device is open source and based on a PSoC, modification to implement other electrochemical techniques is straightforward and only requires modest programming skills, but no expensive equipment or difficult techniques. The potentiostat developed here adds to the growing amount of open source laboratory equipment. To demonstrate the PSoC potentiostat in a wide range of applications, we performed cyclic voltammetry (to measure vitamin C concentration in orange juice), amperometry (to measure glucose with a glucose strip), and stripping voltammetry experiments (to measure lead in water). The device was able to perform all experiments and could accurately measure Vitamin C, glucose, and lead.

Journal ArticleDOI
TL;DR: A packaging solution for millimeter-wave system-on-chip (SoC) radio transceivers is presented which includes a high permittivity silicon lens which serves additionally as heat sink and a quad flat no-lead package which is mountable on a standard printed circuit board (PCB).
Abstract: In this paper, a packaging solution for millimeter-wave system-on-chip (SoC) radio transceivers is presented. The on-chip antennas are realized as primary radiators of an integrated lens antenna which offer high bandwidth and high efficiency. The package concept includes a high permittivity silicon lens which serves additionally as heat sink and a quad flat no-lead package which is mountable on a standard printed circuit board (PCB). The electrical and thermal properties of the package are investigated through simulations and calibrated measurements. The concept is verified by realizing a complete radar sensor. The manufactured SoC radar frontend is soldered on a standard PCB which includes the baseband circuitry for a frequency-modulated continuous wave radar and finally, measurements are performed to compare the superposed radiation patterns of the transmit and receive antennas with simulations.

Proceedings ArticleDOI
04 Nov 2018
TL;DR: A post-SoC/32-bit design point called Hamilton is developed, showing that using integrated components enables a ~$7 core and shifts hardware modularity to design time, and its efficient MCU control improves concurrency with ~30% less energy consumption.
Abstract: The emergence of low-power 32-bit Systems-on-Chip (SoCs), which integrate a 32-bit MCU, radio, and flash, presents an opportunity to re-examine design points and trade-offs at all levels of the system architecture of networked sensors. To this end, we develop a post-SoC/32-bit design point called Hamilton, showing that using integrated components enables a ~$7 core and shifts hardware modularity to design time. We study the interaction between hardware and embedded operating systems, identifying that (1) post-SoC motes provide lower idle current (5.9 μA) than traditional 16-bit motes, (2) 32-bit MCUs are a major energy consumer (e.g., tick increases idle current >50 times), comparable to radios, and (3) thread-based concurrency is viable, requiring only 8.3 μs of context switch time. We design a system architecture, based on a tickless multithreading operating system, with cooperative/adaptive clocking, advanced sensor abstraction, and preemptive packet processing. Its efficient MCU control improves concurrency with ~30% less energy consumption. Together, these developments set the system architecture for networked sensors in a new direction.

Proceedings ArticleDOI
20 Oct 2018
TL;DR: A new SSD simulation framework, SimpleSSD 2.0, is introduced, namely Amber, that models embedded CPU cores, DRAMs, and various flash technologies (within an SSD), and operate under the full system simulation environment by enabling a data transfer emulation.
Abstract: SSDs become a major storage component in modern memory hierarchies, and SSD research demands exploring future simulation-based studies by integrating SSD subsystems into a full-system environment. However, several challenges exist to model SSDs under a full-system simulations; SSDs are composed upon their own complete system and architecture, which employ all necessary hardware, such as CPUs, DRAM and interconnect network. Employing the hardware components, SSDs also require to have multiple device controllers, internal caches and software modules that respect a wide spectrum of storage interfaces and protocols. These SSD hardware and software are all necessary to incarnate storage subsystems under full-system environment, which can operate in parallel with the host system. In this work, we introduce a new SSD simulation framework, SimpleSSD 2.0, namely Amber, that models embedded CPU cores, DRAMs, and various flash technologies (within an SSD), and operate under the full system simulation environment by enabling a data transfer emulation. Amber also includes full firmware stack, including DRAM cache logic, flash firmware, such as FTL and HIL, and obey diverse standard protocols by revising the host DMA engines and system buses of a popular full system simulator's all functional and timing CPU models (gem5). The proposed simulator can capture the details of dynamic performance and power of embedded cores, DRAMs, firmware and flash under the executions of various OS systems and hardware platforms. Using Amber, we characterize several system-level challenges by simulating different types of full-systems, such as mobile devices and general-purpose computers, and offer comprehensive analyses by comparing passive storage and active storage architectures.

Proceedings ArticleDOI
Yun Yin1, Liang Xiong1, Yiting Zhu1, Bowen Chen1, Hao Min1, Hongtao Xu1 
01 Feb 2018
TL;DR: A high-power digital Doherty PA for NB-IoT applications is proposed and a parallel-combining-transformer (PCT) power combiner for dual-band coverage, back-off efficiency enhancement, and ultra-compact implementation is introduced.
Abstract: Narrowband Internet-of-Things (NB-IoT) is a newly developed 3GPP protocol optimized for low-power wide-area IoT applications and is evolving toward the future fifth-generation (5G) mobile communication. It specifies at least 23dBm maximum output power for long-range communication, stringent emission mask compatible with guard-band or in-band scenarios, and it supports multiple operation bands from 699 to 915MHz (LB) and from 1710 to 1980MHz (HB). For cost reduction, longer battery life, and fast time to market, the integration of high-power high-efficiency power amplifiers (PAs) on-chip is greatly demanded. To benefit from advanced CMOS technology, the digital polar transmitter has become a very attractive architecture for NB-IoT applications [1]. To simultaneously support dual bands for user flexibility, the traditional solution is to implement two separately optimized PAs [2], which requires extra design effort and increases die area. An ultra-compact single-transformer-based parallel power combiner proposed in [3] provides optimum load transformation in the two operation bands. Moreover, to support higher throughputs and achieve better spectral efficiency, high peak-to-average-power-ratio (PAPR) multi-subcarrier modulation is adopted in NB-IoT, which requires the PA to be efficient not only at peak power but also at power back-off (PBO) to extend battery life. Efficiency boosting techniques of digital Doherty PAs have been shown in [4-6], but two transformers are needed in the passive network. In this work, a high-power digital Doherty PA for NB-IoT applications is proposed and introduces a parallel-combining-transformer (PCT) power combiner for dual-band coverage, back-off efficiency enhancement, and ultra-compact implementation.

Journal ArticleDOI
TL;DR: The proposed ECG-on-chip contains a low noise preamplifier with embedded band-pass function, a programmable gain buffer, a 12-bit successive approximation ADC, a novel morphological filter based QRS detector, 8-Kb on-chip SRAM, a control unit and MCU interfaces.
Abstract: This brief presents an ultra-low power single chip solution for electrocardiography (ECG) signal acquisition and processing in wearable ECG sensors. The chip contains a low noise preamplifier with embedded band-pass function, a programmable gain buffer, a 12-bit successive approximation ADC, a novel morphological filter based QRS detector, 8-Kb on-chip SRAM, a control unit and MCU interfaces. The chip was designed and implemented in 0.35- ${\mu }\text{m}$ standard CMOS process. The analog core operates from 0.8 V to 1.8 V, while the digital circuits and SRAM operate from 1.5 V to 3.6 V. The chip has a total core area of 5.74 mm2 and consumes $2.3~{\mu }\text{W}$ . Small size and low power consumption make this chip suitable for usage in wearable ECG sensors. Apart from presenting the measurement results, we also successfully demonstrate a prototype wearable ECG device, for long term cardiac monitoring using the proposed ECG-on-chip.

Journal ArticleDOI
TL;DR: An Integer-Linear Programming (ILP) model is proposed to properly address communication problem, which generates the optimal solutions with the consideration of inter-processor communication and a novel heuristic algorithm for task mapping in dark silicon many-core systems, called TopoMap is presented.
Abstract: Dark silicon is the phenomenon that a fraction of many-core chip has to be turned off or run in a low-power state in order to maintain the safe chip temperature. System-level thermal management techniques normally map application on non-adjacent cores, while communication efficiency among these cores will be oppositely affected over conventional network-on-chip (NoC). Recently, SMART NoC architecture is proposed, enabling single-cycle multi-hop bypass channels to be built between distant cores at runtime, to reduce communication latency. However, communication efficiency of SMART NoC will be diminished by communication contention, which will in turn decrease system performance. In this paper, we first propose an Integer-Linear Programming (ILP) model to properly address communication problem, which generates the optimal solutions with the consideration of inter-processor communication. We further present a novel heuristic algorithm for task mapping in dark silicon many-core systems, called TopoMap , on top of SMART architecture, which can effectively solve communication contention problem in polynomial time. With fine-grained consideration of chip thermal reliability and inter-processor communication, presented approaches are able to control the reconfigurability of NoC communication topology in task mapping and scheduling. Thermal-safe system is guaranteed by physically decentralized active cores, and communication overhead is reduced by the minimized communication contention and maximized bypass routing. Performance evaluation on PARSEC shows the applicability and effectiveness of the proposed techniques, which achieve on average 42.5 and 32.4 percent improvement in communication and application performance, and 32.3 percent reduction in system energy consumption, compared with state-of-the-art techniques. TopoMap only introduces 1.8 percent performance difference compared to ILP model and is more scalable to large-size NoCs.

Journal ArticleDOI
01 Feb 2018
TL;DR: In this paper, a field programmable gate arrays (FPGA) accelerated adaptation of the efficient large-scale stereo (ELAS) algorithm is presented, achieving a frame rate of 47 fps while consuming under 4 W of power.
Abstract: For many applications in low-power real-time robotics, stereo cameras are the sensors of choice for depth perception as they are typically cheaper and more versatile than their active counterparts. Their biggest drawback, however, is that they do not directly sense depth maps; instead, these must be estimated through data-intensive processes. Therefore, appropriate algorithm selection plays an important role in achieving the desired performance characteristics. Motivated by applications in space and mobile robotics, we implement and evaluate a field programmable gate arrays (FPGA) accelerated adaptation of the efficient large-scale stereo (ELAS) algorithm. Despite offering one of the best tradeoffs between efficiency and accuracy, ELAS has only been shown to run at 1.5–3 fps on a high-end CPU. Our system preserves all intriguing properties of the original algorithm, such as the slanted plane priors, but can achieve a frame rate of 47 fps whilst consuming under 4 W of power. Unlike previous FPGA-based designs, we take advantage of both components on the CPU/FPGA system-on-chip to showcase the strategy necessary to accelerate more complex and computationally diverse algorithms for such low power, real-time systems.

Journal ArticleDOI
TL;DR: A two-layer configurable global interconnection is implemented in the proposed architecture to reduce virtualization time overhead, make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially provide the capability of simulating irregular topologies.
Abstract: On-chip interconnections play an important role in multi/many-processor systems-on-chip (MPSoCs). In order to achieve efficient optimization, each specific application must utilize a specific architecture, and consequently a specific interconnection network. For design space exploration and finding the best NoC solution for each specific application, a fast and flexible NoC simulator is necessary, especially for large design spaces. In this paper, we present an FPGA-based NoC co-simulator, which is able to be configured via software. In our proposed NoC simulator, entitled DuCNoC , we implement a Dual-Clock router micro-architecture, which demonstrates 75x $-$ 350x speed-up against BOOKSIM. Additionally, we implement a two-layer configurable global interconnection in our proposed architecture to (1) reduce virtualization time overhead, (2) make an efficient trade-off between the resource utilization and simulation time of the whole simulator, and especially (3) provide the capability of simulating irregular topologies. Migration of some important sub-modules like traffic generators (TGs) and traffic receptors (TRs) to software side, and implementing a dual-clock context switching in virtualization are other major features of DuCNoC. Thanks to its dual-clock router micro-architecture, as well as TGs and TRs migration to software side, DuCNoC can simulate a 100-node (10 $\times$ 10) non-virtualized or a 2048-node virtualized mesh network on Xilinx Zynq-7000.

Proceedings ArticleDOI
Bo-Yuan Huang1, Sayak Ray2, Aarti Gupta1, Jason M. Fung2, Sharad Malik1 
24 Jun 2018
TL;DR: This paper model hardware using the Instruction-Level Abstraction (ILA), capturing firmware-visible behavior at the architecture level, and proposes an optimization using abstraction to prevent expensive bit-precise reasoning.
Abstract: Formal security verification of firmware interacting with hardware in modern Systems-on-Chip (SoCs) is a critical research problem. This faces the following challenges: (1) design complexity and heterogeneity, (2) semantics gaps between software and hardware, (3) concurrency between firmware/hardware and between Intellectual Property Blocks (IPs), and (4) expensive bit-precise reasoning. In this paper, we present a co-verification methodology to address these challenges. We model hardware using the Instruction-Level Abstraction (ILA), capturing firmware-visible behavior at the architecture level. This enables integrating hardware behavior with firmware in each IP into a single thread. The co-verification with multiple firmware across IPs is formulated as a multi-threaded program verification problem, for which we leverage software verification techniques. We also propose an optimization using abstraction to prevent expensive bit-precise reasoning. The evaluation of our methodology on an industry SoC Secure Boot design demonstrates its applicability in SoC security verification.

Journal ArticleDOI
TL;DR: This paper presents a fully integrated system-on-a-chip for real-time terahertz super-resolution near-field imaging that features both an analog readout mode and a lock-in-amplifier-based digital read out mode, as well as a biometric human fingerprint.
Abstract: This paper presents a fully integrated system-on-a-chip for real-time terahertz super-resolution near-field imaging. The chip consists of 128 sensing pixels with individual cross-bridged double 3-D split-ring resonators arranged in a 3.2 mm long $2\times 64$ 1-D array. It is implemented in 0.13- $\mu \text{m}$ SiGe bipolar complementary metal–oxide–semiconductor technology and operated at around 550 GHz. All the functions, including sensor illumination, near-field sensing, and detection, are co-integrated with a readout integrated circuit for real-time image acquisition. The pixels exhibit a permittivity-based imaging contrast with a worst case estimated relative permittivity uncertainty of 0.33 and 10–12- $\mu \text{m}$ spatial resolution. The sensor illumination is provided with on-chip oscillators feeding four-way equal power divider networks to enable an effective pixel pitch of 25 $\mu \text{m}$ and a dense fill factor of 48% for the 1-D sensing area. The oscillators are equipped with electronic chopping to avoid $1/f$ -noise-related desensitization for the SiGe-heterojunction bipolar transistor power detectors integrated at each pixel. The chip features both an analog readout mode and a lock-in-amplifier-based digital readout mode. In the analog readout mode, the measured dynamic range (DR) is 63.8 dB for a 1-ms integration time at an external lock-in amplifier. The digital readout mode achieves a DR of 38.5 dB at 28 f/s. The chip consumes 37–104 mW of power and is packaged into a compact imaging module. This paper further demonstrates real-time acquisition of 2-D terahertz super-resolution images of a nickel mesh with 50- $\mu \text{m}$ feature size, as well as a biometric human fingerprint.

Journal ArticleDOI
TL;DR: A ring-shaped switched-capacitor dc-dc converter that has a unity-gain frequency a few times higher than its switching frequency is introduced with comprehensive considerations, and optimizing the number of time-interleaving phases (power cells) is detailed.
Abstract: On-chip power supply distribution faces the challenges of high and fast-changing load current, limited metal layers and decoupling capacitors, efficiency, and thermal issues. This paper mainly discusses system-level design considerations of both distributed and centralized fully integrated voltage regulators. In particular, a ring-shaped switched-capacitor dc-dc converter that has a unity-gain frequency a few times higher than its switching frequency is introduced with comprehensive considerations, and optimizing the number of time-interleaving phases (power cells) is detailed. Design issues such as on-chip power-rail routing parasitics, input capacitor and input ripple, and phase mismatch between power cells are addressed. Furthermore, a couple of possible extensions of the converter-ring architecture are proposed, which include power management of the active-matrix light-emitting diode array in a microdisplay system, cascading multiple nMOS-LDO regulators for granular power, and on-chip power converter grid with flipped-chip packaging.

Journal ArticleDOI
TL;DR: Simulation results reveal that the proposed mapping algorithm greatly improves the reliability of the system and reduce the communication energy.
Abstract: Extensive research has been conducted on task scheduling and mapping on a multi-processor system on chip. The mapping strategy on a network on chip (NoC) has a huge effect on the communication energy and performance. This paper proposes an efficient core mapping for NoC-based architectures. Which focus on energy- aware and reliability-aware mapping issues for NoC-based architectures and considers new applications with insignificant inter-processor communication overhead to be added to the system. This methodology was assessed by applying it to various benchmark applications. Simulation results reveal that the proposed mapping algorithm greatly improves the reliability of the system and reduce the communication energy.

Journal ArticleDOI
TL;DR: In this paper, a review on the gas sensing capabilities of the sensor and summarizes achievements in modeling relevant materials and processes for these emerging devices is presented. And the importance of a thorough understanding of the electro-thermal-mechanical problem and how it links to the operation of the sensing film is highlighted.
Abstract: The growing demand for the integration of functionalities on a single device is peaking with the rise of IoT. We are near to having multiple sensors in portable and wearable technologies, made possible through integration of sensor fabrication with mature CMOS manufacturing. In this paper we address semiconductor metal oxide sensors, which have the potential to become a universal sensor since they can be used in many emerging applications. This review concentrates on the gas sensing capabilities of the sensor and summarizes achievements in modeling relevant materials and processes for these emerging devices. Recent advances in sensor fabrication and the modeling thereof are further discussed, followed by a description of the essential electro-thermal-mechanical analyses, employed to estimate the devices’ mechanical reliability. We further address advances made in understanding the sensing layer, which can be modeled similar to a transistor, where instead of a gate contact, the ionosorped gas ions create a surface potential, changing the film’s conduction. Due to the intricate nature of the porous sensing films and the reception-transduction mechanism, many added complexities must be addressed. The importance of a thorough understanding of the electro-thermal-mechanical problem and how it links to the operation of the sensing film is thereby highlighted. © The Author(s) 2018. Published by ECS. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 License (CC BY, http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse of the work in any medium, provided the original work is properly cited. [DOI: 10.1149/2.0731816jes]

Journal ArticleDOI
TL;DR: The research article presents the use of machine learning techniques to predict the FPGA resource utilization for NoC in advance by taking into account known hardware design parameters, memory utilization and timing parameters such as minimum and maximum period, frequency support etc.
Abstract: Network on chip (NoC) is the solution to solve the problem of larger system on chip and bus based communication system. NoC provides scalable, highly reliable and modular approach for on chip communication and related problems. The wireless communication technologies such as IEEE 802.15.4 Zigbee technology follow mesh, star and cluster tree topology. The paper focuses on the development of machine learning model for design and FPGA synthesis of mesh, ring and fat tree NoC for different cluster size (N = 2, 4, 8, 16, 32, 64, 128 and 256). The fat-tree based topologies incorporate more links near the root of the tree, in order to fulfill the requirement for higher communication demand closer to the root of the tree, as compared to its leafs. It is an indirect topology in which not all routers are identical in terms of number of ports connecting to other routers or elements in the network. The research article presents the use of machine learning techniques to predict the FPGA resource utilization for NoC in advance. The present study helps in NoC chip planning before designing the chip itself by taking into account known hardware design parameters, memory utilization and timing parameters such as minimum and maximum period, frequency support etc. The machine learning is carried out based on multiple linear regression, decision tree regression and random forest regression which estimate the accuracy of the design and good performance. The interprocess communication among nodes is verified using Virtex-5 FPGA, in which data flows in packets and can vary up to ‘n’ bit. The designs are developed in Xilinx ISE 14.2 and simulated in Modelsim 10.1b with the help of VHDL programming language. The developed model has been validated and has performed well on independent test data.

Journal ArticleDOI
TL;DR: The article demonstrates the usefulness of heterogeneous System on Chip (SoC) devices in smart cameras used in intelligent transportation systems (ITS) and other advanced embedded image processing, analysis and recognition applications.
Abstract: The article demonstrates the usefulness of heterogeneous System on Chip (SoC) devices in smart cameras used in intelligent transportation systems (ITS). In a compact, energy efficient system the following exemplary algorithms were implemented: vehicle queue length estimation, vehicle detection, vehicle counting and speed estimation (using multiple virtual detection lines), as well as vehicle type (local binary features and SVM classifier) and colour (k-means classifier and YCbCr colourspace analysis) recognition. The solution exploits the hardware–software architecture, i.e. the combination of reconfigurable resources and the efficient ARM processor. Most of the modules were implemented in hardware, using Verilog HDL, taking full advantage of the possible parallelization and pipeline, which allowed to obtain real-time image processing. The ARM processor is responsible for executing some parts of the algorithm, i.e. high-level image processing and analysis, as well as for communication with the external systems (e.g. traffic lights controllers). The demonstrated results indicate that modern SoC systems are a very interesting platform for advanced ITS systems and other advanced embedded image processing, analysis and recognition applications.

Journal ArticleDOI
TL;DR: A generic methodology, which leverages the burst mode communication protocol, to detect the intrusions during runtime with validated approach by applying it on the AES Trojan benchmarks that utilize intermodule interface to communicate with other modules in the system on chip (SoC).

Proceedings ArticleDOI
05 Nov 2018
TL;DR: This work proposes AxBA, an approximate bus architecture framework that is aware of the data amenable to approximations and seamlessly compresses/decompresses the corresponding transactions on the bus without requiring any changes to pre-designed masters and slaves.
Abstract: Modern computing platforms expend significant amounts of time and energy in transmitting data across on-chip and off-chip interconnects. This challenge is exacerbated in prevalent data-intensive workloads such as machine learning, data analytics and search. However, these workloads also present a unique opportunity in the form of intrinsic resilience to approximations in computations and data. We explore approximate compression of communication traffic, which leverages this intrinsic resilience to improve communication bandwidth and reduce the energy consumed by interconnects. Specifically, we propose AxBA, an approximate bus architecture framework that is aware of the data amenable to approximations and seamlessly compresses/decompresses the corresponding transactions on the bus without requiring any changes to pre-designed masters and slaves. AxBA uses a lightweight compression scheme based on approximate deduplication, which is suitable for the tight latency constraints imposed by bus-based interconnects. To facilitate software development on AxBA-based systems, we introduce a software interface that enables programmers to identify regions of the system address space that are amenable to approximations. We also propose a run-time quality monitoring framework that automatically determines the error constraints for the identified regions such that a specified application-level quality is maintained. We demonstrate the feasibility of the proposed concepts by realizing a prototype AxBA system on a Cyclone-IV FPGA development board using an Intel Nios II processor-based SoC. Across a suite of six machine learning benchmarks, AxBA obtains an average improvement in system performance of 29% and a 25% reduction in system-level energy for a 0.5% loss in application-level quality.

Journal ArticleDOI
TL;DR: A tiny, energy-efficient, and domain-specific manycore accelerator referred to as power-efficient nanoclusters (PENC) is proposed to map and execute the kernels of these applications, which show that the PENC is able to reduce energy consumption and energy efficiency when optimally parallelized.
Abstract: Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy-efficient, and domain-specific manycore accelerator referred to as power-efficient nanoclusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multiphysiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field-programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general-purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to $1872{\times}$ and $276{\times} $ , respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a $14{\times} $ improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is $14.87{\times} $ and $2.28{\times} $ better over FPGA implementation, respectively. For the sTDS application, the PENC improves a throughput by $5.45{\times} $ and an energy efficiency by $2.37{\times} $ over FPGA implementation.