scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 2021"


Journal ArticleDOI
TL;DR: In this paper, a semi-asynchronous federated learning (SAFA) protocol is proposed to mitigate the impacts of straggglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model.
Abstract: Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost.

150 citations


Journal ArticleDOI
TL;DR: A quantum algorithm is presented and analyzed to estimate credit risk more efficiently than Monte Carlo simulations can do on classical computers and how this translates into an expected runtime under reasonable assumptions on future fault-tolerant quantum hardware is analyzed.
Abstract: We present and analyze a quantum algorithm to estimate credit risk more efficiently than Monte Carlo simulations can do on classical computers. More precisely, we estimate the economic capital requirement, i.e. the difference between the Value at Risk and the expected value of a given loss distribution. The economic capital requirement is an important risk metric because it summarizes the amount of capital required to remain solvent at a given confidence level. We implement this problem for a realistic loss distribution and analyze its scaling to a realistic problem size. In particular, we provide estimates of the total number of required qubits, the expected circuit depth, and how this translates into an expected runtime under reasonable assumptions on future fault-tolerant quantum hardware.

86 citations


Journal ArticleDOI
TL;DR: This work introduces an improved double-layer Stackelberg game model to describe the cloud-edge-client collaboration and proposes a novel pricing prediction algorithm based on double-label Radius K-nearest Neighbors, thereby reducing the number of invalid games to accelerate the game convergence.
Abstract: Nowadays, IoT systems can better satisfy the service requirements of users with effectively utilizing edge computing resources. Designing an appropriate pricing scheme is critical for users to obtain the optimal computing resources at a reasonable price and for service providers to maximize profits. This problem is complicated with incomplete information. The state-of-the-art solutions focus on the pricing game between a single service provider and users, which ignoring the competition among multiple edge service providers. To address this challenge, we design an edge-intelligent hierarchical dynamic pricing mechanism based on cloud-edge-client collaboration. We introduce an improved double-layer Stackelberg game model to describe the cloud-edge-client collaboration. Technically, we propose a novel pricing prediction algorithm based on double-label Radius K-nearest Neighbors, thereby reducing the number of invalid games to accelerate the game convergence. The experimental results show that our proposed mechanism effectively improves the quality of service for users and realizes the maximum benefit equilibrium for service providers, compared with the traditional pricing scheme. Our proposed mechanism is highly suitable for the IoT applications (e.g., intelligent agriculture or Internet of Vehicles), where there are multiple competing edge service providers for resource allocation.

82 citations


Journal ArticleDOI
TL;DR: This work proposes DORY (Deployment Oriented to memoRY) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory and releases all the developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.
Abstract: The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency – requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY ( Deployment Oriented to memoRY ) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.

69 citations


Journal ArticleDOI
TL;DR: ZigZag extends the common DSE with uneven mapping opportunities and smart mapping search strategies, opening up a whole new space for DSE, and thus better design points are found by ZigZag compared to other SotAs.
Abstract: Building efficient embedded deep learning systems requires a tight co-design between DNN algorithms, hardware, and algorithm-to-hardware mapping, a.k.a. dataflow. However, owing to the large joint design space, finding an optimal solution through physical implementation becomes infeasible. To tackle this problem, several design space exploration (DSE) frameworks have emerged recently, yet they either suffer from long runtimes or a limited exploration space. This article introduces ZigZag, a rapid DSE framework for DNN accelerator architecture and mapping. ZigZag extends the common DSE with uneven mapping opportunities and smart mapping search strategies. Uneven mapping decouples operands (W/I/O), memory hierarchy, and mappings (temporal/spatial), opening up a whole new space for DSE, and thus better design points are found by ZigZag compared to other SotAs. For this, ZigZag uses an enhanced nested-for-loop format as a uniform representation to integrate algorithm, accelerator, and algorithm-to-accelerator mapping. ZigZag consists of three key components: 1) an analytical energy-performance-area Hardware Cost Estimator, 2) two Mapping Search Engines that support spatial and temporal even/uneven mapping search, and 3) an Architecture Generator that auto-explores the wide memory hierarchy design space. Benchmarking experiments against published works, in-house accelerator, and existing DSE frameworks, together with three case studies, show the reliability and capability of ZigZag. Up to 64 percent more energy-efficient solutions are found compared to other SotAs, due to ZigZag's uneven mapping capabilities.

68 citations


Journal ArticleDOI
TL;DR: This paper first extends the simulatability framework of Belaı̈d et al. (EUROCRYPT 2016) and proves that a compositional strategy that is correct without glitches remains valid with glitches, and proves the first masked gadgets that enable trivial composition with glitches at arbitrary orders.
Abstract: The design of glitch-resistant higher-order masking schemes is an important challenge in cryptographic engineering. A recent work by Moos et al. (CHES 2019) showed that most published schemes (and all efficient ones) exhibit local or composability flaws at high security orders, leaving a critical gap in the literature on hardware masking. In this article, we first extend the simulatability framework of Belaid et al. (EUROCRYPT 2016) and prove that a compositional strategy that is correct without glitches remains valid with glitches. We then use this extended framework to prove the first masked gadgets that enable trivial composition with glitches at arbitrary orders. We show that the resulting “Hardware Private Circuits” approach the implementation efficiency of previous (flawed) schemes. We finally investigate how trivial composition can serve as a basis for a tool that allows verifying full masked hardware implementations (e.g., of complete block ciphers) at any security order from their HDL code. As side products, we improve the randomness complexity of the best published refreshing gadgets, show that some S-box representations allow latency reductions and confirm practical claims based on implementation results.

61 citations


Journal ArticleDOI
TL;DR: A challenge self-obfuscation structure (CSoS) which employs previous challenges combined with keys or random numbers to obfuscate the current challenge for the VOS-based authentication to resist ML attacks is proposed.
Abstract: It is a challenging task to deploy lightweight security protocols in resource-constrained IoT applications. A hardware-oriented lightweight authentication protocol based on device signature generated during voltage over-scaling (VOS) was recently proposed to address this issue. VOS-based authentication employs the computation unit such as adders to generate the process variation dependent error which is combined with secret keys to create a two-factor authentication protocol. In this paper, machine learning (ML)-based modeling attacks to break such authentication is presented. We also propose a \underline{c}hallenge \underline{s}elf-\underline{o}bfuscation \underline{s}tructure (CSoS) which employs previous challenges combined with keys or random numbers to obfuscate the current challenge for the VOS-based authentication to resist ML attacks. Experimental results show that ANN, RNN and CMA-ES can clone the challenge-response behavior of VOS-based authentication with up to 99.65% prediction accuracy, while the prediction accuracy is less than 51.2% after deploying our proposed ML resilient technique. In addition, our proposed CSoS also shows good obfuscation ability for strong PUFs. Experimental results show that the modeling accuracies are below 54% when $10^6$ challenge-response pairs (CRPs) are collected to model the CSoS-based Arbiter PUF with ML attacks such as LR, SVM, ANN, RNN and CMA-ES.

54 citations


Journal ArticleDOI
TL;DR: This work introduces Lime, a better approach for modeling dynamic and heterogeneous information networks and shows, for the first time, how an effective incremental learning approach can be developed – with the help of RsNN, the authors' cuboid structure, and a set of novel optimization techniques – to allow a learning framework to quickly and efficiently adapt to a constantly evolving network.
Abstract: Understanding the interconnected relationships of large-scale information networks like social, scholar and Internet of Things networks is vital for tasks like recommendation and fraud detection. The vast majority of the real-world networks are inherently heterogeneous and dynamic, containing many different types of nodes and edges and can change drastically over time. The dynamicity and heterogeneity make it extremely challenging to reason about the network structure. Unfortunately, existing approaches are inadequate in modeling real-life networks as they require extensive computational resources and do not scale well to large, dynamically evolving networks. We introduce LIME, a better approach for modeling dynamic and heterogeneous information networks. LIME is designed to extract high-quality network representation with significantly lower memory resources and computational time over the state-of-the-art. Unlike prior work that uses a vector to encode each network node, we exploit the semantic relationships among network nodes to encode multiple nodes with similar semantics in shared vectors. We evaluate LIME by applying it to three representative network-based tasks, node classification, node clustering and anomaly detection, performing on three large-scale datasets. Our extensive experiments demonstrate that LIME not only reduces the memory footprint by over 80\% and computational time over 2x when learning network representation but also delivers comparable performance for downstream processing tasks.

54 citations


Journal ArticleDOI
TL;DR: This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit.
Abstract: Multiplication is the most resource-hungry operation in neural networks (NNs). Logarithmic multipliers (LMs) simplify multiplication to shift and addition operations and thus reduce the energy consumption. Since implementing the logarithm in a compact circuit often introduces approximation, some accuracy loss is inevitable in LMs. However, this inaccuracy accords with the inherent error tolerance of NNs and their associated applications. This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit. Considering that the output of the NOD uses a one-hot representation, some entries in the truth table of a conventional adder cannot occur. Hence, a compact adder is designed for the reduced truth table. The 8×8 ILM achieves up to 17.48 percent saving in power consumption compared to a recent LM in the literature while being almost 8 percent more accurate. Moreover, the evaluation of the ILM for two benchmark NN workloads shows up to 21.85 percent reduction in energy consumption compared to the NNs implemented with other LMs. Interestingly, using the ILM increases the classification accuracy of the considered NNs by up to 1.4 percent compared to a NN implementation that uses exact multipliers.

53 citations


Journal ArticleDOI
TL;DR: EnGN as discussed by the authors proposes a specialized accelerator architecture to accelerate the three key stages of GNN propagation, which is abstracted as common computing patterns shared by typical GNNs, and uses graph tiling strategy to fit large graphs into EnGN and make good use of the hierarchical onchip buffers through adaptive computation reordering and tile scheduling.
Abstract: Graph neural networks (GNNs) emerge as a powerful approach to process non-euclidean data structures and have been proved powerful in various application domains such as social networks and e-commerce. While such graph data maintained in real-world systems can be extremely large and sparse, thus employing GNNs to deal with them requires substantial computational and memory overhead, which induces considerable energy and resource cost on CPUs and GPUs. In this article, we present a specialized accelerator architecture, EnGN, to enable high-throughput and energy-efficient processing of large-scale GNNs. The proposed EnGN is designed to accelerate the three key stages of GNN propagation, which is abstracted as common computing patterns shared by typical GNNs. To support the key stages simultaneously, we propose the ring-edge-reduce(RER) dataflow that tames the poor locality of sparsely-and-randomly connected vertices, and the RER PE-array to practice RER dataflow. In addition, we utilize a graph tiling strategy to fit large graphs into EnGN and make good use of the hierarchical on-chip buffers through adaptive computation reordering and tile scheduling. Overall, EnGN achieves performance speedup by 1802.9X, 19.75X, and 2.97X and energy efficiency by 1326.35X, 304.43X, and 6.2X on average compared to CPU, GPU, and a state-of-the-art GCN accelerator HyGCN, respectively.

49 citations


Journal ArticleDOI
TL;DR: Diagonalizing the transform matrix of the map is given, giving the explicit formulation of any iteration of the generalized Cat map and its real graph (cycle) structure in any binary arithmetic domain is disclosed.
Abstract: Chaotic dynamics is an important source for generating pseudorandom binary sequences (PRBS) Much efforts have been devoted to obtaining period distribution of the generalized discrete Arnold's Cat map in various domains using all kinds of theoretical methods, including Hensel's lifting approach Diagonalizing the transform matrix of the map, this paper gives the explicit formulation of any iteration of the generalized Cat map Then, its real graph (cycle) structure in any binary arithmetic domain is disclosed The subtle rules on how the cycles (itself and its distribution) change with the arithmetic precision e are elaborately investigated and proved The regular and beautiful patterns of Cat map demonstrated in a computer adopting fixed-point arithmetics are rigorously proved and experimentally verified The results can serve as a benchmark for studying the dynamics of the variants of the Cat map in any domain In addition, the used methodology can be used to evaluate randomness of PRBS generated by iterating any other maps

Journal ArticleDOI
TL;DR: NACIM as mentioned in this paper proposes a cross-layer exploration framework, which jointly explores device, circuit and architecture design space and takes device variation into consideration to find the most robust neural architectures, coupled with the most efficient hardware design.
Abstract: Co-exploration of neural architectures and hardware design is promising due to its capability to simultaneously optimize network accuracy and hardware efficiency. However, state-of-the-art neural architecture search algorithms for the co-exploration are dedicated for the conventional von-Neumann computing architecture, whose performance is heavily limited by the well-known memory wall. In this article, we are the first to bring the computing-in-memory architecture, which can easily transcend the memory wall, to interplay with the neural architecture search, aiming to find the most efficient neural architectures with high network accuracy and maximized hardware efficiency. Such a novel combination makes opportunities to boost performance, but also brings a bunch of challenges: The optimization space spans across multiple design layers from device type and circuit topology to neural architecture; and the presence of device variation may drastically degrade the neural network performance. To address these challenges, we propose a cross-layer exploration framework, namely NACIM, which jointly explores device, circuit and architecture design space and takes device variation into consideration to find the most robust neural architectures, coupled with the most efficient hardware design. Experimental results demonstrate that NACIM can find the robust neural network with 0.45 percent accuracy loss in the presence of device variation, compared with a 76.44 percent loss from the state-of-the-art NAS without consideration of variation; in addition, NACIM achieves an energy efficiency up to 16.3 TOPs/W, 3.17× higher than the state-of-the-art NAS.

Journal ArticleDOI
TL;DR: A double offloading framework is proposed to simulate the offloading process in real edge scenario which consists of different edge servers and devices and utilize a deep reinforcement learning (DRL) algorithm named asynchronous advantage actor-critic (A3C) as the offload decision making strategy to balance the workload of edge server and reduce the overhead in terms of energy and time.
Abstract: Currently, huge amounts of data are produced by edge device. Considering the heavy burden of network bandwidth and the service delay requirements of delay-sensitive applications, processing the data at network edge is a great choice. However, edge devices such as smart wearables, connected and autonomous vehicles usually have several limitations on computational capacity and energy which will influence the quality of service. As an effective and efficient strategy, offloading is widely used to address this issue. But when facing device heterogeneity problem and task complexity increase, service quality degradation and resource utility decrease often occur due to unreasonable task distribution. Since conventional simplex offloading strategies show limited performance in complex environment, we are motivated to design a dynamic regional resource scheduling framework which is able to work effectively taking different indexes into consideration. Thus, in this article we first propose a double offloading framework to simulate the offloading process in real edge scenario which consists of different edge servers and devices. Then we formulate the offloading as a Markov Decision Process (MDP) and utilize a deep reinforcement learning (DRL) algorithm named asynchronous advantage actor-critic (A3C) as the offloading decision making strategy to balance the workload of edge servers and finally reduce the overhead in terms of energy and time. Comparison experiments for local computing and wide-used DRL algorithm DQN are conducted in a comprehensive benchmark and the results show that our work performs much better on self-adjusting and overhead reduction.

Journal ArticleDOI
TL;DR: The concept of device's score and use entropy weight method to measure the quality of model update is proposed and the proposed BAFL framework performs better in the aspects of both efficiency and anti-poisoning attacks than other distributed ML methods.
Abstract: As an emerging distributed machine learning (ML) technology, federated learning can protect data privacy through collaborative learning AI models across a large number of IoT devices. However, inefficiency and vulnerability to poisoning attacks have slowed federated learning performance. To solve the above problems, a blockchain-based asynchronous federated learning framework (BAFL) is proposed to pursue both the security and efficiency. Blockchain ensures that data cannot be tampered with and secured while the asynchrony of learning speeds up global aggregation. In further, we propose the concept of device's score and use entropy weight method to measure the quality of model update. The score design directly determines the proportion of the device's model in the global aggregation and the allowed local update delay. By analyzing the optimal block generation rate, the paper also balances the equipment energy consumption and local update delay by adjusting local training delay and communication delay. The extensive evaluation results show that the proposed BAFL framework has performs better in the aspects of both efficiency and anti-poisoning attacks than other distributed ML methods.

Journal ArticleDOI
TL;DR: The QUEKO benchmarks as mentioned in this paper evaluate the optimality of current layout synthesis tools, including Cirq from Google, Qiskit from IBM, $\mathsf {t}|\mathsf{ket}rangle$ t | ket 〉 from Cambridge Quantum Computing, and a recent academic work.
Abstract: Layout synthesis, an important step in quantum computing, processes quantum circuits to satisfy device layout constraints. In this paper, we construct QUEKO benchmarks for this problem, which have known optimal depths and gate counts. We use QUEKO to evaluate the optimality of current layout synthesis tools, including Cirq from Google, Qiskit from IBM, $\mathsf {t}|\mathsf {ket}\rangle$ t | ket 〉 from Cambridge Quantum Computing, and a recent academic work. To our surprise, despite over a decade of research and development by academia and industry on compilation and synthesis for quantum circuits, we are still able to demonstrate large optimality gaps: 1.5-12x on average on a smaller device and 5-45x on average on a larger device. This suggests substantial room for improvement of the efficiency of quantum computer by better layout synthesis tools. Finally, we also prove the NP-completeness of the layout synthesis problem for quantum computing. We have made the QUEKO benchmarks open-source.

Journal ArticleDOI
TL;DR: A deep reinforcement learning based approach has been proposed to adaptively control the training of local models and the phase of global aggregation simultaneously, which can improve the model accuracy by up to 30\%, as compared to the state-of-the-art approaches.
Abstract: Federated learning (FL) has been widely recognized as a promising approach by enabling individual end-devices to cooperatively train a global model without exposing their own data. One of the key challenges in FL is the non-independent and identically distributed (Non-IID) data across the clients, which decreases the efficiency of stochastic gradient descent (SGD) based training process. Moreover, clients with different data distributions may cause bias to the global model update, resulting in a degraded model accuracy. To tackle the Non-IID problem in FL, we aim to optimize the local training process and global aggregation simultaneously. For local training, we analyze the effect of hyperparameters (e.g., the batch size, the number of local updates) on the training performance of FL. Guided by the toy example and theoretical analysis, we are motivated to mitigate the negative impacts incurred by Non-IID data via selecting a subset of participants and adaptively adjust their batch size. A deep reinforcement learning based approach has been proposed to adaptively control the training of local models and the phase of global aggregation. Extensive experiments on different datasets show that our method can improve the model accuracy by up to 30\%, as compared to the state-of-the-art approaches.

Journal ArticleDOI
TL;DR: An ensemble detector is proposed, which exploits the capabilities of the main analysis algorithms proposed in the literature designed to offer greater resilience to specific evasion techniques and can be used to increase the unpredictability of the detection strategy, as well as improve the detection rate in presence of unknown malware families.

Journal ArticleDOI
TL;DR: This Article addresses the resource management issue by proposing a novel approach - named Energy-aware Fog Resource Optimization (EFRO) model- to optimizing the utilization of connected devices in fog computing with a heuristic algorithm minimizing both energy cost and time consumption in a holistic way.
Abstract: Combining the Internet-of-Things (IoT) technology with cloud computing is a significant alternative for powering the utilization of computing resources in a connected environment. A grand challenge in communications is raised by the emergence of big data, due to the large-sized data transmissions and frequent data exchanges. Applying fog computing is considered an option for resolving the communication challenge. However, a high extent of available heterogeneous computing attached to fog computing servers leads to a restriction of the resource management. This Article addresses the resource management issue by proposing a novel approach - named Energy-aware Fog Resource Optimization (EFRO) model- to optimizing the utilization of connected devices in fog computing. We develop a heuristic algorithm minimizing both energy cost and time consumption in a holistic way. A salient feature of EFRO lies in the integration of the standardization and smart shift operations fueled by a hill-climbing mechanism to produce near-optimal resource allocation solutions. Experimental results demonstrate that our EFRO is adroit at making near-optimal decisions in managing resources in fog computing environments. In particular, EFRO boosts the energy efficiency of the existing MESF and RR schemes by 54.83 and 71.28 percent, respectively. EFRO shortens DECM’s allocation-generation time by up to a factor of 507.

Journal ArticleDOI
Jinming Lu1, Chao Fang1, Xu Mingyang1, Jun Lin1, Zhongfeng Wang1 
TL;DR: It is demonstrated that the posit format shows great potential to be employed in the training of DNNs, and a DNN training framework using 8-bit posit is proposed with a novel tensor-wise scaling scheme.
Abstract: The training of Deep Neural Networks (DNNs) brings enormous memory requirements and computational complexity, which makes it a challenge to train DNN models on resource-constrained devices. Training DNNs with reduced-precision data representation is crucial to mitigate this problem. In this article, we conduct a thorough investigation on training DNNs with low-bit posit numbers, a Type-III universal number (Unum). Through a comprehensive analysis of quantization with various data formats, it is demonstrated that the posit format shows great potential to be employed in the training of DNNs. Moreover, a DNN training framework using 8-bit posit is proposed with a novel tensor-wise scaling scheme. The experiments show the same performance as the state-of-the-art (SOTA) across multiple datasets (MNIST, CIFAR-10, ImageNet, and Penn Treebank) and model architectures (LeNet-5, AlexNet, ResNet, MobileNet-V2, and LSTM). We further design an energy-efficient hardware prototype for our framework. Compared to the standard floating-point counterpart, our design achieves a reduction of 68, 51, and 75 percent in terms of area, power, and memory capacity, respectively.

Journal ArticleDOI
TL;DR: PyQUBO as discussed by the authors is an open-source, Python library for constructing quadratic unconstrained binary optimizations (QUBOs) from the objective functions and the constraints of optimization problems.
Abstract: We present PyQUBO, an open-source, Python library for constructing quadratic unconstrained binary optimizations (QUBOs) from the objective functions and the constraints of optimization problems. PyQUBO enables users to prepare QUBOs or Ising models for various combinatorial optimization problems with ease thanks to the abstraction of expressions and the extensibility of the program. QUBOs and Ising models formulated using PyQUBO are solvable by Ising machines, including quantum annealing machines. We introduce the features of PyQUBO with applications in the number partitioning problem, knapsack problem, graph coloring problem, and integer factorization using a binary multiplier. Moreover, we demonstrate how PyQUBO can be applied to production-scale problems through integration with quantum annealing machines. Through its flexibility and ease of use, PyQUBO has the potential to make quantum annealing a more practical tool among researchers.

Journal ArticleDOI
TL;DR: This work proposes an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP).
Abstract: Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to adjust to algorithmic changes. We propose an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine. The key idea is to pair a tiny 10kGE (kilo gate equivalent) control core, called Snitch, with a double-precision floating-point unit (FPU) to adjust the compute to control ratio. While traditionally minimizing non-floating-point unit (FPU) area and achieving high floating-point utilization has been a trade-off, with Snitch, we achieve them both, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP). SSRs allow the core to implicitly encode load/store instructions as register reads/writes, eliding many explicit memory instructions. The FREP extension decouples the floating-point and integer pipeline by sequencing instructions from a micro-loop buffer. These ISA extensions significantly reduce the pressure on the core and free it up for other tasks, making Snitch and FPU effectively dual-issue at a minimal incremental cost of 3.2 percent. The two low overhead ISA extensions make Snitch more flexible than a contemporary vector processor lane, achieving a $2\times$ 2 × energy-efficiency improvement. We have evaluated the proposed core and ISA extensions on an octa-core cluster in 22 nm technology. We achieve more than $6\times$ 6 × multi-core speed-up and a $3.5\times$ 3 . 5 × gain in energy efficiency on several parallel microkernels.

Journal ArticleDOI
Hyeokjea Kwon1, Joonwoo Bae1
TL;DR: In this article, the authors present a scheme to deal with unknown quantum noise and show that it can be used to mitigate errors in measurement readout with noisy intermediate scalable quantum (NISQ) devices.
Abstract: When noisy intermediate scalable quantum (NISQ) devices are applied in information processing, all of the stages through preparation, manipulation, and measurement of multipartite qubit states contain various types of noise that are generally hard to be verified in practice. In this article, we present a scheme to deal with unknown quantum noise and show that it can be used to mitigate errors in measurement readout with NISQ devices. Quantum detector tomography that identifies a type of noise in a measurement can be circumvented. The scheme applies single-qubit operations only, that are with relatively higher precision than measurement readout or two-qubit gates. A classical post-processing is then performed with measurement outcomes. The scheme is implemented in quantum algorithms with NISQ devices: the Bernstein-Vazirani algorithm and a quantum amplitude estimation algorithm in $\mathrm{IBMQ\_yorktown}$ IBMQ _ yorktown and $\mathrm{IBMQ\_essex}$ IBMQ _ essex . The enhancement in the statistics of the measurement outcomes is presented for both of the algorithms with NISQ devices.

Journal ArticleDOI
TL;DR: A heterogeneous fairness-aware energy efficient framework (HFEE) that employs DVFS to meet fairness constraints and provide energy efficient scheduling is proposed and implemented and evaluated on a real heterogeneous multi-core processor.
Abstract: Heterogeneous multi-core processors (HMP) with the same instruction set architecture (ISA) integrate complex high performance big cores with power efficient small cores on the same chip. In comparison with homogeneous architectures, HMPs have been shown to significantly increase energy efficiency. However, current techniques to exploit the energy efficiency of HMPs do not consider fair usage of resources that leads to reduced performance predictability, a longer makespan, starvation, and QoS degradation. The effect of different cluster voltage and frequency levels on fairness is another issue neglected by previous task scheduling algorithms. The present study investigates both the fairness problem and energy efficiency in HMPs. This article proposes a heterogeneous fairness-aware energy efficient framework (HFEE) that employs DVFS to meet fairness constraints and provide energy efficient scheduling. The proposed framework is implemented and evaluated on a real heterogeneous multi-core processor. The experimental results indicate that the introduced technique can significantly improve energy efficiency and fairness when compared to Linux standard scheduler and two energy efficient and fairness-aware schedulers.

Journal ArticleDOI
TL;DR: This article designs and implements a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions.
Abstract: Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of $functions$ without managing virtual machines or servers. Though provided with a simpler resource interface ( $i.e.,$ function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to $unpredictable$ DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such $unpredictable performance$ of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this paper, we design and implement $\lambda DNN$ , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a $lightweight$ analytical DDNN training performance model to enable our design of $\lambda DNN$ resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, $\lambda DNN$ can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7%, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.

Journal ArticleDOI
TL;DR: The approach includes the Unmanned Aerial Vehicle with an embedded system on board running various Fully Convolutional Neural Networks (FCNN) and proposes the optimal architecture of FCNN for the embedded system relying on the trade-off between the detection quality and frame rate.
Abstract: The Hogweed of Sosnowskyi ( lat. Heracleum sosnowskyi ) is poisonous for humans, dangerous for farming crops, and local ecosystems. This plant is fast-growing and has already spread all over Eurasia: from Germany to the Siberian part of Russia, and its distribution expands year-by-year. In-situ detection of this harmful plant is a tremendous challenge for many countries. Meanwhile, there are no automatic systems for detection and localization of hogweed. In this article, we report on an approach for fast and accurate detection of hogweed. The approach includes the Unmanned Aerial Vehicle (UAV) with an embedded system on board running various Fully Convolutional Neural Networks (FCNN). We propose the optimal architecture of FCNN for the embedded system relying on the trade-off between the detection quality and frame rate. We propose a model that achieves ROC AUC 0.96 in the hogweed segmentation task, which can process 4K frames at 0.46 FPS on NVIDIA Jetson Nano. The developed system can recognize the hogweed on the scale of individual plants and leaves. This system opens up a wide vista for obtaining comprehensive and relevant data about the spreading of harmful plants allowing for the elimination of their expansion.

Journal ArticleDOI
TL;DR: In this paper, the authors present the first practical software implementation of supersingular isogeny key encapsulation (SIKE) round 2, targeting NIST's 1, 2, 3, and 5 security levels on 32-bit ARM Cortex-M4 microcontrollers.
Abstract: We present the first practical software implementation of Supersingular Isogeny Key Encapsulation (SIKE) round 2, targeting NIST's 1, 2, 3, and 5 security levels on 32-bit ARM Cortex-M4 microcontrollers. The proposed library introduces a new speed record of all SIKE Round 2 protocols with reasonable memory consumption on the low-end target platform. We achieved this record by adopting several state-of-the-art engineering techniques as well as highly-optimized hand-crafted assembly implementation of finite field arithmetic. In particular, we carefully redesign the previous optimized implementations of finite field arithmetic on the 32-bit ARM Cortex-M4 platform and propose a set of novel techniques which are explicitly suitable for SIKE primes. The benchmark result on STM32F4 Discovery board equipped with 32-bit ARM Cortex-M4 microcontrollers shows that entire key encapsulation and decapsultation over SIKEp434 take about 184 million clock cycles (i.e., 1.09 seconds @168 MHz). In contrast to the previous optimized implementation of the isogeny-based key exchange on low-end 32-bit ARM Cortex-M4, our performance evaluation shows feasibility of using SIKE mechanism on the low-end platform. In comparison to the most of the post-quantum candidates, SIKE requires an excessive number of arithmetic operations, resulting in significantly slower timings. However, its small key size makes this scheme as a promising candidate on low-end microcontrollers in the quantum era by ensuring the lower energy consumption for key transmission than other schemes.

Journal ArticleDOI
TL;DR: A strategy to load continuous data without post-selection with computational cost is proposed based on the probabilistic quantum memory, a strategy toload binary data in quantum devices, and the FF-QRAM using standard quantum gates, and is suitable for noisy intermediate-scale quantum computers.
Abstract: Loading data in a quantum device is required in several quantum computing applications. Without an efficient loading procedure, the cost to initialize the algorithms can dominate the overall computational cost. A circuit-based quantum random access memory named FF-QRAM can load $M$ M $n$ n -bit patterns with computational cost $O(CMn)$ O ( C M n ) to load continuous data where $C$ C depends on the data distribution. In this article, we propose a strategy to load continuous data without post-selection with computational cost $O(Mn$ O ( M n ). The proposed method is based on the probabilistic quantum memory, a strategy to load binary data in quantum devices, and the FF-QRAM using standard quantum gates, and is suitable for noisy intermediate-scale quantum computers.

Journal ArticleDOI
TL;DR: This work proposes a novel implementation technique for designing resource-efficient and low-power accurate and approximate multipliers which are optimized for FPGA-based systems.
Abstract: Multiplication is one of the most extensively used arithmetic operations in a wide range of applications. In order to provide resource-efficient and high-performance multipliers, previous works have proposed different designs of accurate and approximate multipliers—mainly for ASIC-based systems. However, the architectural differences between ASICs- and FPGA-based systems limit the effectiveness of these multipliers for FPGA-based systems. Moreover, most of these multiplier designs are valid only for unsigned numbers. To bridge this gap, we propose a novel implementation technique for designing resource-efficient and low-power accurate and approximate signed multipliers which are optimized for FPGA-based systems. Compared to Vivado's area-optimized multiplier IPs, the designs obtained using our proposed technique occupy 47 to 63 percent less area ( Lookup Tables ). To accelerate further research in this direction and reproduce the presented results, the RTL and behavioral models of our proposed methodology are available as an open-source library. 1 1. Online. [Available]: https://cfaed.tu-dresden.de/pd-downloads .

Journal ArticleDOI
TL;DR: This article proposes a directed test generation technique to activate a target by effective utilization of concolic testing on RTL models and develops efficient learning and clustering techniques to minimize the overlapping searches across targets to drastically reduce the overall test generation effort.
Abstract: Simulation is widely used for validation of Register-Transfer-Level (RTL) models. While simulating with millions of random or constrained-random tests can cover majority of the functional scenarios, the number of remaining scenarios can still be huge (hundreds or thousands) in case of today's industrial designs. Hard-to-activate branches are one of the major contributors for such remaining/untested scenarios. While directed test generation techniques using formal methods are promising in activating branches, it is infeasible to apply them on large designs due to state space explosion. In this article, we propose a fully automated and scalable approach to cover the hard-to-activate branches using concolic testing of RTL models. While application of concolic testing on hardware designs has shown some promising results in improving the overall coverage, they are not designed to activate specific targets such as uncovered corner cases and rare scenarios. In other words, existing concolic testing approaches address state space explosion problem but leads to path explosion problem while searching for the uncovered targets. Our proposed approach maps directed test generation problem to target search problem while avoiding overlapping searches involving multiple targets. This article makes two important contributions. (1) We propose a directed test generation technique to activate a target by effective utilization of concolic testing on RTL models. (2) We develop efficient learning and clustering techniques to minimize the overlapping searches across targets to drastically reduce the overall test generation effort. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods in terms of test generation time (up to 205X, 69X on average) as well as memory requirements (up to 31X, 7X on average).

Journal ArticleDOI
TL;DR: A design methodology aiming at allocating the execution of Convolutional Neural Networks (CNNs) on a distributed IoT application is introduced, formalized as an optimization problem where the latency between the data-gathering phase and the subsequent decision-making one is minimized, within the given constraints on memory and processing load at the units level.
Abstract: Severe constraints on memory and computation characterizing the Internet-of-Things (IoT) units may prevent the execution of Deep Learning (DL)-based solutions, which typically demand large memory and high processing load. In order to support a real-time execution of the considered DL model at the IoT unit level, DL solutions must be designed having in mind constraints on memory and processing capability exposed by the chosen IoT technology. In this article, we introduce a design methodology aiming at allocating the execution of Convolutional Neural Networks (CNNs) on a distributed IoT application. Such a methodology is formalized as an optimization problem where the latency between the data-gathering phase and the subsequent decision-making one is minimized, within the given constraints on memory and processing load at the units level. The methodology supports multiple sources of data as well as multiple CNNs in execution on the same IoT system allowing the design of CNN-based applications demanding autonomy, low decision-latency, and high Quality-of-Service.