scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension : Industrial Product

TL;DR: Xuantie-910 is an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division that features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations, and implements the 0.7.1 stable release of RISCV vector extension specification for high efficiency vector processing.
Abstract: The open source RISC-V ISA has been quickly gaining momentum. This paper presents Xuantie-910, an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division. It is fully based on the RV64GCV instruction set and it features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations. It also implements the 0.7.1 stable release of RISC-V vector extension specification for high efficiency vector processing. Xuantie-910 supports multi-core multi-cluster SMP with cache coherence. Each cluster contains 1 to 4 core(s) capable of booting the Linux operating system. Each single core utilizes the state-of-the-art 12-stage deep pipeline, out-of-order, multi-issue superscalar architecture, achieving a maximum clock frequency of 2.5 GHz in the typical process, voltage and temperature condition in a TSMC 12nm FinFET process technology. Each single core with the vector execution unit costs an area of 0.8 mm2 (excluding the L2 cache). The toolchain is enhanced significantly to support the vector extension and custom extensions. Through hardware and toolchain co-optimization, to date Xuantie-910 delivers the highest performance (in terms of IPC, speed, and power efficiency) for a number of industrial control flow and data computing benchmarks, when compared with its predecessors in the RISC-V family. Xuantie-910 FPGA implementation has been deployed in the data centers of Alibaba Cloud, for application-specific acceleration (e.g., blockchain transaction). The ASIC deployment at low-cost SoC applications, such as IoT endpoints and edge computing, is planned to facilitate Alibaba's end-to-end and cloud-to-edge computing infrastructure.
Citations
More filters
Proceedings ArticleDOI
11 May 2021
TL;DR: In this article, the authors compared the most prominent open-source application-class RISC-V projects by running identical benchmarks on identical platforms with defined configuration settings, including the Rocket, BOOM, CVA6, and SHAKTI C-Class implementations.
Abstract: The numerous emerging implementations of RISC-V processors and frameworks underline the success of this Instruction Set Architecture (ISA) specification. The free and open source character of many implementations facilitates their adoption in academic and commercial projects. As yet it is not easy to say which implementation fits best for a system with given requirements such as processing performance or power consumption. With varying backgrounds and histories, the developed RISC-V processors are very different from each other. Comparisons are difficult, because results are reported for arbitrary technologies and configuration settings. Scaling factors are used to draw comparisons, but this gives only rough estimates. In order to give more substantiated results, this paper compares the most prominent open-source application-class RISC-V projects by running identical benchmarks on identical platforms with defined configuration settings. The Rocket, BOOM, CVA6, and SHAKTI C-Class implementations are evaluated for processing performance, area and resource utilization, power consumption as well as efficiency. Results are presented for the Xilinx Virtex UltraScale+ family and GlobalFoundries 22FDX ASIC technology.

35 citations

Journal ArticleDOI
TL;DR: Vector architectures lack tools for research, so the gem5 simulator, which is possibly the leading platform for computer-system architecture research, does not have an ava...
Abstract: Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research time. However, once the base simulator platform is developed, another question is the following: Which applications should be tested to perform the experiments? The lack of Vectorized Benchmark Suites is another limitation. To face these problems, this work presents a set of tools for designing and evaluating vector architectures. First, the gem5 simulator was extended to support the execution of RISC-V Vector instructions by adding a parameterizable Vector Architecture model for designers to evaluate different approaches according to the target they pursue. Second, a novel Vectorized Benchmark Suite is presented: a collection composed of seven data-parallel applications from different domains that can be classified according to the modules that are stressed in the vector architecture. Finally, a study of the Vectorized Benchmark Suite executing on the gem5-based Vector Architecture model is highlighted. This suite is the first in its category that covers the different possible usage scenarios that may occur within different vector architecture designs such as embedded systems, mainly focused on short vectors, or High-Performance-Computing (HPC), usually designed for large vectors.

20 citations

Proceedings ArticleDOI
13 Jun 2021
TL;DR: In this article, the authors proposed a method to evaluate the energy consumption using power and performance values, and shortlisted the most optimized cores for resource-constrained devices and implemented them using an ASIC prototyping platform as a unified technology.
Abstract: Resource-Constrained electronic devices targeting IoT applications need a microcontroller to control their operation, and hence this microcontroller should be energy efficient to increase battery life. One of the emerging open-source processors ISA is RISC-V; as an open-source, it provides a great opportunity for innovation and creativity in designing processor cores. This study targets to survey open-source RISC-V cores and classify them as high-performance and resource-constrained. Afterward, we shortlist the most optimized cores for resource-constrained devices. Seven shortlisted cores are implemented using an ASIC prototyping platform as a unified technology and are compared using resource utilization and energy consumption profile to find the most energy efficient core. We proposed a method to evaluate the energy consumption using power and performance values. Results showed Ibex core to have the best energy consumption characteristics with 8.7 Coremark iterations/mJ over the ASIC prototyping platform.

8 citations

Proceedings ArticleDOI
01 Jul 2022
TL;DR: This work presents its first open-source implementation of the RISC-V V extension, discusses the new specification's impact on the micro-architecture of a lane-based design, and provides insights on performance-oriented design of coupled scalar-vector processors.
Abstract: Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.

7 citations

Proceedings ArticleDOI
01 Oct 2022
TL;DR: MINJIE, an open-source platform supporting agile processor development flow that integrates a broad set of tools for logic design, functional verification, performance modelling, pre-silicon validation and debugging for better development efficiency of state-of-the-art processor designs is proposed.
Abstract: While research has shown that the agile chip design methodology is promising to sustain the scaling of computing performance in a more efficient way, it is still of limited usage in actual applications due to two major obstacles: 1) Lack of tool-chain and developing framework supporting agile chip design, especially for large-scale modern processors. 2) The conventional verification methods are less agile and become a major bottleneck of the entire process. To tackle both issues, we propose MINJIE, an open-source platform supporting agile processor development flow. MINJIE integrates a broad set of tools for logic design, functional verification, performance modelling, pre-silicon validation and debugging for better development efficiency of state-of-the-art processor designs. We demonstrate the usage and effectiveness of MINJIE by building two generations of an open-source superscalar out-of-order RISC-V processor code-named XIANGSHAN using agile methodologies. We quantify the performance of XIANGSHAN using SPEC CPU2006 benchmarks and demonstrate that XIANGSHAN achieves industry-competitive performance.

6 citations

References
More filters
Journal ArticleDOI
TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Abstract: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. This is an entirely new implementation of the Sparc V9 architectural specification, which exploits large amounts of on-chip parallelism to provide high throughput. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and memory controllers for a highly integrated design that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.

1,053 citations


"Xuantie-910: A Commercial Multi-Cor..." refers background in this paper

  • ...Admittedly, compared with the X86, ARM, MIPS [16]– [18], PowerPC, SPARC [11], [20], [21], [23], [30], openRISC [14], [24], [26] and other ISAs under the hood of popular GPUs and DSPs, RISC-V is still in its infancy....

    [...]

01 Jan 2014
TL;DR: This draft specification may change before being accepted as standard by the RISC-V Foundation, and it remains possible that implementations made to this draft specification will not conform to the future standard.
Abstract: Volume II: Privileged Architecture Privileged Architecture Version 1.10 Document Version 1.10 Warning! This draft specification may change before being accepted as standard by the RISC-V Foundation. While the editors intend future changes to this specification to be forward compatible, it remains possible that implementations made to this draft specification will not conform to the future standard.

583 citations


"Xuantie-910: A Commercial Multi-Cor..." refers background in this paper

  • ...XT-910 is fully in compliance with the RISC-V RV64GCV instruction set specification [31]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors describe the design of an open-source RISC-V processor core specifically designed for near-threshold (NT) operation in tightly coupled multicore clusters and introduce instruction extensions and micro-architectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy.
Abstract: Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper, we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multicore clusters. We introduce instruction extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy. For typical data-intensive sensor processing workloads, the proposed core is, on average, $3.5\times $ faster and $3.2\times $ more energy efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. Single Instruction Multiple Data extensions, such as dot products, and a built-in L0 storage further reduce the shared-memory accesses by $8\times $ reducing contentions by $3.2\times $ . With four NT-optimized cores, the cluster is operational from 0.6 to 1.2 V, achieving a peak efficiency of 67 MOPS/mW in a low-cost 65-nm bulk CMOS technology. In a low-power 28-nm FD-SOI process, a peak efficiency of 193 MOPS/mW (40 MHz and 1 mW) can be achieved.

304 citations

Journal ArticleDOI
Florian Zaruba1, Luca Benini1
TL;DR: A thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline “application class” functionality, i.e., supporting the Linux OS and its application environment based on the authors' open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane.
Abstract: The open-source RISC-V instruction set architecture (ISA) is gaining traction, both in industry and academia. The ISA is designed to scale from microcontrollers to server-class processors. Furthermore, openness promotes the availability of various open-source and commercial implementations. Our main contribution in this paper is a thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline “application class” functionality, i.e., supporting the Linux OS and its application environment based on our open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane. Our analysis is based on a detailed power and efficiency analysis of the RISC-V ISA extracted from silicon measurements and calibrated simulation of an Ariane instance (RV64IMC) taped-out in GlobalFoundries 22FDX technology. Ariane runs at up to 1.7-GHz, achieves up to 40-Gop/sW energy efficiency, which is superior to similar cores presented in the literature. We provide insight into the interplay between functionality required for the application-class execution (e.g., virtual memory, caches, and multiple modes of privileged operation) and energy cost. We also compare Ariane with RISCY, a simpler and a slower microcontroller-class core. Our analysis confirms that supporting application-class execution implies a nonnegligible energy-efficiency loss and that compute performance is more cost-effectively boosted by instruction extensions (e.g., packed SIMD) rather than the high-frequency operation.

195 citations

Proceedings ArticleDOI
10 Jul 2018
TL;DR: GAP-8 is proposed: a multi-GOPS fully programmable RISC-V IoT-edge computing engine, featuring a 8-core cluster with CNN accelerator, coupled with an ultra-low power MCU with 30 μW state-retentive sleep power.
Abstract: Current ultra-low power smart sensing edge devices, operating for years on small batteries, are limited to low-bandwidth sensors, such as temperature or pressure Enabling the next generation of edge devices to process data from richer sensors such as image, video, audio, or multi-axial motion/vibration has huge application potential However, edge processing of data-rich sensors poses the extreme challenge of squeezing the computational requirements of advanced, machine-Iearning-based near-sensor data analysis algorithms (such as Convolutional Neural Networks) within the mW-range power envelope of always-ON battery-powered IoT end-nodes To address this challenge, we propose GAP-8: a multi-GOPS fully programmable RISC-V IoT-edge computing engine, featuring a 8-core cluster with CNN accelerator, coupled with an ultra-low power MCU with 30 μW state-retentive sleep power GAP-8 delivers up to 10 GMAC/s for CNN inference (90 MHz, 10V) at the energy efficiency of 600 GMAC/s/W within a worst-case power envelope of75 mW

157 citations


"Xuantie-910: A Commercial Multi-Cor..." refers background in this paper

  • ...Along the RISC-V performance spectrum, most of the existing cores are in the microcontroller class [12], [15], [19], [33]....

    [...]