scispace - formally typeset
Search or ask a question
Author

Zhijian Chen

Bio: Zhijian Chen is an academic researcher from Alibaba Group. The author has contributed to research in topics: Operand & Cache. The author has an hindex of 5, co-authored 6 publications receiving 58 citations.

Papers
More filters
Proceedings ArticleDOI
30 May 2020
TL;DR: Xuantie-910 is an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division that features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations, and implements the 0.7.1 stable release of RISCV vector extension specification for high efficiency vector processing.
Abstract: The open source RISC-V ISA has been quickly gaining momentum. This paper presents Xuantie-910, an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division. It is fully based on the RV64GCV instruction set and it features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations. It also implements the 0.7.1 stable release of RISC-V vector extension specification for high efficiency vector processing. Xuantie-910 supports multi-core multi-cluster SMP with cache coherence. Each cluster contains 1 to 4 core(s) capable of booting the Linux operating system. Each single core utilizes the state-of-the-art 12-stage deep pipeline, out-of-order, multi-issue superscalar architecture, achieving a maximum clock frequency of 2.5 GHz in the typical process, voltage and temperature condition in a TSMC 12nm FinFET process technology. Each single core with the vector execution unit costs an area of 0.8 mm2 (excluding the L2 cache). The toolchain is enhanced significantly to support the vector extension and custom extensions. Through hardware and toolchain co-optimization, to date Xuantie-910 delivers the highest performance (in terms of IPC, speed, and power efficiency) for a number of industrial control flow and data computing benchmarks, when compared with its predecessors in the RISC-V family. Xuantie-910 FPGA implementation has been deployed in the data centers of Alibaba Cloud, for application-specific acceleration (e.g., blockchain transaction). The ASIC deployment at low-cost SoC applications, such as IoT endpoints and edge computing, is planned to facilitate Alibaba's end-to-end and cloud-to-edge computing infrastructure.

55 citations

Patent
08 Jul 2009
TL;DR: In this article, the authors proposed an out-of-order execution control device consisting of a transmit unit, a reservation station register unit and an execution control unit, which is used for monitoring the working condition of each execution unit in a real-time manner.
Abstract: The invention relates to a device of an embedded processor, which can control the out-of-order execution. The device comprises a transmit unit, a reservation station register unit and an execution control unit. The transmit unit is used for storing a decoded instruction on a pipeline register, and sending an instruction in a single-clock-cycle manner; the reservation station register unit is used for temporarily storing an instruction for generating for generating a pause when the sent instruction generates a pause because of the related conflict of write/read data, and conducting the bypass monitoring on operands; the execution control unit is used for monitoring the working condition of each execution unit in a real time manner, and dynamically distributing the instruction in the reservation station register unit or the current transmitted instruction according to the information returned by each execution unit. The invention has the advantages of simple design, easy realization and remarkable promotion of the performance of the embedded processor.

27 citations

Patent
21 Jul 2010
TL;DR: In this paper, a high speed register with low power consumption and high capacity is proposed, where the storing district is divided into a number of physical blocks and only visited 1/n of the total size under the hitting status of the high speed working-storage section.
Abstract: The invention discloses a design method of a high speed register with low power consumption and high capacity. On the one hand, the invention prevents the invalid visiting to the high speed register by beforehand obtaining the information whether the high speed register is hitting and reduce the dynamic power consumption; on the other hand, the invention divides the storing district in to a number of physical blocks and only visit 1/n of the total size under the hitting status of the high speed working-storage section, the dynamic power consumption is 1/n of the original situation. The invention saves the switching time of the dummy address to the physical address, reduces the extra hardware spending brought by the address switching and reduces the whole power consumption of the system byadopting the dummy address to search the high speed register. The invention can greatly reduce the power consumption of the high speed register and greatly elevate the whole capacity of the highly embedding type processor being utilized, which has the advantages of low hardware cost and simple design realization.

8 citations

Patent
17 Feb 2010
TL;DR: In this paper, a floating point addition device based on complement rounding is proposed, which supports both the floating-point addition operation and the floating point subtraction operation, and it has the uniform mechanism, avoids the special complex mantissa operand preparation and rounding judgment logic of the floatingpoint addition, and reduces the logic complexity.
Abstract: The invention relates to a floating point addition device based on complement rounding, which supports the floating point addition operation and the floating point subtraction operation. The floatingpoint addition device comprises an exponent adder, a mantissa shifter, a mantissa operand preparation logic unit, a mantissa adder, a rounding judgment logic unit and a rounding adder, wherein the mantissa operand preparation logic unit is used for processing the mantissa operand according to sign bits and the exponent difference of the first floating point operand and the second floating point operand, the rounding judgment logic unit is used for executing the uniform rounding judgment on a mantissa addition result, judging the positive and the negative of the mantissa sum according to the highest bit output by the mantissa adder, determining a constant bit for the rounding judgment according to the highest four bits output by the mantissa adder, and unifying original code rounding plus 1judgment logic and complement rounding plus 0 judgment logic; and the rounding adder is used for rounding the mantissa addition result of the floating point and finishing the code extraction and complement operation to the mantissa sum. The invention has the uniform mechanism, avoids the special complex mantissa operand preparation and rounding judgment logic of the floating point addition, and reduces the logic complexity.

7 citations

Patent
08 May 2013
TL;DR: In this paper, a single-instruction multi-data arithmetic unit supporting various data types comprises N atom operation arrays, each of which comprises an operand preparation unit, an additive operation unit, a round-off operation unit and a saturation operation unit.
Abstract: A single-instruction multi-data arithmetic unit supporting various data types comprises N atom operation arrays. Each atom operation array comprises an operand preparation unit, an additive operation unit, a round-off operation unit, a saturation operation unit and a result encapsulation unit. The operand preparation unit is used for carrying out operation on an input source operand and outputting a middle operand according to input information of operation types and data types; the additive operation unit is used for receiving the middle operand and finishing additive operation and outputting a result of the additive operation; the round-off operation unit is used for carrying out round-off operation on the result of the additive operation according to the input information of the operation types and the data types, and outputting a result of the round-off operation. The saturation operation unit is used for carrying out saturation operation on the result of the additive operation according to the input information of the operation types and the data types. The result encapsulation unit is used for selecting the input result of the round-off operation unit or the saturation operation unit, and saturating a middle result into a final datum according to the information of the data types. The ingle-instruction multi-data arithmetic unit can effectively support various data widths, and applicability is good.

5 citations


Cited by
More filters
Proceedings ArticleDOI
11 May 2021
TL;DR: In this article, the authors compared the most prominent open-source application-class RISC-V projects by running identical benchmarks on identical platforms with defined configuration settings, including the Rocket, BOOM, CVA6, and SHAKTI C-Class implementations.
Abstract: The numerous emerging implementations of RISC-V processors and frameworks underline the success of this Instruction Set Architecture (ISA) specification. The free and open source character of many implementations facilitates their adoption in academic and commercial projects. As yet it is not easy to say which implementation fits best for a system with given requirements such as processing performance or power consumption. With varying backgrounds and histories, the developed RISC-V processors are very different from each other. Comparisons are difficult, because results are reported for arbitrary technologies and configuration settings. Scaling factors are used to draw comparisons, but this gives only rough estimates. In order to give more substantiated results, this paper compares the most prominent open-source application-class RISC-V projects by running identical benchmarks on identical platforms with defined configuration settings. The Rocket, BOOM, CVA6, and SHAKTI C-Class implementations are evaluated for processing performance, area and resource utilization, power consumption as well as efficiency. Results are presented for the Xilinx Virtex UltraScale+ family and GlobalFoundries 22FDX ASIC technology.

35 citations

Journal ArticleDOI
TL;DR: Vector architectures lack tools for research, so the gem5 simulator, which is possibly the leading platform for computer-system architecture research, does not have an ava...
Abstract: Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research time. However, once the base simulator platform is developed, another question is the following: Which applications should be tested to perform the experiments? The lack of Vectorized Benchmark Suites is another limitation. To face these problems, this work presents a set of tools for designing and evaluating vector architectures. First, the gem5 simulator was extended to support the execution of RISC-V Vector instructions by adding a parameterizable Vector Architecture model for designers to evaluate different approaches according to the target they pursue. Second, a novel Vectorized Benchmark Suite is presented: a collection composed of seven data-parallel applications from different domains that can be classified according to the modules that are stressed in the vector architecture. Finally, a study of the Vectorized Benchmark Suite executing on the gem5-based Vector Architecture model is highlighted. This suite is the first in its category that covers the different possible usage scenarios that may occur within different vector architecture designs such as embedded systems, mainly focused on short vectors, or High-Performance-Computing (HPC), usually designed for large vectors.

20 citations

Patent
22 Jun 2011
TL;DR: In this paper, a floating point calculator consisting of a fixed point addition module and a normalization module is described. But the floating point calculation is performed using the fixed point number instead of the bit digit input.
Abstract: The invention discloses a floating point calculator and a processing method for the floating point calculation. The floating point calculator comprises a floating point turning and fixed module, a fixed point addition module and a normalization module, wherein the floating point turning and fixed module is used for changing the floating point number inputted to the floating point turning and fixed module into the fixed point number; the input end of the fixed point addition module is connected with the output end of the floating point turning and fixed module, and the output end of the fixed point addition module is connected with the input end of the fixed point addition module; the fixed point addition module is used for the fixed point addition calculation of the fixed point number outputted by the floating point turning and fixed module and the fixed point number outputted by the fixed point addition module; the normalization module is connected with the output end of the fixed point addition module; the normalization module is used for normalizing the output of the fixed point addition module and changing into the floating point number the same as the bit digit inputted by the floating point calculator for outputting. The single-beat accumulation under high frequency is realized by adopting the floating point calculator.

10 citations

Patent
29 Sep 2010
TL;DR: In this paper, a reconfigurable transverse summing network structure for supporting fixed and floating points is proposed, which comprises a floating point index operating part, a fixed point mantissa operating/ fixed point operating part and a floating-point normalization operating part which are connected in turn.
Abstract: The invention discloses a reconfigurable transverse summing network structure for supporting fixed and floating points, which comprises a floating point index operating part, a floating point mantissa operating/ fixed point operating part and a floating point normalization operating part which are connected in turn, wherein the floating point index operating part is used for finishing selection of an index maximum and solution of an index difference, and outputting the obtained index difference to the floating point mantissa operating/ fixed point operating part; the floating point mantissa operating/ fixed point operating part is used for finishing displacement alignment, data compression and data summation of a floating point mantissa and complement code conversion of a floating point result, finishing lead 0 prediction and judgment required for floating point normalization operation in parallel through the other bypass at the same time, and outputting the obtained processing result to the floating point normalization operating part; and the floating point normalization operating part is used for finishing normalized displacement of the floating point mantissa and index adjustment. The reconfigurable transverse summing network structure reduces key channel time delay of a multi-input floating point addition, reduces operation resources consumed by fixed point summation, and reduces the power consumption.

10 citations

Patent
15 Mar 2013
TL;DR: In this paper, a data caching system and data caching method which is provided by the invention, data caches are filled in advance by storing related information in command caches or a data track table, so that waiting time caused by missing data caches and/or access time delay for the data caches is partially or totally covered.
Abstract: The invention provides a data caching system and a data caching method. In the data caching system and the data caching method which are provided by the invention, data caches are filled in advance by storing related information in command caches or a data track table and the data caches are controlled to output data possibly accessed to a processor in advance, so that waiting time caused by missing of the data caches and/or access time delay for the data caches is partially or totally covered.

8 citations