# Design and Implementation of a Pipelined Bit-Serial SFQ Microprocessor, $CORE1\beta$

Y. Yamanashi, M. Tanaka, A. Akimoto, H. Park, Y. Kamiya, N. Irie, N. Yoshikawa, A. Fujimaki, H. Terai, and Y. Hashimoto

Abstract—A pipelined 8-bit-serial single-flux-quantum (SFQ) microprocessor, called  $CORE1\beta$ , was designed and tested. The  $CORE1\beta$  has two cascaded arithmetic logic units (ALUs) based on forwarding architecture, which can perform two register operations from one instruction. Pipelining is also extensively adopted to enhance the performance. A new design method, known as one-hot encoding, has been introduced into the design of the control circuit. The 4-stage-pipelined SFQ microprocessors, CORE1 $\beta$ 8, have been implemented using the CONNECT cell library and the SRL 2.5  $kA/cm^2$  Nb process. The frequency for the instruction fetch is 25 GHz, and 20 GHz for the bit-serial data operation. The peak performance and the power consumption of the CORE1 $\beta$ 8 are estimated to be 1400 MOPS (million instructions per second) and 3.4 mW, respectively. We have experimentally demonstrated 4-stage pipelining and all functionalities of the  $CORE1\beta 8$  microprocessors by on-chip high-speed tests.

*Index Terms*—Josephson logic, microprocessors, pipelining, SFQ circuits, superconducting integrated circuits.

## I. INTRODUCTION

SINGLE FLUX quantum (SFQ) logic is considered to be a very promising technology for realizing a future high-end information processing system, because of its high-speed and ultra low-power operation [1]. One of the most attractive applications for SFQ technology is the microprocessor, which requires very high-speed operation as the central component of an information processing system.

After previous studies of the SFQ microprocessor, such as the FLUX chip [2] and the TIPPY processor [3], we began the development of SFQ microprocessors based on the complexity-reduced (CORE) architecture [4], [5]. In CORE architecture, bitserial processing is employed to reduce the complexity of the hardware. In 2003, we first demonstrated the complete operation of a prototype of the SFQ microprocessor, called  $CORE1\alpha$ 

Manuscript received August 29, 2006. This work was supported by the New Energy and Industrial Technology Development Organization (NEDO) through ISTEC as a Collaborative Research and Superconductors Network Device Project.

Y. Yamanashi, A. Akimoto, H. Park, and N. Yoshikawa are with the Department of Electrical and Computer Engineering, Yokohama National University, Yokohama 240-8501, Japan (e-mail: yamanasi@yoshilab.dnj.ynu.ac.jp).

M. Tanaka, Y. Kamiya, N. Irie, and A. Fujimaki are with the Department of Quantum Engineering, Nagoya University, Nagoya 464-8603, Japan (e-mail: fujimaki@nuee.nagoya-u.ac.jp).

H. Terai is with National Institute of Information and Communication Technology, Kobe 651-2492, Japan (e-mail: terai@nict.go.jp).

Y. Hashimoto is with the Superconductivity Research Laboratory, International Superconductivity Technology Center, Tsukuba 305-8501, Japan (e-mail: hasimoto@istec.or.jp).

Digital Object Identifier 10.1109/TASC.2007.898606

[6]. This first prototype, which is a very simple 8-bit-serial microprocessor consisting of 4999 Josephson junctions, is operated at a clock frequency of 16 GHz. Its performance was estimated to be 167 MIPS (million instructions per second). We have also demonstrated the correct operation of an improved version, called CORE1 $\alpha$ 6, which utilizes passive transmission lines (PTLs) for the connection of circuit blocks [7]. Utilization of PTL wiring has enhanced the performance and the design flexibility. The CORE1 $\alpha$ 6 microprocessor was operated at 18 GHz, and the experimentally demonstrated maximum performance was 240 MIPS [8]. We also demonstrated the operation of a further improved version, called CORE1 $\alpha$ 10, at 21 GHz, which was integrated with a 4-byte SFQ memory [9].

As a next step, we have been developing a new microprocessor, called CORE1 $\beta$ , which has a peak performance equivalent to that of semiconductor microprocessors, by improvement of the microprocessor architecture. CORE1 $\beta$  has been developed by introducing various techniques to enhance the performance. In this paper, we will describe the design of the CORE1 $\beta$  microprocessor in detail and provide experimental results of on-chip high-speed tests.

## II. $CORE1\beta$ Microprocessor

The main improvements in the  $CORE1\beta$  microprocessor are the introduction of pipelining and the implementation of two cascaded ALUs [10]. The microarchitecture and instruction set of the microprocessor are arranged by taking these improvements into account.

Fig. 1 shows the microarchitecture of the CORE1 $\beta$  microprocessor. The main circuit components of the microprocessor are an 8-bit program counter (PC), an instruction memory (IM), a 16-bit instruction register (IR), a 4 × 8-bit register file, a data memory (DM), two cascaded ALUs (ALUa, ALUb), a decoder for the ALUs, a forwarding buffer (FB), and a controller. Besides these components, several buffers (SRB1, SRB2, DRB) are added for the pipelining. The processor has two cascaded bit-serial ALUs based on forwarding architecture, which enable the execution of two operations from one instruction [11].

The CORE1 $\beta$  has eight instructions including data transfers (LD, ST), register operation (R-type), unconditional and conditional branches (J, BEQZ, BNEZ), halt (HLT), and no operation (NOP), as listed in Table I. These instructions are specified by a 4-bit primary operation code (opcode). In the R-type operation, seven arithmetic/logic operations, which are specified by 3-bit ALU opcodes, can be performed at each ALU. The length of the instruction and the data are 16-bits and 8 bits, respectively. Two source registers (Rs1, Rs2) and the destination register (Rd) are specified by 2-bit fields.

In order to introduce pipelining, execution of all instructions is divided into seven stages, as shown in Fig. 1. The operations



Fig. 1. Microarchitecture of the  $CORE1\beta$  microprocessor. The processor is composed of a program counter (PC), an instruction memory (IM), an instruction register (IR), a register file, two ALUs (ALUa, ALUb), two source register buffers (SRB1, SRB2), a destination register buffer (DRB), a forwarding buffer (FB), and a controller. All instructions are divided into seven phases, as shown at the top of the figure.

TABLE I INSTRUCTION SET FOR  $CORE1\beta$ 

| Instruction | Definition                             | Opcode |  |  |
|-------------|----------------------------------------|--------|--|--|
| LD          | $Rd \leftarrow DM$                     | 1000   |  |  |
| ST          | DM ← Rs1                               | 0110   |  |  |
| R-type      | Register operation                     | 1100   |  |  |
| J           | $PC \leftarrow address$                | 0010   |  |  |
| BEQZ        | $if(Rs == 0) PC \leftarrow address$    | 0101   |  |  |
| BNEZ        | if(Rs $!= 0$ ) PC $\leftarrow$ address | 0100   |  |  |
| HLT         | Stop                                   | 0001   |  |  |
| NOP         | No operation                           | 0000   |  |  |

of the microprocessor during the seven phases are described by the following:

# Phase 0: Instruction Fetch 0 (IF0)

The instruction is read from the IM using the address in the PC. Then, for the next instruction, the PC address is incremented by 2, which corresponds to the length of an instruction. The internal states of the ALUs and the ALU decoder are reset.

# Phase 1: Instruction Fetch 1 (IF1)

The serial 16-bit instruction is transferred from the IM to the IR.

## Phase 2: Instruction Decode 0 (ID0)

The 4-bit opcode is read out from the IR and the instruction is decoded in the controller. The Rs1, Rs2 and the Rd are set using the 2-bit field transferred from the IR. The opcodes for the ALUs and the 8-bit address for conditional/unconditional branch operations are read out and latched for use in a later phase.

# Phase 3: Instruction Decode 1 (ID1)

For R-type operation, the data in the Rs1 and Rs2 are transferred to the SRB1 and SRB2. For the ST operation, the data in the Rs1 is written into the DM. For the HLT operation, the controller outputs the stop signal.

#### Phase 4: Execute 0 (EX0)

During R-type operation, the opcode for the ALUs transferred in phase 2 are decoded in the ALU decoder. After the functionalities of each ALU are set, the data in the SRB1 and SRB2 are transferred to the ALUs and the arithmetic/logical operation is executed in ALUa using the data in the FB and SRB1. The ALUa performs a zero-check function, which determines whether the data in the Rs1 is zero or not.

## Phase 5: Execute 1 (EX1)

During R-type operation, the arithmetic/logical operation is executed in the ALUb. The result of the calculation is input to the DRB and the FB. The result of the zero-check is sent to the controller.

## Phase 6: Write Back (WB)

For R-type operation, the data in the DRB is written into the Rd. For LD operation, the data in the DM is read out and written into the Rd. In the J instruction, the address in the PC is overwritten. For the BEQZ and BNEQ operations, the address in the PC is overwritten if the condition in the zero-check result is satisfied.

We have introduced four-stage pipelining to enhance the peak performance of the microprocessor. Therefore, the instruction is issued at every two system cycle. Fig. 2 shows the pipelining of the CORE1 $\beta$  microprocessor. Four instructions are overlapped at odd system cycles, as shown in the figure. No circuit component is accessed by multiple instructions at each system cycle except the register, which can be written and read out simultaneously [10].

Although pipelining is very effective for the enhancement of performance, the controller, which handles all the circuit components of the microprocessor by providing appropriate control signals, becomes very complicated for the previously used conventional design method [12], since a large number of pipeline registers are required to maintain control of the information for each phase. To overcome this problem, we are using a new design method to achieve complex pipeline control, by introducing one-hot encoding into the design of the controller.

The internal state of the microprocessor is generally represented by a state transition diagram. With one-hot encoding, the state transition table of the microprocessor is directly implemented by the SFQ circuits, where each state is replaced with a 1-bit delay flip-flop (DFF), and the current status of the micro-

| Inst1            | IF0<br>(PC) | IF1<br>(IR) | ID0<br>(IR,Reg) | ID1<br>(Reg,DM) | EX0<br>(SRB1,SRB2,<br>ALUa, FB) | EX1<br>(ALUa, ALUb)  | WB<br>(DRB,PC,Reg)              |                     |                                 |                     |                                 |                     |                                 |                     |                    |
|------------------|-------------|-------------|-----------------|-----------------|---------------------------------|----------------------|---------------------------------|---------------------|---------------------------------|---------------------|---------------------------------|---------------------|---------------------------------|---------------------|--------------------|
| 1                |             | Inst2       | IF0<br>(PC)     | IF1<br>(IR)     | ID0<br>(IR,Reg)                 | ID1<br>(Reg,DM)      | EX0<br>(SRB1,SRB2,<br>ALUa, FB) | EX1<br>(ALUa, ALUb) | WB<br>(DRB,PC,Reg)              |                     |                                 |                     |                                 |                     |                    |
| <br> <br> <br>   |             |             |                 | Inst3           | IF0<br>(PC)                     | IF1<br>(IR)          | ID0<br>(IR,Reg)                 | ID1<br>(Reg,DM)     | EX0<br>(SRB1,SRB2,<br>ALUa, FB) | EX1<br>(ALUa, ALUb) | WB<br>(DRB,PC,Reg)              |                     |                                 |                     |                    |
| 1                |             |             |                 | <br> <br> <br>  |                                 | lnst4                | IF0<br>(PC)                     | IF1<br>(IR)         | ID0<br>(IR,Reg)                 | ID1<br>(Reg,DM)     | EX0<br>(SRB1,SRB2,<br>ALUa, FB) | EX1<br>(ALUa, ALUb) | WB<br>(DRB,PC,Reg)              |                     |                    |
|                  |             |             |                 | <br> <br> <br>  |                                 | -<br> <br> <br> <br> | <br> <br> <br>                  | ¦ Inst5             | IF0<br>(PC)                     | IF1<br>(IR)         | ID0<br>(IR,Reg)                 | ID1<br>(Reg,DM)     | EX0<br>(SRB1,SRB2,<br>ALUa, FB) | EX1<br>(ALUa, ALUb) | WB<br>(DRB,PC,Reg) |
| ,<br>,<br>,<br>, | SC1         | SC2         | SC3             | SC4             | SC5                             | SC6                  | SC7                             | SC8                 | SC9                             | '<br>  SC10         | SC11                            | SC12                | '<br>  SC13                     | SC14                | SC15               |
| System Cycle     |             |             |                 |                 |                                 |                      |                                 |                     |                                 |                     |                                 |                     |                                 |                     |                    |

Fig. 2. Pipelining of the  $CORE1\beta$  microprocessor. Execution of only five instructions is illustrated. Instructions are issued every two system cycles. Squares separated by system cycles correspond to the phases of each instruction. The characters in brackets represent circuit components accessed at each system cycle.

processor is represented by the existence of the SFQ pulse in the DFF [13]. The advantage of one-hot encoding is a fast decoding time. In addition, pipeline control is easily carried out, because the multiple states in pipelining are simply represented by the existence of multiple SFQs. One-hot encoding is very suitable for the design of the SFQ control circuit because the cost of implementing DFF in SFQ logic circuits is inexpensive.

# **III. TEST RESULTS**

The four-stage pipelined SFQ microprocessor, called CORE1 $\beta$ 8, was designed and implemented using the CONNECT cell library [14] and the SRL Nb standard process [15]. The PC and the IR are almost the same as those of the previous microprocessor, CORE1 $\alpha$  [12]. The IM and the DM are substituted by 16-bit and 8-bit shift registers in the new design. The clock frequencies for CORE1 $\beta$ 8 are 25 GHz for the instruction fetch, and 20 GHz for the bit-serial data operation. As a result of the timing adjustment between all circuit components, the system cycle frequency could be enhanced up to 1.4 GHz. The peak performance corresponds to 1400 MOPS (million operations per second), because instructions are issued every two system cycles, and two operations are executed for one instruction in the cascaded ALUs.

Fig. 3 shows a microphotograph of the  $CORE1\beta 8$  microprocessor. All cells used in the  $CORE1\beta 8$  have a superconducting shielding (SUSHI) structure, which can completely remove the influences of magnetic fields induced by bias feeding lines in each cell [16]. In addition, to reduce the effect of external magnetic fields generated by bonding wires and off-chip bias feeding lines [17], we have fabricated the  $CORE1\beta 8$  microprocessor on a large die with an area of  $8 \times 8$  mm, whereas a typical die size is  $5 \times 5$  mm. The large die size enables wider spacing between the circuit and off-chip bias feeding lines, which results in a reduction of the influences from external magnetic fields on circuit operation. The  $CORE1\beta 8$  microprocessor is made up of 10995 Josephson junctions. The effective area of the circuit, except the clock generators for the high-speed test, is  $4.7 \times 4.6$  mm. The processor is composed of five main circuit blocks: CTRL, PC, REG, ALU, and DEC, as shown in Fig. 3. The power consumption of  $CORE1\beta 8$  is estimated to be 3.4 mW. Bias currents are individually supplied to each circuit block. Therefore, the dc bias margin of each circuit block can be measured. The total bias current is 1373 mA, and the bias



Fig. 3. Microphotograph of the CORE1 $\beta$ 8 microprocessor. The CORE1 $\beta$ 8 is made up of 10995 Josephson junctions, and has five main circuit blocks. The circuit blocks are connected by PTL wiring. The microprocessor is fabricated on an area of 4.7 × 4.6 mm. The die size used was 8 × 8 mm.

current for each circuit block is designed so as not to exceed approximately 300 mA.

We have examined the operation of the CORE1b8 microprocessor using on-chip high-speed tests, and its main operations, including multiple add operations, have been confirmed. Fig. 4 shows the measured dc bias margins of each circuit block when the multiple add operations, i.e. LD-LD-ADD-ADD-ST, are performed at high-speed. It can be seen that each circuit block operates successfully at high speed with sufficient DC bias margins. However, we could not confirm the conditional branch operations in this chip (Chip #3F4) due to a malfunction of the zero-check signal from the ALU.

Another chip (Chip #2E6) was then measured and the conditional branch operations (BEQZ, BNEQ) were confirmed to work correctly. Fig. 5 shows the dc bias margins of each circuit block when conditional/unconditional branch operations



Fig. 4. DC bias margins for each circuit block of the  $CORE1\beta 8$  when the multiple add operations (LD-LD-ADD-ADD-ST) are performed at high-speed (Chip #3F4).



Fig. 5. DC bias margins for each circuit block of the  $CORE1\beta$  when the three branch operations (J, BEQZ, BNEZ) are performed. The bias margin of the DEC block could not be measured, because of a malfunction of the R-type operation. (Chip #2E6).

are performed at high-speed. The margin of the ALU block is relatively large, because only the zero-check function of the ALU was tested in this measurement. Unfortunately, the correct functionality of R-type operations could not be observed in this chip. We believe that the malfunctions of the microprocessor are caused by the low circuit yield, due to reasons, such as circuit defects and flux trapping.

# IV. CONCLUSION

We have designed and tested an 8-bit-serial four-stagepipelined SFQ microprocessor,  $CORE1\beta8$ . It has two cascaded ALUs based on forwarding architecture for enhancement of the performance. A new design method using one-hot encoding was adopted for the design of the control circuit, which enabled the efficient implementation of complex pipelining. The microprocessor has been fabricated on a large die to reduce the influence of external magnetic fields. The peak performance and power consumption are 1400 MOPS and 3.4 mW, respectively. The functionalities of all instructions for the  $CORE1\beta 8$  have been demonstrated using on-chip high-speed tests.

# ACKNOWLEDGMENT

The authors thank all the CONNECT members consisting of Nagoya University, SRL-ISTEC, NICT, and Yokohama National University.

#### REFERENCES

- K. K. Likharev and V. K. Semenov, "RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz-clock-frequency digital systems," *IEEE Trans. Appl. Supercond.*, vol. 1, pp. 3–28, Mar. 1991.
- [2] P. Bunyk, M. Leung, J. Spargo, and M. Dorojevets, "FLUX-1 RSFQ microprocessor: Physical design and test results," *IEEE Trans. Appl. Supercond.*, vol. 13, pp. 433–436, Jun. 2003.
- [3] N. Yoshikawa, F. Matsuzaki, N. Nakajima, K. Fujiwara, K. Yoda, and K. Kawasaki, "Design and component test of a tiny processor based on the SFQ technology," *IEEE Trans. Appl. Supercond.*, vol. 13, pp. 441–445, Jun. 2003.
- [4] A. Fujimaki, Y. Takai, and N. Yoshikawa, "High-end server based on complexity-reduced architecture for superconductor technology," *IEICE Trans. Electron.*, vol. 85, pp. 612–616, Mar. 2002.
- [5] M. Tanaka, F. Matsuzaki, T. Kondo, N. Nakajima, Y. Yamanashi, H. Terai, S. Yorozu, N. Yoshikawa, A. Fujimaki, and H. Hayakawa, "Prototypic design of the single-flux quantum microprocessor, CORE1," *Supercond. Sci. Technol.*, vol. 16, pp. 1460–1463, Nov. 2003.
- [6] M. Tanaka, F. Matsuzaki, T. Kondo, N. Nakajima, Y. Yamanashi, A. Fujimaki, H. Hayakawa, N. Yoshikawa, H. Terai, and S. Yorozu, "A single-flux-quantum logic prototype microprocessor," in *Tech. Dig. IEEE Int. Solid-State Circuit Conf.*, San Francisco, CA, Feb. 2004.
- [7] Y. Hashimoto, S. Yorozu, Y. Kameda, and V. K. Semenov, "A design approach to passive interconnects for single flux quantum logic cells," *IEEE Trans. Appl. Supercond.*, vol. 13, pp. 535–538, Jun. 2003.
- [8] M. Tanaka, T. Kondo, N. Nakajima, T. Kawamoto, Y. Yamanashi, Y. Kamiya, A. Akimoto, A. Fujimaki, H. Hayakawa, N. Yoshikawa, H. Terai, Y. Hashimoto, and S. Yorozu, "Demonstration of a single-flux-quantum microprocessor using passive transmission lines," *IEEE Trans. Appl. Supercond.*, vol. 15, pp. 400–404, Jun. 2005.
  [9] K. Fujiwara, Y. Yamashiro, N. Yoshikawa, A. Fujimaki, H.
- [9] K. Fujiwara, Y. Yamashiro, N. Yoshikawa, A. Fujimaki, H. Terai, and S. Yorozu, "Design and high-speed test of (4 × 8)-bit single-flux-quantum shift register files," *Supercond. Sci. Technol.*, vol. 16, pp. 1456–1459, Nov. 2003.
- [10] M. Tanaka, T. Kawamoto, Y. Yamanashi, Y. Kamiya, A. Akimoto, K. Fujiwara, A. Fujimaki, N. Yoshikawa, H. Terai, and S. Yorozu, "Design of a pipelined 8-bit-serial single-flux-quantum microprocessor with multiple ALUs," *Supercond. Sci. Technol.*, vol. 19, pp. S344–S349, Mar. 2006.
- [11] M. Tanaka, T. Kondo, T. Kawamoto, Y. Kamiya, K. Fujiwara, Y. Yamanashi, A. Akimoto, A. Fujimaki, N. Yoshikawa, H. Terai, and S. Yorozu, "Design of a data path for single-flux-quantum microprocessors with multiple ALUs," *Physica C*, vol. 426–431, pp. 1693–1698, Nov. 2005.
- [12] N. Nakajima, F. Matsuzaki, Y. Yamanashi, N. Yoshikawa, M. Tanaka, T. Kondo, A. Fujimaki, H. Terai, and S. Yorozu, "Design and implementation of circuit components of the SFQ microprocessor, CORE1," *Supercond. Sci. Technol.*, vol. 17, pp. 301–307, Jan. 2004.
- [13] Y. Yamanashi, A. Akimoto, N. Yoshikawa, M. Tanaka, T. Kawamoto, Y. Kamiya, A. Fujimaki, H. Terai, and S. Yorozu, "A new design approach for control circuits of a pipelined single-flux-quantum microprocessor," *Supercond. Sci. Technol.*, vol. 19, pp. S340–S343, Mar. 2006.
- [14] S. Yorozu, Y. Kameda, H. Terai, A. Fujimaki, T. Yamada, and S. Tahara, "A single flux quantum standard logic cell library," *Physica C*, vol. 378–381, pp. 1471–1474, Sep. 2002.
- [15] S. Nagasawa, Y. Hashimoto, H. Numata, and S. Tahara, "A 380 ps, 9.5 mW Josephson 4-kbit RAM operated at a high bit yield," *IEEE Trans. Appl. Supercond.*, vol. 5, pp. 2447–2452, Jan. 1995.
- [16] N. Yoshikawa, T. Nishigai, H. Kojima, K. Fujiwara, A. Fujimaki, T. Yamada, M. Tanaka, S. Yorozu, M. Hidaka, and H. Terai, "Magnetic shielding against DC bias current toward large-scale SFQ integrated circuits," in *Appl. Supercond. Conf.*, Jacksonville, FL, Oct. 2004.
- [17] H. Terai, S. Yorozu, A. Fujimaki, N. Yoshikawa, and Z. Wang, "Signal integrity in large-scale single-flux-quantum circuit," in *18th International Symposium on Superconductivity*, Tsukuba, Japan, Oct. 2005.