# On the High-Speed VLSI Implementation of Errors-and-Erasures Correcting Reed-Solomon Decoders 

Tong Zhang and Keshab K. Parhi<br>Department of Electrical and Computer Engineering, University of Minnesota

\{tzhang,parhi\}@ece.umn.edu


#### Abstract

Recently a novel algorithm transformation was proposed to reduce the critical path of Berlekamp-Massey algorithm implementation for errors-alone Reed-Solomon decoding. In this paper, we apply the same methodology to transform the Berlekamp-Massey algorithm for errors-and-erasures RS decoding. We present a regular hardware architecture to implement the reformulated Berlekamp-Massey algorithm, which can achieve high throughput. Moreover, an operation scheduling scheme is proposed to further reduce the hardware complexity without loss of throughput.


## Categories and Subject Descriptors

B.2.4 [ARITHMETIC AND LOGIC STRUCTURES]: High-Speed Arithmetic

## General Terms

Design

## Keywords

Reed-Solomon codes, Berlekamp-Massey algorithm, erasure, VLSI architectures

## 1. INTRODUCTION

Reed-Solomon (RS) codes are widely used for forward error correcting (FEC) in numerous communication systems because of their good error correction capability for burst errors. A conventional errors-alone RS decoding procedure can be modified to correct both errors and erasures [1][4] (an erasure occurs when the position of a corrupted symbol is known). RS decoder with the capability of correcting errors as well as erasures will improve performance in various systems [9][3].

To achieve higher decoding speed, except applying the more advanced physical-level, e.g., sub-micron, technology to improve the throughput, the algorithm/architecture-level

[^0]transformations and modifications are also playing important roles. In RS decoders, the throughput bottleneck is in solving the Berlekamp's key equation, where two algorithms are typically employed [1]: extended Euclidean (eE) algorithm and Berlekamp-Massey (BM) algorithm. Recently a novel algorithm transformation was proposed in [7] to reformulate the BM algorithm for errors-alone RS decoders so that the key-equation-solving block has a very regular semisystolic architecture with the critical path of only $T_{\text {mult }}+$ $T_{\text {add }}$, where $T_{\text {mult }}$ and $T_{\text {add }}$ are the delays of the finite field multiplier and adder, respectively. This amount of critical path is much smaller than that of the previous RS decoder implementations using BM algorithm and comparable to the best known result of implementations based on eE algorithm. The basic methodology behind such algorithm transformation is to remove the data dependency inside of one iteration at the cost of increased computation complexity so that the critical path can be reduced. In this paper, we show that this methodology can be readily applied to the errors-and-erasures RS decoder implementations to achieve regular semi-systolic architecture and critical path of $T_{\text {mult }}+T_{\text {add }}$. Moreover, exploiting the characteristics of errors-and-erasures RS decoding, we propose an operation scheduling scheme for the developed decoder architecture to reduce the hardware complexity without loss of throughput.

This paper is organized as follows. We introduce the fundamentals of RS decoding in Section 2. An inverse-free BM algorithm for errors-and-erasures RS decoding is described in Section 3. In Section 4, we apply the methodology proposed in [7] to transform the BM algorithm for errors-anderasures RS decoding, based on which we present a semisystolic high-speed hardware architecture. We note that in order to facilitate the reader's reference to [7], we use the same notations as much as possible in the algorithm and hardware architecture descriptions.

## 2. RS CODES DECODING

Let's consider an $(n, k) \mathrm{RS}$ code over $G F\left(2^{m}\right)$, where each code symbol is a $m$-bit data, $n=2^{m}-1$ is the codeword block length and $k$ is the information length. Since RS codes are maximum distance separable (MDS) codes [6], the ( $n, k$ ) RS code has a minimum distance $d=n-k+1$. When such RS code is used on a channel that makes both errors and erasures, any pattern of $v$ errors and $\rho$ erasures can be corrected provided that $d \geq 2 v+\rho+1$. Let $C(z)$ denote the transmitted codeword polynomial and $R(z)$ denote the received word polynomial in which erased symbols are represented as
blanks (or zeroes), we have

$$
R(z)=C(z)+E(z)+F(z)
$$

where $E(z)$ and $F(z)$ represent the error polynomial and erasure polynomial. Define the errata polynomial as

$$
\begin{aligned}
\tilde{E}(z) & =E(z)+F(z) \\
& =\tilde{e}_{i_{1}} z^{i_{1}}+\tilde{e}_{i_{2}} z^{i_{2}}+\cdots+\tilde{e}_{i_{v+\rho}} z^{i_{v+\rho}} .
\end{aligned}
$$

Let $\alpha$ be the primitive element in $G F\left(2^{m}\right)$, we say that errata magnitudes $\tilde{Y}_{l}=\tilde{e}_{i_{l}}$ for $l=1, \ldots, v+\rho$, occurred at errata locations $\tilde{X}_{l}=\alpha^{i_{l}}$ for $l=1, \ldots, v+\rho$. Suppose that the $d-1$ consecutive powers of $\alpha$ used in RS code construction are $\alpha^{m_{0}}, \alpha^{m_{0}+1}, \cdots, \alpha^{m_{0}+d-2}$. The RS decoder always begin by computing the syndromes

$$
s_{k}=R\left(\alpha^{m_{0}+k}\right)=\tilde{E}\left(\alpha^{m_{0}+k}\right), \quad 0 \leq k \leq d-2
$$

We introduce the syndrome polynomial $S(z)=\sum_{k=0}^{d-2} s_{k} z^{k}$, and define the errata locator polynomial $\Lambda(z)$ and errata evaluator polynomial $\Omega(z)$ as

$$
\begin{aligned}
\Lambda(z) & =\prod_{j=1}^{v+\rho}\left(1-\tilde{X}_{j} z\right)=1+\lambda_{1} z+\cdots+\lambda_{v+\rho} z^{v+\rho} \\
\Omega(z) & =\sum_{i=1}^{v+\rho} \tilde{Y}_{i} \tilde{X}_{i}^{m_{0}} \prod_{j=1, j \neq i}^{v+\rho}\left(1-\tilde{X}_{j} z\right) \\
& =\omega_{0}+\omega_{1} z+\cdots+\omega_{v+\rho-1} z^{v+\rho-1}
\end{aligned}
$$

We can determine the above two unknown polynomials $\Lambda(z)$ and $\Omega(z)$ by solving the key equation [1]:

$$
\begin{equation*}
\Omega(z) \equiv \Lambda(z) S(z) \bmod z^{d-1} \tag{1}
\end{equation*}
$$

Notice that the $v+\rho$ roots of $\Lambda(z)$ consist of the inverses of $v$ unknown error locations and $\rho$ known erasure locations. Once $\Lambda(z)$ and $\Omega(z)$ are found, the decoder can use Chien search to identify the $v$ unknown roots of $\Lambda(z)$ so that all the $v+\rho$ errata locations $\tilde{X}_{i}$ become known. Then all the $v+\rho$ errata magnitudes $\tilde{Y}_{i}$ can be determined by using the Forney algorithm [4]

$$
\begin{equation*}
\tilde{Y}_{i}=-\frac{\tilde{X}_{i}^{1-m_{0}} \Omega\left(\tilde{X}_{i}^{-1}\right)}{\Lambda^{\prime}\left(\tilde{X}_{i}^{-1}\right)} \text {, for } i=1,2, \cdots, v+\rho \tag{2}
\end{equation*}
$$

Using all the $\tilde{X}_{i}^{\prime} s$ and $\tilde{Y}_{i}^{\prime} s$ we can derive the errata polynomial $\tilde{E}(z)$, and finally reconstruct the correct codeword $C(z)=R(z)-\tilde{E}(z)$.

The VLSI implementation of the above decoding process typically contains four consecutive pipelined stages : 1. the syndrome computation (SC) block; 2. the key equation solver (KES) block; 3. the Chien search (CS) block; 4. the errata evaluator (EE) block. All the blocks work in parallel on consecutive codewords. Pipeline is also typically used in each block to achieve higher throughput. Compared with the other three blocks which can be pipelined in a straightforward manner, the KES block, based on BM or eE algorithm, cannot be easily pipelined due to the presence of inherent feedback loops. Thus we may say that the throughput bottleneck in RS decoders is in the KES block.

## 3. BERLEKAMP-MASSEY ALGORITHM

Recently an efficient inverse-free BM algorithm was proposed in [5] for errors-and-erasures RS decoding. We adopt
such algorithm as the starting point of this work. In the following, we briefly describe this algorithm based on [5]. Notice that such algorithm actually finds scalar multiples $\beta \cdot \Lambda(z)$ and $\beta \cdot \Omega(z)$, where $\beta$ is a field element in $G F\left(2^{m}\right)$. Since these scaled results will not affect the computation of errata locations using Chien search and the errata magnitudes using (2), we still refer to the polynomials output by the following BM algorithm as $\Lambda(z)$ and $\Omega(z)$.

Suppose errata locations $\tilde{X}_{l_{1}}, \tilde{X}_{l_{2}}, \cdots, \tilde{X}_{l_{\rho}}$ correspond to the $\rho$ erasures, we define the erasure locator polynomial as

$$
\begin{equation*}
\Psi(z)=\prod_{i=1}^{\rho}\left(1-\tilde{X}_{l_{i}} z\right)=1+\psi_{1} z+\cdots+\psi_{\rho} z^{\rho} \tag{3}
\end{equation*}
$$

Notice that, provided with $2 v+\rho+1 \leq d$, when $\rho=d-1$ or $\rho=d-2$, no errors occurred (or $v=0$ ) and the errata locator polynomial $\Lambda(z)=\Psi(z)$ is immediately available, the decoder can bypass the following BM algorithm and computes errata evaluator polynomial $\Omega(z)$ directly using (1).

Algorithm 3.1. Inverse-Free BM Algorithm for Errors-and-Erasures Decoding

1. Initialization: $r=k^{(0)}=0, \gamma^{(0)}=1, \Lambda^{(0)}(z)=$ $B^{(0)}(z)=\Psi(z)$.
2. Set $r=r+1$. If $r>d-\rho-1$, go to step 6 , else compute the discrepancy

$$
\begin{equation*}
\delta^{(r-1)}=\lambda_{0}^{(r-1)} \cdot s_{r+\rho-1}+\cdots+\lambda_{r+\rho-1}^{(r-1)} \cdot s_{0} \tag{4}
\end{equation*}
$$

3. Compute $\Lambda^{(r)}(z)=\gamma^{(r-1)} \cdot \Lambda^{(r-1)}(z)-z \cdot \delta^{(r-1)} \cdot B^{(r-1)}(z)$ so that for $i=0,1, \cdots, d-3$,

$$
\begin{equation*}
\lambda_{i}^{(r)}=\gamma^{(r-1)} \cdot \lambda_{i}^{(r-1)}-\delta^{(r-1)} \cdot b_{i-1}^{(r-1)} \tag{5}
\end{equation*}
$$

4. If $\delta^{(r-1)} \neq 0$ and $k^{(r-1)} \geq 0$, then

$$
\begin{cases}B^{(r)}(z) & =\Lambda^{(r-1)}(z) \\ \gamma^{(r)} & =\delta^{(r-1)} \\ k^{(r)} & =-k^{(r-1)}-1\end{cases}
$$

else

$$
\begin{cases}B^{(r)}(z) & =z \cdot B^{(r-1)}(z) \\ \gamma^{(r)} & =\gamma^{(r-1)} \\ k^{(r)} & =k^{(r-1)}+1\end{cases}
$$

5. Return to step 2.
6. Set $\Lambda(z)=\Lambda^{(d-\rho-1)}(z)$, and compute $\Omega(z)=(\Lambda(z)$. $S(z)) \bmod z^{d-1}$ so that for $i=0,1, \cdots, d-3$
$\omega_{i}=\lambda_{0}^{(d-\rho-1)} \cdot s_{i}+\lambda_{1}^{(d-\rho-1)} \cdot s_{i-1} \cdots+\lambda_{i}^{(d-\rho-1)} \cdot s_{0}$.
Notice that $b_{-1}^{(r)}$ occurring in the above algorithm are set to zeroes. This algorithm provides the solutions of the key equation: $\Lambda(z)$ and $\Omega(z)$, where all the $\lambda_{i}$ for $i>v+\rho$ and $\omega_{j}$ at least for $j>v+\rho-1$ are zeroes.

Notice that the loop between step 2 and 5 is completed in $d-\rho-1$ iterations, and in each iteration there is a data dependency between the computations of (4) and (5) through the discrepancy $\delta^{(r-1)}$. Thus, if we want to realize the above algorithm in such a way that one iteration is completed in one clock cycle for the minimal latency, the critical path
will be larger than $2 \cdot\left(T_{\text {mult }}+T_{\text {add }}\right)$ that is typically much longer than the critical paths of all other building blocks in RS decoder. Therefore, reducing this critical path becomes the key to achieve higher RS decoding throughput.

## 4. HIGH-SPEED RS DECODER ARCHITECTURE

As mentioned earlier, the authors of [7] proposed a novel algorithm transformation to reformulate the errors-alone correcting BM algorithm to significantly reduce the critical path, where the basic idea is to remove the data dependency inside of one iteration at the cost of increased computation complexity so that the critical path can be reduced. In this section, we apply the same methodology to reformulate the above BM algorithm for errors-and-erasures RS decoding. The developed KES block has a regular semi-systolic architecture with the critical path of only $T_{m u l t}+T_{\text {add }}$. Moreover, to reduce the overall RS decoder hardware complexity without loss of latency, we propose an operation scheduling scheme to make KES block not only solve the key equation but also compute an intermediate polynomial that is a primary input to the reformulated BM algorithm.

### 4.1 Reformulated BM Algorithm

In the following we present how to apply the essential methodology behind the algorithm transformation proposed in [7] for errors-alone RS decoding to our interested errors-and-erasures RS decoding. The readers are highly recommended to read [7] for its detailed descriptions.

First notice that the discrepancy $\delta^{(r-1)}$ in Algorithm 3.1, the pivotal element along the data dependency path, is actually $\delta_{r+\rho-1}^{(r-1)}$, the coefficient of $z^{r+\rho-1}$ in the polynomial

$$
\begin{aligned}
\Delta^{(r-1)}(z) & =\Lambda^{(r-1)}(z) \cdot S(z) \\
& =\delta_{0}^{(r-1)}+\delta_{1}^{(r-1)} z+\cdots+\delta_{r+\rho-1}^{(r-1)} z^{r+\rho-1}+\cdots
\end{aligned}
$$

If we introduce a new polynomial $\Theta^{(r)}(z)=B^{(r)}(z) \cdot S(z)$, according to Algorithm 3.1, we have

$$
\begin{align*}
\Delta^{(r)}(z) & =\Lambda^{(r)}(z) \cdot S(z) \\
& =\left(\gamma^{(r-1)} \cdot \Lambda^{(r-1)}(z)-z \cdot \delta^{(r-1)} \cdot B^{(r-1)}(z)\right) \cdot S(z) \\
& =\gamma^{(r-1)} \cdot \Delta^{(r-1)}(z)-z \cdot \delta^{(r-1)} \cdot \Theta^{(r-1)}(z), \tag{6}
\end{align*}
$$

where $\Theta^{(r)}(z)=B^{(r)} \cdot S(z)$ will be either $\Delta^{(r-1)}(z)=$ $\Lambda^{(r-1)}(z) \cdot S(z)$ or $z \cdot \Theta^{(r-1)}(z)=z \cdot B^{(r-1)} \cdot S(z)$. We can easily prove that, if $\Delta^{(0)}(z)$ is available at the beginning and we replace the computation of (4) in Algorithm 3.1 with (6), discrepancy $\delta^{(r-1)}=\delta_{r+\rho-1}^{(r-1)}$, the crucial element in each iteration, will be computed in the previous clock cycle instead of the current clock cycle (or we can say that discrepancy $\delta^{(r-1)}$ is computed in a look-ahead style). In this way, the original data dependency in each iteration (or intra-iteration data dependency) has been transformed into successive iterations (or replaced by an inter-iteration data dependency). Thus, with the disappearance of intraiteration data dependency, all the computations in one iteration can be performed in parallel and, as shown later, the corresponding KES block can achieve the critical path of only $T_{\text {mult }}+T_{\text {add }}$.

Let $z^{i} \bmod z^{d-1}=0$ for $i<0$, we introduce the polyno-
mial $\Phi(z)$ :

$$
\begin{equation*}
\Phi(z)=\left(z^{-\rho} \cdot \Psi(z) \cdot S(z)\right) \bmod z^{d-1}=\sum_{i=0}^{d-2} \phi_{i} z^{i} \tag{7}
\end{equation*}
$$

where

$$
\phi_{i}=\psi_{0} \cdot s_{i+\rho}+\psi_{1} \cdot s_{i+\rho-1}+\cdots+\psi_{\rho} \cdot s_{i} .
$$

Moreover, notice that for any $i<r+\rho, \delta_{i}^{(r)}$ and $\theta_{i}^{(r)}$ cannot affect the value of any later discrepancy $\delta_{r+\rho+j}^{(r+j)}$. Consequently, we need not store $\delta_{i}^{(r)}$ and $\theta_{i}^{(r)}$ for $i<r+\rho$. Thus we define $\hat{\delta}_{i}^{(r)}=\delta_{i+r+\rho}^{(r)}, \hat{\theta}_{i}^{(r)}=\theta_{i+r+\rho}^{(r)}$ and the polynomials

$$
\begin{equation*}
\hat{\Delta}^{(r)}(z)=\sum_{i=0}^{d-2} \hat{\delta}_{i}^{(r)} z^{i} \text { and } \hat{\Theta}^{(r)}(z)=\sum_{i=0}^{d-2} \hat{\theta}_{i}^{(r)} z^{i} \tag{8}
\end{equation*}
$$

with initial values $\hat{\Delta}^{(0)}(z)=\hat{\Theta}^{(0)}(z)=\Phi(z)$. Based on the above discussion and introduced polynomials, we can readily transform Algorithm 3.1 to the following algorithm suitable for high-speed implementations:

Algorithm 4.1. Reformulated Inverse-Free BM Algorithm for Errors-and-Erasures Decoding

1. Initialization: $r=k^{(0)}=0, \gamma^{(0)}=1, \Lambda^{(0)}(z)=$ $B^{(0)}(z)=\Psi(z)$ and $\hat{\Delta}^{(0)}(z)=\hat{\Theta}_{i}^{(0)}(z)=\Phi(z)$.
2. Set $r=r+1$. If $r>d-\rho-1$, algorithm terminates, else compute $\hat{\Delta}^{(r)}(z)=z \cdot \gamma^{(r-1)} \cdot \hat{\Delta}^{(r-1)}(z)-\hat{\delta}_{0}^{(r-1)}$. $\hat{\Theta}^{(r-1)}(z)$ so that for $i=0,1, \cdots, d-2$,

$$
\begin{equation*}
\hat{\delta}_{i}^{(r)}=\gamma^{(r-1)} \cdot \hat{\delta}_{i+1}-\hat{\delta}_{0}^{(r-1)} \cdot \hat{\theta}_{i}^{(r-1)} \tag{9}
\end{equation*}
$$

3. Compute $\Lambda^{(r)}(z)=\gamma^{(r-1)} \cdot \Lambda^{(r-1)}(z)-z \cdot \hat{\delta}_{0}^{(r-1)} \cdot B^{(r-1)}(z)$ so that for $i=0,1, \cdots, d-3$,

$$
\begin{equation*}
\lambda_{i}^{(r)}=\gamma^{(r-1)} \cdot \lambda_{i}^{(r-1)}-\hat{\delta}_{0}^{(r-1)} \cdot b_{i-1}^{(r-1)} \tag{10}
\end{equation*}
$$

4. If $\hat{\delta}_{0}^{(r-1)} \neq 0$ and $k^{(r-1)} \geq 0$, then

$$
\begin{cases}B^{(r)}(z) & =\Lambda^{(r-1)}(z) \\ \hat{\Theta}^{(r)}(z) & =z^{-1} \cdot \hat{\Delta}^{(r-1)}(z) \\ \gamma^{(r)} & =\hat{\delta}_{0}^{(r-1)} \\ k^{(r)} & =-k^{(r-1)}-1\end{cases}
$$

else

$$
\begin{cases}B^{(r)}(z) & =z \cdot B^{(r-1)}(z) \\ \hat{\Theta}^{(r)}(z) & =\hat{\Theta}^{(r-1)}(z) \\ \gamma^{(r)} & =\gamma^{(r-1)} \\ k^{(r)} & =k^{(r-1)}+1\end{cases}
$$

5. Return to step 2.

It is clear that, in the above algorithm, all the computations in one iteration, (9) and (10), can be carried out in parallel, from which we may easily develop its hardware realization architecture, as shown later, with a critical path of only $T_{\text {mult }}+T_{\text {add }}$. Notice that $\Lambda^{(d-\rho-1)}(z)$ is exactly the solution of the key equation for $\Lambda(z)$. Although $\hat{\Omega}(z)=\hat{\Delta}^{(d-\rho-1)}(z)$ is not the solution of the key equation for $\Omega(z)$, following the argument in [7], we have
$\Omega\left(\tilde{X}_{i}^{-1}\right)=-\tilde{X}_{i}^{-(d-1)} \hat{\Omega}\left(\tilde{X}_{i}^{-1}\right)$, for $i=1,2, \cdots, v+\rho$.


Figure 1: Key equation solver (KES) block architecture.

Thus, using the above reformulated algorithm, we can find all the errata locations $\tilde{X}_{i}$ by applying Chien search on $\Lambda(z)$ and derive all the errata magnitudes by rewriting (2) as

$$
\tilde{Y}_{i}=-\frac{\tilde{X}_{i}^{2-d-m_{0}} \hat{\Omega}\left(\tilde{X}_{i}^{-1}\right)}{\Lambda^{\prime}\left(\tilde{X}_{i}^{-1}\right)} \text {, for } i=1,2, \cdots, v+\rho
$$

In this way, we still say that the Algorithm 4.1 solves the RS decoding key equation and the corresponding hardware block is referred as key equation solver (KES).

### 4.2 Decoder Architecture

Following the above discussion, we know that such errors-and-erasures RS decoder principally contains six blocks: 1 . the syndrome computation (SC) block; 2. the erasure locator (EL) block to compute erasure locator polynomial $\Psi(z)$; 3 . the $\Phi$-block to compute polynomial $\Phi(z) ; 4$. the key equation solver (KES) block; 5. the Chien search (CS) block; 6. errata evaluator (EE) block. Since all other blocks have been well-investigated in conventional errors-alone RS decoder implementations, in what follows, we only consider the architecture design for the EL, $\Phi$-block and KES blocks. For the implementations of other blocks, readers are referred to $[2][8]$, etc.

The architecture design for EL block can be performed in a quite straightforward manner. Let $\alpha^{t_{1}}, \alpha^{t_{2}}, \cdots, \alpha^{t_{\rho}}$ be the $\rho$ erasure locations where $t_{1}<t_{2}<\cdots<t_{\rho}$. Suppose, before the erasure at $\alpha^{t_{s}}$ is detected, we have obtained the intermediate erasure locator polynomial

$$
\begin{aligned}
\Psi^{(s-1)}(z) & =\prod_{i=1}^{s-1}\left(1-\alpha^{t_{i}} z\right) \\
& =1+\psi_{1}^{(s-1)} z+\cdots+\psi_{s-1}^{(s-1)} z^{s-1}
\end{aligned}
$$

Then, after detecting the erasure at $\alpha^{t_{s}}$, we update the intermediate erasure locator polynomial as

$$
\begin{aligned}
\Psi^{(s)}(z) & =\Psi^{(s-1)}(z) \cdot\left(1-\alpha^{t_{s}} z\right) \\
& =1+\psi_{1}^{(s)} z+\cdots+\psi_{s}^{(s)} z^{s}
\end{aligned}
$$

where $\psi_{i}^{(s)}=\psi_{i}^{(s-1)}-\alpha^{t_{s}} \cdot \psi_{i-1}^{(s-1)}$. Based on such iterative update rule, we can easily develop one possible hardware architecture as shown in Fig. 2 for the EL block if the received codeword is fed to the decoder one code symbol per clock cycle, which is a typical configuration in conventional RS decoder design. Notice that the critical path here is $T_{\text {mult }}+T_{\text {add }}$.


Figure 2: . Erasure locator (EL) block architecture.
Next let's consider the KES architecture. Directly exploring the above reformulated BM algorithm and referring to the semi-systolic KES architecture presented in [7], we may easily obtain the semi-systolic KES architecture, as shown in Fig. 1, for the implementations of errors-and-erasures RS decoder. This KES architecture mainly consists of two subblocks: discrepancy computation (DC) block and error locator update (ELU) block, which perform the computations of (9) and (10) in each iteration, respectively. Compared with the KES for errors-alone RS decoder [7], this architecture contains $\left\lfloor\frac{d-1}{2}\right\rfloor-1$ more $P E 0$ blocks due to the higher possible degree of errata locator polynomial $\Lambda(z)$.

Notice that, as shown in Fig. 1, $\kappa_{i}^{(U)}$ and $\kappa_{i}^{(L)}$ in each $P E 0_{i}$ and $\mu_{i}^{(U)}$ and $\mu_{i}^{(L)}$ in each $P E 1_{i}$ represent the initialization value of the corresponding upper and lower DFF (D flip-flop) in each $P E 0_{i}$ and $P E 1_{i}$ blocks. According to Algorithm 4.1, if we initialize all the DFF's with

$$
\left\{\begin{aligned}
\kappa_{i}^{(U)}=\kappa_{i}^{(L)}=\psi_{i}, & \text { for } i=0,1, \cdots, d-3, \\
\mu_{i}^{(U)}=\mu_{i}^{(L)}=\phi_{i}, & \text { for } i=0,1, \cdots, d-2,
\end{aligned}\right.
$$

such KES block will produce $\Lambda(z)$ and $\hat{\Omega}(z)$ in $d-\rho-1$ clock cycles.

Finally let's consider the implementation of $\Phi$-block for the computation of polynomial $\Phi(z)$. From (7) we know that $\Phi(z)$ is obtained through a polynomial multiplication followed by modular $z^{d-1}$. It's well-known that the polynomial multiplication can be realized using a FIR filter with the coefficients initialized by one of the two multiplying polynomials [1], thus we can implement the $\Phi$-block using a $d-1$ order FIR as shown in Fig. 3: each coefficient is set to $s_{i}$, the coefficient of $S(z)$, and each coefficient $\psi_{i}$ of $\Psi(z)$ is sequentially fed to this FIR as primary input.


Figure 3: FIR Architecture for $\Phi$-block.
Each DFF in the such FIR filter is reset to zeroes at the beginning, then each coefficient $\psi_{i}$ is input sequentially with the subscript $i$ increasing from 0 to $\rho$. We can easily prove that after the last element $\psi_{\rho}$ is fed to FIR at $(\rho+1)$-th clock cycle, as shown in Fig. 3, the output of all the DFF's in this FIR are exactly $\phi_{d-2}, \phi_{d-3}, \cdots, \phi_{0}$, the coefficients of desired polynomial $\Phi(z)$.

From the above discussion, we know that the $\Phi$-block and the KES block complete computation for one codeword frame in $\rho+1$ and $d-\rho-1$ clock cycles, respectively. Since the erasure number $\rho$ may varies from 0 to $d-1$, it's desirable to implement the $\Phi$-block and the KES block in the same pipeline stage so that they work on the same codeword frame and produce polynomial $\Lambda(z)$ and $\hat{\Omega}(z)$ in the fixed $d$ clock cycles. However, in each $d$ clock cycles, the $\Phi$-block and the KES block will be idle in $d-\rho-1$ and $\rho+1$ clock cycles, respectively, which is a somehow waste of silicon area. In the following, we propose a simple scheduling scheme for the KES block in such a way that it can compute $\Phi(z)$ and solve the key equation in total $d$ clock cycles. Therefore, we don't need to physically implement the above $d-1$ order FIR for the $\Phi$-block.

Notice that, as shown in Fig. 1, if we set $\gamma^{(r)}=1$ and $M C=0$ and consecutively assign $\delta^{(r)}$ as $\psi_{i}$ with the subscript $i$ increasing from 0 to $\rho$, the DC block will perform in the exactly same way as the $d-1$ order FIR for the $\Phi$-block as in Fig. 3. Based on the above observation, we propose the following scheduling scheme for the KES block to complete both the computation of $\Phi(z)$ and solve the key equation in total $d$ clock cycles:

## Procedure 4.1. KES Block Operation Scheduling

1. In the first $\rho+1$ clock cycles, compute $\Phi(z)$ using the DC block only:
(a) Initialize all the DFF's in DC block with $\mu_{i}^{(U)}=0$ and $\mu_{i}^{(L)}=s_{i}$, for $i=0,1, \cdots, d-2$, set $\gamma^{(r)}=1$ and $M C=0$.
(b) For $0 \leq r \leq \rho$, in each clock cycle, set $\delta^{(r)}=\psi_{r}$.

After $\rho+1$ clock cycles, the content of the upper DFF in each $P E 1_{i}$ block, $\hat{\delta}_{i}^{(\rho+1)}$, is equal to $\phi_{i}$.
2. In the following $d-\rho-1$ clock cycles, compute the $\Lambda(z)$ using $E L U$ block and $\hat{\Omega}(z)$ using $D C$ block:
(a) Re-initialize all the DFF's in DC and ELU blocks with $\mu_{i}^{(L)}=\mu_{i}^{(U)}=\phi_{i}$ and $\kappa_{i}^{(L)}=\kappa_{i}^{(L)}=\psi_{i}$.
(b) For $0 \leq r \leq d-\rho-1$, in each clock cycle, set $\delta^{(r)}=\hat{\delta}_{0}^{(r)}$ and generate $\gamma^{(r)}$ and $M C^{(r)}$ according to Algorithm 4.1.
Finally, we obtain the polynomial $\Lambda(z)$ and $\hat{\Omega}(z)$ as illustrated in Fig. 1.

## 5. CONCLUSION

Applying the methodology behind the algorithm transformation proposed in [7], in this paper, we reformulated the BM algorithm for errors-and-erasures RS decoding and correspondingly present the semi-systolic architecture which has a critical path of only $T_{\text {mult }}+$ Tadd. Moreover, an operation scheduling scheme has been proposed for the key equation solver (KES) block so that the overall RS decoder hardware complexity can be reduced without loss of throughput. Compared with the errors-alone RS decoder, the presented errors-and-erasures RS decoder may achieve the same throughput but requires extra $d+2 \cdot\left\lfloor\frac{d-1}{2}\right\rfloor-3$ multipliers and $d+\left\lfloor\frac{d-1}{2}\right\rfloor-3$ adders due to the implementation of erasure locator (EL) block and higher possible degree of errata locator polynomial $\Lambda(z)$ in the KES block.

## 6. REFERENCES

[1] R. E. Blahut. Theory and Practice of Error Control Codes. Addison Wesley, 1984.
[2] H.-C. Chang, C. B. Shung, and C.-Y. Lee. A Reed-Solomon product-code (RS-PC) decoder chip for DVD applications. IEEE Journal of Solid-State Circuits, 36(2):229-238, Feb. 2001.
[3] C. B. et al. The U.S. HDTV standard the grand. IEEE Spectrum, 32:36-45, April 1995.
[4] J. Forney. On decoding BCH codes. IEEE Trans. on Inform. Theory, IT-11:549-557, Oct. 1965.
[5] J.-H. Jeng and T.-K. Truong. On decoding of both errors and erasures of a Reed-Solomon code using an inverse-free Berlekamp-Massey algorithm. IEEE Trans. on Communications, 47(10):1488-1494, Oct. 1999.
[6] V. Pless. Introduction to the theory of error-correcting codes. Wiley, 1998.
[7] D. V. Sarwate and N. R. Shanbhag. High-speed architecture for Reed-Solomon decoders. IEEE Trans. on VLSI Systems, 9(5):641-655, Oct. 2001, available at http://www.icims.csl.uiuc.edu/~shanbhag/myhome.
[8] K. Seki, K. Mikami, M. Baba, N. Shinohara, S. Suzuki, and H. Tezuka. Single-ship $10.7 \mathrm{~Gb} / \mathrm{s}$ FEC CODEC LSI using time-multiplexed RS decoder. In IEEE Custom Integrated Circuits Conference, pages 289-292, 2001.
[9] L.-L. Yang and L. Hanzo. Performance analysis of coded M-ary orthogonal signaling using errors-and-erasures decoding over frequency-selective fading channels. IEEE Journal on Selected Areas in Communications, 19(2):211-221, Feb. 2001.


[^0]:    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the £rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee.
    GLSVLSI'02, April 18-19, 2002, New York, New York, USA
    Copyright 2002 ACM 1-58113-462-2/02/0004 ...\$5.00.

