BULETINUL INSTITUTULUI POLITEHNIC DIN IAȘI Publicat de Universitatea Tehnică "Gheorghe Asachi" din Iași Volumul 64 (68), Numărul 4, 2018 Secția ELECTROTEHNICĂ. ENERGETICĂ. ELECTRONICĂ

# A NEW SYSTOLIC ARRAY ALGORITHM BASED ON BAND-CORRELATION STRUCTURE AND AN EFFICIENT VLSI ARCHITECTURE FOR THE ODD-TIME GDHT

ΒY

### **DORU FLORIN CHIPER**<sup>\*</sup>

Technical University "Gheorghe Asachi" of Iaşi Faculty of Electronics, Telecommunications and Information Technology,

Received: October 19, 2018 Accepted for publication: November 29, 2018

Abstract. Using two input restructuring sequences a new VLSI algorithm for 1-D odd-time generalized Hartley transform (GDHT) for an efficient VLSI architecture based on systolic array architectural paradigm. The proposed VLSI architecture has appealing topological features and high performances. The new algorithm is based on a modular and regular computational structure called bandcorrelation. Using the proposed algorithm we can obtain an efficient VLSI architecture for odd-time GDHT having a high speed at a low hardware complexity.

**Keywords:** generalized discrete Hartley transform; discrete transforms; systolic arrays; systolic algorithms; VLSI algorithms.

## 1. Introduction

The generalized discrete Hartley transform (GDHT) (Hu *et al.*,1992) has been proved to efficiently replace the generalized discrete Fourier transform

<sup>\*</sup>Corresponding author: *e-mail*: chiper@etti.tuiasi.ro

| Doru Fiorm Chipe | Doru | Florin | Chipe |
|------------------|------|--------|-------|
|------------------|------|--------|-------|

(GDFT) when the input sequence is real. It has the same useful applications as GDFT in many fields as designing filter banks, signal representation, computing convolutions and fast computation of DFT. One of the most useful forms of

GDHT is the odd-timed GDHT. There are many software solutions for GDHT but until now a few efficient hardware algorithms has been proposed for DHT (Chiper, 2005; Pan,1997; Chiper,2013) and especially for GDHT (Chiper, 2012; Chiper *et al.*,2004). Due to the fact that GDHT has a high computational complexity, in many real-time applications it is necessary to find efficient dedicated VLSI-based hardware structures to satisfy the requirements of the specific applications.

In real-time applications it is necessary to design VLSI hardware accelerators that can speed up the execution of the GDHT transform. In order to do this is necessary to design new VLSI algorithms or to reformulate existing algorithms for odd-time GDHT in a such manner that an efficient VLSI implementation can be obtained.

It is well known that the efficiency of a hardware algorithm is due not only to the computational one but especially to the communication complexity. This is due to the fact that data movement plays a key role in a VLSI implementation. So, the FFT algorithm that has a very good computational complexity is not so good from a VLSI implementation point of view.

Cycle convolution and circular correlation algorithms have remarkable advantages over other ones due to its efficient input/output and data transfer operations. These computational structures can be efficiently implemented in VLSI using distributed arithmetic (White, 1989) or systolic arrays (Kung,1982). Thus, several solutions have been proposed for a VLSI implementation of some DSP algorithms based on cyclic convolution or circular correlation (Meher, 2006; Chiper, 2007; Cheng, 2006; Chiper *et al.*, 2011).

The important advantages of the circular correlation or cycle convolution from a VLSI implementation point of view can be extended to other structures as for example skew-circular and pseudo-circular correlations or band-correlation (Chiper, 2011a,b; Chiper, 2017).

In this paper we propose a new systolic array algorithm for odd-time GDHT using two band-correlation structures. These structures have a regular and modular form and can be computed in parallel resulting thus a high throughput VLSI implementation. In order to obtain an efficient VLSI algorithm we have used an appropriate restructuring method of the 1-D odd-time GDHT into such regular structures. All the advantages of a circular correlation based implementation as regularity, modularity, low I/O cost and a reduced data management scheme can be obtained with the proposed computational structures.

The rest of the paper is organized as follows: in Section 2 an efficient formulation of 1-D odd-time GDHT is presented using an example for a 1-D GDHT of length N = 13 that can be used to obtain a unified VLSI implementation. In Section 3 we discuss some considerations about a VLSI

104

implementation of the proposed algorithm using the systolic array architectural paradigm. Conclusions are presented in Section 4.

# 2. Systolic Algorithm for the Odd-Time GDHT

To illustrate our decomposition method, we use an example with the transform length N = 13 and the primitive root g = 2.

The odd-time generalized DHT of the input sequence  $\{x(i): i = 0, 1, ..., N-1\}$  is defined as:

$$Y(k) = \sum_{i=0}^{N-1} x(i) cas[(2i+1)k\alpha],$$
(1)

where:  $\alpha = \pi / N$ , and  $cas(\theta) = cos(\theta) + sin(\theta)$ .

If the transform length N is a prime number, its computation can be reformulated for an efficient VLSI implementation as follows:

$$\begin{bmatrix} Y(1) \\ Y(2) \\ Y(3) \\ Y(4) \\ Y(5) \\ Y(6) \end{bmatrix} = \begin{bmatrix} H_C(1) \cdot \cos(\alpha) \\ H_C(2) \cdot \cos(2\alpha) \\ H_C(3) \cdot \cos(3\alpha) \\ H_C(3) \cdot \cos(3\alpha) \\ H_C(5) \cdot \cos(5\alpha) \\ H_C(5) \cdot \cos(5\alpha) \\ H_C(6) \cdot \cos(6\alpha) \end{bmatrix} + \begin{bmatrix} H_S(1) \cdot \sin(\alpha) \\ H_S(2) \cdot \sin(2\alpha) \\ H_S(3) \cdot \sin(3\alpha) \\ H_S(4) \cdot \sin(4\alpha) \\ H_S(5) \cdot \sin(5\alpha) \\ H_S(6) \cdot \sin(6\alpha) \end{bmatrix},$$
(2)

$$\begin{bmatrix} Y(12) \\ Y(11) \\ Y(10) \\ Y(9) \\ Y(9) \\ Y(8) \\ Y(7) \end{bmatrix} = -\begin{bmatrix} H_C(1) \cdot \cos(\alpha) \\ H_C(2) \cdot \cos(2\alpha) \\ H_C(3) \cdot \cos(3\alpha) \\ H_C(4) \cdot \cos(4\alpha) \\ H_C(5) \cdot \cos(5\alpha) \\ H_C(5) \cdot \cos(5\alpha) \\ H_C(6) \cdot \cos(6\alpha) \end{bmatrix} + \begin{bmatrix} H_S(1) \cdot \sin(\alpha) \\ H_S(2) \cdot \sin(2\alpha) \\ H_S(3) \cdot \sin(3\alpha) \\ H_S(4) \cdot \sin(4\alpha) \\ H_S(5) \cdot \sin(5\alpha) \\ H_S(6) \cdot \sin(6\alpha) \end{bmatrix},$$
(3)

where we have introduced two auxiliary output sequences  $\{H_c(i): i = 0, 1, ..., (N-1)/2\}$  and  $\{H_s(i): i = 0, 1, ..., (N-1)/2\}$  that are computed as:

$$\begin{bmatrix} H_{C}(1) \\ H_{C}(2) \\ H_{C}(3) \\ H_{C}(4) \\ H_{C}(5) \\ H_{C}(6) \end{bmatrix} = \begin{bmatrix} x_{C}(0) + 2 \cdot T_{C}(1) \\ x_{C}(0) + 2 \cdot T_{C}(2) \\ x_{C}(0) + 2 \cdot T_{C}(3) \\ x_{C}(0) + 2 \cdot T_{C}(4) \\ x_{C}(0) + 2 \cdot T_{C}(5) \\ x_{C}(0) + 2 \cdot T_{C}(6) \end{bmatrix},$$
(4)

```
Doru Florin Chiper
```

$$\begin{bmatrix} H_{S}(1) \\ H_{S}(2) \\ H_{S}(3) \\ H_{S}(4) \\ H_{S}(5) \\ H_{S}(6) \end{bmatrix} = \begin{bmatrix} x_{C}(0) + 2 \cdot T_{S}(1) \\ x_{C}(0) + 2 \cdot T_{S}(2) \\ x_{C}(0) + 2 \cdot T_{S}(3) \\ x_{C}(0) + 2 \cdot T_{S}(4) \\ x_{C}(0) + 2 \cdot T_{S}(5) \\ x_{C}(0) + 2 \cdot T_{S}(6) \end{bmatrix}.$$
(5)

Where we have introduced two other auxiliary output sequences  $\{T_c(i): i = 0, 1, ..., (N-1)/2\}$  and  $\{T_s(i): i = 0, 1, ..., (N-1)/2\}$  that can be computed using a special computational structure called band-correlation that can be efficiently implemented using systolic arrays as will be seen further on.

The two auxiliary output sequences are computed using eqs. (6) and (7) as follows:

$$\begin{bmatrix} T_{C}(2) \\ T_{C}(4) \\ T_{C}(5) \\ T_{C}(6) \\ T_{C}(1) \end{bmatrix} = \begin{bmatrix} \cos(4a) & \cos(8a) & \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) \\ \cos(8a) & \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) \\ \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) \\ \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) & \cos(7a) \\ \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) & \cos(7a) \\ \cos(11a) & \cos(9a) & \cos(5a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) & \cos(1a) \\ \cos(3a) & \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(3a) & \cos(6a) & \cos(12a) & \cos(11a) & \cos(9a) \\ \cos(5a) & \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) \\ \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) \\ \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) \\ \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) \\ \cos(12a) & \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) \\ \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) & \cos(7a) \\ \cos(11a) & \cos(9a) & \cos(5a) & \cos(10a) & \cos(7a) \\ \end{array} \right] \times \begin{bmatrix} x_{S}(2) + x_{S}(11) \\ x_{S}(4) + x_{S}(9) \\ x_{S}(5) + x_{S}(8) \\ x_{S}(3) + x_{S}(10) \\ x_{S}(6) + x_{S}(7) \\ x_{S}(1) + x_{S}(12) \end{bmatrix}$$
(7)

where:  $a = 2\alpha$ .

We have used two auxiliary input sequences, defined as:

$$x_{C}(N-1) = x(N-1)$$
  

$$x_{C}(i) = x(i) - x_{C}(i+1)$$
 for  $i = N-2,...,0$  (8)  

$$x_{C}(N-1) = x(N-1)$$

$$x_{s}(N-1) = x(N-1)$$
  

$$x_{s}(i) = x(i) + x_{s}(i+1)$$
 for  $i = N-2,...,0$  (9)

and appropriate permutation of the Galois Field of the indexes:

$$\psi(k) = \begin{cases} \varphi(k) & \text{if } \varphi(k) \le (N-1)/2\\ \varphi(k+(N-1)/2) & \text{otherwise} \end{cases}$$
(10)

$$\varphi(k) = \langle g^K \rangle_N \tag{11}$$

106

### 3. A VLSI Implementation Discussion

Using the algorithm presented in Section 2 we can obtain an efficient VLSI systolic array using a dependence-graph based synthesis procedure (Kung, 1988). Using this design procedure we can implement eqs. (6) and (7) using two linear systolic arrays working in parallel as shown in Fig. 1.



Fig. 1 – The systolic arrays that implement the hardware core of the proposed VLSI architecture.

These systolic arrays represent the hardware core of the architecture used for the VLSI implementation of the derived algorithm. We can obtain a further reduction of the hardware complexity using an appropriate hardware sharing method as that proposed in (Chiper, 2005b) to unify the two systolic arrays presented in Fig. 1. The function of the processing elements Pe from Fig. 1 is presented in Fig. 2.



Fig. 2 – The function of the processing elements Pe from Fig. 1.

It can be seen from Fig. 2 that each processing element consists of a multiplier, an adder and a multiplexer that is used to properly select the input data in each multiplier.

Using the so-called tag control bits (Jen *et al.*,1988) we can control the input data to be correctly stored in each processing elements in a such manner that all the input channels to be placed at one of the two ends of the linear systolic array.

It can be seen from Fig. 1 that the input data x(i) + x(N - i) and c(i) are flowing through the systolic array at a half rate of the partial results in such a manner that each input data will meat the right data at the right moment in the right processor. This is a very important condition in order to obtain the correct results at the end of the linear systolic arrays.

This architecture offers a high processing speed by using pipelining and parallelism. The high rate of the processing inside of the array will involve a high volume of input data to be sent to the systolic array. It is the so called I/O bottle neck that will limit the speed performances of a such architecture.

In our solution the so-called I/O bottleneck is avoided by using each data in as many processing elements as possible.

### 4. Conclusions

In this paper is proposed a new VLSI algorithm for 1-D odd-time GDHT that can be used to obtain an efficient VLSI architecture based on systolic array architectural paradigm. The proposed VLSI algorithm is based on a regular and modular computation structure called band-correlation. It was shown that this computational structure can be implemented in VLSI using the systolic array architectural paradigm with the same appealing features for a VLSI implementation and high speed performances as cyclic convolution and circular correlation. The two band-correlations can be computed in parallel thus

resulting a high throughput VLSI implementation. The architecture that can be obtained has high performances and good topological properties well suited for a VLSI implementation.

#### REFERENCES

- Cheng C., Parhi K.K., Hardware Efficient Fast DCT Based on Novel Cyclic Convolution Structures, IEEE Trans. Signal Processing, 54, 11, 4419-4434 (2006).
- Chiper D.F., A VLSI Algorithm for a Systolic Array VLSI Implementation of Type IV DST Based on A Pseudo-Band Correlation Structure, Proc. of Int. IEEE Symp. on Circuits and Systems, ISSCS 2011, Iaşi, 2011.
- Chiper D.F., Comşa C., An Efficient Linear Systolic Array Architecture for a Memory-Based VLSI Implementation of Type III Generalized Hartley Transform, Bul. Științific al Universității Politehnica din Timişoara, s. Electronică şi Telecomunicații, 49(63), I, 117-121 (2004).
- Chiper D.F., Cracan A., Burdia D., A New Systolic Array Algorithm and Architecture for the VLSI Implementation of IDST Based on a Pseudo-Band Correlation Structure, Advances in Electrican and Computer Engineering, 54, 1 (2017).
- Chiper D.F., *Radix-2 Fast Algorithm for Computing Discrete Hartley Transform of Type III*, IEEE Transactions on Circuits and Systems-II, **59**, 5, 297-301 (2012).
- Chiper D.F., Swamy M.N.S., Ahmad M.O., An Efficient Unified Framework for the VLSI Implementation of a Prime-Length DCT/IDCT with High Throughput, IEEE Transactions on Signal Processing, Regular Papers, **54**, 6 (2007).
- Chiper D.F., Swamy M.N.S., Ahmad M.O., Stouraitis T., Systolic Algorithms and a Memory-Based Design Approach for a Unified Architecture for the Computation of DCT/DST/IDCT/IDST, IEEE Trans. on Circuits and Systems\_I: Regula papers, 52, 6 (2005).
- Chiper D.F., Swamy M.N.S., and Ahmad M.O., *An Efficient Systolic Array Algorithm* for the VLSI Implementation of Prime-Length DHT, in Proc. of Internat. Symp. on Signals, Circuits and Systems (ISSCS2005), Jul. 2005, vol. **1**, 167-169.
- Chiper D.F., Ungureanu P., Novel VLSI Algorithm and Architecture with Good Quantization Properties for a High-Throughput Area Efficient Systolic Array Implementation of DCT, EURASIP Journal on Advances in Signal Processing, Vol. 1 (2011).
- Chiper D.F., A Novel VLSI DHT Algorithm For A Highly Modular And Parallel Architecture, IEEE Transactions on Circuits and Systems- II, **60**, 5, 282-286 (2013).
- Chiper D.F., A New VLSI Algorithm and Architecture for the Hardware Implementation of Type IV Discrete Cosine Transform Using a Pseudo-Band Correlation Structure, Central European Journal of Computer Science, 1, 2, 90-97 (2011).
- Hu N.C., Chang H.I., Orsoy O., *Generalized Discrete Hartley Transform*, IEEE Trans. Signal Processing, **40**, *6*, 2931-2940 (1992).
- Jen C.W., Hsu H.Y., *The Design of Systolic Arrays with Tag Input*, in Proc. IEEE Int. Symp. Circuits and Systems ISCAS'88, Finland, 1988, 2263-2266.
- Kung H.T., Why Systolic Architectures, IEEE Comp., vol.15, 37-46, Jan. 1982
- Kung S.Y., VLSI Array Processors, Prentice Hall, Englewood Cliffs, New Jersay, 1988.

Meher P.K., Systolic Designs for DCT Using a Low-Complexity Concurrent Convolutional Formulation, IEEE Trans. Circuits Syst. Video Technol., **16**, 9, 1041-1050 (2006).

Pan S.B., Park R.H., *Unified Systolic Array for Computatiuon of DCT/DST/DHT*, IEEE Trans. on Circuits and Systems for Video Technology, 7, 2, 413-419 (1997).

White S.A., Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review, IEEE ASSP Mag., 6, 3, 5-19 (1989).

#### UN NOU ALGORITM SISTOLIC BAZAT PE STRUCTURA BAND-CORRELATION ȘI O ARHITECTURĂ VLSI EFICIENTĂ PENTRU TRANSFORMATA HARTLEY DISCRETĂ GENERALIZATĂ DE TIP ODD-TIME

#### (Rezumat)

În această lucrare a fost propus un nou algoritm VLSI pentru transformata discrete Hartley generalizată de tip odd-time care poate fi folosit pentru a obține o arhitectură VLSI eficientă bazată pe paradigm arhitecturală sistolică.

Algoritmul VLSI propus se bazează pe o structură computațional regulate și modular denumită band-correlation. S-a arătat că această structură computațională poate fi implementată efficient în VLSI ustilizând paradigm arhitecturală denumită arie sistolică cu aceleași caracteristici atrăgătoare din punct al implementării VLSI și a performanțelor înalte de viteză ca și convoluția ciclică sau corelația circular. Cele două band-correlations pot fi procesate în parallel rezultând astfel o implementare VLSI cu un throughput ridicat. Arhitectura VLSI ce poate fi astfel obținută are performanțe înalte și proprietăți topologice foarte bune, adecvate implementării VLSI.

110