BULETINUL INSTITUTULUI POLITEHNIC DIN IAȘI Publicat de Universitatea Tehnică "Gheorghe Asachi" din Iași Tomul LVII (LXI), Fasc. 4, 2011 Secția ELECTROTEHNICĂ. ENERGETICĂ. ELECTRONICĂ

# A VLSI ALGORITHM FOR THE HARDWARE IMPLEMENTATION OF 1-D DISCRETE SINE TRANSFORM BASED ON A PSEUDO-BAND CORRELATION STRUCTURE

ΒY

## **DORU-FLORIN CHIPER**\*

"Gheorghe Asachi" Technical University of Iaşi, Faculty of Electronics, Telecommunications and Information Technology

Received, June 5, 2011 Accepted for publication: August 16, 2011

**Abstract.** Using a novel input restructuring sequence and appropriate index-mapping techniques a new VLSI algorithm for a prime-length DST is presented. The proposed algorithm uses a modular and regular computational structure, called *pseudo-band* correlation structure, well adapted to the VLSI implementation of 1-D DST, thus resulting an efficient VLSI implementation. The proposed algorithm can be further used to obtain a linear systolic array with a high computing speed and low I/O cost characterized by a small number of I/O channels placed at the two ends of the linear array. The resulting architecture obtained by mapping the proposed algorithm is highly regular and modular with local connections specific to the systolic array architectural paradigm.

Key words: discrete sine transform; systolic array; VLSI algorithm; VLSI architecture.

## **1. Introduction**

The discrete cosine transform (DCT) and discrete sine transform (DST) (Ahmet *et al.*, 1974; Jain, 1976, 1989) are key functions used in many digital signal processing applications, especially in transform coding. Besides they are

<sup>\*</sup> Corresponding author: *e-mail*: chiper@etc.tuiasi.ro

good approximations to the statistically-optimal Karhunen-Loeve transform (Jain, 1976, 1989); they are also used in speech and image transform coding (Jain, 1976; Zhang *et al.*, 2007), DCT based subband decomposition in speech and image compression (Chen, 2007), or video transcoding (Fung & Siu, 2006). Other important applications are: block filtering (Martucci & Mersereau, 1993), feature extraction (Jadhav & Holambe, 2008), digital signal interpolation (Wang *et al.*, 1993), image resizing (Park & Park, 2006), transform-adaptive filtering (Pei & Tseng, 1996, Mayyas, 2005) and filter banks (Bergen, 2008).

Essential for multimedia communications are high speed and low cost transmission of audio and video signals. To reduce storage and transmission cost, different data compression schemes are used. Among these schemes transform coding plays a significant role in reducing the bit rate for digital signal processing and transmission for both video and audio signals. For high correlation images DCT yields better results but for low correlation ones DST generate lower bit rates. The DST is signal independent and represents a good approximation of the statistically optimal Karhunen-Loeve transform.

In transform coding the usual transform length is 8 or 16. But in order to reduce blocks artifacts we can use a prime transform length of 11 or 17 and an overlapping technique. Also, in many other applications it is useful to have a prime DST length or a composite transform length were the factors are mutually prime. A prime factor could be a more suitable transform length for such applications than a power of two (Tatsaki *et al.*, 1995). Thus, there are in the literature several prime factor algorithms for an efficient calculation or implementation of DST (Chiper *et al.*, 2002). Also, there are several approaches to combine prime-factor algorithms for an efficient computation or implementation of the DST transform for composite-lengths (Kar & Rao, 1994).

There are many software implementation solutions but only a few good VLSI implementation ones .Due to the fact that DST is computational intensive the programmable general-purpose architectures can not satisfy the speed requirements of many real-time applications. Thus, it is necessary to design application specific hardware to accelerate the execution of this transform. In order to attain this goal it is necessary to appropriately reformulate the existing algorithms or even better to design new ones for a VLSI implementation. In order to obtain the necessary high speed for a real-time implementation we have to exploit the concurrency of the DST transform implementation.

It is already well known that the efficiency of a hardware algorithm is due especially to the communication complexity and then to the computational one. This is due to the fact that the data movement and transfer play a key role in obtaining an efficient VLSI implementation. Cycle convolution and circular correlation algorithms have remarkable advantages over other ones due to the fact that they are very appropriate for a VLSI implementation especially due to its efficient input/output operations and data transfer. Thus, several solutions have been proposed for a VLSI implementation based on cyclic convolution (Chiper *et al.*, 2005) or circular correlation (Chiper *et al.*, 2002). These

30

computational structures are well suited for a VLSI implementation using distributed arithmetic (White, 1989) or systolic arrays (Kung, 1982) due to the fact that they avoid complicated data routing and management leading thus to efficient VLSI implementation with a regular and modular structure. In this paper we propose another computational structure that is well suited to a VLSI implementation.

Systolic arrays represent an interesting architectural paradigm that can exploit the advantages offered by the VLSI technology, especially through modularity, regularity and short and local interconnections. The main bottleneck of this architecture is represented by the so called *I/O bottle-neck*. Thus, it is important to find systolic algorithms with a low I/O cost. In the same time systolic arrays are well suited for real time implementation due to the fact that they offer a high speed through an efficient exploitation of the concurrency.

The above mentioned advantages of the circular correlation can be extended to another computational structure that was called *pseudo-band correlation*. The differences in sign can be managed using a control-tag based mechanism appropriate for the systolic array paradigm.

In this paper we propose a new VLSI algorithm based on a regular and modular computational structure called *pseudo-band correlation structure* that can be efficiently mapped on a systolic array architecture resulting a high throughput VLSI implementation with reduced hardware complexity. We have used a new restructuring method of the DST into a such regular structure using a single input restructuring sequence as opposed to the used proceeding in a previous paper (Chiper *et al.*, 2005), where two such auxiliary sequences are used. This can be exploited to significantly reduce the area occupied by the preprocessing stage that for relatively short DST transforms represents an important overhead which can diminish the advantages offered by cycle convolution structures. The differences in sign involved by the pseudo-band correlation structure can be efficiently tackled with some tag control bits. All the advantages of a circular correlation based implementation as regularity, modularity, low I/O cost and a reduced data management scheme can be obtained with the proposed VLSI algorithm.

The rest of the paper is organized as follows: in Section 2 a low complexity formulation for the computation of the DST transform is presented. An example for a 1-D DST of length N = 11 is presented in Section 3. In Section 4 are discussed some details about a VLSI implementation of the proposed algorithm using the systolic array architectural paradigm. Conclusions are presented in Section 5.

## 2. Systolic Algorithm for 1-D DST

For the real input sequence x(i): i = 0, 1, ..., N-1, 1-D DST is defined

as

$$x(k) = \sqrt{\frac{2}{N}} \sum_{i=0}^{N-1} x(i) \sin[(2i+1)k\alpha], \ (k = 1, ..., N),$$
(1)

with

$$\alpha = \frac{\pi}{2N}.$$
 (2)

To simplify our presentation we will drop the constant coefficient  $\sqrt{2/N}$  from the eq. (1), that represents the definition of DST. We will add at the end of the VLSI array a multiplier to scale the output sequence with this constant.

We will reformulate relation (1) as a parallel decomposition based on a skew-cycle and pseudo-cycle convolution forms using a single new input restructuring sequence as opposed to the proceeding used in a previous paper (Chiper *et al.*, 2005) where we have used two such auxiliary input sequences. Further, we'll use the properties of DST kernel and the properties of the Galois Field of indexes to appropriately permute the auxiliary input and output sequences.

Thus, we will introduce the following auxiliary input sequence:  $\{x_a(i): i = 0, 1, ..., N-1\}$ . It can be recursively computed as follows:

$$x_a(N-1) = x(N-1),$$
(3)

$$x_a(i) = (-1)^i x(i) + x_a(i+1), \ (i = N - 2,...,0).$$
(4)

Using this restructuring input sequence we can reformulate (1) as follows:

$$X(0) = x_a(0), (5)$$

$$X(k) = x_a(0)\sin k\alpha + T(k)\cos k\alpha, \quad (k = 1,..., N-1).$$
 (6)

The new auxiliary output sequence,  $\{T(k): k = 1, 2, ..., N-1\}$ , can be computed as a pseudo-band correlation, if the transform length, N, is a prime number, as following:

$$T(\varphi(k)) = \sum_{i=1}^{(N-1)/2} \left[ (-1)^{\xi(k,i)} x_a(\varphi(i)) - (-1)^{\xi(k,i+(N-1)/2)} x_a(\varphi(i+(N-1)/2)) \right] \times (7)$$
  
 
$$\times 2 \sin[\varphi(i+k)2\alpha], \ (k=0,1,...,(N-1)/2),$$

where  $\varphi(k) = \langle g^k \rangle_N$ ,  $\langle x \rangle_N$  denoting the result of x modulo N.

We have also used the properties of the Galois Field of indexes to convert the computation of 1-D DST as a pseudo-band correlation.

# 3. An Example

To illustrate our approach we will consider an example for 1-D DST with the length N = 11 and the primitive root g = 2.

First we recursively compute the two auxiliary input sequences  $\{x_a(i): i = 0, ..., N-1\}$  as following:

$$x_a(10) = x(10),$$
 (8)

$$x_a(i) = (-1)^i x(i) + x_a(i+1), \ (i = 9,...,0).$$
 (9)

The functions  $\xi(k,i)$  and  $\xi(k,i+(N-1)/2)$  define the sign of terms in eq. (10). They are defined as follows:

a)  $\xi(k,i)$  is defined by the matrix

$$\begin{bmatrix} 0 & 0 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix};$$

b)  $\xi(k,i+(N-1)/2)$  is defined by the matrix

$$\begin{bmatrix} 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix}$$

Using the auxiliary input sequence,  $\{x_a(i): i = 0,..., N-1\}$ , we can write the matrix-vector product form as

| $\int T(2)$                          |  | $\int s(4)$       | <i>s</i> (8)  | <i>s</i> (5)  | <i>s</i> (10) | s(9)         |                                                                                            |
|--------------------------------------|--|-------------------|---------------|---------------|---------------|--------------|--------------------------------------------------------------------------------------------|
| <i>T</i> (4)                         |  | s(8)              | <i>s</i> (5)  | <i>s</i> (10) | s(9)          | <i>s</i> (7) |                                                                                            |
| T(8)                                 |  | s(5)              | <i>s</i> (10) | s(9)          | <i>s</i> (7)  | <i>s</i> (3) | [+[r, (2) + r, (0)]]                                                                       |
| T(5)                                 |  | s(10)             | s(9)          | <i>s</i> (7)  | <i>s</i> (3)  | <i>s</i> (6) | $ \begin{bmatrix} \pm [x_a(2) \pm x_a(9)] \\ \pm [x_a(4) \pm x_a(7)] \end{bmatrix} $       |
| <i>T</i> (10)                        |  | s(9)              | <i>s</i> (7)  | <i>s</i> (3)  | <i>s</i> (6)  | <i>s</i> (1) | $ \begin{bmatrix} \pm [x_a(4) \pm x_a(7)] \\ \pm [x_a(8) \pm x_a(2)] \end{bmatrix} $ (10)  |
| T(9)                                 |  | $\overline{s(7)}$ | <i>s</i> (3)  | <i>s</i> (6)  | <i>s</i> (1)  | <i>s</i> (2) | $ \begin{bmatrix} \pm [x_a(0) \pm x_a(5)] \\ \pm [x_a(5) \pm x_a(5)] \end{bmatrix}, (10) $ |
| <i>T</i> (7)                         |  | s(3)              | <i>s</i> (6)  | <i>s</i> (1)  | <i>s</i> (2)  | <i>s</i> (4) | $\frac{1}{2} \left[ x_a(3) \pm x_a(0) \right] + \left[ x_a(1) + x_a(1) \right]$            |
| T(3)                                 |  | s(6) $s(1)$       | <i>s</i> (1)  | <i>s</i> (2)  | <i>s</i> (4)  | <i>s</i> (8) | $\begin{bmatrix} \pm [x_a(10) \pm x_a(1)] \end{bmatrix}$                                   |
| T(6)                                 |  | s(1)              | <i>s</i> (2)  | <i>s</i> (4)  | <i>s</i> (8)  | <i>s</i> (5) |                                                                                            |
| $\begin{bmatrix} T(1) \end{bmatrix}$ |  | $\int s(2)$       | <i>s</i> (4)  | <i>s</i> (8)  | <i>s</i> (5)  | s(10)_       |                                                                                            |

where we noted by  $s(k) = 2 \sin 2k\alpha$  and the sign of the items in relation (10) is given by the following matrix:

where

a) the first bit designates the sign before the brackets;

b) the second bit denotes the sign inside the brackets.

The "1" bit indicates the minus sign (the first bit) and the subtraction operation (the second one)

From eq. (10) we can see that all the elements along the secondary diagonal of the matrix in (10) are the same. We'll call this *regular computational structure band-correlation*. As regards the differences in sign it can be seen from the SIGN matrix this computational structure will be called *pseudo band-correlation structure*.

Finally, the output sequence  $\{X(k): k = 1, 2, ..., N-1\}$  can be computed as follows:

$$X(0) = \sum_{i=0}^{10} x_a(i), \qquad (12)$$

 $X(1) = x_{a}(0)\sin \alpha + T(1)\cos \alpha,$   $X(2) = x_{a}(0)\sin 2\alpha + T(2)\cos 2\alpha,$   $X(3) = x_{a}(0)\sin 3\alpha + T(3)\cos 3\alpha,$   $X(4) = x_{a}(0)\sin 4\alpha + T(4)\cos 4\alpha,$   $X(5) = x_{a}(0)\sin 5\alpha + T(5)\cos 5\alpha,$   $X(6) = x_{a}(0)\sin 6\alpha + T(6)\cos 6\alpha,$   $X(7) = x_{a}(0)\sin 7\alpha + T(7)\cos 7\alpha,$   $X(8) = x_{a}(0)\sin 8\alpha + T(8)\cos 8\alpha,$   $X(9) = x_{a}(0)\sin 9\alpha + T(9)\cos 9\alpha,$   $X(10) = x_{a}(0)\sin 10\alpha + T(10)\cos 10\alpha,$ (13)

with

$$\varphi(k) = \left\langle g^k \right\rangle_N,$$

where  $\langle x \rangle_N$  denotes the result of x modulo N.

We have also used the properties of the Galois Field of indexes to convert the computation of DST as two pseudo-cycle convolutions.

#### 4. A VLSI Implementation Discussion

Using the proposed algorithm we can obtain a linear systolic array as will be indicate in the following.

First, we have obtained the recursive formulation of eq. (7). Then, using a dependence-graph based synthesis procedure we can map eq. (7) into a linear systolic array with I/O channels placed only at the two ends of the linear systolic array. This array form the hardware-core of the proposed architecture used for the VLSI implementation of the proposed algorithm. Due to the pseudo-band correlation structure there are differences in sign that can be managed using a tag-control mechanism. This technique is very well suited for the systolic array architectural paradigm. The obtained processing elements consist of a multiplier, an adder and some multiplexers used to manage the differences in sign in eqs. (7) and (10), respectively.

| D     |     | •    | 01   | •    |
|-------|-----|------|------|------|
| Dorn- | Ela | orin | ( 'h | iner |
| Dura  | 1 1 | Jun  | CII  | iper |

Another important advantage of the proposed algorithm is its low I/O cost. This aspect is very important in designing systolic arrays as the so called I/O bottle-neck can seriously limit the usefulness of this concept. Using the tagcontrol mechanism it is possible to place all I/O channels at the two extreme ends of each systolic array. The tag bits used in this mechanism can control the internal registers using only I/O channels placed at the two ends of the linear systolic array.

The computation of the auxiliary input sequence using eqs. (3) and (4) and the appropriate permutation of this sequence is realized in the preprocessing stage. There are two reordering operations involved in obtaining the right auxiliary input sequence. First, we can reverse the order of the input signal in order to compute eq. (4) and then we have to change the order of the obtained sequence in order to obtain the desired auxiliary input sequence in the desired order. Each permutation is obtained using a RAM with N words. In a previous paper (Chiper *et al.*, 2005) the number of RAMs and adders is doubled as compared with the proposed solution.

In the post-processing stage we reorder the auxiliary output sequence  $\{T(k): k = 1, 2, ..., N-1\}$  to put it in a natural order and then we will compute the final output sequence using eq. (6), respectively (13).

#### **5.** Conclusions

A new VLSI algorithm for 1-D DST is proposed based on a modular and regular computation structure called pseudo-band correlation. The proposed algorithm can be mapped into a linear systolic array with a high throughput and low I/O cost and hardware complexity. This algorithm is based on a new computational structure with a regular and modular feature that is appropriate for a regular and modular VLSI implementation with short and local interconnections. The proposed restructuring of 1-D DST presents important advantages as a regular and local data flow and a reduced computational complexity. The VLSI architecture that can be obtained by mapping the proposed algorithm has a high throughput and a regular and modular structure with local connections well suited for a VLSI implementation.

#### REFERENCES

- Ahmed N., Natarajan T., Rao K.R., *Discrete Cosine Transform*. IEEE Trans. Comput., C-23, *1*, 90-94 (1974).
- Bergen S.W., A Design Method for Cosine-Modulated Filter Banks Using Weighted Constrained-Least-Squares Filters. Dig. Sign. Proc., 18, 3, 282-290 (2008).
- Chen Y.-Y., Medical Image Compression Using DCT-Based Subband Decomposition and Modified SPIHT Data Organization. Internat. J. of Medical Inform., **76**, 717-725 (2007).

36

- Chiper D.F., Swamy M.N.S., Ahmad M.O., Stouraitis T., Systolic Algorithms and a Memory-Based Design Approach for a Unified Architecture for the Computation of DCT/DST/IDCT/IDST. IEEE Trans. on Circ. a. Syst.-I; Regular papers, 52, 6, 1125-1137 (2005).
- Chiper D.F., Swamy M.N.S., Ahmad M.O., Stouraits T., *A Systolic Array Architecture* for the Discrete Sine Transform. IEEE Trans. on Sign. Proc., **50**, 9, 2347-2394 (2002).
- Fung K.-T., Siu W-C., On Re-composition of Motion Compensated Macroblocks for DCT-Based Video Transcoding. Sign. Proc.: Image Commun., 21, 44-58 (2006).
- Jadhav D.V., Holambe R.S., Random and Discrete Cosine Transform Based Features Extraction and Dimensionality Reduction Approach for Face Recognition. Sign. Proc., 88, 2604-2609 (2008).
- Jain A.K., A Fast Karhunen-Loeve Transform for a Class of Random Processes. IEEE Trans. Comm., COM-24, 10, 1023-1029 (1976).
- Jain A.K., *Fundamentals of Digital Image Processing*. Englewood Cliffs, Prentice-Hall, NJ, 1989.
- Kar D., Rao V.V.B., On Prime Factor Decomposition Algorithm for the Discrete Sine Transform. IEEE Trans. on Sign. Proc., 42, 11, 3258-3260 (1994).
- Kung H.T., Why Systolic Architectures. IEEE Comp., 15, 37-46 (1982).
- Martucci S.A., Mersereau R., New Approaches to Block Filtering of Images Using Symmetric Convolution and the DST or DCT. Proc. IEEE Internat. Symp. Circ. Syst. (ISCAS'93), May 1993, Jassy, 259-262.
- Mayyas K., A Note on Performance Analysis of the DCT-LMS Adaptive Filtering Algorithm. Sign. Proc., 85, 1465-1467 (2005).
- Park Y.S., Park H.W., Arbitrary-Ratio Image Resizing Using Fast DCT of Composite Length for DCT-Based Transcoder. IEEE Trans. Image Proc., 15, 2, 494-500 (2006).
- Pei S.-C., Tseng C.-C., Transform-Domain Adaptive Filter. IEEE Trans. Sign. Proc., 44, 12, 3142-3146 (1996).
- Tatsaki A., Dre C., Stouraitis T., Goutis C., *Prime-Factor DCT Algorithms*. IEEE Trans. on Sign. Proc., 43, 3, 772-776 (1995).
- Wang Z., Jullien G.A., Miller W.C., Interpolation Using the Discrete Sine Transform with Increased Acuracy. Electron. Lett., 29, 22, 1918-1920 (1993).
- White S.A., Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. IEEE ASSP Mag., 6, 3, 5-19 (1989).
- Zhang D., Lin S., Zhang Y., Yu L., *Complexity Controllable DCT for Real-Time H.264 Encoder.* J. of Visual Commun. a. Image Repres., 18, 59-67 (2007).

#### UN ALGORITM VLSI PENTRU IMPLEMENTAREA HARDWARE A TRANSFORMATEI 1-D DST BAZAT PE UTILIZAREA STRUCTURII PSEUDO-BAND CORRELATION

#### (Rezumat)

Se propune un nou algoritm VLSI pentru transformata 1-D DST bazat pe o structură regulată și modulară denumită "Pseudo-band Correlation". Algoritmul propus poate fi mapat pe o arie sistolică liniară având un through-put ridicat și o complexitate 38

hardware și a operațiilor de intrare/ieșire reduse. Algoritmul propus se bazează pe o structură computațională regulată și modulară adecvată pentru o implementare VLSI. Restructurarea propusă pentru transformata DST are anumite avantaje importante cum ar fi un flux de date regulat și local precum și o complexitate aritmetică redusă. Arhitectura VLSI ce poate fi obținută prin maparea algoritmului propus are un throughput ridicat și o structură regulată și modulară cu interconexiuni locale, fiind astfel potrivită unei implementări VLSI.