Low Complexity FFT Factorization for CS Reconstruction

Alahari Radhika, K. Satya Prasad, K. Kishan Rao

Abstract: In this paper is presented a novel area efficient Fast Fourier transform (FFT) for real-time compressive sensing (CS) reconstruction. Among various methodologies used for CS reconstruction algorithms, Greedy-based compressive sensing (OMP) approach provides better solution in terms of accurate implementation with complex computations overhead. Several computationally intensive arithmetic operations like complex matrix multiplication are required to formulate correlative vectors making this algorithm highly complex and power consuming hardware implementation. Computational complexity becomes very important especially in complex FFT models to meet different operational standards and system requirements. In general, for real time applications, FFT transforms are required for high speed computations as well as with least possible complexity overhead in order to support wide range of applications. This paper presents an hardware efficient FFT computation technique with twiddle factor normalization for correlation optimization in orthogonal matching pursuit (OMP). Experimental results are provided to validate the performance metrics of the proposed normalization techniques against complexity and energy related issues. The proposed method is verified by FPGA synthesizer, and validated with appropriate currently available comparative analyzes.

Keywords: Compressive sensing, Fast Fourier transform, Orthogonal matching pursuit, FPGA, etc.

I. INTRODUCTION

In recent years high speed and improved end quality results are emerged as important aspects of many digital systems like Clinical imaging [1], wireless communication [2] and IoT Applications [3] which give rise to both bandwidth and frequency. In general, almost in all fields of signal and image processing, data acquisition using sampling theory has been primarily used to narrow down the penalty gap that exists between desired qualities over rate of signal acquisition [4]. However, sampling the information based on Nyquist is unsuitable for many applications, often causes storage capacity burden or hardware complexity overhead. Additionally, owing to the inclusion of wideband and high throughput signal processing, this conventional method requires a high sampling rate, which tends to energy consumption problems.

On the other side, from the basis of the sparse representation, signals parts can be discarded to attain some level of compression without compromising any sort of end results quality which is referred to as Compressive Sampling (CS). In this method signals are acquired with some measured value and then utilize some unique algorithm at the receiver side to restore the input signal from the down rated measured values. It has advantage of least possible samples requirement for accurate reconstruction of input signal which is far less as compared to its counterpart sampling theory.

In CS method, signals are represented sparsely based on prior statistics and characteristics of the signal to be compressed based on some orthogonal basis. Some of the prominent basis widely preferred is discrete cosine transform (DST), Fast Fourier Transform (FFT), wavelet, Gabor, etc. In general signal reconstruction from the measured values normally comes with several problems that need to be solved and the solution provided to mitigate these problems should be evolved with some optimization.

Numerous methods investigated the influence of compressive sampling over effective signal representation. Some works also focused on the use of greedy methods for signal reconstruction [5]. This work aims to propose highly optimized FFT core for active correlation optimization in CS signal reconstruction which is a crucial step in CS analyses[6, 7].

The major contributions of this paper towards CS reconstruction are as follows: (i) the twiddle factor normalization using radix-2\(^2\) framework, which has low complexity and prominent impact since CS analysis always requires large number of computational resources due to its correlation computations. (ii) Conventional hardware optimization models in FFT computations come with performance trade off measures. In contrast, in this work, the hardware complexity and energy consumption problems are addressed without using any arithmetic computational techniques. Another metric of this FFT core is that it can perform FFT computation at high speed whereas conventional FFT methods always require pipelining or parallel process to accomplish this task. (iii) Due to the use of non trivial twiddle factors, the FFT computation process is robust to any sort of arithmetic error owing to its fixed width word constrain and can provide improved signal recovery as compared to other FFT methods.

The organization of remaining part of the paper is as follows: In Section 2 is described various FFT hardware optimization models; Section 3 explores the potential metrics of FFT model along with the CS reconstruction algorithm framework. Radix factorization technique is elaborated in Section 4 based on index mapping framework for low complexity and energy efficiency. Experimental results and comparative analyses are addressed to demonstrate the area and power efficiency of this FFT twiddle factor normalization scheme, finally with a summary in Section 5.
II. FFT OPTIMIZATION RELATED WORKS

Investigation of hardware optimization techniques is ineffective, while hierarchical hardware integrations are vulnerable to synchronization problems thus making FFT computation less efficient. Moreover, this optimization comes with finite performance trade which reduces the speed. Thus, for optimization of FFT computation, various approaches such as radix indices used for FFT, pipeline standards and sophisticated multiplication techniques have to be explored. In [8] multi-path Canonical Signed Digit (CSD) based twiddle factor multiplication is proposed using multi-layer scheme to carry out mixed radix FFT computation. In [9] pipelined FFT architecture is presented by incorporating feed-forward data commutator architecture and decomposing FFT complex input into two streams in terms of real and imaginary parts and decimated dual-path delay is used for high speed computation with improved hardware utilization rate. In [10] folding transformation-based approach is developed for FFT computation using parallel pipelined architecture. Register minimization techniques were used to reduce design complexity. Compared to parallel data path approach, folding transformation saves significant computational cost since folding eliminates some of the redundancy in the FFT computation to optimize the arithmetic complexity. In [11] hierarchical memory scheduling approach is proposed to accomplish memory efficiency FFT computation. Here Multipath Delay Commutator (MDC) driven parallel data stream was followed for high throughput rate. In [12] mixed-radix multipath delay feedback FFT structure is developed to process twiddle factor multiplication to carry out radix-2\(^2\)\times2\(^2\)\times2\(^2\) FFT architecture for wireless personal area network applications.

Though high radix FFT computation supports high speed requirements and complexity reduction, there is always finite trade off that exists between throughput rates over spectral efficiency. In radix 2\(^m\) methodologies FFT computations simultaneously achieve spectral efficiency with simple butterfly radix-2 and throughput demands which requires least possible twiddle factor multiplications [13].

In this paper is proposed factorization techniques with twiddle factor trivial conversion for FFT design which has following advantages:

1) Direct trivial twiddle factors are used in alternate stages of FFT which requires less memory and computational time and at times it produces more realistic synthesis results that can configure into any FFT length.

2) Low complexity with improved hardware utilization rate.

3) It can meet the desired throughput demands, with the reduced number of complex twiddle factor with improved operating performance.

In addition to hardware complexity reduction, worst case critical path analysis over complex multiplication reduction schemes exploits high speed FFT computation process. Here FFT design architecture is designed without using any arithmetic level optimization technology.

III. FFT FOR CS RECONSTRUCTION

Orthogonal matching pursuit (OMP) developed by Tropp and Gilbert [14] is the most prominent and reliable signal reconstruction technique. OMP recovery algorithm shows superior system performance with some computationally expensive tasks such as coefficients selection and signal estimation.

- Most relevant vectors are computed using matrix-vector multiplication during coefficients selection stage.
- Signal estimation needs to evolve with least squares problem which includes both matrix inversion and matrix-vector multiplication.

The hardware complexity overhead in any of this correlation optimization and the least square problem (LSP) limits the OMP algorithm performance. FPGA-based implementation of this OMP for CS reconstruction relies on the optimizations carried out in any of these two steps. Though FPGA implementation has some inherent advantages like massive parallelism and configurable processing elements measures need to be taken out to narrow down the complex arithmetic involved during FFT computation to increase the complex reconstruction rate. Here, to reduce the resource requirement in correlation optimization during CS reconstruction, the fast Fourier transform (FFT) is used. Also, IFFT is used in lieu of matrix-vector multiplications to improve the efficiency of inner product computation.

Let us define time frequency dictionary, \( w = F\omega \).

Here, both \( F \) and \( \omega \) are Fourier basis of size \( N \times N \).

\( F \) is obtained by arbitrarily opting \( M \) rows from an identity matrix \( N \times N \).

Instead of calculating the correlative vectors sequentially, with vector dot product, one can determine a column with the help of the fast Fourier transform as given in Eqn. (1)

\[
\lambda t = \arg \max_{j=1...N} | IFFT(R_{t-1})_j |
\]

(1)

\( R_{t-1} \) is a vector based on residual set \( R_0 \) for each iteration.

IV. FFT FACTORIZATION

FFT algorithm can be of many forms, depending on the manner in which the input discrete signals are factorized. The radix-2 decimation-in-time FFT (Radix-2 DIT) is the simplest and most widely used form, where FFT algorithm divides \( N \) samples into two \( N/2 \) sized DFTs for each stage of FFT. Radix-2 DIT computes first the FFT of the even-indexed inputs \( x[2m] \) and then of the odd-indexed inputs \( x[2m+1] \), and finally concatenates these results to produce the FFT end results.

A. Radix factorization model:

The various steps involved in FFT computation tend to invent FFT computation using radix factorizations with improved hardware utilization rate which, in turn, helps to improve system performance and energy level as well. The other aspect related to overall latency and energy level assisted transceiver designs are additional performance metrics that can be obtained.

In the proposed radix factorization, the non-trivial twiddle factors are converted into trivial twiddle factors which can be easily modeled using simple logical swapping and 2's complemented operations in different stages of FFT computation.
Hierarchical index mapping rule is used to decompose the radix-2 DIF FFT into multiple region model and twiddle factors features are extracted using the linear index mapping process. This will lead to reduction in the number of complex twiddle factor multiplications at different levels with order of magnitude of the index map. As stated above, this approach of configuration of the twiddle factors better adapts to any architectural level changes like conventional high radix indices. By integrating decomposition levels in radix-2 DIF FFT through 3-dimensional linear index map number of complex multipliers used for FFT computation is reduced as follows.

\[
n = \frac{N}{2} n_1 + \frac{N}{4} n_2 + n_3 \{ n_1, n_2 = 0, n_3 = 0 - \frac{N}{4} - 1 \}
\]
\[
k = k_1 + 2k_2 + 4k_3 \{ k_1, k_2 = 0, k_3 = 0 - \frac{N}{4} - 1 \}
\]

(2)

The corresponding DFT is of the form,

\[
X(k_1 + 2k_2 + 4k_3) = \sum_{n_1=0}^{N-1} \sum_{n_2=0}^{N-1} \sum_{n_3=0}^{N-1} x(n_1, n_2, n_3)W_N^{n_1k_1}W_N^{n_2k_2}W_N^{n_3k_3}
\]

(3)

Where the FFT stage one has the form of

\[
B_n^k(n_2 + n_3) = x\left(\frac{N}{4} n_2 + n_3\right) + (-1)^k x\left(\frac{N}{4} n_2 + n_3 + \frac{N}{2}\right)
\]

(4)

Decomposition of radix-2 DIF FFT is represented as follows.

\[
W_N^{n_2 + n_2 + 2k_2 + 4k_3} = (-j)^{n_1(k_1 + 2k_2)} W_N^{n_1k_1}W_N^{n_2k_2}W_N^{n_3k_3} \frac{N}{4}
\]

(5)

On Substitution of equation (4) into equation (2) and expanding the summation with respect to index $N_2$, a set of 4 DFTs of length $N_2$ can be obtained.

\[
X(k_1 + 2k_2 + 4k_3) = \sum_{n_3=0}^{N-1} \sum_{n_1=0}^{N-1} \sum_{n_2=0}^{N-1} \{ H_{n_3}^{k_3} (n_3) \} W_N^{n_1k_1}W_N^{n_2k_2} \frac{N}{4}
\]

(6)

Then second stage of FFT $H_{n_3}^{k_3} (n_3)$ is described as

\[
H_{n_3}^{k_3} (n_3) = B_n^k (n_2) + (-1)^k (-j)^{k_3} B_n^k (n_2 + \frac{N}{2})
\]

(7)

Decomposition of each radix-2 FFT stage is achieved recursively to the remaining length of equation 4, we will get the radix $-2^2$ FFT algorithms. Here, at stage 1, 50% of non-trivial twiddle factors are transformed into trivial factors (1-1, j-j) where only swapping and sign inversions are required. The algorithm characterized here has all the merit as that of radix-4 but its structures are same as radix-2 butterfly.

V. RESULTS AND PERFORMANCE ANALYSIS

A. Experimental Setup

Here, the performance metrics of twiddle factor normalized FFT model with basic radix indices model are compared both in terms of delay and area efficiency. Verilog HDL is used to model the proposed architecture and FPGA synthesizer with cyclone-II family devices is used for the state-of-the-art comparison. In this brief ultimate goal is to attain high performance and hardware complexity reduction with less resource utilization which is validated from the FPGA QUARTUS II EDA synthesis results. The proposed FFT framework achieves both area and performance efficiency metrics without causing any significant quality degradations.

B. Experimental Results

Here performance metrics of radix-2$^2$ methodology and twiddle factor modeling for FFT computation are validated over conventional FFT radix-2 core units. The hardware complexity reduction is compared with respect to the logic element resource utilized. The FPGA hardware synthesizer tool is used to measure the hardware utilization report using ALTERA cyclone III EP3C10F256C6 family device as shown in Figure 1. The benefits of critical path delay reduction due to reduced complex multipliers are also proven through delay metrics analysis.

Figure 1: Hardware report summary

In order to illustrate the efficiency of radix-2$^2$ FFT unit over basic model, hardware synthesis is carried out separately. As shown in Table I, besides the complexity reduction at logic registers level, logic element utilization levels of proposed normalized FFT require minimal resources than that of basic FFT model. In this case the number of integer multipliers required is minimized and various hardware optimization techniques also analyzed as a potential extension to complexity reduction.

C. Delay optimization

In general performance degradations arise in FFT arithmetic due to its complex multiplications. Here only minimal amount of complex multiplications are considered for FFT operation that lead to improved system performance and sequentially dependent post computation operations are also reduced and this reduction in FFT operation plays significant role in overall critical path reduction.

The time quest timing analyzer tool is used to evaluate the maximum operating frequency report as shown in Figure 2.
It is found that the proposed factorization driven FFT core model reduces the propagation delay considerably over conventional model.

Figure 2: Performance reports $F_{max}$ summary

D. Energy Efficiency

To illustrate the power dissipation related problems in FFT architecture post simulation has been carried out in order to generate the value change dump (.vcd) file which incorporates all dynamic transitions during computation. The power play analyzer tool has been utilized to evaluate the power dissipation report and its experimental results are tabulated in Table-I. From the results it is clear that the radix-2 $^2$ model offers considerable energy efficiency over radix-2 FFT architecture. The proposed FFT model achieves significant energy efficiency both in static and dynamic power dissipation which is evident from figure 3.

Figure 3: Power dissipation report

Table- I: Performance comparison of normalized FFT core over conventional model

<table>
<thead>
<tr>
<th>FFT arithmetic Model</th>
<th>AREA in terms of LE’s used</th>
<th>Power in mW</th>
<th>Speed in MHz</th>
</tr>
</thead>
<tbody>
<tr>
<td>Radix-2 FFT (Conventional)</td>
<td>685</td>
<td>112.72</td>
<td>74.94</td>
</tr>
<tr>
<td>Factorized FFT</td>
<td>5295</td>
<td>98.45</td>
<td>80.63</td>
</tr>
</tbody>
</table>

VI. CONCLUSION

In this paper, a twiddle factor normalization scheme for realization of low complexity FFT adopted for correlation optimization in many CS reconstruction algorithms has been presented. Accurate and energy-efficient hardware implementation is adopted to cope up with the resource needs of CS analyses and its functionality was verified using exhaustive test bench input stimulus. Here, through FPGA hardware synthesis, the hardware utilization rate, the operating frequency, and the power consumptions were analyzed in detail. Finally, from experimental results it was well proved proposed radix factorization outperformed all other competitive FFT methods.

REFERENCES


AUTHORS PROFILE

A. Radhika was born in Andhra Pradesh, India in 1970. She received her B.Tech. and M.Tech. degrees in Electronics and Communication Engineering from JNTU Hyderabad, Andhra Pradesh, India in 2005 and 2010 respectively and currently pursuing her Ph.D degree from JNTU Kakinada, AP, India. She has 14 years of experience in teaching and currently working as Assistant Professor in the Department of Electronics and Communication Engineering at Anurag College of Engineering, Hyderabad, India. Her research interests include Embedded Systems, VLSI and Signal Processing.

Dr. K. Satya Prasad was born in AP, India in 1955. He obtained B Tech. degree from JNTU, Anantapur, AP, India in 1977 and M. E. degree from Madras University, India in 1979 and Ph.D from IIT Madras, India in 1989.He started his carrier in teaching at REC, Warangal in 1979 and joined JNT University in 1980. He has rendered majority of his services to JNTU at Hyderabad, Anantapur and Kakinada in various capacities viz. Associate Professor, Professor, Head of the Department, Vice
Principal and Principal. He has published more than 275 technical papers in different National & International Journals and conferences and Authored four Text books. He is recipient of awards like Siksha Ratan Puraskar, Best Citizen of India, etc., and Fellow member of various professional bodies like IETE, IE, and ISTE. He has guided 32 Ph.D scholars and presently 35 scholars are working under his supervision. He has a total teaching experience of 38 years and currently working as Rector at Vignan’s Deemed to be University, Guntur, Andhra Pradesh, India. His areas of Research include Communications, Signal Processing, Image Processing, Speech Processing, Neural Networks & Ad-hoc wireless networks etc.

Prof. K. Kishan Rao was born in India in 1943. He received his B.E. and M.E. degrees from Osmania University, Hyderabad, India in 1965 and 1967 respectively. He acquired his Ph.D. degree from IIT Kanpur, India, in 1973. He worked as Principal of NIT Warangal (NITW) and KITS, Warangal. He is currently working as Professor in ECE and Director-Faculty development at Sreenidhi Institute of Technology and Science, Telangana, India. He guided 08 Ph.D. scholars and presently 06 Ph.D scholars are working with him in their doctoral program. He published more than 117 papers (National-53; International-64). He is a senior member of many professional bodies like IEEE, ISTE, IETE and ISOI, India. His research interests include Wireless Communications, Signal Processing Applications and Cooperative, Mobile Communications.