# High Speed Parallel SAD Architecture Implementation on FPGA for HEVC encoder

### Jaya Koshta, Kavita Khare



Abstract: Video compression is a very complex and time consuming task which generally pursuit high performance. Motion Estimation (ME) process in any video encoder is responsible to primarily achieve the colossal performance which contributes to significant compression gain. Summation of Absolute Difference (SAD) is widely applied as distortion metric for ME process. With the increase in block size to 64×64 for real time applications along with the introduction of asymmetric mode motion partitioning(AMP) in High Efficiency Video Encoding (HEVC)causes variable block size motion estimation verv convoluted. This results in increase in computational time and demands for significant requirement of hardware resources. In this paper parallel SAD hardware circuit for ME process in HEVC is propound where parallelism is used at various levels. The propound circuit has been implemented using Xilinx Virtex-5 FPGA for XC5VLX20T family. Synthesis results shows that the propound circuit provides significant reduction in delay and increase in frequency in comparison with results of other parallel architectures.

Keywords : HEVC, Video compression, Motion estimation (ME), Summation of absolute differences (SAD), FPGA.

#### I. INTRODUCTION

To meet the technological demands such as low power, less memory and fast transfer rate for a broad spectrum of applications including the growing demand for high resolution videos, resulted in creation of stronger needs for better video compression efficiency. High Efficiency Video Coding Standard(HEVC) or H.265is a video estimation standard designed to considerably enhance coding efficiency in comparison to previous, the Advanced Video Coding (AVC) or H.264.HEVC is a block-based video compression standard designed to support higher resolutions and for the similar video quality fifty percent bit rate saving is accomplished in comparison to AVC/H.264 [1][2].A number of advanced techniques and approaches have been adopted in H.265/HEVC to elevate its performance. These advanced techniques are also encompassed in the Motion Estimation(ME) process, which is utmost complicated and protracted block in video prediction process. Function of the ME unit is to remove

Revised Manuscript Received on August 30, 2019. \* Correspondence Author

Jaya Koshta\*, Department of Electronics and Comm. Engg., MANIT, Bhopal, India. Email: jayakoshta15@gmail.com

Kavita Khare, Department of Electronics and Comm. Engg., MANIT, Bhopal, India. Email: kavita \_khare1@yahoo.co.in

© The Authors. Published by Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

temporal redundancy in video frames. This is accomplished by finding the finest equivalent block in the reference frame selected region, for every block of instantaneous frame which leads to the constructed frame having the least residual information [3]. Thus, to find motion

vectors, ME process is used which calculates finest resembled block of instantaneous frame in the selected region of reference frame. For block based ME, the Summation of Absolute Difference(SAD) is the generally adopted metric which adds up the modulus of numerical differences amidst corresponding pixels of instantaneous and reference blocks in video frames.

SAD for both instantaneous and reference block is computed as shown in equation (1),

SAD = 
$$\sum_{x=1}^{A} \sum_{y=1}^{B} |IB(x, y) - RB(x, y)| - ... (1)$$

where. IB - Current Block, **RB** -Reference Block, A X B - block size of current and reference block, x,y - two dimensional coordinates of block.

In HEVC, encoding of video frames is accomplished using coding units (CU) and transform units (TU) where a video frame is segmented into CU, where every CU dimensions differs from eight x eight to sixty four  $\times$  sixty four pixels. In HEVC, largest dimension of CU is 64×64 resulting in addition in the entanglement in design of ME unit. HEVC utilises 8 divergent types of segmenting the inter prediction unit as presented in Figure1 for each level of CU. At stage0, 2M=64 and stage 1, 2M = 32 and so on. Segmentation (i) -(iv) are known as Symmetric-Mode-Partition(SMP) whereas (v) - (viii) are known as Asymmetric-Mode-Partition (AMP).In HEVC, variable block sizes have different dimensions from 4x8 and 8x4 to 64x64 which are supported as shown in Table 1. This enhanced feature of variable block sizes for ME, substantially elevate the compression vastly performance however, it raises encoding computational complexity. Thus with the expansion of coding block size to 64x64 in HEVC, compared to 16x16 in H.264/AVC and with the introduction of asymmetric mode partition, makes ME process very complex. Different VLSI architectures for SAD computations for ME process in HEVC are available in the literature where there are lot of tradeoffs between the speed, power and area during the hardware implementation. The number of SAD computations for ME process in HEVC vastly increases resulting in increase in the processing delay, power consumption and hardware complexity.



Published By:

& Sciences Publication



Figure 1 Segmentation dimensions in HEVC

| Table I Overall number of SAD calculation |                               |             |         |  |
|-------------------------------------------|-------------------------------|-------------|---------|--|
| Block                                     | Number                        | Block       | Number  |  |
| Dimensions                                | of SADs                       | Dimensions  | of SADs |  |
| 64x64                                     | 1                             | 8x32(left)  | 8       |  |
| 32x64                                     | 2                             | 16x16       | 16      |  |
| 64x32                                     | 2                             | 8x16        | 32      |  |
| 64x16(up)                                 | 2                             | 16x8        | 32      |  |
| 64x16(down)                               | 2                             | 16x4(up)    | 32      |  |
| 16x64(right)                              | 2                             | 16x4(down)  | 32      |  |
| 16x64(left)                               | 2                             | 4x16(right) | 32      |  |
| 32x32                                     | 4                             | 4x16(left)  | 32      |  |
| 16x32                                     | 8                             | 8x8         | 64      |  |
| 32x16                                     | 8                             | 4x8         | 128     |  |
| 32x8(up)                                  | 8                             | 8x4         | 128     |  |
| 32x8(down                                 | 8                             | 4x4         | 256     |  |
| 8x32(right)                               | 8                             |             |         |  |
| Overall No.                               | 316 Square + 340 SMP +168 AMP |             |         |  |
| of SADs                                   |                               |             |         |  |

# II. RELATED WORK

For ME process, different VLSI architectures for SAD computations are available in the literature, for both ASIC [4-5] and FPGA [6-7] implementation methodology approaches. In [8], partial propagate SAD and SAD tree architectures were presented. This architecture is generally utilised for low resolution video data along with less complex applications. In [9] a highly parallel SAD architecture for motion estimation in HEVC encoder is presented where there are sixty four processing units which are operating in parallel for computation of SAD values in ME unit. Here the architecture utilises separate two memory banks one for Reference CTB and other for Instantaneous CTB. In [10]and [11], for parallel implementation, the system requires a data bus width of 1024 bits for one stage design and 4096 bits in a two stage design. The humongous demand for large data width results in making the design speculative. In [12], a neoteric speedy SAD architecture for variable block size ME in HEVC encoder is propound for reduction of hardware resources as well as the processing time.

In this paper, highly parallelized hardware circuit for SAD unit is propound. Synthesis results displays that the propound circuit processes the video data with significant reduction in delay along with the increase in frequency than existing ones.

## **III. PROPOUND CIRCUITS**

This paper illustrates the implementation of parallel SAD circuit for calculation of entire probable SAD values of 4x4 to 64x64 blocks. In propound circuit, at one time instant both current and reference 64x64 block is fetched from memory bank for processing and hence the circuit can be utilised for all types of fast and full search algorithms used for Motion Estimation techniques. For processing of every 64x64 blocks (instantaneous and reference block), firstly it is segmented into sixteen 16x16 blocks and each 16x16 block is composed of sixteen 4x4 SADs operating in parallel as shown in Figure 2.



Figure 2 Memory Organization of 64x64 block

The proposed architecture mainly is composed of instantaneous block, reference block, absolute difference(AD)circuit, processing element(PE),4X4 SAD block, SAD 16x16 array, merger tree for SAD aggregation and register for storage of all SAD combinations from 4x4 to 64x64[12].The propound circuit is implemented using Absolute Difference(AD) circuit as basic building block for 4X4 SAD calculation which adds up the modulus of numerical differences between corresponding pixels of current and reference blocks in video data. Figure 3 shows AD circuit block diagram implemented using comparator and subtractor where comparator determines the larger input number between instantaneous and reference pixel so that AD difference can computed as larger number subtracted by smaller number.



Figure 3 AD circuit block diagram

For calculation of 4x4 SAD, firstly Processor Element(PE) block is implemented as shown in Figure 4, which uses four AD circuits to calculate one row of 4x4 SADs.

Published By: Blue Eyes Intelligence Engineering & Sciences Publication





For 4x4 SAD calculation, the Processor Element(PE) block is implemented in parallel and the SAD values of each row are added to calculate4X4 SAD which forms the primitive block for 64X64 SAD computation.





Merger Tree - As HEVC supports variable block size SAD calculation, merger tree is used to satisfy this requirement. For this purpose 256- 4×4 SAD values are computed and saved for further processing in register bank. HEVC uses the iterative quad-tree coding unit arrangement, that is for one  $2k\times 2k$  SAD computation it requires computation of four k x k blocks as shown in Figure 5.The units of various block sizes are connected in merger tree based on stratified arrangement. For4×4, 8×8, 16×16, 32×32 and 64×64 blocks, there are 256, 64, 16, 4 and 1 structures respectively.



**Figure 5 Hierarchical SAD Structure** 

# **IV. RESULTS**

The propound circuit is implemented in hardware description language- Verilog and synthesized using Xilinx

XC5VLX20TFPGA family. In addition, simulation and functional verification of circuit was compassed employing Xilinx Simulation tool- ISE-14.2.Figure 6below shows simulation result of the propound circuit where 256- 4x4 SADs are computed in parallel and timing waveform shows outputs of 4x4 SADs. Figure 7 shows RTL schematic of 4 x 4 SAD block where Processor Element(PE) and Adders are operating in parallel and Figure 8 shows RTL schematic of 16 x 16 block where pins shows the inputs and outputs of it in terms of 4x4 SAD block. Using this 16 x 16 block which are operating in parallel 64x64 block has been implemented.

Table 2 shows the comparison of proposed work with other existing work where operating frequency of propound circuit is 425.51 MHz along with the delay of 9.091ns however larger area as compared to other architectures.



Figure 6 Simulation result of propound circuit



Figure 7 RTL schematic of 4 x 4 SAD block



Figure 8 RTL schematic of 16 x 16 block

Published By: Blue Eyes Intelligence Engineering & Sciences Publication



## High Speed Parallel SAD Architecture Implementation on FPGA for HEVC encoder

|                                   | [9]             | [12]         | Propound     |
|-----------------------------------|-----------------|--------------|--------------|
| Technology                        | 65 nm           | 65 nm        | 65 nm        |
| No. of Registers                  | 38032           | 8841         | 65536        |
| No. of LUTs                       | 25173           | 17992        | 307264       |
| Maximum<br>Frequency(MHz)         | 348             | 190.785      | 425.51       |
| Block<br>Dimensions               | 4x4 to<br>64x64 | 4x4 to 64x64 | 4x4 to 64x64 |
| Delay for each<br>64x64 block(ns) | 45.98           | 167.73       | 9.019        |
| AMP support                       | Yes             | Yes          | Yes          |

Table 2 Comparison with related work

## V. CONCLUSION

An efficacious real-time parallel SAD circuit for ME in HEVC has been presented. The propound circuit has been prototyped, simulated and synthesized on Xilinx Virtex 5 FPGA for XC5VLX20T family. The design uses 256-4 x4 SAD basic building blocks running in parallel for determining 256 SAD values for 64 x 64 block. The propound VLSI architecture has an operating frequency of 425.51 MHz along with the delay of 9.091ns however has larger area as compared to other architectures. The propound circuit can be utilised for all kinds of fast and full search algorithms used for Block Motion Estimation techniques.

## REFERENCES

- G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, "Overview of the high efficiency video coding (HEVC)standard," IEEE Trans. Circuits Syst. Video Technol., vol.22, no. 12, pp. 1649-1668, December 2012.
- I. Richardson, "HEVC: An introduction to high efficiency video coding," 2001, https://www.vcodex.com/h265.html
- N. Purnachand, L. N. Alves, and A. Navarro, "Fast Motion Estimation Algorithm for HEVC", IEEE International Conference on Consumer Electronics-Berlin (ICCE-Berlin), September 2012.
- Zhenyu Liu, Goto, S., Ikenaga, T., "Optimization of Propagate Partial SAD and SAD tree motion estimation hardwired engine for H.264," IEEE Inter. Conf. on Compo Des., pp.328-333, Oct. 2008.
- Tung-Chien Chen, Yu-Han Chen, Sung-Fang Tsai, Shao-Yi Chien, Liang-Gee Chen, "Fast Algorithm and Architecture Design of Low Power Integer Motion Estimation for H.264/ A VC," IEEE Transactions on Circuits and Systems for Video Technology, vol 7, no.5, May 2007.
- Rehman, S., Young, R., Chatwin, C., Birch, P., , "An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation," Europ. Jour. Scient. Res., vol.33, no. I, 2009.
- Niitsuma, H., Maruyama, T., "Sum of Absolute Difference Implementations for Image Processing on FPGAs," Inter. Conf. on Field Programmable Logic and Applications (FPL), Sept. 2010.
- [8] Zhenyu Liu, Goto, S., Ikenaga, T., "Optimization of Propagate Partial SAD and SAD tree motion estimation hardwired engine for H.264," IEEE Inter. Conf. on Compo Des., pp.328-333, Oct. 2008.
- Ahmed Medhat, Ahmed Shalaby, Mohammed S. Sayed, Maha Elsabrouty and Farhad Mehdipour. "A Highly Parallel SAD Architecture for Motion Estimation in HEVC Encoder", Circuits and Systems (APCCAS), IEEE Asia Pacific Conference, 280 - 283, 2014.
- P. Nalluri, L.N. Alves, A. Avarro, "A novel SAD architecture for variable block size motion estimation in HEVC video coding", International Symposium on System on Chip (SoC), pp. 1-4, Oct. 2013.
- Purnachand Nalluri, Luis Nero Alves, Antonio Avarro, "High speed SAD architectures for variable block size motion estimation in HEVC video coding", Image Processing, IEEE International Conference, 1233 –1237, 2014.
- Dinh, Vu Nam, et al. "High speed SAD architecture for variable block size motion estimation in HEVC encoder." Communications and Electronics (ICCE), 2016 IEEE Sixth International Conference on IEEE,2016.
- Amit M. Joshi, Mohd. Samar Ansari, Chitrakant Sahu, "VLSI Architecture of High Speed SAD for High Efficiency Video Coding (HEVC) Encoder" IEEE 2018.
- Jarno, Vanne, Eero Aho, Timo D. Hamalainen and Kimmo Kuusilinna, "A High-Performance Sum of Absolute Difference Implementation Motion Estimation", IEEE Transactions on Circuits and Systems for Video Technology, pp. 876-883, Vol. 16, No. 7, 2006.

Retrieval Number F8380088619/2019©BEIESP DOI: 10.35940/ijeat.F8380.088619 Journal Website: <u>www.ijeat.org</u>

## **AUTHORS PROFILE**



Jaya Koshta received her M.Tech degree in Electronics and Communication Engineering in 2009.Currently she is pursuing Ph.D from Department of EleCtronics and Communication Engineering from MANIT,Bhopal



Dr.Kavita Khare received her B.Tech degree in Electronics and Communication Engineering in 1989, M.Tech. degree in Digital Communication Systems in 1993 and Ph.D. degree in the field of VLSI Design in 2004. She has nearly 200 publications in various international conferences and journals. Journals include IEEE transactions on circuits and systems II, IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, Circuits, Systems, and Signal Processing ( Springer), Elsvier, Oxford, Taylor and Francis, Hindawi, wiley etc. 6 Book Chapters in renowned publications. Also a Patent at intellectual property India Govt of India. Best Paper Award at International ConferenceMS-05 by AMSE France. Currently, she is working as Associate Professor of Electronics and Communication Engineering in MANIT, Bhopal. She has guided around 45 post-graduate and 17 doctoral thesis. Her fields of interest are VLSI design and Communication Systems. Her research mainly includes design of low power VLSI CIRCUITS AND arithmetic circuits and various communication algorithms related to synchronization, estimation and routing circuits. Dr. Khare is a Fellow IETE (India) and Life Member ISTE.



Published By: Blue Eyes Intelligence Engineering & Sciences Publication