Computational & Technology Resources
an online resource for computational,
engineering & technology publications
PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GPU AND CLOUD COMPUTING FOR ENGINEERING
Edited by: P. Iványi and B.H.V. Topping
Evaluation of FPGA-based motion estimation module for HEVC video coding standard
O. López1, R. Gutierrez1, E. Alcocer, V. Galiano1, H. Migallon1, G. Van Wallendael2 and M.P. Malumbres1
1Miguel Hern´andez University, Spain
O. López, R. Gutierrez, E. Alcocer, V. Galiano, H. Migallon, G. Van Wallendael, M.P. Malumbres, "Evaluation of FPGA-based motion estimation module for HEVC video coding standard", in P. Iványi, B.H.V. Topping, (Editors), "Proceedings of the Sixth International Conference on Parallel, Distributed, GPU and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 29, 2019. doi:10.4203/ccp.112.29
Keywords: HEVC, FPGA, video coding, motion estimation, special purpose hardware.
The High Efficiency Video Coding (HEVC) standard was designed by the Joint Collaborative Team on Video Coding (JCT-VC). This new standard has been developed in order to deal with nowadays and future market trends like 4K and 8K video resolutions. HEVC improves the coding efficiency over its predecessor standard (H.264/AVC) by a factor of almost twice while maintaining a similar visual quality, but at expense of a high coding complexity increase.
As in previous video standards, Motion Estimation (ME) is the most complex task of the encoder, requiring more than 90% of the encoding time. For HEVC standard, that complexity is due to several issues such as (a) a large set of Coding Tree Unit (CTU) partitioning modes, (b) the presence of multiple reference frames, and (c) the varying size of Coding Units (CU). In addition, HEVC adopts Variable Block Size Motion Estimation (VBSME) to obtain advanced coding efficiency.
In a previous work, we presented a hardware architecture that performs ME computation using FPGA technology. We presented two innovative techniques: (a) a new SAD adder tree structure, and (b) a new memory scan order. The SAD adder tree structure performs the additions at the first level of the tree, starting from the maximum size of the CTU, and halving the amount of additions at the next levels. This approach is different from the rest of state-of-the-art works taking advantage of the resources provided by the FPGA, obtaining the minimum possible latency when calculating SADs of all levels and partitions for a CTU. In this way, SADs corresponding to asymmetric partitions are obtained in a fast and efficient way. Regarding the new memory scan order, a series of reconfigurable shift registers and processing elements are responsible for storing the necessary pixels of both reference and current CTU, keeping them always available for calculating the SADs and Motion Vectors (MVs), avoiding external memory accesses.
In this work, we propose an architecture where a linux-embedded on the Xilinx SoC Zynq-7 Mini-ITX board manages the transfer between our previous hardware design, the IME SAD module, and a Double Data Rate (DDR) memory by a DirectMemory Access (DMA) module. We have developed an application to check the results of our IME hardware design compared to the ones provides by the HEVC software reference model, evaluating how the CTU size, and the transfers of that CTU and its reference window through the DMA could affect the R/D performance as well as the throughput of our system. As a result, with the inclusion of our hardware IME proposal in the HEVC software reference model, we can speed-up the encoding time 558 times.
purchase the full-text of this paper (price £22)