Computational Technology Resources - CCP

Keywords: computational fluid dynamics, graphics processing unit, lattice Boltzmann, high-performance computing, STREAM benchmarks, CUDA.

Summary

We have implemented a D3Q19 stencil based lattice Boltzmann (LBM) flow solver on nVIDIA graphics processing units (GPUs) using the compute unified device architecture (CUDA [1]) framework. As compared to general purpose processors GPUs introduce some restrictive constraints to be followed for obtaining optimal performance. The basic programming guidelines we present in this report for our LBM implementation similarly apply to other codes as well.

Data alignment in main memory and aligned access patterns within specific groups of threads are frequent prerequisites for efficient memory access on GPUs. For the LBM we have chosen a combination of data layout and parallelization appproach which easily guarantees aligned data access for the so-called collision part of the algorithm. In the succeeding propagation step the updated values of each LB cell need to be propagated to neighbouring cells, potentially breaking alignment requirements. Hence, the propagation values are passed through on-chip shared memory, are reorderred there and stored back to the device memory according to the alignment boundaries [3].

The D3Q19 model leads to increased register usage as compared to lower discretization stencils as used in, e.g. Tölke et al. [3]. The straight forward implementation of the LBM required a total of 70 registers per thread, which severely restricted the level of thread concurrency resulting in an inefficient use of the main memory bandwidth. Changing the standardized access to data arrays to manual index calculations, which store the current index to the same variable consecutively, the register usage of the kernel was reduced by nearly 50%. This showed up in similar performance boost (2x) to a final 200 FluidMLUPS/s (400 FluidMLUPS/s) on the 8800 GTX (GTX 280), being equivalent to approximately 40 GFlop/s (80 GFlop/s) of sustained performance in single precision. These performance numbers are perfectly in line with our simple performance model as measured by the STREAM benchmarks [4].

For GPU clusters the bandwidth of the host-to-device interface, i.e. PCIe, is very important. Interestingly we found that using a manually blocked data transfer routine does improve the host to GPU data transfer considerably from 2.2GB/s (2.2 GB/s) to a maximum of 2.5 GB/s (4.5GB/s) for long messages for PCIe 1.1 (PCIe 2.0). Therefore, the PCIe bus should not limit inter- and intranode communication.

References

1: nVIDIA Cuda Toolkit 2.0, December 2008, http://www.nvidia.com/object/cuda_get.html
2: G. Wellein, T. Zeiser, G. Hager, S. Donath, "On the single processor performance of simple lattice Boltzmann kernels", Computers & Fluids, 35:910-919, 2006. doi:10.1016/j.compfluid.2005.02.008
3: J. Tölke, M. Krafczyk, "Towards three-dimensional teraflop CFD computing on a desktop PC using graphics hardware", in "Proceedings of International Conference for Mesoscopic Methods in Engineering and Science ICMMES07", Munich, 2007.
4: J. McCalpin, "The STREAM Benchmark", http://www.streambench.org/, 2008.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £72 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 90 PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING FOR ENGINEERING Edited by: Paper 17 Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs J. Habich, T. Zeiser, G. Hager and G. Wellein Regional Computing Center, Erlangen, Germany doi:10.4203/ccp.90.17 purchase the full-text of this paper Full Bibliographic Reference for this paper J. Habich, T. Zeiser, G. Hager, G. Wellein, "Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs", in , (Editors), "Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 17, 2009. doi:10.4203/ccp.90.17 Keywords: computational fluid dynamics, graphics processing unit, lattice Boltzmann, high-performance computing, STREAM benchmarks, CUDA. Summary We have implemented a D3Q19 stencil based lattice Boltzmann (LBM) flow solver on nVIDIA graphics processing units (GPUs) using the compute unified device architecture (CUDA [1]) framework. As compared to general purpose processors GPUs introduce some restrictive constraints to be followed for obtaining optimal performance. The basic programming guidelines we present in this report for our LBM implementation similarly apply to other codes as well. Data alignment in main memory and aligned access patterns within specific groups of threads are frequent prerequisites for efficient memory access on GPUs. For the LBM we have chosen a combination of data layout and parallelization appproach which easily guarantees aligned data access for the so-called collision part of the algorithm. In the succeeding propagation step the updated values of each LB cell need to be propagated to neighbouring cells, potentially breaking alignment requirements. Hence, the propagation values are passed through on-chip shared memory, are reorderred there and stored back to the device memory according to the alignment boundaries [3]. The D3Q19 model leads to increased register usage as compared to lower discretization stencils as used in, e.g. Tölke et al. [3]. The straight forward implementation of the LBM required a total of 70 registers per thread, which severely restricted the level of thread concurrency resulting in an inefficient use of the main memory bandwidth. Changing the standardized access to data arrays to manual index calculations, which store the current index to the same variable consecutively, the register usage of the kernel was reduced by nearly 50%. This showed up in similar performance boost (2x) to a final 200 FluidMLUPS/s (400 FluidMLUPS/s) on the 8800 GTX (GTX 280), being equivalent to approximately 40 GFlop/s (80 GFlop/s) of sustained performance in single precision. These performance numbers are perfectly in line with our simple performance model as measured by the STREAM benchmarks [4]. For GPU clusters the bandwidth of the host-to-device interface, i.e. PCIe, is very important. Interestingly we found that using a manually blocked data transfer routine does improve the host to GPU data transfer considerably from 2.2GB/s (2.2 GB/s) to a maximum of 2.5 GB/s (4.5GB/s) for long messages for PCIe 1.1 (PCIe 2.0). Therefore, the PCIe bus should not limit inter- and intranode communication. References 1 nVIDIA Cuda Toolkit 2.0, December 2008, http://www.nvidia.com/object/cuda_get.html 2 G. Wellein, T. Zeiser, G. Hager, S. Donath, "On the single processor performance of simple lattice Boltzmann kernels", Computers & Fluids, 35:910-919, 2006. doi:10.1016/j.compfluid.2005.02.008 3 J. Tölke, M. Krafczyk, "Towards three-dimensional teraflop CFD computing on a desktop PC using graphics hardware", in "Proceedings of International Conference for Mesoscopic Methods in Engineering and Science ICMMES07", Munich, 2007. 4 J. McCalpin, "The STREAM Benchmark", http://www.streambench.org/, 2008. purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £72 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions