Computational & Technology Resources
an online resource for computational,
engineering & technology publications
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING FOR ENGINEERING
Edited by: B.H.V. Topping and P. Iványi
Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs
J. Habich, T. Zeiser, G. Hager and G. Wellein
Regional Computing Center, Erlangen, Germany
J. Habich, T. Zeiser, G. Hager, G. Wellein, "Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs", in B.H.V. Topping, P. Iványi, (Editors), "Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 17, 2009. doi:10.4203/ccp.90.17
Keywords: computational fluid dynamics, graphics processing unit, lattice Boltzmann, high-performance computing, STREAM benchmarks, CUDA.
We have implemented a D3Q19 stencil based lattice Boltzmann (LBM) flow solver on nVIDIA graphics processing units (GPUs) using the compute unified device architecture (CUDA ) framework. As compared to general purpose processors GPUs introduce some restrictive constraints to be followed for obtaining optimal performance. The basic programming guidelines we present in this report for our LBM implementation similarly apply to other codes as well.
Data alignment in main memory and aligned access patterns within specific groups of threads are frequent prerequisites for efficient memory access on GPUs. For the LBM we have chosen a combination of data layout and parallelization appproach which easily guarantees aligned data access for the so-called collision part of the algorithm. In the succeeding propagation step the updated values of each LB cell need to be propagated to neighbouring cells, potentially breaking alignment requirements. Hence, the propagation values are passed through on-chip shared memory, are reorderred there and stored back to the device memory according to the alignment boundaries .
The D3Q19 model leads to increased register usage as compared to lower discretization stencils as used in, e.g. Tölke et al. . The straight forward implementation of the LBM required a total of 70 registers per thread, which severely restricted the level of thread concurrency resulting in an inefficient use of the main memory bandwidth. Changing the standardized access to data arrays to manual index calculations, which store the current index to the same variable consecutively, the register usage of the kernel was reduced by nearly 50%. This showed up in similar performance boost (2x) to a final 200 FluidMLUPS/s (400 FluidMLUPS/s) on the 8800 GTX (GTX 280), being equivalent to approximately 40 GFlop/s (80 GFlop/s) of sustained performance in single precision. These performance numbers are perfectly in line with our simple performance model as measured by the STREAM benchmarks .
For GPU clusters the bandwidth of the host-to-device interface, i.e. PCIe, is very important. Interestingly we found that using a manually blocked data transfer routine does improve the host to GPU data transfer considerably from 2.2GB/s (2.2 GB/s) to a maximum of 2.5 GB/s (4.5GB/s) for long messages for PCIe 1.1 (PCIe 2.0). Therefore, the PCIe bus should not limit inter- and intranode communication.
purchase the full-text of this paper (price £20)