Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 90
Edited by:
Paper 17

Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs

J. Habich, T. Zeiser, G. Hager and G. Wellein

Regional Computing Center, Erlangen, Germany

Full Bibliographic Reference for this paper
J. Habich, T. Zeiser, G. Hager, G. Wellein, "Speeding up a Lattice Boltzmann Kernel on nVIDIA GPUs", in , (Editors), "Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 17, 2009. doi:10.4203/ccp.90.17
Keywords: computational fluid dynamics, graphics processing unit, lattice Boltzmann, high-performance computing, STREAM benchmarks, CUDA.

We have implemented a D3Q19 stencil based lattice Boltzmann (LBM) flow solver on nVIDIA graphics processing units (GPUs) using the compute unified device architecture (CUDA [1]) framework. As compared to general purpose processors GPUs introduce some restrictive constraints to be followed for obtaining optimal performance. The basic programming guidelines we present in this report for our LBM implementation similarly apply to other codes as well.

Data alignment in main memory and aligned access patterns within specific groups of threads are frequent prerequisites for efficient memory access on GPUs. For the LBM we have chosen a combination of data layout and parallelization appproach which easily guarantees aligned data access for the so-called collision part of the algorithm. In the succeeding propagation step the updated values of each LB cell need to be propagated to neighbouring cells, potentially breaking alignment requirements. Hence, the propagation values are passed through on-chip shared memory, are reorderred there and stored back to the device memory according to the alignment boundaries [3].

The D3Q19 model leads to increased register usage as compared to lower discretization stencils as used in, e.g. Tölke et al. [3]. The straight forward implementation of the LBM required a total of 70 registers per thread, which severely restricted the level of thread concurrency resulting in an inefficient use of the main memory bandwidth. Changing the standardized access to data arrays to manual index calculations, which store the current index to the same variable consecutively, the register usage of the kernel was reduced by nearly 50%. This showed up in similar performance boost (2x) to a final 200 FluidMLUPS/s (400 FluidMLUPS/s) on the 8800 GTX (GTX 280), being equivalent to approximately 40 GFlop/s (80 GFlop/s) of sustained performance in single precision. These performance numbers are perfectly in line with our simple performance model as measured by the STREAM benchmarks [4].

For GPU clusters the bandwidth of the host-to-device interface, i.e. PCIe, is very important. Interestingly we found that using a manually blocked data transfer routine does improve the host to GPU data transfer considerably from 2.2GB/s (2.2 GB/s) to a maximum of 2.5 GB/s (4.5GB/s) for long messages for PCIe 1.1 (PCIe 2.0). Therefore, the PCIe bus should not limit inter- and intranode communication.

nVIDIA Cuda Toolkit 2.0, December 2008,
G. Wellein, T. Zeiser, G. Hager, S. Donath, "On the single processor performance of simple lattice Boltzmann kernels", Computers & Fluids, 35:910-919, 2006. doi:10.1016/j.compfluid.2005.02.008
J. Tölke, M. Krafczyk, "Towards three-dimensional teraflop CFD computing on a desktop PC using graphics hardware", in "Proceedings of International Conference for Mesoscopic Methods in Engineering and Science ICMMES07", Munich, 2007.
J. McCalpin, "The STREAM Benchmark",, 2008.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £72 +P&P)