Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Conferences
ISSN 2753-3239
CCC: 2
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON ENGINEERING COMPUTATIONAL TECHNOLOGY
Edited by: B.H.V. Topping and P. Iványi
Paper 8.5

MRT Lattice Boltzmann Method on multiple Graphics Processing Units with halo sharing over PCI-e for non-contiguous memory

T.O.M. Forslund

Division of Fluid and Experimental Mechanics Luleå University of Technology, Sweden

Full Bibliographic Reference for this paper
T.O.M. Forslund, "MRT Lattice Boltzmann Method on multiple Graphics Processing Units with halo sharing over PCI-e for non-contiguous memory", in B.H.V. Topping, P. Iványi, (Editors), "Proceedings of the Eleventh International Conference on Engineering Computational Technology", Civil-Comp Press, Edinburgh, UK, Online volume: CCC 2, Paper 8.5, 2022, doi:10.4203/ccc.2.8.5
Keywords: lattice Boltzmann method, GPU, multi-GPU programming.

Abstract
The Lattice Boltzmann Method (LBM) has been shown to be well suited for implementation on Graphics Processing Units (GPUs). The benefit of GPU implementations compared to CPU is in the reduction of computational time, by as much as 2 orders of magnitude. This staggering difference is due to how computations for LBM are both explicit and local, meaning that it can make full use of the GPUs capabilities, like most other cellular automata methods. Although GPUs have a significantly larger performance in terms of floating-point operations per second (FLOPS) compared to a CPU it has two significant drawbacks; First, the complexity of the calculations is limited due to the relative simplicity of the GPU core design compared to a CPU, secondly, the memory of a GPU is usually limited in comparison, ranging from a few GB up to ???? 100 GB for high-end enterprise cards. Because the LBM method is suitable for execution on GPUs the first point is not necessary to consider. But the second point becomes a limitation as larger, or more highly resolved computational domains are of interest. This can be remedied by distributing the computations across several GPUs executing in parallel. The GPUs share values in overlapping regions called halo-values that need to be transferred each time step. If the memory is contiguous then each transfer can be executed as a single efficient memory transfer call that utilizes the PCI-e lanes efficiently. If this is not the case then support exists for copying of so-called strided memory which has a constant offset between values for either single strided (2D) or double strided (3D). These functions practically result in bad PCI-e lane utilization and to remedy this a method is proposed, the halo-values are calculated and packed into a contiguous memory buffer that is then communicated between the GPUs via the PCI-e lanes. It is shown that the method introduces some additional overhead compared to single GPU execution but maintains a reasonable 70% performance compared to the single GPU case.

download the full-text of this paper (PDF, 6 pages, 316 Kb)

go to the previous paper
go to the next paper
return to the table of contents
return to the volume description