Computational & Technology Resources
an online resource for computational,
engineering & technology publications
PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING
Edited by: B.H.V. Topping and P. Iványi
Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors
G.O. Ainsworth Jr.1, V.H.F. Batista2 and F.L.B. Ribeiro3
1Structural Engineering Department, Federal University of Juiz de Fora, Juiz de Fora, MG, Brazil
G.O. Ainsworth Jr., V.H.F. Batista, F.L.B. Ribeiro, "Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors", in B.H.V. Topping, P. Iványi, (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 22, 2013. doi:10.4203/ccp.101.22
Keywords: sparse matrix-vector product, compressed sparse row, symmetric nonzero pattern, parallel implementation, multi-core architectures.
In this paper we consider the implementation of a multi-threaded code for the computation of sparse matrix-vector products that occur in finite element applications using Krylov Space iterative solvers [1,2]. Processor vendors have been concentrated on developing systems that group two or more processors on a single socket and multi-threading has been considered as a key optimization technique for many scientific computing applications. Some aspects of the distributed-memory implementation of the matrix-vector products are also considered in this paper.
Global finite element matrices are usually sparse and structurally symmetric, i.e., most of its entries are zeros and the matrix presents a symmetric pattern. To take advantage of these properties the compressed sparse row-column format (CSRC) is used to store the global matrices . In this paper, two versions of the CSRC format are considered: one for square matrices and an extension to rectangular matrices. The square version is suitable to storage global matrices originated in non-overlapping partitions implemented in any distributed-memory finite element code; the later is used in a subdomain-by-subdomain approach. Differently of the classical CSR format, a multi-threaded parallel implementation of the CSRC format requires a thread-safe access to the output vector. This is accomplished by reserving one array for each thread to store its contribution, which needs to be later accumulated in parallel into the resulting vector. We show how to perform this accumulation in four different ways: an all-in-one, per buffer, effective and interval approaches. The implementation is further optimized through the use of manual prefetching and NUMA-specific measures, namely, thread binding and memory affinity considering the topology of the processor architectures used in the numerical experiments.
We conduct experiments on two multi-core architectures, which reveal that our implementation achieved substantial performance improvements. We could observe that, provided sufficient memory bandwidth, our implementation has demonstrated to be fairly scalable.
purchase the full-text of this paper (price £20)