Computational Technology Resources - CCP

Keywords: sparse matrix-vector product, compressed sparse row, symmetric nonzero pattern, parallel implementation, multi-core architectures.

Summary

In this paper we consider the implementation of a multi-threaded code for the computation of sparse matrix-vector products that occur in finite element applications using Krylov Space iterative solvers [1,2]. Processor vendors have been concentrated on developing systems that group two or more processors on a single socket and multi-threading has been considered as a key optimization technique for many scientific computing applications. Some aspects of the distributed-memory implementation of the matrix-vector products are also considered in this paper.

Global finite element matrices are usually sparse and structurally symmetric, i.e., most of its entries are zeros and the matrix presents a symmetric pattern. To take advantage of these properties the compressed sparse row-column format (CSRC) is used to store the global matrices [1]. In this paper, two versions of the CSRC format are considered: one for square matrices and an extension to rectangular matrices. The square version is suitable to storage global matrices originated in non-overlapping partitions implemented in any distributed-memory finite element code; the later is used in a subdomain-by-subdomain approach. Differently of the classical CSR format, a multi-threaded parallel implementation of the CSRC format requires a thread-safe access to the output vector. This is accomplished by reserving one array for each thread to store its contribution, which needs to be later accumulated in parallel into the resulting vector. We show how to perform this accumulation in four different ways: an all-in-one, per buffer, effective and interval approaches. The implementation is further optimized through the use of manual prefetching and NUMA-specific measures, namely, thread binding and memory affinity considering the topology of the processor architectures used in the numerical experiments.

We conduct experiments on two multi-core architectures, which reveal that our implementation achieved substantial performance improvements. We could observe that, provided sufficient memory bandwidth, our implementation has demonstrated to be fairly scalable.

References

1: F.L.B Ribeiro, I.A. Ferreira, "Parallel implementation of the finite element method using compressed data structures", Computational Mechanics, 41(18): 31-48, 2007.
2: G.O. Ainsworth Jr., F.L.B Ribeiro, C. Magluta, "A parallel implementation of the implicitly restarted Arnoldi/Lanczos method", Computational Mechanics, 1: 1-15, 2011.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £40 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 101 PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by: Paper 22 Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors G.O. Ainsworth Jr.¹, V.H.F. Batista² and F.L.B. Ribeiro³ ¹Structural Engineering Department, Federal University of Juiz de Fora, Juiz de Fora, MG, Brazil ²Federal University of Ceara, Juazeiro do Norte, CE, Brazil ³Civil Engineering Program, Federal University of Rio de Janeiro, Rio de Janeiro, RJ, Brazil doi:10.4203/ccp.101.22 purchase the full-text of this paper Full Bibliographic Reference for this paper G.O. Ainsworth Jr., V.H.F. Batista, F.L.B. Ribeiro, "Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors", in , (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 22, 2013. doi:10.4203/ccp.101.22 Keywords: sparse matrix-vector product, compressed sparse row, symmetric nonzero pattern, parallel implementation, multi-core architectures. Summary In this paper we consider the implementation of a multi-threaded code for the computation of sparse matrix-vector products that occur in finite element applications using Krylov Space iterative solvers [1,2]. Processor vendors have been concentrated on developing systems that group two or more processors on a single socket and multi-threading has been considered as a key optimization technique for many scientific computing applications. Some aspects of the distributed-memory implementation of the matrix-vector products are also considered in this paper. Global finite element matrices are usually sparse and structurally symmetric, i.e., most of its entries are zeros and the matrix presents a symmetric pattern. To take advantage of these properties the compressed sparse row-column format (CSRC) is used to store the global matrices [1]. In this paper, two versions of the CSRC format are considered: one for square matrices and an extension to rectangular matrices. The square version is suitable to storage global matrices originated in non-overlapping partitions implemented in any distributed-memory finite element code; the later is used in a subdomain-by-subdomain approach. Differently of the classical CSR format, a multi-threaded parallel implementation of the CSRC format requires a thread-safe access to the output vector. This is accomplished by reserving one array for each thread to store its contribution, which needs to be later accumulated in parallel into the resulting vector. We show how to perform this accumulation in four different ways: an all-in-one, per buffer, effective and interval approaches. The implementation is further optimized through the use of manual prefetching and NUMA-specific measures, namely, thread binding and memory affinity considering the topology of the processor architectures used in the numerical experiments. We conduct experiments on two multi-core architectures, which reveal that our implementation achieved substantial performance improvements. We could observe that, provided sufficient memory bandwidth, our implementation has demonstrated to be fairly scalable. References 1 F.L.B Ribeiro, I.A. Ferreira, "Parallel implementation of the finite element method using compressed data structures", Computational Mechanics, 41(18): 31-48, 2007. 2 G.O. Ainsworth Jr., F.L.B Ribeiro, C. Magluta, "A parallel implementation of the implicitly restarted Arnoldi/Lanczos method", Computational Mechanics, 1: 1-15, 2011. purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £40 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions