Computational Technology Resources - CCP

Keywords: parallel programming, parallel processing, cache-coherent non uniform memory access, finite element methods, multi-core, shared memory.

Summary

Recent multi-core designs migrated from symmetric multi processing to cache coherent non uniform memory access architectures. In this paper we discuss performance issues that arise when designing parallel finite element method programs for a 64-core ccNUMA computer and explore solutions for these issues. First we present an overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8x speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate two possible solutions for the problem. The first one consists of distributing the data evenly among the memory banks using the numactl tool and the second consists of using the libnuma to properly schedule threads and related data on local CPUs and memory banks to take advantage of the memory subsystem parallelism and reduce the average memory access latency. We show that the first approach is able to boost the performance by 10.6x only by changing the way we invoke the program on the command line and that the second approach is able to further boost the performance by 30.9x at the expense of changing the applications code. Finally, we argue that the issues reported only happen for large data sets and conclude with recommendations to help programmers to design algorithms and programs that perform well on this type of machine.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £40 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 101 PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by: Paper 40 Programming Finite Element Methods for ccNUMA Processors E. Borin¹ and P. Devloo² ¹Institute of Computing, University of Campinas, Brazil ²Faculty of Civil Engineering, University of Campinas, Brazil doi:10.4203/ccp.101.40 purchase the full-text of this paper Full Bibliographic Reference for this paper E. Borin, P. Devloo, "Programming Finite Element Methods for ccNUMA Processors", in , (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 40, 2013. doi:10.4203/ccp.101.40 Keywords: parallel programming, parallel processing, cache-coherent non uniform memory access, finite element methods, multi-core, shared memory. Summary Recent multi-core designs migrated from symmetric multi processing to cache coherent non uniform memory access architectures. In this paper we discuss performance issues that arise when designing parallel finite element method programs for a 64-core ccNUMA computer and explore solutions for these issues. First we present an overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8x speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate two possible solutions for the problem. The first one consists of distributing the data evenly among the memory banks using the `numactl` tool and the second consists of using the `libnuma` to properly schedule threads and related data on local CPUs and memory banks to take advantage of the memory subsystem parallelism and reduce the average memory access latency. We show that the first approach is able to boost the performance by 10.6x only by changing the way we invoke the program on the command line and that the second approach is able to further boost the performance by 30.9x at the expense of changing the applications code. Finally, we argue that the issues reported only happen for large data sets and conclude with recommendations to help programmers to design algorithms and programs that perform well on this type of machine. purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £40 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions