Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 101
PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING
Edited by:
Paper 40

Programming Finite Element Methods for ccNUMA Processors

E. Borin1 and P. Devloo2

1Institute of Computing, University of Campinas, Brazil
2Faculty of Civil Engineering, University of Campinas, Brazil

Full Bibliographic Reference for this paper
E. Borin, P. Devloo, "Programming Finite Element Methods for ccNUMA Processors", in , (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 40, 2013. doi:10.4203/ccp.101.40
Keywords: parallel programming, parallel processing, cache-coherent non uniform memory access, finite element methods, multi-core, shared memory.

Summary
Recent multi-core designs migrated from symmetric multi processing to cache coherent non uniform memory access architectures. In this paper we discuss performance issues that arise when designing parallel finite element method programs for a 64-core ccNUMA computer and explore solutions for these issues. First we present an overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8x speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate two possible solutions for the problem. The first one consists of distributing the data evenly among the memory banks using the numactl tool and the second consists of using the libnuma to properly schedule threads and related data on local CPUs and memory banks to take advantage of the memory subsystem parallelism and reduce the average memory access latency. We show that the first approach is able to boost the performance by 10.6x only by changing the way we invoke the program on the command line and that the second approach is able to further boost the performance by 30.9x at the expense of changing the applications code. Finally, we argue that the issues reported only happen for large data sets and conclude with recommendations to help programmers to design algorithms and programs that perform well on this type of machine.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £40 +P&P)