Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 111
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING
Edited by:
Paper 38

Exploiting Functional Directives to Achieve MPI Parallelization

D. Rubio Bonilla and C.W. Glass

HLRS - University of Stuttgart, Germany

Full Bibliographic Reference for this paper
D. Rubio Bonilla, C.W. Glass, "Exploiting Functional Directives to Achieve MPI Parallelization", in , (Editors), "Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 38, 2017. doi:10.4203/ccp.111.38
Keywords: high performance computing, functional programming, MPI, parallelization, programming models.

Summary
In the last years CPU manufactures have not been able to substantially increase the Instructions Per Cycle of CPU cores. Trying to overcome this situation manufacturers have increased the raw performance of HPC systems by simultaneously increasing the amount of processors, by multiplying the number of cores in each processor and integrating specialized accelerators such as GPGPUs, FPGAs and other ASICs with specialized instruction sets. To be able to exploit the new hardware capabilities applications have to be specifically written with parallelism, to deal with the increasing number of cores available and also need to have parts of the source code written in specialized languages to make use of the integrated accelerators.

This creates a major paradigm shift from compute centric to communication centric execution to which most programming models are not properly aligned yet: classical models are geared towards optimizing the computational operations, assuming data access is almost for free. Languages like C, for example, assume that variables are immediately available and are accessed synchronously. The new situation implies that the data is going to be distributed across the system, and communication latency will have a big impact. A bad data distribution will result in a continue exchange while processing units are idling waiting for the data to compute. Most programming models do not convey the necessary dependency information to the compiler that must be careful not to make wrong assumptions. There are successful attempts, such as OpenMP, to exploit parallelism by introducing structural information of the application.

Research projects, such as POLCA, have develop means to introduce functionallike semantics, in the form of directives, to procedural code that describe the structural behavior of the application with the aim to allow compilers to perform aggressive code transformations that increase performance and allow portability across different architectures. These functional semantics are based on Higher-Order Functions (HOFs), which are functions that can take as parameters, or return as result, other functions instead of a value. Thanks to this property the directives can be interlinked to create hierarchical structures that can be analyzed at different levels (compared to the flat structure created by OpenMP). At the same times the HOFs have a very clear execution structure that is well understood and can be manipulated to obtain different execution structures that are equivalent but have different properties (memory usage, degree of parallelization or communication pattern among others).

In this paper we present how the functional directives based on Higher-Order Functions can be applied to procedural code to obtain the application’s hierarchical. Then we will demonstrate how the structure is analyzed to find the parallelism and its data dependencies. After it, we will follow the process that compilers can take to exploit this information to adapt the original source code to port it to non-shared memory address space HPC clusters. This steps involves the detection of the communication pattern based on the data flow, partitioning on the data, modification of the data structures and the introduction of the MPI (Message Passing) calls. To finalize we will present the execution results of the MPI code generated from non-parallelized C code (N-Body and 3D Heat Diffusion) following this process. The results show that the parallelization in large HPC clusters is correct and with performance comparable to hand tuned versions, with almost equivalent scalability and energy consumption.

purchase the full-text of this paper (price £22)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £45 +P&P)