Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 101
Edited by: B.H.V. Topping and P. Iványi
Paper 36

Towards a Divide-and-Conquer Strategy for a Coordinated Resilience of HPC Applications: A Job Manager as a Coordinating System Middleware

W. Abu Abed, M. Krafczyk, J. Hegewald and K. Kucher

Institute for Computational Modeling in Civil Engineering, Technische Universität Braunschweig, Germany

Full Bibliographic Reference for this paper
W. Abu Abed, M. Krafczyk, J. Hegewald, K. Kucher, "Towards a Divide-and-Conquer Strategy for a Coordinated Resilience of HPC Applications: A Job Manager as a Coordinating System Middleware", in B.H.V. Topping, P. Iványi, (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 36, 2013. doi:10.4203/ccp.101.36
Keywords: high-performance computing, resilience, fault-tolerance, system middleware, fault tolerant environment, job manager.

Robustness and stability of HPC systems are essential prerequisites for the successful and economical operation of HPC applications. The increasing size and complexity of HPC systems are two major factors that are leading to an inevitable increase in the frequency of hard and soft errors in the system. Therefore, current and future HPC systems are becoming less robust and unstable and the operating efficiency and reliability of such systems are significantly deteriorating. New integrated approaches to improve the resilience of HPC systems are undoubtedly needed in order to maintain a reasonable operation of such systems. Recent literature surveys have shown that resilience and fault tolerance cannot be efficiently realised by implementing fault tolerance mechanisms on the system level only. Different application domains have different and specific methodological requirements to achieve resilience of an HPC system. Hence, an integrated application oriented approach is inevitable.

In this paper a framework for a fault tolerant environment (FETOL) [1] implementing an approach to achieve a coordinated resilience solution is presented. The focus is on the Job Manager (JM), a coordinating system middleware that constitutes a core component of the framework. FETOL is based on a software solution exploiting a divide-and-conquer strategy that offers comprehensive methods on both system and application levels to deal with different failure scenarios. An important feature that can make FETOL a success in establishing a sustainable and efficient fault-tolerance for many HPC applications is the ability of individually adapting the components of the framework on the application level.

The JM, as a coordinating middleware, has the task of bundling, orchestrating and extending the different functionalities of the components of FETOL. These components are: the HPC application that is parallelised using MPI, the scheduler, the system monitoring tools and BOND [2], which is a multi-agent communication framework that is based on TCP/IP. The act of breaking down the main task of an HPC application into subtasks and grouping them in the so-called process bundles is a central operation of the strategy followed in FETOL. The JM assumes the responsibility of managing and coordinating both the I/O procedures (i.e. check pointing) on the application level as well as the migration and restoring of failed process bundles. Different system and application information that are delivered by the monitoring tools influence the coordinated reactions of the Job Manager middleware. The main task, i.e. the HPC application, can consequently be kept alive with acceptable computational overhead and without the need to restart the whole job.