Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Computational Science, Engineering & Technology Series
PATTERNS FOR PARALLEL PROGRAMMING ON GPUS
Edited by: F. Magoulès
Migrating a Big-Data Grade Application to Large GPU Clusters
D. Tello1, V. Ducrot1, J.-M. Batto2, S. Monot1, F. Boumezbeur2, V. Arslan1 and T. Saidani1
1Alliance Services Plus, Groupe EOLEN, Malakoff, France
D. Tello, V. Ducrot, J.-M. Batto, S. Monot, F. Boumezbeur, V. Arslan, T. Saidani, "Migrating a Big-Data Grade Application to Large GPU Clusters", in F. Magoulès, (Editor), "Patterns for Parallel Programming on GPUs", Saxe-Coburg Publications, Stirlingshire, UK, Chapter 12, pp 281-310, 2014. doi:10.4203/csets.34.12
Keywords: GPU, Cuda, OpenCL, OpenMP, MPI, HMPP, Curie, big data.
This chapter relates a typical example of porting a legacy application to GPU architectures. The application named MetaProf aims to provide correlation patterns in meta-genomic catalogues and to help in identifying new species. Specificity of the application lies in the high volumetry of data and calculation handled in the process i.e. a matrix of 8 million genes by 800 samples as input data, the complexity of the calculation being quadratic. The time required to process such data with a sequential single-core implementation exceeds one month on a commodity server. In this chapter, we describe first a parallel version of the algorithm for multi-core architecture based on hybrid MPI-OpenMP. Then we demonstrate how emerging GPU architectures turned out to be an interesting alternative to these early implementations in terms of pure performance, scalability and power efficiency. Different programming models such as Cuda, OpenCL and HMPP are evaluated and compared. Our implementations were tested on both high-scale GPU clusters such as TGCC Titane and Curie and GPU-based workstations. The conclusion of our work confirmed that Cuda implementations are the fastest on Nvidia GPU, whereas their OpenCL or HMMP show slightly less performance but are valuable in terms of portability and perenity. As far as the initial use case is concerned a GPU cluster such as Curie gave us the opportunity to bring the processing time of a 3 M matrix down to a few minutes.
purchase the full-text of this chapter (price £20)