Computational Technology Resources - CCC

Z. Qiu^1,2, F. Magoulès² and D. Peláez¹

¹Institut des Sciences Moléculaires d'Orsay, Université Paris-Saclay, Orsay, Île-de France, France
²Laboratory of Mathematics in Interaction with Computer Science, MICS, CentraleSupélec, Université Paris-Saclay, Gif-sur-Yvette, Île-de-France, France

Z. Qiu, F. Magoulès, D. Peláez, "GPU Parallelization for Analytical Hierarchical Tucker Representation Using Binary Trees", in P. Iványi, J. Kruis, B.H.V. Topping, (Editors), "Proceedings of the Eighth International Conference on Parallel, Distributed, GPU and Cloud Computing for Engineering", Civil-Comp Press, Edinburgh, UK, Online volume: CCC 12, Paper 2.4, 2025,

Keywords: functional Hierarchical Tucker format, finite basis representation, deep learning, multi-GPU parallelism, DDP, high-dimensional fitting.

Abstract

In this contribution, we present a Distributed Data Parallel (DDP) approach for optimizing an analytical hierarchical Tucker in finite basis representation functional Tucker representation (HT-FBR) of high-order multivariate functions. We achieve up to a 10× speedup compared to the benchmark on a 6D proof-of-concept dataset and 2x speedup on a 12D Ethene trajectory dataset trained on 1, 2, 4 and 8 GPUs. This speedup is attributed to our implementation of an element-wise GPU parallelization algorithm for both forward and backward propagations, as well as a customized dataset and data loader configuration. We also show that the model trained on multi-GPU has less overfitting issue than the one trained on a single CPU/GPU. This paves the way to a large scale high-performance training schema and model parallelization on multi-GPU setting, especially the parallelism on levels of the nodes based on their binary tree dependency. On the other hand, improving the accuracy and reducing overfitting as a function of number of GPUs.

download the full-text of this paper (PDF, 12 pages, 707 Kb)

go to the previous paper
go to the next paper
return to the table of contents
return to the volume description

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCC CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Conferences ISSN 2753-3239 CCC: 12 PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GPU AND CLOUD COMPUTING FOR ENGINEERING Edited by: P. Iványi, J. Kruis and B.H.V. Topping Paper 2.4 GPU Parallelization for Analytical Hierarchical Tucker Representation Using Binary Trees Z. Qiu^1,2, F. Magoulès² and D. Peláez¹ ¹Institut des Sciences Moléculaires d'Orsay, Université Paris-Saclay, Orsay, Île-de France, France ²Laboratory of Mathematics in Interaction with Computer Science, MICS, CentraleSupélec, Université Paris-Saclay, Gif-sur-Yvette, Île-de-France, France Full Bibliographic Reference for this paper Z. Qiu, F. Magoulès, D. Peláez, "GPU Parallelization for Analytical Hierarchical Tucker Representation Using Binary Trees", in P. Iványi, J. Kruis, B.H.V. Topping, (Editors), "Proceedings of the Eighth International Conference on Parallel, Distributed, GPU and Cloud Computing for Engineering", Civil-Comp Press, Edinburgh, UK, Online volume: CCC 12, Paper 2.4, 2025, Keywords: functional Hierarchical Tucker format, finite basis representation, deep learning, multi-GPU parallelism, DDP, high-dimensional fitting. Abstract In this contribution, we present a Distributed Data Parallel (DDP) approach for optimizing an analytical hierarchical Tucker in finite basis representation functional Tucker representation (HT-FBR) of high-order multivariate functions. We achieve up to a 10× speedup compared to the benchmark on a 6D proof-of-concept dataset and 2x speedup on a 12D Ethene trajectory dataset trained on 1, 2, 4 and 8 GPUs. This speedup is attributed to our implementation of an element-wise GPU parallelization algorithm for both forward and backward propagations, as well as a customized dataset and data loader configuration. We also show that the model trained on multi-GPU has less overfitting issue than the one trained on a single CPU/GPU. This paves the way to a large scale high-performance training schema and model parallelization on multi-GPU setting, especially the parallelism on levels of the nodes based on their binary tree dependency. On the other hand, improving the accuracy and reducing overfitting as a function of number of GPUs. download the full-text of this paper (PDF, 12 pages, 707 Kb) go to the previous paper go to the next paper return to the table of contents return to the volume description
Back to top	©Civil-Comp Limited 2023 - terms & conditions