Computational Technology Resources - CCP

Keywords: multithreaded array processor, element by element, preconditioned conjugate gradient solver.

Summary

A range of emerging novel hardware technologies, such as Field Programmable Gate Arrays (FPGAs) and floating-point co-processors, promise to significantly accelerate engineering computations. The authors have selected one of these new technologies, multithreaded array processors, to investigate whether it can accelerate finite element computations. The hardware used in the study was kindly provided by ClearSpeed Ltd [1]. It comprises an accelerator card that plugs into the PCI-X slot of a standard PC. The card contains 192 processing elements (PEs) working in parallel. Each PE has floating point and integer units and also 6KBytes of local memory. The card has the potential to significantly speed up any program that uses predominantly floating-point operations. Iterative solvers such as the conjugate gradient method comprise simple numerical computations such as repeated matrix-vector multiplications and vector products. In the element by element implementation (EBE-PCG) [3], no global stiffness matrix is constructed and individual element matrices are stored and can be effectively worked on independently, in parallel, by the 192 PEs. Thus, it appears that the hardware and algorithm are ideally matched.

A parallel finite element library, ParaFEM [2], has been modified by the authors to offload numerical computations to the multithreaded array processor. The core, i.e. the solver, has been rewritten using a special C like compiler [1] to run on the multithreaded array processor. The main part of the ParaFEM based program running on the host PC will generate the system of linear equations. This system is transferred to the multithreaded array processor where it is solved. The solution vector is then retrieved.

The first implementation of the solver used the full extent of the multithreaded array processor only for a part of the EBE-PCG algorithm, the matrix-vector multiplication. Here all of the available 96 processing elements of one chip were used. The remaining steps, including the gather and scatter, however, were implemented on the serial processing facility provided by the chip. The multithreaded array processor has lots of PEs, each is by itself slower than a standard single core processor used in standard PCs. This reduces the energy consumption and increases the FLOPS/Watt, a very important feature in current technology. To properly use this technology any serial part should be avoided. In the first implementation some parts were serial, including the gather and scatter steps. While it is trivial to parallelize the gather, parallelizing the scatter is not easy, as it is not possible to have multiple PEs update the same memory location at the same time. Nonetheless a solution was found by exploiting special properties of both the multithreaded array processor and the finite element mesh.

References

1: www.clearspeed.com
2: L. Margetts, "Parallel Finite Element Analysis", University of Manchester, 2002
3: I.M. Smith and D.V. Griffiths, "Programming the Finite Element Method", Wiley, 2004

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £75 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 85 PROCEEDINGS OF THE FIFTEENTH UK CONFERENCE OF THE ASSOCIATION OF COMPUTATIONAL MECHANICS IN ENGINEERING Edited by: B.H.V. Topping Paper 54 The Implementation of an Element by Element Preconditioned Conjugate Gradient Solver for a Novel Multithreaded Array Processor V. Szeremi¹ and L. Margetts² ¹ School of Materials ²Manchester Computing University of Manchester, United Kingdom doi:10.4203/ccp.85.54 purchase the full-text of this paper Full Bibliographic Reference for this paper V. Szeremi, L. Margetts, "The Implementation of an Element by Element Preconditioned Conjugate Gradient Solver for a Novel Multithreaded Array Processor", in B.H.V. Topping, (Editor), "Proceedings of the Fifteenth UK Conference of the Association of Computational Mechanics in Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 54, 2007. doi:10.4203/ccp.85.54 Keywords: multithreaded array processor, element by element, preconditioned conjugate gradient solver. Summary A range of emerging novel hardware technologies, such as Field Programmable Gate Arrays (FPGAs) and floating-point co-processors, promise to significantly accelerate engineering computations. The authors have selected one of these new technologies, multithreaded array processors, to investigate whether it can accelerate finite element computations. The hardware used in the study was kindly provided by ClearSpeed Ltd [1]. It comprises an accelerator card that plugs into the PCI-X slot of a standard PC. The card contains 192 processing elements (PEs) working in parallel. Each PE has floating point and integer units and also 6KBytes of local memory. The card has the potential to significantly speed up any program that uses predominantly floating-point operations. Iterative solvers such as the conjugate gradient method comprise simple numerical computations such as repeated matrix-vector multiplications and vector products. In the element by element implementation (EBE-PCG) [3], no global stiffness matrix is constructed and individual element matrices are stored and can be effectively worked on independently, in parallel, by the 192 PEs. Thus, it appears that the hardware and algorithm are ideally matched. A parallel finite element library, ParaFEM [2], has been modified by the authors to offload numerical computations to the multithreaded array processor. The core, i.e. the solver, has been rewritten using a special C like compiler [1] to run on the multithreaded array processor. The main part of the ParaFEM based program running on the host PC will generate the system of linear equations. This system is transferred to the multithreaded array processor where it is solved. The solution vector is then retrieved. The first implementation of the solver used the full extent of the multithreaded array processor only for a part of the EBE-PCG algorithm, the matrix-vector multiplication. Here all of the available 96 processing elements of one chip were used. The remaining steps, including the gather and scatter, however, were implemented on the serial processing facility provided by the chip. The multithreaded array processor has lots of PEs, each is by itself slower than a standard single core processor used in standard PCs. This reduces the energy consumption and increases the FLOPS/Watt, a very important feature in current technology. To properly use this technology any serial part should be avoided. In the first implementation some parts were serial, including the gather and scatter steps. While it is trivial to parallelize the gather, parallelizing the scatter is not easy, as it is not possible to have multiple PEs update the same memory location at the same time. Nonetheless a solution was found by exploiting special properties of both the multithreaded array processor and the finite element mesh. References 1 www.clearspeed.com 2 L. Margetts, "Parallel Finite Element Analysis", University of Manchester, 2002 3 I.M. Smith and D.V. Griffiths, "Programming the Finite Element Method", Wiley, 2004 purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £75 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions