In this paper, we present the recent advances in the computation of the Dirac-Kohn-Sham (DKS) method of the BERTHA code. We show here that the simple underlined structure of the FORTRAN code also favors efficient porting of the code to GPUs, leading to a particularly efficient hybrid CPU/GPU implementation (OpenMP/OpenACC), where the most computationally intensive part for DKS matrix evaluation (three-center two-electron integrals evaluated via the McMurchie-Davidson scheme) is efficiently offloaded to the GPU via compiler directives based on the OpenACC programming model. This scheme in combination with the use of a linear algebra library optimized for GPUs (cuBLAS, cuSOLVER) significantly accelerates the DKS calculations. In addition, the low-level integral kernel developed here at FORTRAN level was used to port our real-time DKS (RT-TDDKS) implementation based on Python (PyBERTHART) for the utilization of the GPU. The results obtained on the new Tier-0 EuroHPC supercomputer (LEONARDO) of the CINECA Supercomputing Centre with a single NVIDIA A100 card are very satisfactory. We achieve a speedup up to 30 for Au16 in a single-point DKS energy calculation and up to 10 for the Au8 systems in an RT-TDDKS calculation, compared to our OpenMP (i.e., CPU only) parallel implementation (with 32 cores). The approach presented here is very general and, to our knowledge, represents the first port of a Python API to GPUs based on a FORTRAN kernel for the evaluation of two-electron integrals. The implementation is currently limited to the use of a single GPU accelerator, but future paths to an actual exascale implementation are discussed.

Acceleration of the Relativistic Dirac–Kohn–Sham Method with GPU: A Pre-Exascale Implementation of BERTHA and PyBERTHA

Storchi, Loriano
;
2025-01-01

Abstract

In this paper, we present the recent advances in the computation of the Dirac-Kohn-Sham (DKS) method of the BERTHA code. We show here that the simple underlined structure of the FORTRAN code also favors efficient porting of the code to GPUs, leading to a particularly efficient hybrid CPU/GPU implementation (OpenMP/OpenACC), where the most computationally intensive part for DKS matrix evaluation (three-center two-electron integrals evaluated via the McMurchie-Davidson scheme) is efficiently offloaded to the GPU via compiler directives based on the OpenACC programming model. This scheme in combination with the use of a linear algebra library optimized for GPUs (cuBLAS, cuSOLVER) significantly accelerates the DKS calculations. In addition, the low-level integral kernel developed here at FORTRAN level was used to port our real-time DKS (RT-TDDKS) implementation based on Python (PyBERTHART) for the utilization of the GPU. The results obtained on the new Tier-0 EuroHPC supercomputer (LEONARDO) of the CINECA Supercomputing Centre with a single NVIDIA A100 card are very satisfactory. We achieve a speedup up to 30 for Au16 in a single-point DKS energy calculation and up to 10 for the Au8 systems in an RT-TDDKS calculation, compared to our OpenMP (i.e., CPU only) parallel implementation (with 32 cores). The approach presented here is very general and, to our knowledge, represents the first port of a Python API to GPUs based on a FORTRAN kernel for the evaluation of two-electron integrals. The implementation is currently limited to the use of a single GPU accelerator, but future paths to an actual exascale implementation are discussed.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11564/886516
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 2
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 5
social impact