Acceleration of the Relativistic Dirac–Kohn–Sham Method with GPU: A Pre-Exascale Implementation of BERTHA and PyBERTHA

Storchi, Loriano; Bellentani, Laura; Hammond, Jeff; Orlandini, Sergio; Pacifici, Leonardo; Antonini, Nicoló; Belpassi, Leonardo

doi:10.1021/acs.jctc.4c01759

In this paper, we present the recent advances in the computation of the Dirac-Kohn-Sham (DKS) method of the BERTHA code. We show here that the simple underlined structure of the FORTRAN code also favors efficient porting of the code to GPUs, leading to a particularly efficient hybrid CPU/GPU implementation (OpenMP/OpenACC), where the most computationally intensive part for DKS matrix evaluation (three-center two-electron integrals evaluated via the McMurchie-Davidson scheme) is efficiently offloaded to the GPU via compiler directives based on the OpenACC programming model. This scheme in combination with the use of a linear algebra library optimized for GPUs (cuBLAS, cuSOLVER) significantly accelerates the DKS calculations. In addition, the low-level integral kernel developed here at FORTRAN level was used to port our real-time DKS (RT-TDDKS) implementation based on Python (PyBERTHART) for the utilization of the GPU. The results obtained on the new Tier-0 EuroHPC supercomputer (LEONARDO) of the CINECA Supercomputing Centre with a single NVIDIA A100 card are very satisfactory. We achieve a speedup up to 30 for Au16 in a single-point DKS energy calculation and up to 10 for the Au8 systems in an RT-TDDKS calculation, compared to our OpenMP (i.e., CPU only) parallel implementation (with 32 cores). The approach presented here is very general and, to our knowledge, represents the first port of a Python API to GPUs based on a FORTRAN kernel for the evaluation of two-electron integrals. The implementation is currently limited to the use of a single GPU accelerator, but future paths to an actual exascale implementation are discussed.