We present a transferable, interpretable, and modular machine-learning framework that enhances the accuracy of density functional theory (DFT) reaction energies using physically meaningful energy-decomposition descriptors. Reaction energies computed at the DFT level with standard basis sets are first decomposed into chemically intuitive contributions─such as kinetic and potential energy─which are then used to train a library of linear regression (LR) models. This includes a general-purpose model that reduces mean absolute percentage errors (MAPE) relative to gold standard CCSD(T)/CBS reference values by up to 63% compared to uncorrected DFT across extended benchmark sets. In parallel, a series of specialized LR models provide improved accuracy for specific reaction classes. A random forest (RF) classifier dynamically selects the optimal model for each case, pushing accuracy further and achieving a MAPE reduction of up to 123 percentage points, all while maintaining full model interpretability. In a rigorous out-of-distribution stress test on the WCCR10 data set─containing transition-metal complexes absent from training─both the general LR model and the RF/LR pipeline retain robust performance. Unlike typical neural network models, which often face generalization challenges beyond their training set, our framework maintains stable performance outside its training domain.
Transferable and Transparent Energy Decomposition-Based Machine Learning Models for Computing Accurate Reaction Energetics
Storchi, Loriano
;
2025-01-01
Abstract
We present a transferable, interpretable, and modular machine-learning framework that enhances the accuracy of density functional theory (DFT) reaction energies using physically meaningful energy-decomposition descriptors. Reaction energies computed at the DFT level with standard basis sets are first decomposed into chemically intuitive contributions─such as kinetic and potential energy─which are then used to train a library of linear regression (LR) models. This includes a general-purpose model that reduces mean absolute percentage errors (MAPE) relative to gold standard CCSD(T)/CBS reference values by up to 63% compared to uncorrected DFT across extended benchmark sets. In parallel, a series of specialized LR models provide improved accuracy for specific reaction classes. A random forest (RF) classifier dynamically selects the optimal model for each case, pushing accuracy further and achieving a MAPE reduction of up to 123 percentage points, all while maintaining full model interpretability. In a rigorous out-of-distribution stress test on the WCCR10 data set─containing transition-metal complexes absent from training─both the general LR model and the RF/LR pipeline retain robust performance. Unlike typical neural network models, which often face generalization challenges beyond their training set, our framework maintains stable performance outside its training domain.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


