Predicting protein pK(a) by environment similarity

Milletti, F.; Storchi, Loriano; Cruciani, G.

doi:10.1002/prot.22363

A statistical method to predict protein pK a has been developed by using the 3D structure of a protein and a database of 434 experimental protein pK a values. Each pK a in the database is associated with a fingerprint that describes the chemical environment around an ionizable residue. A computational tool, MoKaBio, has been developed to identify automatically ionizable residues in a protein, generate fingerprints that describe the chemical environment around such residues, and predict pK a from the experimental pK a values in the database by using a similarity metric. The method, which retrieved the pK a of 429 of the 434 ionizable sites in the database correctly, was crossvalidated by leave-one-out and yielded root mean square error (RMSE) = 0.95, a result that is superior to that obtained by using the Null Model (RMSE 1.07) and other well-established protein pK a prediction tools. This novel approach is suitable to rationalize protein pK a by comparing the region around the ionizable site with similar regions whose ionizable site pK a is known. The pK a of residues that have a unique environment not represented in the training set cannot be predicted accurately, however, the method offers the advantage of being trainable to increase its predictive power.