HIV protease inhibitors QSAR

Project

Home
HIV

基于机器学习算法的HIV-1蛋白酶抑制剂的QSAR模型建立以及应用域分析

Quantitative structure-activity relationship (QSAR) models and their applicability domain analysis on HIV-1 protease inhibitors by machine learning methods

Tian, Y.J.; Zhang, S.D.; Yin, H.Y.; Yan, A.X.*

Chemometrics and Intelligent Laboratory Systems, 2020, 196, 103888

HIV-1蛋白酶抑制剂(PIs)对人类免疫缺陷病毒(HIV)的高效抗逆转录病毒治疗(HAART)做出了重要贡献。本研究采用多元线性回归(MLR)、支持向量机(SVM)、随机森林(RF)和深度神经网络(DNlN)四种机器学习方法，对1238个PI建立了14个定量构效关系(QSAR)模型。对于由DNN算法构建的最优模型Model2G，其在训练集和测试集上分别得到决定系数(R2)分别为0.88和0.79，均方根误差(RMSE)为0.39和0.51。对于Model 2G，基于训练集得到的应用域阈值ADT为1.765，一个相似度距离(d)小于ADT的化合物被认为在应用域内，模型对该化合物可以准确预测，测试集中65.37%的化合物可被可靠预测。此外，将1238个PI手工分为8个子集，包含不同的支架。结果发现，与其他子集相比，羟胺衍生物和七元环尿素衍生物表现出较高的抑制活性。我们还对299个羟胺衍生物抑制剂(Dataset2)和377个七元环环尿素衍生物抑制剂(Dataset3)的两个子集用SVM、RF和DNN方法建立了QSAR模型。在Dataset2上最好的模型是Model3A，其在测试集上的R2为0.71，RMSE为0.53。在Dataset3上最好的模型是Model4B，其在测试集上的R2为0.82，RMSE为0.51。最后,我们分析了在这两个子集中对抑制剂生物活性作出重大贡献的描述符。研究发现，七元环尿素衍生物的高活性抑制剂通常含有多个芳香族氮杂环取代基，如咪唑和吡唑。恶唑烷酮和磺胺主要出现在羟胺衍生物的高活性抑制剂中。这些观察结果可进一步用于设计有前景的HIV-1蛋白酶抑制剂。

阅读文章原文

下载原始数据

Download Supporting Information

HIV-1 protease inhibitors (PIs) make a vital contribution on highly active antiretroviral therapy (HAART) of human immunodeficiency virus (HIV). In this study, 14 quantitative structure-activity relationship (QSAR) models on 1238 PIs were built by four machine learning methods, including multiple linear regression (MLR), support vector machine (SVM), random forest (RF) and deep neural networks (DNlN). For the best model Model2G constructed by DNN algorithm, the coefficient of determination (R2) of 0.88 and 0.79, the root mean squared error (RMSE) of 0.39 and 0.51 were obtained on training set and test set, respectively. For model Model2G, the applicability domain threshold (ADT) of 1.765 was obtained for training set, a compound that has a similarity distance (d) less than the ADT is considered to be inside the applicability domain, could be predicted accurately, and thus 65.37% compounds in test set performed reliable. In addition, the 1238 PIs were manually divided into eight subsets containing different scaffolds. It was found that hydroxylamine derivatives and seven-member cyclic urea derivatives showed highly inhibitory activity comparing with other subsets. We also built QSAR models with SVM, RF and DNN methods on two subsets of 299 hydroxylamine derivatives inhibitors (Dataset2) and 377 seven-member cyclic urea derivatives inhibitors (Dataset3). For the best model Model3A on Dataset2, R2of 0.71 and RMSE of 0.53 were obtained for test set. For the best model Model4B on Dataset3, R2 of 0.82 and RMSE of 0.51 were obtained for test set. At last, we analyzed the descriptors which make significant contributions on the bioactivity of inhibitors among these two subsets. It was found that highly active inhibitors of seven-member cyclic urea derivatives usually contained several aromatic nitrogen heterocyclic ring substituents such as the inidazole and the pyrazole. The oxazolidinone group and sulfanilamide mainly appeared in highly active inhibitors of hydroxylamine derivatives. These observations may be utilized further in designing promising HIV-1 protease inhibitors.

Model Name	Algorithm	Descriptors	Training set R2	Training set RMSE	Test set R2	Test set RMSE
Model 1A	MLR	21 RDKit descriptors	0.55	0.75	0.56	0.77
Model 1B	MLR	22 RDKit descriptors	0.57	0.73	0.55	0.76
Model 1C	SVM	21 RDKit descriptors	0.89	0.38	0.73	0.60
Model 1D	SVM	21 RDKit descriptors	0.90	0.35	0.76	0.56
Model 1E	RF	22 RDKit descriptors	0.86	0.42	0.75	0.58
Model 1F	RF	24 RDKit descriptors	0.86	0.41	0.74	0.59
Model 1G	DNN	63 RDKit descriptors	0.91	0.36	0.76	0.57
Model 2A	MLR	25 RDKit descriptors	0.54	0.77	0.57	0.73
Model 2B	MLR	26 RDKit descriptors	0.56	0.75	0.57	0.72
Model 2C	SVM	24 RDKit descriptors	0.84	0.46	0.76	0.55
Model 2D	SVM	13 RDKit descriptors	0.83	0.47	0.76	0.54
Model 2E	RF	23 RDKit descriptors	0.85	0.44	0.76	0.55
Model 2F	RF	12 RDKit descriptors	0.85	0.44	0.74	0.56
Model 2G	DNN	69 RDKit descriptors	0.88	0.39	0.79	0.51

Dataset 2: 299 hydroxylamine derivatives inhibitors

Model Name	Algorithm	Descriptors	Training set R2	Training set RMSE	Test set R2	Test set RMSE
Model 3A	RF	25 RDKit descriptors	0.89	0.37	0.71	0.53
Model 3B	SVM	16 RDKit descriptors	0.84	0.38	0.64	0.6
Model 3C	RF	22 RDKit descriptors	0.78	0.43	0.61	0.56
Model 3D	SVM	18 RDKit descriptors	0.8	0.41	0.65	0.62
Model 3E	DNN	68 RDKit descriptors	0.90	0.30	0.69	0.59

Dataset 3: 377 cyclic urea derivatives inhibitors

Model Name	Algorithm	Descriptors	Training set R2	Training set RMSE	Test set R2	Test set RMSE
Model 4A	RF	25 RDKit descriptors	0.90	0.28	0.74	0.53
Model 4B	SVM	23 RDKit descriptors	0.87	0.43	0.82	0.51
Model 4C	RF	18 RDKit descriptors	0.91	0.27	0.78	0.56
Model 4D	SVM	24 RDKit descriptors	0.85	0.44	0.73	0.59
Model 4E	DNN	68 RDKit descriptors	0.94	0.27	0.81	0.50

主要项目成员

田钰嘉

博士研究生

1204429112@qq.com

张声德

硕士研究生