标题：Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking
作者：Sze, Kam-Heung; Xiong, Zhiqiang; Ma, Jinlong; Lu, Gang; Chan, Wai-Yee; Li, Hongjian
作者机构：[Sze, Kam-Heung; Xiong, Zhiqiang; Ma, Jinlong; Li, Hongjian] SDIVF R&D Ctr, Bioinformat Unit, Sha Tin, Hong Kong Sci Pk, Hong Kong, Peoples R China.; 更多
会议名称：13th International Joint Conference on Biomedical Engineering Systems and Technologies
会议日期：FEB 24-26, 2020
来源：PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, VOL 3: BIOINFORMATICS
关键词：Molecular Docking; Binding Affinity Prediction; Machine Learning;; Feature Engineering; Data Similarity
摘要：Inconsistent conclusions have been drawn from recent studies exploring the influence of data similarity on the scoring power of machine-learning scoring functions, but they were all based on the PDBbind v2007 refined set whose data size is limited to just 1300 protein-ligand complexes. Whether these conclusions can be generalized to substantially larger and more diverse datasets warrants further examinations. Besides, the previous definition of protein structure similarity, which relied on aligning monomers, might not truly reflect what it was supposed to be. Moreover, the impact of binding pocket similarity has not been investigated either. Here we have employed the updated refined set v2013 providing 2959 complexes and utilized not only protein structure and ligand fingerprint similarity but also a novel measure based on binding pocket topology dissimilarity to systematically control how similar or dissimilar complexes are incorporated for training predictive models. Three empirical scoring functions X-Score, AutoDock Vina, Cyscore and their random forest counterparts were evaluated. Results have confirmed that dissimilar training complexes may be valuable if allied with appropriate machine learning algorithms and informative descriptor sets. Machine-learning scoring functions acquire their remarkable scoring power through mining more data to advance performance persistently, whereas classical scoring functions lack such learning ability. The software code and data used in this study and supplementary results are available at https://GitHub.com/HongjianLi/MLSF.