A practical guide to machine-learning scoring for structure-based virtual screening

Tran-Nguyen, V-K; Junaid, M; Simeon, S; Ballester, PJ

Altmetric

A practical guide to machine-learning scoring for structure-based virtual screening

File	Description	Size	Format
accepted-version.pdf	Accepted version	2.28 MB	Adobe PDF	View/Open

Title:	A practical guide to machine-learning scoring for structure-based virtual screening
Authors:	Tran-Nguyen, V-K Junaid, M Simeon, S Ballester, PJ
Item Type:	Journal Article
Abstract:	Structure-based virtual screening (SBVS) via docking has been used to discover active molecules for a range of therapeutic targets. Chemical and protein data sets that contain integrated bioactivity information have increased both in number and in size. Artificial intelligence and, more concretely, its machine-learning (ML) branch, including deep learning, have effectively exploited these data sets to build scoring functions (SFs) for SBVS against targets with an atomic-resolution 3D model (e.g., generated by X-ray crystallography or predicted by AlphaFold2). Often outperforming their generic and non-ML counterparts, target-specific ML-based SFs represent the state of the art for SBVS. Here, we present a comprehensive and user-friendly protocol to build and rigorously evaluate these new SFs for SBVS. This protocol is organized into four sections: (i) using a public benchmark of a given target to evaluate an existing generic SF; (ii) preparing experimental data for a target from public repositories; (iii) partitioning data into a training set and a test set for subsequent target-specific ML modeling; and (iv) generating and evaluating target-specific ML SFs by using the prepared training-test partitions. All necessary code and input/output data related to three example targets (acetylcholinesterase, HMG-CoA reductase, and peroxisome proliferator-activated receptor-α) are available at https://github.com/vktrannguyen/MLSF-protocol, can be run by using a single computer within 1 week and make use of easily accessible software/programs (e.g., Smina, CNN-Score, RF-Score-VS and DeepCoy) and web resources. Our aim is to provide practical guidance on how to augment training data to enhance SBVS performance, how to identify the most suitable supervised learning algorithm for a data set, and how to build an SF with the highest likelihood of discovering target-active molecules within a given compound library.
Issue Date:	Nov-2023
Date of Acceptance:	3-Jul-2023
URI:	http://hdl.handle.net/10044/1/108994
DOI:	10.1038/s41596-023-00885-w
ISSN:	1750-2799
Publisher:	Nature Research
Start Page:	3460
End Page:	3511
Journal / Book Title:	Nature Protocols
Volume:	18
Issue:	11
Copyright Statement:	Copyright © 2023 Springer-Verlag. This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1038/s41596-023-00885-w
Publication Status:	Published
Online Publication Date:	2023-10-16
Appears in Collections:	Bioengineering