Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring
Overview
Questions:Objectives:
How can candidate ligands be generated and docked to a protein in Galaxy?
How can the poses of the docked ligands be evaluated?
How can a workflow for drug virtual screening be constructed in Galaxy?
Requirements:
Understand how Galaxy was used to perform docking and pose scoring on the SARS-CoV-2 main protease (MPro).
Replicate the study on a (very) small scale
Gain familiarity with the docking and scoring techniques involved.
- Introduction to Galaxy Analyses
- Computational chemistry
- Protein-ligand docking: tutorial hands-on
Time estimation: 2 hoursLevel: Intermediate IntermediateSupporting Materials:Last modification: Sep 2, 2021
Introduction
This tutorial provides a companion to the work performed in March 2020 by InformaticsMatters, the Diamond Light Source, and the European Galaxy Team to perform virtual screening on candidate ligands for the SARS-CoV-2 main protease (MPro). This work is described here.
In this tutorial, you will perform protein-ligand docking to MPro using rxDock (Ruiz-Carmona et al. 2014), a version of the popular rDock software, and score the results using two different methods. The same tools will be used as in the original study, but with a smaller dataset.
Agenda
In this tutorial, we will cover:
Background
Early in March 2020, the Diamond Light Source completed a successful fragment screen on MPro, which provided 55 fragment hits (see their press release here). In an effort to identify candidate molecules for binding, InformaticsMatters, the XChem group and the European Galaxy team joined forces to construct and execute a Galaxy workflow for performing and evaluating molecular docking on a massive scale.
An initial list of 41,000 candidate molecules was assembled by using the Fragalysis fragment network to elaborate from the initial fragment hits, as described here. These were used as inputs for the docking and scoring workflow. The workflow consists of the following steps, each of which was carried out using tools installed on the European Galaxy server:
- Charge enumeration of the 41,000 candidate molecules selected based on the fragment hits.
- Generation of 3D conformations based on SMILES strings of the candidate molecules.
- Docking of molecules into each of the MPro structures using rxDock.
- Evaluation of the docking poses using a TransFS, a deep learning approach (Scantlebury 2019) developed by the XChem group and collaborators, and SuCOS scoring (Leong 2019), which compares the poses with the structures of the original fragment hits.
The original study required almost 20 years of CPU time, not counting GPU resources consumed. This is obviously not reproducible as a tutorial. Therefore, we will repeat the workflow with a small library of just 100 molecules, on a single MPro fragment structure. Links will be provided to original Galaxy histories, with notes to explain where and why things were done differently to the tutorial.
Get data
We require three datasets for the simulation and analysis:
- A list of 100 ligand candidates. These are the molecules which will be docking into the protein binding site.
- A PDB file of the receptor MPro protein (without ligand or solvent).
- A list of fragment hits (17 in total) in SDF format.
details Differences with the original study
Of the initial 55 fragment hits, 17 were chosen for further study. From these, 41,587 compounds were generated using the Fragalysis fragment network for further study. The 100 compounds used in the tutorial are taken from this list.
Starting data is available from this Galaxy history: https://usegalaxy.eu/u/sbray/h/mpro-raw-data.
This history contains 103 files. One of these (
Initial candidates for docking
) contains the 41k candidate compounds in SMILES format. The remaining 102 files (all with names beginning withMpro-x...
) provide structural information on the fragment hits - 6 files per hit (hence 17 x 6 = 102).The identity of the files is as follows:
- the
*_0.mol
files contain the fragment structure in mol format.- the
*_0.pdb
files contain the fragment structure in pdb format.- the
*_0_apo.pdb
files contain the protein with solvent, but without ligand- the
*_0_apo-desolv.pdb
files contain the protein without either solvent or ligand- the
*_0_apo-solv.pdb
files contain only solvent- the
*_0_bound.pdb
file contain everything (protein, ligand and solvent)The PDB file of the receptor that we are using is
Mpro-x0195_0_apo-desolv.pdb
. In other words, the structure is derived from just one fragment hit. In the original study, however, all compounds were docked against all of the fragment hit structures.
hands_on Hands-on: Data upload
- Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/record/3730474/files/candidates.smi https://zenodo.org/record/3730474/files/Mpro-x0195_0_apo-desolv.pdb https://zenodo.org/record/3730474/files/hits.sdf
Tip: Importing via links
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
- Rename the datasets
Candidates SMILES
,Receptor PDB
andHits SDF
respectively.Check that the datatypes (
smi
,pdb
, andsdf
respectively) are correct. In particularly, check theCandidates SMILES
file, as the SMILES datatype is not detected automatically by Galaxy.Tip: Changing the datatype
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
- Select
datatypes
- tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Preparation for docking
Before docking, the candidate ligands need to be prepared for docking with the following steps: 1) charge enumeration, 2) generation of three-dimensional structures, and 3) splitting the SD-file into a collection.
details Differences with the original study
This stage is carried out as described here, except of course with the full set of 42,000 compounds. See here for more details.
Charge enumeration
Many of the compounds may contain functional groups which can exist in multiple charge states, and this will affect the quality of binding to the receptor dramatically. Therefore, we perform ‘charge enumeration’, which means that we generate all charge forms of the compounds within a certain pH range.
hands_on Hands-on: Charge enumeration
- Enumerate charges Tool: toolshed.g2.bx.psu.edu/repos/bgruening/enumerate_charges/enumerate_charges/2020.03.4+galaxy0 with the following parameters:
- “Input molecule data”:
Candidate SMILES
- “Minimum pH”:
4.4
- “Maximum pH”:
10.4
Rename the output file
Enumerated candidates SMILES
.comment Comment
The Enumerate charges tool tool is based on the Dimorphite-DL program. (Ropp et al. 2019)
The output is another SMILES file, with several hundred entries.
Generate three-dimensional conformations
So far our list of enumerated candidate compounds is still in SMILES format; we need to produce three-dimensional structures in SDF format for docking. This can be done with the Compound conversion tool tool.
If you are not familiar with SMILES and SDF formats, consult the introductory protein-ligand docking tutorial for more details.
hands_on Hands-on: Convert to SDF format
- Compound conversion Tool: toolshed.g2.bx.psu.edu/repos/bgruening/openbabel_compound_convert/openbabel_compound_convert/3.1.1+galaxy0 with the following parameters:
- “Molecular input file”:
Enumerated candidates
dataset.- “Output format”:
MDL MOL format (sdf, mol)
- “Generate 3D coordinates”:
Yes
Rename the output file
Enumerated candidates SDF
.comment Comment
The Compound conversion tool tool is based on the OpenBabel toolkit. (O’Boyle et al. 2011)
Splitting the SD-file into a collection
The next stage is to split the SD-file with the candidate ligands into a set of smaller SD-files.
question Questions
Why is splitting the file necessary?
solution Solution
The rxDock tool performs one docking at a time (more technically: the task is not parallelized, as it uses only a single CPU). Therefore, splitting the large SD-file into many small files allows the work to be carried out by multiple Galaxy jobs in parallel, so it completes faster.
In the original study, this kind of parallelization was absolutely essential because of the enormous dataset; at some points, there were 5,000 docking jobs running concurrently on the European Galaxy server. Even on the much smaller scale of this tutorial, we can speed things up considerably using this trick.
hands_on Hands-on: Split the SD-files
- Split file to dataset collection Tool: toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0 with the following parameters:
- “Select the file type to split”:
SD-files
- “SD-file to split”:
Enumerated candidates SDF
- “Specify number of output files or number of records per file?”:
Number of output files
- “Number of new files”:
10
- “Method to allocate records to new files”:
Alternate output files
Active site preparation
The active site also needs to be prepared for docking, using the following steps: 1) conversion to MOL2 format, and 2) generation of the active site using the rbcavity tool tool.
details Differences with the original study
This stage was carried out as described here. However, it was repeated for each of the fragment hit structures, not just the
Mpro-x0195_0_apo-desolv.pdb
file used here. See here for more details.
Convert protein structure to MOL2 format
The receptor file we are using is in PDB format, but the rxDock tool we use for docking requires an input in MOL2 format. Therefore, we first convert the file.
hands_on Hands-on: Conversion to MOL2 format
- Compound conversion Tool: toolshed.g2.bx.psu.edu/repos/bgruening/openbabel_compound_convert/openbabel_compound_convert/3.1.1+galaxy0 with the following parameters:
- “Molecular input file”:
Receptor PDB
dataset.- “Output format”:
Sybyl Mol2 format (mol2)
- Rename the output file
Receptor MOL2
.
Generate Frankenstein ligand
For docking with rxDock, a file needs to be created defining the active site. This requires two input files - one for the protein and one for the ligand. We want an active site generation that takes into account the features of all 17 fragments, and therefore need to generate a ‘Frankenstein ligand’ which possesses the properties of all the fragments. A very simple Galaxy tool is available for this.
question Questions
What is a ‘Frankenstein ligand’ and why do we need it?
solution Solution
The Frankenstein ligand combines the atoms of all the fragments and therefore occupies the entire space of the binding site. Therefore, it is the best choice for cavity definition. See the information provided by InformaticsMatters for more details.
hands_on Hands-on: Generate Frankenstein ligand
- Create Frankenstein ligand Tool: toolshed.g2.bx.psu.edu/repos/bgruening/ctb_frankenstein_ligand/ctb_frankenstein_ligand/2013.1-0+galaxy0 with the following parameters:
- “Input file”:
Hits SDF
- Rename the file to
Frankstein SDF
.
Generate active site definition
The active site can now be generated using the rbcavity tool tool, which requires the receptor in MOL2 format as input as well as a single reference ligand in Mol/SDF format. We use the Frankenstein ligand as the reference.
hands_on Hands-on: Active site preparation
- rxDock cavity definition Tool: toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_rbcavity/rxdock_rbcavity/0.1.5 with the following parameters:
- “Receptor”:
Receptor MOL2
- “Reference ligand”:
Frankenstein SDF
- “Mapper sphere radius”:
3.0
- “Mapper small sphere radius”:
1.0
- “Mapper minimum volume”:
100
- “Mapper volume increment”:
0
- “Mapper grid step”:
0.5
- “Cavity weight”:
1.0
- Rename the output file
Active site
.comment Comment
The meanings of these parameters are too complex to go into in this tutorial. If you are interested, see the rDock documentation for more details.
Docking and scoring
Docking and scoring are now performed, using the following steps: 1) docking using rxDock, 2) recombining the results into a single SDF file, 3) TransFS scoring, and 4) SuCOS scoring.
details Differences with the original study
This section in the original study differed from this tutorial in the following ways:
- Docking was performed on over 100,000 enumerated candidates, rather than the 300 used here.
- 25 different poses were generated per candidate, rather than 5, as in this tutorial.
- Because of the large number of poses to score (more than a million), the scoring steps were parallelized by splitting into collections. This is skipped in the tutorial.
- The entire process was repeated 17 times, using a different fragment hit as the receptor structure each time. See here and here for more details. A full list of Galaxy histories generated is listed here.
Docking with rxDock
hands_on Hands-on: Docking
- rxDock docking Tool: toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_rbdock/rxdock_rbdock/0.1.5 with the following parameters:
- “Receptor”:
Receptor MOL2
- “Active site”:
Active site
- “Ligands”:
Split file
collection- “Number of dockings”:
5
- “Number of best poses”:
5
comment Comment
For more information about docking, check out the introductory tutorial. It uses a different tool, AutoDock Vina, rather than rxDock, but the general principles are the same.
Collapse collection to a single file
Having created a collection to parallelize the docking procedure, we can now recombine the results to a single file.
hands_on Hands-on: Collapse collection
- Collapse Collection Tool: toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2 with the following parameters:
- “Collection of files to collapse into single dataset”: Output of docking step
- Rename to
Docked poses SDF
.
The output file should contain around 1,900 docked poses in SDF format.
Pose scoring with TransFS
In this step, we carry out scoring of the poses using TransFS. This is a deep learning approach developed at the University of Oxford, employing augmentation of training data with incorrectly docked ligands to prompt the model to learn from protein-ligand interactions. (Scantlebury 2019)
The TransFS scoring returns a value (saved as <TransFSScore>
in the SDF file) between 0 (poor) and 1 (good).
hands_on Hands-on: TransFS scoring
- XChem TransFS pose scoring Tool: toolshed.g2.bx.psu.edu/repos/bgruening/xchem_transfs_scoring/xchem_transfs_scoring/0.4.0 with the following parameters:
- “Receptor”:
Receptor PDB
- “Ligands”:
Docked poses SDF
- “Distance to waters”:
2
Pose scoring with SuCOS
This step involves scoring of the poses from each molecule against the original fragment screening hit ligands using the SuCOS MAX shape and feature overlay algorithm. (Leong 2019) The conformation and position of the poses are compared to known structures (i.e. the fragment hits) to determine a score.
SuCOS scoring returns a value (saved as <Max_SuCOS_Score>
in the SDF file) between 0 (poor) and 1 (good).
hands_on Hands-on: SuCOS scoring
- Max SuCOS score Tool: toolshed.g2.bx.psu.edu/repos/bgruening/sucos_max_score/sucos_max_score/2020.03.4+galaxy0 with the following parameters:
- “Ligands to be scored”: Output of the TransFS step
- “Set of clusters to score against”:
Hits SDF
- Rename the output file to
Scored poses
.
Compound selection
The aim of the original study was to select 500 candidate molecules for synthesis and experimental study. In order to do this, the data for all fragment hits had to be combined (i.e. so that each compound was assigned the lowest score from all the fragments). The resulting table was then compared with a list of compounds available from Enamine and Chemspace and the 500 highest scoring matching compounds were selected for purchase.
This step is skipped in the tutorial, as only 100 compounds were tested, using only a single fragment hit structure. If you want, though, check out the history and workflow used.
Conclusion
This tutorial guided you through docking and scoring of a small set of compounds to the MPro protein. Hopefully, you have a better understanding of how docking can be practically used, as well as how the original study was performed.
Key points
Galaxy can support large, rapid studies in computational chemistry
Protein-ligand docking contributes to the discovery of new drugs
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Computational chemistry topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumUseful literature
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
- O’Boyle, N. M., M. Banck, C. A. James, C. Morley, T. Vandermeersch et al., 2011 Open Babel: An open chemical toolbox. Journal of Cheminformatics 3: 10.1186/1758-2946-3-33
- Ruiz-Carmona, S., D. Alvarez-Garcia, N. Foloppe, A. B. Garmendia-Doval, S. Juhos et al., 2014 rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLOS Computational Biology 10: 1–7. 10.1371/journal.pcbi.1003571
- Rose, A. S., A. R. Bradley, Y. Valasatava, J. M. Duarte, A. Prlić et al., 2018 NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 34: 3755–3758. 10.1093/bioinformatics/bty419
- Leong, S. et al., 2019 SuCOS is Better than RMSD for Evaluating Fragment Elaboration and Docking Poses. 10.26434/chemrxiv.8100203.v1 https://chemrxiv.org/articles/SuCOS_is_Better_than_RMSD_for_Evaluating_Fragment_Elaboration_and_Docking_Poses/8100203/1
- Ropp, P. J., J. C. Kaminsky, S. Yablonski, and J. D. Durrant, 2019 Dimorphite-DL: an open-source program for enumerating the ionization states of drug-like small molecules. Journal of Cheminformatics 11: 10.1186/s13321-019-0336-9
- Scantlebury, J. et al., 2019 Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalise To Unseen Target Classes, And Highlight Important Binding Interactions. https://www.biorxiv.org/content/10.1101/2020.03.06.979625v1
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- Simon Bray, 2021 Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
details BibTeX
@misc{computational-chemistry-covid19-docking, author = "Simon Bray", title = "Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials)", year = "2021", month = "09", day = "02" url = "\url{https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }