4. How-to#
This notebook serves as a practical guide to common questions users might have. Some functionalities of moldrug may not be available on PyPi or Conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:
! pip install git+https://github.com/ale94mleon/moldrug.git@main
Table of content
# Import section
from moldrug.data import get_data
import os, requests,inspect, gzip, shutil, yaml, sys
from moldrug import utils
from typing import List
import random
# Remember to change to your vina executable
# If you define as a path, it must be absolute
vina_executable = '/Users/klimt/GIT/docking/bin/vina'
# use moldrug's internal data
data_x0161 = get_data('x0161')
# Defining the wd and change directory
wd = 'wd'
os.mkdir(wd)
os.chdir(wd)
4.1. Execute moldrug from the command line#
moldrug it is meant to be for using as a python module. However it has capabilities to work from the command line. Due to the complexity of the input for a normal simulation, yaml structured file are used for the input specification. Let first see the help of moldrug.
! moldrug -h
usage: moldrug [-h] [-f [FITNESS]] [-c] [-v] [-V [VERBOSE]] yaml_file
For information of moldrug:
Docs: https://moldrug.readthedocs.io/en/latest/
Source Code: https://github.com/ale94mleon/moldrug
positional arguments:
yaml_file The configuration yaml file
optional arguments:
-h, --help show this help message and exit
-f [FITNESS], --fitness [FITNESS]
The path to the user-custom fitness module; inside of which the given custom cost function must be implemented. See the docs for how to do it properly. E.g. my/awesome/fitness_module.py. By default will look in the moldrug.fitness module.
-c, --continue To continue the simulation. The moldrug command must be the same and all the output moldrug files must be located in the working directory. This option is only compatible with moldrug.utils.GA; otherwise, a RuntimeError will be raised.
-v, --version show program's version number and exit
-V [VERBOSE], --verbose [VERBOSE]
As it shows, only one positional arguments is needed yaml_file. Then you could give:
fitness: A customized fitness python module where your cost function must be implemented.outdir: The out directory.
Lets go steep by steep
4.1.1. yaml_file anatomy#
# Getting the data
lig = data_x0161['smiles']
box = data_x0161['box']
# Getting the CReM data base
url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = 'crem.db.gz'
crem_db_path = 'crem.db'
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
with open(crem_db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
print(lig)
print(box)
print(os.listdir())
COC(=O)C=1C=CC(=CC1)S(=O)(=O)N
{'boxcenter': [12.11, 1.84, 23.56], 'boxsize': [22.5, 22.5, 22.5]}
['crem.db.gz', 'crem.db']
Now we have all that we need. The SMILES, the definition of the box, the CReM data base and the receptor. But first let see what are the arguments that the function utils.Local needs:
inspect.getfullargspec(utils.Local.__init__).args
# The sam can be done for GA: inspect.getfullargspec(utils.Local.__init__).args
['self',
'seed_mol',
'crem_db_path',
'costfunc',
'grow_crem_kwargs',
'costfunc_kwargs',
'AddHs',
'randomseed',
'deffnm']
We will pass all of this parameters using config.yaml file.
Note: Here we will not work neither modify with the desirability parameter of the cost function. We will talk about it in a separate tutorial inside of Advance Topics.
Warning!: Check your correct temporal file and change it accordantly. Of course you could also set a normal directory and safe the results of the simulation.
Here is an example yaml file to work with the class utils.Local.
main:
type: Local
njobs: 1
pick: 1
AddHs: False
seed_mol: COC(=O)C=1C=CC(=CC1)S(=O)(=O)N
costfunc: Cost
costfunc_kwargs:
vina_executable: /Users/klimt/GIT/docking/bin/vina
receptor_pdbqt_path: x0161.pdbqt
boxcenter:
- 12.11
- 1.84
- 23.56
boxsize:
- 22.5
- 22.5
- 22.5
exhaustiveness: 4
ncores: 12
num_modes: 1
crem_db_path: crem.db
grow_crem_kwargs:
radius: 3
min_atoms: 1
max_atoms: 2
ncores: 12
deffnm: local1
The structure of a yaml file is like a python dictionary but more human readable (see this youtube video if you are not familiar).
First we have the directrix:
main:
This is just the name of your main project and could be any word (we will see that moldrug.utils.GA accept also follow projects). Then we have:
type: Local
njobs: 1
pick: 1
AddHs: False
Look how this section is inside of main (one indentation level). type is the name of the class inside moldrug.utils; for now could be GA or Local (case insensitive). In this case we want to use the class Local. njobs, pick and AddHs are the parameters when the class Local (or GA) is call. The rest of the parameters are just the parameters that needs Local for the initialization. All of them are at the same level of the previous ones. Because costfunc_kwargs is a dictionary; we add for its value a new indentation level:
costfunc_kwargs:
vina_executable: vina
receptor_pdbqt_path: x0161.pdbqt
boxcenter:
- 12.11
- 1.84
- 23.56
boxsize:
- 22.5
- 22.5
- 22.5
exhaustiveness: 4
ncores: 12
num_modes: 1
As you could imagine the keyword boxcenter is a list and this is one of the way to represent list in yaml.
deffnm: local1
The previous is a prefix dto add to the genereated files.
The next dictionary is the same that the yaml file and we will save it in the temporal file that we created.
config = {
"main": {
"type": "Local",
"njobs": 1,
"pick": 1,
"seed_mol": lig,
"costfunc": "Cost",
"costfunc_kwargs": {
"vina_executable": vina_executable,
"receptor_pdbqt_path": data_x0161['protein']['pdbqt'],
"boxcenter": [
12.11,
1.84,
23.56
],
"boxsize": [
22.5,
22.5,
22.5
],
"exhaustiveness": 4,
"ncores": 12,
"num_modes": 1
},
"crem_db_path": 'crem.db',
"grow_crem_kwargs": {
"radius": 3,
"min_atoms": 1,
"max_atoms": 2,
"ncores": 12,
},
"deffnm": 'local1'
}
}
# Save the config as a yaml file
with open('config.yml', 'w') as f:
yaml.dump(config, f)
os.listdir()
['crem.db.gz', 'config.yml', 'crem.db']
4.1.2. Run moldrug#
! moldrug config.yml
Started at Thu Jun 13 14:40:53 2024
You are using moldrug: 3.7.2.
CommandLineHelper(yaml_file='config.yml', fitness=None, continuation=False, verbose=False)
Calculating cost function...
100%|█████████████████████████████████████████████| 2/2 [00:05<00:00, 2.59s/it]
Finished at Thu Jun 13 14:41:00 2024.
File local_pop.sdf was createad!
os.listdir()
['local_result.pbz2', 'crem.db.gz', 'local_pop.sdf', 'config.yml', 'crem.db']
As you see the simulation run successfully and the following files were generated:
local_result.pbz2; the compresses version of theLocalclass with all the information of the simulation (binary file).local_pop.sdf; the mol structures. This file is useful to use inside Pymol.
4.1.3. Specify a custom cost function#
The following is a simple cost function based on the vina score and very similar to moldrug.fitness.Cost used by moldrug. The details on the parameters and how to correctly implement it will be discussed letter on the following section.
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
Because this cost function accept the same parameters we can use the same configuration yaml file; but the name of the function is different. So we have to modify that parameter. Let’s save the function in a .py file with the corresponded import statements
# Create a new config file
# Save the config as a yaml file
config['main']['costfunc'] = 'MyAwesomeCost'
# this is needed because if out_dir we will change directories
config['main']['crem_db_path'] = os.path.abspath('crem.db')
with open('config_custom_fitness.yml', 'w') as f:
yaml.dump(config, f)
os.listdir()
['local_result.pbz2',
'crem.db.gz',
'local_pop.sdf',
'config.yml',
'crem.db',
'config_custom_fitness.yml']
str_module = """#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from moldrug import utils
from typing import List
import os
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_pdbqt_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_pdbqt_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
"""
# Saving the new fitness
with open('MyCustomFitness.py', 'w') as f:
f.write(str_module)
os.listdir()
['MyCustomFitness.py',
'local_result.pbz2',
'crem.db.gz',
'local_pop.sdf',
'config.yml',
'crem.db',
'config_custom_fitness.yml']
Now we only need to call moldrug and specify the new fitness module to use. It is recommendable to output in on folder the results; therefore also use the option outdir of the command line.
! moldrug config_custom_fitness.yml --fitness MyCustomFitness.py
Started at Thu Jun 13 14:41:01 2024
You are using moldrug: 3.7.2.
CommandLineHelper(yaml_file='config_custom_fitness.yml', fitness='MyCustomFitness.py', continuation=False, verbose=False)
Calculating cost function...
100%|█████████████████████████████████████████████| 2/2 [00:04<00:00, 2.28s/it]
Finished at Thu Jun 13 14:41:08 2024.
File local_pop.sdf was createad!
os.listdir()
['MyCustomFitness.py',
'local_result.pbz2',
'CustomMoldrugFitness.py',
'__pycache__',
'crem.db.gz',
'local_pop.sdf',
'config.yml',
'crem.db',
'config_custom_fitness.yml']
print(os.listdir())
['MyCustomFitness.py', 'local_result.pbz2', 'CustomMoldrugFitness.py', '__pycache__', 'crem.db.gz', 'local_pop.sdf', 'config.yml', 'crem.db', 'config_custom_fitness.yml']
As you can see the simulation run successfully! But be aware of a possible issue! Because the cost function used is not actually inside of moldrug. If we would like to use on the future the binary file local_result.pbz2. We have first say to Python where to find the used fitness function. For reproducibility and also peace of mind, moldrug will generate a new file: CustomMoldrugFitness.py this is nothing more than a copy of your implemented python fitness module. This module is the one used for the internal calculation of moldrug. Therefore, we have to say to python where is CustomMoldrugFitness and append to sys.path in order to open the binary file. If we do not do that we will get an error. Let see the example.
whole_result = utils.decompress_pickle('local_result.pbz2')
whole_result
<moldrug.utils.Local at 0x30588bc40>
# The cost modify cost function
whole_result.costfunc
<function CustomMoldrugFitness.MyAwesomeCost(Individual: moldrug.utils.Individual, wd: str = '.vina_jobs', vina_executable: str = 'vina', receptor_pdbqt_path: str = None, boxcenter: List[float] = None, boxsize: List[float] = None, exhaustiveness: int = 8, ncores: int = 1, num_modes: int = 1)>
4.2. Create your own cost function#
Create your own cost function should be straightforward as soon as we understand how moldrug works with it. So, let’s take a look again to the function defined previously:
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_pdbqt_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_pdbqt_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
MyAwesomeCost function it will only:
Perform Docking
Update:
pdbqtattribute ofIndividualcostattribute ofIndividual(the must important in order to moldrug works properly)
Assign
vina_scoreattribute toIndividual
Basically the idea is give a number to the attribute cost of the passed Individual instance. moldrug works based on the class moldrug.utils.Individual; this is just a representation of each chemical structure. Therefore all the custom cost function must work based on this class; in oder words, “the chemical structure input” must be coded as an Individual and it will be the first positional argument to pass. The rest is completely optional. If the function perform I/O operations you can also provided as keyword wd. By default moldrug.utils.GA and moldrug.utils.Local create a temporal directory (in /tmp) even when you specify some values for it. This could be a problem if your function create big files (bigger than the capacity of your /tmp directory) or if you would like to preserve some of this files. In this case just create an alternative keyword (like wd_dont_touch) and work based on it. The other important idea to have in mind is that moldrug.utils.GA perform minimization (the best element is the one with lower cost attribute).
WARNING: moldrug uses multiprocessing for the parallelization. This module uses itself pickle. In our case moldrug parallelize the cost function; therefore the COST FUNCTION MUST BE PICKLEABLE. To see what can be pickled and unpickled go here.
Common error
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7efd82c72b80>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed")'
The last error could be due to some user defined functions (or classes) that uses your custom cost function that are not a the top level of the module. In other words, you are using user defined functions (or classes) that are outside of your custom_fitness_module.py (in case command line is used) or your main script where the cost function was defined. Solution: put all your code for the cost function in only one .py ((in case command line is used)) file or in your main script.
The following is probably the easiest and dummy cost function possible. But I bring it here to show that there is not limitation on how to create your own cost function.
def MyPerfectDummyCost(Individual):
Individual.cost = random.random()
return Individual
The next code is a duplication for the implementation of MyPerfectDummyCost fitness function.
Is written in this way due to some possible problems that might arise because the execution of multiprocessing inside GA on an interactive Python like IPython-Notebook.
This will create an importable function in your working directory. In case you execute this code in a python script rather that in a Notebook, consider to put the execution of the class after the section if __name__ == '__main__:'
python_script = """
import random
from rdkit import Chem
from moldrug import utils
def MyPerfectDummyCost(Individual):
Individual.cost = random.random()
return Individual
if __name__ == '__main__':
ga = utils.GA(
Chem.MolFromSmiles('CCCC(=O)OCCN'),
crem_db_path='crem.db',
maxiter=2,
popsize=20,
costfunc=MyPerfectDummyCost,
mutate_crem_kwargs={
'ncores': 12
},
costfunc_kwargs={})
ga(20)
"""
with open('script.py', 'w') as f:
f.write(python_script)
! python script.py
Creating the first population with 20 members:
100%|███████████████████████████████████████████| 20/20 [00:02<00:00, 7.19it/s]
Initial Population: Best Individual: Individual(idx = 17, smiles = NN(CC(=O)O)CC(=O)O, cost = 0.041591764561210165)
Accepted rate: 20 / 20
Evaluating generation 1 / 2:
100%|███████████████████████████████████████████| 20/20 [00:02<00:00, 7.87it/s]
Generation 1: Best Individual: Individual(idx = 17, smiles = NN(CC(=O)O)CC(=O)O, cost = 0.041591764561210165).
Accepted rate: 7 / 20
Evaluating generation 2 / 2:
100%|███████████████████████████████████████████| 20/20 [00:02<00:00, 7.83it/s]
Generation 2: Best Individual: Individual(idx = 46, smiles = COC(=O)C=C(C)C, cost = 0.02724931885829862).
Accepted rate: 10 / 20
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
The simulation finished successfully after 2 generations witha population of 20 individuals. A total number of 60 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = CCCC(=O)OCCN, cost = 0.05650119309032309)
Final Individual: Individual(idx = 46, smiles = COC(=O)C=C(C)C, cost = 0.02724931885829862)
The cost function dropped in 0.029251874232024466 units.
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Total time (2 generations): 109.90 (s).
Finished at Thu Jun 13 14:42:59 2024.
moldrug.utils.GA exports an sdf file that contains the individual of the current generations every save_pop_every_gen. For that, it uses the information contained in the pdbqt or pdbqts attributes (depending if single or multi-receptor is used respectively). Therefore if we use docking we must update the pdbqt attribute with the pose obtained by vina. If we don’t do that, we will just get a conformation generated by RDKit that is automatically created when a instance of utils.Individual is made it.