3. Quickstart#
First let import the modules to work with. Some functionalities of moldrug may not be available on PyPi or Conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:
! pip install git+https://github.com/ale94mleon/moldrug.git@main
from rdkit import Chem
from moldrug import utils, fitness
from moldrug.data import get_data
import os, gzip, shutil, requests
from multiprocessing import cpu_count
# Remember to change to your vina executable
# If you define as a path, it must be absolute
vina_executable = '/Users/klimt/GIT/docking/bin/vina'
# use moldrug's internal data
data_x0161 = get_data('x0161')
# Defining the wd and change directory
wd = 'wd'
os.mkdir(wd)
os.chdir(wd)
We will use some test examples from moldrug.data module.
Now let’s download the crem data base. Here we are downloading the smaller one. But consider to visit CReM for more information about how to use and create the data base of fragments.
url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = 'crem.db.gz'
crem_db_path = 'crem.db'
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
with open(crem_db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Now let’s initialize the GA class.
maxiter = 3
popsize = 20
njobs = 2
out = utils.GA(
seed_mol=Chem.MolFromSmiles(data_x0161['smiles']),
maxiter=maxiter,
popsize=popsize,
crem_db_path = crem_db_path,
pc = 1,
get_similar = False,
mutate_crem_kwargs = {
'radius':3,
'min_size':1,
'max_size':8,
'min_inc':-5,
'max_inc':3,
'ncores':cpu_count(),
},
costfunc = fitness.Cost,
costfunc_kwargs = {
'vina_executable': vina_executable,
'receptor_pdbqt_path': data_x0161['protein']['pdbqt'],
'boxcenter' : data_x0161['box']['boxcenter'] ,
'boxsize': data_x0161['box']['boxsize'],
'exhaustiveness': 4,
'ncores': int(cpu_count() / njobs),
'num_modes': 1,
'vina_seed':1234,
},
save_pop_every_gen = 20,
deffnm = 'pop_test_single_receptor',
AddHs=True,
randomseed = 1234,
)
Then let’s call the class
out(njobs = njobs)
Creating the first population with 20 members:
100%|██████████| 20/20 [00:25<00:00, 1.25s/it]
Initial Population: Best Individual: Individual(idx = 6, smiles = CC(C(=O)NO)c1ccc(S(N)(=O)=O)cc1, cost = 0.6845382847545236)
Accepted rate: 20 / 20
File pop_test_single_receptor_pop.sdf was createad!
Evaluating generation 1 / 3:
100%|██████████| 20/20 [00:28<00:00, 1.44s/it]
Generation 1: Best Individual: Individual(idx = 34, smiles = NS(=O)(=O)c1ccc(C(O)C(=O)c2ccncc2)cc1, cost = 0.5736490508359202).
Accepted rate: 8 / 20
Evaluating generation 2 / 3:
100%|██████████| 20/20 [00:30<00:00, 1.53s/it]
Generation 2: Best Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185).
Accepted rate: 9 / 20
Evaluating generation 3 / 3:
100%|██████████| 20/20 [00:35<00:00, 1.76s/it]
File pop_test_single_receptor_pop.sdf was createad!
Generation 3: Best Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185).
Accepted rate: 8 / 20
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
The simulation finished successfully after 3 generations witha population of 20 individuals. A total number of 80 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185)
The cost function dropped in 0.5098064112111081 units.
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Total time (3 generations): 211.39 (s).
Finished at Thu Jun 13 15:30:10 2024.
This is a very silly example. In real life, you should increase the population size and the number of generation. In this way we explore in a better way the chemical space. The GA class has some nice methods that give you an idea of the obtained results.
out.to_dataframe().tail()
| pdbqt | cost | idx | genID | kept_gens | qed | sa_score | vina_score | |
|---|---|---|---|---|---|---|---|---|
| 75 | MODEL 1\nREMARK VINA RESULT: -6.795 0.... | 0.490194 | 46 | 2 | {2, 3} | 0.889623 | 1.728995 | -6.795 |
| 76 | MODEL 1\nREMARK VINA RESULT: -5.991 0.... | 1.000000 | 28 | 1 | {} | 0.734842 | 2.415447 | -5.991 |
| 77 | MODEL 1\nREMARK VINA RESULT: -5.991 0.... | 1.000000 | 61 | 3 | {} | 0.618853 | 1.995441 | -5.991 |
| 78 | MODEL 1\nREMARK VINA RESULT: -5.411 0.... | 1.000000 | 62 | 3 | {} | 0.750082 | 2.980767 | -5.411 |
| 79 | MODEL 1\nREMARK VINA RESULT: -5.807 0.... | 1.000000 | 42 | 2 | {} | 0.858139 | 2.288519 | -5.807 |
You could save the entire class for future in a compressed or uncompressed pickle file. Here I just saved the file in the temporal directory, you could change the path to your current working directory.
out.pickle(f"NumGens_{out.NumGens}_PopSize_{popsize}", compress=True)
Could be that after a long simulation we would like to perform a different searching strategy over the last population. This is really simple, we could “initialize” again the out instance with a different set of parameters. Let say that we would like to be close to the last solutions. Then we could use the parameters: min_size=0, max_size=1, min_inc=-1, max_inc=1.This flavor will add, delate mutate heavy atoms or change hydrogens to heavy atoms.
out.maxiter = 5
out.mutate_crem_kwargs = {
'radius':3,
'min_size':0,
'max_size':1,
'min_inc':-1,
'max_inc':1,
'ncores':cpu_count(),
}
And run again the class
out(njobs = njobs)
File pop_test_single_receptor_pop.sdf was createad!
Evaluating generation 4 / 8:
100%|██████████| 18/18 [00:29<00:00, 1.64s/it]
Generation 4: Best Individual: Individual(idx = 88, smiles = CC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1, cost = 0.4605309287890408).
Accepted rate: 13 / 18
Evaluating generation 5 / 8:
100%|██████████| 19/19 [00:32<00:00, 1.70s/it]
Generation 5: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 11 / 19
Evaluating generation 6 / 8:
100%|██████████| 17/17 [00:29<00:00, 1.75s/it]
Generation 6: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 10 / 17
Evaluating generation 7 / 8:
100%|██████████| 18/18 [00:35<00:00, 1.96s/it]
Generation 7: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 4 / 18
Evaluating generation 8 / 8:
100%|██████████| 17/17 [00:31<00:00, 1.88s/it]
File pop_test_single_receptor_pop.sdf was createad!
Generation 8: Best Individual: Individual(idx = 156, smiles = CCC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1Cl, cost = 0.4194509097394795).
Accepted rate: 6 / 17
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
The simulation finished successfully after 8 generations witha population of 20 individuals. A total number of 169 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 156, smiles = CCC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1Cl, cost = 0.4194509097394795)
The cost function dropped in 0.5805490902605205 units.
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Total time (5 generations): 213.37 (s).
Finished at Thu Jun 13 15:33:43 2024.
As you can see the GA class is updated rather than be replaced. This is a perfect feature for rerun with different ideas. As you see the number of generations is also updated as well the number of calls.
print(out.NumGens, out.NumCalls)
8 2
The same idea is for the rest of the methods and atributtes
out.to_dataframe().tail()
| pdbqt | cost | idx | genID | kept_gens | qed | sa_score | vina_score | |
|---|---|---|---|---|---|---|---|---|
| 164 | MODEL 1\nREMARK VINA RESULT: -6.761 0.... | 0.497567 | 151 | 7 | {7} | 0.763430 | 2.113062 | -6.761 |
| 165 | MODEL 1\nREMARK VINA RESULT: -5.991 0.... | 1.000000 | 61 | 3 | {} | 0.618853 | 1.995441 | -5.991 |
| 166 | MODEL 1\nREMARK VINA RESULT: -5.411 0.... | 1.000000 | 62 | 3 | {} | 0.750082 | 2.980767 | -5.411 |
| 167 | MODEL 1\nREMARK VINA RESULT: -5.807 0.... | 1.000000 | 42 | 2 | {} | 0.858139 | 2.288519 | -5.807 |
| 168 | MODEL 1\nREMARK VINA RESULT: -6.481 0.... | 0.610629 | 100 | 5 | {} | 0.578641 | 1.971277 | -6.481 |
This is the best solution of this simulation (in your case you could get a different one):
Chem.MolFromSmiles(out.pop[0].smiles)
And here is the input
from rdkit import Chem
Chem.MolFromSmiles("COC(=O)C=1C=CC(=CC1)S(=O)(=O)N")
Printing some information
import matplotlib.pyplot as plt
from moldrug import utils
import pandas as pd
sascorer = utils.import_sascorer()
fig, ax = plt.subplots(nrows = 6, figsize = (16,20))
ax[0].plot(range(len(out.SawIndividuals)),[Individual.qed for Individual in out.SawIndividuals], 'o')
ax[0].set(title = 'qed')
ax[1].plot(range(len(out.SawIndividuals)),[Individual.sa_score for Individual in out.SawIndividuals], 'o')
ax[1].set(title = 'sa_score')
ax[2].plot(range(len(out.SawIndividuals)),[Individual.vina_score for Individual in out.SawIndividuals], 'o')
ax[2].set(title = 'vina_score')
ax[3].plot(range(len(out.SawIndividuals)),[Individual.cost for Individual in out.SawIndividuals], 'o')
ax[3].set(title = 'cost')
ax[4].plot(range(len(out.avg_cost)),out.avg_cost, 'o')
ax[4].set(title = 'avg_cost')
ax[5].plot(range(len(out.best_cost)),out.best_cost, 'o')
ax[5].set(title = 'best_cost')
print(f"Initial vina score: {out.InitIndividual.vina_score}. Final vina score: {out.pop[0].vina_score}")
print(f"sascorer of the best Individual: {sascorer.calculateScore(out.pop[0].mol)}")
print(f"QED of the best Individual: {fitness.QED.weights_mean(out.pop[0].mol)}")
pd.DataFrame(utils.lipinski_profile(out.pop[0].mol), index=['value']).T
Initial vina score: -5.209. Final vina score: -7.174
sascorer of the best Individual: 7.4282033353078765
QED of the best Individual: 0.8576139251605003
| value | |
|---|---|
| NumHAcceptors | 4.000000 |
| NumHDonors | 1.000000 |
| wt | 304.781000 |
| MLogP | 3.538600 |
| NumRotatableBonds | 8.000000 |
| TPSA | 71.530000 |
| FractionCSP3 | 0.333333 |
| HeavyAtomCount | 21.000000 |
| NHOHCount | 1.000000 |
| NOCount | 5.000000 |
| NumAliphaticCarbocycles | 0.000000 |
| NumAliphaticHeterocycles | 0.000000 |
| NumAliphaticRings | 0.000000 |
| NumAromaticCarbocycles | 1.000000 |
| NumAromaticHeterocycles | 1.000000 |
| NumAromaticRings | 2.000000 |
| NumHeteroatoms | 6.000000 |
| NumSaturatedCarbocycles | 0.000000 |
| NumSaturatedHeterocycles | 0.000000 |
| NumSaturatedRings | 0.000000 |
| RingCount | 2.000000 |
import seaborn as sns
import matplotlib.pyplot as plt
from typing import List
from moldrug.utils import Individual
import numpy as np
# This function is possible from version 2.2.0
def plot_dist(individuals:List[Individual], properties:List[str], every_gen:int = 1):
# Set up the matplotlib figure
sns.set_theme(style="whitegrid")
fig, axes = plt.subplots(nrows = len(properties), figsize=(25, 25))
SawIndividuals = utils.to_dataframe(individuals).drop(['pdbqt'], axis = 1).replace([np.inf, -np.inf], np.nan).dropna()
SawIndividuals = SawIndividuals[SawIndividuals['kept_gens'].map(len) != 0].reset_index(drop=True)
gen_idxs = sorted(set(item for sublist in SawIndividuals['kept_gens'] for item in sublist))
NumGens = max(gen_idxs)
# Set pop to the initial population and pops out the first gen
pop = SawIndividuals[SawIndividuals.genID == gen_idxs.pop(0)].sort_values(by=["cost"])
pops = pop.copy()
for gen_idx in gen_idxs:
idx = [i for i in range(SawIndividuals.shape[0]) if gen_idx in SawIndividuals.loc[i,'kept_gens']]
pop = SawIndividuals.copy().iloc[idx,:].assign(genID=gen_idx)
pops = pd.concat([pops, pop.copy()])
# Draw a violinplot with a narrow bandwidth than the default
pops = pops.loc[pops['genID'].isin([gen for gen in range(0, NumGens+every_gen, every_gen)])]
for i, prop in enumerate(properties):
sns.violinplot(x = 'genID', hue='genID', y = prop, data=pops, palette="Set3", bw_adjust=.2, cut=0, linewidth=1, ax=axes[i], legend = False)
return fig, axes
plot_dist(out.SawIndividuals, properties=['qed', 'sa_score', 'vina_score', 'cost'])
(<Figure size 2500x2500 with 4 Axes>,
array([<Axes: xlabel='genID', ylabel='qed'>,
<Axes: xlabel='genID', ylabel='sa_score'>,
<Axes: xlabel='genID', ylabel='vina_score'>,
<Axes: xlabel='genID', ylabel='cost'>], dtype=object))