3. Quickstart#

First let import the modules to work with. Some functionalities of moldrug may not be available on PyPi or Conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:

! pip install git+https://github.com/ale94mleon/moldrug.git@main

from rdkit import Chem
from moldrug import utils, fitness
from moldrug.data import get_data
import os, gzip, shutil, requests
from multiprocessing import cpu_count

# Remember to change to your vina executable
# If you define as a path, it must be absolute
vina_executable = '/Users/klimt/GIT/docking/bin/vina'

# use moldrug's internal data
data_x0161 = get_data('x0161')

# Defining the wd and change directory
wd = 'wd'
os.mkdir(wd)
os.chdir(wd)

We will use some test examples from moldrug.data module.

Now let’s download the crem data base. Here we are downloading the smaller one. But consider to visit CReM for more information about how to use and create the data base of fragments.

url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = 'crem.db.gz'
crem_db_path = 'crem.db'
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
    with open(crem_db_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Now let’s initialize the GA class.

maxiter = 3
popsize = 20
njobs = 2

out = utils.GA(
    seed_mol=Chem.MolFromSmiles(data_x0161['smiles']),
    maxiter=maxiter,
    popsize=popsize,
    crem_db_path = crem_db_path,
    pc = 1,
    get_similar = False,
    mutate_crem_kwargs = {
        'radius':3,
        'min_size':1,
        'max_size':8,
        'min_inc':-5,
        'max_inc':3,
        'ncores':cpu_count(),
    },
    costfunc = fitness.Cost,
    costfunc_kwargs = {
        'vina_executable': vina_executable,
        'receptor_pdbqt_path': data_x0161['protein']['pdbqt'],
        'boxcenter' : data_x0161['box']['boxcenter'] ,
        'boxsize': data_x0161['box']['boxsize'],
        'exhaustiveness': 4,
        'ncores': int(cpu_count() / njobs),
        'num_modes': 1,
        'vina_seed':1234,
    },
    save_pop_every_gen = 20,
    deffnm = 'pop_test_single_receptor',
    AddHs=True,
    randomseed = 1234,
)

Then let’s call the class

out(njobs = njobs)

Creating the first population with 20 members:

100%|██████████| 20/20 [00:25<00:00,  1.25s/it]

Initial Population: Best Individual: Individual(idx = 6, smiles = CC(C(=O)NO)c1ccc(S(N)(=O)=O)cc1, cost = 0.6845382847545236)
Accepted rate: 20 / 20

File pop_test_single_receptor_pop.sdf was createad!

Evaluating generation 1 / 3:

100%|██████████| 20/20 [00:28<00:00,  1.44s/it]

Generation 1: Best Individual: Individual(idx = 34, smiles = NS(=O)(=O)c1ccc(C(O)C(=O)c2ccncc2)cc1, cost = 0.5736490508359202).
Accepted rate: 8 / 20

Evaluating generation 2 / 3:

100%|██████████| 20/20 [00:30<00:00,  1.53s/it]

Generation 2: Best Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185).
Accepted rate: 9 / 20

Evaluating generation 3 / 3:

100%|██████████| 20/20 [00:35<00:00,  1.76s/it]

File pop_test_single_receptor_pop.sdf was createad!
Generation 3: Best Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185).
Accepted rate: 8 / 20


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

The simulation finished successfully after 3 generations witha population of 20 individuals. A total number of 80 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 46, smiles = NS(=O)(=O)c1ccc(CNCc2cccc(Cl)c2)cc1, cost = 0.49019358878889185)
The cost function dropped in 0.5098064112111081 units.

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Total time (3 generations): 211.39 (s).
Finished at Thu Jun 13 15:30:10 2024.

This is a very silly example. In real life, you should increase the population size and the number of generation. In this way we explore in a better way the chemical space. The GA class has some nice methods that give you an idea of the obtained results.

out.to_dataframe().tail()

	pdbqt	cost	idx	genID	kept_gens	qed	sa_score	vina_score
75	MODEL 1\nREMARK VINA RESULT: -6.795 0....	0.490194	46	2	{2, 3}	0.889623	1.728995	-6.795
76	MODEL 1\nREMARK VINA RESULT: -5.991 0....	1.000000	28	1	{}	0.734842	2.415447	-5.991
77	MODEL 1\nREMARK VINA RESULT: -5.991 0....	1.000000	61	3	{}	0.618853	1.995441	-5.991
78	MODEL 1\nREMARK VINA RESULT: -5.411 0....	1.000000	62	3	{}	0.750082	2.980767	-5.411
79	MODEL 1\nREMARK VINA RESULT: -5.807 0....	1.000000	42	2	{}	0.858139	2.288519	-5.807

You could save the entire class for future in a compressed or uncompressed pickle file. Here I just saved the file in the temporal directory, you could change the path to your current working directory.

out.pickle(f"NumGens_{out.NumGens}_PopSize_{popsize}", compress=True)

Could be that after a long simulation we would like to perform a different searching strategy over the last population. This is really simple, we could “initialize” again the out instance with a different set of parameters. Let say that we would like to be close to the last solutions. Then we could use the parameters: min_size=0, max_size=1, min_inc=-1, max_inc=1.This flavor will add, delate mutate heavy atoms or change hydrogens to heavy atoms.

out.maxiter = 5
out.mutate_crem_kwargs = {
        'radius':3,
        'min_size':0,
        'max_size':1,
        'min_inc':-1,
        'max_inc':1,
        'ncores':cpu_count(),
    }

And run again the class

out(njobs = njobs)

File pop_test_single_receptor_pop.sdf was createad!
Evaluating generation 4 / 8:

100%|██████████| 18/18 [00:29<00:00,  1.64s/it]

Generation 4: Best Individual: Individual(idx = 88, smiles = CC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1, cost = 0.4605309287890408).
Accepted rate: 13 / 18

Evaluating generation 5 / 8:

100%|██████████| 19/19 [00:32<00:00,  1.70s/it]

Generation 5: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 11 / 19

Evaluating generation 6 / 8:

100%|██████████| 17/17 [00:29<00:00,  1.75s/it]

Generation 6: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 10 / 17

Evaluating generation 7 / 8:

100%|██████████| 18/18 [00:35<00:00,  1.96s/it]

Generation 7: Best Individual: Individual(idx = 103, smiles = NS(=O)(=O)c1ccc(CC=CCc2cccc(Cl)c2)cc1, cost = 0.4216017240342247).
Accepted rate: 4 / 18

Evaluating generation 8 / 8:

100%|██████████| 17/17 [00:31<00:00,  1.88s/it]

File pop_test_single_receptor_pop.sdf was createad!
Generation 8: Best Individual: Individual(idx = 156, smiles = CCC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1Cl, cost = 0.4194509097394795).
Accepted rate: 6 / 17


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

The simulation finished successfully after 8 generations witha population of 20 individuals. A total number of 169 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 156, smiles = CCC(C)(C)C(=O)C=Cc1ccc(-c2nn[nH]n2)cc1Cl, cost = 0.4194509097394795)
The cost function dropped in 0.5805490902605205 units.

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Total time (5 generations): 213.37 (s).
Finished at Thu Jun 13 15:33:43 2024.

As you can see the GA class is updated rather than be replaced. This is a perfect feature for rerun with different ideas. As you see the number of generations is also updated as well the number of calls.

print(out.NumGens, out.NumCalls)

8 2

The same idea is for the rest of the methods and atributtes

out.to_dataframe().tail()

	pdbqt	cost	idx	genID	kept_gens	qed	sa_score	vina_score
164	MODEL 1\nREMARK VINA RESULT: -6.761 0....	0.497567	151	7	{7}	0.763430	2.113062	-6.761
165	MODEL 1\nREMARK VINA RESULT: -5.991 0....	1.000000	61	3	{}	0.618853	1.995441	-5.991
166	MODEL 1\nREMARK VINA RESULT: -5.411 0....	1.000000	62	3	{}	0.750082	2.980767	-5.411
167	MODEL 1\nREMARK VINA RESULT: -5.807 0....	1.000000	42	2	{}	0.858139	2.288519	-5.807
168	MODEL 1\nREMARK VINA RESULT: -6.481 0....	0.610629	100	5	{}	0.578641	1.971277	-6.481

This is the best solution of this simulation (in your case you could get a different one):

Chem.MolFromSmiles(out.pop[0].smiles)

../_images/1a2ac599cdc3b812e2567d2635bfdbdfde8449bb1650885faddb2dcc7c5b093c.png

And here is the input

from rdkit import Chem
Chem.MolFromSmiles("COC(=O)C=1C=CC(=CC1)S(=O)(=O)N")

../_images/b1bc40529a3cfda2ab6afae4c493b38290af287884b7f257a2fd479f4afa0176.png

Printing some information

import matplotlib.pyplot as plt
from moldrug import utils
import pandas as pd
sascorer = utils.import_sascorer()


fig, ax = plt.subplots(nrows = 6, figsize = (16,20))

ax[0].plot(range(len(out.SawIndividuals)),[Individual.qed for Individual in out.SawIndividuals], 'o')
ax[0].set(title = 'qed')

ax[1].plot(range(len(out.SawIndividuals)),[Individual.sa_score for Individual in out.SawIndividuals], 'o')
ax[1].set(title = 'sa_score')

ax[2].plot(range(len(out.SawIndividuals)),[Individual.vina_score for Individual in out.SawIndividuals], 'o')
ax[2].set(title = 'vina_score')

ax[3].plot(range(len(out.SawIndividuals)),[Individual.cost for Individual in out.SawIndividuals], 'o')
ax[3].set(title = 'cost')

ax[4].plot(range(len(out.avg_cost)),out.avg_cost, 'o')
ax[4].set(title = 'avg_cost')

ax[5].plot(range(len(out.best_cost)),out.best_cost, 'o')
ax[5].set(title = 'best_cost')


print(f"Initial vina score: {out.InitIndividual.vina_score}. Final vina score: {out.pop[0].vina_score}")
print(f"sascorer of the best Individual: {sascorer.calculateScore(out.pop[0].mol)}")
print(f"QED of the best Individual: {fitness.QED.weights_mean(out.pop[0].mol)}")

pd.DataFrame(utils.lipinski_profile(out.pop[0].mol), index=['value']).T

Initial vina score: -5.209. Final vina score: -7.174
sascorer of the best Individual: 7.4282033353078765
QED of the best Individual: 0.8576139251605003

	value
NumHAcceptors	4.000000
NumHDonors	1.000000
wt	304.781000
MLogP	3.538600
NumRotatableBonds	8.000000
TPSA	71.530000
FractionCSP3	0.333333
HeavyAtomCount	21.000000
NHOHCount	1.000000
NOCount	5.000000
NumAliphaticCarbocycles	0.000000
NumAliphaticHeterocycles	0.000000
NumAliphaticRings	0.000000
NumAromaticCarbocycles	1.000000
NumAromaticHeterocycles	1.000000
NumAromaticRings	2.000000
NumHeteroatoms	6.000000
NumSaturatedCarbocycles	0.000000
NumSaturatedHeterocycles	0.000000
NumSaturatedRings	0.000000
RingCount	2.000000

../_images/d229cdc4ce9c41cbef0e42b99ac0996a500f2bf5d7f7f725a0218a3bf71527cc.png

import seaborn as sns
import matplotlib.pyplot as plt
from typing import List
from moldrug.utils import Individual
import numpy as np

# This function is possible from version 2.2.0

def plot_dist(individuals:List[Individual], properties:List[str], every_gen:int = 1):


    # Set up the matplotlib figure
    sns.set_theme(style="whitegrid")
    fig, axes = plt.subplots(nrows = len(properties), figsize=(25, 25))

    SawIndividuals = utils.to_dataframe(individuals).drop(['pdbqt'], axis = 1).replace([np.inf, -np.inf], np.nan).dropna()
    SawIndividuals = SawIndividuals[SawIndividuals['kept_gens'].map(len) != 0].reset_index(drop=True)
    gen_idxs = sorted(set(item for sublist in SawIndividuals['kept_gens'] for item in sublist))
    NumGens = max(gen_idxs)

    # Set pop to the initial population and pops out the first gen
    pop = SawIndividuals[SawIndividuals.genID == gen_idxs.pop(0)].sort_values(by=["cost"])
    pops = pop.copy()
    for gen_idx in gen_idxs:
        idx = [i for i in range(SawIndividuals.shape[0]) if gen_idx in SawIndividuals.loc[i,'kept_gens']]
        pop = SawIndividuals.copy().iloc[idx,:].assign(genID=gen_idx)
        pops = pd.concat([pops, pop.copy()])
    # Draw a violinplot with a narrow bandwidth than the default
    pops = pops.loc[pops['genID'].isin([gen for gen in range(0, NumGens+every_gen, every_gen)])]
    for i, prop in enumerate(properties):
        sns.violinplot(x = 'genID', hue='genID', y = prop, data=pops, palette="Set3", bw_adjust=.2, cut=0, linewidth=1, ax=axes[i], legend = False)

    return fig, axes

plot_dist(out.SawIndividuals, properties=['qed', 'sa_score', 'vina_score', 'cost'])

(<Figure size 2500x2500 with 4 Axes>,
 array([<Axes: xlabel='genID', ylabel='qed'>,
        <Axes: xlabel='genID', ylabel='sa_score'>,
        <Axes: xlabel='genID', ylabel='vina_score'>,
        <Axes: xlabel='genID', ylabel='cost'>], dtype=object))

../_images/19742d4af3e0a09eb0b57c4421673f4a56e3723b25aba7474536878605c4198b.png