7. Advance topics#

7.1. Optimizing selected positions#

Could be that our aim is to optimize a specif feature of the molecule instead of the whole chemical structure. From version 1.0.0 of moldrug the task is possible. Previously the class moldrug.utils.Local could be used as a workaround. However this class only create one population from the seed molecule rather than to make a proper optimization on the chemical space.

Now is as easy as works with CReM. In this case we just need to provided the correct CReM parameter and the class moldrug.utils.GA will do the job.

Let’s begging importing and creating some simple functions.

from moldrug.data import get_data
from moldrug import utils, fitness
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
import copy, os, requests, gzip, shutil
from multiprocessing import cpu_count
from rdkit.Chem import Draw

# making moldrug talkative
os.environ['MOLDRUG_VERBOSE'] = 'True'

def drawgrid(mols):
    return Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(300,200), useSVG=True)

def mol_with_atom_index(mol):
    for atom in mol.GetAtoms():
        atom.SetAtomMapNum(atom.GetIdx())
    return mol
# Defining the wd and change directory
wd = 'wd'
os.mkdir(wd)
os.chdir(wd)
# Remember to change to your vina executable
# If you define as a path, it must be absolute
vina_executable = '/Users/klimt/GIT/docking/bin/vina'
# use moldrug's internal data
data_x0161 = get_data('x0161')

We will use the same example as the one discussed in Quickstart but with some minor modifications.

url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = os.path.join('crem.db.gz')
crem_db_path = os.path.join('crem.db')
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
    with open(crem_db_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Let’s first take a look on the numeration of the molecule. In this example we want to optimize the carbon atom with index “0” (highlighted in the above picture) and do not touch the rest of the molecule. CReM handles this possibilities in a very friendlily way: with the kwargs: replace_ids and protected_ids (visit CReM for more information about it). In this case we will use replace_ids specifying the atom with index “0” inside of the mutate_crem_kwargs of moldrug.utils.GA.

ligand = Chem.MolFromSmiles(data_x0161['smiles'])
ligand.__sssAtoms = [0]
mol_with_atom_index(copy.deepcopy(ligand))
../_images/0cb960dfebdb890e83f1ca822bf831fdd63e19879063a5e65156abdef1d54850.png

Initializing moldrug.utils.GA

maxiter = 3
popsize = 20
njobs = 2
out = utils.GA(
    seed_mol=ligand,
    maxiter=maxiter,
    popsize=popsize,
    crem_db_path = crem_db_path,
    pc = 1,
    get_similar = False,
    mutate_crem_kwargs = {
        'radius':3,
        'min_size':0,
        'max_size':0,
        'min_inc':2,
        'max_inc':6,
        'replace_ids':[0],
        'ncores':cpu_count(),
    },
    costfunc = fitness.Cost,
    costfunc_kwargs = {
        'vina_executable': vina_executable,
        'receptor_pdbqt_path': data_x0161['protein']['pdbqt'],
        'boxcenter' : data_x0161["box"]['boxcenter'] ,
        'boxsize': data_x0161["box"]['boxsize'],
        'exhaustiveness': 4,
        'ncores': int(cpu_count() / njobs),
        'num_modes': 1,
        'vina_seed':1234  # This is only for reproducibility, ot should not be used for production
    },
    save_pop_every_gen = 20,
    deffnm = 'specific_position',
    randomseed=1234,  # This is only for reproducibility, ot should not be used for production
    AddHs=True
)

As you can see in the mutate_crem_kwargs we are substituting hydrogen atoms by fragments between 2 and 6 heavy atoms; and only will be used for replace the carbon atom with index “0”. However, after the first generation and a new fragments is added, all the atoms of the new fragments will be considered also for mutations. In other words: moldrug.utils.GA will internally update the value of replace_ids adding the index of the atoms added by CReM in the previous generation.

Check that the AddHs is set to True. This is needed if we want to work with the hydrogens atoms of the molecule.

out(njobs=njobs)
Creating the first population with 20 members:
100%|██████████| 20/20 [00:34<00:00,  1.74s/it]
Initial Population: Best Individual: Individual(idx = 6, smiles = NS(=O)(=O)c1ccc(C(=O)OCc2ccccc2)cc1, cost = 0.5972412921346482)
Accepted rate: 20 / 20

File specific_position_pop.sdf was createad!

Evaluating generation 1 / 3:
100%|██████████| 20/20 [01:00<00:00,  3.04s/it]
Generation 1: Best Individual: Individual(idx = 33, smiles = NS(=O)(=O)c1ccc(C(=O)OC(c2ccccc2)c2ccncc2)cc1, cost = 0.4092908688748841).
Accepted rate: 8 / 20

Evaluating generation 2 / 3:
100%|██████████| 20/20 [01:07<00:00,  3.37s/it]
Generation 2: Best Individual: Individual(idx = 33, smiles = NS(=O)(=O)c1ccc(C(=O)OC(c2ccccc2)c2ccncc2)cc1, cost = 0.4092908688748841).
Accepted rate: 11 / 20

Evaluating generation 3 / 3:
100%|██████████| 20/20 [01:23<00:00,  4.16s/it]
File specific_position_pop.sdf was createad!
Generation 3: Best Individual: Individual(idx = 68, smiles = NS(=O)(=O)c1ccc(C(=O)OC(c2ccncc2)c2ccc(N3CCC(=O)N3)cc2)cc1, cost = 0.36239300658520834).
Accepted rate: 8 / 20


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

The simulation finished successfully after 3 generations witha population of 20 individuals. A total number of 80 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 68, smiles = NS(=O)(=O)c1ccc(C(=O)OC(c2ccncc2)c2ccc(N3CCC(=O)N3)cc2)cc1, cost = 0.36239300658520834)
The cost function dropped in 0.6376069934147917 units.

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Total time (3 generations): 290.53 (s).
Finished at Thu Jun 13 14:06:46 2024.

mol_list = [Chem.RemoveHs(m.mol) for m in out.pop]
AllChem.Compute2DCoords(ligand)
_ = [AllChem.GenerateDepictionMatching2DStructure(m,Chem.RemoveHs(ligand)) for m in mol_list]
drawgrid([mol_with_atom_index(Chem.RemoveHs(m)) for m in mol_list])
../_images/9a85c7619009ccef72c459533aa4af652e20b547e32bc4c18c65edd1fc9cc03a.svg

As you can see only the selected atom is the one submitted to modifications. Also, is evident that the numeration of the molecules are different. However this is not a problem for moldrug.utils.GA, it is smart enough to identify those changes on numeration during the generations.

Known issues:

  1. Symmetric molecules! If, for example, we provided the atom index [5] instead of [0] could be that also the position 9 is submitted to mutation.

7.2. Known issues#

  1. Symmetric molecules! If, for example, we provided the atom index [5] instead of [0] could be that also the position 9 is submitted to mutation.

  2. If for the molecule CCC(O)CC the carbon atom attached to the oxygen atom is submitted to mutation and one of the products is CCN(O)CC; the last two atoms and the oxygen will be also considered for mutation in the next generation. This undesired behavior is because of the use of the MCS to detect what are the “new” atoms in the offspring respect to the parent molecule. The use of Disconnected Maximum Common Substructures under Constraints might solve this issue, but we have not tested yet such approach.

from rdkit.Chem import Draw
parent = 'CCC(O)CC'
offspring = 'CCN(O)CC'
RDMols = [mol_with_atom_index(Chem.MolFromSmiles(parent)), mol_with_atom_index(Chem.MolFromSmiles(offspring))]

[mol_with_atom_index(m) for m in RDMols]
Draw.MolsToGridImage(RDMols, subImgSize=(250,250))
../_images/792f7144a98fbafd983c8f340083263beb6515be4cda8dd285927af35d58a1ac.png
new_replace_ids, new_protected_ids = utils.update_reactant_zone(RDMols[0], RDMols[1], parent_replace_ids=[2])
print(f"{new_replace_ids=}") # Which is wrong, it should only be 2
print(f"{new_protected_ids=}")
new_replace_ids=[2, 3, 4, 5]
new_protected_ids=[]