Utils

Utils#

class moldrug.utils.Atom(line)[source]#

This is a simple class to wrap a pdbqt Atom. It is based on https://userguide.mdanalysis.org/stable/formats/reference/pdbqt.html#writing-out.

__init__(line)[source]#

class moldrug.utils.CHUNK_VINA_OUT(chunk)[source]#

This class will be used by VINA_OUT in order to read the pdbqt ouput of a vina docking results.

__init__(chunk)[source]#

get_atoms()[source]#

Return a list of all atoms.

If to_dict is True, each atom is represented as a dictionary. Otherwise, a list of Atom objects is returned.

moldrug.utils.DerringerSuichDesirability()[source]#

A warper around the implemented desirability functions

Returns:: A dict with key name of the desirability and value the corresponded function
Return type:: dict

class moldrug.utils.GA(seed_mol: Mol | Iterable[Mol], costfunc: Callable, costfunc_kwargs: Dict, crem_db_path: str, maxiter: int = 10, popsize: int = 20, beta: float = 0.001, pc: float = 1, get_similar: bool = False, mutate_crem_kwargs: None | Dict = None, save_pop_every_gen: int = 0, checkpoint: bool = False, deffnm: str = 'ga', AddHs: bool = False, randomseed: int | None = None)[source]#

An implementation of a genetic algorithm to search in the chemical space.

randomseed#

The random seed to use with random module.

Type:: Union[None, int]

__moldrug_version__#

The molDrug version.

Type:: str

costfunc#

The cost function set by the user.

Type:: object

crem_db_path#

Path to the CReM data base.

Type:: str

maxiter#

Maximum number of iteratinos to perform.

Type:: int

popsize#

Population size.

Type:: int

beta#

Selection pressure.

Type:: float

costfunc_kwargs#

The keyword arguments of the costfunc.

Type:: dict

costfunc_ncores#

The number of cores to use for costfunc.

Type:: int

nc#

Number of childs of offsprints = round(pc * popsize)

Type:: int

get_similar#

Bias the search upon similar molecules. If True modrug.utils.get_similar_mols() is used after the mutation with CReM instead random choice.

Type:: bool

mutate_crem_kwargs#

The keyword arguments to pass to crem.crem.mutate_mol().

Type:: dict

save_pop_every_gen#

Frequency to save the pickle file o fthe population during the optimazation.

Type:: int

checkpoint#

Safe chekpoint file, this help to restart a simualation.

Type:: bool

deffnm#

Prefix for the genereated files.

Type:: str

NumCalls#

How many times the __call__ method has been called.

Type:: int

NumGens#

he number of generations performed by the class. Subsequent __call__ executions update this number acordennly.

Type:: int

SawIndividuals#

All the Individulas saw during the optimizations.

Type:: set[moldrug.utils.Individuals()]

acceptance#

A dictionary with key the Generation id and as value another dictionary with keys accepeted and generated with the number of accepted and genereated individuals on the generation respectively.

Type:: dict

AddHs#

In case explicit hydrogens should be added for all genreated molecules.

Type:: bool

_seed_mol#

The list of seed molecules.

Type:: list[Chem.rdchem.Mol]

InitIndividual#

The initial individual based on _seed_mol.

Type:: moldrug.utils.Individuals()

pop#

The final population sorted by cost.

Type:: list[moldrug.utils.Individuals()]

best_cost#

The list of best cost for each generations.

Type:: list[float]

avg_cost#

The list of average cost for each generations.

Type:: list[float]

TODO#

Timing the simulation, add tracking variable for the timing of the evaluation and genereation of moleucles. Print at the end of each call
Extend to other genereators:
- mutate_crem_kwargs = None and some other keyword that get the generator function, in this case the mutate method will be overwrite
with the user provided, this fucntion will take an Individual and return a new offspring, to be more copatible and not create issues, I good idea will be that this fucntion accept a self as first arguemnt, and internally, it will use the self of the GA class

__call__(njobs: int = 1)[source]#

Call definition

Parameters:: njobs (int, optional) – The number of jobs for parallelization, the module multiprocessing will be used, by default 1,
Raises:: RuntimeError – Error during the initialization of the population.

__init__(seed_mol: Mol | Iterable[Mol], costfunc: Callable, costfunc_kwargs: Dict, crem_db_path: str, maxiter: int = 10, popsize: int = 20, beta: float = 0.001, pc: float = 1, get_similar: bool = False, mutate_crem_kwargs: None | Dict = None, save_pop_every_gen: int = 0, checkpoint: bool = False, deffnm: str = 'ga', AddHs: bool = False, randomseed: int | None = None) → None[source]#

Constructor

Parameters:

seed_mol (Union[Chem.rdchem.Mol, Iterable[Chem.rdchem.Mol]]) – The seed molecule submitted to genetic algorithm optimization on the chemical space. Could be only one RDKit molecule or more than one specified in an Iterable object.
costfunc (Callable) – The cost function to work with (any from moldrug.fitness or a valid user defined).
costfunc_kwargs (Dict) – The keyword arguments of the selected cost function
crem_db_path (str) – Path to the CReM data base.
maxiter (int, optional) – Maximum number of iteration (or generation), by default 10.
popsize (int, optional) – Population size, by default 20.
beta (float, optional) – Selection pressure. Higher values means that the best individual are going to be sumitted for mutations more frquently, by default 0.001.
pc (float, optional) – Proportion of children, by default 1
get_similar (bool, optional) – If True the searching will be bias to similar molecules, by default False
mutate_crem_kwargs (Union[None, Dict], optional) – Parameters for mutate_mol of CReM, by default {}
save_pop_every_gen (int, optional) – Frequency to save the population, by default 0
checkpoint (bool, optional) – If True the whole class will be saved as cpt with the frequency of save_pop_every_gen. This means that if save_pop_every_gen = 0 and checkpoint = True, no checkpoint will be output, by default False
deffnm (str, optional) – Default prefix name for all generated files, by default ‘ga’
AddHs (bool, optional) – If True the explicit hydrogens will be added, by default False
randomseed (Union[None, int], optional) – Set a random seed for reproducibility, by default None

Raises:

TypeError – In case that seed_mol is a wrong input.
ValueError – In case of incorrect definition of mutate_crem_kwargs. It must be None or a dict instance.
ValueError – In case of crem_db_path deos not exist.

mutate(individual: Individual)[source]#

Genetic operators

Parameters:: individual (Individual) – The individual to mutate.
Returns:: A new Individual.
Return type:: Individual

pickle(title: str, compress: bool = False)[source]#

Method to pickle the whole GA class

Parameters:

title (str) – Name of the object which will be completed with the corresponding extension depending if compress is set to True or False.
compress (bool, optional) – Use compression, by default False. If True moldrug.utils.compressed_pickle() will be used; if not moldrug.utils.full_pickle() will be used instead.

to_dataframe(return_mol: bool = False)[source]#

Create a DataFrame from self.SawIndividuals.

Returns:: The DataFrame
Return type:: pandas.DataFrame

class moldrug.utils.Individual(mol: Mol, idx: int | str = 0, pdbqt: str | None = None, cost: float = inf, randomseed: int | None = None)[source]#

Base class to work with GA, Local and all the fitness functions. Individual is a mutable object. Only the attribute smiles it is not mutable and is used for hash. Therefore this class is hashable based on the smiles attribute. This one is also used for ‘==’ comparison If two Individuals has the same smiles not matter if the rest of the elements are different, they will be considered the same. The cost attribute is used for arithmetic operations. It also admit copy and deepcopy operations. Known issue, in case that we would like to use a numpy array of individuals. It is needed to change the dtype of the generated arrays

mol#

The molecule object

Type:: Chem.rdchem.Mol

idx#

The identifier

Type:: Union[int, str]

pdbqt#

A pdbqt string representation of the molecule, used for docking with Vina. It is generated during the initialization of the class

Type:: str

smiles#

The SMILES representation of the mol attribute without explicit hydrogens, this attribute (property) is immutable.

Type:: str (property)

cost#

This attribute is used to interact with the fitness functions of moldrug.fitness

Type:: float

Example

In [1]: from moldrug import utils, fitness

In [2]: import numpy as np

In [3]: from copy import copy, deepcopy

In [4]: from rdkit import Chem

In [5]: i1 = utils.Individual(mol = Chem.MolFromSmiles('CC'), idx = 1, cost = 5)

In [6]: i2 = utils.Individual(mol = Chem.MolFromSmiles('CC'), idx = 2, cost = 4)

In [7]: i3 = utils.Individual(mol = Chem.MolFromSmiles('CCC'), idx = 3, cost = 4)

# Show the '==' operation
In [8]: print(i1 == i2, i1 == i3)
True False

# Show that Individual is a hashable object based on the smiles
In [9]: print(set([i1,i2,i3]))
{Individual(idx = 3, smiles = CCC, cost = 4), Individual(idx = 1, smiles = CC, cost = 5)}

# Show arithmetic operations
In [10]: print(i1+i2)
9

# How to work with numpy
In [11]: array = np.array([i1,i2, i3])

In [12]: array_2 = (array*2).astype('float64')

In [13]: print(array_2)
[10.  8.  8.]

# Show copy
In [14]: print(copy(i3), deepcopy(i3))
Individual(idx = 3, smiles = CCC, cost = 4) Individual(idx = 3, smiles = CCC, cost = 4)

__init__(mol: Mol, idx: int | str = 0, pdbqt: str | None = None, cost: float = inf, randomseed: int | None = None) → None[source]#

This is the constructor of the class.

Parameters:

mol (Chem.rdchem.Mol, optional) – A valid RDKit molecule.
idx (Union[int str], optional) – An identification, by default 0
pdbqt (str, optional) – A valid pdbqt string. If it is not provided it will be generated from mol through utils.confgen and the mol attribute will be update with the 3D model, by default None
cost (float, optional) – This attribute is used to perform operations between Individuals and should be used for the cost functions, by default np.inf
randomseed (Union[None, int], optional) – Provide a seed for the random number generator so that the “same” coordinates can be obtained for the attribute pdbqt on multiple runs. If None, the RNG will not be seeded, by default None

moldrug.utils.LargerTheBest(Value: float, LowerLimit: float, Target: float, r: float = 1) → float[source]#

Desirability function used when larger values are the targets. If Value is higher or equal than the target it will return 1; if it is lower than LowerLimit it will return 0; else a number between 0 and 1. You can also check: doi:10.1016/j.chemolab.2011.04.004 https://www.youtube.com/watch?v=quz4NW0uIYw&list=PL6ebkIZFT4xXiVdpOeKR4o_sKLSY0aQf_&index=3

Parameters:

Value (float) – Value to test.
LowerLimit (float) – Lower value accepted. Lower than this one will return 0.
Target (float) – The target value. On this value (or higher) the function takes 1 as value.
r (float, optional) – This is the exponent of the interpolation. Could be used to control the interpolation, by default 1

Returns:

A number between 0 and 1. Been 1 the desireable value to get.

Return type:

float

class moldrug.utils.Local(seed_mol: Mol, crem_db_path: str, costfunc: object, grow_crem_kwargs: Dict | None = None, costfunc_kwargs: Dict | None = None, AddHs: bool = False, randomseed: int | None = None, deffnm: str = 'local')[source]#

This class is used to genereate close solutions to the seed molecule. It use crem.crem.grow_mol().

randomseed#

The random seed to use with random module.

Type:: Union[None, int]

__moldrug_version__#

The molDrug version.

Type:: str

costfunc#

The cost function set by the user.

Type:: object

crem_db_path#

Path to the CReM data base.

Type:: str

costfunc_kwargs#

The keyword arguments of the costfunc.

Type:: dict

grow_crem_kwargs#

The keyword arguments to pass to crem.crem.grow_mol().

Type:: dict

AddHs#

In case explicit hydrogens should be added.

Type:: bool

pop#

The final population sorted by cost.

Type:: list[moldrug.utils.Individuals()]

__call__(njobs: int = 1, pick: int | None = None)[source]#

Call deffinition

Parameters:

njobs (int, optional) – The number of jobs for parallelization, the module multiprocessing will be used, by default 1
pick (int, optional) – How many molecules take from the generated throgh the grow_mol CReM operation, by default None which means all generated.

__init__(seed_mol: Mol, crem_db_path: str, costfunc: object, grow_crem_kwargs: Dict | None = None, costfunc_kwargs: Dict | None = None, AddHs: bool = False, randomseed: int | None = None, deffnm: str = 'local') → None[source]#

Creator

Parameters:

seed_mol (Chem.rdchem.Mol) – The seed molecule from which the population will be generated.
crem_db_path (str) – The pathway to the CReM data base.
costfunc (object) – The cost function to work with (any from moldrug.fitness or a valid user defined).
grow_crem_kwargs (Dict, optional) – The keywords of the grow_mol function of CReM, by default None
costfunc_kwargs (Dict, optional) – The keyword arguments of the selected cost function, by default None
AddHs (bool, optional) – If True the explicit hyrgones will be added, by default False
randomseed (Union[None, int], optional) – Set a random seed for reproducibility, by default None
deffnm (str) – Just a place holder for compatibility with the CLI.

Raises:

Exception – In case that some problem occured during the creation of the Individula from the seed_mol
ValueError – In case of incorrect definition of grow_crem_kwargs and/or costfunc_kwargs. They must be None or a dict instance.

pickle(title: str, compress: bool = False)[source]#

Method to pickle the whole Local class

Parameters:

title (str) – Name of the object which will be compleated with the correposnding extension depending if compress is set to True or False.
compress (bool, optional) – Use compression, by default False. If True moldrug.utils.compressed_pickle() will be used; if not moldrug.utils.full_pickle() will be used instead.

to_dataframe(return_mol: bool = False)[source]#

Create a DataFrame from self.pop.

Returns:: The DataFrame
Return type:: pandas.DataFrame

moldrug.utils.NominalTheBest(Value: float, LowerLimit: float, Target: float, UpperLimit: float, r1: float = 1, r2: float = 1) → float[source]#

Desirability function used when a target value is desired. If Value is lower or equal than the LowerLimit it will return 0; as well values higher or equal than UpperLimit; else a number between 0 and 1.

Parameters:

Value (float) – Value to test.
LowerLimit (float) – Lower value accepted. Lower than this one will return 0.
Target (float) – The target value. On this value the function takes 1 as value.
UpperLimit (float) – Upper value accepted. Higher than this one will return 0.
r1 (float, optional) – This is the exponent of the interpolation from LowerLimit to Target. Could be used to control the interpolation, by default 1
r2 (float, optional) – This is the exponent of the interpolation from Target to UpperLimit. Could be used to control the interpolation, by default 1

Returns:

A number between 0 and 1. Been 1 the desireable value to get.

Return type:

float

moldrug.utils.SmallerTheBest(Value: float, Target: float, UpperLimit: float, r: float = 1) → float[source]#

Desirability function used when lower values are the targets. If Value is lower or equal than the target it will return 1; if it is higher than UpperLimit it will return 0; else a number between 0 and 1.

Parameters:

Value (float) – Value to test.
Target (float) – The target value. On this value (or lower) the function takes 1 as value.
UpperLimit (float) – Upper value accepted. Higher than this one will return 0.
r (float, optional) – This is the exponent of the interpolation. Could be used to control the interpolation, by default 1

Returns:

A number between 0 and 1. Been 1 the desireable value to get.

Return type:

float

class moldrug.utils.VINA_OUT(file)[source]#

Vina class to handle vina output. Think about use meeko in the future!

__init__(file)[source]#

moldrug.utils.compressed_pickle(title: str, data: object)[source]#

Compress Python object. First cPickle it and then bz2.BZ2File compressed it.

Parameters:

title (str) – Name of the file without extensions, .pbz2 will be added by default
data (object) – Any serializable python object

moldrug.utils.confgen(mol: Mol, return_mol: bool = False, randomseed: int | None = None)[source]#

Create a 3D model from a smiles and return a pdbqt string and, a mol if return_mol = True.

Parameters:

mol (Chem.rdchem.Mol) – A valid RDKit molecule.
return_mol (bool, optional) – If true the function will also return the rdkit.Chem.rdchem.Mol, by default False
randomseed (Union[None, int], optional) – Provide a seed for the random number generator so that the same coordinates can be obtained for a molecule on multiple runs. If None, the RNG will not be seeded, by default None

Returns:

If return_mol = True it will return a tuple (str[pdbqt], Chem.rdchem.Mol), if not only a str that represents the pdbqt.

Return type:

tuple or str

moldrug.utils.decompress_pickle(file: str)[source]#

Decompress CPickle objects compressed first with bz2 formats

Parameters:: file (str) – This is the cPickle files compressed with bz2.BZ2File. (as a convention with extension .pbz2, but not needed)
Returns:: The python object.
Return type:: object

moldrug.utils.deep_update(target_dict: dict, update_dict: dict) → dict[source]#

Recursively update a dictionary with the key-value pairs from another dictionary. Inpired on https://stackoverflow.com/questions/3232943/update-value-of-a-nested-dictionary-of-varying-depth

Parameters:

target_dict (dict) – The dictionary to be updated
update_dict (dict) – The dictionary providing the updates

Example

In [1]: from moldrug.utils import deep_update

In [2]: target = {'a': 1, 'b': {'c': 2, 'd': 3}}

In [3]: updates = {'b': {'c': 4, 'e': 5}, 'f': 6}

In [4]: result = deep_update(target, updates)

In [5]: print(result)
{'a': 1, 'b': {'c': 4, 'd': 3, 'e': 5}, 'f': 6}

# Output: {'a': 1, 'b': {'c': 4, 'd': 3, 'e': 5}, 'f': 6}

Returns:: The updated dictionary
Return type:: dict

moldrug.utils.full_pickle(title: str, data: object)[source]#

Normal pickle.

Parameters:

title (str) – Name of the file without extension, .pkl will be added by default.
data (object) – Any serializable python object

moldrug.utils.get_sim(ms: List[Mol], ref_fps: List)[source]#

Get the molecules with higher similarity to each member of ref_fps.

Parameters:

ms (list[Chem.rdchem.Mol]) – List of molecules
ref_fps (list[AllChem.GetMorganFingerprintAsBitVect(mol, 2)]) – A list of reference fingerprints

Returns:

A list of molecules with the higher similarity with their corresponded ref_fps value.

Return type:

list[Chem.rdchem.Mol]

moldrug.utils.get_similar_mols(mols: List, ref_mol: Mol, pick: int, beta: float = 0.01)[source]#

Pick the similar molecules from mols respect to ref_mol using a roulette wheel selection strategy.

Parameters:

mols (list) – The list of molecules from where to pick molecules.
ref_mol (Chem.rdchem.Mol) – The reference molecule
pick (int) – Number of molecules to pick from mols
beta (float, optional) – Selection threshold, by default 0.01

Returns:

A list of picked molecules.

Return type:

list

moldrug.utils.import_sascorer()[source]#

Function to import sascorer from RDConfig.RDContribDir of RDKit

Returns:: The sascorer module ready to use.
Return type:: module

moldrug.utils.is_iter(obj)[source]#

Check if obj is iterable

Parameters:: obj (Any) – Any python object
Returns:: Tru if obj iterable, False if not
Return type:: bool

moldrug.utils.lipinski_filter(mol: Mol, maxviolation: int = 2)[source]#

Implementation of Lipinski filter.

Parameters:

mol (Chem.rdchem.Mol) – An RDKit molecule.
maxviolation (int, optional) – Maximum number of violations. Above this value the function return False, by default 2

Returns:

True if the molecule present less than maxviolation violations; otherwise False.

Return type:

bool

moldrug.utils.lipinski_profile(mol: Mol)[source]#

See: https://www.rdkit.org/docs/source/rdkit.Chem.Lipinski.html?highlight=lipinski#module-rdkit.Chem.Lipinski

Parameters:: mol (Chem.rdchem.Mol) – An RDKit molecule.
Returns:: A dictionary with molecular properties.
Return type:: dict

moldrug.utils.loosen(file: str)[source]#

Unpickle a pickled object.

Parameters:: file (str) – The path to the file who store the pickle object.
Returns:: The python object.
Return type:: object

moldrug.utils.make_sdf(individuals: List[Individual], sdf_name: str = 'out')[source]#

This function create a sdf file from a list of Individuals based on their pdbqt attribute This assume that the cost function update the pdbqt attribute after the docking with the conformations obtained In the case of multiple receptor the attribute should be a list of valid pdbqt strings. Here will export several sdf depending how many pdbqt string are in the pdbqt attribute.

Parameters:

individuals (list[Individual]) – A list of individuals
sdf_name (str, optional) – The name for the output file. Could be a path + sdf_name. The sdf extension will be added by the function, by default ‘out’

Example

In [1]: import tempfile, os

In [2]: from moldrug import utils

In [3]: from rdkit import Chem

# Create some temporal dir
In [4]: tmp_path = tempfile.TemporaryDirectory()

# Creating two individuals
In [5]: I1 = utils.Individual(Chem.MolFromSmiles('CCCCl'))

In [6]: I2 = utils.Individual(Chem.MolFromSmiles('CCOCCCF'))

# Creating the pdbqt attribute as a list with the pdbqt attribute (this is just a silly example)
In [7]: I1.pdbqt = [I1.pdbqt, I1.pdbqt]

In [8]: I2.pdbqt = [I2.pdbqt, I2.pdbqt]

In [9]: utils.make_sdf([I1, I2], sdf_name = os.path.join(tmp_path.name, 'out'))
 File /tmp/tmprj8ilrsr/out_1.sdf was created!
 File /tmp/tmprj8ilrsr/out_2.sdf was created!

# Two files were created
# In the other hand, if the attribute pdbqt is not a list, only one file is going to be created
# Set pdbqt to the original value
In [10]: I1.pdbqt = I1.pdbqt[0]

In [11]: I2.pdbqt = I2.pdbqt[0]

In [12]: utils.make_sdf([I1, I2], sdf_name = os.path.join(tmp_path.name, 'out'))
File /tmp/tmprj8ilrsr/out.sdf was createad!

# Only one file will be created if the pdbqt has not len in some of
# the individuals or they presents different lens as well.
# In this case the pdbqts will be completely ignored and pdbqt attribute
# will be used for the construction of the sdf file
In [13]: I1.pdbqt = [I1.pdbqt, I1.pdbqt, I1.pdbqt]

In [14]: I2.pdbqt = [I2.pdbqt, I2.pdbqt]

In [15]: utils.make_sdf([I1, I2], sdf_name = os.path.join(tmp_path.name, 'out'))
File /tmp/tmprj8ilrsr/out.sdf was createad!

moldrug.utils.roulette_wheel_selection(p: List[float])[source]#

Function to select the offsprings based on their fitness.

Parameters:: p (list[float]) – Probabilities
Returns:: The selected index
Return type:: int

moldrug.utils.run(command: str, shell: bool = True, executable: str = '/bin/bash')[source]#

This function is just a useful wrapper around subprocess.run

Parameters:

command (str) – Any command to execute.
shell (bool, optional) – keyword of subprocess.Popen and subprocess.Popen, by default True
executable (str, optional) – keyword of subprocess.Popen and subprocess.Popen, by default ‘/bin/bash’

Returns:

The processes returned by Run.

Return type:

object

Raises:

RuntimeError – In case of non-zero exit status on the provided command.

moldrug.utils.tar_errors(error_path: str = 'error')[source]#

Clean errors in the working directory. Convert to error.tar.gz the error_path and delete the directory.

Parameters:: error_path (str) – Where the errors are storged.

moldrug.utils.to_dataframe(individuals: List[Individual], return_mol: bool = False) → DataFrame[source]#

Convert a list of individuals to a DataFrame

Parameters:

individuals (List[Individual]) – The list of individuals
return_mol (bool, optional) – If True the attribute mol will bot be return, by default False

Returns:

The DataFrame

Return type:

pd.DataFrame

moldrug.utils.update_reactant_zone(parent: Mol, offspring: Mol, parent_replace_ids: List[int] | None = None, parent_protected_ids: List[int] | None = None)[source]#

This function will find the difference between offspring and parent based on the Maximum Common Substructure (MCS). This difference will be consider offspring_replace_ids. Because after a reaction the indexes of the product could change respect to the reactant, the parent_replace_ids could change. The function will map the index of the parent to the offspring based on MCS. If on those indexes some of the parent_replace_ids are still present, they will be updated based on the offspring and also added to offspring_replace_ids. Similarly will be done for the parent_protected_ids.

Parameters:

parent (Chem.rdchem.Mol) – The original molecule from where offspring was generated
offspring (Chem.rdchem.Mol) – A derivative of parent
parent_replace_ids (List[int], optional) – A list of replaceable indexes in the parent, by default None
parent_protected_ids (List[int], optional) – A list of protected indexes in the parent, by default None

Returns:

The function returns a tuple composed by two list of integers. The first list is offspring_replace_ids and the second one offspring_protected_ids.

Return type:

tuple[list[int]]

Utils

Contents

Utils#