Summary#

Input data#

Currently, moldrug only accepts valid RDKit molecules and a valid pdbqt file that will be processed in case AutoDock-Vina is used.

The idea#

The idea is simple: we want the best drug for some applications. The problem is that it is not as simple as it sounds. The chemical space is enormous and the estimation of the fitness is also a term of concern.

Here we address this problem using a Genetic Algorithm. The inputs are

  1. Biological target (pdbqt format).

  2. RDKit molecule of some known (or guessing) drug. (it could get easily get from the SMILES string)

  3. Definition of the binding pocket.

  4. Definition of the moldrug.utils.GA() running variables.

With the molecule, a random population of popsize individuals will be generated through the CReM operation mutate_mol. Every individual will be evaluated through the selected fitness function. The best individual will be used for crossover and mutation, generating pc*popsize``offspring. The offspring will be merged with the current population, sorted with respect to the fitness function and the first ``popsize individuals will be kept for the next generation. This cycle will run for maxiter generations.

In a general way moldrug will try to minimize:

\[cost = f(\vec{X})\]

Where \(\vec{X}\) is some description of the chemical structure of the molecule which is mapped to the \(\mathbb{R}\) numbers through the function \(f\). This function will be called cost function.

Alternatively, there is also the class moldrug.utils.Local(). In this case, only an initial population is created and exported.

Fitness functions#

The default fitness functions could be accessed through the module moldrug.fitness. There are currently four implemented:

  1. moldrug.fitness.Cost() (standard). It uses all the response variables.

  2. moldrug.fitness.CostOnlyVina(). Only use the information of the Vina score.

  3. moldrug.fitness.CostMultiReceptors(). It uses all the response variables.

  4. moldrug.fitness.CostMultiReceptorsOnlyVina(). Only use the information of the Vina scores

from moldrug import fitness
cost = fitness.Cost

A molecule must present several properties to be considered a drug. Some of the most important are: be potent, reach the biological target (good ADME profile) and be real. The last obvious property could be a bottleneck for computer-assisted drug design. Because we want to optimize several response variables at the same time. These cost functions (with the exception of moldrug.fitness.CostOnlyVina()) use the concept of desirability functions which optimizes several response variables on the hub.

The following are the response variables:

  1. Vina score.

  2. Quantitative Estimation of Drug-likeness (QED).

  3. Synthetic accessibility score.

For each of this response variable we create their corresponding Derringer-Suich desirability functions (see here some examples of how it is implemented in moldrug). And then we combine them as a geometric mean:

\[D = {\left[\prod_{i = 1}^{3} d_i^{w_i}\right]}^{\frac{1}{\sum_i w_i}}\]

where \(w_i\) are the weights of each variable; and \(d_i\) the desirability functions. Each individual \(d_i\) ranges from 0 to 1 and therefore also \(D\). Because we are looking for the minimum, the function cost return \(1 - D\).

Multi Receptor#

Could be that our receptor presents high flexibility or that we are interested in generating specific small molecules. In this case could be convenient to add more than one receptor to the cost function. In moldrug.fitness module the cost functions moldrug.fitness.CostMultiReceptors() and moldrug.fitness.CostMultiReceptorsOnlyVina() try to reach this goal. For the case of flexibility, we could perform docking in an ensemble of protein structures and just keep the lower scoring rather than include all of them in the final desirability function.