Using parallelized incremental meta-docking can solve the conformational sampling issue when docking large ligands to proteins

Background Docking large ligands, and especially peptides, to protein receptors is still considered a challenge in computational structural biology. Besides the issue of accurately scoring the binding modes of a protein-ligand complex produced by a molecular docking tool, the conformational sampling of a large ligand is also often considered a challenge because of its underlying combinatorial complexity. In this study, we evaluate the impact of using parallelized and incremental paradigms on the accuracy and performance of conformational sampling when docking large ligands. We use five datasets of protein-ligand complexes involving ligands that could not be accurately docked by classical protein-ligand docking tools in previous similar studies. Results Our computational evaluation shows that simply increasing the amount of conformational sampling performed by a protein-ligand docking tool, such as Vina, by running it for longer is rarely beneficial. Instead, it is more efficient and advantageous to run several short instances of this docking tool in parallel and group their results together, in a straightforward parallelized docking protocol. Even greater accuracy and efficiency are achieved by our parallelized incremental meta-docking tool, DINC, showing the additional benefits of its incremental paradigm. Using DINC, we could accurately reproduce the vast majority of the protein-ligand complexes we considered. Conclusions Our study suggests that, even when trying to dock large ligands to proteins, the conformational sampling of the ligand should no longer be considered an issue, as simple docking protocols using existing tools can solve it. Therefore, scoring should currently be regarded as the biggest unmet challenge in molecular docking.

in solution. Furthermore, the number of possible conformations increases exponentially with the number of rotatable bonds. This underpins the computational challenge of exploring all these conformations and predicting the best fit between a ligand and a protein's binding site [11]. With the exception of very small ligands, exhaustively exploring all the rotatable bonds-or degrees of freedom (DoFs)-of a ligand is infeasible, and several strategies have been proposed to achieve efficient sampling [8,12,13].
Other considerations that we do not address in this paper can render conformational sampling even more computationally challenging. This is the case when considering the flexibility of the protein receptor in addition to that of the ligand. As there might exist structural differences between a protein's bound and unbound conformations, ideally, protein flexibility should be taken into account in molecular docking studies. However, due to the tremendous computational cost of doing so, various methods have been proposed that consider only limited levels of flexibility [6]. Another important issue is the inclusion of "explicit" water molecules in molecular docking because they sometimes mediate interactions between ligands and receptors. Unfortunately, there is currently no consensus on the way this should be done to improve the results of docking tools [9,10,12].
The second important component of molecular docking is scoring [9,14]. The goal of a scoring function is to assess the "quality" of the conformations produced by the sampling algorithm in order to guide the search towards better binding modes and to rank conformations of different ligands (as in virtual screening). Since numerous conformations are evaluated during sampling, scoring functions have to be computationally efficient. This requirement imposes a major trade-off between efficiency and accuracy when designing a useful scoring function [12,13,15].
A variety of docking tools is now available, relying on various strategies for sampling and scoring, which both affect docking performance and involve different challenges [10,16]. Our work has been focused on addressing the challenges associated with sampling when docking large ligands and even peptides. Our first step was to develop a parallelized incremental meta-docking approach to dock large ligands, called DINC [11,17]. Showing promising capabilities, DINC was applied in studies on STAT3 and STAT6 inhibition [18,19]. After significant software improvements, we demonstrated its ability to dock large peptides binding MHC receptors [20]. However, these improvements were not sufficient to ensure that DINC could accurately dock any large ligand. To address this issue, we have recently released a new version of DINC. One of the most significant changes is that DINC now mostly relies on the popular docking tool Vina [21], instead of AutoDock4 [22], as in its past versions [11,20]. This latest upgrade has been made available through the DINC 2.0 web server [23,24].
The rationale behind the switch from AutoDock4 to Vina is that several benchmark studies have reported that Vina performs generally better than AutoDock4. For example, in [25], it is shown that Vina outperforms AutoDock4 both in terms of sampling power and scoring power. This is especially true for the Lamarckian genetic algorithm method in AutoDock4, which is the one that was used in DINC [11,20]. Similar differences in docking performance are reported in [26], which also illustrates the computational efficiency of Vina over AutoDock4. This confirms what was initially stated by Vina's authors, i.e., improved scoring accuracy and sampling efficiency (through multi-threading and optimization) [21].
In this paper, we report the results of our evaluation of the sampling power of several docking protocols, including DINC. We focus on the problem of docking large ligands (including peptides) to protein receptors. For our evaluation, we use five datasets of protein-ligand complexes reported in related work and involving large ligands that could not be accurately docked using Vina or other docking tools [11,[25][26][27][28]. Although it is impossible to fully dissociate the effects of sampling and scoring on the output of a docking approach, there are approaches for assessing sampling power somewhat independently of scoring power [14,[25][26][27][28]. To do so, we check whether a given docking approach is able to produce binding modes that are similar to the crystal structure of a given complex, whether or not these binding modes receive high scores. The docking approaches we evaluate all involve Vina and are based on (i) varying Vina's exhaustiveness (i.e., the parameter defining the amount of sampling performed by Vina), (ii) running several instances of Vina in parallel and grouping their results together, in a protocol we call Multi-Vina, and (iii) using our parallelized incremental meta-docking method, DINC.
Our results clearly show the benefits of using parallelized approaches over simply increasing Vina's exhaustiveness. Furthermore, the good performance of DINC indicates that the incremental paradigm it relies on provides additional benefits over only using parallelism. Overall, our study suggests that, even when docking large ligands (i.e., ligands with more than a dozen DoFs), conformational sampling is rarely critical if enough computing resources are available. Although this might not be satisfactory in the context of virtual screening applications, where computational efficiency is paramount, this is evidence that the conformational sampling challenge can essentially be considered solved in the context of geometry optimization. This also highlights the fact that scoring remains the biggest unmet challenge of molecular docking.

Results
In this section we present the results of our evaluation of several docking methodologies that involve the classical molecular docking tool Vina, including our parallelized incremental meta-docking tool, DINC. We perform redocking experiments, which consist of trying to reproduce the crystal structures of challenging protein-ligand complexes from five different datasets (see "Methods" section). The most extensive part of our benchmarking involves only the first four datasets (Dhanik, Renard, LEADS and Hou) as the fifth one (PPDbench) was published after we had performed our study. The PPDbench dataset is involved only in a smaller experiment reported at the end of this section. Note that, in our redocking tasks, we only explore the flexibility of ligands, and keep proteins rigid at all times.
To assess the quality of the results produced by a docking protocol for a specific complex, we evaluate the Root Mean Square Deviation (RMSD) between the predicted binding modes and the initial crystal structure of the complex, considering all the heavy atoms of the ligand, i.e., the so-called all-atom RMSD. The results we report for each complex correspond to the RMSD between its crystal structure and the so-called top-RMSD conformation, i.e., the conformation produced by the docking tool which is the closest to the crystal structure (see "Methods" section). This allows assessing the sampling power of a docking tool irrespective of its scoring power. Note that we consider a crystal structure to be successfully reproduced if this all-atom RMSD is less than 2 Å, which is a common threshold in the docking community.

Vina
As explained in the "Methods" section, the protein-ligand complexes we selected for this study involve large ligands and cannot be reproduced using Vina with its default parameters. The main parameter we will refer to in this section is Vina's exhaustiveness, which defines the amount of sampling that is performed by Vina before it returns its results, and whose default value is 8. Note that it is a unitless parameter whose value should be a positive integer, and it has no maximum value. The results we have obtained when using Vina to try and reproduce the complexes from our first four datasets are listed in Tables 1-4, under the column titled "Vina". These results are averages (and standard deviations) calculated from five runs. For all complexes, the all-atom RMSD between the initial crystal structure and the top-RMSD conformation produced by Vina is greater than 2 Å, with an average of 5.17 Å (and a standard deviation of 1.12 Å) across the four datasets combined.
The strategy recommended by Vina's creators to improve its performance is to increase its exhaustiveness. Therefore, we have tried to reproduce all the complexes from our first four datasets after increasing Vina's exhaustiveness to 100. Results (averaged over five runs) are listed in Tables 1-4, under the column titled "Vina 100 ". They show limited overall improvement, with an average of 4.52 Å (and a standard deviation of 0.76 Å) across the four datasets. For some complexes, such as 4FIV in the Dhanik dataset, 2D5W in the Renard dataset, 3MMG in the LEADS dataset and 2ER6 in the Hou dataset, increasing the exhaustiveness yields a significant improvement and a successful reproduction. On the other hand, for other complexes, such as 1JQ9 in the Dhanik dataset, 1TJ9 in the Renard dataset, 3OBQ in the LEADS dataset and 1G7V in the Hou dataset, we can see a deterioration of the results. The fact that the standard deviation associated with these complexes decreased after increasing Vina's exhaustiveness suggests that this deterioration is not due to the inherent randomness of the sampling process, but to the fact that Vina more consistently produces "bad" results.
A critical effect of increasing exhaustiveness is a rise in computing time: Vina's runtime increases roughly linearly with respect to its exhaustiveness. Therefore, we do not report explicit running times (which depend on the computing platform) and only discuss differences in computing times through differences in exhaustiveness. In our study, increasing exhaustiveness from 8 to 100 resulted in a 12-fold increase in Vina's running times. A legitimate question is thus whether the limited improvement in results quality achieved by increasing exhaustiveness is worth such an increase in computing time. If one was not deterred by the prohibitive running times, one could increase exhaustiveness beyond 100 and hopefully get better results. However, our experience and other studies have shown that, when docking large ligands, increasing Vina's exhaustiveness only results in minor improvements that quickly plateau [26,27].

Multi-Vina
We aimed to make better use of computing resources than is achieved by increasing Vina's exhaustiveness. To that end, we assessed a docking protocol we call "Multi-Vina", based on running several independent instances of Vina (performing different non-deterministic conformational searches) in parallel and grouping their results together. This method generates a larger pool of binding modes from which we extract the top-RMSD conformation.
The first protocol we evaluated involves running 12 instances of Vina, with its exhaustiveness set to 8; we call it 12×Vina. Using 12×Vina requires as much computing resources as using Vina 100 (in terms of CPU time), but as little time as running Vina (in terms of wall clock time). The results (averaged over five replicates) we obtained when trying to redock the complexes from our first four datasets with this protocol are listed in Tables 1-4 the column titled "12×Vina". The overall average and standard deviation across the four datasets are 3.28 Å and 0.58 Å, respectively. As expected, 12×Vina performs better than Vina, and interestingly it also performs significantly better than Vina 100 (see Fig. 1) despite using a similar amount of computing resources. The only complexes for which Vina 100 produced a better result than 12×Vina are 1N12 and 1H6W in the LEADS dataset, as well as 3FVH in the Hou dataset.
To assess the impact of increasing the amount of computing resources allocated to Multi-Vina, we also tried to redock the complexes from our first four datasets using a 24×Vina protocol. Results (averaged over five replicates) are listed in Tables 1-4, under the column titled "24×Vina". The overall average and standard deviation across the four datasets are 3 Å and 0.49 Å, respectively. This is only a small improvement when compared to results obtained with the 12×Vina protocol (see Fig. 1), especially considering that computing resources have been doubled.
Finally, to evaluate the full potential of the Multi-Vina method, we performed redocking experiments with a 288×Vina protocol (where 288 = 12 × 24) on our first four datasets. However, because of the huge amount of computing resources required and the very low expected standard deviation, we performed only one replicate for this experiment. Results are listed in Tables 1-4, under the column titled "288×Vina". The average across the four datasets is 1.83 Å, and only 30% of complexes could not be reproduced.

DINC
In its current implementation, DINC can be seen as an incremental Multi-Vina approach (see Fig. 1 Average RMSD achieved by the docking protocols. For each docking protocol, we report the all-atom RMSD averaged over all complexes from the first four datasets. For Vina, Vina 100 , 12×Vina, 24×Vina and DINC, we also report the corresponding standard deviation "Methods" section). Therefore, we wanted to examine whether this additional incremental paradigm would give DINC an advantage over the regular Multi-Vina approach. For that, we ran five replicates of a redocking experiment involving all the complexes from our first four datasets. Results (averaged over the five replicates) are listed in Tables 1-4, under the column titled "DINC". The overall average and standard deviation across the four datasets are 2.34 Å and 0.32 Å, respectively. When comparing these results to those obtained with the 24×Vina protocol, one can conclude that DINC seems to perform better than the Multi-Vina approach. Indeed, the improvement observed between 24×Vina and DINC is much larger than the improvement observed between the 12×Vina and 24×Vina protocols (see Fig. 1). This comparison is meaningful because, in terms of computing resources, DINC lies between a 24×Vina and a 36×Vina protocol.
To assess the full potential of DINC, we collected all the results from all the redocking experiments we performed when comparing the various DINC protocols (see "Methods" section). More specifically, for each complex of our first four datasets, we looked for the minimum RMSD among all the top-RMSD conformations generated by the replicates of the 15 DINC protocols we had evaluated (see "Methods" section). This amounts to running a Multi-DINC, similar to the Multi-Vina, except that different instances of DINC may use different parameters. Note that all DINC protocols run 12 Vina instances, except the final one, which runs 24 Vina 4 instances (i.e., Vina with its exhaustiveness set to 4) as explained in the "Methods" section. The results we obtained by combining all the DINC protocols together are listed in Tables 1-4, under the column titled "DINC best ". The average across the four datasets is 1.55 Å, and only 17% of complexes could not be reproduced. Note that the amount of computing resources involved in obtaining the results reported for DINC best is certainly greater than that used by the 288×Vina protocol, although a direct comparison is not really possible. Despite this fact, we can conclude that DINC shows a greater potential to reproduce challenging complexes than the Multi-Vina approach.

PPDbench dataset
As the PPDbench dataset was published after we had performed our evaluation study [28], we used it only in a smaller experiment to compare the sampling capabilities of Vina and DINC's default protocols (see "Methods" section). We performed five replicates of a redocking experiment in which we tried to reproduce the crystal structures of the 89 complexes of this dataset. The results we obtained are presented in Table 5. Vina could not reproduce any of these complexes. The average all-atom RMSD (across the whole dataset) between the initial crystal structure and the top-RMSD conformation produced by Vina is 7.7 Å with a standard deviation of 1.01 Å. Although DINC was able to successfully reproduce only 7 complexes, the average all-atom RMSD it achieved across the whole dataset is 4.17 Å, with a standard deviation of 0.45 Å. This represents a significant improvement in comparison to Vina. Results obtained on this dataset, which contains very large peptides (with up to 67 DoFs), illustrate that, even when using DINC, more sampling is required to reproduce such challenging complexes. To illustrate differences between a successful and unsuccessful reproduction of a specific complex, we report the best results obtained with Vina and DINC when trying to reproduce the protein-peptide complex with PDB code 2O9V (see Fig. 2). The peptide involved in this complex is composed of 69 heavy atoms and features 15 DoFs. The all-atom RMSD between the crystal structure and the top-RMSD conformations produced by Vina and DINC are 5.61 Å and 1.1 Å, respectively. As the threshold for success is 2 Å, the binding mode obtained with DINC constitutes a successful reproduction of the crystal structure. Figure 2 shows that only the ends of the peptide's conformation are not very well aligned with the crystal structure. On the other hand, the binding mode produced by Vina corresponds to a totally different conformation.

Vina's exhaustiveness
The default value of Vina's exhaustiveness is 8. When we increased its value to 100, we obtained better docking results for many protein-ligand complexes, but not all of them. One can mostly interpret the effect of raising exhaustiveness as increasing the amount of sampling performed by Vina during a given run. Therefore, the fact that increasing Vina's exhaustiveness can sometimes lead to a deterioration of the docking results might be counter-intuitive. However, another effect of increased exhaustiveness is an enhanced impact of the scoring function on the final output, as Vina can spend more time improving the fit of the population of binding modes it internally maintains. In other words, raising Vina's exhaustiveness increases the bias of its scoring function on the sampling procedure. If Vina's scoring function was perfectly accurate, raising exhaustiveness would systematically yield better docking results. Unfortunately, as it is not perfect, Vina's scoring function sometimes drives the sampling process away from near-native binding modes of a complex. In other cases the scoring function simply favors an alternative binding mode that is as valid as the one captured by the crystal structure.
Note that two docking studies focused on proteinpeptide complexes have shown that, despite leading to huge increases in running times, raising exhaustiveness to large values does not produce drastic improvements [26,27]. Therefore, increasing Vina's exhaustiveness is clearly not the most effective use of computing resources when trying to improve docking results. In addition, our results suggest that in the context of a meta-docking approach involving several instances of Vina such as Multi-Vina or DINC, reducing exhaustiveness is beneficial. The first benefit is that it reduces running times. The second benefit is that it ensures more diversity in the sets of binding modes produced by all Vina instances, as the bias from the scoring function is reduced. Vina's exhaustiveness controls the number of independent runs that it performs internally. However, other parameters might have an impact on the achieved amount of sampling, such as the number of steps performed by each run, the length of the local optimization within each step, or the optimization method itself (knowing that Vina uses the Broyden-Fletcher-Goldfarb-Shanno algorithm). Unfortunately, the only parameter accessible to Vina's users is the exhaustiveness.
The default protocol we have adopted in DINC uses Vina 4 , i.e., it runs Vina with its exhaustiveness set to 4. This allows reducing Vina's running time by half, as compared to its regular version. In addition, after performing redocking experiments with DINC using either Vina 4 or the regular Vina, we observed no significant change in results quality in terms of all-atom RMSD to the initial crystal structures. For very few complexes, using Vina 4 improved docking results, and for even fewer complexes, using Vina 4 deteriorated docking results. On the other hand, using Vina 2 clearly deteriorated docking results on the four datasets. Therefore, using Vina 4 seems to be a good compromise between achieving computational efficiency and obtaining docking results of good quality.

Parallelized meta-docking
As discussed in the "Results" section, running several Vina instances in parallel and grouping their results together is a more efficient way to improve docking results than simply increasing exhaustiveness in a single instance of Vina. First, this kind of parallelized meta-docking approach is computationally efficient: even if it uses the same amount of computing resources as a long Vina run, it is much faster in terms of wall clock time. Second, with a given computing budget, the multi-threaded approach provides better docking results than the single-threaded approach.
This result is most likely not specific to Vina: a parallelized meta-docking approach using another docking tool would probably provide similar benefits. This was demonstrated, to some extent, by the original implementation of DINC [11,17], which involved AutoDock4. Therefore, to rephrase the above statement in more general terms, it is more effective to combine the results from several short docking runs than extending a single docking run. This concept should be familiar to readers of the computational biophysics literature: better results have been obtained from combining several short molecular dynamics simulations together than running a very long simulation [29]. This paradigm has been applied in numerous fields of computer science, such as genetic programming [30].

Parallelized incremental meta-docking
Despite achieving rather good results, the Multi-Vina approach failed at reproducing some complexes, even when using a very large number of threads. On the other hand, DINC was able to reproduce some of these complexes. In addition, when using a similar computing budget, DINC provides better docking results than the Multi-Vina approach. Therefore, the addition of the incremental paradigm to a simple parallelized meta-docking approach can be considered beneficial. Our understanding is that this is especially true when the binding site is not at the protein's surface but deeper in the protein's core. Complexes exhibiting this characteristic are often the results of significant conformational changes undergone by the protein receptor as a result of the docking process in vivo [6]. Therefore, reproducing their crystal structure might be impossible if one keeps the protein receptor rigid while attempting to dock the ligand in the binding site. More specifically, because the binding site is so constrained, it becomes difficult to computationally sample conformations of the whole ligand within it. On the other hand, docking a smaller fragment of this ligand and growing it in the binding site can be easier.
After evaluating several docking protocols in DINC, we had to conclude that none of them systematically performed best. Compared to the others, each protocol improves the docking results for some complexes and deteriorates them for other complexes. Therefore, we chose as a default protocol for our latest version of DINC the one providing a good trade-off between docking accuracy and computational efficiency. This protocol involves three rounds of incremental docking using Vina 4 ; it was evaluated with 24 threads. To obtain better docking results, one can simply increase the number of threads. In addition, if enough computing resources are available, one can easily implement a Multi-DINC method in which the various DINC instances would use different protocols. The beauty of such a meta-docking strategy is that it can be implemented with as many levels as computing resources permit. As illustrated by our results, in this way, most protein-ligand complexes, even challenging ones involving large ligands, can be reproduced.

Additional conformational sampling
The docking protocols we have presented can be combined with other techniques providing additional conformational sampling of a protein-ligand complex. For example, one could envision exploring the conformational space around binding modes produced by a docking protocol using molecular dynamics (MD) simulations. As simulating binding with MD is very expensive, MD has mostly been used for post-docking relaxation [31][32][33][34] or ensemble docking [35][36][37]. To ensure broader sampling of MD simulations, several strategies have been proposed, such as enhanced sampling (e.g., umbrella sampling, metadynamics, replica exchange) [38] or accelerated MD [39,40]. To try and separate as much as possible the sampling challenge from the scoring challenge, our assessment of docking results was based on evaluating the all-atom RMSD between the original crystal structure of a complex and the top-RMSD conformation generated by a docking approach for this complex. The rationale was to assess whether a docking approach could produce binding modes that were close enough to a native conformation, irrespective of whether the scoring function could select these binding modes as being the most favorable ones.
Our results show that increasing Vina's exhaustiveness yields limited improvement, with few complexes being reproduced using this approach. Therefore, when dealing with large ligands, the increase in computing costs incurred from raising Vina's exhaustiveness is not worthy of this small improvement in docking accuracy. Running several short Vina instances and grouping their results together in a Multi-Vina approach seems to be a better use of additional computing resources. Indeed, using this approach yields significant improvement in docking accuracy, with numerous complexes being successfully reproduced. However, even when using a very large number of threads, about a third of complexes still remain too challenging for the Multi-Vina approach. On the other hand, our incremental meta-docking approach, DINC, can successfully reproduce the vast majority of complexes studied here, albeit only when using a huge amount of computing resources for some of these complexes. In general, even when using a more reasonable amount of computing resources, DINC performs significantly better than the Multi-Vina approach, given a similar computing budget.
In conclusion, this study clearly demonstrates the benefits of using parallelized docking approaches, as well as incremental docking approaches such as our metadocking tool DINC, to solve the sampling challenge associated with the docking of large ligands, including peptides. More generally, our results illustrate that conformational sampling is not really a challenge anymore, contrary to what transpires from previous similar studies [25][26][27]. The real challenge of molecular docking resides on improving scoring functions. In fact, methods such as Multi-Vina or DINC incur additional computing costs required to counter the bias imposed by the scoring function on the sampling procedure. For now, the solution we suggest is to use a meta-docking approach to generate a large pool of binding modes by grouping the results from several independent docking runs. The benefit is that this pool of binding modes can then be re-scored, using a scoring function that is more computationally-expensive but more accurate than the fast functions typically used by protein-ligand docking tools. We are planning to evaluate such re-scoring techniques in future work, as well as the consensus scoring paradigm [41].

DINC -docking incrementally
DINC is a parallelized meta-docking method developed for the incremental docking of large ligands to protein receptors. The rationale for the method and its implementation have been described in previous publications [11,17,20]. The newest version of DINC, called DINC 2.0, has been made available online as a web server [23,24]. In short, DINC is based on a divide-and-conquer approach enabling the docking of large ligands containing too many flexible bonds to be efficiently docked by traditional protein-ligand docking tools. The idea behind DINC is to incrementally dock larger and larger overlapping fragments of a ligand instead of docking it all at once. The workflow of the algorithm is illustrated in Fig. 3.
Given a ligand, DINC's algorithm starts by selecting a subset of the ligand's flexible bonds to be explored, executing the sampling and scoring of this first fragment. Then, several docked conformations of this fragment are selected for expansion. During this process, the selected conformations are "grown" so as to include an additional subset of flexible bonds from the original ligand. This is defined by one of DINC's parameters, which determines how many new flexible bonds are added at each round of docking. The expanded fragments are then used as input for a second round of sampling and scoring. This process is incrementally repeated until all the flexible bonds of the original ligand have been explored.
The number of flexible bonds that are explored in each round of docking is a key parameter for the success of the incremental process. Although the number of atoms and bonds composing the fragments increases from one round to the next, the number of bonds that are considered flexible and are effectively sampled is kept constant. In DINC, this parameter is referred to as the fragment size. Instead of defining the fragment size and the number of new flexible bonds (which then automatically determines the number of docking rounds), it is possible to define the number of docking rounds and new flexible bonds in DINC (which then automatically determines the fragment size). Note that different heuristics can be used to decide which bonds will be active at each round, and which previously-explored bonds will be kept rigid. The important point is that, by keeping only a subset of bonds active at each round, DINC enables the efficient sampling and scoring of the growing fragments.
Another key aspect of the DINC approach is parallelism. At each round of the incremental process, multiple attempts at docking a given fragment are performed independently in parallel. Then, all generated conformations are grouped together, and a subset of this conformation pool is selected for expansion and for the next docking round. In DINC, the parameter driving this behavior is referred to as the number of docking tasks, or simply the number of threads. Through parallelism, the amount of sampling at every round is greatly increased without affecting the overall running time. DINC can run on a desktop computer using multiple threads, and is also well-suited for high-performance computing systems.
DINC is also a meta-docking approach, in the sense that it relies on regular docking tools to perform the sampling and scoring tasks at each docking round. DINC itself only manages the parallelism, the generation of the fragments and the selection of active flexible bonds. In its original version, DINC relied solely on AutoDock4 [22]. The newest version of DINC, which we evaluate in this study, involves the popular docking tool Vina [21]. As two docking tools are available in DINC, it is now possible to use separate tools for sampling and scoring tasks. More generally, DINC is a highly customizable tool in which all parameters can be tuned.

Docking protocols
The first docking protocol we evaluated involves only Vina and was aimed at studying the effects of varying Vina's exhaustiveness. This parameter defines how long Vina will run by setting the number of independent runs that are performed internally, starting from random conformations of the ligand; its default value is 8. We increased it to 100, in a docking protocol that we refer to as Vina 100 , to evaluate whether this would improve Vina's performance. While varying Vina's exhaustiveness we kept its other parameters constant, but we did not use their default value. For the num_modes parameter, which defines the maximum number of binding modes that Vina can produce, we used a value of 25 instead of 9. For the energy_range parameter, which defines the maximum energy difference (in kcal/mol) allowed between the best and worst binding modes produced by Vina, we used a value of 10 instead of 3. Increasing the energy range and the maximum number of binding modes returned by Vina enabled us to rely less on its scoring function when analyzing its output.
We call the second docking protocol we evaluated Multi-Vina. It consists of running several independent instances of Vina in parallel and grouping their results together. This allows obtaining a larger set of binding modes that can be analyzed to evaluate the success of docking. In this context, we set Vina's exhaustiveness to 8; we kept its num_modes and energy_range parameters at 25 and 10, respectively. We varied the number of Vina instances from 12 to 24, to 288 (= 12 × 24), in docking protocols that we refer to as 12×Vina, 24×Vina and 288×Vina, respectively.
We then evaluated our parallelized incremental metadocking tool, DINC. As DINC involves several parameters defining its incremental process, we wanted to assess which set of parameter values (i.e., which DINC protocol) would produce the best results. First, we varied the fragment size using the values 6, 12, 18, 24 and 30, while keeping all other parameters fixed: 12 threads and 3 new bonds at each docking round. Unfortunately, no fragment size value seemed to systematically produce the best results. Then, instead of using a fixed fragment size (which would result in a varying number of docking rounds based on the ligand's number of flexible bonds) we decided to evaluate DINC protocols in which the

Evaluation methodology
To evaluate all the docking protocols we perform redocking experiments, in which we try to reproduce the crystal structure of protein-ligand complexes obtained from the Protein Data Bank (PDB) [42]. This requires processing the PDB files, following a standard procedure in the docking field: 1) removing water molecules; 2) if several instances of the complex are present in the PDB file, conserving only the first instance; 3) for each complex, separating the ligand from the protein and saving both molecules in separate PDB files that can be given as input to a docking tool. The preparation of the ligand and protein receptor is done using the scripts prepare_ligand4 and prepare_receptor4 from the AutoDockTools4 suite [22]. The operations performed by these scripts are: removing lone atoms, adding polar hydrogens, removing non-polar hydrogens, and adding Gasteiger charges. At the beginning of the docking process performed by DINC, similar to many other docking tools, the conformation of the given ligand is randomized. The objective of the redocking experiment is to assess whether the docking tool can produce a binding mode for the ligand in the protein's binding site that is similar to the initial crystal structure.
The similarity between two binding modes of a proteinligand complex (whether a computationally-generated binding mode or a crystal structure) is usually assessed by calculating the Root Mean Square Deviation (RMSD) between them. Several ways of calculating RMSD values have been reported in the molecular docking literature. Here, we use the strictest definition of the RMSD by calculating its values using all the heavy atoms of the ligand. This means measuring changes in the conformation, as well as in the position and orientation of the whole ligand within the protein's binding site. As the protein is kept rigid and in a fixed position during the docking process, no alignment is required between two binding modes to calculate the RMSD between them. Note that a more lenient definition of the RMSD that is often used in related work consists of using only the backbone atoms of the ligand; in this case, changes in side-chain conformations are mostly ignored.
Similar to other docking tools, DINC produces several binding modes for a given protein-ligand complex as a result of the docking process. Therefore, there exists several ways to assess whether a docking tool was successful at reproducing a given crystal structure. If users decide to fully rely on their tool's scoring function, they might decide to calculate the RMSD between the initial crystal structure and the binding mode that receives the best score, i.e., the so-called top-scoring conformation. A common threshold used to determine whether the crystal structure was successfully reproduced is 2 Å. Instead of considering only the top-scoring conformation, users can also check whether any of a small number of binding modes among those with the best scores have an RMSD of less than 2 Å from the crystal structure. In this study, as we want to evaluate the sampling power of the docking protocols, we calculate the RMSD between the crystal structure and all the produced binding modes, irrespective of their score, and we determine which one is the closest to the crystal structure, i.e., the so-called top-RMSD conformation. We consider that a given crystal structure has been successfully reproduced if the top-RMSD conformation is less than 2 Å away from it. This is common practice in the molecular docking field, which allows evaluating the sampling power of a docking tool independently (to some extent) of its scoring power.

Datasets of protein-ligand complexes
For this study, we define five datasets containing proteinligand complexes involving large ligands, some being peptides. These datasets comprise complexes involved in previous evaluations of molecular docking tools performed by several research groups. We restrict our datasets to complexes that some of these tools cannot reproduce. Our goal is to perform an indirect comparison between these tools and our own docking tool, DINC, by showing that it can reproduce most of these complexes. Dhanik dataset. Our first dataset is a small subset of the one on which the original version of DINC was evaluated [11,17]. The complete dataset contains 73 protein-ligand complexes extracted from an old version of the PDBbind database [43]. Results in [11,17] show that DINC and AutoDock4 could reproduce the crystal structures of only 31 of these complexes, when considering top-RMSD conformations. The remaining 42 complexes could have thus been considered challenging. However, after running redocking experiments with the most recent version of Vina (using its default parameters), we realized that some of these complexes were actually not very challenging. Therefore, we removed from the dataset all the complexes that Vina could reproduce, when considering top-RMSD conformations. We also discarded the complexes featuring more than one ligand in the protein's binding site.
Eventually, we obtained a dataset containing 19 complexes, with ligands having between 7 and 30 rotatable bonds (see Table 1).

Renard dataset.
Our second dataset is a subset of a meta-dataset compiled from previous studies with the objective of assessing the ability of Vina to dock small peptides [27]. The original meta-dataset contains 47 complexes involving peptides with up to five amino acid residues. Results in [27] show that Vina could produce top-RMSD conformations that were similar to the crystal structure of only about half of these complexes. However, since these results were based on RMSD values calculated only for backbone atoms of the peptides, we performed our own redocking experiments to evaluate which complexes were really challenging. After discarding the complexes that already belonged to our first dataset, we were left with a set of 26 complexes (involving peptides having between 10 and 22 rotatable bonds) that Vina was not able to reproduce, even when considering top-RMSD conformations (see Table 2).

LEADS dataset.
Our third dataset is a subset of LEADS-PEP, which currently contains 53 protein-peptide complexes involving peptides composed of 3 to 12 residues [26]. LEADS-PEP was created as an unbiased benchmark dataset for researchers wanting to assess the efficacy of molecular docking tools on peptides. It was used to evaluate four protein-ligand docking tools: GOLD, Surflex-Dock, AutoDock4 and Vina [26]. Results show that, in spite of not being specifically aimed at peptides, these four tools were able to perform quite well with small peptides composed of 3 or 4 residues. However, they all showed poor performance on larger peptides, even when considering top-RMSD conformations, and despite the fact that reported RMSD values were calculated using only backbone atoms. Therefore, we removed from the dataset complexes containing small peptides, as well as complexes already present in previous datasets. The resulting dataset contains 33 complexes involving peptides having between 11 and 52 rotatable bonds (see Table 3).

Hou dataset.
Our fourth dataset is a small subset of an extensive dataset used to evaluate ten molecular docking tools, including AutoDock4 and Vina [25]. The complete dataset contains 2002 protein-ligand complexes extracted from a recent version of the PDBbind database [44]. Results in [25] clearly show that all docking tools struggle with neutral ligands and large ligands, such as peptides.
Interestingly, a set of 72 complexes could not be reproduced by any of the tested docking tools, even when considering top-RMSD conformations, and are, therefore, ideal candidates for our evaluation of challenging proteinligand complexes. After discarding from this small set complexes involving more than one ligand in the binding site and complexes involving ligands with less than 7 rotatable bonds, we were left with a dataset containing 28 complexes involving ligands with 7 to 31 rotatable bonds (see Table 4).
PPDbench dataset. Our fifth dataset was published after we carried out our benchmarking. Therefore, we used it only for a smaller experiment to compare results obtained with Vina and DINC using their default parameters. The original PPDbench dataset was involved in a study comparing the performance of six molecular docking tools: ZDOCK, FRODOCK, Hex, PatchDock, ATTRACT and pepATTRACT [28]. It was created by combining two smaller datasets published in previous studies, and contains 133 protein-peptide complexes composed of 9 to 15 amino acids. After discarding from the original dataset the complexes involving more than one ligand in the binding site, we obtained a dataset containing 89 complexes involving peptides with 13 to 67 rotatable bonds (see Table 5).

Evaluation platform
We ran all our docking jobs on the Comet cluster, from the San Diego Supercomputer Center, through an Extreme Science and Engineering Discovery Environment (XSEDE) allocation [45]. Comet features Intel next-gen processors with AVX2, Mellanox FDR InfiniBand interconnects and Aeon storage. Its compute nodes consist of Intel Xeon E5-2680v3 processors, 128 GB DDR4 DRAM, and 320 GB of SSD local scratch memory. Each node contains 24 cores, with a clock speed of 2.5 GHz, a flop speed of 960 GFlop/s and a memory bandwidth of 120 GB/s.