Identification of a druggable binding pocket in the spike protein reveals a key site for existing drugs potentially capable of combating Covid-19 infectivity

Following the recent outbreak of the new coronavirus pandemic (Covid-19), the rapid determination of the structure of the homo-trimeric spike glycoprotein has prompted the study reported here. The aims were to identify potential “druggable” binding pockets in the protein and, if located, to virtual screen pharmaceutical agents currently in use for predicted affinity to these pockets which might be useful to restrict, reduce, or inhibit the infectivity of the virion. Our analyses of this structure have revealed a key potentially druggable pocket where it might be viable to bind pharmaceutical agents to inhibit its ability to infect human cells. This pocket is found at the inter-chain interface that exists between two domains prior to the virion binding to human Angiotensin Converting Enzyme 2 (ACE2) protein. One of these domains is the highly mobile receptor binding domain, which must move into position to interact with ACE2, which is an essential feature for viral entry to the host cell. Virtual screening with a library of purchasable drug molecules has identified pharmaceuticals currently in use as prescription and over the counter medications that, in silico, readily bind into this pocket. This study highlights possible drugs already in use as pharmaceuticals that may act as agents to interfere with the movements of the domains within this protein essential for the infectivity processes and hence might slow, or even halt, the infection of host cells by this new coronavirus. As these are existing pharmaceuticals already approved for use in humans, this knowledge could accelerate their roll-out, through repurposing, for affected individuals and help guide the efforts of other researchers in finding effective treatments for the disease.


Background
Coronaviruses are a family of envelope viruses which are hosted primarily by mammals and by birds. Their general structure comprises of a single-stranded positive sense RNA genome which creates four viral proteins, the S (spike), N (nucleocapsid), M (membrane) and E (envelope) proteins. Each has at least one key role: M and E make up the primary protein components of the viral envelope defining its shape and having a major role in virus propagation, respectively. N is involved in stabilising the nucleocapsid binding directly to the RNA viral material. The spike protein (S), is pivotal to viral infection as it is this that binds to receptors on the host cells enabling subsequent fusion between the host and viral membranes such that the interior RNA material can then invade the host cell [1]. The S protein is comprised of three identical polypeptide subunits arranged as a trimer structure. It is this protein that gives the virus its name as the end of the spike has the appearance of a small "crown" in shape. In this end region is a mobile domain which moves from an inaccessible (down) state to an accessible (up) state which makes it available for interaction with Angiotensin Converting Enzyme 2 (ACE2), the transmembrane protein through which coronavirus infection usually proceeds. It is the spike protein therefore, that is critical for infectivity and in this regard it can be considered as the optimal target for vaccine and drug intervention, being the most prominent on the surface of the virion.
In recent years these types of viruses have posed a major threat due to their ability to cross the species barrier, causing infection in the human population from a virus innately from another source, mammal or bird. Such cross-species events have resulted in Severe Acute Respiratory Syndrome (SARS) [2] and Middle Eastern Respiratory Syndrome (MERS) [3] which arose from bats and then jumped to humans from civet and camel, respectively, as intermediaries [4]. Such viruses that arise in this way within the human population are potentially very problematic in that we have no innate antibodies to these virions and so there is always the possibility for severe illness and death to occur as a result. A new crossspecies event has recently arisen in Wuhan City in China, now termed Covid-19. The source is still being debated, although it is likely this virus again originated in bats and then crossed to humans probably again using a mammalian intermediary.
The spike (S) protein from the SARS and MERS coronaviruses have been studied in detail and a number of X-ray and cryo-electron microscopy (Cryo-EM) structures have been produced. Gaining detailed information about these structures offers ways both of understanding how the spike protein is used to infect host cells, and of combating this infectivity. To date, two structural models of this new coronavirus spike protein have been produced, one [5] developed using C-I-Tasser [6], the other [7] produced by Swiss Modeller [8] using as templates homologous structures from the SARS and MERS spike proteins. Both models were used for evaluating the possibility of its binding to ACE2, and Zhang et al. [5] also dispelled suggestions of the presence of novel sequence inclusions in this protein.
Of key importance here, is the recently reported Cryo-EM structure, solved to 3.5 Å resolution, of the spike protein of Covid-19 [9] and the coordinates for this structure are available under the Protein Data Bank (PDB) code 6VSB; it is this structure that prompted the investigations reported here to look for sites where it might be possible to bind drugs. The structure here is interesting as one subunit chain (A) is in the "up" accessible state, while the other two (B and C) are in the "down" inaccessible state. The structure is shown in Fig. 1 where the differences in conformation of the up and down states are also shown. Figure 1b shows chain A in the up state, whilst Fig. 1c shows the same view of Fig. 1 a The structure of the Spike (S) protein from Covid-19 (PDB code 6VSB) showing its three subunit conformation. In red is chain A in the "up" position, and chains B and C (blue and green respectively) are in the "down" position (see text for specific details). b Chain A in the up position and c chain B in the same orientation as A, in the down position. These are coloured according to the structural differences between them where red indicates the most significant of these differences, using 2StrucCompare [10]. Structure coloured grey only exists in chain B Drew and Janes BMC Molecular and Cell Biology (2020) 21:49 Page 2 of 13 chain B but in the down state; the mobile domain is coloured red in both cases, while structure that is in dark grey only exists in the one chain, B in this case. The protein retains its three fold symmetry over all regions of its structure that do not interact with this mobile domain. The aims were to see if pharmaceutical products currently available and approved for human use might be able to bind to the spike protein with the chance that this might disrupt, stall, or even prevent, the infection process. In order to infect, the spike protein has to be exposed and then become accessible within the host so it can contact the receptors on the host cell. If infectivity were able to be slowed down because of disruption of the process, this would leave the spike protein still exposed but in the inaccessible state and this would enable the host to register the virion as being foreign and so antibodies would be raised against it.

Pocket identification
Results from the DoGSiteScorer server [11] identified a total of 106 possible pockets within the Cryo-EM structure, with druggability scores ranging from 0.125 to 0.849 over a scoring range between 0 and 1, where 1 would be a perfect druggable site. Of these 106 possible sites, 79 had a druggability score less than 0.7, and a further 7 did not have a three-fold symmetry where it should be expected, leaving 20 remaining sites. Removing pockets with a small overall volume (less than 500 Å 3 ) as these were considered as unlikely to be successfully druggable [11], left 12 potential candidate pockets. Analysis of these revealed an~800 Å 3 pocket with a high druggability score of 0.79 which was in the 90th percentile of those identified (Table 1 and Fig. 2), and its location, between the mobile domain (in the down position) of one subunit and a second subunit, suggested it could be of potential interest regarding infectivity. Comparison between the residues of Covid-19 lining this pocket and the matched SARS spike protein residues is given in Fig. 3. Figure 4 shows the structural comparisons between chain A and B in the up and down states with selected pocket residues identified to highlight the positional changes.

Docking methods
The results from the three methods, Autodock vina [12], Smina [13] and Ledock [14], were used to generate a list of 4358 poses of commercially-available drugs capable of binding into this Covid-19 spike protein pocket. None of the pharmaceutical agents that were successfully docked into the site displayed any steric hindrance within the site and all of these had favourable empirically calculated binding energies.

Analysis of the compound rankings
From the extensive spreadsheet of data, obtained from the results of the Autodock vina, Smina and Ledock methods, together with related information associated with the drugs within the list (supplied as Supplementary Information), analyses were undertaken to establish the presence of any correlations between, or statistical significance within these results to aid further investigation. Correlations were obtained between various properties, looking for the differences/similarities between them over the whole data set of results. Specifically, the chemical and geometrical properties of the top 150 conformers, as ranked by ComboPC score, were compared against the full dataset to see if any pattern emerged concerning the top performing compounds. To assess the significance of any difference found, a t-test was employed and a ratio of the top 150 set average versus the whole set average was obtained to assess the direction and magnitude of any difference. Table 2 lists the top 100 drug poses ordered by the Combo percentile score. In this table 14 drugs are actively used for treatment and one further drug is highlighted, as it might be available, which is used in the treatment of various pulmonary diseases. These 15 molecules are shown in their poses binding into the pocket in Fig. 5. Six of these are shown in Fig. 6 as selected examples of how these drugs interact with the lining pocket residues. The figures are using the Ledock poses in both cases.

Discussion
Our criteria for pursuing a pocket site were that the druggability score should be high, that in the parts of the protein where a three-fold symmetry should be present, the same pocket should be found in each of the subunits, that the site was of a likely suitable volume [11], and that, if possible, the site had a high interest regarding either structure or function within the S protein.
The pocket identified in this study is of particular interest as it is located at an interface between two chains, B and C, of the protein and is not present between the comparable residues in A and B or A and C. This is because chains B and C have their mobile domains in the down position while chain A has its mobile domain in the up position. The pocket is between two domains, one flanking region in chain C of the protein (between residues 1 and 290) and the other, the underside of the mobile domain in chain B (Fig. 2). The comparison of the pocket residues of Covid-19 and the matched SARS spike protein residues (Fig. 3) shows a very high degree of conservation between them, adding weight to the suggestion they might be important regarding the function of the protein. The domain in chain B is the region between residues around 335 and 526 in the Covid-19 spike protein structure, but in chain A, this same domain region has hinged away losing all contacts between residues 328 and 530 (in beta strands N-and C-terminal to the domain, respectively) as this domain is in the "up" position. In related spike proteins like that from human SARS-CoV (PDB code: 6ACK), when this domain is in the up position it interacts with ACE2 facilitating entry to the cell [16]. The movement of this domain, therefore, appears to be required for ACE2 interaction, as in the SARS-CoV structure an ACE2 is present and bound to the spike protein. Critically, in the Covid-19 Cryo-EM spike protein structure, when in the up position (and when no ACE2 protein is present) then residues 460 to 473 in chain A are unobserved in the structure; they are too flexible to be detectable (grey coloured structure in Fig. 4b), whereas the equivalent residues are present in the SARS-CoV structure with the ACE2 bound. This implies they are flexible when there is no contact with the ACE2 protein, only being stabilised in this up position once contact has been established. However, residues 464 to 466 are structured and visible in the Covid-19 Cryo-EM B chain and form part of the side of the druggable pocket (Table 1 and Fig. 4) because this chain is in the down position. If the inter-chain interactions within and around this pocket formed between the two domains could be stabilised by the binding in of an appropriate drug molecule, it might prevent the domain from moving to the up position and this could be crucial in preventing or hindering the infectivity of the virus. The location of this pocket near such a functionally important domain, and its high predicted druggability score were the reasons behind this site being chosen for the virtual screening studies. The averaged pairwise root mean square deviation (avRMSD) calculations between each of the docking methods were used as a guide to the quality of each drug pose being in a well-established position within the druggable pocket. A small RMSD value indicated that each method had placed the specific drug pose into a similar position within the pocket. In vina for example, due to the non-deterministic nature of the docking algorithm [12], both it and Smina [13] use a random seed approach to initiate the search process. The results of the poses from all three methods would ideally produce "comparable" but not "identical" results. For all three methods the poses obtained retain information of rotamer space, and overall structural space, which can be employed to find good candidates for fitting the pocket, but from slightly different start positions. So it would be expected and anticipated that, for a given drug molecule, good poses would be comparable, but not the same, from each of the methods used, which would be reflected in small and similar RMSD pairwise values being obtained. However, given that the scoring of the two methods, vina and Smina, was "expected" to be different, but in fact proved to be highly correlated in their results, we removed the Smina scoring from the overall ComboPC equation as we did not want to bias in favour of these two methods over all three. For the top 150 poses ranked by ComboPC the number of hydrogen bond acceptors was found to be higher than expected (1.16, p < 0.001). The number of rings/heterocycles per compound was also significantly higher (1.29, p < 0.001). Compounds were also much more likely to contain halogen atoms (1.74, p < 0.001), particularly fluorine atoms (2.16, p < 0.001). Compounds with SO 2 groups were also overrepresented (2.97, p < 0.001), as exemplified by Ladarixin, Diflumidone and Quinethazone. Moieties of this kind are capable of many electrostatic interactions which need not be associated only with hydrogen-bonding. Given the statistical significance associated with the data above it appears that the properties of the pocket, its shape and lining residue composition, have generated a focused list of drugs with notable component properties of their own. In the top 100 drug poses, of interest there are 20 antiinflammatory drugs of which 8 are members of the "profen" family, derivatives containing the 2-arylpropionic acid moiety. Many of the highlighted drugs in Table 2 could be capable of successfully binding into this pocket and hence alter the infectivity profile. Clearly, these would need to be examined experimentally because at such a sensitive site in the spike protein there is a possibility of promoting the mobile domain into the up, infective conformation. However, with the fact that these drugs interact with the residues lining this pocket they might strengthen the interaction forces in this region and prevent the mobile domain from moving.

Conclusions
This study provides a suggested list of pharmaceutical agents, identifying some of them as being from related structure families, that are available on the market and have been sanctioned for use in humans that have been shown to be capable of binding into a druggable pocket in the spike protein of Covid-19. It must be stated that whether they do or do not actually bind in cannot be ratified here, that would have to be determined experimentally. Some might even bind into the site and increase infectivity as a result. However, the aims of this study were to present and show that the members of this list might have the capacity to bind in, thereby providing open suggestions to experimentalists to establish whether many, some, or few, of these agents do actually bind into the spike protein. Equally, it is not possible to state whether these drugs would be in any way efficacious towards suppressing the infectivity of Covid-19 because that is not the remit of this work. Again, that would be for those in the field to establish whether the listed drugs show any effect on reducing virus titre levels, thereby indicating that they are indeed interfering with the infectivity of the virion. The primary aim has been to show that pharmaceutical agents already available and approved might be usable as drugs to interfere with the process of infection, thereby providing time for the host generation of antibodies to combat this latest of cross-species coronavirus events.

Pocket identification
We utilised the Cryo-EM spike protein structure (PDB code: 6VSB), as a source for our studies to locate the presence of "druggable" pockets where small molecules Fig. 4 Images showing the differences between paired C-alpha residues from chain A and chain B of the spike protein coloured according to their extent of difference using 2StrucCompare [10]. Residues coloured in red indicate differences in paired positions greater than 5 Å from each other between the two chains. a. Shows the shape and position of chain A in the "up" position. b. Shows the same orientation for chain B in the "down" position. Some residues are labelled to indicate approximately where the pocket residues are in the two positions of these chains, and in B three of these residues lie on the strand coloured dark grey indicating that this is not seen in chain A Drew and Janes BMC Molecular and Cell Biology (2020) 21:49 Page 6 of 13  could bind. The protein atomic coordinates were uploaded to the DoGSiteScorer server [11] to identify potential pockets within the structure. The server provides a druggability score that ranges from 0 to 1, 1 being the most druggable.

Docking methods
Three approaches to docking the pharmaceutical agents into the identified druggable pocket in the Covid-19 spike protein were employed to act to corroborate the output produced. These were Autodock vina [12], Smina [13] and Ledock [14] . The centre position of the pocket was calculated and a search grid space, defined by those coordinates ±12 Å in the x, y and z coordinate directions from that centre, was created and supplied to each of the docking methods. By default in each of the packages used, the drugs were free to explore their rotamer space to optimise their binding into the pocket. However, whilst it is customary to allow side chain flexibility in the residues lining the pocket when undertaking in silico docking studies, given that the resolution of the structure used here was only 3.5 Å, no flexible side chains were defined.
The number of poses considered by all three methods used was left as default because this was thought to be optimal for this study. For Ledock, a maximum of 20 docking poses was returned per conformer, with an additional pose root mean square deviation (RMSD) cutoff of 1.0 Å applied to reduce redundancy of returned poses. For Autodock vina and Smina, a maximum of 9 poses with a maximum energy range of 3 kcal/mol between best and worst were returned.
A 1049-member library of compounds was derived from the Drugs-lib dataset available from the virtual screening webserver MTIOpenScreen [17] from an initial study we undertook on the S protein model from I-Tasser. This dataset consists of approved drugs and research chemicals refined from the "drug" subset of the ChEMBL database [18], the "approved" subset of Drug-Bank version 5.0.10 [19], the DrugCentral online compendium [20] and the "approved" SuperDrug2 database version 2.0 [21].
Initially Autodock vina and Ledock were the methods of choice as these have been considered as the best non _commercial docking packages available [22]. However, after pairing the poses between the two methods according to their structural similarity when in the docking site (as in the RMSD Calculations below) the corresponding docking scores showed only a weak correlation (0.36, p < 0.0001)) and so Smina, which is reported as having a different scoring scheme from Autodock vina [13], was used to see if this improved the correlation of the docking scoring to Ledock. In fact, the scoring correlation between Smina and Autodock vina proved to be very high and so only Autodock vina and Ledock data were used in the docking scoring as a result.

RMSD calculations
For a given drug, poses were obtained that fit the pocket.
To establish the best superposition of these different poses from each method, firstly, pairwise RMSDs were calculated using the following equation: where x i and y i are comparable atoms from two of the methods, and n is the number of atoms in the given drug. The best overall superposition from the three methods was then calculated by taking the average of these pairwise RMSD superpositions. From these calculations, the poses from each of the methods with the best overall superposition, and hence, highest degree of similarity, would have the lowest average RMSD values (avRMSD).

Overall druggability analysis and ranking
An extensive amount of data was generated tabulating the output information from the docking packages of the druggable compounds with their structural characteristics. In generalised overview these correspond to the docking output scores from each of the methods, avRMSD values between the poses chosen for each compound, hydrogen bonding data, and the JOELib Native descriptor set, calculated using the ChemMine webserver [23], which include molar refractivity, polar surface area, and frequencies of atom and selected group types among other geometric and chemical properties. Drugs within the search group were also classified by chemical compound class and, where possible, by subclass through classification using the ClassyFire webserver [24]. A compound with a pose that scored highly in the two scoring regimes, and with good superposition agreement between docking methods, as determined from their avRMSD values, was considered more likely to be a meaningful data point. As these individual terms were on different scales, their significant feature was their explicit "order" rather than "value". As a result each component, scoring value, and avRMSD, was ranked according to their percentile position within their individual scales. This therefore ordered the results by position rather than by value. Thus, the following combined score, ComboPC, was created to rank the compounds applying equal weight to the estimated docking scores from each docking method and to the agreement between the superpositions from each of the methods, as measured by avRMSD: Where LedockPC is the percentile rank of the docking score of the pose as docked by Ledock, vinaPC is the