Recognizing ion ligand binding sites by SMO algorithm

Background In many important life activities, the execution of protein function depends on the interaction between proteins and ligands. As an important protein binding ligand, the identification of the binding site of the ion ligands plays an important role in the study of the protein function. Results In this study, four acid radical ion ligands (NO2−,CO32−,SO42−,PO43−) and ten metal ion ligands (Zn2+,Cu2+,Fe2+,Fe3+,Ca2+,Mg2+,Mn2+,Na+,K+,Co2+) are selected as the research object, and the Sequential minimal optimization (SMO) algorithm based on sequence information was proposed, better prediction results were obtained by 5-fold cross validation. Conclusions An efficient method for predicting ion ligand binding sites was presented.


Introduction
Ions play an important role in the structure and function of proteins: for example, the SO 4 2− participate in the synthesis process of Cysteine [1], the sulfation process after protein translation [2], the synthesis process of proteoglycan, the sulfate absorption and decomposition process of plant and others [3]; the PO 4 3− is an important component of bones and teeth which can maintain the neutrality of body fluids; alkali metal K + and Na + control the charge balance in cells, tissue fluids and blood, which plays an important role in maintaining the normal circulation of body fluids and controlling the acid-base balance in the body; alkaline earth metal Ca 2+ plays a regulatory role in nerve conduction and blood coagulation; transition metal Fe 3+ plays an important role in the oxidative damage process of proteins, lipids, sugars and nucleic acids [4]. The interaction of proteins with ion ligands determines the realization of these biological functions, so the recognition of ion ligand binding sites is important for the study of its function [5][6][7][8][9][10].
In 2002, Richard et al. [11] have tested sulphate ion binding site of proteoglycan, and they identified the sites that is interaction with heparan sulfate. In 2017, Li et al. [12] used protein structural classification (SCOP) and Protein Data Bank (PDB) databases to extract 1251 protein chains using Ligand-Protein Contacts (LPC) software, and gave predictions of 8112 binding residues, and the Support vector machine (SVM) algorithm was used to predict the sulfate ionbinding residues of proteins. In recent years, the Zhang Lab team has compiled a database of ligand-binding residues named as the BioLip [13] database, a semi-manual database that collects interactions between ligands and proteins, functional annotations are relatively comprehensive compared with other databases, which contain extremely extensive and accurate ligand protein data.
During the last few years, many approaches have been developed to predict the binding sites of protein-metal ions. In 2008, Babr et al. [14] predicted the binding sites of protein chains and transition metal ions by CHED algorithm; when predicting 349 whole proteins, 95% specificity was obtained, and 82 prions were predicted to obtain 96% specificity. In 2012, Lu et al. [15] used the "fragment transformation" method to predict metal ion (Ca 2+ , Mg 2+ , Cu 2+ , Fe 3+ , Mn 2+ , Zn 2+ ) ligand binding sites, and the prediction results were obtained with a total accuracy of 94.6% and a true positive of rate 60.5%. In 2016, Hu et al. [16] identified four metal ions in the BioLip database by both sequence-based and template-based methods, and the Matthew's correlation coefficient (MCC) values were greater than 0.5. In 2017, Cao et al. [17] used the SVM algorithm to identify ten metal ion binding sites based on amino acid sequences, which obtained a good result by 5fold cross validation. In 2018, Greenside et al. [18] used an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, which got a great result.
In this paper, the dataset of acid radical ion and metal ion ligands was extracted from BioLip database, the Sequential minimal optimization (SMO) algorithm was proposed to predict the binding site with component information, position conservation information and refinement characteristics, experiment results show that the MCC values of the four acid radical ion ligands by 5-fold cross validation exceeded 0.470, the accuracy values were not less than 74.0%; the MCC values of six metal ion ligands of Zn 2+, Cu 2+ , Fe 2+ , Fe 3+ , Mn 2+ and Co 2+ exceeded 0.620, the accuracy values were not less than 80%; the MCC values of four metal ions of Ca 2+ , Mg 2+ , Na + and K + exceeded 0.430, the accuracy values were not less than 71%.

Dataset
The construction of the dataset is directly related to the reliability of the prediction accuracy. The dataset constructed in the paper was from the BioLip database.
The binding protein chains, including four acid radical ion ligands (NO 2− , CO 3 ) and ten metal ion ligands (Zn 2+ , Cu 2+ , Fe 2+ , Fe 3+ , Ca 2+ , Mg 2+ , Mn 2+ , Na + , K + , Co 2+ ), were downloaded from the BioLip database, wherein the sequence length is greater than 50 residues, the resolution is less than 3 Å, and the sequence identity threshold is less than 30%. Then, the sliding window method is adopted to get the overlapping segment on the protein chain, if the center of the segment is the ligand binding site, it is defined as a positive sample; otherwise it is defined as a negative sample. We selected the datasets with the sequence segment length of 17 as an example to simply explain the multiple relationships of segments' number in positive and negative sets; the detailed datasets are summarized in Table 1.
Since the number of samples in negative set is several tens of times the number of samples in positive set, in order to ensure stable of the results, the negative set with equal numbers of positive set was randomly selected ten times in the 5-fold cross validation, and finally the final result was obtained by selecting an average of ten times.

The statistical analysis of dataset Amino acid composition information
According to the literature [12,17], amino acid composition information is an important feature in the recognition of binding sites. Therefore, we analyzed the composition information of acid radical ion and metal ion ligand. The SO 4 2− ligand was taken as an example, the violin plot was shown in Fig. 1. The violin plot is a combination of a box plot and a kernel density, and is mainly used to display the distribution state of the data. The left side of each group represents the amino acid composition in the negative set, the right side represents the amino acid composition in the positive set, the ordinate represents the frequency of occurrence of the amino acid, and the white dot represents the median. The black box pattern ranges from the lower quartile to the upper quartile, representing the concentrated distribution of amino acid; the outer shape represents the kernel density estimation, the more concentrated the data, the fatter the graph. Figure 1 showed that the concentrated distribution interval of R, S and T in the positive set was larger than the concentrated distribution of the negative set, while the D, E, G in the negative set were more concentrated than the positive set. Since the concentrated distribution interval of amino acid composition in the positive and negative sets was significantly different, we used the amino acid composition information as a characteristic parameter.

The position conservation of amino acids
The WEBLOGO [19] software was used to analyze the position conservation of acid radical ion and metal ion ligands. Since the ion ligands are small ligands, they usually only bind with a few residues. So we selected a window length L of 17 as an example to analyze. The x-axis represents 17 positions, the y-axis represents the conservation of amino acids in every position, with the height of each letter corresponding to the occurrence probability of the corresponding amino acid, the center of the positive set indicates the ion ligand binding residue. As shown in Fig.2, the position conservation of the SO 4 2− binding residues and environmental residues are strong, but binding residues are more conservative, the preferred residues are R, G, K, S, H, T, and there is a significant difference of amino acid conservative between positive set and negative set. For example, at the eighth position, the highest frequency of the amino acid is G, S, A, L in positive set; the highest frequency of the amino acid in negative set is L, A, G, V. In the tenth positive, the highest frequency of amino acid is G, T, S, A in positive set; the highest frequency is L, A, G, V in negative set. The above analysis shows that the position conservation of amino acid residues is a good indicator of protein ion binding, so it was selected as the characteristic information to further develop an effective identification model. The selection of characteristic parameters The characteristic parameters from statistical analysis According to the statistical analysis of component information and position conservation information for amino acid, these two kinds of information were selected as characteristic parameters.

Physicochemical properties of amino acids
According to the biological background, the physicochemical properties of amino acid residues play an irreplaceable role in the binding of proteins to ions. Therefore, we chose the hydropathy and polarization charge of amino acids as characteristic parameters. The 20 amino acids are grouped into 6 kinds [20] according to hydropathy characteristic ( Table 2)

Predicted structural information
The prediction of secondary structure and solvent accessibility reflect the spatial structure information of the backbone and side chains [22], so we also extracted these information as characteristic parameters using ANGLOR [23] software. According to the predicted secondary structure information, the 20 amino acids are divided into 3 categories: α-helix, β-sheet and coil; according to the predicted relative solvent accessibility (SA), the 20 amino acids are divided into 2 categories: SA value is greater than 0.25 for exposure; SA value is less than 0.25 for burial.

The extraction of characteristic parameters
According to the statistical analysis, the component information of five characteristic parameters of amino acid, hydropathy, charge, secondary structure and relative solvent accessibility were selected, and the Increment of Diversity algorithm was used to reduce the dimension of the above five components to extract their refinement features; the Position matrix scoring algorithm was used to extract the site information of five characteristic parameters and reduce the dimension to extract their refinement features.

Position matrix scoring algorithm
The Position matrix scoring algorithm constructs a positional frequency matrix using known sequence patterns to describe the composition of amino acids at various positions in an unknown sequence pattern, and to characterize the position conservation of amino acids in the sequence. Through statistical analysis of the ion ligands in this study, it is found that they have obvious position conservation, so the Position matrix scoring algorithm was selected to extract the feature parameters.
Position matrix scoring algorithm is a classification algorithm. It has been successfully used in predicting transcription factor binding sites in genomes and supersecondary structures [24,25].
The position frequency matrix is defined as: In the above equation, j is 20 amino acids and one pseudo amino acid "X", n i, j is the frequency of the j th amino acids at the i th position, N i is total number of all amino acids occurring at the i th position, P i,j is the observed probability of the j th amino acids at the i th position.
The matrix element of the position weight matrix is defined as: P 0,j is background probability of the j th amino acid, m i,j is the weight probability of the j th amino acids at the i th position.
The scoring(S) value is given by the following equation: Here, S is the scoring matrix function, L is length of amino acid sequence segment, C i is conservation index at the ith position, m i,min is the minimum value at the i th position, m i,max is the maximum value at the i th position.
Taking the position amino acid residue as a parameter, two standard scoring matrices were constructed using the training set. In the test set, two scoring (S) values can be obtained for an arbitrary sequence segment, which can be used as the refinement characteristic parameters. Besides, the characteristic parameters of the 2 L dimensional site information can also be obtained by using the position weight matrix.

Increment of diversity (ID) algorithm
Dispersion is a measure of information diversity. It can quantitatively describe certain feature information contained in an amino acid sequence, and the measure of diversity can describe the overall diversity. The increment of diversity is one of the information coefficients. It is applied to the information classification as a classification algorithm. It can reduce the dimension and use the refined features as the characteristic parameters of classification prediction. It has been successfully applied to protein folding and protein structure classification prediction [26,27]. Therefore, the Increment of Diversity algorithm was used to extract the feature information from sequence.
In the state space of dimension S, for a vector X: [n 1, n 2 , …,n s ] the measure of diversity source was For two state spaces of dimension S, for vectors X: [n 1, n 2 , … n s ] and Y: [m 1 , m 2 , …, m s ], the measure of mixed diversity source X + Y was The increment of diversity between the source of diversity X and Y was The amino acid composition information was input into the ID algorithm. The standard discrete source is constructed by training. Two discrete increment (ID) values can be obtained for each segment of the test set. Then, the obtained two-dimensional ID value can be used as the characteristic parameter.

Algorithm
The SMO algorithm was proposed by Platt in 1998, which is also known as the sequence minimum optimization method. It is the fastest quadratic programming optimization algorithm that can effectively improve computational efficiency. The SMO algorithm optimizes only two variables at a time, regards all other variables as constants, transforms a complex optimization problem into a relatively simple two-variable optimization problem, and adopts analytical method to avoid the error accumulation caused by iteration method, which ensures its accuracy. In this paper, we established our identification model using the SMO algorithm based on the weka3.8 [28,29] and using the Precomputed Kernel Matrix (PUK) kernel function. PUK is a general kernel function based on Pearson's seventh function [30]. It has good robustness and has equivalent or even stronger mapping ability than standard kernel functions. It can be used as a general kernel function to replace ordinary linear, polynomial and radial basis kernel functions. To a certain extent, it can eliminate the trouble of how to select the kernel function in the SVM algorithm, saving time.

Performance measure
We used the following four standard measures [31] to evaluate the performance of the identification of ion binding residues: sensitivity (S n ), specificity (S p ), accuracy of prediction (Acc) and Matthew's correlation coefficient (MCC). These were calculated by the following formulae: Where TP is the number of correctly identified acid radical or metal ion binding residues, FN is the number of binding residues wrongly identified as non-binding residues, TN is the number of correctly identified nonbinding residues, and FP is the number of non-binding residues identified as binding residues.

Results and discussion
The optimal window size Whether the amino acid residue can be combined with the ion ligand depends not only on amino acid residue itself but also on neighboring residues [32]. In order to extract more comprehensive information, we used the sliding window method, where different window sizes range from 5 to 17, intercepting the sequence segments from the N-terminal to the C-terminal, and ensuring that all residues appear in the center of the segment, we added an (L-1)/2 dummy residue "X" at both terminals of the proteins. If the central residue of the segment was an ion binding residue, we assigned the segment as positive; otherwise it was assigned as negative. Taking SO 4 2− ligand as an example (Fig. 3), the x-axis represents the window size, the y-axis represents the MCC, ACC, S n and S p values under different window sizes, we performed a large range search on the window size of 7 kinds of amino acid residues and combined the WEBLOGO diagram of the ion ligand to finally determine the optimal window size of SO 4 : 11, 13, 9, 7, 13, 9, 9, 9, 9, 7, 9, 11, 11. The following calculations were made under the optimal window sizes and the 5-fold cross validation commonly used in the literature [33][34][35].

The results under component information parameters
Under the optimal window size, amino acid component information, hydropathy component information, charge component information, secondary structure component information, and relative solvent accessibility component information were collectively used as characteristic parameters and input to the SMO algorithm. The calculation results of 5-fold cross validation were shown in Table 3.
It can be observed from Table 3 that the ACC values of the four acid radical ion ligands were all greater than 61.0%, the MCC values of CO 3 2− , SO 4 2− and PO 4 3− exceed 0.360, and only the MCC value of NO 2 − was lower than 0.225; among the recognition results of metal ion ligands, Zn 2+ , Cu 2+ , Fe 2+ , Fe 3+ and K + were preferable, and the MCC values were not less than 0.5. It can be considered that these five metal ion ligands were sensitive to the component information; the results were consistent with the previous research results. The reason can be seen from the statistical diagram of the amino acid composition given in [17] that the differences of positive and negative sets of transition metal ions were relatively large, so their prediction results were better, and the remaining ion ligands will continue to be identified by adding other characteristic parameters.

The results under position conservation information parameters
Under the optimal window size, we identified the ion ligand binding sites using position amino acid, position hydropathy, position charge, position secondary structure and position relative solvent accessibility as characteristic parameters via the SMO algorithm. The calculation results by 5-fold cross validation were shown in Table 4.
From Table 4, it can be concluded that the MCC value of NO 2 − was 0.350, the MCC value of CO 3 2− was 0.462, the MCC value of SO 4 2− was 0.460, and the MCC value of PO 4 3− was 0.548. Compared with all component information as characteristic parameters, the recognition result has been improved.
For the identification results of ten metal ion ligands, the six metal ion ligands of Zn 2+ , Cu 2+ , Fe 2+ , Fe 3+ , Mn 2+ and Co 2+ have good prediction results, and the MCC values were not less than 0.600; Na + and K + have worst recognition results, we considered that these two ion ligands were less sensitive to the position conservation information and can continue to identify their refinement. Compared with the identification of all the component information as characteristic parameters, the MCC values of Na + and K + decreased slightly, but other's MCC values showed an upward trend, indicating that these ion ligands were more sensitive to the position conservation information, as can be seen from the WEBLOGO in [17]. The positive and negative sets are more different than the statistical analysis of the components in [17], so the ion ligands were more sensitive to the position conservation information.

The results under refinement characteristic parameters
The ID algorithm was used to reduce the dimensionality of the amino acid component information, hydropathy component information, charge component information, secondary structure component information, and relative solvent accessibility component information to obtain a 10-dimensional ID value; the Position matrix scoring algorithm reduced the dimensionality of the position amino acid, position hydropathy, position charge, position secondary structure and position relative solvent accessibility to obtain a 10-dimensional S value. The obtained 10-dimensional ID value and 10-dimensional S value were collectively recognized as the 20-dimensional refinement characteristic by the SMO algorithm, and the results (OUR'S) by 5-fold cross validation were shown in Table 5.
At the same time, for the sake of comparison, the results of the SVM algorithm in paper [17] and the calculation results of SMO using the characteristic parameters of literature [17] were also included in Table 5.
As seen, the four acid radical ion ligands under the refinement characteristic parameters were very good, the MCC values were over 0.460, and the Acc values were all greater than 73.0%. Compared with the recognition results of all component information and all position conservation information, the values of S n , S p and Acc were gradually improved, indicating that the detailed characteristic parameters contain more complete information.
The MCC values of Zn 2+ , Fe 2+ , Fe 3+ and Cu 2+ have reached above 0.7, the MCC values of Mn 2+ and Co 2+ exceed 0.6, and the MCC value of K + was only 0.362; the MCC values of the eight metal ion ligands of Zn 2+ , Cu 2+ , Fe 2+ , Fe 3+ , Mn 2+ , Na + , K + and Co 2+ were improved in a small range compared with the results in Table 4, indicating that the eight ion ligands were more sensitive to the refinement characteristic; the evaluation indexes of Ca 2+ and Mg 2+ with the refinement characteristic parameters were not higher than that with the position conservation information, indicating that these two ion ligands were more sensitive to position conservation information; the Na + and K + have higher MCC values when the refinement characteristic was used as a parameter, compared with the results of all component information as characteristic parameters, it can be understood that Na + and K + were more sensitive to all component information under three characteristic parameters, but still lower than the results of other metal ion ligands, the MCC values of the residual ion ligands under the refinement characteristic parameters were improved compared with the results of all component information, which was the best results under the three characteristic parameters.
In general, the recognition result under the refined characteristic parameters was generally higher than the recognition result under the single combination characteristic parameter, which fully demonstrated that the compatibility performance of the SMO algorithm is good.  In addition, new characteristic parameters were added based on the SMO results, and the prediction results for some ion ligands were improved, that is, the results of OUR'S in Table 5, indicating that the new characteristic parameters we added were useful parameters, suitable for the SMO algorithm.
Overall, in the process of ion ligand binding sites prediction, the SMO algorithm adopts analytical method to avoid the error accumulation caused by iteration method, so the accuracy of the prediction result is guaranteed; the PUK kernel function of this algorithm can deal with the nonlinear classification data of the binding sites prediction well and reflect the distribution characteristics of the training sample data, since it maps features from low-dimensional space to high-dimensional space, and achieves linear separability. Therefore, the SMO algorithm has a good performance for the prediction of ion ligands.

Conclusion
In this paper, the ligand binding sites of four acid radical ions and ten metal ions were predicted. Firstly, BioLip database was selected, and the optimal window sizes were determined by calculation; secondly, component information, position conservative information and detailed characteristics were extracted as characteristic parameters; then different characteristic parameters were input into the SMO algorithm, under the 5-fold cross validation, the identification of four kinds of acid radical ion ligand binding sites got a good result, among the results of the identification of ten metal ion ligands, the prediction results of transition metals were better than those of alkaline earth metals and alkali metals, the results of all position conservation information as characteristic parameters were better than the results of all component information as characteristic parameters, the prediction results under the refinement characteristic were better than the prediction results under the single combination characteristic, so the characteristic parameters can be refined as much as possible in the subsequent work.

Availability of data and materials
If you need data and materials, you can contact the corresponding author.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.