Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix

Background The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. Results We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. Conclusions The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK.


Background
The biological process of enzymatic change in proteins brought about after the translation in the ribosome is known as post-translational modification (PTM). The high interest in PTM for various organisms have emerged as a result of efforts in high-throughput proteomics for the study of site-specific PTM as well as enzymes which cause these modifications [1]. The genetic code comprises 20 amino acids and out of which, lysine is the most commonly modified [2,3]. From the literature [4], some of the major covalent modifications of lysine residues are acetyl [5], glycosyl [6], methyl [7], succinyl [8], pupyl [9], crotonyl [10], and propionyl [11]. These various amino acid modifications, as well as their regulatory enzymes, are associated with several human diseases including heart disease, rheumatoid arthritis, multiple sclerosis, neurodegenerative disorders, and celiac disease [12][13][14][15].
Phosphoglycerylation, which is a non-enzymatic lysine modification, is a type of PTM that has been recently discovered in human cells and mouse liver [16,17]. Cardiovascular disease, such as heart failure, is a highly probable condition caused by phosphoglycerylation since this chemical modification is associated with glycolytic pathways and glucose metabolism [18,19]. Phosphoglycerylation is dynamic and reversible and occurs as a result of reaction between primary glycolytic intermediate (1, and lysine residue, which result in the formation of 3-phosphoglyceryl-lysine (pgK) [17]. Glycolytic enzymes are affected by pgk. It also builds up on cells having high exposure to glucose. As a result, potential feedback mechanism which leads to build up and redirection of glycolytic intermediates to different biosynthetic pathways is established. As this PTM is relatively new to the field, it is important to identify and analyze its functional aspects to be able to understand the selectivity mechanism and its regulatory roles for better diagnosis and treatment of affected persons.
In the recent years, there have been a number of studies involved to identify phosphoglycerylation using computational technique. Phogly-PseAAC was the earliest work to be carried out where it utilized a KNN-based predictor to predict phosphoglycerylation using a feature set of pseudo amino acid composition [38]. The second work called CKSAAP_PhoglySite uses the composition of k-spaced amino acid pairs (CKSAAP) as features and employs a fuzzy support vector machine to predict [16].
Finally, the recent work named iPGK-PseAAC was proposed and it uses a four tier amino acid pairwise coupling technique alongside a SVM operation engine for prediction [39].
The proposed predictors of phosphoglycerylation in the literature are still limited in terms of their performance. In this regard, we are introducing a novel predictor called Bigram-PGK which employs evolutionary information to predict phosphoglycerylated and nonphosphoglycerylated lysine residues. A total of 91 protein sequences were used in this work which contained experimentally confirmed phosphoglycerylated sites and their profile bigram was obtained from the positionspecific scoring matrix (PSSM). With the evolutionary information of the sequences, different segment sizes for each lysine residue was analyzed in terms of the performance. The residue window of ±32 performed the best on Mathews correlation coefficient (MCC) metric when the size of ±1 to ±32 was considered (see Additional file 1). Residue window sizes further than ± 32 could not be taken into account due to constrain of the protein sequence length. Hence a lysine residue, whether phosphoglycerylated or non-phosphoglycerylated, was considered by encompassing a stretch of 32 upstream and 32 downstream amino acids to the lysine with the lysine residue at the center. The number of phosphoglycerylated residues were small compared to the nonphosphoglycerylated, therefore a k-nearest neighbors cleaning treatment was implemented to deal with the class imbalance [35,40,41]. The balanced dataset was then used to construct the Bigram-PGK predictor using LibSVM package which showed a superior performance over the existing methods on the 10-fold cross-validation procedure. The performance of Bigram-PGK on the metrics sensitivity, specificity, precision, accuracy, MCC and area under the ROC curve (AUC) was 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively.

Dataset balancing
The phosphoglycerylation dataset obtained from PLMD was found to be imbalanced, whereby the phosphoglycerylated sites were much less compared to that of nonphosphoglycerylated. The 111 phosphoglycerylated sites compared to 3249 non-phosphoglycerylated sites resulted in an imbalance ratio of 1:29. Having imbalance ratio of this magnitude will easily bias the classification process. Resolving the class imbalance is critical in order to build a reliable predictor. To deal with the imbalance issue, we employed the commonly used k-nearest neighbor cleaning treatment which removed instances from the majority class (non-phosphoglycerylated in this case) when they are one of the k neighbors of a positive instance (phosphoglycerylated site) [35,37,40,42,43].
The cleaning treatment was initiated with a k value equal to the imbalance ratio i.e. 29. The intention was to remove those negative instances which were among the 29 neighbors of every positive instance. With a k value of 29, the imbalance ratio remained undesirable hence the threshold was further increased until the final data set attained an imbalance ratio of 1:2. As a result, the number of non-phosphoglycerylated sites was reduced to 224 instances after applying a k value of 111. The final dataset of 111 positive instances and 224 negative instances, obtained using a k value of 111, was used to validate the performance of the predictor.

Statistical measures
In proposing any new predictor, it is crucial to assess its performance. In this work, we have employed five statistical measures including sensitivity, specificity, precision, accuracy, and Mathews correlation coefficient [16,35,38,41,42,[44][45][46][47]. Furthermore, we have calculated the area under the ROC curve of the predictor and it is depicted in the later section.
The first metric, sensitivity, determines the ability of the classifier to correctly predict phosphoglycerylated lysine sites. The measure ranges from 0 to 1 where a higher value indicates the better the predictor is in classifying the phosphoglycerylated sites. Specificity is the second metric and it measures the ability of the classifier to correctly predict non-phosphoglycerylated lysine sites. This metric also takes on the 0 to 1 range of values where a high value indicates that the predictor is effective at predicting non-phosphoglycerylated sites. The third and fourth metrics are precision and accuracy, respectively, and they take on the same range of values as sensitivity and specificity. Precision metric assesses capability of the predictor to correctly classify phosphoglycerylated sites from all the phosphoglycerylated sites predicted. The accuracy metric evaluates how correctly the predictor distinguishes between phosphoglycerylated sites and non-phosphoglycerylated sites. Mathews correlation metric, which is the fifth measure, assesses the quality of the predictor. The range of values of MCC metric is − 1 to + 1 where − 1 signifies a completely negative correlation, while + 1 indicates a highly positive correlation. These five statistical measures can be written as equations as shown below: In the equations above, FN, FP, TN, and TP represents false negatives, false positives, true negatives and true positives, respectively. False negatives represents instances which are phosphoglycerylated sites but predicted as non-phosphoglycerylated. False positives are those that are non-phosphoglycerylated sites yet predicted as phosphoglycerylated. The true negatives are instances correctly predicted as non-phosphoglycerylated sites and finally, true positives are instances correctly predicted as phosphoglycerylated sites. It is desirable for the best predictor to have high scores in all of the statistical measures. Nevertheless, the proposed predictor should at least have higher sensitivity measure compared to the existing predictors.

Validation scheme
The statistical measures outlined in the previous section to assess the predictor's performance was carried out using the 10-fold cross-validation scheme. In the literature, there are three common ways of determining the effectiveness of a predictor and these are n-fold crossvalidation test, independent dataset test, and the jackknife test [48,49]. Though the jackknife test is regarded to be the least arbitrary of the three and outputs distinctive result on dataset [50], we employed the n-fold cross-validation scheme to avoid high computational time, with n equal to 10. The below steps highlight the 10-fold cross-validation procedure: Step 1: Divide the dataset into 10 equal parts Step 2: Train predictor by combining the 9 parts and test it using the part left out Step 3: Adjust the classifier parameters with training set Step 4: Obtain the statistical measures with the test set Step 5: Repeat steps 2 to 4 until all the folds have been used as test sets and average the statistical measures The result of 10-fold cross-validation scheme on Bigram-PGK is presented in the following section.

Bigram-PGK comparison with available predictors
There are three predictors in the literature which carry out the classification of phosphoglycerylated and nonphosphoglycerylated sites, and these are Phogly_PseAAC [38], CKSAAP_PhoglySite [16], and iPGK_PseAAC [39]. Firstly, we obtained the predictions of these methods on all the lysine residues in our benchmark dataset. This was carried out by preparing the dataset in FASTA format and uploading it to the webservers of Phogly-PseAAC and iPGK-PseAAC predictors, and for the CKSAAP_PhoglySite predictor by inputting the file to the Matlab software package. It is intuitive to point out that these predictors may have been trained using samples which are being used to carry out the performance evaluation and therefore the results can be biased in their favor. The performance comparison of our predictor against the existing methods was carried out on the validation set, the sets put aside as test sets during the 10-fold cross-validation scheme. Likewise, the same validation set was used to obtain the performance of the other methods by investigating their predictions on those samples when the benchmark dataset was uploaded to the respective webserver/software packages.
The comparison result of Phogly_PseAAC [38], CKSAAP_PhoglySite [16], iPGK_PseAAC [39], and our predictor Bigram-PGK is shown in Table 1. In Table 1, we have also added the AUC measure for all the predictors for a more robust comparison since the predictor with the highest AUC measure is always favorable. It can be seen from the results that Bigram-PGK gives the highest performance on the metrics sensitivity, accuracy, MCC and AUC. The sensitivity measure increased by 16.4%, accuracy by 1.7%, MCC by 6.5%, and AUC by 5.1%. These improved performances goes on to say that Bigram-PGK is quite an effective predictor for the phosphoglycerylation problem. From Table 1, it can also be realized that iPGK_PseAAC predictor [39] obtained the highest specificity measure (0.9864) but its sensitivity measure is very low (0.4555), which shows that almost 55% of the phosphoglycerylated sites were undetected by this method.
The promising result in Table 1 clearly illustrate the ability of Bigram-PGK to correctly predict phosphoglycerylated and non-phosphoglycerylated lysine residues. This can be credited to the use of underlying important evolutionary information in protein sequences. The information is captured in PSSM of amino acids surrounding the lysines and when this information is transformed into the matrix of bigram occurrences, it produces the necessary characteristics for identifying the modified lysines. Furthermore, the improved performance can also be attributed to the SVM algorithm for its effective data handling.

Insights into phosphoglycerylation prediction
In the Additional file 2, we present the analysis of phosphoglycerylation sites predicted by iPGK_PseAAC [39], CKSAAP_PhoglySite [16], Phogly_PseAAC [38], and Bigram-PGK on the 10-fold cross-validation procedure. It has been observed that for the proteins having multiple phosphoglycerylation sites, not all the predictors were able to detect them entirely. In fact, only the Bigram-PGK predictor managed to detect almost all of these proteins. The only protein which went undetected was Beta-globin (UniProt Accession A8DUK4) which is a subunit of a larger protein named hemoglobin [51], and this protein was successfully identified by the Phogly_PseAAC [38] predictor alone. Moreover, Bigram-PGK was the only one that effectively detected all phosphoglycerylation sites of the protein Carbamoylphosphate synthase (UniProt Accession Q8C196) which plays a vital role in the removal of surplus ammonia from the cell of ureotelic animals [52]. Moving on to the proteins with single phosphoglycerylation site, there were a number of proteins which only the Bigram-PGK predictor was able to detect. These proteins include Arf-GAP with SH3 domain (UniProt Accession E9QMI7) which regulates the formation of post-Golgi vesicles and controls constitutive secretion [53], 14-3-3 protein beta/ alpha (UniProt Accession A2A5N1) which regulates both general and specialized signaling pathways [54], 60S ribosomal protein L31 (UniProt Accession P62900) which is heavily involved in RNA binding and structural integrity of the ribosome [55], and Zinc finger protein GLI1 (UniProt Accession P47806) which acts as a transcriptional activator [56]. There were also proteins that the Bigram-PGK could not detect but were detected by the rest of the predictors. These proteins include Glutamate receptor ionotropic (UniProt Accession B1AS29) which acts as an excitatory neurotransmitter at many synapses in the central nervous system [57], and EH domain-containing protein 4 (UniProt Accession Q9EQP2) that binds ATP and membrane and it could likely control membrane reorganization upon ATP hydrolysis [58]. Furthermore, none of the predictors were able to detect phosphoglycerylation site of a couple of  [59]. Nevertheless, all the predictors successfully predicted the phosphoglycerylation site of many of the proteins such as ATP-dependent 6phosphofructokinase (UniProt Accession P47857) which acts as a catalyst in phosphorylation of D-fructose 6phophate to fructose 1,6-bisphosphate by ATP [60], Farnesyl pyrophosphate synthase (UniProt Accession Q920E5) which plays a key role in isoprenoid biosynthesis [61], Calcium-binding mitochondrial carrier protein Aralar2 (UniProt Accession Q9QXX4) which acts as a catalyst in calcium-dependent exchange of cytoplasmic glutamate with mitochondrial aspartate [62], Triosephosphate isomerase (UniProt Accession P60174) that catalyzes interconversion between dihydroxyacetone phosphate and D-glyceraldehyde-3-phosphate in glycolysis [63], Kinectin (UniProt Accession Q86UP2) which is involved in kinesin-driven vesicle motility [64], Fructosebisphosphate aldolase (UniProt Accession A6ZI44) which plays an important role in glycolysis and gluconeogenesis [65], and Eukaryotic translation initiation factor 4E-binding protein 1 (UniProt Accession Q13541) which is a repressor of translational initiation that controls EIF4E activity [66].

Conclusion
This paper presents a novel predictor Bigram-PGK, which utilizes the feature PSSM + bigram to predict phosphoglycerylation. The underlying evolutionary information in PSSM of protein sequences and its transformation to bigram occurrences appears to be a crucial property in detecting the lysine modification. The use of studied feature in this work and the SVM classifier with polynomial kernel to obtain a decent hyperplane separation was effective to distinguish between the modified and unmodified lysine sites.

Protein dataset
The benchmark dataset used in this work was obtained from the Compendium of Protein Lysine Modifications (CPLM) repository, accessed 1 March 2017 (available at http://cplm.biocuckoo.org) which has now been upgraded to Protein Lysine Modification Database (PLMD). PLMD contains a number of different protein lysine modifications that have been experimentally identified. Phosphoglycerylation dataset obtained was initially prepared by removing sequences which had 40% or higher sequential similarities, which is a widely used level in the literature [40,67,68], using the Cd-hit tool [69]. As a result, a total of 91 sequences were attained and in each sequence, there were more than one lysine residue. From these sequences, 3360 lysine residues were found. Three thousand two hundred forty-nine lysines were non-phosphoglycerylated and 111 were phosphoglycerylated.

Position specific scoring matrix
Evolutionary feature captures how proteins have evolved in relative to its structural, functional and sequential similarities with other protein sequences [70]. PSSM calculates the substitution probability of amino acids in the sequence to all the amino acids of the genetic code. PSSM profiles is a highly revered feature in the area of proteomics [71][72][73]. The profiles are obtained using PSI-BLAST toolbox [74] which aligns protein sequences to similar sequences stored in protein data bank [75]. The outputs of PSI-BLAST are two matrices with a dimension of L × 20; L being the length of the queried protein sequence and 20 being the 20 amino acids of the genetic code. Of the two matrices, one being log odds and the other the amino acid linear probabilities, the latter was used in this work. The PSSM for the purpose of this work was produced on non-redundant proteins using a threshold value of 0.001 of the PSI-BLAST toolbox with three iterations.

Feature extraction
This section deals with the segment sizing for each lysine residue and its corresponding feature extraction. To represent each sample, we have used the evolutionary information of 32 upstream and 32 downstream amino acids to the lysine K portrayed in Fig. 1a. In the cases where lysine residue did not have enough amino acids, either upstream or downstream, the mirror technique [35] was used to create the missing amino acids as shown in Fig. 1b. The segment consisting of 32 upstream and 32 downstream neighboring amino acids of lysine K can be denoted by P as: From eq. (6), the downstream amino acids are represented by An where 1 ≤ n ≤ 32 while the upstream by An where 1 ≤ n ≤ 32. Moreover, it can be seen that a segment consists of a total of 65 amino acids, including the lysine K at the center. The segment P is attached with an experimentally confirmed label of either 1 or a 0 indicating a phosphoglycerylated site or a nonphosphoglycerylated site, respectively. The acquired submatrix by segment P describing each lysine was converted to a frequency vector of bigrams (PSSM + bigram) resulting in the matrix of size 20 × 20. Each lysine was then represented by transforming this matrix to a 400 dimensional row vector capturing evolutionary information of the segment P.

Profile bigrams
The profile bigrams technique has displayed promising results in dealing with discriminatory information [76][77][78][79]. For the purpose of explanation, let's assume that the PSSM of a protein sequence is denoted by a matrix M. Every element in matrix M, indicated by mij, can be said to be the transitional probability of j-th amino acid at i-th location within the given protein sequence. The segment P, consisting of 65 amino acids (a fraction of the protein sequence), is hence represented by a 65 × 20 feature matrix in which 20 denotes the amino acids of the genetic code. Therefore the PSSM was calculated based on the substitution probabilities of each amino acid in the segment to the 20 amino acids. For the matrix M, its profile bigram is calculated by.
k¼1 m k;p m kþ1;q where 1≤ p ≤ 20 and 1≤ q ≤20 From the above equation, the resulting dimension of matrix B representing the PSSM + bigram is 20 × 20. Finally, matrix B is converted to a 400 dimensional row vector indicated by Eq. (8) which represents the 400 transitional probabilities pertaining to evolutionary information of each lysine residue.

Support vector machine
Support vector machine is one of the supervised learning model listed under the topic of machine learning. The algorithm is commonly used in classification and regression applications. It is a discriminative classifier that works by defining a separating hyperplane. Usage of SVM is not only popular in protein problems [20,[80][81][82][83], but also in other areas of biology, such as genomes [84,85]. With a given set of training data, the algorithm produces an optimal hyperplane separating the two classes and for every new data points presented, it is able to categorize based on this defined hyperplane. The data points represent a point in n-dimensional space where n corresponds to the number of features it possesses. These data points of two class problem are not always linearly separable, hence non-linear kernels are used to carry out classification. The non-linear kernels project the nonlinear input space to a higher dimensional space where the classes are linearly separable. For the purpose of this work, LibSVM package on Matlab software was used to carry out the identification of phosphoglycerylated and non-phosphoglycerylated sites using C-SVC type SVM and a polynomial kernel.
Additional file 1. MCC values for different segment sizes.
Additional file 2. Number of Phosphoglycerylation sites detected by each predictor.