- Research article
- Open Access
Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data
BMC Molecular and Cell Biology volume 20, Article number: 21 (2019)
To-date, no claim regarding finding a consensus sequon for O-glycosylation has been made. Thus, predicting the likelihood of O-glycosylation with sequence and structural information using classical regression analysis is quite difficult. In particular, if a binary response is used to distinguish between O-glycosylated and non-O-glycosylated sequences, an appropriate set of non-O-glycosylatable sequences is hard to find.
Three sequences from similar post-translational modifications (PTMs) of proteins occurring at, or very near, the S/T-site are analyzed: N-glycosylation, O-mucin type (O-GalNAc) glycosylation, and phosphorylation. Results found include: 1) The consensus composite sequon for O-glycosylation is: ~(W–S/T–W), where “~” denotes the “not” operator. 2) The consensus sequon for phosphorylation is ~(W–S/T/Y/H–W); although W–S/T/Y/H–W is not an absolute inhibitor of phosphorylation. 3) For linear probability model (LPM) estimation, N-glycosylated sequences are good approximations to non-O-glycosylatable sequences; although N – ~P – S/T is not an absolute inhibitor of O-glycosylation. 4) The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. 5) Some N-glycosylated sequences are also phosphorylated at the S/T-site in the N – ~P – S/T sequon. 6) ASA values for N-glycosylated sequences are stochastically larger than those for O-GlcNAc glycosylated sequences. 7) Structural attributes (beta turn II, II´, helix, beta bridges, beta hairpin, and the phi angle) are significant LPM predictors of O-GlcNAc glycosylation. The LPM with sequence and structural data as explanatory variables yields a Kolmogorov-Smirnov (KS) statistic of 99%. 8) With only sequence data, the KS statistic erodes to 80%, and 21% of out-of-sample O-GlcNAc glycosylated sequences are mispredicted as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%.
The data indicates the existence of a consensus sequon for O-glycosylation; and underscores the germaneness of structural information for predicting the likelihood of O-glycosylation.
The process of post-translational modification (PTM) of proteins has allowed cells to vastly increase the otherwise normal array of functions that proteins carry out. PTMs are essential modifications that occur in higher eukaryotes. Glycosylation is one type of modification that chemically alters proteins by the enzymatic or non-enzymatic addition of glycans. There are two main types of glycosylation present: N-linked and O-linked . There are two types of O-linked glycosylation in cells: mucin-type O-linked glycosylation (O-GalNAc); and several types of non-mucin O-glycans, which include α-linked O-fucose, β-linked O-xylose, α-linked O-mannose, β-linked O-GlcNAc (N-Acetyl Glucosamine), α- or β-linked O-galactose, α- or β-linked O-glucose glycans, and O-GlcNAcylation (or O-GlcNAc). In addition, C-linked, S-linked and non-enzymatic glycation processes exist . The consensus sequon for N-linked glycosylation is N – X – S/T, where X is not proline. Two consensus sequons for C-linked glycosylation are: W – X – X – W and W – S/T – X – C . No consensus sequon for O-glycosylation is known thus far.
O-GlcNAc modification is the attachment of β-N-Acetyl Glucosamine (GlcNAc) to protein serine (S) or threonine (T) amino acid residues via the O-linked glycosylation process . O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins [5, 6]. O-GlcNAcylation mechanism targets numerous transcription factors, tumor suppressors, kinases, phosphatases, and histone-modifying proteins [6,7,8,9]. Furthermore, the monoaddition of GlcNAc is suggested to be the most significant glycosylation process in the regulation of metazoan cytosolic and nuclear proteins [10, 11]. Often, the S/T residues, which are targeted by O-GlcNAcylation, are also targets of kinases and phosphatases. Hence, the consensus sequon for O-linked glycosylation can also be the one for phosphorylation. These residues are both dynamic and inducing in nature . Thus, to summarize, O-GlcNAcylation plays a major role in modulating kinase-induced signal transduction pathways , supporting the notion that O-GlcNAcylation has an important biological role.
The source of GlcNAc for O-GlcNAcylation in eukaryotic cells is UDP-N-Acetyl Glucosamine , UDP-GlcNAc, obtained from glucose in the hexosamine biosynthetic pathway (HBP). The attachment process of O-GlcNAcylation is catalyzed via O-linked β-N-Acetyl Glucosamine transferase [15, 16] or O-GlcNAc transferase and is removed by O-linked β-N-acetyl- glucosaminidase or O-GlcNAcase . Currently, O-GlcNAc transferase is the only known enzyme in mammals responsible for this addition of GlcNAc . O-GlcNAc transferase is found in the nucleus, mitochondria, and cytoplasm of cells.
O-GlcNAc transferase has important roles in many biological processes and diseases. These include type 2 diabetes and its complications, such as blindness, renal failure, and nerve damage; diabetes-accelerated atherosclerosis; dyslipidemia; cardiovascular diseases; aging and cancer [18,19,20,21,22,23]. Additionally, O-GlcNAc transferase is associated with tauopathies, including Alzheimer’s disease, dementia, and Parkinson’s disease [24, 25]. Recently, the role of glycosylation in neuroblastoma has taken center stage .
The common underlying cause of such diseases is due to the over- or under- active process of O-GlcNAcylation. With the complex metabolic processes in O-GlcNAcylation and its critical role in the regulation of cellular processes, it is not unusual to consider abnormal O-GlcNAcylation activity as contributing to various diseased states. This concept suggests that an in-depth understanding of sequence and structural parameters that drive O-GlcNAcylation may shed some light on O-linked glycosylation.
The goal of this paper is to use the ridge regression (RR) estimated linear probability model (LPM) to predict the likelihood of O-GlcNAc glycosylation in human proteins [27,28,29,30,31,32,33]. A detailed description of the statistical methodology used herein is in a predecessor paper . The LPM models a binary outcome: O-GlcNAc glycosylated or not. So, the LPM estimation dataset requires two data subsets: O-GlcNAc glycosylated sequences and sequences that cannot be O-GlcNAc glycosylated. Experimentally one can determine if a protein is O-glycosylated. Thus, getting data on O-glycosylated proteins is straightforward. Getting data on proteins that are highly unlikely (or impossible) to be O-glycosylated is quite challenging, because a consensus sequon for O-glycosylation is needed. To understand whether a true, or even approximate, consensus sequon for O-glycosylation exists, a family of similar sequences with PTMs of proteins occurring at the S/T-site is collected. This family includes sequences on the following PTMs of proteins: N-glycosylation, O-GalNAc glycosylation, and phosphorylation. The rest of this paper documents the analyses done with all of this collected data.
Results from examining the empirical data
The distribution of groups of residues around the O-GlcNAc glycosylated proteins, in dataset dbogap-seq, is shown in Table 1. It is striking that the immediate vicinity of the S/T-site is predominantly polar uncharged and hydrophobic at position –1 and +1, respectively.
The marginal distributions of residues in the vicinity of the O-GlcNAc and O-GalNAc glycosylated sequences are shown in Tables 2 and 3, respectively. About 10% or more of the sequences in positions –4 thru –8 have S, P, or A in Table 2; positions –2 and –3 are proline rich; and amino acids C, F, M, and W do not occur much. There are some differences between amino acid preferences in O-GlcNAc and O-GalNAc glycosylated sequences. As reported by earlier studies, the vicinity of the S/T-site in O-GalNAc glycosylated sequences is proline rich.
The marginal distributions of residues in phosphorylated sequences are shown in Table 4. Along the sequence, S tends to show up the most in phosphorylated sequences. There are higher percentages of charged amino acids in these sequences, as well.
While visual inspection of amino acids in the vicinity of these sites provide some indication of their preferences and charge distributions, a subjective analysis alone is insufficient. Therefore, statistical analysis of the data is done in this paper to provide more rigorous indications of the same. Empirical analysis suggests that given a set of sequences for a particular PTM of proteins, the proportions of an amino acid occupying positions (e.g., ±8 from the S/T-site) along a sequence can discriminate between two PTMs of proteins. This is tested using the Wilcoxon signed-rank test  and the results are shown in Table 5.
For example, in Tables 2 and 4, the proportions of occurrence of amino acid F along seven positions to the left/right of the S/T- and S/T/Y/H- sites in O-GlcNAc glycosylated and phosphorylated sequences, respectively, appear different under pairwise comparison (i.e., the amino acid proportions along one row of one Table are compared, pairwise, with the proportions of that same amino acid along the corresponding row of the other Table). The Wilcoxon signed-rank test is used to determine if these two sets of proportions are statistically different. As shown in Table 5, detected differences are flagged as “1” and by “0” otherwise. Table 5 indicates that using amino acids in the positions they occupy as “explanatory variables” to predict O-glycosylation is reasonable. In contrast, the proportions of amino acids in a particular position for one PTM process are not statistically different from the proportions of amino acids in that same position for another PTM process. For example, the Wilcoxon signed-rank test applied to compare, pairwise, the proportions in any column of Table 2 with those in the corresponding column of Table 3 shows the pairwise differences are not statistically significant. This same result holds when comparing the columns of Table 2 with the corresponding ones in Table 4; and those of Table 3 with the corresponding ones in Table 4.
Finally, it is of interest to know whether sequons involved in the PTMs of proteins are unique to each kind of PTM process. For example, given the consensus sequon for N-glycosylation is N–X–S/T, one can find out if in the collected data there are N-glycosylated sequences that are also O-glycosylated or phosphorylated at the S/T-site, which is at the 2nd position to the right of N in this sequon. UniProt Accession numbers and UniProt position pairs are compared across PTMs. Because N-glycosylation occurs at N, the UniProt positions of N-glycosylated sequences, in the data, are increased by 2 so the S/T-sites can be compared across the other PTMs. In the empirical data, there are no N-glycosylated sequences with UniProt Accession No. and UniProt position +2 pairs that match the UniProt Accession No. and UniProt position pairs of O-GlcNAc or O-GalNAc glycosylated sequences. This is as expected, because the structures of N- and O-linked oligosaccharides are very different, and reflect differences in their biosynthesis and cellular localizations. Thus, the N-linked site cannot be O-glycosylated at the S/T-site of the N–X–S/T sequon. However, there is a small number of N-glycosylated sequences that are phosphorylated. Table 6 presents the counts of sequences in the empirical data that are common between the different PTMs of proteins considered herein (i.e., intersections of two sets of sequences are calculated).
Table 7 lists the 13 proteins that are N-glycosylated at N and phosphorylated at the second position to the right of N, which is S/T. The majority of these proteins are receptors involved in signaling and are phosphorylated via a threonine residue. What roles these phosphorylation sites play require further analysis and experimentation.
Search for a Consensus Sequon for O-glycosylation
Data (see Table 8) on three PTMs of proteins, increasing in sample size, is collected to identify a consensus sequon: O-GlcNAc glycosylation, O-GalNAc glycosylation, and phosphorylation, which is an important and pervasive PTM of proteins  that occurs, generally, at the S/T/Y-site. It has also been found to occur at the H-site . Several sequons are tested to see if a consensus sequon exists. The outcome of these tests is shown in Table 9. As can be noted, the sequon W – S/T – W does not occur in the collected set of O-glycosylated sequences. Thus, ~ (W – S/T – W) is likely a consensus composite sequon for O-glycosylation, where, following Principia Mathematica , “~” denotes the “not” operator. This means there are 3 sequons that satisfy ~ (W – S/T – W): ~W – S/T – W, W – S/T – ~W, or ~W – S/T – ~W.
An important question is whether, as sample size increases, W – S/T – W continues not to occur at all in O-glycosylated sequences. The answer to this question has to wait until much larger sets of O-glycosylated sequences are available for study. In the meantime, one can approach this question in another way. The N – ~P – S/T sequon is considered a necessary [46, 47], but not sufficient, condition for N-glycosylation, which is another important PTM of proteins. Its likelihood is modeled in Gana et al. . In the strictest sense, this is a true statement if the occurrence of P in the position between N and S/T is an “absolute” inhibitor of N-glycosylation  – i.e., if the N – ~P – S/T sequon is never observed in N-glycosylated sequences. However, this is not the case. Rather, the word “absolute” is used in a probabilistic sense: the fact that N – P – S/T occurs “rarely” is sufficient for N – ~P – S/T to be termed a “consensus” sequon. In particular, a partial search for the N–P–S/T sequons in UniProt revealed 23 such sequons associated with N-glycosylated human proteins. In contrast, 16 (0.0066% + 0.004% = 0.007%) of the 227,810 phosphorylated human sequences in Table 9 have the W – S/T/Y/H – W sequon. Finally, a search of UniProt for human proteins with the W – S/T – W sequon yielded 236 sequences (in the WSTW-Uniprot dataset of Table 8). It is found that none of them are O-glycosylated at S/T.
The sequon for C-linked glycosylation is W–X–X–W or W–S/T–X–C. Analysis of the data indicates that the presence of W immediately before and immediately after the S/T-site (S/T/Y/H-site) disallows O-glycosylation (phosphorylation). Furthermore, the occurrence of N two positions to the left of the S/T-site is relatively rare in O-glycosylation. Thus, nature appears to have evolved distinct sequons and amino acid propensities unique to each of the PTMs discussed herein.
Table 9 indicates that the majority (98.54% + 0.24% = 98.78%) of O-GlcNAc glycosylated sequences have the ~N – X – S/T sequon. In contrast, these sequences having the N – ~P – S/T sequon is small (1.22%). Thus, for LPM estimation, the set of N-glycosylated sequences in Gana et al.  is considered to be a reasonable approximation for non-O-glycosylated sequences. By discriminating between the features of O- and non-O- glycosylated sequences, respectively, the LPM estimates the likelihood of O-glycosylation. Because 1.22%, rather than 0%, of the O-glycosylated sequences have the N – ~P – S/T sequon, the ambiguity regarding what constitutes a set of sequences that cannot be O-glycosylated has not been completely removed. However, that ambiguity has been considerably reduced.
Using the data in Gana et al. , O-GlcNAc glycosylation is modeled by including the amino acids in 8 positions to the right, and 8 positions to the left, of the S/T-site as possible explanatory variables. The collected data on N-glycosylated sequences described in Gana et al.  have 10 positions to the right, and 10 positions to the left, of the N-site. Because training the LPM to discriminate between O- and not O- glycosylated proteins must focus on the S/T-site as center for all proteins in the estimation dataset, only 8 positions to the right of S/T are available for use in that data. The remaining possible explanatory variables are the collected structural data on the proteins.
When the LPM is applied to out-of-sample data, other types of O-glycosylated sequences may be a part of it. To assess the sensitivity of the mispredictions rate, the LPM of O-GlcNAc glycosylation is applied to a sample of O-GalNAc glycosylated sequences. The modeling and validation strategy is summarized in Table 10.
LPM predictions using sequence and structural data
Based on the collected data (glycos_public.xlsx), as in Gana et al., the ith position to the left (right) of S/T is minusi (plusi). If amino acid α occupies i, then: minusiα = 1 if α is to the left of S/T and plusiα = 1 if α is to the right of S/T; otherwise, these indicator variables take the value 0. Thus, 140 (20 × 8 – 20) and 160 (20 × 8) indicator variables define the left and right neighborhoods of S/T, respectively. The ratio (posj) of the position of the glycosylation site (S/T or N) to the total length of protein j is also assumed to be an explanatory variable. Finally, it is assumed that the structural information on the sequences are additional explanatory variables.
Let Yj be the binary dependent variable that takes the value 1 if sequence j is O-glycosylated and 0 if it is not (N-glycosylated). Then, the postulated LPM (indexed as Eq. (1)) is:
where β0 , βiα , φiα , λ , δ , π , θk , ω1 and ω2 are constants (coefficients) estimated using the data; Σα , Σi denote summations over significant amino acid and position combinations, respectively, and Σk denotes summation over significant turn types. The variable ASA_zero is a dummy variable that takes the value 1 if ASA is zero, and 0 otherwise. minus2α is dropped so as not to bias the LPM, because the set of non-O-glycosylated proteins only has an N (i.e., there is no amino acid variation, by design) at minus 2. Thus, retaining minus2α with α ≠ N will flag amino acids (other than N) present at minus 2 only whenever Yj = 1, which would be inappropriate. Amino acids A, D, F and K dominate the 998 human proteins at minus 2; and D, F, K, R and S dominate the remaining 107 non-human proteins. The set of O-glycosylated sequences have no N at minus 2. Also, note that WSTW-Uniprot is not used as the set of non-O-glycosylated sequences, because it will lead to the perfect, but trivial, model: Yj = 1 – minus1W. In this case, one option is to estimate the LPM by ignoring the two amino acids on either side of the S/T-site. However, this does not improve LPM predictive capability materially. More importantly, it would be of interest to find sequences with the W–S/T–W sequon that do “something” (e.g., C-glycosylation with W–S/T–W–W or W–S/T–W–C sequons) rather than “nothing” (or staying “silent”); and then refit the LPM to see how well it discriminates between O-glycosylated and non-O-glycosylatable sequences. In addition, if structural data is included for LPM estimation, solid predictive power would likely be the outcome.
The selected LPM, shown in Table 11, is estimated on 1,083 (human) N-glycosylated proteins and 987 O-glycosylated proteins. As in Gana et al. , N-glycosylated sites with a single sugar are deleted from the estimation data because these may be false positives given the known artifacts from crystallization. As detailed in Gana et al. , the LPM is selected using the stepwise selection method [49,50,51] in conjunction with 10-fold cross-validation [52,53,54,55,56] to minimize the prediction residual sum of squared errors (PRESS), which has a point of contact with the Brier Score [57, 58]. Because the LS estimated LPM produces heteroscedastic errors (i.e., residuals with varying variances), it was confirmed that stepwise selected significant variables continued to remain significant when reevaluated in terms of heteroscedasticity-consistent standard errors , as a first approximation. Because the exact functional form of the LPM error variance is known, RR is applied to the stepwise selected LPM to reconfirm that all variables continue to be statistically significant. This is done by re-estimating the stepwise selected LPM by RR so the predicted values of Yj are forced to lie in the 0-1 interval and estimating the error variances : (RR predicted Yj ) × (1 – RR predicted Yj ). The optimal value of the RR tuning parameter, k, for which all of the aforesaid predictions fall in the range 0-1 is 3.77895. Using these variances as weights, the LPM is re-estimated using classical weighted LS (WLS). If variables with p-values greater than 5% result, the entire LPM estimation process is redone until all variables are significant at 5%.
The number of amino acid and position combinations found to be significant in the selected LPM is 76. Model parsimony (e.g., favoring retaining fewer, rather than more, explanatory variables) is not sought, because it is not at all clear that Nature is “inefficient”, or provides redundant information, by making irrelevant the placement of certain amino acids in certain positions along the sequence. The selected LPM yields an in-sample Kolmogorov-Smirnov (KS) statistic of about 99.2%, which signals a reasonable degree of discrimination between O- and non-O- glycosylated proteins. The maximum separation for the KS statistic [60,61,62,63,64] occurs at the predicted probability value of about 43%. The average (in-sample) predicted probabilities of O-glycosylation when Yj = 0 and Yj = 1 are about 1% and 98%, respectively. The standard deviations of predicted probabilities of O-glycosylation when Yj = 0 and Yj = 1 are about 11% and 7%, respectively. The Durbin-Watson (DW) statistic [65, 66] for the LPM is about 1.5.
The nine mispredictions made by the LPM, in-sample, are documented in Table 12. The accessions in Table 12 are not annotated in UniProt to be glycosylated except for Q16566 (O-linked), Q92854 (N-linked), and P31749 (phosphorylated). Thus, how many sequences in Table 12 are truly false positives, remains to be seen.
It has been observed that amino acid P is usually present around O-glycosylation sites; and that O-glycosylation tends to occur more in the beta-strands of proteins [67, 68] indicating the significance of proline around the S/T-site. However, Table 11 indicates that the likelihood of O-glycosylation tends to be more favorable in the beta hairpins and beta bridges of proteins. In Table 11, ASA_zero is significant. This appears to be “unusual” because only 2 sequences in the LPM estimation data, both of which are experimentally deemed to be O-GlcNAc glycosylated, have ASA values of 0. ASA_zero has a point of contact with the concept of an “observation-specific” dummy variable [69,70,71,72,73]. The two sequences with ASA values of 0 are shown in Table 13. An important biological question is whether sequences not exposed at all, but with other O-glycosylation friendly characteristics, have a greater propensity to be O-glycosylated.
The two proteins in Table 13, Peroxiredoxin-2 (P32119) and Band 3 anion transport protein (P02730), have their O-linked glycosylation sites buried as indicated by their zero ASA values. While an explanation for this is not immediately apparent, position 112 in Peroxiredoxin-2 (P32119) is a phosphorylation site as well. A question is whether the site is buried in order to make itself amenable to switch between the two processes via a conformational change (O-linked and Phosphorylation). Further research is needed to answer this question.
ASA is not significant in the LPM. As the two data subsets of sequences in the LPM estimation dataset are either O- or N- glycosylated, it appears that differences in ASA, between N- and O- glycosylated sequences does not rise to the level of a predictor variable in the fitted LPM. The empirical cumulative probability distribution functions (CDFs) for ASA, by estimation data subset, is shown in Fig. 1. The mean/median ASA values, shown in Fig. 1, differ by LPM estimation data subset. In particular, it appears that ASA values for N-glycosylated sequences are generally “larger” than those for O-GlcNAc glycosylated sequences. Interestingly, the Wilcoxon rank sum test  indicates that the ASA value for N-glycosylated sequences is “stochastically larger”  than the ASA value for O-GlcNAc glycosylated sequences. The p-value of this test is about 3.6%. If the KS test is applied, the p-value drops to nearly zero (3.96e-13). However, because the first p-value is less than 5%, it would be of interest to retest this phenomenon with more data to see if one can get significantly lower p-values, say less than 0.5% , with the Wilcoxon rank sum test.
LPM predictions using only sequence data
The LPM in Table 14 employs only sequence data. The optimal value of k for which all predictions are in the 0-1 range is 1.4357. With this k, the LPM in Table 14 is estimated by WLS and produces a KS statistic of about 80.4%, which is a steep drop from 99.2% and underscores the germaneness of structural information. Absent 3D structure, this loss in LPM predictive power is not surprising, considering that the same site is used by two fundamental PTM processes driven by conformational flexibility. Maximum separation occurs at the predicted probability value of about 49%. The DW statistic value is about 1.12, indicating, as expected, missing explanatory variables (e.g., structural information) in the LPM as specified. This LPM produced only one prediction outside the 0-1 range flagging a protein deemed to be O-glycosylated in the data with a negative (or zero) “probability” (≈ –7.2%) of O-glycosylation; its UniProt Accession No., UniProt position, protein length, and sequence are Q16566, 58, 473, and VESELGRGATSIVYRCKQKGT, respectively. The LPM in Table 14 mispredicts (using 50% as the cutoff probability) 54 of the 259 (or 20.85%) out-of-sample O-GlcNAc glycosylated sequences as not being O-GlcNAc glycosylated. The 95% confidence interval  around 20.85% is: (16.35%, 26.20%). If the LPM in Table 14 is applied to a sample of O-GalNAc glycosylated sequences, it mispredicts 656 of the 2,079 (or 31.6%) as not being O-GalNAc glycosylated. A LPM specific to O-GalNAc glycosylation will decrease the mispredictions rate. If the LPM in Table 14 is applied to the WSTW-Uniprot dataset (by ignoring the amino acids on either side of the S/T-site), 52 of the 236 (≈ 22%) sequences are mispredicted as being O-glycosylated. As expected, this is slightly more than 20.85%.
In order to clarify why the LPM methodology used herein is appropriate, we discuss several points, in a question-answer format, in this section.
Is the LPM a valid methodology to model a binary response?
The short answer is “yes”. However, a longer answer is needed because it is quite unfashionable to use the LPM these days. And that perception needs to be addressed.
The LPM is fairly well-known to econometricians and statisticians ( [76,77,78,79,80,81]) and has sparked a fairly substantial literature. The LPM is Y = Xb + u, but where Y only takes the values 0 and 1. If E(u) = 0, then each ui has variance E(Yi)(1 – E(Yi)). Goldberger  suggested estimating E(Yi) by ordinary LS (OLS), and then re-estimating the model by WLS to achieve homoscedasticity (i.e., constant residual variance) with E(Yi)(1 – E(Yi)) as the “weights”. The primary attraction of the LPM is its simplicity. In large samples, the non-normality of u does not matter. The LPM produces consistent coefficients . And the problem of getting predictions that lie outside the 0-1 range is not an asymptotic one . Yes, “probabilities” less than 0 or greater than 1 can hinder empirical work with finite samples; but, only if those probabilities, per se, are used for something. In our case, probabilities less than 0 are interpreted as nearly 0 and those greater than 1 as nearly 1. This is so, because we are only interested in whether or not a sequence is O-glycosylated by assessing whether or not its probability of O-glycosylation is greater or less than 50%, respectively. The numerical value, per se, of the estimated probability of O-glycosylation below or above this threshold does not matter to us.
Rounding LPM predictions that are less than 0 to a small number close to 0 (e.g., 0.001), and those greater than 1 to a number close to 1 (e.g., 0.999), is a well-known method to constrain the LPM predictions within the 0-1 range. In another approach, the sum of squared errors is minimized subject to the constraints 0 ≤ Xb ≤ 1 . OLS LPM predictions are used to reestimate the LPM using WLS to overcome the problem of heteroscedasticity. Another method uses the absolute values of the weights to do WLS estimation. Another method uses only OLS predictions that are in the 0-1 range to do WLS estimation . In another method, the weights are bounded and negative weights are assigned a constant value . Another method generalizes the last two methods . A limitation of these methods is that their sampling properties are not available. So, we use RR to construct the weights .
To avoid the problem, in finite samples, of LPM estimates falling outside of the 0-1 range, logistic regression is a greatly favored alternative to the LPM . The logit model mathematically forces the estimates of Y to lie in the 0-1 range by the transformation: log(Y ÷ (1 – Y)) = Xb + u; and then estimates Y. In contrast, an alternative transformation is to use RR to force the estimates of Y in the 0-1 range. However, there is no a priori reason that says one way of constraining the estimates of Y to behave like “probabilities” is “superior” to the other. Furthermore, given that most of the predictors in our model Eq. (1) are also binary variables, it is not at all clear that Nature prefers the likelihood of O-glycosylation to vary along an ogive- or sigmoid- like surface, rather than linearly.
This naturally leads to the question of whether our fitted model satisfies the assumptions of a linear model. For the LPM, the error (residual) term is never normal; rather it is binomial. Furthermore, the error term is not observable – we only observe the final binary outcome in Nature. For large samples, the central limit theorem guarantees that LPM errors converge to a normal distribution . Although LPM errors are not observable, one way to approximate them is to simply compute them like in ordinary linear regression with a continuous dependent variable – i.e., take the difference of the binary dependent variable, Y, and the corresponding LPM prediction; and compute the average squared error, which is the Brier Score. For our model in Table 11, Fig. 2 indicates that our sample size is approximately normal with a concentration of small errors in the range –0.04 to +0.04. A normal distribution with a mean of about 0.0025 (which is close to 0) and a standard deviation of about 0.1 fits Fig. 2 approximately.
Cook’s distance is used to test for the presence of outliers [88, 89]. We use the rule of thumb that Cook’s distances (Ds) greater than 0.5 may be problematic. Cook’s D are shown in Fig. 3, which indicates that the assumption of no outliers is a satisfactory one.
In the LPM, a unit increase in an independent variable, holding the other independent variables at fixed values, leads to a constant increase in the dependent variable. Because most of our independent variables in the LPM are binary, this is indeed the case. However, for the two continuous variables, the question is whether this constant “partial derivative” or “marginal” effect of an independent variable on the dependent variable holds. In a logit model, for example, this marginal effect varies with the independent variable. A plot of the residuals versus the two continuous variables in our LPM in Table 11 is shown in Fig. 4. This plot indicates fairly random scatters. Thus, the constant linear marginal effect assumption is satisfactory.
The LPM is a heteroscedastic model by construction. We first used White’s procedure  to correct for it. This LS estimated LPM with its heteroscedasticity consistent |t| values is shown in Table 15.
Because the form of the heteroscedasticity in the LPM is known, we had also used WLS to correct for it. This was done by estimating Y by RR in such a manner so as to guarantee that the predicted Y lies in the unit interval . The plots of the actual observations (which are binary) vs. the RR predictions and the final WLS predictions of our model in Table 11 are shown in the left-hand and right-hand panels of Fig. 5, respectively.
The RR predicted Y in the left-hand panel of Fig. 5 was used to estimate the LPM residual variances (this is the “first stage” LPM), so the problem of non-constant variances in the LPM could be corrected for by using WLS. As can be visually observed in the left-hand panel, all predictions are constrained to lie in the 0-1 range for the smallest value of k . That is, similar to logistic regression, a mathematical constraint has been imposed on the predicted Y. In logistic regression this is done via a transformation; here it is done by constraining the estimated coefficients. It is also known [30, 31] that the RR predicted LPM (in the left-hand panel of Fig. 5) is also a competitive model. That is, one does not necessarily have to compute the predictions in the right-hand panel of Fig. 5, because RR has the capability to account for non-constant residual variances . The thing to note is that the KS drops and the Brier score increases in the left-hand panel of Fig. 5 relative to the corresponding values in the right-hand panel . This may be interpretable as the “cost” for constraining the predicted Y to lie in the 0-1 range, rather than leveraging the asymptotic property that predictions outside the 0-1 range do not matter in large samples.
What is the RR estimated LPM ANOVA?
The RR estimated LPM that generated Fig. 5 is shown in Table 16. The RR estimated standard errors for the coefficients are also shown in Table 16. Because RR produces non-integer degrees of freedom, unlike in classical LS estimation, the resultant RR analysis of variance (ANOVA) table is shown in Table 17 . Note that the RR estimated mean square error (MSE) is about 0.0827, which is very much like the Brier Score in the left-hand panel of Fig. 5. RR produces an approximate, but not exact, F-ratio (for non-stochastic k) , in the ANOVA table in Table 17, which is about 341 . The angle between the regression vector and the residual vector is about 370, which measures “how much the residual vector deviates from being orthogonal to the X space as it is in least squares” .
Is the lasso penalized logit model (LPLM) more appropriate than the LPM to estimate Y?
The concept of the “lasso” penalty to model a dataset in which the number of predictors exceeds the number of observations goes back, at least, to the 1970s [92,93,94]. Later it was made fashionable by describing its applicability to statistical model selection .
The LPLM is estimated by minimizing the negative log-likelihood function, L(·), subject to the constraint that the sum of the absolute values of the coefficients is less than or equal to some value – that is, ||b||1 ≤ t, where log[(Pr(Y=1) ÷ (1 – Pr(Y=1))] = Xb and ||v||p denotes the “p-norm” of a vector v. In contrast, RR uses the constraint (||b||2)2 ≤ t. That is, the sum of squared coefficients, instead of absolute coefficients, is constrained. In Lagrangian form [95, 96], the LPLM optimization problem can be recast as: minimize
–L(·) + λ × sum(|b|). This minimization is done using Nesterov’s method . Here, a regularization parameter, ρi, where 0 < ρ < 1 and i is the ith step in Lasso selection, is used for λ in the ith step. Variable subset selection, via the Lasso, is done, using up to 200 steps, so as to minimize the Schwarz Bayesian Criterion (SBC) . In-sample, several LPLMs are estimated for different values of ρ and the resultant number of variables (excluding the intercept) selected, the KS statistic values, the Brier Scores, and twice the negative log-likelihood values are shown in Table 18. In terms of the SBC, the “optimal” LPLM, on the full sample, is the one with ρ = 0.815; and this is also nearly “optimal” in terms of Brier Score and KS statistic value. To assess LPLM performance out-of-sample as ρ varies, 5-fold cross validation (CV) is done. LPLMs are fit on a set of 4 “training” folds and the Brier Score is calculated on the holdout (i.e., 5th) fold. Thus, for a particular choice of ρ, five out-of-sample Brier Scores are available. The signal-to-noise ratio (SNR), which is μ/σ, of these 5 Brier Scores is calculated. Per the Rose criterion , SNRs above 5 are deemed to be “good” in the engineering literature. The SNRs are shown in the last row of Table 18; and at ρ of 0.82, the SNR is a maximum and satisfies the Rose criterion. Thus, overall, it is reasonable to take ρ = 0.82 as “optimal”.
The resultant LPLM is shown in Table 19. An important question is: if the LPLM selected variables drive O-glycosylation in Nature, would they, post-Lasso, continue to be statistically significant if they were put into a classical logit model to reestimate Y ?
Using the classical logit model, we reestimated Y with the independent variables in Table 19. The classical logit model could not be estimated because of the problem of “quasi-complete separation”, which makes finding the optimal maximum likelihood solution difficult. This problem arises when Y separates a subset of the predictors to a large enough degree. To resolve this problem, we used Firth’s penalized maximum likelihood method [100, 101]. Several predictors (23 of the 51) turn out to be statistically insignificant at the 10% level of significance, and these are noted by asterisks in Table 19. Next, Y is reestimated using the LPM with the predictors in Table 19. The predictors that are statistically insignificant (10 of the 51), in terms of White t-values, at the 10% level of significance are noted in Table 19 by the sun-like symbol “☼”. The LPM also indicates there is no multicollinearity among the predictors, because all of the VIFs are less than 10. However, the DW statistic drops to 0.8, likely indicating there are missing explanatory variables .
Of course, given a set of predictors, the estimated coefficients produced by the classical logit model, the LPM, and the LPLM are not comparable, because different optimization procedures are used to estimate Y. However, a philosophical question arises. Suppose, Nature operates along a logit surface to reveal the probability of O-glycosylation, given sequence and structural information in the neighborhood of the S/T site; and the Lasso has identified the relevant predictors. It appears to us that the selected predictors should, for the most part, be statistically significant, in the classical sense, when they are put into a classical logit model. The fact that 23 of the 51 LPLM predictors are not significant is troubling.
The rates at which the estimated models mispredict Y as 1 when it is empirically 0 (i.e., not O-glycosylated) in the estimation dataset, and 0 when it is empirically 1 (i.e., O-glycosylated), is shown in Table 20, together with the fit statistics. A 50% cutoff probability is used to measure the mispredictions rate.
As can be seen in Table 20, our WLS estimated LPM (Table 11) is performing quite well and does have an edge over the other models. This is not surprising. For example, the LPLM selects a subset of variables by setting certain coefficients to exactly 0. In our context, the interpretation of coefficients that are exactly zero is not easy. Coefficients of variables may be 0 if they are “irrelevant” and uncorrelated with the “relevant” variables. If many variables are irrelevant one possibility is that Nature has “wasted” certain amino acids by placing them in certain positions along the sequence; and another possibility is that they are “silent”, rather than “wasted”. In contrast, RR never sets coefficients to exactly 0. Rather, RR exhibits optimal linear shrinkage of the coefficients . In particular, simulation studies indicate that the performance of RR dominates those of its peers such as partial least squares and principal component regression . If we assume that Nature has not “wasted” amino acids, then RR is better suited to modify coefficients because amino acids in certain positions may have a “small” effect that is not yet visible. That is, the “relevant” amino acids may dominate the “irrelevant” ones (until a “signal” changes that) in determining the three-dimensional structure of the protein, and, hence, its function. In such situations, we do not seek to set, as in the LPLM, some coefficients to exactly zero. Next, we present an example to clarify how fundamental are the amino acids and the positions they occupy along a sequence.
Consider the case of the psi angle. It is not a predictor in Table 11, but is a predictor in Table 19. Because the phi and psi angles go together, an important question is whether the LPM in Table 11 failed to capture it as a predictor, while the LPLM in Table 19 captured it. The answer is “no”; because to a significant degree the psi angle is determined by the amino acids and the positions they occupy. To see this, a linear regression is fit with the psi angle as the dependent variable and sequence information as independent variables. The linear regression model is chosen via two steps. In the first step, stepwise variable selection under 5-fold CV is done so as to minimize the CV PRESS. The minimum CV PRESS attained was about 4,253,609 with 106 independent variables in the specification. Then, in the second step, because the model is heteroscedastic, the White t-values , are used to retain only those variables that have p-values less than 10% – i.e., variable subset selection is done, manually, using White t-values instead of classical t-values. The selected model is in Table 21.
The model in Table 21 has 99 variables, and an adjusted R2 of about 76%. However, its standardized (studentized) residuals are not normally distributed as indicated by the Q-Q plot in Fig. 6 . This is not surprising, because psi is a “circular” measure and has heavy tails as can be seen by its tendency to cluster (with phi) in certain (and remain sparse in other) regions of the X-Y plane, which can be seen in the Ramachandran plot in Fig. 7 . The non-normality of the residuals indicates the need to do more work to normalize psi. However, as fitting a model to psi is not the focus of this paper, we did not pursue that goal at this time (our focus is on answering the question raised in the last paragraph). Nonetheless, the t-values can be viewed as approximations (e.g., SNRs) for model selection by viewing linear regression, in this case, as purely an optimization method. In particular, we held the 99 variables in Table 21 “constant” and redid 5-fold CV 50 times to study the variability in out-of-sample CV PRESS values. We did this because we noted that in the second step used to select the model (based on White’s t-values), the new CVPRESS tends to exceed the minimum CV PRESS of about 4,253,609 – the reduction of 106 variables to 99 variables is fairly small and likely explains this movement. This variability, with its associated summary statistics, is shown in Fig. 8 by plotting the percentage increase in each new CV PRESS over the minimal CV PRESS (i.e., 4,253,609). Figure 8 indicates that this variation is about 4.5% on average, which is not large enough to reject the assumption that the distribution of amino acids along the sequence plays a significant role in determining structure and, hence, function. Future research can more precisely mathematize this role.
The number of times predictors are selected by the LPLM as ρ varies in Table 18 is shown in Table 22. In particular, ρ takes 13 values in Table 18. For each ρ, an optimal LPLM is selected. In all, there are 13 optimal LPLMs. All possible variables are shown in a rectangular format in Table 22. The positions of amino acids along the sequence is shown in an “integer line” format in the penultimate row of Table 22. So, for example, if we are at position number 3 to the right of S/T in Table 22 and go up vertically 6 “cells” we would arrive at the row for amino acid G; and this cell would represent the dummy variable p3G – that is, the dummy variable representing amino acid G in position +3. The variables (predictors) in the LPM of Table 11 are shaded in green in Table 22. For the 13 LPLMs, the frequency with which a predictor occurs in the LPLMs is noted in Table 22. For example, the number 13 in cell (–3, G) in Table 22 means that all of the 13 LPLMs had m3G as a predictor. Similarly, the 13 in cell (8, N) means that p8N was selected as a predictor by all of the 13 LPLMs; and, additionally, because this cell is shaded green, p8N is also a predictor in the selected LPM of Table 11. As another contrasting example, the predictor BH (Beta Hairpin) was only selected in 3 of the 13 LPLMs, but is in the LPM of Table 11. In summary, the wide variation in subsets of selected LPLM predictors, as ρ varies, is captured in Table 22. There are 107 integers (between 1 and 13, inclusive) entered in the cells identifying amino acids and their positions in Table 22; and 44 of these 107 integers are also shaded in green. That is, about 41% (44 ÷ 107) of the explanatory variables in the LPM of Table 11 are also selected by one or more of the 13 LPLMs.
How do fit statistics like the KS statistic and the Brier Score vary for the selected LPM?
The KS statistic is an alternative measure to the well-known Gini coefficient [105,106,107]. The KS statistic for the LPM in Table 11 is 99.2%; and the corresponding Brier Score is 0.0094 (see Fig. 5). To study the out-of-sample variation in KS statistic and Brier Score, we reestimate the coefficients of the LPM in Table 15 under 5-fold CV, 100 times. That is, the predictors of this LPM are held “constant” and only their coefficients are reestimated by pooling the data in 4 of the 5 randomly selected folds; and then calculating the KS statistic and Brier Score on the remaining fold. This exercise is done 5 times for a random selection of 5 folds. That is, 5 LPMs are estimated by partitioning the 5 folds into 5 sets of “training” and “validation” samples; where a training sample includes the data in 4 of the 5 folds, and the remaining fold is used as the validation sample. In turn, random selection of 5 folds is done 20 times. Thus, 100 KS statistics and Brier Scores are produced on sample sizes of about 400 each (the estimation sample size is 2,070).
The LPM in Table 15 instead of the one in Table 11 is selected for this exercise because it is simpler for our purpose. In particular, the difference between the two LPMs is the correction for heteroscedasticity, which is only a problem impacting the standard error of the coefficients, not their consistency. The results of this exercise are shown in Fig. 9, which indicates that the variation in these fit statistics is restricted to fairly narrow bands. Thus, there is nothing unreasonable in LPM performance.
Is “ASA_zero” a valid, rather than spurious, predictor in the LPM?
ASA measures the extent to which a sequence is exposed. As mentioned previously, there are two sequences in our data that have ASA values of 0, but are deemed to be O-glycosylated. In our model, ASA_zero is a statistically significant variable. Because of its special status as an “observation specific” dummy variable, it is not estimable within the classical logit modeling framework . Its statistical significance, and persistence during CV, raises the open question of whether sequences that are not exposed at all have a greater propensity to be O-glycosylated. Although, we do not know the answer to that, we think that until empirical evidence to the contrary comes up, it does not hurt keeping ASA_zero in the LPM. Furthermore, the good news is that dropping it from the model does not change any of our major conclusions. That is, in our case, dropping ASA_zero as an explanatory variable, simply allowed its effect to be distributed among the other coefficients without disrupting anything. However, we did not drop it, because it is of interest to consider (until evidence to the contrary) its marginal contribution toward encouraging O-glycosylation.
The presence of ASA_zero in the LPM, but not ASA, raises another important question: is ASA_zero a spurious dummy variable in the LPM, because the fundamental variable underlying it, ASA, is not in the LPM? The answer is “no”. And we discuss the reason for this next.
The ultimate assumption underlying this paper is that the amino acids and the positions they occupy along the sequence are the “building blocks” leading to O-glycosylation. This means a significant amount of the information carried by ASA should be embedded in the amino acids and the positions they occupy in the sequences. To test this assumption, we searched for a linear regression model with the natural logarithm (log) of ASA as the dependent variable and sequence information as independent variables. Our search followed the same principles that were used to model psi. We used the log transformation to get ASA “closer” to normality, and, as a consequence, the two observations with ASA values of 0 also get eliminated from the estimation dataset (because log(0) is undefined). ASA, like psi, has “heavy” tails. The selected model is shown in Table 23. The model has 103 predictors and an adjusted R2 of about 80%. Its in-sample average squared error (ASE) is about 0.0425. The studentized residual distribution in Fig. 10 show the heavy tails; but, a reasonable amount of symmetry is preserved. The model predicts that the two sequences with Uniprot Accession Numbers P02730 and P 32119, which have empirical ASA values of 0 (see Table 13), have predicted ASA values of about 32.5 and 24.0, respectively.
The information content in ASA generated by the amino acids and the positions they occupy along the sequence can be measured by how well the regression model estimates of log(ASA) can discriminate between O- and non-O- glycosylated sequences (i.e., Y=1 vs. Y=0). The KS statistic would be a measure of this. In the estimation sample, if Y is ranked by estimated log(ASA), the resultant KS statistic is about 32.0%, which is a strong enough value to not reject the assumption that the distribution of amino acids along the sequence does meaningfully drive ASA. Like in the case of psi, we do 5-fold CV (holding the 103 predictors “constant”), 20 times, to study the variability in the resultant 100 realizations of the out of sample ASEs and KS statistics. These variabilities, which are not unreasonable relative to the baseline ASE and KS statistic, are shown in Fig. 11. The baseline ASE and KS statistic are for the model in Table 23, which is fit on the full sample.
Is it valid to use mixed data from PTMs occurring in different parts of the cell to estimate the LPM?
The PTMs occur in different parts of the cell by a very complex process involving many different enzymes. For example, the initial processing of the N-glycosylation occurs in the rough ER and the process ends in the Golgi apparatus; phosphorylation occurs in the cytoplasm/nucleus; and O-GlcNAc proteins are also cytosolic/nuclear. So, the question is whether our “control” dataset (i.e., N-glycosylated sequences) is the correct “negative control”? The answer is “yes”. By virtue of their differences in their mechanisms and cellular compartmentalization, it makes logical sense to use N-glycosylated sequences as the negative control for O-glycosylation. For example, in the absence of a consensus sequon for O-glycosylated sequences, using sequences with the S/T motif within the O-glycosylated set of sequences makes little sense, because there is no guarantee that such sequences will not get O-glycosylated.
The LPM models a binary outcome. For example, when tossing a “fair” coin, if one wants to model (using the LPM, say) whether the coin will come down as “head”, then one needs to start with two distinct data subsets: a set of heads and a set of tails . One cannot have entries in the set of tails (the “control”) that may also be heads. From physics, we know that the outcome of the toss is exact, not probabilistic . However, given the mathematical complexity of, and the instrumentation needed for, making this prediction based on the laws of physics (e.g., Newton’s laws), we use the language of probability, instead of physics, to estimate the outcome . In particular, we say the probability of “head” is 50%. But, although this statement is false in theory, the methodology used is an important approximation that can be quite useful in practice, and for guiding experimental work.
Similarly, our use of the complex concept called “probability” , in this paper does not imply that we are saying there is a fundamental or intrinsic randomness in how a sequence folds and becomes sugar linked; rather, we are saying that given the underlying complexity of this process (as in the case of the coin toss), our methodology, which depends on probabilistic ideas, may represent a reasonable approximation (via rank-ordering) to facilitate predicting whether a sequence is O-glycosylated. Now the way we do this, is by starting with “partial” or “incomplete” information. That is, we start with explanatory variables in the ±10 positions around the S/T site. The distribution of amino acids in this neighborhood is considered. And the structural information at S/T is considered. For example, we have not considered the distribution of amino acids further away from S/T, which must also carry key information. So, for example, the LPM is “unaware” of the existence of the signal peptide in N-glycosylated proteins that sends it to a specific part of the cell.
So, we assumed that sequence and structural information in the neighborhood of the S/T site is sufficient for predicting the likelihood of O-glycosylation. To validate this assumption, we started with Table 9, which indicates that the proportion of O-glycosylated proteins having the N – ~ P – S/T sequon is tiny (1.22% of O-GlcNAc and 0.77% of O-GalNAc proteins). Given this empirical observation, we assumed that the “signature” N – ~ P – S/T discourages O-glycosylation – i.e., it appears to be an approximate inhibitor of O-glycosylation. Now, a reason that only a tiny number of sequences with the N – ~ P – S/T sequon are O-glycosylated may be because the remaining signature around the O-glycosylated S/T site is quite similar to the corresponding one around the N-glycosylated S/T site. And, there may be some “signal” outside of our chosen ±10 neighborhood that may occasionally allow a protein with the N – ~ P – S/T sequon to get O-glycosylated. However, the LPM is not trained on data outside of the ±10 neighborhood and, thus, such a signal, if it were to exist, is unobservable from the LPM’s frame of reference. So, our selecting the N-glycosylated sequences as the negative control set is independent of cell “compartments”. We are simply training the LPM on the signatures around the ±10 neighborhood of the N- and O- glycosylated sequences. Furthermore, if our assumption regarding the signature is unreasonable, we would have seen material deterioration in prediction accuracy under CV, but did not.
The consensus composite sequon for O-glycosylation is ~ (W–S/T–W), where “~” denotes the “not” operator. This means ~W – S/T – W, W – S/T – ~W, or ~W – S/T – ~W are necessary for O-glycosylation. The RR estimated LPM approach was instrumental in making this finding. The consensus sequon for phosphorylation is ~ (W–S/T/Y/H–W); although W–S/T/Y/H–W is not an absolute inhibitor of phosphorylation. Structural attributes (beta turn II, II´, helix, beta bridges, beta hairpin, and the phi angle) are significant predictors of O-GlcNAc glycosylation.
For LPM estimation it is found that N-glycosylated sequences are good approximations to non-O-glycosylatable sequences; although N – ~P – S/T is not an absolute inhibitor of O-glycosylation. The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. Some N-glycosylated sequences are also phosphorylated at the S/T-site in the N – ~P – S/T sequon. ASA values for N-glycosylated sequences are stochastically larger than those for O-GlcNAc glycosylated sequences.
The LPM with sequence and structural data as explanatory variables yields a KS statistic value of 99%. With only sequence data, the KS statistic erodes to 80%. With 50% as the cutoff probability for predicting O-GlcNAc glycosylation, this LPM mispredicts 21% of out-of-sample O-GlcNAc glycosylated sequences as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%. This study underscores the germaneness of structural information for predicting the likelihood of O-glycosylation.
Several other studies are available for the predicting the likelihood of O-glycosylation. These studies have employed “machine learning” techniques, like random forests and factor analysis, to predict O-glycosylation [111,112,113,114]. However, this approach was not pursued in this paper because the focus herein is on the explanatory process driving O-glycosylation. Thus, classical hypothesis testing is germane to the present work, and algorithmic modeling with its pervasive focus on prediction is not, in this paper, an end in itself. In this sense, Cox’s crucial comment to Breiman is echoed, which is that the “starting point” in this paper is not the data, but the underlying process generating it . Furthermore, echoing Cox, again, the preference in this paper is to avoid proceeding with “a directly empirical black-box approach” in favor of trying to “take account of some underlying explanatory process” that can be simply represented by the LPM. In this context, the Lasso penalized logistic model was also fit to the data. However, it was found that the LPM is quite competitive with it, and provides a simpler mechanistic understanding of the O-glycosylation prediction process.
Data collected on the glycosylated sequences used for analysis herein is stored in an Excel spreadsheet: glycos_public.xlsx. This data, and key operations performed on it, are described in Table 8. Data is collected from two resources (dbOGAP  and PhosphoSitePlus ) as described in Table 8.
Structural data is obtained from the Protein Data Bank (PDB) , and the PDB-ID codes used are provided in glycos_public.xlsx in the column labeled PDBID. In dbOGAP, the total number of UniProt sequences is 376; and, of these, 39 have structural information. Several unique structures exist for each of the 39 sequences. This results in the 39 sequences having a total of 998 structures with unique PDB-IDs. Each structure is unique in terms of different ligands/drugs bound to them. Although there is redundancy in this data from a sequence standpoint, these 39 sequences have different structural attributes as defined by phi-psi angles and ASA, which are important structural variables. In order to make sure that these are real structural differences, similar structures were superposed to calculate a RMSD (root-mean-squared-deviation). It is confirmed that the resolution of the structures is high enough to believe in the subtle conformational changes captured in the data. The 10 residues to the left of the S/T-site, and the 10 residues to the right of the S/T-site, are superposed to see if the S/T-site displays slightly different conformations. An example is presented in Fig. 12. This example supports the rationale behind using all of the redundant sequences for regression model building. It is not surprising that the S-site, in this example, is flexible enough to accommodate its role in a multitude of cellular processes. Residues that have maximum deviation (0.5-1.0Å) are close to the S-site. Residues farther out superimpose with less than 0.1Å RMSD.
The sequence and taxonomy information for the data used in the analysis are extracted from the UniProtKB database (www.uniprot.org) and shown in columns labeled Uniprot_Accession_No and Protein_Organism in glycos_public.xlsx . The ligand information (see column labeled Ligand_Sequence in glycos_public.xlsx) is obtained from the PDBsum database 
Accessible Surface Area (ASA) values (see column labeled ASA in glycos_public.xlsx) are obtained using the ASAView tool and database . The definitions in the ASAView tool are used to classify the amino acids into five categories: positively charged residues (R, K, H), negatively charged residues (D, E), polar uncharged residues (G, N, Y, Q, S, T, W), Cysteine (C), and hydrophobic residues (L, V, I, A, F, M, P).
The ProMotif database is used to obtain the turn types: http://www.ebi.ac.uk/thornton-srv/databases/cgi in/pdbsum/GetPage.pl?pdbcode=n/a&template=doc_promotif.html. These are in glycos_public.xlsx in the column labeled Turn_Type.
Because structural data on O-glycosylated sequences is sparse, two LPMs are estimated. One with sequence and structural data as explanatory variables. The other with only sequence data. The LPM has two principal sources of error: the exact process by which a protein becomes O-glycosylated is unknown; and uncertainty due to ambiguity in what constitutes a set of proteins that cannot be O-glycosylated. So, a suitable sequon is needed for discriminating between the two outcomes: O-glycosylated or not.
Availability of data and materials
The dataset (glycos_public.xlsx), which is in Microsoft Excel, is included as Additional file 1.
Lasso penalized logit model
Linear probability model
predicted error sum of squares
Schwarz Bayesian criterion
Brooks SA. Strategies for analysis of the glycosylation of proteins: current status and future perspectives. Mol Biotechnol. 2009;43(1):76–88.
Joshi HJ, Narimatsu Y, Schjoldager KT, Tytgat HLP, Aebi M, Clausen H, Halim A. SnapShot: O-Glycosylation Pathways across Kingdoms. Cell. 2018;172(3):632–632 e632.
Krieg J, Hartmann S, Vicentini A, Gläsner W, Hess D, Hofsteenge J. Recognition signal for C-mannosylation of Trp-7 in RNase 2 consists of sequence Trp-x-x-Trp. Mol Biol Cell. 1998;9(2):301–9.
Torres CR, Hart GW. Topography and polypeptide distribution of terminal N-acetylglucosamine residues on the surfaces of intact lymphocytes. Evidence for O-linked GlcNAc. J Biol Chem. 1984;259(5):3308–17.
Haltiwanger RS, Holt GD, Hart GW. Enzymatic addition of O-GlcNAc to nuclear and cytoplasmic proteins. Identification of a uridine diphospho-N-acetylglucosamine:peptide beta-N-acetylglucosaminyltransferase. J Biol Chem. 1990;265(5):2563–8.
Hart GW, Housley MP, Slawson C. Cycling of O-linked beta-N-acetylglucosamine on nucleocytoplasmic proteins. Nature. 2007;446(7139):1017–22.
Yang X, Zhang F, Kudlow JE. Recruitment of O-GlcNAc transferase to promoters by corepressor mSin3A: coupling protein O-GlcNAcylation to transcriptional repression. Cell. 2002;110(1):69–80.
Dias WB, Cheung WD, Wang Z, Hart GW. Regulation of calcium/calmodulin-dependent kinase IV by O-GlcNAc modification. J Biol Chem. 2009;284(32):21327–37.
Lazarus MB, Nam Y, Jiang J, Sliz P, Walker S. Structure of human O-GlcNAc transferase and its complex with a peptide substrate. Nature. 2011;469(7331):564–7.
Wells L, Hart GW. O-GlcNAc turns twenty: functional implications for post-translational modification of nuclear and cytosolic proteins with a sugar. FEBS lett. 2003;546(1):154–8.
Capotosti F, Guernier S, Lammers F, Waridel P, Cai Y, Jin J, Conaway JW, Conaway RC, Herr W. O-GlcNAc transferase catalyzes site-specific proteolysis of HCF-1. Cell. 2011;144(3):376–88.
Wells L, Vosseller K, Hart GW. Glycosylation of nucleocytoplasmic proteins: signal transduction and O-GlcNAc. Science (New York). 2001;291(5512):2376–8.
Lubas WA, Hanover JA. Functional expression of O-linked GlcNAc transferase. Domain structure and substrate specificity. J Biol Chem. 2000;275(15):10983–8.
Kreppel LK, Blomberg MA, Hart GW. Dynamic glycosylation of nuclear and cytosolic proteins. Cloning and characterization of a unique O-GlcNAc transferase with multiple tetratricopeptide repeats. J Biol Chem. 1997;272(14):9308–15.
Haltiwanger RS, Blomberg MA, Hart GW. Glycosylation of nuclear and cytoplasmic proteins. Purification and characterization of a uridine diphospho-N-acetylglucosamine:polypeptide beta-N-acetylglucosaminyltransferase. J Biol Chem. 1992;267(13):9005–13.
Dong DL, Hart GW. Purification and characterization of an O-GlcNAc selective N-acetyl-beta-D-glucosaminidase from rat spleen cytosol. J Biol Chem. 1994;269(30):19321–30.
Lubas WA, Frank DW, Krause M, Hanover JA. O-Linked GlcNAc transferase is a conserved nucleocytoplasmic protein containing tetratricopeptide repeats. J Biol Chem. 1997;272(14):9316–24.
Copeland RJ, Bullen JW, Hart GW. Cross-talk between GlcNAcylation and phosphorylation: roles in insulin resistance and glucose toxicity. Am J Physiol Endocrinol Metab. 2008;295(1):E17–28.
Yang X, Ongusaha PP, Miles PD, Havstad JC, Zhang F, So WV, Kudlow JE, Michell RH, Olefsky JM, Field SJ, et al. Phosphoinositide signalling links O-GlcNAc transferase to insulin resistance. Nature. 2008;451(7181):964–9.
Brownlee M. Biochemistry and molecular cell biology of diabetic complications. Nature. 2001;414(6865):813–20.
Caldwell SA, Jackson SR, Shahriari KS, Lynch TP, Sethi G, Walker S, Vosseller K, Reginato MJ. Nutrient sensor O-GlcNAc transferase regulates breast cancer tumorigenesis through targeting of the oncogenic transcription factor FoxM1. Oncogene. 2010;29(19):2831–42.
Wright JN, Collins HE, Wende AR, Chatham JC. O-GlcNAcylation and cardiovascular disease. Biochem Soc Trans. 2017;45(2):545–53.
Banerjee PS, Lagerlof O, Hart GW. Roles of O-GlcNAc in chronic diseases of aging. Mol Aspects Med. 2016;51:1–15.
Wani WY, Ouyang X, Benavides GA, Redmann M, Cofield SS. O-GlcNAc regulation of autophagy and α-synuclein homeostasis; implications for Parkinson’s disease. Mol Brain. 2017;10(1):32.
Ma X, Li H, He Y, Hao J. The emerging link between O-GlcNAcylation and neurological disorders. Cell Mol life Sci. 2017;74(20):3667–86.
Ho WL, Hsu WM, Huang MC, Kadomatsu K, Nakagawara A. Protein glycosylation in cancers and its potential therapeutic applications in neuroblastoma. J Hematol Oncol. 2016;9(1):100.
Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
Goldberger AS. Econometric Theory. New York: Wiley; 1964.
Gana R. Ridge regression estimation of the linear probability model. J Appl Stat. 1995;22(4):537–9.
Saccucci MS. Effect of variance-inflated outliers on least squares and ridge regression. Newark: University of Delaware (unpublished PhD dissertation; 1985.
Monyak JT: Mean squared error properties of the ridge regression estimated linear probability model. unpublished PhD dissertation 1998.
McGillivray RG. Estimating the linear probability function. Econometrica. 1970;30:775–6.
Amemiya T. Some theorems in the linear probability model. Int Econ Rev. 1977;18(3):645–50.
Gana R, Naha S, Mazumder R, Goldman R, Vasudevan S. Ridge Regression Estimated Linear Probability Model Predictions of N-glycosylation in Proteins with Structural and Sequence Data. ArXiv. 2018.
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945;1:80–3.
Wang J, Torii M, Liu H, Hart GW, Hu ZZ. dbOGAP - an integrated bioinformatics resource for protein O-GlcNAcylation. BMC Bioinformatics. 2011;12:91.
Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 2015;43(Database issue):D512–20.
Burley SK, Berman HM, Christie C, Duarte JM, Feng Z, Westbrook J, Young J, Zardecki C. RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education. Protein Sci. 2018;27(1):316–30.
UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(Database issue):D204–12.
Laskowski RA, Hutchinson EG, Michie AD, Wallace AC, Jones ML, Thornton JM. PDBsum: a Web-based database of summaries and analyses of all PDB structures. Trends Biochem Sci. 1997;22(12):488–90.
Ahmad S, Gromiha M, Fawareh H, Sarai A. ASAView: database and tool for solvent accessibility representation in proteins. BMC Bioinformatics. 2004;5:51.
Cohen P. The origins of protein phosphorylation. Nat Cell Biol. 2002;4(5):E127–30.
Fuhs SR, Meisenhelder J, Aslanian A, Ma L, Zagorska A, Stankova M, Binnie A, Al-Obeidi F, Mauger J, Lemke G, et al. Monoclonal 1- and 3-Phosphohistidine Antibodies: New Tools to Study Histidine Phosphorylation. Cell. 2015;162(1):198–210.
Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22:209–12.
Whitehead AN, Russell BAW. Principia Mathematica to *56; 1910.
Hart GW. Glycosylation. Curr Opin Cell Biol. 1992;4(6):1017–23.
Gavel Y, von Heijne G. Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein Eng. 1990;3(5):433–42.
Ben-Dor S, Esterman N, Rubin E, Sharon N. Biases and complex patterns in the residues flanking protein N-glycosylation sites. Glycobiology. 2004;14(2):95–101.
Efroymson MA. Multiple Regression Analysis. In: Mathematical Methods for Digital Computers. New York: Wiley; 1960.
Hocking RR. A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression. Biometrics. 1976;32(1):1–49.
Hoerl RW, Schuenemeyer JH, Hoerl AE. A simulation of biased estimation and subset selection regression techniques. Technometrics. 1986;28:369–80.
Larson SC. The shrinkage of the coefficient of multiple correlation. J Educ Psychol. 1931;22(1):45–55.
Mosteller F, Wallace DL. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. J Am Stat Assoc. 1963;58:275–309.
Mosteller F, Tukey JW. Data analysis, including statistics. Handbook Soc Psychol. 1968;2:80–203.
Stone M. Cross-validatory choice and assessment of statistical problems. J R Stat Soc. 1974;36(1):103–6.
Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc. 1975;70:320–8.
Brier GW: Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950, 78:1-3.
Murphy A. A new vector partition of the probability score. J Appl Meteorol. 1970;12:695–700.
White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48:817–38.
Smirnov NV. On the estimation of the discrepancy between empirical curves of distribution for two independent samples, vol. 2. Moscow: Bulletin of the University of Moscow; 1939. p. 3–14.
Smirnov NV. Table for Estimating the Goodness of Fit of Empirical Distributions. Ann Math Stat. 1948;19(2):279–81.
Kolmogorov A. Sulla determinazione empirica di una lgge di distribuzione. InstItalAttuari Giorn. 1933;4:83–91.
Feller W. On the Kolmogorov-Smirnov Limit Theorems for Empirical Distributions. Ann Math Stat. 1950;21(2):301–2.
Doob JL. Heuristic approach to the Kolmogorov-Smirnov theorems. Ann Math Stat. 1949;20:393–403.
Durbin J, Watson GS. Testing for serial correlation in least squares regression I. Biometrika. 1950;37:409–28.
Durbin J, Watson GS. Testing for serial correlation in least squares regression II. Biometrika. 1951;38:159–77.
Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4(6):1633–49.
Thanka Christlet TH, Veluraja K. Database analysis of O-glycosylation sites in proteins. Biophys J. 2001;80(2):952–60.
Salkever DS. The use of dummy variables to compute predictions, prediction errors, and confidence intervals. J Econ. 1976;4:393–7.
Anderson GJ. Prediction tests in limited dependent variable models. J Econ. 1987;34:253–61.
Caudill SB. An advantage of the linear probability model over probit or logit. Oxford Bull Econ Stat. 1988;50:425–7.
Caudill SB. Dichotomous choice models and dummy variables. Statistician. 1987;36(4):381–3.
Oksanen EH. A Note on Observation-Specific Dummies and Logit Analysis. J R Stat Soc Series D (The Statistician). 1986;35(4):413–6.
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60.
Benjamin DJ, et al. Redefine Statistical Significance. Nat Hum Behav. 2018;2:6–10.
Cox DR. Analysis of Binary Data; 1970.
Gujarati DN. Basic Econometrics; 1995.
Judge G, Hill C, Griffiths W, Lee T. The Theory and Practice of Econometrics; 1985.
Maddala GS. Introduction to Econometrics; 1992.
Takeshi A. Advanced Econometrics; 1985.
Wooldridge JM. Introductory Econometrics: A Modern Approach; 2016.
Judge G, Takayama T. Inequality restrictions in regression analysis. J Am Stat Assoc. 1966;61(313):166–81.
Goldfeld SM, Quandt RE. Nonlinear Methods in Econometrics; 1972.
Hensher DA, Johnson LW. Applied Discrete Choice Modeling; 1981.
Mullahy J. Weighted least squares estimation of the linear probability model revisited. Econ Lett. 1990;32(1):35–41.
Cox DR. The regression analysis of binary sequences. J R Stat Soc. 1958;20(2):215–42.
Schneider I, De Moivre A. The Doctrine of Chances (1718, 1738, 1756), Grattan-Guinness, I Landmark Writings in Western Mathematics; 2005. p. 1640–940. 1105-1120
Cook DR. Influential observations in linear regression. J Am Stat Assoc. 1979;74(365):169–74.
Cook DR. Detection of influential observations in linear regression. Technometrics. 1977;19(1):15–8.
Hoerl AE, Kennard RW. Ridge regression: degrees of freedom in the analysis of variance. Commun Stat. 1990;19:1485–95.
Obenchain RL. Classical F-tests and confidence regions for ridge regression. Technometrics. 1977;19:429–39.
Santosa F, Symes WW. Linear inversion of band-limited reflection seismograms. J Sci Stat Comput SIAM. 1986;7(4):1307–30.
Taylor HL, Banks SC, McCoy JF. Deconvolution with the ℓ1 norm. Geophysics. 1979;44:39–52.
Tibshirani R. Regression Shrinkage and Selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.
Bussotti P. On the Genesis of the Lagrange Multipliers. J Optim Theory Appl. 2003;117(3):453–9.
Lagrange JL. Mecanique Analitique; 1811.
Nesterov Y. Gradient Methods for Minimizing Composite Objective Function. Math Program. 2013;140:125–62.
Schwarz GE. Estimating the Dimension of a Model. Ann Stat. 1978;6:461–4.
Rose A. Vision - Human and Electronic; 1973.
Firth D. Reduction of Maximum Likelihood Estimates. Biometrika. 1993;80:27–38.
Heinze G, Schemper M. A Solution to the Problem of Separation in Logistic Regression. Stat Med. 2002;21:2409–19.
Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–35.
Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis of data. Biometrika. 1968;55(1):1–17.
Ramachandran GN, Ramakrishnan C, Sasisekharan V. Stereochemistry of polypeptide chain configurations. J Mol Biol. 1963;7(1):95–9.
Ceriani L, Verme P. The origins of the Gini index: extracts from Variabilita e Mutabilita (1912) by Corrado Gini. J Econ Ineqal. 2012;10(3):421–43.
Gini C. Variabilita e Mutabilita; 1912.
Gini C. Measurement of inequality and incomes. Econ J. 1921;31:124–6.
Keller J. The probability of heads. Am Math Mon. 1986;93(3):191–7.
Diaconis P, Mazur BC. The problem of thinking too much. Bull Am Acad Arts Sci. 2003;56:26–38.
Ergodos N. The enigma of probability. J Cogn Neuroethics. 2014;1(2):37–71.
Yang X, Han H. Factors analysis of protein O-glycosylation site prediction. Comput Biol Chem. 2017;71:258–63.
Hassan H, Badr A, Abdelhalim MB. Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique. Bioinform Biol Insights. 2015;9:103–9.
Chen Y, Zhou W, Wang H, Yuan Z. Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection. Med Biol Eng Comput. 2015;53(6):535–44.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Breiman L, Modeling S. Stat Sci. 2001;16(3):199–231.
We thank Swagata Naha, who was a graduate student at Georgetown University, for collecting data on some of the sequences. We thank Zhang-Zhi Hu for providing the data mined for the creation of dbOGAP during his tenure, as a faculty member, at Georgetown University. We thank Elliott Crooke, Professor and Chair, Department of Biochemistry and Molecular & Cellular Biology, and Senior Associate Dean, Faculty and Academic Affairs, Georgetown University, for his support and encouragement. We thank Dr. Sarma Dittakavi, Professor Emeritus, Laboratory Medicine and Pathobiology, University of Toronto, for reading the manuscript, providing valuable comments, and for the many stimulating discussions that we have had together. We thank the Editors and the two anonymous reviewers for their valuable comments.
No funding was obtained for this study.
Ethics approval and consent to participate
Not applicable. This study does not utilize any data that requires ethical approval or consent to participate.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Gana, R., Vasudevan, S. Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data. BMC Mol and Cell Biol 20, 21 (2019). https://doi.org/10.1186/s12860-019-0200-9
- consensus sequon
- probability model
- ridge regression