 Research article
 Open access
 Published:
Ridge regression estimated linear probability model predictions of Oglycosylation in proteins with structural and sequence data
BMC Molecular and Cell Biology volume 20, Article number: 21 (2019)
Abstract
Background
Todate, no claim regarding finding a consensus sequon for Oglycosylation has been made. Thus, predicting the likelihood of Oglycosylation with sequence and structural information using classical regression analysis is quite difficult. In particular, if a binary response is used to distinguish between Oglycosylated and nonOglycosylated sequences, an appropriate set of nonOglycosylatable sequences is hard to find.
Results
Three sequences from similar posttranslational modifications (PTMs) of proteins occurring at, or very near, the S/Tsite are analyzed: Nglycosylation, Omucin type (OGalNAc) glycosylation, and phosphorylation. Results found include: 1) The consensus composite sequon for Oglycosylation is: ~(W–S/T–W), where “~” denotes the “not” operator. 2) The consensus sequon for phosphorylation is ~(W–S/T/Y/H–W); although W–S/T/Y/H–W is not an absolute inhibitor of phosphorylation. 3) For linear probability model (LPM) estimation, Nglycosylated sequences are good approximations to nonOglycosylatable sequences; although N – ~P – S/T is not an absolute inhibitor of Oglycosylation. 4) The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. 5) Some Nglycosylated sequences are also phosphorylated at the S/Tsite in the N – ~P – S/T sequon. 6) ASA values for Nglycosylated sequences are stochastically larger than those for OGlcNAc glycosylated sequences. 7) Structural attributes (beta turn II, II´, helix, beta bridges, beta hairpin, and the phi angle) are significant LPM predictors of OGlcNAc glycosylation. The LPM with sequence and structural data as explanatory variables yields a KolmogorovSmirnov (KS) statistic of 99%. 8) With only sequence data, the KS statistic erodes to 80%, and 21% of outofsample OGlcNAc glycosylated sequences are mispredicted as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%.
Conclusions
The data indicates the existence of a consensus sequon for Oglycosylation; and underscores the germaneness of structural information for predicting the likelihood of Oglycosylation.
Background
The process of posttranslational modification (PTM) of proteins has allowed cells to vastly increase the otherwise normal array of functions that proteins carry out. PTMs are essential modifications that occur in higher eukaryotes. Glycosylation is one type of modification that chemically alters proteins by the enzymatic or nonenzymatic addition of glycans. There are two main types of glycosylation present: Nlinked and Olinked [1]. There are two types of Olinked glycosylation in cells: mucintype Olinked glycosylation (OGalNAc); and several types of nonmucin Oglycans, which include αlinked Ofucose, βlinked Oxylose, αlinked Omannose, βlinked OGlcNAc (NAcetyl Glucosamine), α or βlinked Ogalactose, α or βlinked Oglucose glycans, and OGlcNAcylation (or OGlcNAc). In addition, Clinked, Slinked and nonenzymatic glycation processes exist [2]. The consensus sequon for Nlinked glycosylation is N – X – S/T, where X is not proline. Two consensus sequons for Clinked glycosylation are: W – X – X – W and W – S/T – X – C [3]. No consensus sequon for Oglycosylation is known thus far.
OGlcNAc modification is the attachment of βNAcetyl Glucosamine (GlcNAc) to protein serine (S) or threonine (T) amino acid residues via the Olinked glycosylation process [4]. OGlcNAcylation occurs primarily in nucleocytoplasmic proteins [5, 6]. OGlcNAcylation mechanism targets numerous transcription factors, tumor suppressors, kinases, phosphatases, and histonemodifying proteins [6,7,8,9]. Furthermore, the monoaddition of GlcNAc is suggested to be the most significant glycosylation process in the regulation of metazoan cytosolic and nuclear proteins [10, 11]. Often, the S/T residues, which are targeted by OGlcNAcylation, are also targets of kinases and phosphatases. Hence, the consensus sequon for Olinked glycosylation can also be the one for phosphorylation. These residues are both dynamic and inducing in nature [12]. Thus, to summarize, OGlcNAcylation plays a major role in modulating kinaseinduced signal transduction pathways [13], supporting the notion that OGlcNAcylation has an important biological role.
The source of GlcNAc for OGlcNAcylation in eukaryotic cells is UDPNAcetyl Glucosamine [14], UDPGlcNAc, obtained from glucose in the hexosamine biosynthetic pathway (HBP). The attachment process of OGlcNAcylation is catalyzed via Olinked βNAcetyl Glucosamine transferase [15, 16] or OGlcNAc transferase and is removed by Olinked βNacetyl glucosaminidase or OGlcNAcase [16]. Currently, OGlcNAc transferase is the only known enzyme in mammals responsible for this addition of GlcNAc [17]. OGlcNAc transferase is found in the nucleus, mitochondria, and cytoplasm of cells.
OGlcNAc transferase has important roles in many biological processes and diseases. These include type 2 diabetes and its complications, such as blindness, renal failure, and nerve damage; diabetesaccelerated atherosclerosis; dyslipidemia; cardiovascular diseases; aging and cancer [18,19,20,21,22,23]. Additionally, OGlcNAc transferase is associated with tauopathies, including Alzheimer’s disease, dementia, and Parkinson’s disease [24, 25]. Recently, the role of glycosylation in neuroblastoma has taken center stage [26].
The common underlying cause of such diseases is due to the over or under active process of OGlcNAcylation. With the complex metabolic processes in OGlcNAcylation and its critical role in the regulation of cellular processes, it is not unusual to consider abnormal OGlcNAcylation activity as contributing to various diseased states. This concept suggests that an indepth understanding of sequence and structural parameters that drive OGlcNAcylation may shed some light on Olinked glycosylation.
The goal of this paper is to use the ridge regression (RR) estimated linear probability model (LPM) to predict the likelihood of OGlcNAc glycosylation in human proteins [27,28,29,30,31,32,33]. A detailed description of the statistical methodology used herein is in a predecessor paper [34]. The LPM models a binary outcome: OGlcNAc glycosylated or not. So, the LPM estimation dataset requires two data subsets: OGlcNAc glycosylated sequences and sequences that cannot be OGlcNAc glycosylated. Experimentally one can determine if a protein is Oglycosylated. Thus, getting data on Oglycosylated proteins is straightforward. Getting data on proteins that are highly unlikely (or impossible) to be Oglycosylated is quite challenging, because a consensus sequon for Oglycosylation is needed. To understand whether a true, or even approximate, consensus sequon for Oglycosylation exists, a family of similar sequences with PTMs of proteins occurring at the S/Tsite is collected. This family includes sequences on the following PTMs of proteins: Nglycosylation, OGalNAc glycosylation, and phosphorylation. The rest of this paper documents the analyses done with all of this collected data.
Results
Results from examining the empirical data
The distribution of groups of residues around the OGlcNAc glycosylated proteins, in dataset dbogapseq, is shown in Table 1. It is striking that the immediate vicinity of the S/Tsite is predominantly polar uncharged and hydrophobic at position –1 and +1, respectively.
The marginal distributions of residues in the vicinity of the OGlcNAc and OGalNAc glycosylated sequences are shown in Tables 2 and 3, respectively. About 10% or more of the sequences in positions –4 thru –8 have S, P, or A in Table 2; positions –2 and –3 are proline rich; and amino acids C, F, M, and W do not occur much. There are some differences between amino acid preferences in OGlcNAc and OGalNAc glycosylated sequences. As reported by earlier studies, the vicinity of the S/Tsite in OGalNAc glycosylated sequences is proline rich.
The marginal distributions of residues in phosphorylated sequences are shown in Table 4. Along the sequence, S tends to show up the most in phosphorylated sequences. There are higher percentages of charged amino acids in these sequences, as well.
While visual inspection of amino acids in the vicinity of these sites provide some indication of their preferences and charge distributions, a subjective analysis alone is insufficient. Therefore, statistical analysis of the data is done in this paper to provide more rigorous indications of the same. Empirical analysis suggests that given a set of sequences for a particular PTM of proteins, the proportions of an amino acid occupying positions (e.g., ±8 from the S/Tsite) along a sequence can discriminate between two PTMs of proteins. This is tested using the Wilcoxon signedrank test [35] and the results are shown in Table 5.
For example, in Tables 2 and 4, the proportions of occurrence of amino acid F along seven positions to the left/right of the S/T and S/T/Y/H sites in OGlcNAc glycosylated and phosphorylated sequences, respectively, appear different under pairwise comparison (i.e., the amino acid proportions along one row of one Table are compared, pairwise, with the proportions of that same amino acid along the corresponding row of the other Table). The Wilcoxon signedrank test is used to determine if these two sets of proportions are statistically different. As shown in Table 5, detected differences are flagged as “1” and by “0” otherwise. Table 5 indicates that using amino acids in the positions they occupy as “explanatory variables” to predict Oglycosylation is reasonable. In contrast, the proportions of amino acids in a particular position for one PTM process are not statistically different from the proportions of amino acids in that same position for another PTM process. For example, the Wilcoxon signedrank test applied to compare, pairwise, the proportions in any column of Table 2 with those in the corresponding column of Table 3 shows the pairwise differences are not statistically significant. This same result holds when comparing the columns of Table 2 with the corresponding ones in Table 4; and those of Table 3 with the corresponding ones in Table 4.
Finally, it is of interest to know whether sequons involved in the PTMs of proteins are unique to each kind of PTM process. For example, given the consensus sequon for Nglycosylation is N–X–S/T, one can find out if in the collected data there are Nglycosylated sequences that are also Oglycosylated or phosphorylated at the S/Tsite, which is at the 2^{nd} position to the right of N in this sequon. UniProt Accession numbers and UniProt position pairs are compared across PTMs. Because Nglycosylation occurs at N, the UniProt positions of Nglycosylated sequences, in the data, are increased by 2 so the S/Tsites can be compared across the other PTMs. In the empirical data, there are no Nglycosylated sequences with UniProt Accession No. and UniProt position +2 pairs that match the UniProt Accession No. and UniProt position pairs of OGlcNAc or OGalNAc glycosylated sequences. This is as expected, because the structures of N and Olinked oligosaccharides are very different, and reflect differences in their biosynthesis and cellular localizations. Thus, the Nlinked site cannot be Oglycosylated at the S/Tsite of the N–X–S/T sequon. However, there is a small number of Nglycosylated sequences that are phosphorylated. Table 6 presents the counts of sequences in the empirical data that are common between the different PTMs of proteins considered herein (i.e., intersections of two sets of sequences are calculated).
Table 7 lists the 13 proteins that are Nglycosylated at N and phosphorylated at the second position to the right of N, which is S/T. The majority of these proteins are receptors involved in signaling and are phosphorylated via a threonine residue. What roles these phosphorylation sites play require further analysis and experimentation.
Search for a Consensus Sequon for Oglycosylation
Data (see Table 8) on three PTMs of proteins, increasing in sample size, is collected to identify a consensus sequon: OGlcNAc glycosylation, OGalNAc glycosylation, and phosphorylation, which is an important and pervasive PTM of proteins [42] that occurs, generally, at the S/T/Ysite. It has also been found to occur at the Hsite [43]. Several sequons are tested to see if a consensus sequon exists. The outcome of these tests is shown in Table 9. As can be noted, the sequon W – S/T – W does not occur in the collected set of Oglycosylated sequences. Thus, ~ (W – S/T – W) is likely a consensus composite sequon for Oglycosylation, where, following Principia Mathematica [45], “~” denotes the “not” operator. This means there are 3 sequons that satisfy ~ (W – S/T – W): ~W – S/T – W, W – S/T – ~W, or ~W – S/T – ~W.
An important question is whether, as sample size increases, W – S/T – W continues not to occur at all in Oglycosylated sequences. The answer to this question has to wait until much larger sets of Oglycosylated sequences are available for study. In the meantime, one can approach this question in another way. The N – ~P – S/T sequon is considered a necessary [46, 47], but not sufficient, condition for Nglycosylation, which is another important PTM of proteins. Its likelihood is modeled in Gana et al. [34]. In the strictest sense, this is a true statement if the occurrence of P in the position between N and S/T is an “absolute” inhibitor of Nglycosylation [48] – i.e., if the N – ~P – S/T sequon is never observed in Nglycosylated sequences. However, this is not the case. Rather, the word “absolute” is used in a probabilistic sense: the fact that N – P – S/T occurs “rarely” is sufficient for N – ~P – S/T to be termed a “consensus” sequon. In particular, a partial search for the N–P–S/T sequons in UniProt revealed 23 such sequons associated with Nglycosylated human proteins. In contrast, 16 (0.0066% + 0.004% = 0.007%) of the 227,810 phosphorylated human sequences in Table 9 have the W – S/T/Y/H – W sequon. Finally, a search of UniProt for human proteins with the W – S/T – W sequon yielded 236 sequences (in the WSTWUniprot dataset of Table 8). It is found that none of them are Oglycosylated at S/T.
The sequon for Clinked glycosylation is W–X–X–W or W–S/T–X–C. Analysis of the data indicates that the presence of W immediately before and immediately after the S/Tsite (S/T/Y/Hsite) disallows Oglycosylation (phosphorylation). Furthermore, the occurrence of N two positions to the left of the S/Tsite is relatively rare in Oglycosylation. Thus, nature appears to have evolved distinct sequons and amino acid propensities unique to each of the PTMs discussed herein.
Table 9 indicates that the majority (98.54% + 0.24% = 98.78%) of OGlcNAc glycosylated sequences have the ~N – X – S/T sequon. In contrast, these sequences having the N – ~P – S/T sequon is small (1.22%). Thus, for LPM estimation, the set of Nglycosylated sequences in Gana et al. [34] is considered to be a reasonable approximation for nonOglycosylated sequences. By discriminating between the features of O and nonO glycosylated sequences, respectively, the LPM estimates the likelihood of Oglycosylation. Because 1.22%, rather than 0%, of the Oglycosylated sequences have the N – ~P – S/T sequon, the ambiguity regarding what constitutes a set of sequences that cannot be Oglycosylated has not been completely removed. However, that ambiguity has been considerably reduced.
Modeling Strategy
Using the data in Gana et al. [34], OGlcNAc glycosylation is modeled by including the amino acids in 8 positions to the right, and 8 positions to the left, of the S/Tsite as possible explanatory variables. The collected data on Nglycosylated sequences described in Gana et al. [34] have 10 positions to the right, and 10 positions to the left, of the Nsite. Because training the LPM to discriminate between O and not O glycosylated proteins must focus on the S/Tsite as center for all proteins in the estimation dataset, only 8 positions to the right of S/T are available for use in that data. The remaining possible explanatory variables are the collected structural data on the proteins.
When the LPM is applied to outofsample data, other types of Oglycosylated sequences may be a part of it. To assess the sensitivity of the mispredictions rate, the LPM of OGlcNAc glycosylation is applied to a sample of OGalNAc glycosylated sequences. The modeling and validation strategy is summarized in Table 10.
LPM predictions using sequence and structural data
Based on the collected data (glycos_public.xlsx), as in Gana et al.[34], the i^{th} position to the left (right) of S/T is minus_{i} (plus_{i}). If amino acid α occupies i, then: minus_{iα} = 1 if α is to the left of S/T and plus_{iα} = 1 if α is to the right of S/T; otherwise, these indicator variables take the value 0. Thus, 140 (20 × 8 – 20) and 160 (20 × 8) indicator variables define the left and right neighborhoods of S/T, respectively. The ratio (pos_{j}) of the position of the glycosylation site (S/T or N) to the total length of protein j is also assumed to be an explanatory variable. Finally, it is assumed that the structural information on the sequences are additional explanatory variables.
Let Y_{j} be the binary dependent variable that takes the value 1 if sequence j is Oglycosylated and 0 if it is not (Nglycosylated). Then, the postulated LPM (indexed as Eq. (1)) is:
where β_{0} , β_{iα} , φ_{iα} , λ , δ , π , θ_{k} , ω_{1} and ω_{2} are constants (coefficients) estimated using the data; Σ_{α} , Σ_{i} denote summations over significant amino acid and position combinations, respectively, and Σ_{k} denotes summation over significant turn types. The variable ASA_zero is a dummy variable that takes the value 1 if ASA is zero, and 0 otherwise. minus_{2α} is dropped so as not to bias the LPM, because the set of nonOglycosylated proteins only has an N (i.e., there is no amino acid variation, by design) at minus 2. Thus, retaining minus_{2α} with α ≠ N will flag amino acids (other than N) present at minus 2 only whenever Y_{j} = 1, which would be inappropriate. Amino acids A, D, F and K dominate the 998 human proteins at minus 2; and D, F, K, R and S dominate the remaining 107 nonhuman proteins. The set of Oglycosylated sequences have no N at minus 2. Also, note that WSTWUniprot is not used as the set of nonOglycosylated sequences, because it will lead to the perfect, but trivial, model: Y_{j} = 1 – minus_{1W}. In this case, one option is to estimate the LPM by ignoring the two amino acids on either side of the S/Tsite. However, this does not improve LPM predictive capability materially. More importantly, it would be of interest to find sequences with the W–S/T–W sequon that do “something” (e.g., Cglycosylation with W–S/T–W–W or W–S/T–W–C sequons) rather than “nothing” (or staying “silent”); and then refit the LPM to see how well it discriminates between Oglycosylated and nonOglycosylatable sequences. In addition, if structural data is included for LPM estimation, solid predictive power would likely be the outcome.
The selected LPM, shown in Table 11, is estimated on 1,083 (human) Nglycosylated proteins and 987 Oglycosylated proteins. As in Gana et al. [34], Nglycosylated sites with a single sugar are deleted from the estimation data because these may be false positives given the known artifacts from crystallization. As detailed in Gana et al. [34], the LPM is selected using the stepwise selection method [49,50,51] in conjunction with 10fold crossvalidation [52,53,54,55,56] to minimize the prediction residual sum of squared errors (PRESS), which has a point of contact with the Brier Score [57, 58]. Because the LS estimated LPM produces heteroscedastic errors (i.e., residuals with varying variances), it was confirmed that stepwise selected significant variables continued to remain significant when reevaluated in terms of heteroscedasticityconsistent standard errors [59], as a first approximation. Because the exact functional form of the LPM error variance is known, RR is applied to the stepwise selected LPM to reconfirm that all variables continue to be statistically significant. This is done by reestimating the stepwise selected LPM by RR so the predicted values of Y_{j} are forced to lie in the 01 interval and estimating the error variances [29]: (RR predicted Y_{j} ) × (1 – RR predicted Y_{j} ). The optimal value of the RR tuning parameter, k, for which all of the aforesaid predictions fall in the range 01 is 3.77895. Using these variances as weights, the LPM is reestimated using classical weighted LS (WLS). If variables with pvalues greater than 5% result, the entire LPM estimation process is redone until all variables are significant at 5%.
The number of amino acid and position combinations found to be significant in the selected LPM is 76. Model parsimony (e.g., favoring retaining fewer, rather than more, explanatory variables) is not sought, because it is not at all clear that Nature is “inefficient”, or provides redundant information, by making irrelevant the placement of certain amino acids in certain positions along the sequence. The selected LPM yields an insample KolmogorovSmirnov (KS) statistic of about 99.2%, which signals a reasonable degree of discrimination between O and nonO glycosylated proteins. The maximum separation for the KS statistic [60,61,62,63,64] occurs at the predicted probability value of about 43%. The average (insample) predicted probabilities of Oglycosylation when Y_{j} = 0 and Y_{j} = 1 are about 1% and 98%, respectively. The standard deviations of predicted probabilities of Oglycosylation when Y_{j} = 0 and Y_{j} = 1 are about 11% and 7%, respectively. The DurbinWatson (DW) statistic [65, 66] for the LPM is about 1.5.
The nine mispredictions made by the LPM, insample, are documented in Table 12. The accessions in Table 12 are not annotated in UniProt to be glycosylated except for Q16566 (Olinked), Q92854 (Nlinked), and P31749 (phosphorylated). Thus, how many sequences in Table 12 are truly false positives, remains to be seen.
It has been observed that amino acid P is usually present around Oglycosylation sites; and that Oglycosylation tends to occur more in the betastrands of proteins [67, 68] indicating the significance of proline around the S/Tsite. However, Table 11 indicates that the likelihood of Oglycosylation tends to be more favorable in the beta hairpins and beta bridges of proteins. In Table 11, ASA_zero is significant. This appears to be “unusual” because only 2 sequences in the LPM estimation data, both of which are experimentally deemed to be OGlcNAc glycosylated, have ASA values of 0. ASA_zero has a point of contact with the concept of an “observationspecific” dummy variable [69,70,71,72,73]. The two sequences with ASA values of 0 are shown in Table 13. An important biological question is whether sequences not exposed at all, but with other Oglycosylation friendly characteristics, have a greater propensity to be Oglycosylated.
The two proteins in Table 13, Peroxiredoxin2 (P32119) and Band 3 anion transport protein (P02730), have their Olinked glycosylation sites buried as indicated by their zero ASA values. While an explanation for this is not immediately apparent, position 112 in Peroxiredoxin2 (P32119) is a phosphorylation site as well. A question is whether the site is buried in order to make itself amenable to switch between the two processes via a conformational change (Olinked and Phosphorylation). Further research is needed to answer this question.
ASA is not significant in the LPM. As the two data subsets of sequences in the LPM estimation dataset are either O or N glycosylated, it appears that differences in ASA, between N and O glycosylated sequences does not rise to the level of a predictor variable in the fitted LPM. The empirical cumulative probability distribution functions (CDFs) for ASA, by estimation data subset, is shown in Fig. 1. The mean/median ASA values, shown in Fig. 1, differ by LPM estimation data subset. In particular, it appears that ASA values for Nglycosylated sequences are generally “larger” than those for OGlcNAc glycosylated sequences. Interestingly, the Wilcoxon rank sum test [35] indicates that the ASA value for Nglycosylated sequences is “stochastically larger” [74] than the ASA value for OGlcNAc glycosylated sequences. The pvalue of this test is about 3.6%. If the KS test is applied, the pvalue drops to nearly zero (3.96e13). However, because the first pvalue is less than 5%, it would be of interest to retest this phenomenon with more data to see if one can get significantly lower pvalues, say less than 0.5% [75], with the Wilcoxon rank sum test.
LPM predictions using only sequence data
The LPM in Table 14 employs only sequence data. The optimal value of k for which all predictions are in the 01 range is 1.4357. With this k, the LPM in Table 14 is estimated by WLS and produces a KS statistic of about 80.4%, which is a steep drop from 99.2% and underscores the germaneness of structural information. Absent 3D structure, this loss in LPM predictive power is not surprising, considering that the same site is used by two fundamental PTM processes driven by conformational flexibility. Maximum separation occurs at the predicted probability value of about 49%. The DW statistic value is about 1.12, indicating, as expected, missing explanatory variables (e.g., structural information) in the LPM as specified. This LPM produced only one prediction outside the 01 range flagging a protein deemed to be Oglycosylated in the data with a negative (or zero) “probability” (≈ –7.2%) of Oglycosylation; its UniProt Accession No., UniProt position, protein length, and sequence are Q16566, 58, 473, and VESELGRGATSIVYRCKQKGT, respectively. The LPM in Table 14 mispredicts (using 50% as the cutoff probability) 54 of the 259 (or 20.85%) outofsample OGlcNAc glycosylated sequences as not being OGlcNAc glycosylated. The 95% confidence interval [44] around 20.85% is: (16.35%, 26.20%). If the LPM in Table 14 is applied to a sample of OGalNAc glycosylated sequences, it mispredicts 656 of the 2,079 (or 31.6%) as not being OGalNAc glycosylated. A LPM specific to OGalNAc glycosylation will decrease the mispredictions rate. If the LPM in Table 14 is applied to the WSTWUniprot dataset (by ignoring the amino acids on either side of the S/Tsite), 52 of the 236 (≈ 22%) sequences are mispredicted as being Oglycosylated. As expected, this is slightly more than 20.85%.
Discussion
In order to clarify why the LPM methodology used herein is appropriate, we discuss several points, in a questionanswer format, in this section.
Is the LPM a valid methodology to model a binary response?
The short answer is “yes”. However, a longer answer is needed because it is quite unfashionable to use the LPM these days. And that perception needs to be addressed.
The LPM is fairly wellknown to econometricians and statisticians ([28] [76,77,78,79,80,81]) and has sparked a fairly substantial literature. The LPM is Y = Xb + u, but where Y only takes the values 0 and 1. If E(u) = 0, then each u_{i} has variance E(Y_{i})(1 – E(Y_{i})). Goldberger [28] suggested estimating E(Yi) by ordinary LS (OLS), and then reestimating the model by WLS to achieve homoscedasticity (i.e., constant residual variance) with E(Y_{i})(1 – E(Y_{i})) as the “weights”. The primary attraction of the LPM is its simplicity. In large samples, the nonnormality of u does not matter. The LPM produces consistent coefficients [32]. And the problem of getting predictions that lie outside the 01 range is not an asymptotic one [33]. Yes, “probabilities” less than 0 or greater than 1 can hinder empirical work with finite samples; but, only if those probabilities, per se, are used for something. In our case, probabilities less than 0 are interpreted as nearly 0 and those greater than 1 as nearly 1. This is so, because we are only interested in whether or not a sequence is Oglycosylated by assessing whether or not its probability of Oglycosylation is greater or less than 50%, respectively. The numerical value, per se, of the estimated probability of Oglycosylation below or above this threshold does not matter to us.
Rounding LPM predictions that are less than 0 to a small number close to 0 (e.g., 0.001), and those greater than 1 to a number close to 1 (e.g., 0.999), is a wellknown method to constrain the LPM predictions within the 01 range. In another approach, the sum of squared errors is minimized subject to the constraints 0 ≤ Xb ≤ 1 [82]. OLS LPM predictions are used to reestimate the LPM using WLS to overcome the problem of heteroscedasticity. Another method uses the absolute values of the weights to do WLS estimation. Another method uses only OLS predictions that are in the 01 range to do WLS estimation [83]. In another method, the weights are bounded and negative weights are assigned a constant value [84]. Another method generalizes the last two methods [85]. A limitation of these methods is that their sampling properties are not available. So, we use RR to construct the weights [29].
To avoid the problem, in finite samples, of LPM estimates falling outside of the 01 range, logistic regression is a greatly favored alternative to the LPM [86]. The logit model mathematically forces the estimates of Y to lie in the 01 range by the transformation: log(Y ÷ (1 – Y)) = Xb + u; and then estimates Y. In contrast, an alternative transformation is to use RR to force the estimates of Y in the 01 range. However, there is no a priori reason that says one way of constraining the estimates of Y to behave like “probabilities” is “superior” to the other. Furthermore, given that most of the predictors in our model Eq. (1) are also binary variables, it is not at all clear that Nature prefers the likelihood of Oglycosylation to vary along an ogive or sigmoid like surface, rather than linearly.
This naturally leads to the question of whether our fitted model satisfies the assumptions of a linear model. For the LPM, the error (residual) term is never normal; rather it is binomial. Furthermore, the error term is not observable – we only observe the final binary outcome in Nature. For large samples, the central limit theorem guarantees that LPM errors converge to a normal distribution [87]. Although LPM errors are not observable, one way to approximate them is to simply compute them like in ordinary linear regression with a continuous dependent variable – i.e., take the difference of the binary dependent variable, Y, and the corresponding LPM prediction; and compute the average squared error, which is the Brier Score. For our model in Table 11, Fig. 2 indicates that our sample size is approximately normal with a concentration of small errors in the range –0.04 to +0.04. A normal distribution with a mean of about 0.0025 (which is close to 0) and a standard deviation of about 0.1 fits Fig. 2 approximately.
Cook’s distance is used to test for the presence of outliers [88, 89]. We use the rule of thumb that Cook’s distances (Ds) greater than 0.5 may be problematic. Cook’s D are shown in Fig. 3, which indicates that the assumption of no outliers is a satisfactory one.
In the LPM, a unit increase in an independent variable, holding the other independent variables at fixed values, leads to a constant increase in the dependent variable. Because most of our independent variables in the LPM are binary, this is indeed the case. However, for the two continuous variables, the question is whether this constant “partial derivative” or “marginal” effect of an independent variable on the dependent variable holds. In a logit model, for example, this marginal effect varies with the independent variable. A plot of the residuals versus the two continuous variables in our LPM in Table 11 is shown in Fig. 4. This plot indicates fairly random scatters. Thus, the constant linear marginal effect assumption is satisfactory.
The LPM is a heteroscedastic model by construction. We first used White’s procedure [58] to correct for it. This LS estimated LPM with its heteroscedasticity consistent t values is shown in Table 15.
Because the form of the heteroscedasticity in the LPM is known, we had also used WLS to correct for it. This was done by estimating Y by RR in such a manner so as to guarantee that the predicted Y lies in the unit interval [29]. The plots of the actual observations (which are binary) vs. the RR predictions and the final WLS predictions of our model in Table 11 are shown in the lefthand and righthand panels of Fig. 5, respectively.
The RR predicted Y in the lefthand panel of Fig. 5 was used to estimate the LPM residual variances (this is the “first stage” LPM), so the problem of nonconstant variances in the LPM could be corrected for by using WLS. As can be visually observed in the lefthand panel, all predictions are constrained to lie in the 01 range for the smallest value of k [29]. That is, similar to logistic regression, a mathematical constraint has been imposed on the predicted Y. In logistic regression this is done via a transformation; here it is done by constraining the estimated coefficients. It is also known [30, 31] that the RR predicted LPM (in the lefthand panel of Fig. 5) is also a competitive model. That is, one does not necessarily have to compute the predictions in the righthand panel of Fig. 5, because RR has the capability to account for nonconstant residual variances [30]. The thing to note is that the KS drops and the Brier score increases in the lefthand panel of Fig. 5 relative to the corresponding values in the righthand panel [57]. This may be interpretable as the “cost” for constraining the predicted Y to lie in the 01 range, rather than leveraging the asymptotic property that predictions outside the 01 range do not matter in large samples.
What is the RR estimated LPM ANOVA?
The RR estimated LPM that generated Fig. 5 is shown in Table 16. The RR estimated standard errors for the coefficients are also shown in Table 16. Because RR produces noninteger degrees of freedom, unlike in classical LS estimation, the resultant RR analysis of variance (ANOVA) table is shown in Table 17 [90]. Note that the RR estimated mean square error (MSE) is about 0.0827, which is very much like the Brier Score in the lefthand panel of Fig. 5. RR produces an approximate, but not exact, Fratio (for nonstochastic k) [91], in the ANOVA table in Table 17, which is about 341 [90]. The angle between the regression vector and the residual vector is about 37^{0}, which measures “how much the residual vector deviates from being orthogonal to the X space as it is in least squares” [90].
Is the lasso penalized logit model (LPLM) more appropriate than the LPM to estimate Y?
The concept of the “lasso” penalty to model a dataset in which the number of predictors exceeds the number of observations goes back, at least, to the 1970s [92,93,94]. Later it was made fashionable by describing its applicability to statistical model selection [94].
The LPLM is estimated by minimizing the negative loglikelihood function, L(·), subject to the constraint that the sum of the absolute values of the coefficients is less than or equal to some value – that is, b_{1} ≤ t, where log[(Pr(Y=1) ÷ (1 – Pr(Y=1))] = Xb and v_{p} denotes the “pnorm” of a vector v. In contrast, RR uses the constraint (b_{2})^{2} ≤ t. That is, the sum of squared coefficients, instead of absolute coefficients, is constrained. In Lagrangian form [95, 96], the LPLM optimization problem can be recast as: minimize
–L(·) + λ × sum(b). This minimization is done using Nesterov’s method [97]. Here, a regularization parameter, ρ^{i}, where 0 < ρ < 1 and i is the i^{th} step in Lasso selection, is used for λ in the i^{th} step. Variable subset selection, via the Lasso, is done, using up to 200 steps, so as to minimize the Schwarz Bayesian Criterion (SBC) [98]. Insample, several LPLMs are estimated for different values of ρ and the resultant number of variables (excluding the intercept) selected, the KS statistic values, the Brier Scores, and twice the negative loglikelihood values are shown in Table 18. In terms of the SBC, the “optimal” LPLM, on the full sample, is the one with ρ = 0.815; and this is also nearly “optimal” in terms of Brier Score and KS statistic value. To assess LPLM performance outofsample as ρ varies, 5fold cross validation (CV) is done. LPLMs are fit on a set of 4 “training” folds and the Brier Score is calculated on the holdout (i.e., 5^{th}) fold. Thus, for a particular choice of ρ, five outofsample Brier Scores are available. The signaltonoise ratio (SNR), which is μ/σ, of these 5 Brier Scores is calculated. Per the Rose criterion [99], SNRs above 5 are deemed to be “good” in the engineering literature. The SNRs are shown in the last row of Table 18; and at ρ of 0.82, the SNR is a maximum and satisfies the Rose criterion. Thus, overall, it is reasonable to take ρ = 0.82 as “optimal”.
The resultant LPLM is shown in Table 19. An important question is: if the LPLM selected variables drive Oglycosylation in Nature, would they, postLasso, continue to be statistically significant if they were put into a classical logit model to reestimate Y ?
Using the classical logit model, we reestimated Y with the independent variables in Table 19. The classical logit model could not be estimated because of the problem of “quasicomplete separation”, which makes finding the optimal maximum likelihood solution difficult. This problem arises when Y separates a subset of the predictors to a large enough degree. To resolve this problem, we used Firth’s penalized maximum likelihood method [100, 101]. Several predictors (23 of the 51) turn out to be statistically insignificant at the 10% level of significance, and these are noted by asterisks in Table 19. Next, Y is reestimated using the LPM with the predictors in Table 19. The predictors that are statistically insignificant (10 of the 51), in terms of White tvalues, at the 10% level of significance are noted in Table 19 by the sunlike symbol “☼”. The LPM also indicates there is no multicollinearity among the predictors, because all of the VIFs are less than 10. However, the DW statistic drops to 0.8, likely indicating there are missing explanatory variables [77].
Of course, given a set of predictors, the estimated coefficients produced by the classical logit model, the LPM, and the LPLM are not comparable, because different optimization procedures are used to estimate Y. However, a philosophical question arises. Suppose, Nature operates along a logit surface to reveal the probability of Oglycosylation, given sequence and structural information in the neighborhood of the S/T site; and the Lasso has identified the relevant predictors. It appears to us that the selected predictors should, for the most part, be statistically significant, in the classical sense, when they are put into a classical logit model. The fact that 23 of the 51 LPLM predictors are not significant is troubling.
The rates at which the estimated models mispredict Y as 1 when it is empirically 0 (i.e., not Oglycosylated) in the estimation dataset, and 0 when it is empirically 1 (i.e., Oglycosylated), is shown in Table 20, together with the fit statistics. A 50% cutoff probability is used to measure the mispredictions rate.
As can be seen in Table 20, our WLS estimated LPM (Table 11) is performing quite well and does have an edge over the other models. This is not surprising. For example, the LPLM selects a subset of variables by setting certain coefficients to exactly 0. In our context, the interpretation of coefficients that are exactly zero is not easy. Coefficients of variables may be 0 if they are “irrelevant” and uncorrelated with the “relevant” variables. If many variables are irrelevant one possibility is that Nature has “wasted” certain amino acids by placing them in certain positions along the sequence; and another possibility is that they are “silent”, rather than “wasted”. In contrast, RR never sets coefficients to exactly 0. Rather, RR exhibits optimal linear shrinkage of the coefficients [102]. In particular, simulation studies indicate that the performance of RR dominates those of its peers such as partial least squares and principal component regression [102]. If we assume that Nature has not “wasted” amino acids, then RR is better suited to modify coefficients because amino acids in certain positions may have a “small” effect that is not yet visible. That is, the “relevant” amino acids may dominate the “irrelevant” ones (until a “signal” changes that) in determining the threedimensional structure of the protein, and, hence, its function. In such situations, we do not seek to set, as in the LPLM, some coefficients to exactly zero. Next, we present an example to clarify how fundamental are the amino acids and the positions they occupy along a sequence.
Consider the case of the psi angle. It is not a predictor in Table 11, but is a predictor in Table 19. Because the phi and psi angles go together, an important question is whether the LPM in Table 11 failed to capture it as a predictor, while the LPLM in Table 19 captured it. The answer is “no”; because to a significant degree the psi angle is determined by the amino acids and the positions they occupy. To see this, a linear regression is fit with the psi angle as the dependent variable and sequence information as independent variables. The linear regression model is chosen via two steps. In the first step, stepwise variable selection under 5fold CV is done so as to minimize the CV PRESS. The minimum CV PRESS attained was about 4,253,609 with 106 independent variables in the specification. Then, in the second step, because the model is heteroscedastic, the White tvalues [59], are used to retain only those variables that have pvalues less than 10% – i.e., variable subset selection is done, manually, using White tvalues instead of classical tvalues. The selected model is in Table 21.
The model in Table 21 has 99 variables, and an adjusted R^{2} of about 76%. However, its standardized (studentized) residuals are not normally distributed as indicated by the QQ plot in Fig. 6 [103]. This is not surprising, because psi is a “circular” measure and has heavy tails as can be seen by its tendency to cluster (with phi) in certain (and remain sparse in other) regions of the XY plane, which can be seen in the Ramachandran plot in Fig. 7 [104]. The nonnormality of the residuals indicates the need to do more work to normalize psi. However, as fitting a model to psi is not the focus of this paper, we did not pursue that goal at this time (our focus is on answering the question raised in the last paragraph). Nonetheless, the tvalues can be viewed as approximations (e.g., SNRs) for model selection by viewing linear regression, in this case, as purely an optimization method. In particular, we held the 99 variables in Table 21 “constant” and redid 5fold CV 50 times to study the variability in outofsample CV PRESS values. We did this because we noted that in the second step used to select the model (based on White’s tvalues), the new CVPRESS tends to exceed the minimum CV PRESS of about 4,253,609 – the reduction of 106 variables to 99 variables is fairly small and likely explains this movement. This variability, with its associated summary statistics, is shown in Fig. 8 by plotting the percentage increase in each new CV PRESS over the minimal CV PRESS (i.e., 4,253,609). Figure 8 indicates that this variation is about 4.5% on average, which is not large enough to reject the assumption that the distribution of amino acids along the sequence plays a significant role in determining structure and, hence, function. Future research can more precisely mathematize this role.
The number of times predictors are selected by the LPLM as ρ varies in Table 18 is shown in Table 22. In particular, ρ takes 13 values in Table 18. For each ρ, an optimal LPLM is selected. In all, there are 13 optimal LPLMs. All possible variables are shown in a rectangular format in Table 22. The positions of amino acids along the sequence is shown in an “integer line” format in the penultimate row of Table 22. So, for example, if we are at position number 3 to the right of S/T in Table 22 and go up vertically 6 “cells” we would arrive at the row for amino acid G; and this cell would represent the dummy variable p3G – that is, the dummy variable representing amino acid G in position +3. The variables (predictors) in the LPM of Table 11 are shaded in green in Table 22. For the 13 LPLMs, the frequency with which a predictor occurs in the LPLMs is noted in Table 22. For example, the number 13 in cell (–3, G) in Table 22 means that all of the 13 LPLMs had m3G as a predictor. Similarly, the 13 in cell (8, N) means that p8N was selected as a predictor by all of the 13 LPLMs; and, additionally, because this cell is shaded green, p8N is also a predictor in the selected LPM of Table 11. As another contrasting example, the predictor BH (Beta Hairpin) was only selected in 3 of the 13 LPLMs, but is in the LPM of Table 11. In summary, the wide variation in subsets of selected LPLM predictors, as ρ varies, is captured in Table 22. There are 107 integers (between 1 and 13, inclusive) entered in the cells identifying amino acids and their positions in Table 22; and 44 of these 107 integers are also shaded in green. That is, about 41% (44 ÷ 107) of the explanatory variables in the LPM of Table 11 are also selected by one or more of the 13 LPLMs.
How do fit statistics like the KS statistic and the Brier Score vary for the selected LPM?
The KS statistic is an alternative measure to the wellknown Gini coefficient [105,106,107]. The KS statistic for the LPM in Table 11 is 99.2%; and the corresponding Brier Score is 0.0094 (see Fig. 5). To study the outofsample variation in KS statistic and Brier Score, we reestimate the coefficients of the LPM in Table 15 under 5fold CV, 100 times. That is, the predictors of this LPM are held “constant” and only their coefficients are reestimated by pooling the data in 4 of the 5 randomly selected folds; and then calculating the KS statistic and Brier Score on the remaining fold. This exercise is done 5 times for a random selection of 5 folds. That is, 5 LPMs are estimated by partitioning the 5 folds into 5 sets of “training” and “validation” samples; where a training sample includes the data in 4 of the 5 folds, and the remaining fold is used as the validation sample. In turn, random selection of 5 folds is done 20 times. Thus, 100 KS statistics and Brier Scores are produced on sample sizes of about 400 each (the estimation sample size is 2,070).
The LPM in Table 15 instead of the one in Table 11 is selected for this exercise because it is simpler for our purpose. In particular, the difference between the two LPMs is the correction for heteroscedasticity, which is only a problem impacting the standard error of the coefficients, not their consistency. The results of this exercise are shown in Fig. 9, which indicates that the variation in these fit statistics is restricted to fairly narrow bands. Thus, there is nothing unreasonable in LPM performance.
Is “ASA_zero” a valid, rather than spurious, predictor in the LPM?
ASA measures the extent to which a sequence is exposed. As mentioned previously, there are two sequences in our data that have ASA values of 0, but are deemed to be Oglycosylated. In our model, ASA_zero is a statistically significant variable. Because of its special status as an “observation specific” dummy variable, it is not estimable within the classical logit modeling framework [71]. Its statistical significance, and persistence during CV, raises the open question of whether sequences that are not exposed at all have a greater propensity to be Oglycosylated. Although, we do not know the answer to that, we think that until empirical evidence to the contrary comes up, it does not hurt keeping ASA_zero in the LPM. Furthermore, the good news is that dropping it from the model does not change any of our major conclusions. That is, in our case, dropping ASA_zero as an explanatory variable, simply allowed its effect to be distributed among the other coefficients without disrupting anything. However, we did not drop it, because it is of interest to consider (until evidence to the contrary) its marginal contribution toward encouraging Oglycosylation.
The presence of ASA_zero in the LPM, but not ASA, raises another important question: is ASA_zero a spurious dummy variable in the LPM, because the fundamental variable underlying it, ASA, is not in the LPM? The answer is “no”. And we discuss the reason for this next.
The ultimate assumption underlying this paper is that the amino acids and the positions they occupy along the sequence are the “building blocks” leading to Oglycosylation. This means a significant amount of the information carried by ASA should be embedded in the amino acids and the positions they occupy in the sequences. To test this assumption, we searched for a linear regression model with the natural logarithm (log) of ASA as the dependent variable and sequence information as independent variables. Our search followed the same principles that were used to model psi. We used the log transformation to get ASA “closer” to normality, and, as a consequence, the two observations with ASA values of 0 also get eliminated from the estimation dataset (because log(0) is undefined). ASA, like psi, has “heavy” tails. The selected model is shown in Table 23. The model has 103 predictors and an adjusted R^{2} of about 80%. Its insample average squared error (ASE) is about 0.0425. The studentized residual distribution in Fig. 10 show the heavy tails; but, a reasonable amount of symmetry is preserved. The model predicts that the two sequences with Uniprot Accession Numbers P02730 and P 32119, which have empirical ASA values of 0 (see Table 13), have predicted ASA values of about 32.5 and 24.0, respectively.
The information content in ASA generated by the amino acids and the positions they occupy along the sequence can be measured by how well the regression model estimates of log(ASA) can discriminate between O and nonO glycosylated sequences (i.e., Y=1 vs. Y=0). The KS statistic would be a measure of this. In the estimation sample, if Y is ranked by estimated log(ASA), the resultant KS statistic is about 32.0%, which is a strong enough value to not reject the assumption that the distribution of amino acids along the sequence does meaningfully drive ASA. Like in the case of psi, we do 5fold CV (holding the 103 predictors “constant”), 20 times, to study the variability in the resultant 100 realizations of the out of sample ASEs and KS statistics. These variabilities, which are not unreasonable relative to the baseline ASE and KS statistic, are shown in Fig. 11. The baseline ASE and KS statistic are for the model in Table 23, which is fit on the full sample.
Is it valid to use mixed data from PTMs occurring in different parts of the cell to estimate the LPM?
The PTMs occur in different parts of the cell by a very complex process involving many different enzymes. For example, the initial processing of the Nglycosylation occurs in the rough ER and the process ends in the Golgi apparatus; phosphorylation occurs in the cytoplasm/nucleus; and OGlcNAc proteins are also cytosolic/nuclear. So, the question is whether our “control” dataset (i.e., Nglycosylated sequences) is the correct “negative control”? The answer is “yes”. By virtue of their differences in their mechanisms and cellular compartmentalization, it makes logical sense to use Nglycosylated sequences as the negative control for Oglycosylation. For example, in the absence of a consensus sequon for Oglycosylated sequences, using sequences with the S/T motif within the Oglycosylated set of sequences makes little sense, because there is no guarantee that such sequences will not get Oglycosylated.
The LPM models a binary outcome. For example, when tossing a “fair” coin, if one wants to model (using the LPM, say) whether the coin will come down as “head”, then one needs to start with two distinct data subsets: a set of heads and a set of tails [108]. One cannot have entries in the set of tails (the “control”) that may also be heads. From physics, we know that the outcome of the toss is exact, not probabilistic [108]. However, given the mathematical complexity of, and the instrumentation needed for, making this prediction based on the laws of physics (e.g., Newton’s laws), we use the language of probability, instead of physics, to estimate the outcome [109]. In particular, we say the probability of “head” is 50%. But, although this statement is false in theory, the methodology used is an important approximation that can be quite useful in practice, and for guiding experimental work.
Similarly, our use of the complex concept called “probability” [110], in this paper does not imply that we are saying there is a fundamental or intrinsic randomness in how a sequence folds and becomes sugar linked; rather, we are saying that given the underlying complexity of this process (as in the case of the coin toss), our methodology, which depends on probabilistic ideas, may represent a reasonable approximation (via rankordering) to facilitate predicting whether a sequence is Oglycosylated. Now the way we do this, is by starting with “partial” or “incomplete” information. That is, we start with explanatory variables in the ±10 positions around the S/T site. The distribution of amino acids in this neighborhood is considered. And the structural information at S/T is considered. For example, we have not considered the distribution of amino acids further away from S/T, which must also carry key information. So, for example, the LPM is “unaware” of the existence of the signal peptide in Nglycosylated proteins that sends it to a specific part of the cell.
So, we assumed that sequence and structural information in the neighborhood of the S/T site is sufficient for predicting the likelihood of Oglycosylation. To validate this assumption, we started with Table 9, which indicates that the proportion of Oglycosylated proteins having the N – ~ P – S/T sequon is tiny (1.22% of OGlcNAc and 0.77% of OGalNAc proteins). Given this empirical observation, we assumed that the “signature” N – ~ P – S/T discourages Oglycosylation – i.e., it appears to be an approximate inhibitor of Oglycosylation. Now, a reason that only a tiny number of sequences with the N – ~ P – S/T sequon are Oglycosylated may be because the remaining signature around the Oglycosylated S/T site is quite similar to the corresponding one around the Nglycosylated S/T site. And, there may be some “signal” outside of our chosen ±10 neighborhood that may occasionally allow a protein with the N – ~ P – S/T sequon to get Oglycosylated. However, the LPM is not trained on data outside of the ±10 neighborhood and, thus, such a signal, if it were to exist, is unobservable from the LPM’s frame of reference. So, our selecting the Nglycosylated sequences as the negative control set is independent of cell “compartments”. We are simply training the LPM on the signatures around the ±10 neighborhood of the N and O glycosylated sequences. Furthermore, if our assumption regarding the signature is unreasonable, we would have seen material deterioration in prediction accuracy under CV, but did not.
Conclusion
The consensus composite sequon for Oglycosylation is ~ (W–S/T–W), where “~” denotes the “not” operator. This means ~W – S/T – W, W – S/T – ~W, or ~W – S/T – ~W are necessary for Oglycosylation. The RR estimated LPM approach was instrumental in making this finding. The consensus sequon for phosphorylation is ~ (W–S/T/Y/H–W); although W–S/T/Y/H–W is not an absolute inhibitor of phosphorylation. Structural attributes (beta turn II, II´, helix, beta bridges, beta hairpin, and the phi angle) are significant predictors of OGlcNAc glycosylation.
For LPM estimation it is found that Nglycosylated sequences are good approximations to nonOglycosylatable sequences; although N – ~P – S/T is not an absolute inhibitor of Oglycosylation. The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. Some Nglycosylated sequences are also phosphorylated at the S/Tsite in the N – ~P – S/T sequon. ASA values for Nglycosylated sequences are stochastically larger than those for OGlcNAc glycosylated sequences.
The LPM with sequence and structural data as explanatory variables yields a KS statistic value of 99%. With only sequence data, the KS statistic erodes to 80%. With 50% as the cutoff probability for predicting OGlcNAc glycosylation, this LPM mispredicts 21% of outofsample OGlcNAc glycosylated sequences as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%. This study underscores the germaneness of structural information for predicting the likelihood of Oglycosylation.
Several other studies are available for the predicting the likelihood of Oglycosylation. These studies have employed “machine learning” techniques, like random forests and factor analysis, to predict Oglycosylation [111,112,113,114]. However, this approach was not pursued in this paper because the focus herein is on the explanatory process driving Oglycosylation. Thus, classical hypothesis testing is germane to the present work, and algorithmic modeling with its pervasive focus on prediction is not, in this paper, an end in itself. In this sense, Cox’s crucial comment to Breiman is echoed, which is that the “starting point” in this paper is not the data, but the underlying process generating it [115]. Furthermore, echoing Cox, again, the preference in this paper is to avoid proceeding with “a directly empirical blackbox approach” in favor of trying to “take account of some underlying explanatory process” that can be simply represented by the LPM. In this context, the Lasso penalized logistic model was also fit to the data. However, it was found that the LPM is quite competitive with it, and provides a simpler mechanistic understanding of the Oglycosylation prediction process.
Methods
Data Collection

a.
Data collected on the glycosylated sequences used for analysis herein is stored in an Excel spreadsheet: glycos_public.xlsx. This data, and key operations performed on it, are described in Table 8. Data is collected from two resources (dbOGAP [36] and PhosphoSitePlus [37]) as described in Table 8.

b.
Structural data is obtained from the Protein Data Bank (PDB) [38], and the PDBID codes used are provided in glycos_public.xlsx in the column labeled PDBID. In dbOGAP, the total number of UniProt sequences is 376; and, of these, 39 have structural information. Several unique structures exist for each of the 39 sequences. This results in the 39 sequences having a total of 998 structures with unique PDBIDs. Each structure is unique in terms of different ligands/drugs bound to them. Although there is redundancy in this data from a sequence standpoint, these 39 sequences have different structural attributes as defined by phipsi angles and ASA, which are important structural variables. In order to make sure that these are real structural differences, similar structures were superposed to calculate a RMSD (rootmeansquareddeviation). It is confirmed that the resolution of the structures is high enough to believe in the subtle conformational changes captured in the data. The 10 residues to the left of the S/Tsite, and the 10 residues to the right of the S/Tsite, are superposed to see if the S/Tsite displays slightly different conformations. An example is presented in Fig. 12. This example supports the rationale behind using all of the redundant sequences for regression model building. It is not surprising that the Ssite, in this example, is flexible enough to accommodate its role in a multitude of cellular processes. Residues that have maximum deviation (0.51.0Å) are close to the Ssite. Residues farther out superimpose with less than 0.1Å RMSD.

c.
The sequence and taxonomy information for the data used in the analysis are extracted from the UniProtKB database (www.uniprot.org) and shown in columns labeled Uniprot_Accession_No and Protein_Organism in glycos_public.xlsx [39]. The ligand information (see column labeled Ligand_Sequence in glycos_public.xlsx) is obtained from the PDBsum database [40]

d.
Accessible Surface Area (ASA) values (see column labeled ASA in glycos_public.xlsx) are obtained using the ASAView tool and database [41]. The definitions in the ASAView tool are used to classify the amino acids into five categories: positively charged residues (R, K, H), negatively charged residues (D, E), polar uncharged residues (G, N, Y, Q, S, T, W), Cysteine (C), and hydrophobic residues (L, V, I, A, F, M, P).

e.
The ProMotif database is used to obtain the turn types: http://www.ebi.ac.uk/thorntonsrv/databases/cgi in/pdbsum/GetPage.pl?pdbcode=n/a&template=doc_promotif.html. These are in glycos_public.xlsx in the column labeled Turn_Type.
Modeling Strategy
Because structural data on Oglycosylated sequences is sparse, two LPMs are estimated. One with sequence and structural data as explanatory variables. The other with only sequence data. The LPM has two principal sources of error: the exact process by which a protein becomes Oglycosylated is unknown; and uncertainty due to ambiguity in what constitutes a set of proteins that cannot be Oglycosylated. So, a suitable sequon is needed for discriminating between the two outcomes: Oglycosylated or not.
Availability of data and materials
The dataset (glycos_public.xlsx), which is in Microsoft Excel, is included as Additional file 1.
Abbreviations
 CV:

Cross validation
 KS:

KolmogorovSmirnov
 LPLM:

Lasso penalized logit model
 LPM:

Linear probability model
 PRESS:

predicted error sum of squares
 PTM:

Posttranslational modification
 RR:

Ridge regression
 SBC:

Schwarz Bayesian criterion
References
Brooks SA. Strategies for analysis of the glycosylation of proteins: current status and future perspectives. Mol Biotechnol. 2009;43(1):76–88.
Joshi HJ, Narimatsu Y, Schjoldager KT, Tytgat HLP, Aebi M, Clausen H, Halim A. SnapShot: OGlycosylation Pathways across Kingdoms. Cell. 2018;172(3):632–632 e632.
Krieg J, Hartmann S, Vicentini A, Gläsner W, Hess D, Hofsteenge J. Recognition signal for Cmannosylation of Trp7 in RNase 2 consists of sequence TrpxxTrp. Mol Biol Cell. 1998;9(2):301–9.
Torres CR, Hart GW. Topography and polypeptide distribution of terminal Nacetylglucosamine residues on the surfaces of intact lymphocytes. Evidence for Olinked GlcNAc. J Biol Chem. 1984;259(5):3308–17.
Haltiwanger RS, Holt GD, Hart GW. Enzymatic addition of OGlcNAc to nuclear and cytoplasmic proteins. Identification of a uridine diphosphoNacetylglucosamine:peptide betaNacetylglucosaminyltransferase. J Biol Chem. 1990;265(5):2563–8.
Hart GW, Housley MP, Slawson C. Cycling of Olinked betaNacetylglucosamine on nucleocytoplasmic proteins. Nature. 2007;446(7139):1017–22.
Yang X, Zhang F, Kudlow JE. Recruitment of OGlcNAc transferase to promoters by corepressor mSin3A: coupling protein OGlcNAcylation to transcriptional repression. Cell. 2002;110(1):69–80.
Dias WB, Cheung WD, Wang Z, Hart GW. Regulation of calcium/calmodulindependent kinase IV by OGlcNAc modification. J Biol Chem. 2009;284(32):21327–37.
Lazarus MB, Nam Y, Jiang J, Sliz P, Walker S. Structure of human OGlcNAc transferase and its complex with a peptide substrate. Nature. 2011;469(7331):564–7.
Wells L, Hart GW. OGlcNAc turns twenty: functional implications for posttranslational modification of nuclear and cytosolic proteins with a sugar. FEBS lett. 2003;546(1):154–8.
Capotosti F, Guernier S, Lammers F, Waridel P, Cai Y, Jin J, Conaway JW, Conaway RC, Herr W. OGlcNAc transferase catalyzes sitespecific proteolysis of HCF1. Cell. 2011;144(3):376–88.
Wells L, Vosseller K, Hart GW. Glycosylation of nucleocytoplasmic proteins: signal transduction and OGlcNAc. Science (New York). 2001;291(5512):2376–8.
Lubas WA, Hanover JA. Functional expression of Olinked GlcNAc transferase. Domain structure and substrate specificity. J Biol Chem. 2000;275(15):10983–8.
Kreppel LK, Blomberg MA, Hart GW. Dynamic glycosylation of nuclear and cytosolic proteins. Cloning and characterization of a unique OGlcNAc transferase with multiple tetratricopeptide repeats. J Biol Chem. 1997;272(14):9308–15.
Haltiwanger RS, Blomberg MA, Hart GW. Glycosylation of nuclear and cytoplasmic proteins. Purification and characterization of a uridine diphosphoNacetylglucosamine:polypeptide betaNacetylglucosaminyltransferase. J Biol Chem. 1992;267(13):9005–13.
Dong DL, Hart GW. Purification and characterization of an OGlcNAc selective NacetylbetaDglucosaminidase from rat spleen cytosol. J Biol Chem. 1994;269(30):19321–30.
Lubas WA, Frank DW, Krause M, Hanover JA. OLinked GlcNAc transferase is a conserved nucleocytoplasmic protein containing tetratricopeptide repeats. J Biol Chem. 1997;272(14):9316–24.
Copeland RJ, Bullen JW, Hart GW. Crosstalk between GlcNAcylation and phosphorylation: roles in insulin resistance and glucose toxicity. Am J Physiol Endocrinol Metab. 2008;295(1):E17–28.
Yang X, Ongusaha PP, Miles PD, Havstad JC, Zhang F, So WV, Kudlow JE, Michell RH, Olefsky JM, Field SJ, et al. Phosphoinositide signalling links OGlcNAc transferase to insulin resistance. Nature. 2008;451(7181):964–9.
Brownlee M. Biochemistry and molecular cell biology of diabetic complications. Nature. 2001;414(6865):813–20.
Caldwell SA, Jackson SR, Shahriari KS, Lynch TP, Sethi G, Walker S, Vosseller K, Reginato MJ. Nutrient sensor OGlcNAc transferase regulates breast cancer tumorigenesis through targeting of the oncogenic transcription factor FoxM1. Oncogene. 2010;29(19):2831–42.
Wright JN, Collins HE, Wende AR, Chatham JC. OGlcNAcylation and cardiovascular disease. Biochem Soc Trans. 2017;45(2):545–53.
Banerjee PS, Lagerlof O, Hart GW. Roles of OGlcNAc in chronic diseases of aging. Mol Aspects Med. 2016;51:1–15.
Wani WY, Ouyang X, Benavides GA, Redmann M, Cofield SS. OGlcNAc regulation of autophagy and αsynuclein homeostasis; implications for Parkinson’s disease. Mol Brain. 2017;10(1):32.
Ma X, Li H, He Y, Hao J. The emerging link between OGlcNAcylation and neurological disorders. Cell Mol life Sci. 2017;74(20):3667–86.
Ho WL, Hsu WM, Huang MC, Kadomatsu K, Nakagawara A. Protein glycosylation in cancers and its potential therapeutic applications in neuroblastoma. J Hematol Oncol. 2016;9(1):100.
Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
Goldberger AS. Econometric Theory. New York: Wiley; 1964.
Gana R. Ridge regression estimation of the linear probability model. J Appl Stat. 1995;22(4):537–9.
Saccucci MS. Effect of varianceinflated outliers on least squares and ridge regression. Newark: University of Delaware (unpublished PhD dissertation; 1985.
Monyak JT: Mean squared error properties of the ridge regression estimated linear probability model. unpublished PhD dissertation 1998.
McGillivray RG. Estimating the linear probability function. Econometrica. 1970;30:775–6.
Amemiya T. Some theorems in the linear probability model. Int Econ Rev. 1977;18(3):645–50.
Gana R, Naha S, Mazumder R, Goldman R, Vasudevan S. Ridge Regression Estimated Linear Probability Model Predictions of Nglycosylation in Proteins with Structural and Sequence Data. ArXiv. 2018.
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945;1:80–3.
Wang J, Torii M, Liu H, Hart GW, Hu ZZ. dbOGAP  an integrated bioinformatics resource for protein OGlcNAcylation. BMC Bioinformatics. 2011;12:91.
Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 2015;43(Database issue):D512–20.
Burley SK, Berman HM, Christie C, Duarte JM, Feng Z, Westbrook J, Young J, Zardecki C. RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education. Protein Sci. 2018;27(1):316–30.
UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(Database issue):D204–12.
Laskowski RA, Hutchinson EG, Michie AD, Wallace AC, Jones ML, Thornton JM. PDBsum: a Webbased database of summaries and analyses of all PDB structures. Trends Biochem Sci. 1997;22(12):488–90.
Ahmad S, Gromiha M, Fawareh H, Sarai A. ASAView: database and tool for solvent accessibility representation in proteins. BMC Bioinformatics. 2004;5:51.
Cohen P. The origins of protein phosphorylation. Nat Cell Biol. 2002;4(5):E127–30.
Fuhs SR, Meisenhelder J, Aslanian A, Ma L, Zagorska A, Stankova M, Binnie A, AlObeidi F, Mauger J, Lemke G, et al. Monoclonal 1 and 3Phosphohistidine Antibodies: New Tools to Study Histidine Phosphorylation. Cell. 2015;162(1):198–210.
Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22:209–12.
Whitehead AN, Russell BAW. Principia Mathematica to *56; 1910.
Hart GW. Glycosylation. Curr Opin Cell Biol. 1992;4(6):1017–23.
Gavel Y, von Heijne G. Sequence differences between glycosylated and nonglycosylated AsnXThr/Ser acceptor sites: implications for protein engineering. Protein Eng. 1990;3(5):433–42.
BenDor S, Esterman N, Rubin E, Sharon N. Biases and complex patterns in the residues flanking protein Nglycosylation sites. Glycobiology. 2004;14(2):95–101.
Efroymson MA. Multiple Regression Analysis. In: Mathematical Methods for Digital Computers. New York: Wiley; 1960.
Hocking RR. A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression. Biometrics. 1976;32(1):1–49.
Hoerl RW, Schuenemeyer JH, Hoerl AE. A simulation of biased estimation and subset selection regression techniques. Technometrics. 1986;28:369–80.
Larson SC. The shrinkage of the coefficient of multiple correlation. J Educ Psychol. 1931;22(1):45–55.
Mosteller F, Wallace DL. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. J Am Stat Assoc. 1963;58:275–309.
Mosteller F, Tukey JW. Data analysis, including statistics. Handbook Soc Psychol. 1968;2:80–203.
Stone M. Crossvalidatory choice and assessment of statistical problems. J R Stat Soc. 1974;36(1):103–6.
Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc. 1975;70:320–8.
Brier GW: Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950, 78:13.
Murphy A. A new vector partition of the probability score. J Appl Meteorol. 1970;12:695–700.
White H. A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48:817–38.
Smirnov NV. On the estimation of the discrepancy between empirical curves of distribution for two independent samples, vol. 2. Moscow: Bulletin of the University of Moscow; 1939. p. 3–14.
Smirnov NV. Table for Estimating the Goodness of Fit of Empirical Distributions. Ann Math Stat. 1948;19(2):279–81.
Kolmogorov A. Sulla determinazione empirica di una lgge di distribuzione. InstItalAttuari Giorn. 1933;4:83–91.
Feller W. On the KolmogorovSmirnov Limit Theorems for Empirical Distributions. Ann Math Stat. 1950;21(2):301–2.
Doob JL. Heuristic approach to the KolmogorovSmirnov theorems. Ann Math Stat. 1949;20:393–403.
Durbin J, Watson GS. Testing for serial correlation in least squares regression I. Biometrika. 1950;37:409–28.
Durbin J, Watson GS. Testing for serial correlation in least squares regression II. Biometrika. 1951;38:159–77.
Blom N, SicheritzPonten T, Gupta R, Gammeltoft S, Brunak S. Prediction of posttranslational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4(6):1633–49.
Thanka Christlet TH, Veluraja K. Database analysis of Oglycosylation sites in proteins. Biophys J. 2001;80(2):952–60.
Salkever DS. The use of dummy variables to compute predictions, prediction errors, and confidence intervals. J Econ. 1976;4:393–7.
Anderson GJ. Prediction tests in limited dependent variable models. J Econ. 1987;34:253–61.
Caudill SB. An advantage of the linear probability model over probit or logit. Oxford Bull Econ Stat. 1988;50:425–7.
Caudill SB. Dichotomous choice models and dummy variables. Statistician. 1987;36(4):381–3.
Oksanen EH. A Note on ObservationSpecific Dummies and Logit Analysis. J R Stat Soc Series D (The Statistician). 1986;35(4):413–6.
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60.
Benjamin DJ, et al. Redefine Statistical Significance. Nat Hum Behav. 2018;2:6–10.
Cox DR. Analysis of Binary Data; 1970.
Gujarati DN. Basic Econometrics; 1995.
Judge G, Hill C, Griffiths W, Lee T. The Theory and Practice of Econometrics; 1985.
Maddala GS. Introduction to Econometrics; 1992.
Takeshi A. Advanced Econometrics; 1985.
Wooldridge JM. Introductory Econometrics: A Modern Approach; 2016.
Judge G, Takayama T. Inequality restrictions in regression analysis. J Am Stat Assoc. 1966;61(313):166–81.
Goldfeld SM, Quandt RE. Nonlinear Methods in Econometrics; 1972.
Hensher DA, Johnson LW. Applied Discrete Choice Modeling; 1981.
Mullahy J. Weighted least squares estimation of the linear probability model revisited. Econ Lett. 1990;32(1):35–41.
Cox DR. The regression analysis of binary sequences. J R Stat Soc. 1958;20(2):215–42.
Schneider I, De Moivre A. The Doctrine of Chances (1718, 1738, 1756), GrattanGuinness, I Landmark Writings in Western Mathematics; 2005. p. 1640–940. 11051120
Cook DR. Influential observations in linear regression. J Am Stat Assoc. 1979;74(365):169–74.
Cook DR. Detection of influential observations in linear regression. Technometrics. 1977;19(1):15–8.
Hoerl AE, Kennard RW. Ridge regression: degrees of freedom in the analysis of variance. Commun Stat. 1990;19:1485–95.
Obenchain RL. Classical Ftests and confidence regions for ridge regression. Technometrics. 1977;19:429–39.
Santosa F, Symes WW. Linear inversion of bandlimited reflection seismograms. J Sci Stat Comput SIAM. 1986;7(4):1307–30.
Taylor HL, Banks SC, McCoy JF. Deconvolution with the ℓ1 norm. Geophysics. 1979;44:39–52.
Tibshirani R. Regression Shrinkage and Selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.
Bussotti P. On the Genesis of the Lagrange Multipliers. J Optim Theory Appl. 2003;117(3):453–9.
Lagrange JL. Mecanique Analitique; 1811.
Nesterov Y. Gradient Methods for Minimizing Composite Objective Function. Math Program. 2013;140:125–62.
Schwarz GE. Estimating the Dimension of a Model. Ann Stat. 1978;6:461–4.
Rose A. Vision  Human and Electronic; 1973.
Firth D. Reduction of Maximum Likelihood Estimates. Biometrika. 1993;80:27–38.
Heinze G, Schemper M. A Solution to the Problem of Separation in Logistic Regression. Stat Med. 2002;21:2409–19.
Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–35.
Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis of data. Biometrika. 1968;55(1):1–17.
Ramachandran GN, Ramakrishnan C, Sasisekharan V. Stereochemistry of polypeptide chain configurations. J Mol Biol. 1963;7(1):95–9.
Ceriani L, Verme P. The origins of the Gini index: extracts from Variabilita e Mutabilita (1912) by Corrado Gini. J Econ Ineqal. 2012;10(3):421–43.
Gini C. Variabilita e Mutabilita; 1912.
Gini C. Measurement of inequality and incomes. Econ J. 1921;31:124–6.
Keller J. The probability of heads. Am Math Mon. 1986;93(3):191–7.
Diaconis P, Mazur BC. The problem of thinking too much. Bull Am Acad Arts Sci. 2003;56:26–38.
Ergodos N. The enigma of probability. J Cogn Neuroethics. 2014;1(2):37–71.
Yang X, Han H. Factors analysis of protein Oglycosylation site prediction. Comput Biol Chem. 2017;71:258–63.
Hassan H, Badr A, Abdelhalim MB. Prediction of Oglycosylation Sites Using Random Forest and GATuned PSO Technique. Bioinform Biol Insights. 2015;9:103–9.
Chen Y, Zhou W, Wang H, Yuan Z. Prediction of Oglycosylation sites based on multiscale composition of amino acids and feature selection. Med Biol Eng Comput. 2015;53(6):535–44.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Breiman L, Modeling S. Stat Sci. 2001;16(3):199–231.
Acknowledgements
We thank Swagata Naha, who was a graduate student at Georgetown University, for collecting data on some of the sequences. We thank ZhangZhi Hu for providing the data mined for the creation of dbOGAP during his tenure, as a faculty member, at Georgetown University. We thank Elliott Crooke, Professor and Chair, Department of Biochemistry and Molecular & Cellular Biology, and Senior Associate Dean, Faculty and Academic Affairs, Georgetown University, for his support and encouragement. We thank Dr. Sarma Dittakavi, Professor Emeritus, Laboratory Medicine and Pathobiology, University of Toronto, for reading the manuscript, providing valuable comments, and for the many stimulating discussions that we have had together. We thank the Editors and the two anonymous reviewers for their valuable comments.
Funding
No funding was obtained for this study.
Author information
Authors and Affiliations
Contributions
SV conceived the idea of predicting the likelihood of Oglycosylation. RG conceived the idea of predicting this likelihood using the classical regression framework. SV and RG analyzed the data, wrote and reviewed the manuscript. Both authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable. This study does not utilize any data that requires ethical approval or consent to participate.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
Estimation dataset. (XLSX 1187 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Gana, R., Vasudevan, S. Ridge regression estimated linear probability model predictions of Oglycosylation in proteins with structural and sequence data. BMC Mol and Cell Biol 20, 21 (2019). https://doi.org/10.1186/s1286001902009
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286001902009