Skip to main content

Table 8 Description of the collected dataa

From: Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data

Dataset name given

Data description

Identified in glycos_public.xlsx by

Sample Size



O-GlcNAc glycosylated sequences with sequence and structural data

Oglycos_status = yes

1,105. Where 998 are human with unique PDB-IDs; of these, only 16 are inferred from known O-GlcNAcylated orthologs (the others are experimentally validated). The remaining 107 sequences are non-human. These 998 sequences are in-sample data for the proteins with sequence and structural information. The structural information on the 998 sequences was collected



Human O-GlcNAc glycosylated sequences

Ogly_only_seq = yes

376. These are unique UniProt Accession No. and position pairs.



Extract of human sequences from dbogap-str with unique UniProt Accession number and position pairs (i.e., structure is ignored)

Not identified as it is derivable using software like SAS or R

39. Of these, 28 are experimentally validated and the remaining 11 are inferred. These 39 sequences become 998 in dbogap-str via the richness in conformational changes associated with them. Of the 39 sequences, 25 are unique proteins (UniProt Accession Nos.)



Merge dbogap with dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in the first dataset, but not in the second.

Not identified as it is derivable




Additional extract of O-GlcNAc glycosylated human sequences

Ogly_21 = yes

411. Of these, 59.12% are glycosylated at S and the others at T. Of the 25 unique human proteins in dbogap-unique-seq-with-str, 18 are in Oglc-PS+



Merge Oglc-PS+ with dbogap-seq and dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in Oglc-PS+, but not in dbogap-seq or dbogap-unique-seq-with-str

GLCNAC_s1 = yes

259. This is used as out-of-sample data. Note, 152 of the 340 sequences in dbogap-seq are in Oglc-PS+. The total number of unique sequences (UniProt Accession No. and UniProt position pairs) in dbogap and Oglc-PS+ is 638.



O-GalNAc glycosylated human sequences

GALNAC_s1 = yes

2,079. This is used as out-of-sample data. Of these, 60.27% are glycosylated at T and the others at S



N-glycosylated sequences

glyco_status = yes

6,328. Of these, 2,422 are “Homo sapiens (Human)”. Of the 2,422, the count of sequences with more than one sugar bound is 1,083. These 1,083 sequences are in-sample data for the proteins with sequence and structural information. If structure is ignored, there are 361 unique sequences (i.e., unique UniProt Accession No. and position pairs). These 361 sequences are in-sample data for the proteins with only sequence data

Gana et al.[34]


[35] Phosphorylated sequences

Not identified. This is archived in a separate file: Phosy.csv

363,256. Of these, 227,810 are human with amino acids in ±7 positions of the S/T-site; and 58.95%, 24.51%, and 16.54% are phosphorylated at S, T and Y, respectively



Human sequences with the W– S/T–W sequon

wstw = yes

236. This extract is unique in terms of Uniprot Accession No. & position pairs


  1. a The columns describe the dataset name, counts of the sequences collected, description of the data and its source. For example, 1,105 O-GlcNAc glycosylated proteins with sequence and structural data are collected and stored as dataset dbogap-str. This data is identified in glycos_public.xlsx by “yes” in column Oglycos_status. In terms of unique PDB-IDs, there are 998 sequences in this data. The last column cites the source of the collected data, dbOGAP