Dataset name given | Data description | Identified in glycos_public.xlsx by | Sample Size | Source |
---|---|---|---|---|
dbogap-str | O-GlcNAc glycosylated sequences with sequence and structural data | Oglycos_status = yes | 1,105. Where 998 are human with unique PDB-IDs; of these, only 16 are inferred from known O-GlcNAcylated orthologs (the others are experimentally validated). The remaining 107 sequences are non-human. These 998 sequences are in-sample data for the proteins with sequence and structural information. The structural information on the 998 sequences was collected | dbOGAP |
dbogap | Human O-GlcNAc glycosylated sequences | Ogly_only_seq = yes | 376. These are unique UniProt Accession No. and position pairs. | dbOGAP |
dbogap-unique-seq-with-str | Extract of human sequences from dbogap-str with unique UniProt Accession number and position pairs (i.e., structure is ignored) | Not identified as it is derivable using software like SAS or R | 39. Of these, 28 are experimentally validated and the remaining 11 are inferred. These 39 sequences become 998 in dbogap-str via the richness in conformational changes associated with them. Of the 39 sequences, 25 are unique proteins (UniProt Accession Nos.) | N/A |
dbogap-seq | Merge dbogap with dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in the first dataset, but not in the second. | Not identified as it is derivable | 340 | N/A |
Oglc-PS+ | Additional extract of O-GlcNAc glycosylated human sequences | Ogly_21 = yes | 411. Of these, 59.12% are glycosylated at S and the others at T. Of the 25 unique human proteins in dbogap-unique-seq-with-str, 18 are in Oglc-PS+ | PhosphoSitePlus |
Oglc-non-dbogap | Merge Oglc-PS+ with dbogap-seq and dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in Oglc-PS+, but not in dbogap-seq or dbogap-unique-seq-with-str | GLCNAC_s1 = yes | 259. This is used as out-of-sample data. Note, 152 of the 340 sequences in dbogap-seq are in Oglc-PS+. The total number of unique sequences (UniProt Accession No. and UniProt position pairs) in dbogap and Oglc-PS+ is 638. | N/A |
Ogal | O-GalNAc glycosylated human sequences | GALNAC_s1 = yes | 2,079. This is used as out-of-sample data. Of these, 60.27% are glycosylated at T and the others at S | PhosphoSitePlus |
Ngly | N-glycosylated sequences | glyco_status = yes | 6,328. Of these, 2,422 are “Homo sapiens (Human)”. Of the 2,422, the count of sequences with more than one sugar bound is 1,083. These 1,083 sequences are in-sample data for the proteins with sequence and structural information. If structure is ignored, there are 361 unique sequences (i.e., unique UniProt Accession No. and position pairs). These 361 sequences are in-sample data for the proteins with only sequence data | Gana et al.[34] |
Phosy | [35] Phosphorylated sequences | Not identified. This is archived in a separate file: Phosy.csv | 363,256. Of these, 227,810 are human with amino acids in ±7 positions of the S/T-site; and 58.95%, 24.51%, and 16.54% are phosphorylated at S, T and Y, respectively | PhosphoSitePlus |
WSTW-Uniprot | Human sequences with the W– S/T–W sequon | wstw = yes | 236. This extract is unique in terms of Uniprot Accession No. & position pairs | UniProt |