Skip to main content

Advertisement

Table 8 Description of the collected dataa

From: Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data

Dataset name given Data description Identified in glycos_public.xlsx by Sample Size Source
dbogap-str O-GlcNAc glycosylated sequences with sequence and structural data Oglycos_status = yes 1,105. Where 998 are human with unique PDB-IDs; of these, only 16 are inferred from known O-GlcNAcylated orthologs (the others are experimentally validated). The remaining 107 sequences are non-human. These 998 sequences are in-sample data for the proteins with sequence and structural information. The structural information on the 998 sequences was collected dbOGAP
dbogap Human O-GlcNAc glycosylated sequences Ogly_only_seq = yes 376. These are unique UniProt Accession No. and position pairs. dbOGAP
dbogap-unique-seq-with-str Extract of human sequences from dbogap-str with unique UniProt Accession number and position pairs (i.e., structure is ignored) Not identified as it is derivable using software like SAS or R 39. Of these, 28 are experimentally validated and the remaining 11 are inferred. These 39 sequences become 998 in dbogap-str via the richness in conformational changes associated with them. Of the 39 sequences, 25 are unique proteins (UniProt Accession Nos.) N/A
dbogap-seq Merge dbogap with dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in the first dataset, but not in the second. Not identified as it is derivable 340 N/A
Oglc-PS+ Additional extract of O-GlcNAc glycosylated human sequences Ogly_21 = yes 411. Of these, 59.12% are glycosylated at S and the others at T. Of the 25 unique human proteins in dbogap-unique-seq-with-str, 18 are in Oglc-PS+ PhosphoSitePlus
Oglc-non-dbogap Merge Oglc-PS+ with dbogap-seq and dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in Oglc-PS+, but not in dbogap-seq or dbogap-unique-seq-with-str GLCNAC_s1 = yes 259. This is used as out-of-sample data. Note, 152 of the 340 sequences in dbogap-seq are in Oglc-PS+. The total number of unique sequences (UniProt Accession No. and UniProt position pairs) in dbogap and Oglc-PS+ is 638. N/A
Ogal O-GalNAc glycosylated human sequences GALNAC_s1 = yes 2,079. This is used as out-of-sample data. Of these, 60.27% are glycosylated at T and the others at S PhosphoSitePlus
Ngly N-glycosylated sequences glyco_status = yes 6,328. Of these, 2,422 are “Homo sapiens (Human)”. Of the 2,422, the count of sequences with more than one sugar bound is 1,083. These 1,083 sequences are in-sample data for the proteins with sequence and structural information. If structure is ignored, there are 361 unique sequences (i.e., unique UniProt Accession No. and position pairs). These 361 sequences are in-sample data for the proteins with only sequence data Gana et al.[34]
Phosy [35] Phosphorylated sequences Not identified. This is archived in a separate file: Phosy.csv 363,256. Of these, 227,810 are human with amino acids in ±7 positions of the S/T-site; and 58.95%, 24.51%, and 16.54% are phosphorylated at S, T and Y, respectively PhosphoSitePlus
WSTW-Uniprot Human sequences with the W– S/T–W sequon wstw = yes 236. This extract is unique in terms of Uniprot Accession No. & position pairs UniProt
  1. a The columns describe the dataset name, counts of the sequences collected, description of the data and its source. For example, 1,105 O-GlcNAc glycosylated proteins with sequence and structural data are collected and stored as dataset dbogap-str. This data is identified in glycos_public.xlsx by “yes” in column Oglycos_status. In terms of unique PDB-IDs, there are 998 sequences in this data. The last column cites the source of the collected data, dbOGAP