Skip to main content

Table 8 Description of the collected dataa

From: Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data

Dataset name given

Data description

Identified in glycos_public.xlsx by

Sample Size

Source

dbogap-str

O-GlcNAc glycosylated sequences with sequence and structural data

Oglycos_status = yes

1,105. Where 998 are human with unique PDB-IDs; of these, only 16 are inferred from known O-GlcNAcylated orthologs (the others are experimentally validated). The remaining 107 sequences are non-human. These 998 sequences are in-sample data for the proteins with sequence and structural information. The structural information on the 998 sequences was collected

dbOGAP

dbogap

Human O-GlcNAc glycosylated sequences

Ogly_only_seq = yes

376. These are unique UniProt Accession No. and position pairs.

dbOGAP

dbogap-unique-seq-with-str

Extract of human sequences from dbogap-str with unique UniProt Accession number and position pairs (i.e., structure is ignored)

Not identified as it is derivable using software like SAS or R

39. Of these, 28 are experimentally validated and the remaining 11 are inferred. These 39 sequences become 998 in dbogap-str via the richness in conformational changes associated with them. Of the 39 sequences, 25 are unique proteins (UniProt Accession Nos.)

N/A

dbogap-seq

Merge dbogap with dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in the first dataset, but not in the second.

Not identified as it is derivable

340

N/A

Oglc-PS+

Additional extract of O-GlcNAc glycosylated human sequences

Ogly_21 = yes

411. Of these, 59.12% are glycosylated at S and the others at T. Of the 25 unique human proteins in dbogap-unique-seq-with-str, 18 are in Oglc-PS+

PhosphoSitePlus

Oglc-non-dbogap

Merge Oglc-PS+ with dbogap-seq and dbogap-unique-seq-with-str, by UniProt Accession No. and position, and retain those in Oglc-PS+, but not in dbogap-seq or dbogap-unique-seq-with-str

GLCNAC_s1 = yes

259. This is used as out-of-sample data. Note, 152 of the 340 sequences in dbogap-seq are in Oglc-PS+. The total number of unique sequences (UniProt Accession No. and UniProt position pairs) in dbogap and Oglc-PS+ is 638.

N/A

Ogal

O-GalNAc glycosylated human sequences

GALNAC_s1 = yes

2,079. This is used as out-of-sample data. Of these, 60.27% are glycosylated at T and the others at S

PhosphoSitePlus

Ngly

N-glycosylated sequences

glyco_status = yes

6,328. Of these, 2,422 are “Homo sapiens (Human)”. Of the 2,422, the count of sequences with more than one sugar bound is 1,083. These 1,083 sequences are in-sample data for the proteins with sequence and structural information. If structure is ignored, there are 361 unique sequences (i.e., unique UniProt Accession No. and position pairs). These 361 sequences are in-sample data for the proteins with only sequence data

Gana et al.[34]

Phosy

[35] Phosphorylated sequences

Not identified. This is archived in a separate file: Phosy.csv

363,256. Of these, 227,810 are human with amino acids in ±7 positions of the S/T-site; and 58.95%, 24.51%, and 16.54% are phosphorylated at S, T and Y, respectively

PhosphoSitePlus

WSTW-Uniprot

Human sequences with the W– S/T–W sequon

wstw = yes

236. This extract is unique in terms of Uniprot Accession No. & position pairs

UniProt

  1. a The columns describe the dataset name, counts of the sequences collected, description of the data and its source. For example, 1,105 O-GlcNAc glycosylated proteins with sequence and structural data are collected and stored as dataset dbogap-str. This data is identified in glycos_public.xlsx by “yes” in column Oglycos_status. In terms of unique PDB-IDs, there are 998 sequences in this data. The last column cites the source of the collected data, dbOGAP