Selection of HT-SELEX datasets and rounds
We analyzed the quality filtered datasets for 169 human TFs released with Yang et al.’s recent study [35]. Yang et al. resequenced these datasets from [14] at a significantly higher depth (on average ~10-fold increase in depth) and filtered through a quality-control pipeline [35]. For each dataset, we analyzed the same HT-SELEX round that Yang et al. selected based on a set of criteria that maximize the presence of both strong and weak sites for the corresponding TF.
CIS-BP motifs and finding CIS-BP motif matches
As the motifs, we chose every CIS-BP motif [33] that was derived based on direct binding evidence. This returned a median number of four motifs per TF (range: 1−17). For detecting motif matches, we used the FIMO [10] program with the commonly adopted and relatively liberal significance threshold of 1e-4 [7, 12, 15, 16, 21, 31, 34].
A pipeline to identify non-canonical motifs from HT-SELEX data
To investigate the existence of non-canonical motifs and their difference from canonical motifs, we developed a pipeline for analyzing HT-SELEX data. We combine the standard practices of HT-SELEX data modeling [29] with an additional set of conservative filtering criteria (Fig. 4). The additional criteria aim to ensure that any observed signal of non-canonical motifs is likely not an artifact of the HT-SELEX procedure [23]. The following is an outline of our pipeline; we describe the steps in detail in the following sub-sections.
Step 1. The HT-SELEX experiment for a TF starts with a library of random oligonucleotides (oligos), also known as the round #0 library. From the TF-bound oligos from the round #0 library, one then constructs the round #1 library and repeats this process for several rounds. Thus, each HT-SELEX round becomes more enriched for oligos with specific sites for the TF [14]. Here we analyze the HT-SELEX libraries that Yang et al. selected for each TF after applying a set of quality-control criteria [35]. These criteria were to ensure that the chosen rounds contained binding sites with an expected level of variation in the TFs’ DNA binding affinity.
Step 2. Within the oligos of the selected round, we identify the occurrences of the known motifs of the TF. As the known motifs, we use every CIS-BP motif [33] that is derived based on direct binding evidence in human. (In a post-analysis checking, we confirmed that the discovered non-canonical motifs do not match with CIS-BP motifs derived from indirect evidence.) We use the program FIMO [10] to identify the occurrences of the CIS-BP motifs in the oligos.
Step 3. Following the approach of Slattery et al. [29], we compute the effective length (L) of the TF’s binding sites. This is an information-theoretic approach to estimate the length of a TF’s DNA-binding sites without making assumptions about its DNA-binding properties.
Step 4. We perform an initial filtering on L-mers based on the following two criteria. First, an L-mer should occur at least 100 times (following [29]) in the selected round’s oligos. Secondly, we computed the entropy (based on dinucleotide frequency) of each L-mer, and discarded all L-mers that have an entropy lower than the minimum entropy of a CIS-BP motif occurrence.
Step 5. We compute the enrichment of the remaining L-mers in the selected round’s oligos with respect to round #0 oligos. To estimate the count of an L-mer in round #0, we build higher-order Markov models of the round #0 oligos following Slattery et al.’s approach [29].
Step 6. We then identify the “motif L-mers” and “non-motif L-mers”. An L-mer is a motif L-mer if it is a substring or a superstring of a CIS-BP motif occurrence, otherwise it is a non-motif L-mer. We discarded all non-motif L-mers that fail to satisfy the following three criteria. First, we discard all non-motif L-mers that have an enrichment lower than the least enriched motif L-mer. Secondly, for each non-motif L-mer, we compute its “round-over-round enrichment”, i.e., the ratio of its enrichment in each pair of successive rounds. We discard a non-motif L-mer if its round-over-round enrichment for any pair of successive rounds is less than 1. Finally, we discard the non-motif L-mers that always co-occur with motif L-mers in the selected round’s oligos.
Step 7. We separately cluster the motif L-mers and the non-motif L-mers into canonical and non-canonical motifs. To ensure the reliability of the clustering algorithm, we confirm that the canonical motifs are similar to the CIS-BP motifs of the corresponding TF. We discard every non-canonical motif which does not occur alone in at least 5% of the selected round’s oligos. When counting the number of oligos where a non-canonical motif occurs alone, we also ensure that the same oligo is not counted more than once for different non-canonical motifs.
Step 8. For each canonical and non-canonical motif, we compute its similarity score with the CIS-BP motifs. Based on the similarity scores of the canonical motifs, we then compute the empirical one-tailed p-values for the similarity scores of the non-canonical motif. We called a non-canonical motif as significantly different from the CIS-BP motifs if this one-tailed p-value passed the threshold corresponding to 5% FDR (false discovery rate) correction [4].
Modeling k-mer frequencies in round #0 libraries
We followed Slattery et al.’s [29] Markov model based procedure to model k-mer frequencies in round #0 of the HT-SELEX datasets. For each round #0 library, we first shuffled the order of the sequences and partitioned those into two equal sized datasets for training and validation. We then computed the optimal order of a Markov model for that library by fitting Markov models of order between zero and an integer M on the training sequences, and comparing the model performance (coefficient of determination, R2) on the validation sequences. To determine M, we identified the largest value of k such that every k-mer “occurs” at least 100 times in the library, and we set M = k − 1. We say that a k-mer occurs in a DNA sequence if the sequence has a k-length substring (either in the forward or the reverse complement direction) exactly matching the k-mer. Across the datasets, the median value of M was 6 (range: 5−8), the median value of optimal orders was 5 (range: 4−7), and the median R2 values of the optimal models was 0.94 (range: 0.76−0.99).
Computing the effective length (L) of TF binding sites
We followed Slattery et al.’s [29] procedure to identify the effective site length (L) of a TF from its selected round’s library sequences. According to Slattery et al., for the effective site length L, the distribution of L-mer frequencies in the selected round should be maximally distant from the distribution of L-mer frequencies in round #0. Thus, for each value of k between m + 1 and 15, where m is the optimal order of Markov model for the corresponding round #0 library, we computed the KL divergence DKL between the distributions of k-mer frequencies in the selected round and round #0 (to compute the k-mer frequencies in round #0, we used the Markov models computed above). We then set L equal to the value of k for which the above KL divergence was the maximum. Like Slattery et al., we took the frequencies of all k-mers that occur at least 100 times in the selected round and considered all other k-mers as one combined group. Thus, we computed the above KL divergence DKL as follows.
$${D}_{KL}={\sum}_{w\in {S}_{100}}\kern0.24em {P}_R(w)\mathit{\log}\frac{P_R(w)}{P_0(w)}+{Q}_R(w)\mathit{\log}\frac{Q_R(w)}{Q_0(w)},$$
where:
S100 is the set of all k-mers that occur at least 100 times in the selected round R,
PR(w) and P0(w) are the frequencies of the k-mer w in rounds #R and #0 respectively, and \({Q}_R(w)=1-{\sum}_{w\in {S}_{100}}\;{P}_R(w)\) and \({Q}_0(w)=1-{\sum}_{w\in {S}_{100}}\kern0.24em {P}_0(w)\).
Initial filtering of L-mers from the library of the selected round
From the sequences (oligos) of the selected round’s library, we removed every L-mer if it failed in either of the following two criteria.
-
1.
Minimum count: we eliminate an L-mer if it occurs less than 100 times in the library sequences
-
2.
Minimum di-nucleotide based entropy: we first computed the minimum di-nucleotide based entropy of all matches to the TF’s CIS-BP motif, and we eliminate an L-mer if its di-nucleotide based entropy is less than the minimum of that computed from the CIS-BP motifs. We defined the di-nucleotide based entropy for a given sequence as follows.
$$H={\sum}_{k\in {S}_2}\;{p}_k\mathit{\log}\left(\frac{1}{p_k}\right),$$
where:
S2 is the set of all di-nucleotides, i.e., {AA, AC, AG, AT, …, TG, TT}, and
pk is the frequency of the k-th di-nucleotide in that sequence.
Identifying motif L-mers and non-motif L-mers
We then rank all L-mers in the selected round’s library according to their enrichment with respect to round #0 library. We defined the enrichment of an L-mer w as \(\frac{P_R(w)}{P_0(w)}\), where PR(w) and P0(w) are the frequencies of the k-mer w in rounds #R (the selected round) and #0 respectively.
We then mark every L-mer that is as either a substring or a superstring of the TF’s CIS-BP motif matches, and call these the “motif L-mers”. We identify the lowest ranked motif L-mer and discard every lower-ranked L-mers. From the remaining L-mers, we call an L-mer to be a “non-motif L-mer” if it is not a motif L-mer.
Filtering non-motif L-mers based on round-over-round (RoR) enrichment and alone occurrence
We further filtered non-motif L-mers based on the following two criteria.
-
1.
Round-over-Round enrichment is at least one: for each non-motif L-mer, we computed its enrichment between every pair of consecutive rounds (Round-over-Round, RoR enrichment). We discarded a non-motif L-mer if its RoR enrichment between any pair of consecutive rounds was < 1.
-
2.
Alone occurrence: we discard a non-motif L-mer if it never occurs “alone” in the selected round’s sequences, i.e., if it occurs with some motif L-mer in all oligos.
Clustering L-mers into canonical and non-canonical motifs
We take the filtered lists of motif L-mers and non-motif L-mers, and cluster the L-mers into canonical and non-canonical motifs. The key idea is to iteratively identify a cluster head (defined below) and cluster all the L-mers that: (a) have not been assigned to any other cluster yet and (b) are covered by the current cluster head (defined below).
A cluster head is an l-mer (we chose l = 8; if L < 8, then we chose l = L − 2) that occurs in the maximum number of L-mers with up to m = 2 mismatches (we use m = 1 if l ≤ 5). These choices were adopted from previous string-kernel based support vector machine models of TF binding specificity [2]. We say that a cluster head covers an L-mer if it occurs in the L-mer with up to m mismatches. Intuitively, a cluster head identifies a core region within the L-mers that it covers. After we cluster the L-mers covered by the current cluster head, we identify a new cluster head for the remaining L-mers and repeat the same process. We continue this iterative process until every L-mer has been assigned to a cluster or we have identified a maximum number of clusters (we set the limit at five).
We next align the L-mers in every cluster. We identify the position within each L-mer where the cluster head occurs with the fewest number of mismatches. We call these positions the anchor positions for alignment. If there are more than one anchor position for an L-mer, we choose the one that is closest to the middle position of the L-mer. We then align the L-mers along the anchor positions, and pad each L-mer with ‘N’s to make sure that all L-mers in the alignment have the same length. Of note, we always count mismatches by considering l-mers in both the forward and the reverse complement orientation.
From these alignments, we finally create the position weight matrices or motifs by counting the number of occurrences of each nucleotide at each position of the alignment. An ‘N’ at a position of an L-mer contributes a count of 0.25 to each nucleotide at that position.
We visually confirmed each canonical motif constructed from the above process and confirmed their similarity with the CIS-BP motifs of the same TF (Supplementary Table 1).
It is useful here to mention a final point about the non-canonical motifs constructed in the above process. At any stage during cluster construction, if we find multiple cluster heads (i.e., each of them covers the same number of L-mers), then we execute the above process independently for each cluster head. In such cases, the same L-mer will be assigned to more than one cluster and thus, will contribute to more than one motif. It is not clear how this may influence our downstream analyses. Therefore, after performing multiple test corrections on the non-canonical motifs (see below), we manually check if there is any pair of significant non-canonical motifs that includes the same L-mer and keep the motif that is more different from the CIS-BP motifs (see below). Thus, in our results, an L-mer never occurs more than once in the non-canonical motifs.
Selecting non-canonical motifs based on fraction of Oligos explained
We say that a motif (canonical or non-canonical) of a TF occurs in a sequence (an oligo or a ChIP-Seq peak) if any of its constituent L-mers occur in the sequence. When a non-canonical motif of a TF occurs in a sequence, but none of the canonical motifs of the TF occurs in that sequence, we say that the non-canonical motif occurs alone in that sequence. We eliminate a non-canonical motif if it does not occur alone in at least 5% of the selected round’s oligos.
Statistical significance of non-canonical motifs
For each canonical and non-canonical motif of a TF, we first computed its minimum distance Dmin from the collection of CIS-BP motifs of the same TF. To compute the distance D between two motifs, we first trim the motifs by eliminating non-informative positions (information content less than 0.25 bits) from the two ends. Then we consider the every possible l-length sub-motifs (see below) of the two trimmed motifs, compute their Euclidean distances normalized by l, and report the minimum of these normalized distances as D. We chose l = 8 or set l = the length of the smaller motif if its length is smaller than 8. As we did for cluster heads above, the l-length sub-motifs capture the similarity between the two motifs in a core region. While computing the Euclidean distances, we always consider one of the motifs in both forward and reverse complement orientation, and take the smaller of the two distances.
Next, we compute the statistical significance of the Dmin value of each non-canonical motif by computing a p-value using a normal distribution with mean and variances computed from the Dmin values of the canonical motifs. We report this p-value as the statistical significance of the non-canonical motif.
Finally, we reported the non-canonical motifs that pass a 5% false discovery rate threshold in Benjamini-Hochberg procedure [4].
ChIP-Seq and regulatory sequence data
We collected the ChIP-Seq data from Cistrome DB [17, 38] (Batch download for Human_Factor) and regulatory sequence annotations based on ENCODE [6, 8] and Roadmap Epigenomics [27] from Cao et al.’s [5] recent study. For promoters, we downloaded from UCSC [11] the sequences 1000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5′ UTRs.
Computational validation: enrichment of non-canonical motifs in ChIP-seq data
We computed the enrichment of non-canonical and canonical motifs in Cistrome ChIP-seq data using the following three control data: (i) shuffled versions of L-mers constituting the motifs, (ii) dinucleotide shuffled versions of ChIP-peaks, and (iii) randomly selected genomic sequences matched for length, GC-content, and repeat content (using gkmSVM [9]).
For the first analysis, for a given motif and a ChIP-seq dataset, we define the enrichment e(m) of a motif m as follows.
$$e(m)=\frac{1}{\left|D\right|}\sum \limits_{d\in D}\;\frac{1}{\left|L\right|}\sum \limits_{l\in L}\kern0.24em \frac{n\left(l,d\right)}{n\left(l\hbox{'},d\right)}$$
where,
D is the set of ChIP-seq datasets of the corresponding TF,
L is the set of L-mers constituting the motif,
l′ is the shuffled sequence of a constituent L-mer l of the motif m,
n(l, d) and n(l′, d) are the number of times l and l′ occur in the ChIP-peaks of a dataset d, respectively.
In other words, for each motif, we first take the mean ratio of the number of times its constituent L-mers and their shuffled sequences occur in the ChIP-peaks (we considered a pseudocount of 1). Then, we take the mean of the above statistic over all datasets of the corresponding TF.
For the other two analyses, for a given motif and a ChIP-seq dataset, we define the enrichment e(m) of a motif m as follows.
$$e(m)=\frac{1}{\left|D\right|}\sum \limits_{d\in D}\;\frac{1}{\left|L\right|}\sum \limits_{l\in L}\;\frac{n\left(l,d\right)}{n\left(l,C\right)}$$
where,
D is the set of ChIP-seq datasets of the corresponding TF,
L is the set of L-mers constituting the motif,
n(l, d) and n(l, C) are the number of times an L-mer l occurs in the ChIP-peaks of a dataset d and the corresponding control dataset C, respectively.
In other words, for each motif, we first take the mean ratio of the number of times its constituent L-mers occur in the ChIP-peaks compared to the control sequences (we considered a pseudocount of 1). Then, we take the mean of the above statistic over all datasets of the corresponding TF.
We show the results in Supplementary Figure 1 and discuss in Supplementary Text 2. To make the comparisons clear between non-canonical and canonical motifs, we have plotted the e(m) value of each non-canonical motif against the mean e(m) value of all canonical motifs of that TF.
Computational validation: checking for the existence of non-canonical motifs in a separate HT-SELEX data
As a second computational validation, we analyzed the HT-SELEX dataset of Yin et al. [36]. This dataset includes DNA-binding data of full-length TFs and extended DNA binding domains for 28 of the 31 TFs that have non-canonical motifs in the Yang et al. dataset [35]. For each non-canonical motif, we computed its constituent L-mers’ enrichments in the HT-SELEX round that Yin et al. used to derive motifs compared to the first round of that dataset. We eliminated all L-mers with enrichment less than 1, and applied our clustering algorithms discussed above. We then scrutinized the resulting motifs for similarity with the non-canonical motifs discovered from Yang et al. data [35].
Evolutionary conservation analysis
For each TF, we compared the 7-way (human and six vertebrates) phyloP scores [24] between its canonical and non-canonical motif occurrences within its ChIP-Seq peaks (Cistrome datasets [17, 38]). For non-canonical motif occurrences, we count only the occurrences within peaks that lack occurrences of the TF’s canonical motifs. For each matching sequence in these two groups, we computed its mean phyloP score from the basewise scores and performed a two sample Kolmogorov-Smirnov test on the two groups. We then computed the fraction of datasets per non-canonical motif where the two groups do not have a significantly different level of phyloP scores (two sample KS test p-value > 0.01).