Selection of HTSELEX datasets and rounds
We analyzed the quality filtered datasets for 169 human TFs released with Yang et al.’s recent study [35]. Yang et al. resequenced these datasets from [14] at a significantly higher depth (on average ~10fold increase in depth) and filtered through a qualitycontrol pipeline [35]. For each dataset, we analyzed the same HTSELEX round that Yang et al. selected based on a set of criteria that maximize the presence of both strong and weak sites for the corresponding TF.
CISBP motifs and finding CISBP motif matches
As the motifs, we chose every CISBP motif [33] that was derived based on direct binding evidence. This returned a median number of four motifs per TF (range: 1−17). For detecting motif matches, we used the FIMO [10] program with the commonly adopted and relatively liberal significance threshold of 1e4 [7, 12, 15, 16, 21, 31, 34].
A pipeline to identify noncanonical motifs from HTSELEX data
To investigate the existence of noncanonical motifs and their difference from canonical motifs, we developed a pipeline for analyzing HTSELEX data. We combine the standard practices of HTSELEX data modeling [29] with an additional set of conservative filtering criteria (Fig. 4). The additional criteria aim to ensure that any observed signal of noncanonical motifs is likely not an artifact of the HTSELEX procedure [23]. The following is an outline of our pipeline; we describe the steps in detail in the following subsections.
Step 1. The HTSELEX experiment for a TF starts with a library of random oligonucleotides (oligos), also known as the round #0 library. From the TFbound oligos from the round #0 library, one then constructs the round #1 library and repeats this process for several rounds. Thus, each HTSELEX round becomes more enriched for oligos with specific sites for the TF [14]. Here we analyze the HTSELEX libraries that Yang et al. selected for each TF after applying a set of qualitycontrol criteria [35]. These criteria were to ensure that the chosen rounds contained binding sites with an expected level of variation in the TFs’ DNA binding affinity.
Step 2. Within the oligos of the selected round, we identify the occurrences of the known motifs of the TF. As the known motifs, we use every CISBP motif [33] that is derived based on direct binding evidence in human. (In a postanalysis checking, we confirmed that the discovered noncanonical motifs do not match with CISBP motifs derived from indirect evidence.) We use the program FIMO [10] to identify the occurrences of the CISBP motifs in the oligos.
Step 3. Following the approach of Slattery et al. [29], we compute the effective length (L) of the TF’s binding sites. This is an informationtheoretic approach to estimate the length of a TF’s DNAbinding sites without making assumptions about its DNAbinding properties.
Step 4. We perform an initial filtering on Lmers based on the following two criteria. First, an Lmer should occur at least 100 times (following [29]) in the selected round’s oligos. Secondly, we computed the entropy (based on dinucleotide frequency) of each Lmer, and discarded all Lmers that have an entropy lower than the minimum entropy of a CISBP motif occurrence.
Step 5. We compute the enrichment of the remaining Lmers in the selected round’s oligos with respect to round #0 oligos. To estimate the count of an Lmer in round #0, we build higherorder Markov models of the round #0 oligos following Slattery et al.’s approach [29].
Step 6. We then identify the “motif Lmers” and “nonmotif Lmers”. An Lmer is a motif Lmer if it is a substring or a superstring of a CISBP motif occurrence, otherwise it is a nonmotif Lmer. We discarded all nonmotif Lmers that fail to satisfy the following three criteria. First, we discard all nonmotif Lmers that have an enrichment lower than the least enriched motif Lmer. Secondly, for each nonmotif Lmer, we compute its “roundoverround enrichment”, i.e., the ratio of its enrichment in each pair of successive rounds. We discard a nonmotif Lmer if its roundoverround enrichment for any pair of successive rounds is less than 1. Finally, we discard the nonmotif Lmers that always cooccur with motif Lmers in the selected round’s oligos.
Step 7. We separately cluster the motif Lmers and the nonmotif Lmers into canonical and noncanonical motifs. To ensure the reliability of the clustering algorithm, we confirm that the canonical motifs are similar to the CISBP motifs of the corresponding TF. We discard every noncanonical motif which does not occur alone in at least 5% of the selected round’s oligos. When counting the number of oligos where a noncanonical motif occurs alone, we also ensure that the same oligo is not counted more than once for different noncanonical motifs.
Step 8. For each canonical and noncanonical motif, we compute its similarity score with the CISBP motifs. Based on the similarity scores of the canonical motifs, we then compute the empirical onetailed pvalues for the similarity scores of the noncanonical motif. We called a noncanonical motif as significantly different from the CISBP motifs if this onetailed pvalue passed the threshold corresponding to 5% FDR (false discovery rate) correction [4].
Modeling kmer frequencies in round #0 libraries
We followed Slattery et al.’s [29] Markov model based procedure to model kmer frequencies in round #0 of the HTSELEX datasets. For each round #0 library, we first shuffled the order of the sequences and partitioned those into two equal sized datasets for training and validation. We then computed the optimal order of a Markov model for that library by fitting Markov models of order between zero and an integer M on the training sequences, and comparing the model performance (coefficient of determination, R^{2}) on the validation sequences. To determine M, we identified the largest value of k such that every kmer “occurs” at least 100 times in the library, and we set M = k − 1. We say that a kmer occurs in a DNA sequence if the sequence has a klength substring (either in the forward or the reverse complement direction) exactly matching the kmer. Across the datasets, the median value of M was 6 (range: 5−8), the median value of optimal orders was 5 (range: 4−7), and the median R^{2} values of the optimal models was 0.94 (range: 0.76−0.99).
Computing the effective length (L) of TF binding sites
We followed Slattery et al.’s [29] procedure to identify the effective site length (L) of a TF from its selected round’s library sequences. According to Slattery et al., for the effective site length L, the distribution of Lmer frequencies in the selected round should be maximally distant from the distribution of Lmer frequencies in round #0. Thus, for each value of k between m + 1 and 15, where m is the optimal order of Markov model for the corresponding round #0 library, we computed the KL divergence D_{KL} between the distributions of kmer frequencies in the selected round and round #0 (to compute the kmer frequencies in round #0, we used the Markov models computed above). We then set L equal to the value of k for which the above KL divergence was the maximum. Like Slattery et al., we took the frequencies of all kmers that occur at least 100 times in the selected round and considered all other kmers as one combined group. Thus, we computed the above KL divergence D_{KL} as follows.
$${D}_{KL}={\sum}_{w\in {S}_{100}}\kern0.24em {P}_R(w)\mathit{\log}\frac{P_R(w)}{P_0(w)}+{Q}_R(w)\mathit{\log}\frac{Q_R(w)}{Q_0(w)},$$
where:
S_{100 }is the set of all kmers that occur at least 100 times in the selected round R,
P_{R}(w) and P_{0}(w) are the frequencies of the kmer w in rounds #R and #0 respectively, and \({Q}_R(w)=1{\sum}_{w\in {S}_{100}}\;{P}_R(w)\) and \({Q}_0(w)=1{\sum}_{w\in {S}_{100}}\kern0.24em {P}_0(w)\).
Initial filtering of Lmers from the library of the selected round
From the sequences (oligos) of the selected round’s library, we removed every Lmer if it failed in either of the following two criteria.

1.
Minimum count: we eliminate an Lmer if it occurs less than 100 times in the library sequences

2.
Minimum dinucleotide based entropy: we first computed the minimum dinucleotide based entropy of all matches to the TF’s CISBP motif, and we eliminate an Lmer if its dinucleotide based entropy is less than the minimum of that computed from the CISBP motifs. We defined the dinucleotide based entropy for a given sequence as follows.
$$H={\sum}_{k\in {S}_2}\;{p}_k\mathit{\log}\left(\frac{1}{p_k}\right),$$
where:
S_{2 }is the set of all dinucleotides, i.e., {AA, AC, AG, AT, …, TG, TT}, and
p_{k }is the frequency of the kth dinucleotide in that sequence.
Identifying motif Lmers and nonmotif Lmers
We then rank all Lmers in the selected round’s library according to their enrichment with respect to round #0 library. We defined the enrichment of an Lmer w as \(\frac{P_R(w)}{P_0(w)}\), where P_{R}(w) and P_{0}(w) are the frequencies of the kmer w in rounds #R (the selected round) and #0 respectively.
We then mark every Lmer that is as either a substring or a superstring of the TF’s CISBP motif matches, and call these the “motif Lmers”. We identify the lowest ranked motif Lmer and discard every lowerranked Lmers. From the remaining Lmers, we call an Lmer to be a “nonmotif Lmer” if it is not a motif Lmer.
Filtering nonmotif Lmers based on roundoverround (RoR) enrichment and alone occurrence
We further filtered nonmotif Lmers based on the following two criteria.

1.
RoundoverRound enrichment is at least one: for each nonmotif Lmer, we computed its enrichment between every pair of consecutive rounds (RoundoverRound, RoR enrichment). We discarded a nonmotif Lmer if its RoR enrichment between any pair of consecutive rounds was < 1.

2.
Alone occurrence: we discard a nonmotif Lmer if it never occurs “alone” in the selected round’s sequences, i.e., if it occurs with some motif Lmer in all oligos.
Clustering Lmers into canonical and noncanonical motifs
We take the filtered lists of motif Lmers and nonmotif Lmers, and cluster the Lmers into canonical and noncanonical motifs. The key idea is to iteratively identify a cluster head (defined below) and cluster all the Lmers that: (a) have not been assigned to any other cluster yet and (b) are covered by the current cluster head (defined below).
A cluster head is an lmer (we chose l = 8; if L < 8, then we chose l = L − 2) that occurs in the maximum number of Lmers with up to m = 2 mismatches (we use m = 1 if l ≤ 5). These choices were adopted from previous stringkernel based support vector machine models of TF binding specificity [2]. We say that a cluster head covers an Lmer if it occurs in the Lmer with up to m mismatches. Intuitively, a cluster head identifies a core region within the Lmers that it covers. After we cluster the Lmers covered by the current cluster head, we identify a new cluster head for the remaining Lmers and repeat the same process. We continue this iterative process until every Lmer has been assigned to a cluster or we have identified a maximum number of clusters (we set the limit at five).
We next align the Lmers in every cluster. We identify the position within each Lmer where the cluster head occurs with the fewest number of mismatches. We call these positions the anchor positions for alignment. If there are more than one anchor position for an Lmer, we choose the one that is closest to the middle position of the Lmer. We then align the Lmers along the anchor positions, and pad each Lmer with ‘N’s to make sure that all Lmers in the alignment have the same length. Of note, we always count mismatches by considering lmers in both the forward and the reverse complement orientation.
From these alignments, we finally create the position weight matrices or motifs by counting the number of occurrences of each nucleotide at each position of the alignment. An ‘N’ at a position of an Lmer contributes a count of 0.25 to each nucleotide at that position.
We visually confirmed each canonical motif constructed from the above process and confirmed their similarity with the CISBP motifs of the same TF (Supplementary Table 1).
It is useful here to mention a final point about the noncanonical motifs constructed in the above process. At any stage during cluster construction, if we find multiple cluster heads (i.e., each of them covers the same number of Lmers), then we execute the above process independently for each cluster head. In such cases, the same Lmer will be assigned to more than one cluster and thus, will contribute to more than one motif. It is not clear how this may influence our downstream analyses. Therefore, after performing multiple test corrections on the noncanonical motifs (see below), we manually check if there is any pair of significant noncanonical motifs that includes the same Lmer and keep the motif that is more different from the CISBP motifs (see below). Thus, in our results, an Lmer never occurs more than once in the noncanonical motifs.
Selecting noncanonical motifs based on fraction of Oligos explained
We say that a motif (canonical or noncanonical) of a TF occurs in a sequence (an oligo or a ChIPSeq peak) if any of its constituent Lmers occur in the sequence. When a noncanonical motif of a TF occurs in a sequence, but none of the canonical motifs of the TF occurs in that sequence, we say that the noncanonical motif occurs alone in that sequence. We eliminate a noncanonical motif if it does not occur alone in at least 5% of the selected round’s oligos.
Statistical significance of noncanonical motifs
For each canonical and noncanonical motif of a TF, we first computed its minimum distance D_{min} from the collection of CISBP motifs of the same TF. To compute the distance D between two motifs, we first trim the motifs by eliminating noninformative positions (information content less than 0.25 bits) from the two ends. Then we consider the every possible llength submotifs (see below) of the two trimmed motifs, compute their Euclidean distances normalized by l, and report the minimum of these normalized distances as D. We chose l = 8 or set l = the length of the smaller motif if its length is smaller than 8. As we did for cluster heads above, the llength submotifs capture the similarity between the two motifs in a core region. While computing the Euclidean distances, we always consider one of the motifs in both forward and reverse complement orientation, and take the smaller of the two distances.
Next, we compute the statistical significance of the D_{min} value of each noncanonical motif by computing a pvalue using a normal distribution with mean and variances computed from the D_{min} values of the canonical motifs. We report this pvalue as the statistical significance of the noncanonical motif.
Finally, we reported the noncanonical motifs that pass a 5% false discovery rate threshold in BenjaminiHochberg procedure [4].
ChIPSeq and regulatory sequence data
We collected the ChIPSeq data from Cistrome DB [17, 38] (Batch download for Human_Factor) and regulatory sequence annotations based on ENCODE [6, 8] and Roadmap Epigenomics [27] from Cao et al.’s [5] recent study. For promoters, we downloaded from UCSC [11] the sequences 1000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5′ UTRs.
Computational validation: enrichment of noncanonical motifs in ChIPseq data
We computed the enrichment of noncanonical and canonical motifs in Cistrome ChIPseq data using the following three control data: (i) shuffled versions of Lmers constituting the motifs, (ii) dinucleotide shuffled versions of ChIPpeaks, and (iii) randomly selected genomic sequences matched for length, GCcontent, and repeat content (using gkmSVM [9]).
For the first analysis, for a given motif and a ChIPseq dataset, we define the enrichment e(m) of a motif m as follows.
$$e(m)=\frac{1}{\leftD\right}\sum \limits_{d\in D}\;\frac{1}{\leftL\right}\sum \limits_{l\in L}\kern0.24em \frac{n\left(l,d\right)}{n\left(l\hbox{'},d\right)}$$
where,
D is the set of ChIPseq datasets of the corresponding TF,
L is the set of Lmers constituting the motif,
l′ is the shuffled sequence of a constituent Lmer l of the motif m,
n(l, d) and n(l′, d) are the number of times l and l′ occur in the ChIPpeaks of a dataset d, respectively.
In other words, for each motif, we first take the mean ratio of the number of times its constituent Lmers and their shuffled sequences occur in the ChIPpeaks (we considered a pseudocount of 1). Then, we take the mean of the above statistic over all datasets of the corresponding TF.
For the other two analyses, for a given motif and a ChIPseq dataset, we define the enrichment e(m) of a motif m as follows.
$$e(m)=\frac{1}{\leftD\right}\sum \limits_{d\in D}\;\frac{1}{\leftL\right}\sum \limits_{l\in L}\;\frac{n\left(l,d\right)}{n\left(l,C\right)}$$
where,
D is the set of ChIPseq datasets of the corresponding TF,
L is the set of Lmers constituting the motif,
n(l, d) and n(l, C) are the number of times an Lmer l occurs in the ChIPpeaks of a dataset d and the corresponding control dataset C, respectively.
In other words, for each motif, we first take the mean ratio of the number of times its constituent Lmers occur in the ChIPpeaks compared to the control sequences (we considered a pseudocount of 1). Then, we take the mean of the above statistic over all datasets of the corresponding TF.
We show the results in Supplementary Figure 1 and discuss in Supplementary Text 2. To make the comparisons clear between noncanonical and canonical motifs, we have plotted the e(m) value of each noncanonical motif against the mean e(m) value of all canonical motifs of that TF.
Computational validation: checking for the existence of noncanonical motifs in a separate HTSELEX data
As a second computational validation, we analyzed the HTSELEX dataset of Yin et al. [36]. This dataset includes DNAbinding data of fulllength TFs and extended DNA binding domains for 28 of the 31 TFs that have noncanonical motifs in the Yang et al. dataset [35]. For each noncanonical motif, we computed its constituent Lmers’ enrichments in the HTSELEX round that Yin et al. used to derive motifs compared to the first round of that dataset. We eliminated all Lmers with enrichment less than 1, and applied our clustering algorithms discussed above. We then scrutinized the resulting motifs for similarity with the noncanonical motifs discovered from Yang et al. data [35].
Evolutionary conservation analysis
For each TF, we compared the 7way (human and six vertebrates) phyloP scores [24] between its canonical and noncanonical motif occurrences within its ChIPSeq peaks (Cistrome datasets [17, 38]). For noncanonical motif occurrences, we count only the occurrences within peaks that lack occurrences of the TF’s canonical motifs. For each matching sequence in these two groups, we computed its mean phyloP score from the basewise scores and performed a two sample KolmogorovSmirnov test on the two groups. We then computed the fraction of datasets per noncanonical motif where the two groups do not have a significantly different level of phyloP scores (two sample KS test pvalue > 0.01).