Supplementary Materials for 104 Paper

Diversity and Complexity in DNA Recognition by Transcription Factors

1,*Gwenael Badis, 4,7,*Michael F. Berger, 4,6,7,*Anthony A. Philippakis, 1,2,*Shaheynoor Talukder, 4,*Andrew R. Gehrke, 4,*Savina A. Jaeger, 2,*Esther T. Chan, 8Genita Metzler, 10Anastasia Vedenko, 1Xiaoyu Chen, 8Hanna Kuznetsov, 9Chi-Fong Wang, 1David Coburn, 4Daniel Newburger, 1-3Quaid Morris, 1,2Timothy R. Hughes, 4-7Martha L. Bulyk

1Banting and Best Department of Medical Research, and Departments of 2Molecular Genetics and 3Computer Science, University of Toronto, 160 College St., Toronto, ON, Canada M5S 3E1.
4Division of Genetics, Department of Medicine; 5Department of Pathology; Brigham and Womenâ€™s Hospital and Harvard Medical School, Boston, MA 02115.
6Harvard/MIT Division of Health Sciences and Technology (HST); Harvard Medical School, Boston, MA 02115.
7Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138.
8Department of Biology and 9Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139.
10Departments of Biology, Wellesley College, Wellesley, MA 02481.

*These authors contributed equally to this work.

Correspondence should be addressed to Timothy R. Hughes and Martha L. Bulyk.

Abstract

Sequence preferences of DNA-binding proteins are a primary mechanism by which cells interpret the genome. Despite these proteins&#39 central importance in physiology, development, and evolution, comprehensive DNA-binding specificities have been determined experimentally for few proteins. Here, we used microarrays containing all 10-base-pair sequences to examine the binding specificities of 104 distinct mouse DNA-binding proteins representing 22 structural classes. Our results reveal a complex landscape of binding, with virtually every protein analyzed possessing unique preferences. Roughly half of the proteins each recognized multiple distinctly different sequence motifs, challenging our molecular understanding of how proteins interact with their DNA binding sites. This complexity in DNA recognition may be important in gene regulation and in evolution of transcriptional regulatory networks.

Click here to visit the online database of experimental results

Supplementary Materials

Supplementary Methods

Click here to download PDF.

Figures

All Supplementary Figures:: All supplementary figures combined into one file.
Figure S1:: Cloning strategy.
Figure S2:: Comparison of PBM data for DBD versus full-length constructs for 5 TFs. (A) Motif logo comparisons and k-mer correlations, and (B) k-mer PBM enrichment score scatter plots.
Figure S3:: Comparison of TFs overexpressed and purified from E. coli versus expressed by in vitro transcription and translation. (A) Motif logos, and (B) k-mer correlation plots, from PBM experiments.
Figure S4:: PBM data reproducibility. (A)-(D) Clustergram of k-mers for all PBM data prior to combining data from array designs #1 and #2, showing that array designs #1 and #2 cluster together for each protein. (E) Reproducibility of E-scores and Z-scores from array designs #1 and #2.
Figure S5:: Agreement of PBM k-mer data with prior motif data, in general.
Figure S6:: Comparison of PBM data versus Kd data for the yeast TF Cbf1 and the murine/human TF Max.
Figure S7:: Confirmation of PBM-derived motifs by EMSAs for three newly characterized proteins and one recently characterized protein.
Figure S8:: Binding profiles of specific TF DBD structural classes. (A) HMG/SOX, (B) AP-2, (C) ARID/BRIGHT, (D) bZIP, (E) ZnF_C4, (F) E2F, (G) ETS, (H) Forkhead, (I) GATA, (J) HLH, (K) homeodomain, (L) IRF, (M) RFX, (N) SAND DNA-binding domains.
Figure S9:: Confirmation of secondary motifs by EMSAs for 6 TFs: Hnf4a, Nkx3.1, Mybl1, Foxj3, Rfxdc2 and Myb.
Figure S10:: Primary, secondary, and tertiary Seed-N-Wobble motifs identified in PBM data for the human POU homeodomain TF Oct-1.
Figure S11:: High-scoring k-mers belonging to the Jundm2 secondary motif are not bound as well by the related bZIP protein Atf1.
Figure S12:: RFX protein-DNA recognition positions.
Figure S13:: Graphs showing log10(1-AUC) (area under ROC curve) (y-axis) versus log10(number of positives) (x-axis) for Hnf4a.
Figure S14:: Enrichment of primary versus secondary motif 8-mers bound in vitro within genomic regions bound in vivo for (A, C, D) Hnf4a and (B, E, F) Bcl6b.

Logos for all motifs generated from each array design separately and from the combined array data.

Statistical performance plots show that a multiple motif model best captures the binding profiles for most TFs.

Shuffled simulated 14bp motifs

Shuffled simulated 14bp motifs Seed-and-Wobble-primary motifs

Shuffled simulated 14bp motifs Seed-and-Wobble-secondary motifs

Simulated 14bp motifs

Simulated 14bp motifs Seed-and-Wobble-primary motifs

Simulated 14bp motifs Seed-and-Wobble-secondary motifs

All PBM-derived PWMs (.tar.gz compressed file, ~323 KB)

All PBM-derived PWMs (.zip compressed file, ~1.3 MB)

Tables

Table S1:: Number of proteins in each different TF DBD structural class that exists in the mouse genome, and the number of those that were examined in this paper.
Table S2:: TF clones, sequences, and approximate concentrations used in PBMs.
Table S3:: Comparison of PBM k-mer data to JASPAR, TRANSFAC, and literature-derived motifs (AUC ≥ 0.8 and Q ≤ 0.01).