Diversity and Complexity in DNA Recognition by Transcription Factors
1,*Gwenael Badis, 4,7,*Michael F. Berger, 4,6,7,*Anthony A. Philippakis, 1,2,*Shaheynoor Talukder, 4,*Andrew R. Gehrke, 4,*Savina A. Jaeger, 2,*Esther T. Chan, 8Genita
Metzler, 10Anastasia Vedenko, 1Xiaoyu Chen, 8Hanna
Kuznetsov, 9Chi-Fong Wang, 1David Coburn, 4Daniel
Newburger, 1-3Quaid Morris, 1,2Timothy R. Hughes, 4-7Martha L. Bulyk
- 1Banting and Best Department of Medical
Research, and Departments of 2Molecular
Genetics and 3Computer Science, University of
Toronto, 160 College St., Toronto, ON, Canada M5S 3E1.
- 4Division of Genetics, Department of
Medicine; 5Department of Pathology; Brigham
and Womenâs Hospital and Harvard Medical School, Boston,
MA 02115.
- 6Harvard/MIT Division of Health Sciences and
Technology (HST); Harvard Medical School, Boston, MA 02115.
- 7Committee on Higher Degrees in Biophysics,
Harvard University, Cambridge, MA 02138.
- 8Department of Biology and 9Department of Physics, Massachusetts Institute of
Technology, Cambridge, MA 02139.
- 10Departments of Biology, Wellesley College,
Wellesley, MA 02481.
*These authors contributed equally to this
work.
Correspondence should be addressed to Timothy R. Hughes and Martha L. Bulyk.
Abstract
Sequence preferences of DNA-binding proteins are a primary mechanism by
which cells interpret the genome. Despite these proteins' central
importance in physiology, development, and evolution, comprehensive
DNA-binding specificities have been determined experimentally for few
proteins. Here, we used microarrays containing all 10-base-pair sequences to
examine the binding specificities of 104 distinct mouse DNA-binding proteins
representing 22 structural classes. Our results reveal a complex landscape of
binding, with virtually every protein analyzed possessing unique preferences.
Roughly half of the proteins each recognized multiple distinctly different
sequence motifs, challenging our molecular understanding of how proteins
interact with their DNA binding sites. This complexity in DNA recognition may
be important in gene regulation and in evolution of transcriptional
regulatory networks.
Click here to
visit the online database of experimental results
Supplementary Materials
Supplementary Methods
Click here to
download PDF.
Figures
- All
Supplementary Figures:
- All supplementary figures combined into one file.
- Figure S1:
- Cloning strategy.
- Figure S2:
- Comparison of PBM data for DBD versus full-length constructs for 5
TFs. (A) Motif logo comparisons and k-mer correlations, and
(B) k-mer PBM enrichment score scatter plots.
- Figure S3:
- Comparison of TFs overexpressed and purified from E. coli
versus expressed by in vitro transcription and translation.
(A) Motif logos, and (B) k-mer correlation plots, from
PBM experiments.
- Figure S4:
- PBM data reproducibility. (A)-(D) Clustergram of k-mers for
all PBM data prior to combining data from array designs #1 and #2,
showing that array designs #1 and #2 cluster together for each protein.
(E) Reproducibility of E-scores and Z-scores from array designs
#1 and #2.
- Figure S5:
- Agreement of PBM k-mer data with prior motif data, in general.
- Figure S6:
- Comparison of PBM data versus Kd data for the yeast TF Cbf1 and the
murine/human TF Max.
- Figure S7:
- Confirmation of PBM-derived motifs by EMSAs for three newly
characterized proteins and one recently characterized protein.
- Figure S8:
- Binding profiles of specific TF DBD structural classes. (A)
HMG/SOX, (B) AP-2, (C) ARID/BRIGHT, (D) bZIP,
(E) ZnF_C4, (F) E2F, (G) ETS, (H) Forkhead,
(I) GATA, (J) HLH, (K) homeodomain, (L)
IRF, (M) RFX, (N) SAND DNA-binding domains.
- Figure S9:
- Confirmation of secondary motifs by EMSAs for 6 TFs: Hnf4a, Nkx3.1,
Mybl1, Foxj3, Rfxdc2 and Myb.
- Figure S10:
- Primary, secondary, and tertiary Seed-N-Wobble motifs identified in
PBM data for the human POU homeodomain TF Oct-1.
- Figure S11:
- High-scoring k-mers belonging to the Jundm2 secondary motif are not
bound as well by the related bZIP protein Atf1.
- Figure S12:
- RFX protein-DNA recognition positions.
- Figure S13:
- Graphs showing log10(1-AUC) (area under ROC curve) (y-axis)
versus log10(number of positives) (x-axis) for Hnf4a.
- Figure S14:
- Enrichment of primary versus secondary motif 8-mers bound in
vitro within genomic regions bound in vivo for (A, C,
D) Hnf4a and (B, E, F) Bcl6b.
- Logos for all motifs
generated from each array design separately and from the combined array
data.
- Statistical
performance plots show that a multiple motif model best captures the
binding profiles for most TFs.
- Shuffled
simulated 14bp motifs
- Shuffled
simulated 14bp motifs Seed-and-Wobble-primary motifs
- Shuffled
simulated 14bp motifs Seed-and-Wobble-secondary motifs
- Simulated 14bp
motifs
- Simulated
14bp motifs Seed-and-Wobble-primary motifs
- Simulated
14bp motifs Seed-and-Wobble-secondary motifs
- All PBM-derived PWMs (.tar.gz compressed file, ~323 KB)
- All PBM-derived PWMs (.zip compressed file, ~1.3 MB)
Tables
- Table S1:
- Number of proteins in each different TF DBD structural class that
exists in the mouse genome, and the number of those that were examined
in this paper.
- Table S2:
- TF clones, sequences, and approximate concentrations used in
PBMs.
- Table S3:
- Comparison of PBM k-mer data to JASPAR, TRANSFAC, and
literature-derived motifs (AUC ≥ 0.8 and Q ≤ 0.01).