Chip-seq data analysis: from quality check to motif discovery and more

Lausanne, 27 April - 1 May 2015

Data reproduction exercise: Variation in Drosophila transcription factor binding sites.

Author: Philipp Bucher

Introduction

This exercise is based on the following paper: The authors analyze cross-species and intra-species variation of transcription factor binding sites identified in previous studies. We will try to reproduce some of the results concerning bindings sites for the Drosophila transcription factors Twist, Binou and Tinman reported in: These binding sites were identified by scanning ChIP-on-chip peak regions with newly derived PWMs for the transcription factor under investigation.

Exercise

We wil try to reprocude results shown in Figure 1a,b in (Spivakov et al. 2012).

Have a look at the Figure legend and the Methods section of the corresponding paper. However, you are not asked to precisely follow the data analysis protocol of the others. Just use the relevant sequence conservation and SNP tracks from the ChIP-Seq server to produce figures of the same kind. Once you have the results, you may ask yourself whether you agree with the interpretation by the authors:

The genomic coordinates upon which this analysis is based are given in supplementary Table S3. Note the Figure 1a is based in PhastCons scores. Higher resolution pictures could potentially be obtainied using the PhyloP track, which was probably not yet available when the paper was published.

Hints and recipes

The supplementary Table S3 is a tab-delimited text file which provides binding sites coordinates for several transcription factors in the following format. The genomic coordinates refer to the D. melanogaster genome assembly dm3. The file contains binding sites for 10 different transcription factors identified by the codes: bin_ef, cnc_disc1, h_known1, hb_disc1, hkb_known2, mod_disc3, tin_ef, trx_disc1 and twi_ef- We are only interested in twi_ef (Twist), bin_ef (Binou) and tin_ef (Tinman).

Several editing operations need be applied to transform the lines of this table into a valid BED file that can be uploaded to the ChIP-seq server:

The following R code could be used to extract appropriate binding site lists. The bed files produced in this way can be uploaded to ChIP-Convert, conversion options: All other switches should be left blank.

From the results page, you can directly navigate to the ChIP-Cor server. Since the binding motifs are asymmetric and the binding sites may occur in either orientation, you should specify "strand oriented" for the reference feature. The count cut-off may invariantly be set to 10 since as all relevant tracks have count values of at most 10.

Relevant target feature tracks for this exercise are: