Chip-seq data analysis: from quality check to motif discovery and more

Lausanne, 27 April - 1 May 2015

Data reproduction exercise: Nucleosome organization around YY1 sites.

Author: Philipp Bucher


This exercise is based on the following paper: This paper contains many results covering diverse aspects of transcription factor binding sites. In this data reprroduction exercise, we focus on the part that deals with the nucleosome architectures, see results subsection: Specifically, we will try to reprocude results shown in Figure 5a and 5b.

The Figure shows average nucleosome occupancy profiles around YY1 binding sites in the lymphoblastoid cell line GM12878. The peak lists were first divided into promoter-proximal and promoter-distal binding sites and further stratified into three groups based on ChIP-seq signal strength.

The authors of the paper describe and interpret the results as follows:


Try to reproduce the above Figure with public data available from the ChIP-Seq server

Hints and recipes

Input data from human genome assembly hg19
    ENCODE ChIP-seq-peak -> Uniform TFBS from UCSC 
       sample: GM12878 YY1_(SC-281) None HudsonAlpha peaks 
    ENCODE DNase FAIRE etc. -> GSE35586, Nucleosome Position by MNase-seq ...
       sample: Nucleosome Gm12878 rep1 
    Genome Annotation -> EPD, the Human Curated Promoter Database
       sample: TSS from hg19 EPDnew rel 003 
Note: The MNase data files are very voluminous and thus take time to process (minutes rather than secons). We therefore recommend to generate the figures with one replicate only.

Making a nucleosome occupancy profile for distal YY1 sites.

In the paper, the authors write:

Got to the ChIP-Cor input form. Select the above indicated YY1 peak list as reference feature and the TSS from EPDnew as target feature (other parameters are not important at this stage). On the results page use the "Enriched Feature Extraction Option" submenu to extract YY1 sites that do or do not have an annotated TSS within 2kb. (toggle "Depleted Feature Extraction" on/off). From the next page, navigate directly to ChIP-Cor using the action button near the lower right corner. Then run ChIP-Cor with the above indicated MNase sample as target feature (strand option "all", centering distance 70).

Stratifying the YY1 peaks into three subsets by signal strength

One way of doing this is by downloading the peak SGA file from the MGA repository via the MGA overview page at:

The name of the compressed SGA file is: The 6th column of this extended SGA file contains a statistical binding site score which can be used for ranking. The three subsets can be generated in R are as follows: You can now individually upload these SGA files to the ChIP-cor server and repeat the previously outlined procedure for generate nucleosome occupancy plots.

A more efficient prodedures to reproduc Figures 5a and 5b at once

You can use ChIP-Cor to generate an extended YY1 peak file with promoter status annotation. Select the above indicated YY1 peak as reference feature and EPD promoters as target feature. Then use the "Enriched Feature Extraction Option" with Threshold=0 (all promoters will be selected) and save the resulting SGA file under the name:

This is an extended SGA files with peak scores in the 6th field and the TSS counts in the 7th field as shown below:
NC_000001.10   YY1_P   714001	0   1   133.535942201664   TSS=0
NC_000001.10   YY1_P   805331	0   1   46.6896250548417   TSS=0
NC_000001.10   YY1_P   840144   0   1	70.0879017887289   TSS=0
NC_000001.10   YY1_P   894612   0   1	244.724808404057   TSS=2
NC_000001.10   YY1_P   895930   0   1	40.8279820108751   TSS=2
You can now upload this file to the ChIP-Extract input form at: in order to extract the nucleosome occupancy data (strand any, centering 70) in count matrix format (from -2000 to +2000, window size 20). This is slow - it will take about 20 mins. But once you have the output files, you can do everything that in R very fast.

From the ChIP-Extract results page, download the "Ref SGA File" and the "Table (TEXT)" under the names:

Actually, it is not necessary to save the SGA file again since it is identical to the uploaded file. However, it is good practice to always save the two files together, in order to be sure that the sorting of the reference features is the same in the data and in the annotation file.

Below is the R recipe to generate the plots equivalent to Figure 5a and 5b in the paper.