Chip-seq data analysis: from quality check to motif discovery and more

Lausanne, 27 April

Exercises with position weight matrices (PWMs)

Author: Philipp Bucher

Exercise 1: High resolution aggregation plots for bound PWM matches

Carry out part 7 of Tutorial 3 pdf posted on the ChIP-seq server website.

In the recipe given there your are asked to generate a control set of random STAT1 motifs from regions far away from an in vivo occupied STAT1 binding sites. Most of these sites are presumably non-occupied but you cannot be sure.

Try to compile a true negative set with the aid of PWMscan. Use may use for this purpose either the MEME-matrix generated in part 5 of Tutorial 3 (see below) or the STAT1 matrix from JASPAR.

Use PWMScan with P-value cut-off 0.0001 as before. The match list will be rather large. From there navigate directly to ChIP-cor in order to select PWM matches that have reasonably hight coverage by STAT1 ChIP-seq tags or no coverage at all.

Important or recommended ChIP-Cor parameters:

Other parameters are not important at this stage. On the ChIP-cor results page, use "Enriched Feature Extraction Option" to extract STAT1 bound motif using the following settings.

Save the resulting file under the name given below. To extract unbound motif matches, use the following parameters: The saved FPS files are posted below for your convenience. From here, continue as in Tutorial 3, part 7. You will see some differences as compared to the previous results. In particular, the PhyloP profile for unbound motifs shows a weak peak in the motif region, indicating that a small fraction of these unbound sites are conserved accross vertebrate genomes. There may be several explanations for this observation:

Exercise 2: Compare alternative PWMs for the same transcription factors

For many transcription factors, there are several PWMs available on our server. In some cases, these matrices correspond to primary and secondary motifs extracted by a motif discovery programs from high-throughput data.

Compare the performance of these alternative PWMs with the aid of motif enrichment plots. Use peak lists available as server-resident files for this purpose. We propose the following examples:

Factor Peak files from ENCODE ChIP-seq-peak
Uniform TFBS from UCSC
Position weight matrices
GM12878 RXRA None HudsonAlpha peaks
H1-hESC RXRA None HudsonAlpha peaks
HepG2 RXRA   None HudsonAlpha peaks
JASPAR CORE 2014: MA0512.1 Rxra:
Jolma2013:        Rxra_nuclearreceptor_DBD dimeric
Mouse UniPROBE:   UP00053_1 Rxra_primary
Mouse UniPROBE:   UP00053_2 Rxra_secondary
HOCOMOCO:         RXRA_f1
K562    eGFP-JunD None UChicago peaks
GM12878 JunD None Yale peaks
H1-hESC JunD None HudsonAlpha peaks
H1-hESC JunD None Stanford peaks
HeLa-S3 JunD None Stanford peaks
HepG2   JunD None HudsonAlpha peaks
HepG2   JunD None Stanford peaks
K562    JunD None Stanford peaks
Mouse UniPROBE:   UP00103_1 Jundm2_primary
Mouse UniPROBE:   UP00103_2 Jundm2_secondary
HOCOMOCO:         JUND_f1

Save the text files containing the numerical values of the motif occurrence profiles to disk and make composite figures to show the respective motif enrichment for different PWMs.


Exercise 3: Site occupancy as a function of PWM score

How many predicted sites (PWM matches) in the genome are actually occupied in a particular cell type? To get an answer to this question you may proceed as follows. Go to PWMScan and choose: Then navigate directly to the ChIP-Cor server to annotate the predicted motif matches with raw counts. Chose an ENCODE ChIP-seq data set for this purpose, e.g. Choose strand option oriented for the Reference feature.

From the results page, use the "Enriched Feature Extraction Option" to annotate the matches (From -100 to + 100, Threshold 0, Cut-Off 10, Ref Feature Oriented ON). Save the annotated SGA file under the name:

The lines of this files contain the following fields, see example below:
   NC_000001.10  ChIP_R  138296  +  1  AGGCCACTGGGAGGCAGGA  1181  CTCF=0
   NC_000001.10  ChIP_R  138971  +  1  AGGCCACCAGGAGGCAGCA  1910  CTCF=7
   NC_000001.10  ChIP_R  139230  +  1  AGGCCACTGGGTGGCAGGA  1447  CTCF=1
   NC_000001.10  ChIP_R  141074  +  1  TGAACTCCAGGGGGAAGAG  1231  CTCF=0
   NC_000001.10  ChIP_R  237750  +  1  GCAGCACCAGGTGGCAGCA  1412  CTCF=53
   NC_000001.10  ChIP_R  251318  -  1  CCAACAGCAGGTGGCAGCC  1566  CTCF=0
Note the score is in the 7th field, the CTCF count coverage in the 8th field.

We will first make a plot showing the average number of tags as a function of the PWM score.

R code: Show Hide

Next we will make a plot showing the percentage of occupied sites as a function of the score. Based on the previous Figure, we consider a PWM match as occupied if it has a coverage of at least 10 tags.

R code: Show Hide

We observe that most predicted CTCF sites are in fact occupied in vivo and that the coverage strongly correlates with the PWM score. Is this true for other TFs as well? To get a broader picture, run the same type of analysis for other transcription factor, starting with STAT1.