EPD Exercises

1. Make a validated promoter collection

The first step in this exercise is to download a promoter collection from a publicly available database. Here it is shown how to do it for UCSC gene annotation table but you can do it in BioMart as well. This is your starting point: a promoter collection that is not experimantally validated, with a low accurancy in placing the TSS. You will then use CAGE data to validate the promoters (selecting the promoters that are expressed) and to shift the TSS location to the nearest CAGE peak. This should represent a better estimate of the true TSS location.

Now you have the promoter collection correctly formatted to be analysed using the ChIP-Seq server. Open it with a text editor to familiarize with the SGA format. Note that the 6th colum represent the associated gene name. The next step is to get CAGE peak in the genome. These peaks will then be used to validate promotrs and to define the location of the TSS.

At this point in the analysis you have a promoter collection that has been validated by CAGE data, but the TSS coordinates are the original UCSC. These are not very precise. To increase the TSS precision you have to shift their location to the nearest base with the higher CAGE count. To do so you need a two step procedure in which you first extract the position of the CAGE tags around each promoter and then shift the TSS to the nearest peak.

You now have a validated promoter collection. The next exercise will show you how to check the quality of your collection and compare it to the initial UCSC annotation and EPDnew. Please note that, after shifting, there is a high probability that your collection is not properly sorted. If you want to sort it you can use the following bash command:

sort -s -k1,1 -k3,3n -k4,4 ucscPromotersShift.sga > ucscPromotersShiftSorted.sga

Or, alternatively, switch-on the sorting option when uploading your collection on the server.

2. Quality control of your promoter collection

In this second exercise you will perform some quality controls on your promoter collection to see if it is any better than UCSC original collection and to compare it to EPDnew. First you will check the motif distribution around promoters, then histone marks.

Motif search around promoters is done using OProf tools (part of the SSA web server). It only accept one input file type: FPS. Before starting your analysis you have to convert the SGA file of your final promoter list in to FPS:

Now you are ready to start the motif analysis: Now check histone marks around promoters collection and compare them:

Considering motif plots and histone marks analysis, how does your database compare to the other two? Write a short document reporting your findings (parameters used, number of promoters you have found, ...) and the figures you just generated.

3. Promoter selection

With this exercise you will learn how to select promoters that are expressed under a particular sample and to study them.

As an example, you will use the ENCODE data for cell line GM12878. This is an ENCODE tier 1 cell line. As a consequence, it has been heavily studied, providing data for almost all conditions / targets used by the consotium.

In a first step you will use ChIP-Cor to select promoters that are expressed in GM12878, then you will study their histone marks distribution and additionally check their Gene Onthology (using the GREAT suite)