ChIP-seq coverage analysis

#######
UPDATE
#######

It has been a long time since I wrote this post, and in between a wonderfull set of tools for coverage calculation and visualization has been published: deepTools does metagene analysis, and much, much more in a better than I could possibly do with my näive scripts.

Try it. It is worth it.
#######

There a few analysis tasks for which one would think a solution, or an easy to follow example, is readily available in the web (it has been a long time since the Web became the Internet). One such example is how to do coverage analysis of ChIP-seq, or any other data for that matter. That is, how are the reads distributed around a region of interest, commonly the TSS?

HTSeq, the python library for NGS data analysis offers that, but I prefer to do my plots in R and ggplot. I also find bioconductor nice but some of the objects and the packages literature hard to understand for someone looking for imminently practical information.  So I decided to ask around in my institute and create my own script. I like to do scripts that can be easily modified and also that generate plots that can be used with little or now change for presentations.

Create the bed file with TSSs:

Firstly, how to create the BED file with the TSSs using a combination of bioconductor and bedtools – it is quite fast and it very-well documented. This only needs to be ran once and there is a myriad of ways to do it. Of course one could also use any bed file with features of interest.


######################
### Load libraries ###
library(GenomicRanges)
library(GenomicFeatures)
library(plyr)
library(reshape)
library(data.table)

### custom functions ##
# string concatenation without space gaps
concat <- function(...) paste(..., sep="")

################################
## create TSS bed file (UCSC) ##
# create a database of transcripts
hg19RefGenes <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene") # run only once
hg19RefGenes # summary of the object
saveDb(hg19RefGenes, file='~/Homo_sapiens/hg19_RefSeq_genes.sqlite')
# by saving the database one keeps a record of which version of the genome was used for the analysis. Very important for reproducibility.
hg19RefGenes <- loadDb(file="~/Homo_sapiens/hg19_RefSeq_genes.sqlite") # not necessary of useful if the the script need to be ran again later

# create an object with a list of transcripts grouped by gene
transcripts <- transcriptsBy(hg19RefGenes, by="gene")

# convert to data frame:
gene2tx <- as.data.frame(transcripts)

# Note that transcription_start is always smaller than transcription_end, even when the transcript is on the “−” strand. Hence, we have to use either the start or the end coordinate of the transcript, depending on the strand, to get the actual transcription start sites, i.e., the 5’ ends of the transcripts:
#http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Practicals/TSS/TSS_plot.pdf
tss <- ifelse(gene2tx$strand == "+", gene2tx$start, gene2tx$end)

tss.bed <- data.frame(seqnames = gene2tx$seqnames, start = tss, end= tss, strand=gene2tx$strand, transcriptID=gene2tx$tx_name, type=rep("tss", length(tss)))

## just keep proper chromosomes
tss.bed <- subset(tss.bed, str_detect(seqnames, "^[[:alnum:]]*$"))
with(tss.bed, as.data.frame(table(seqnames)))    # list the frequency of genes per chromosome

# save bed file
write.table(tss.bed, file = "~/Homo_sapiens/hg19RefSeqTSS.bed", row.names = FALSE,
col.names = FALSE, quote = FALSE, sep="\t")

## use bedtools to increase the region around the tss
system('
chromInfo=~/Homo_sapiens/UCSC/hg19/Annotation/Genes/ChromInfo.txt

bedtools slop -i ~/Homo_sapiens/hg19RefSeqTSS.bed -g $chromInfo -b 1000 > ~/Homo_sapiens/hg19RefSeqTSS_1000kb.bed
')

Counting the number of reads per nucleotide

Because I run over several bam files and I like to keep the file names tidy I’ve wrapped the bedtools function in a shell function so that the several analysis can ran in parallel to save time. Because this step takes a of time, it sends me an email at the end. This way I can do something else without being constantly looking at it. Hint: running it in  a screen session also saves a lot of headaches. If you are doing for a single bam file, or are not versed in shell scripting all you need is to run this:

coverageBed -d -abam bamFile -b geneModel > name.cov.bed

But in all likelihood you will have several bam files:

#################################################################################
# This little shell script calculates coverage per base at any given annotation #

# wrapper shell function that does the coverage calculations and returns the files with nice names
coverageAtFeatures(){
# the input is a bam file that comes from sdout (see bellow)
bamFile=$1
name=$(basename ${bamFile%%.bam}) # to keep the directory tidy I rename the results files using the original bam file
geneModel=$2 # also from sdout
outDir=results/coverage # defines the output directory
coverageBed -d -abam $bamFile -b $geneModel > $outDir/$name.cov.bed # actually performs the coverage calculation.
# note the -d, very important!

}
export -f coverageAtFeatures # this add the function to the shell to be used

# Features:
promoters=~/Homo_sapiens/hg19RefSeqTSS_1000kb.bed
# bam folder:
bam_folder=~/chip_seq/data/reads/

# now we can run the function defined above:
# ls will list all the bam files in the folder and pass them to parallel.
#Each bam file will the first argument of coverageAtFeatures
# the second argument will be the bed file with the genomic regions of interest.
# Parallel will run the analysis for 4 bam simultaneously - this # can vary depending on your computer.
parallel  -i -j 4 bash -c "coverageAtFeatures {} $promoters" --  $(ls $bam_folder/*bam)"

# when all is done a message will arrive in your inbox.
echo "Subject:Coverage done" |  sendmail -v myname@mymail.com > /dev/null

At the end of this we will have a file that contains something like:

head chip_experiment.cov.bed
chr1	19922471	19924471	+	NM_001204088	tss	1	1
chr1	19922471	19924471	+	NM_001204088	tss	2	1
chr1	19922471	19924471	+	NM_001204088	tss	3	1
chr1	19922471	19924471	+	NM_001204088	tss	4	1
chr1	19922471	19924471	+	NM_001204088	tss	5	1
chr1	19922471	19924471	+	NM_001204088	tss	6	0
chr1	19922471	19924471	+	NM_001204088	tss	7	0
chr1	19922471	19924471	+	NM_001204088	tss	8	0
chr1	19922471	19924471	+	NM_001204088	tss	9	0
chr1	19922471	19924471	+	NM_001204088	tss	10	0

Col1-6 are the original entries of the bed file containing the TSSs, followed by the position of the base in the feature (remember this is a bp-resolution coverage) and the number of reads that mapped to the base.

Now we are ready calculate some statistics on own many reads map to each base surrounding the TSSs and plot these. In the next snippet I’ll calculate the sum of reads per base but calculating the average requires a minor change to the script.  Since several of the operations take some time, I use the package doMC save time but this should work without it.

############################################
## function to calculate coverage profile ##

## list coverage files
path="./"
cov.files <- list.files(path=path, pattern=".cov.bed")

## for multicore usage:
library(doMC)
registerDoMC(cores=10) # yay for our server

halfWindow=1000 # this half the size of the window around the TSSs

loadCovMat <- function(file){
# read files
cov.data <- read.table(concat(path, file), header=FALSE, sep = "\t", stringsAsFactors = FALSE)

#     http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r
#     cov.data <- fread(file)    # not used because strand not recognized
names(cov.data) <- c("chr", "start", "end", "strand", "ID", "type", "position", "count")

# create a new ID to avoid repeated gene names - common issue with UCSC annotations
cov.data.n <- transform(cov.data, newID = paste(ID, start, sep="_"))
# add base position with reference to the TSS.
    # very important: not the strand operation.
cov.algned <- mutate(cov.data.n, position_aligned=position-halfWindow,
                  position_str_cor=ifelse(strand=="+", position_aligned, -1*position_aligned))

# for the next operation R needs a lot of memory so I'll remove objs from memory
rm("cov.data", "cov.data.n", "cov.data.t")

# counts the number of reads per position - the final result!
    res <- ddply(cov.algned, .(position_str_cor), summarize, sum=sum(count), .parallel=T)

## get condition name
    # important to have column with the name of the experiment when running multiple analysis
sample <- strsplit(file, "\\.")[[1]][1]
res$condition <- rep(sample, nrow(res))

return(res)
}

## Run the function over all coverage files
# the output is a list of data frames
cov.l <- lapply(cov.files, loadCovMat)

## see the data
lapply(cov.l, head)

## save the coverage data
lapply(cov.l, function(x){
                          write.table(x,
                                      concat(unique(x$condition), ".sum.cov"),
                                      row.names = FALSE,
                                      quote = FALSE, sep="\t")})

# join the data frame in he list in one big data frame for plotting
cov <- do.call("rbind", cov.l)

## 2 different plots:
p.cov.all <- ggplot(data=cov, aes(x=position_str_cor, y=sum, color=condition)) + geom_line()
ggsave(filename = "./plots/coverage_tss_all.pdf", p.cov.all)

p.cov.wrap <- p.cov.all + facet_wrap(~condition)
ggsave(filename = "./plots/coverage_tss_wrap.pdf", p.cov.wrap)

If all went well this is the final result:
Plot 1

Plot2

Al there is lacking now is putting all of this in a nice wrapper, which could be easily done, but I preferred to show the different stages of the process separated because it is easier to explain. I hope it helps.

Advertisements
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

One Response to ChIP-seq coverage analysis

  1. Pingback: Coverage Plots for Chip Seq | aditiqamra

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s