Nerd up your R

I probably spend most of my working day in an R terminal, or at least I start R often enough. Now I saw in a blog I follow a way to entertain and instruct in equal measure everytime that little statistical box is opened: fortunes. This is a very simple package that collected and now shows citations related to R (and statistics in R). Some examples (and as of v1.5.2 there are 360 citations to choose from):

Release 1.0.0
(silence)
Wow! Thank you! […] If I am allowed to ask just one question today: How do
you fit 48 hours of coding in an ordinary day? Any hints will be appreciated
… 🙂
— Detlef Steuer (on 2000-02-29)
R-help (February 2000)

or

Perhaps one is the real forge and the other is a forgery? Or a forge-R-y? I’ll
get my coat…
— Barry Rowlingson (on the question whether http://www.RForge.net/ or
http://R-Forge.R-project.org/ is the official forge server)
R-help (April 2007)

Set-up

I have the following lines it in my ~.Rprofile (think of .bashrc but for R):

library("fortunes")
fortune()

This ensures each time I start R I am greeted by a nice little message to brighten my day:

my R start screen

How to install

I had some problems installing from cran, so went with install_url function of devtools which makes the process of installing from source easier when one does not need to download (and keep a copy of) a package source:

library("devtools")
install_url("http://cran.r-project.org/src/contrib/fortunes_1.5-2.tar.gz") #remember to use the link for the latest package version

And it is done!

Posted in Uncategorized | Tagged , | 1 Comment

Rcolorbrewer pallete

I am always looking for a good quality printout of the available palettes in the RColorBrewer package. I finally decided to take matters on my own hand and create a pdf for myself. Of course it is very easy:

library(RColorBrewer)
pdf("~/Pictures/RColorBrewerPalette.pdf")

display.brewer.all()

dev.off()
Posted in Uncategorized | Tagged , , | Leave a comment

ChIP-seq coverage analysis

#######
UPDATE
#######

It has been a long time since I wrote this post, and in between a wonderfull set of tools for coverage calculation and visualization has been published: deepTools does metagene analysis, and much, much more in a better than I could possibly do with my näive scripts.

Try it. It is worth it.
#######

There a few analysis tasks for which one would think a solution, or an easy to follow example, is readily available in the web (it has been a long time since the Web became the Internet). One such example is how to do coverage analysis of ChIP-seq, or any other data for that matter. That is, how are the reads distributed around a region of interest, commonly the TSS?

HTSeq, the python library for NGS data analysis offers that, but I prefer to do my plots in R and ggplot. I also find bioconductor nice but some of the objects and the packages literature hard to understand for someone looking for imminently practical information.  So I decided to ask around in my institute and create my own script. I like to do scripts that can be easily modified and also that generate plots that can be used with little or now change for presentations.

Create the bed file with TSSs:

Firstly, how to create the BED file with the TSSs using a combination of bioconductor and bedtools – it is quite fast and it very-well documented. This only needs to be ran once and there is a myriad of ways to do it. Of course one could also use any bed file with features of interest.


######################
### Load libraries ###
library(GenomicRanges)
library(GenomicFeatures)
library(plyr)
library(reshape)
library(data.table)

### custom functions ##
# string concatenation without space gaps
concat <- function(...) paste(..., sep="")

################################
## create TSS bed file (UCSC) ##
# create a database of transcripts
hg19RefGenes <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene") # run only once
hg19RefGenes # summary of the object
saveDb(hg19RefGenes, file='~/Homo_sapiens/hg19_RefSeq_genes.sqlite')
# by saving the database one keeps a record of which version of the genome was used for the analysis. Very important for reproducibility.
hg19RefGenes <- loadDb(file="~/Homo_sapiens/hg19_RefSeq_genes.sqlite") # not necessary of useful if the the script need to be ran again later

# create an object with a list of transcripts grouped by gene
transcripts <- transcriptsBy(hg19RefGenes, by="gene")

# convert to data frame:
gene2tx <- as.data.frame(transcripts)

# Note that transcription_start is always smaller than transcription_end, even when the transcript is on the “−” strand. Hence, we have to use either the start or the end coordinate of the transcript, depending on the strand, to get the actual transcription start sites, i.e., the 5’ ends of the transcripts:
#http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Practicals/TSS/TSS_plot.pdf
tss <- ifelse(gene2tx$strand == "+", gene2tx$start, gene2tx$end)

tss.bed <- data.frame(seqnames = gene2tx$seqnames, start = tss, end= tss, strand=gene2tx$strand, transcriptID=gene2tx$tx_name, type=rep("tss", length(tss)))

## just keep proper chromosomes
tss.bed <- subset(tss.bed, str_detect(seqnames, "^[[:alnum:]]*$"))
with(tss.bed, as.data.frame(table(seqnames)))    # list the frequency of genes per chromosome

# save bed file
write.table(tss.bed, file = "~/Homo_sapiens/hg19RefSeqTSS.bed", row.names = FALSE,
col.names = FALSE, quote = FALSE, sep="\t")

## use bedtools to increase the region around the tss
system('
chromInfo=~/Homo_sapiens/UCSC/hg19/Annotation/Genes/ChromInfo.txt

bedtools slop -i ~/Homo_sapiens/hg19RefSeqTSS.bed -g $chromInfo -b 1000 > ~/Homo_sapiens/hg19RefSeqTSS_1000kb.bed
')

Counting the number of reads per nucleotide

Because I run over several bam files and I like to keep the file names tidy I’ve wrapped the bedtools function in a shell function so that the several analysis can ran in parallel to save time. Because this step takes a of time, it sends me an email at the end. This way I can do something else without being constantly looking at it. Hint: running it in  a screen session also saves a lot of headaches. If you are doing for a single bam file, or are not versed in shell scripting all you need is to run this:

coverageBed -d -abam bamFile -b geneModel > name.cov.bed

But in all likelihood you will have several bam files:

#################################################################################
# This little shell script calculates coverage per base at any given annotation #

# wrapper shell function that does the coverage calculations and returns the files with nice names
coverageAtFeatures(){
# the input is a bam file that comes from sdout (see bellow)
bamFile=$1
name=$(basename ${bamFile%%.bam}) # to keep the directory tidy I rename the results files using the original bam file
geneModel=$2 # also from sdout
outDir=results/coverage # defines the output directory
coverageBed -d -abam $bamFile -b $geneModel > $outDir/$name.cov.bed # actually performs the coverage calculation.
# note the -d, very important!

}
export -f coverageAtFeatures # this add the function to the shell to be used

# Features:
promoters=~/Homo_sapiens/hg19RefSeqTSS_1000kb.bed
# bam folder:
bam_folder=~/chip_seq/data/reads/

# now we can run the function defined above:
# ls will list all the bam files in the folder and pass them to parallel.
#Each bam file will the first argument of coverageAtFeatures
# the second argument will be the bed file with the genomic regions of interest.
# Parallel will run the analysis for 4 bam simultaneously - this # can vary depending on your computer.
parallel  -i -j 4 bash -c "coverageAtFeatures {} $promoters" --  $(ls $bam_folder/*bam)"

# when all is done a message will arrive in your inbox.
echo "Subject:Coverage done" |  sendmail -v myname@mymail.com > /dev/null

At the end of this we will have a file that contains something like:

head chip_experiment.cov.bed
chr1	19922471	19924471	+	NM_001204088	tss	1	1
chr1	19922471	19924471	+	NM_001204088	tss	2	1
chr1	19922471	19924471	+	NM_001204088	tss	3	1
chr1	19922471	19924471	+	NM_001204088	tss	4	1
chr1	19922471	19924471	+	NM_001204088	tss	5	1
chr1	19922471	19924471	+	NM_001204088	tss	6	0
chr1	19922471	19924471	+	NM_001204088	tss	7	0
chr1	19922471	19924471	+	NM_001204088	tss	8	0
chr1	19922471	19924471	+	NM_001204088	tss	9	0
chr1	19922471	19924471	+	NM_001204088	tss	10	0

Col1-6 are the original entries of the bed file containing the TSSs, followed by the position of the base in the feature (remember this is a bp-resolution coverage) and the number of reads that mapped to the base.

Now we are ready calculate some statistics on own many reads map to each base surrounding the TSSs and plot these. In the next snippet I’ll calculate the sum of reads per base but calculating the average requires a minor change to the script.  Since several of the operations take some time, I use the package doMC save time but this should work without it.

############################################
## function to calculate coverage profile ##

## list coverage files
path="./"
cov.files <- list.files(path=path, pattern=".cov.bed")

## for multicore usage:
library(doMC)
registerDoMC(cores=10) # yay for our server

halfWindow=1000 # this half the size of the window around the TSSs

loadCovMat <- function(file){
# read files
cov.data <- read.table(concat(path, file), header=FALSE, sep = "\t", stringsAsFactors = FALSE)

#     http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r
#     cov.data <- fread(file)    # not used because strand not recognized
names(cov.data) <- c("chr", "start", "end", "strand", "ID", "type", "position", "count")

# create a new ID to avoid repeated gene names - common issue with UCSC annotations
cov.data.n <- transform(cov.data, newID = paste(ID, start, sep="_"))
# add base position with reference to the TSS.
    # very important: not the strand operation.
cov.algned <- mutate(cov.data.n, position_aligned=position-halfWindow,
                  position_str_cor=ifelse(strand=="+", position_aligned, -1*position_aligned))

# for the next operation R needs a lot of memory so I'll remove objs from memory
rm("cov.data", "cov.data.n", "cov.data.t")

# counts the number of reads per position - the final result!
    res <- ddply(cov.algned, .(position_str_cor), summarize, sum=sum(count), .parallel=T)

## get condition name
    # important to have column with the name of the experiment when running multiple analysis
sample <- strsplit(file, "\\.")[[1]][1]
res$condition <- rep(sample, nrow(res))

return(res)
}

## Run the function over all coverage files
# the output is a list of data frames
cov.l <- lapply(cov.files, loadCovMat)

## see the data
lapply(cov.l, head)

## save the coverage data
lapply(cov.l, function(x){
                          write.table(x,
                                      concat(unique(x$condition), ".sum.cov"),
                                      row.names = FALSE,
                                      quote = FALSE, sep="\t")})

# join the data frame in he list in one big data frame for plotting
cov <- do.call("rbind", cov.l)

## 2 different plots:
p.cov.all <- ggplot(data=cov, aes(x=position_str_cor, y=sum, color=condition)) + geom_line()
ggsave(filename = "./plots/coverage_tss_all.pdf", p.cov.all)

p.cov.wrap <- p.cov.all + facet_wrap(~condition)
ggsave(filename = "./plots/coverage_tss_wrap.pdf", p.cov.wrap)

If all went well this is the final result:
Plot 1

Plot2

Al there is lacking now is putting all of this in a nice wrapper, which could be easily done, but I preferred to show the different stages of the process separated because it is easier to explain. I hope it helps.

Posted in Uncategorized | Tagged , , , | 1 Comment

BioPython musings

On my quest to improve my scripting skils I have just recently taken the Introduction to BioPython course at the VIB in Leuven.

The course

It was all very well organized and the Instructor,  Kristian Rother , was well prepared and engaging. Perhaps he was not expecting so many of wanting to learn more about handing NGS data but when he saw the interest, wrote overnight a custom script to parse and extract information SAM files. Not only it is useful, it also serves a practical example of several Python functions:

#!/usr/bin/env python

# TASK: find out how many sequences there are in the filtered dataset
#       (the one printed at the bottom)

data = []

for line in open('example.sam'):
if not line.startswith('$') and not line.startswith('@'):
line = line.strip().split()
record = {
'QNAME' : line[0],
'FLAG'  : int(line[1]),
'RNAME' : line[2],
'POS'   : int(line[3]),
'MAPQ'  : int(line[4]),
'CIGAR' : line[5],
'RNEXT' : line[6],
'PNEXT' : int(line[7]),
'TLEN'  : int(line[8]),
'SEQ'   : line[9],
'QUAL'  : line[10],
'optional' : []
}
for optional in line[11:]:
record['optional'].append(optional.split(':'))
data.append(record)

# filter all records between position 109200 and 110000
count = 0
for rec in data:
if rec['POS'] > 109200 and rec['POS'] + len(rec['SEQ']) <= 110000:
print rec['POS'], rec['SEQ']
count = count + 1
print "Total number of reads is: %i" % (len(data))
print "Filtered reads is: %i" % (count)

Although I’ve learned a lot there were a problem that delayed the class progression: despite the organizers warning that the course was only for people that already knew Python basics, most of the class had not worked with python and a few had never programmed or even used command line to create a directory. This is not the organizer’s or the instructor’s fault and only delayed us in the first day. Actually those of us who already knew a bit of programming carried on with the exercises on our own.

BioPython

First let get out of the way that after the course ended, and for my own purposes – mostly analysis of NGS data – I did not see what could be done with BioPython that I could not already do with a combination of R and shell scripting. That said, parsing of files, for instance FASTA and even SAM appears to be faster and more intuitive than using R/Bioconductor packages. Parsing of files in Python appears to be much much simpler than in R. Of course there is a reason for that: R is for statistical analysis and was not developed to deal with strings. On the other hand Python has plenty of neat built-in functions to deal with strings that makes it much more powerful to deal with DNA/RNA/Protein sequences. BioPhyton seems to add even greater power for this.

Another point in favour of BioPhyton is its documentation which appears to be of a better standard than that of the average Bioconductor package. I’ve also quite liked the possibility of wrapping any variable in help() to check with function can be used with that particular variable – simple and affective.

Plotting is ok in Python with Matplotlib but its aesthetics are not even close to those of ggplot in R.

NGS data in BioPython

Before doing the course I did some search on what was available for the analysis of NGS data using Python and got the impression that Python programmers appeared to have disregarded this technology. There were plenty of scripts in Awk, Perl, etc to parse, manipulate and analyse SAM/BAM/FastQ files, but not in Python. I was wrong. During the course I’ve found out that BioPhtyton has HTSeq. I have yet to explore it completely bu tit looks like it could be a nice tool for quick QC and count reads over features – something I am yet to master in R.

All in all I think I will be using Python a lot more  in the future but still doing most of analysis with R and shell scripting.

Posted in Uncategorized | Tagged , , , , , | 1 Comment

One liners, snippets and small scripts

Often I use one liners or small scripts for useful task but I keep forgetting about those. So I’ll just put them here for future reference.

Count number occurrences in a column/field. In this case, how many lines in a GFF3 file exist for each chromosome.

 

cut -f1 hg19.GFF3 | sort | uniq  -c | sort

This can be further refined and count only genes per chromosome using a simple grep:

 

grep "gene" hg19.GFF3 | cut -f1 | sort | uniq  -c | sort

 

Posted in Uncategorized | Tagged , , , | 1 Comment

Calculating and plotting mapped reads – a simple R/shell script

As this is my first post I’ll start with something very simple I did recently. The problem is simple: “how many reads did does my deep-seq experiment have and how many were mapped to the genome?”. I am now dealing with several RNA-seq and ChIP-seq experiments and having a pipeline to get this info fast for several files is important to me.

There are several suggestion out there on how to it, but I wanted something that could be used routinely for several files per experiment, and that could output the results in a visual manner (and also a table).

Using SAMtools as suggested here (thanks!) I’ve started with a simple bash script that would output the mapped reads into a text file and possible do the maths:

#! /bin/bash

# file header
echo &quot;bam_file total mapped unmapped&quot;

# loop to count reads in all files and add results to a table
for bam_file in *.bam
do
total=$(samtools view -c $bam_file)
mapped=$(samtools view -c -F 4 $bam_file)
unmapped=$(samtools view -c -f 4 $bam_file)
echo &quot;$bam_file $total $mapped $unmapped&quot;
done
# ultimately the data is saved in a space-separated file

# saved in a file called &quot;CountMappedReads.sh&quot;

Unfortunately arithmetic operations are not something advisable to do in shell – in my particular case because I am using a remote server without a necessary package. That minor setback led to think about doing in R: It can send commands to the shell; it is certainly built for mathematical/statistical operations and it has ggplot!

So this how the code looks like this:

# count reads using SAM tools
library(ggplot2)
library(reshape2)
library(plyr)

# function to call BAMtools and store reads count in a text file
CountReadsBAM <- function()
{
command= "~/bin/CountMappedReads.sh > MappedReads.txt"
try(system(command))
res=read.table("MappedReads.txt", header=T)
return(res)
}


# calling the funtion will output read count in an object:
ReadCount <- CountReadsBAM()

# a bit of polishing to remove total (not needed for plotting)
ReadCountSmall <- data.frame(BAM = ReadCount$bam_file, Mapped = ReadCount$mapped, Unmapped = ReadCount$unmapped)

# ggplot needs data in a specific layout
MeltedReadCount = melt(ReadCountSmall, id=c('BAM'))
names(MeltedReadCount) <- c('BAM', 'Mapping', 'Reads')

# calculate the fraction of mapped reads
ReadsFraction <- ddply(
MeltedReadCount,
.(BAM),
summarise,
Count.Fraction = Reads / sum(Reads)
)

# sort the data frame and add fraction to the data frame
to_graph <- cbind(arrange(MeltedReadCount, BAM), fraction = ReadsFraction$Count.Fraction)

# Now all we have to do is plot the data
gp <- ggplot(data=to_graph, aes(x=BAM, y=Reads, fill=Mapping)) +
geom_bar(stat="identity",position = "stack") +
geom_text(aes(label=paste(round(fraction*100),"%", sep="")),size = 3,vjust=0,position="stack") +
opts(axis.text.x=theme_text(angle=90))

pdf("MappedReads.pdf")
gp
dev.off()

To execute all you have to is to go to the folder that contains the .bam files and excute from the shell:

Rscript ~/bin/CountMappedReads.r

This is how the plot looks like:

This of course very simple, which also means very easy to change for ones specific needs, but serves the purpose of having quick visual information on how data mapped and perhaps present in group meetings. That said, never underestimate the power of visual information and one of the many beauties of ggplot is that the plots can be customized to exhaustion (I’ll update this plot soonish to include the % mapping: UPDATED).

This simple plot contains a lot of useful information: (i) How many reads we collected; (ii) how many were mapped (the total numbers are visible in the Y-axis); (iii) visual information on the proportion of mapped reads; and just in case, (iv) also the actual proportion of mapped and unmapped reads. And this for 12 mapped files!

Posted in Uncategorized | Tagged , , | 1 Comment