I'm trying to visualize coverage vs. gc content across a handful of regions. My starting materials are my.bam and gc5Base.txt.gz. What's the easiest way to get a scatterplot of coverage vs. GC-content using free tools? y-axis is coverage, x-axis is %GC.
I got as far as this in R (EDIT: see complete solution below)
library(Rsamtools)
library(GenomicRanges)
x <- readGappedAlignments("my.bam")
cvg <- coverage(x)
but I'm struggling to figure out how to join the GC-content to the coverage.
BONUS - I'd also like to be able to easily adjust the smoothing window.
EDIT: solution modified from Michael Dondrup's answers ([IMAGE] - http://min.us/lmrKfU):
library(GenomicFeatures)
library(ShortRead)
library(rtracklayer)
library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
x <- readGappedAlignments("sample1.bam")
cvg <- coverage(x)
chrZ <- DNAString(Hsapiens$chrZ)
window.size <- 1001 #BONUS
gcZ <- rowSums(letterFrequencyInSlidingView(chrZ, window.size, c("G","C")))/window.size
pad.right <- function (rle, value=0, len) {
if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
if (len > length(rle))
c(rle, Rle(value, len-length(rle)))
else rle
}
cvgZ <- cvg[["chrZ"]]
cvgZp <- pad.right(cvgZ, len=length(gcZ))
cvgZw <- as.numeric(runmean(x=cvgZp, k=window.size, endrule="constant"))
smoothScatter(x=gcZ, y=cvgZw)
Hi, I hope this works for you. check my answer
cool, I edited my question to respond.
Yes, we forgot to pad the coverage vector because the coverage vector will only be as long as the largest alignment and thereby possibly shorter, we have to add 0's at the 'right-hand' side. Se my updated code.
Sweet, after a cup of coffee or 10 I'll have a plot. Some level of alpha transparency will help the madness, or I can switch to a 2D density plot. Thank you. This exercise was most informative for me.
fine!, for completeness, the smoothScatter function in the graphics package will do that.
also very helpful! I must have too much time on my hands because now I am interested in viewing this density strung out along a genomic coordinate 3rd dimension, but I guess the density would then be in a 4th dimension. Shoot.
Can you post the resulting graphics? And maybe make a new question for the new task, because this question would become too crowded with code blocks. The density is however not a function of the genomic position, but only on gc and coverage.
This was very helpful, thanks for the discussion and sharing.
You do mention 'handful of regions' but I cant figure in the code where those select regions are used for the plot. In other words, assuming exome or targeted sequencing data, one would like to use a substring of the chrZ in order to make the final GC plot.
One thing am going to try is subset the original BAM file (capture data aligned to whole genome, so contains off-target stuff) to just the capture regions, and then load the subsetted BAM in this code snippet!