Question

Plot Coverage Vs. Gc Content

10

Entering edit mode

14.0 years ago

Bfranks ▴ 100

I'm trying to visualize coverage vs. gc content across a handful of regions. My starting materials are my.bam and gc5Base.txt.gz. What's the easiest way to get a scatterplot of coverage vs. GC-content using free tools? y-axis is coverage, x-axis is %GC.

I got as far as this in R (EDIT: see complete solution below)

library(Rsamtools)
library(GenomicRanges)
x <- readGappedAlignments("my.bam")
cvg <- coverage(x)

but I'm struggling to figure out how to join the GC-content to the coverage.

BONUS - I'd also like to be able to easily adjust the smoothing window.

EDIT: solution modified from Michael Dondrup's answers ([IMAGE] - http://min.us/lmrKfU):

library(GenomicFeatures)
library(ShortRead)
library(rtracklayer)
library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
x <- readGappedAlignments("sample1.bam")
cvg <- coverage(x)
chrZ <- DNAString(Hsapiens$chrZ)
window.size <- 1001 #BONUS
gcZ <- rowSums(letterFrequencyInSlidingView(chrZ, window.size, c("G","C")))/window.size
pad.right <- function (rle, value=0, len) {
    if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
        if (len > length(rle))
            c(rle, Rle(value, len-length(rle)))
    else rle
}
cvgZ <- cvg[["chrZ"]]
cvgZp <- pad.right(cvgZ, len=length(gcZ))
cvgZw <- as.numeric(runmean(x=cvgZp, k=window.size, endrule="constant"))
smoothScatter(x=gcZ, y=cvgZw)

next-gen sequencing visualization gc • 16k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 14.0 years ago by Bfranks ▴ 100

0

Entering edit mode

Hi, I hope this works for you. check my answer

ADD REPLY • link 13.9 years ago by Michael 55k

0

Entering edit mode

cool, I edited my question to respond.

ADD REPLY • link 13.9 years ago by Bfranks ▴ 100

0

Entering edit mode

Yes, we forgot to pad the coverage vector because the coverage vector will only be as long as the largest alignment and thereby possibly shorter, we have to add 0's at the 'right-hand' side. Se my updated code.

ADD REPLY • link 13.9 years ago by Michael 55k

0

Entering edit mode

Sweet, after a cup of coffee or 10 I'll have a plot. Some level of alpha transparency will help the madness, or I can switch to a 2D density plot. Thank you. This exercise was most informative for me.

ADD REPLY • link 13.9 years ago by Bfranks ▴ 100

0

Entering edit mode

fine!, for completeness, the smoothScatter function in the graphics package will do that.

ADD REPLY • link 13.9 years ago by Michael 55k

0

Entering edit mode

also very helpful! I must have too much time on my hands because now I am interested in viewing this density strung out along a genomic coordinate 3rd dimension, but I guess the density would then be in a 4th dimension. Shoot.

ADD REPLY • link 13.9 years ago by Bfranks ▴ 100

0

Entering edit mode

Can you post the resulting graphics? And maybe make a new question for the new task, because this question would become too crowded with code blocks. The density is however not a function of the genomic position, but only on gc and coverage.

ADD REPLY • link 13.9 years ago by Michael 55k

0

Entering edit mode

This was very helpful, thanks for the discussion and sharing.

You do mention 'handful of regions' but I cant figure in the code where those select regions are used for the plot. In other words, assuming exome or targeted sequencing data, one would like to use a substring of the chrZ in order to make the final GC plot.

One thing am going to try is subset the original BAM file (capture data aligned to whole genome, so contains off-target stuff) to just the capture regions, and then load the subsetted BAM in this code snippet!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Bioinfosm ▴ 620

Ram · Answer 1 · 2011-03-12

Computing GC content and smoothed coverage in R

Edit: pre-processing the data to compute GC content and compute running mean with flexible window size. The genome sequence is required to compute the GC content. I assume it is present as DNAString object, use read.DNAStringSet to read in the genome from a fasta file or use the BSGenome package.

You will have to experiment a bit how to best load the data, e.g. maybe process only one chromosome at a time will help save memory.

library(Biostrings)

window.size <- 1000 # configure the window size

# create some test yeast data, the coordinates must fit with the coverage ofc, 
# so this might not work directly:
data(yeastSEQCHR1)
chrom <- DNAString(yeastSEQCHR1)

# compute gc content in a sliding window (as a fraction, if you want % : *100):
gc <- rowSums(letterFrequencyInSlidingView(chrom, window.size, c("G","C")))/window.size


# the coverage vector can be shorter than the sequence. 
# The length is equal to the largest end position of an alignment in the data:
pad.right <- function (rle, value=0, len) {
if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
    if (len > length(rle)) 
      c(rle, Rle(value, len-length(rle))) 
    else rle
}
cvg <- pad.right(cvg, len=length(gc))

# we have to process the coverage vector to have the same window size:
cvgw <- runmean(x=cvg, k=window.size, endrule = c("constant")) 
# endrule is important to get a vector of same size, see ?runmean

Finally, do a scatter-plot and maybe add a lowess scatterplot smoothing curve to visualize the correlation:

plot(x=gc, y=cvgw)
lines(lowess(x=gc, y=cvgw), col=2)

should do the trick.

Plotting the GC content as a genome track

Once you are using R, the easiest way to plot coverage is possibly the GenomeGraphs package. Use the function makeBaseTrack to generate a coverage track. Given your coverage is in the cvg Rle object and gc is an Rle with the gc content, that looks like this:

library("GenomeGraphs")
baseCoverage <- as.numeric(cvg) # makeBaseTrack doesn't take Rle
gcContent <- as.numeric(gc)
 # contruct your gene models here: 
gene <- ...
pl <- list( 
      genes=gene, makeGenomeAxis(T,T,T),
      coverage=makeBaseTrack(base=1:length(baseCoverage), value=baseCoverage),
      gc=makeBaseTrack(base=1:length(gcContent), value=gcContent)
)


gdPlot(pl, min=1, max=1000) # restrict viewport

There are also multiple configurable functions in that package to plot the gene models, e.g.: makeAnnotationTrack or using biomaRt (from the package documentation):

mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Next we can retrieve the gene structure of the gene of interest.
gene <- makeGene(id = "ENSG00000095203",
type = "ensembl_gene_id", biomart = mart)

Ram · Answer 2 · 2011-03-15

1

Entering edit mode

13.9 years ago

Ian 6.1k

Sorry if this is a silly suggestion, but have you looked at the data using the UCSC browser? It will easily present the two tracks for any given region. The browser can present BAM files (here) and there is the GC content track

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 13.9 years ago by Ian 6.1k

0

Entering edit mode

Thanks, I can do the track visualizations, but I'm interested in the scatterplot where coverage is on the y-axis and GC-content is on the x-axis.