Plot Coverage Vs. Gc Content
2
10
Entering edit mode
13.7 years ago
Bfranks ▴ 100

I'm trying to visualize coverage vs. gc content across a handful of regions. My starting materials are my.bam and gc5Base.txt.gz. What's the easiest way to get a scatterplot of coverage vs. GC-content using free tools? y-axis is coverage, x-axis is %GC.

I got as far as this in R (EDIT: see complete solution below)

library(Rsamtools)
library(GenomicRanges)
x <- readGappedAlignments("my.bam")
cvg <- coverage(x)

but I'm struggling to figure out how to join the GC-content to the coverage.

BONUS - I'd also like to be able to easily adjust the smoothing window.

EDIT: solution modified from Michael Dondrup's answers ([IMAGE] - http://min.us/lmrKfU):

library(GenomicFeatures)
library(ShortRead)
library(rtracklayer)
library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
x <- readGappedAlignments("sample1.bam")
cvg <- coverage(x)
chrZ <- DNAString(Hsapiens$chrZ)
window.size <- 1001 #BONUS
gcZ <- rowSums(letterFrequencyInSlidingView(chrZ, window.size, c("G","C")))/window.size
pad.right <- function (rle, value=0, len) {
    if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
        if (len > length(rle))
            c(rle, Rle(value, len-length(rle)))
    else rle
}
cvgZ <- cvg[["chrZ"]]
cvgZp <- pad.right(cvgZ, len=length(gcZ))
cvgZw <- as.numeric(runmean(x=cvgZp, k=window.size, endrule="constant"))
smoothScatter(x=gcZ, y=cvgZw)
next-gen sequencing visualization gc • 16k views
ADD COMMENT
0
Entering edit mode

Hi, I hope this works for you. check my answer

ADD REPLY
0
Entering edit mode

cool, I edited my question to respond.

ADD REPLY
0
Entering edit mode

Yes, we forgot to pad the coverage vector because the coverage vector will only be as long as the largest alignment and thereby possibly shorter, we have to add 0's at the 'right-hand' side. Se my updated code.

ADD REPLY
0
Entering edit mode

Sweet, after a cup of coffee or 10 I'll have a plot. Some level of alpha transparency will help the madness, or I can switch to a 2D density plot. Thank you. This exercise was most informative for me.

ADD REPLY
0
Entering edit mode

fine!, for completeness, the smoothScatter function in the graphics package will do that.

ADD REPLY
0
Entering edit mode

also very helpful! I must have too much time on my hands because now I am interested in viewing this density strung out along a genomic coordinate 3rd dimension, but I guess the density would then be in a 4th dimension. Shoot.

ADD REPLY
0
Entering edit mode

Can you post the resulting graphics? And maybe make a new question for the new task, because this question would become too crowded with code blocks. The density is however not a function of the genomic position, but only on gc and coverage.

ADD REPLY
0
Entering edit mode

This was very helpful, thanks for the discussion and sharing.

You do mention 'handful of regions' but I cant figure in the code where those select regions are used for the plot. In other words, assuming exome or targeted sequencing data, one would like to use a substring of the chrZ in order to make the final GC plot.

One thing am going to try is subset the original BAM file (capture data aligned to whole genome, so contains off-target stuff) to just the capture regions, and then load the subsetted BAM in this code snippet!

ADD REPLY
12
Entering edit mode
13.7 years ago
Michael 55k

Computing GC content and smoothed coverage in R

Edit: pre-processing the data to compute GC content and compute running mean with flexible window size. The genome sequence is required to compute the GC content. I assume it is present as DNAString object, use read.DNAStringSet to read in the genome from a fasta file or use the BSGenome package.

You will have to experiment a bit how to best load the data, e.g. maybe process only one chromosome at a time will help save memory.

library(Biostrings)

window.size <- 1000 # configure the window size

# create some test yeast data, the coordinates must fit with the coverage ofc, 
# so this might not work directly:
data(yeastSEQCHR1)
chrom <- DNAString(yeastSEQCHR1)

# compute gc content in a sliding window (as a fraction, if you want % : *100):
gc <- rowSums(letterFrequencyInSlidingView(chrom, window.size, c("G","C")))/window.size


# the coverage vector can be shorter than the sequence. 
# The length is equal to the largest end position of an alignment in the data:
pad.right <- function (rle, value=0, len) {
if (length(rle) > len) stop("length mismatch: len must be >= length(rle)!")
    if (len > length(rle)) 
      c(rle, Rle(value, len-length(rle))) 
    else rle
}
cvg <- pad.right(cvg, len=length(gc))

# we have to process the coverage vector to have the same window size:
cvgw <- runmean(x=cvg, k=window.size, endrule = c("constant")) 
# endrule is important to get a vector of same size, see ?runmean

Finally, do a scatter-plot and maybe add a lowess scatterplot smoothing curve to visualize the correlation:

plot(x=gc, y=cvgw)
lines(lowess(x=gc, y=cvgw), col=2)

should do the trick.


Plotting the GC content as a genome track

Once you are using R, the easiest way to plot coverage is possibly the GenomeGraphs package. Use the function makeBaseTrack to generate a coverage track. Given your coverage is in the cvg Rle object and gc is an Rle with the gc content, that looks like this:

library("GenomeGraphs")
baseCoverage <- as.numeric(cvg) # makeBaseTrack doesn't take Rle
gcContent <- as.numeric(gc)
 # contruct your gene models here: 
gene <- ...
pl <- list( 
      genes=gene, makeGenomeAxis(T,T,T),
      coverage=makeBaseTrack(base=1:length(baseCoverage), value=baseCoverage),
      gc=makeBaseTrack(base=1:length(gcContent), value=gcContent)
)


gdPlot(pl, min=1, max=1000) # restrict viewport

There are also multiple configurable functions in that package to plot the gene models, e.g.: makeAnnotationTrack or using biomaRt (from the package documentation):

mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Next we can retrieve the gene structure of the gene of interest.
gene <- makeGene(id = "ENSG00000095203",
type = "ensembl_gene_id", biomart = mart)
ADD COMMENT
0
Entering edit mode

I'm not sure how to get the gc-content Rle object. Specifically, I would like the gc-content averaged across views identical to the coverage Rle.

Also, you prefaced your comment with, 'Once you are using R,". Would you recommend a different tool for the job?

ADD REPLY
0
Entering edit mode

Also, why do you say a scatterplot would not be helpful? A plot of both across the genome axis is nice for the eye, but I'm interested in determining the extent of GC-bias in the data, which I imagine is more accurately visualized in euclidean space across the whole dataset rather than one genomic range at a time. At least with the scatterplot I could get a numeric correlation.

ADD REPLY
0
Entering edit mode

I think R is a good tool to do this, I was just happy about that your use of R would provide for an easy and brief solution. You will see if the plot helps. I was just thinking that, should you try to plot the data for the whole human genome, the plotting might show a lot of noise, but you will see that better. I edit my question to reflect how to compute gc content.

ADD REPLY
1
Entering edit mode
13.7 years ago
Ian 6.1k

Sorry if this is a silly suggestion, but have you looked at the data using the UCSC browser? It will easily present the two tracks for any given region. The browser can present BAM files (here) and there is the GC content track

ADD COMMENT
0
Entering edit mode

Thanks, I can do the track visualizations, but I'm interested in the scatterplot where coverage is on the y-axis and GC-content is on the x-axis.

ADD REPLY
0
Entering edit mode

Apologies - misread the question!

ADD REPLY

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6