Compensating for GC bias is only needed if you have GC bias between samples or groups. Likewise, trying to compensate for length distribution is only needed if it shows a significant bias according to sample or group. With CQN, there are a few diagnostic plots that you can learn how to make in the vignette (e.g. an MA-plot with points colored by GC content) that will let you know if there's notable GC bias. In my experience (so take this with the appropriate grain of salt) this is rarely useful. I recall only having one dataset that showed significant per-sample GC-bias (CQN nicely got rid of that).
BTW, you mention length normalization as well. I've actually never seen a length bias in a dataset, though I suppose it could occur. For calculating length and GC content, one would normally use a "union gene model", wherein overlapping transcripts are merged, such that overlapping portions are only counted once. The following R script will read in a GTF file and genomic reference in fasta format and produce per-gene length and GC content calculations in an appropriate manner. The output (GC_lengths.tsv
) can be used in CQN. I wrote this for my own needs, so the GTF and fasta filenames are hard-coded (just change them).
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
#Change the next two lines!!!
GTFfile = "/home/ryand/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Annotation/Mus_musculus.GRCm38.71.gtf" #CHANGE ME!!!
FASTAfile = "/home/ryand/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Sequence/Mus_musculus.GRCm38.71.dna.toplevel.fa" #CHANGE ME!!!
#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)
#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
nGCs = sum(elementMetadata(x)$nGCs)
width = sum(elementMetadata(x)$widths)
c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")
write.table(output, file="GC_lengths.tsv", sep="\t")
I guess I should reply to the ERCC spike-in part of your question since you linked to a comment from me :p
ERCC spike-ins are useful only for library size normalization. CQN is meant to account for GC or other biases. Because of that, the two can be compatible. Having said that, remember that ERCC spike-ins are only beneficial in very particular circumstances (e.g., when performing single-cell sequencing or when you suspect gross changes in transcription levels). Most people won't benefit from spike-ins.
I'm interested in how you got a GC bias in your data? Was it coming in as a technical artifact?
My first instinct would be that if the sequencing is really using the same protocol then if you are measuring the same gene in different samples then the denominator for length and GC would be the same in both the test and control, and the biases would also be the same, so you should get the same result whether you try to account for them or not. So I do not generally try to fix it, though RSEM or whatever you use for transcript detection will look at that.
If I saw a substantial GC bias I would probably ask them to resequence it with a new library because there's probably something wrong with it. I usually use GC bias as a marker for a library prep gone bad.
Have you seen it for other reasons?
I'm not really sure, it was a one-off event, presumably due to a technical issue during sample prep. You're exactly correct regarding the GC (or equivalently length) of each feature will be the same across samples, so there's normally nothing to account for.
In my current experiment, we spike in the ERCC to serve as an internal control and hoping to serve as a normalization control such that if things go wrong, we can still have something to control for. So far, the ERCC seems to behave normally in my samples but I can imagine that in order for the ERCC to work well the wet lab must be very accurate in spiking in the ERCC or else it is possible for us to have a wrong assumption that the ERCC should be the same in all samples.
I can imagine ERCC will be extremely important in single cell analysis as stated in this paper: http://www.nature.com/nmeth/journal/v11/n1/full/nmeth.2694.html but then for normal RNA Sequencing, I am not so sure if it is as useful considering that we can't control the number of cells we use for the experiment (unless we pick them using those single cell analysis protocol) and there are simply more heterogeneity in the sample. So, do you think that for normal RNA Sequencing experiment, the ERCC serve more as a library prep QC information instead of a signal to be used for normalization?
Yeah, that's probably a good way of using them.
I was wondering why you assume that the ercc pike in are only good for QC purposes. I am not trying to say otherwise, I was just wondering why it is good for the QC, but not really as a normalization factor.
The most sensible thinking that I have heard on that (from the ERCC folks, actually) is that the current batch of ERCCs is a little different than real RNA in terms of things like length and the length of the poly A tail (if I am remembering correctly). So they may behave a little differently than real RNA. So for normalization you can consider the ERCC information, but you would not want to only consider the ERCCs and ignore the rest of the information in your libraries. Therefore the ERCCs can supplement your normalization, or provide support that libraries are correctly normalized, but you should not rely solely on them. Though every experiment is different and I can imagine some cases where they may be the only handle you have.
In the current experiment I am working on, I used DESeq2 to analyse my data. I mapped the spiked-In fastq files to the indexed genome+ercc sequences. I thank took the subset of ercc genes and calculated the size factors using DESeq2. I than created a second DESeq Object from all other genes (complete data set excluding ercc sequences) and gave the above calculated size factors to this newly made DESeq object. I than ran the rest of the DESeq analysis. In the output I see the message that it uses the given size factors for the estimation of dispersion.
With the same data set,I got via standard analysis "only" ~ 70 significant genes with an adj. p-value <=0.1, I get now with the ercc normalization almost 4000 genes with an adj. p-value <=0.05.
Somehow this seems unreliable. I have done a lowess normalization in parallel based on the ercc subset of genes and got another quite high chunk of genes to be differentially expressed. By overlapping the two groups I still get 0ver 2000 genes which are highly significant.
Does it look right? Somehow I am not sure if I want to trust this results or not.
Hi Devon, I read the paper "Population and sex differences in Drosophila melanogaster brain gene expression" In this paper they used the length of the longest transcript of each gene as the gene length. Is this a good idea to do this? And I'm also interested in how to extract the longest transcript from a transcript.fasta file. I posted a question here. Hope you can answer it, thanks.
That's not a good idea, ignore what they did in that paper.
Hey Devon,
I am trying to run your code to assess the gene lengths for
RPKM
calculations in edgeRrpkm()
after usingTopHat2
andhtseq-count
. However, I run into the following error:I did not find any function called
elementLengths
when searching the internet. Could you please clarify, how I can use this function?EDIT:
in the command
asRangedData
seems deprecated, but looking it up this seems not to change anything in the function. I just wanted to mention for the sake of completeness.