Normalization Of Rna Sequencing Counts (By Ercc / Gene Length)
2
2
Entering edit mode
10.8 years ago
Sam ★ 4.8k

There are now a lot of ways to perform RNA Sequencing count normalization and most tools are available. However, do you guys think that gene length normalization / GC count normalization are necessary before the differential gene expression analysis?

  1. Assuming that we are performing DESeq or EdgeR Analysis, we can use R package cqn to provide a size factor for downstream analysis, do you think that is necessary?
  2. When performing CQN, we need to provide the gene length and gene gc content for the normalization, do you provide the full gene boundary e.g. include the intronic region, or just the combined exonic region? If that exon is shared by two gene, will you perform include it or will you ignore that?

Another question is, we can now include the internal control (ERCC spike in) when performing the RNA Sequencing. Should we combine the normalization with cqn or just simply use the ERCC spike in as the size factor estimate (as mentioned in C: RNA-Seq data normalization with spike-in using DESeq) ?

rnaseq normalization deseq edger • 15k views
ADD COMMENT
9
Entering edit mode
10.8 years ago

Compensating for GC bias is only needed if you have GC bias between samples or groups. Likewise, trying to compensate for length distribution is only needed if it shows a significant bias according to sample or group. With CQN, there are a few diagnostic plots that you can learn how to make in the vignette (e.g. an MA-plot with points colored by GC content) that will let you know if there's notable GC bias. In my experience (so take this with the appropriate grain of salt) this is rarely useful. I recall only having one dataset that showed significant per-sample GC-bias (CQN nicely got rid of that).

BTW, you mention length normalization as well. I've actually never seen a length bias in a dataset, though I suppose it could occur. For calculating length and GC content, one would normally use a "union gene model", wherein overlapping transcripts are merged, such that overlapping portions are only counted once. The following R script will read in a GTF file and genomic reference in fasta format and produce per-gene length and GC content calculations in an appropriate manner. The output (GC_lengths.tsv) can be used in CQN. I wrote this for my own needs, so the GTF and fasta filenames are hard-coded (just change them).

#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

#Change the next two lines!!!
GTFfile = "/home/ryand/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Annotation/Mus_musculus.GRCm38.71.gtf" #CHANGE ME!!!
FASTAfile = "/home/ryand/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Sequence/Mus_musculus.GRCm38.71.dna.toplevel.fa" #CHANGE ME!!!

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")

I guess I should reply to the ERCC spike-in part of your question since you linked to a comment from me :p

ERCC spike-ins are useful only for library size normalization. CQN is meant to account for GC or other biases. Because of that, the two can be compatible. Having said that, remember that ERCC spike-ins are only beneficial in very particular circumstances (e.g., when performing single-cell sequencing or when you suspect gross changes in transcription levels). Most people won't benefit from spike-ins.

ADD COMMENT
1
Entering edit mode

I'm interested in how you got a GC bias in your data? Was it coming in as a technical artifact?

My first instinct would be that if the sequencing is really using the same protocol then if you are measuring the same gene in different samples then the denominator for length and GC would be the same in both the test and control, and the biases would also be the same, so you should get the same result whether you try to account for them or not. So I do not generally try to fix it, though RSEM or whatever you use for transcript detection will look at that.

If I saw a substantial GC bias I would probably ask them to resequence it with a new library because there's probably something wrong with it. I usually use GC bias as a marker for a library prep gone bad.

Have you seen it for other reasons?

ADD REPLY
0
Entering edit mode

I'm not really sure, it was a one-off event, presumably due to a technical issue during sample prep. You're exactly correct regarding the GC (or equivalently length) of each feature will be the same across samples, so there's normally nothing to account for.

ADD REPLY
1
Entering edit mode

In my current experiment, we spike in the ERCC to serve as an internal control and hoping to serve as a normalization control such that if things go wrong, we can still have something to control for. So far, the ERCC seems to behave normally in my samples but I can imagine that in order for the ERCC to work well the wet lab must be very accurate in spiking in the ERCC or else it is possible for us to have a wrong assumption that the ERCC should be the same in all samples.

I can imagine ERCC will be extremely important in single cell analysis as stated in this paper: http://www.nature.com/nmeth/journal/v11/n1/full/nmeth.2694.html but then for normal RNA Sequencing, I am not so sure if it is as useful considering that we can't control the number of cells we use for the experiment (unless we pick them using those single cell analysis protocol) and there are simply more heterogeneity in the sample. So, do you think that for normal RNA Sequencing experiment, the ERCC serve more as a library prep QC information instead of a signal to be used for normalization?

ADD REPLY
0
Entering edit mode

Yeah, that's probably a good way of using them.

ADD REPLY
0
Entering edit mode

I was wondering why you assume that the ercc pike in are only good for QC purposes. I am not trying to say otherwise, I was just wondering why it is good for the QC, but not really as a normalization factor.

ADD REPLY
1
Entering edit mode

The most sensible thinking that I have heard on that (from the ERCC folks, actually) is that the current batch of ERCCs is a little different than real RNA in terms of things like length and the length of the poly A tail (if I am remembering correctly). So they may behave a little differently than real RNA. So for normalization you can consider the ERCC information, but you would not want to only consider the ERCCs and ignore the rest of the information in your libraries. Therefore the ERCCs can supplement your normalization, or provide support that libraries are correctly normalized, but you should not rely solely on them. Though every experiment is different and I can imagine some cases where they may be the only handle you have.

ADD REPLY
0
Entering edit mode

In the current experiment I am working on, I used DESeq2 to analyse my data. I mapped the spiked-In fastq files to the indexed genome+ercc sequences. I thank took the subset of ercc genes and calculated the size factors using DESeq2. I than created a second DESeq Object from all other genes (complete data set excluding ercc sequences) and gave the above calculated size factors to this newly made DESeq object. I than ran the rest of the DESeq analysis. In the output I see the message that it uses the given size factors for the estimation of dispersion.

With the same data set,I got via standard analysis "only" ~ 70 significant genes with an adj. p-value <=0.1, I get now with the ercc normalization almost 4000 genes with an adj. p-value <=0.05.

Somehow this seems unreliable. I have done a lowess normalization in parallel based on the ercc subset of genes and got another quite high chunk of genes to be differentially expressed. By overlapping the two groups I still get 0ver 2000 genes which are highly significant.

Does it look right? Somehow I am not sure if I want to trust this results or not.

ADD REPLY
0
Entering edit mode

Hi Devon, I read the paper "Population and sex differences in Drosophila melanogaster brain gene expression" In this paper they used the length of the longest transcript of each gene as the gene length. Is this a good idea to do this? And I'm also interested in how to extract the longest transcript from a transcript.fasta file. I posted a question here. Hope you can answer it, thanks.

ADD REPLY
0
Entering edit mode

That's not a good idea, ignore what they did in that paper.

ADD REPLY
0
Entering edit mode

Hey Devon,

I am trying to run your code to assess the gene lengths for RPKM calculations in edgeR rpkm() after using TopHat2 and htseq-count. However, I run into the following error:

elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
Error in elementLengths(grl) : could not find function "elementLengths"`

I did not find any function called elementLengths when searching the internet. Could you please clarify, how I can use this function?

EDIT:

in the command

GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")

asRangedData seems deprecated, but looking it up this seems not to change anything in the function. I just wanted to mention for the sake of completeness.

ADD REPLY
0
Entering edit mode
10.8 years ago

I would typically use RPKM expression values for differential expression analysis (which would correct for gene length), and I would consider this the most preferable option:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

That said, edgeR and DESeq are supposed to use raw gene counts. I see your argument (neither DESeq nor edgeR has any knowledge of gene length or GC content), but the read dispersions are modeled on a per-gene basis (so, perhaps this captures the effect of these factors indirectly).

ADD COMMENT

Login before adding your answer.

Traffic: 1611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6