Normalization of RNASeq counts with a reference gene (/genes)
4
0
Entering edit mode
2.6 years ago
arctic ▴ 40

Dear all,

I know it is unorthodox, but I want to explore the performance of using a panel of "reference genes" for normalization of RNASeq gene counts (I am doing this alongside a more traditional normalization method in my dataset). To do this:

1. I wonder if I can simply divide each gene's raw counts to the ratio of the reference gene (see below)? Or should I log2 transform all the counts first?

Conversion_Rate = Ref_Count_Sample_A / Ref_Count_Sample_B

Normalized_GeneX_SampleA = GeneX_Counts_SampleA / Conversion_Rate

2. If I were to combine counts from several reference genes for normalization, would it make sense to use arithmetic mean or geometric mean for combination of the reference counts?

Thank you for sharing your thoughts in advance,

geometric normalization arithmetic RNASeq reference_gene • 2.5k views
ADD COMMENT
1
Entering edit mode

Hi, of course you could do this but you are also likely to shoot yourself in the foot. The reason is that you would need a perfect housekeeping gene that never changes as one uses with qPCR. Of course you cannot know this beforehand. In qPCR you have no choice but with RNAseq you can employ the favors of the large number of genes sampled.

ADD REPLY
0
Entering edit mode

I am trying normalization with "Pseudo Reference" method to identify the least variable genes within each dataset to be used as reference genes and then find shared genes across multiple sets. So far within the same dataset the normalization method by a set of reference genes looks comparable with those of "Pseudo Reference" method.

ADD REPLY
3
Entering edit mode
2.5 years ago
yance_feng ▴ 30

https://doi.org/10.1186/s12859-021-04288-0

MUlti-REference Normalizer (MUREN) is a multi-reference robust scaling method to normalize read counts. It is based on the assumption that most genes are not differentially expressed genes. The article also shows an evaluation with several other methods.

Reply to Q1:
Conversion_Sample_A_log = mean(Ref_genes_Count_Sample_A_log)
Conversion_Sample_A = 2^Conversion_Sample_A_log
Normalized_GeneX_SampleA = GeneX_SampleA / Conversion_Sample_A
It ensures the mean expression of the ref panel of genes among different samples same.
Of course, you can scale all the Conversion_Sample_A/B/C by their median/mean to make the scaling of normalization change raw counts as less as possible.

ADD COMMENT
0
Entering edit mode

Hi Yance, welcome to Biostars ! Please add a minimal description of the paper linked so that readers can understand why it is relevant to the question without clicking on the link.

ADD REPLY
0
Entering edit mode

Thanks a lot for the detailed reply to the Q1. Also thank you for sharing the MUREN reference, I have to read it in more details but definitely in line with what I am trying to do.

ADD REPLY
2
Entering edit mode
2.5 years ago
ATpoint 85k

There is nothing unusual with this approach. DESeq2 has an argument in its size factor estimation method that you can use to base the calculation on control genes https://support.bioconductor.org/p/115682/

This is usually not necessary in standard RNA-seq but might be necessry if you have extreme DE profiles and/or asymmetrical shifts. Use MA-plots to explore the performance. You can also use edgeR by first subsetting the DGEList on the control genes, then calculate the TMM factors and then feed them back into the original object. I do that routinely, though mire for applications like ChIP-seq and similar, but as said there is nothing unusual with this as ling as you can show that the normalization went well, e.g. with the MA-plots.

Edit: Note though that the above does simply base the scaling normalization on these control genes. It does not do any transformation like calculating fold changes over a control which indeed would be unusual as fold changes typically require some moderation to avoid inflated changes due to small counts. I recommend to stick with counts, and let the DE analysis being handled by experts software.

ADD COMMENT
1
Entering edit mode
2.5 years ago
Trivas ★ 1.8k

You could use the RUV-seq package, specifically RUVg, which does exactly this. It was originally built for the ERCC spike-ins but can be adapted to use any genes.

ADD COMMENT
0
Entering edit mode

Thanks for sharing the information. It is clearly in line with my question above. Will check the details of the package and the mathematics in their publication.

ADD REPLY
1
Entering edit mode
2.5 years ago
fracarb8 ★ 1.7k

When I was working with Oncoland (Qiagen bioinformatics), they had a way of normalising different dataset together using upper quartile normalisation.

1) get the normalise counts (FPKM/TPM in their case)

2) select a set of genes that will constitute your invariant set for all the datasets (e.g. Housekeepers)

3) Calculate the scaling factor based on 75% quartile of this invariant set

4) scale the 75% quartile target to 10

Keep in mind that, as Michael Dondrup said, you need to pay close attention to what goes in your invariant set, as it could give you the results you want, but also send you completely out of the way.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. Not sure if I follow the step 4 but using a quartile of invariant set sounds like an interesting approach. In my case I hope to only use a handful of genes as reference.

ADD REPLY
1
Entering edit mode

The scaling step is used because in Oncoland your have multiple datasets, and you want to be able to compare them together. By scaling, you are making sure that the range will be the same and the expression values comparable (e.g. 5 FPKM in one datset are 5 FPKM in another dataset). You can skip step 4 if you only have one dataset.

ADD REPLY

Login before adding your answer.

Traffic: 2737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6