Proper normalization for comparison of gene expression values between different samples
2
0
Entering edit mode
9.3 years ago
Les Ander ▴ 110

Hi,

I have a question about normalization of gene expression data.

I have RNA-seq data from 4 different cell lines. Each cell line was treated with a mock and drug1. So in total I have 8 RNA-seq samples. Cell lines 3 and 4 have a physiological response to this drug1 and cell lines 1 and 2 do not have any response. So what I am looking for is to identify genes in samples 3 and 4 that may underlie the physiological response to drug1.

This is what I have done.

  1. I aligned the reads to the genome (I do not care about splice forms or discovery of new isoforms) and used BWA MEM or it. I also did not remove potential PCR artifacts by picard because the reads were only 50bp and therefore the chances of seeing true copies that start and stop at the same site are not negligible.
  2. For each gene's exonic region, I counted the reads that fall into that region and normalized by the length of the exon (to make it reads per bp) and I divided by the total number of reads in those sample.

At this point I am not sure how best to proceed and looking for advice.

One way to proceed is for me to generate a difference in gene expression for each sample (by subtracting the expression value from drug treated sample from the mock treated sample; for example,

d_k = 1 + (e_k_drug1 - e_k_mock) / e_k_mock

where e_k_mock is the expression level of gene k from the mock sample. By doing the above normalization, the values d_k are converted into a number which can be compared relative to the expression level of gene in the mock sample. So for example, if the genes do not change their expression, then the d_k value is 1; if the genes expression doubled upon drug1 treatment, the d_k value is 2 and so forth. Does this sound like a reasonable way to transform the values?

However, the issue with the above normalization is that the expression values of a given gene between the mock and drug1 treatment may not be comparable. For example, for unknown reasons, the normalized expression of each gene is higher in drug1 treated samples, how do I correct for it? I am wondering if I should have a second level of normalization by making all the gene expression values for each sample into a 0 mean and 1 standard deviation. This way, all the expression values are now relative to the expression within the sample and the distribution of the d_k values will have a 0 mean. Any suggestion would be much appreciated.

Please note, I want to properly understand how the normalization is done and that is why I am not using the existing tools such as cuffdiff etc.

Thank you

normalization gene-expression RNA-Seq • 5.5k views
ADD COMMENT
1
Entering edit mode

I have several comments about your research:

  1. Replicates, why don't you include replicates? Cell-line 1/2 and 3/4 could be considered biological replicates?
  2. Why are not using common tools such as the "tuxedo tools" (Bowtie2/Tophat -> Cufflinks) or BWA -> HTSeq -> edgeR/DESeq?
  3. Don't use RPKMs.
  4. To compare your cell-lines, I would run each analysis separately and get a list of DEG for each one, then you can compare which genes are shared and not-shared without concerns in normalizations.
ADD REPLY
0
Entering edit mode

Thank you for the helpful answer

ADD REPLY
0
Entering edit mode
9.3 years ago

Have u tried edgeR/DESeq paired analysis ? so finally you have two cell lines ( 3 and 4) both are treated with drug1 and mock, so 4 samples in total. You could just try a simple paired analysis in edgeR and see how MDSplot looks like.

Why are you trying to come up with a new normalisation method? It still looks like a RPKM applied at Exon level.

If you are interested in exon level use DEXSeq, otherwiseedgeR/DESeq. You can read papers to understand how normalisation is done. There are many threads as well which explains the normalisation methods.

ADD COMMENT
0
Entering edit mode

Hi, Thank you for your response. After doing reads per bp of each exon, I took a mean of all the exons in the gene. In the end I have one (normalized) value for each gene, not per exon.

Can you please send me the link to a good (not overly technical) paper that talks about normalization?

ADD REPLY
0
Entering edit mode

This paper provides an overview, but if you want to recreate their approaches, you're going to have to get technical about it and look at their equations in the papers referenced in this review.

ADD REPLY
0
Entering edit mode
9.3 years ago

I think the earlier responses have provided a good answer, but I would emphasize using the gene-level summarization from either cufflinks or HTSeq. Basically, instead of looking at each exon separately (since you said "length of the exon", I thought that might be what you were doing), I think it is better to compare coverage across all regions associated with the gene.

I think FPKM/RPKM values should be an OK normalization. If there are batch effects that are randomized with respect to your biological groups, you can also factor that in. There are also tools like SVA that try and detect unknown biases, but it seems like you do not have enough samples to justify that sort of analysis.

I would definitely not do the normalization using subtraction that you are describing, and (if at all possible) I think you should follow JC's suggestion to group the cells with similar phenotypes and treat them like replicates for statistical analysis.

ADD COMMENT
0
Entering edit mode

I have several RNA-seq data(all of them have replication),after RNA-seq analysis,I want to perform met-analysis (for given a set of genesthat are up-regulated or down regulate under drug conditions), I am wondering the best way to normalize the data - just calculate RPKM values or I should perform some sort of upper normalization? If so, what is the best way to do this?

ADD REPLY
2
Entering edit mode

Even though TMM (or upper-quantile, etc) normalization is popular (and you can certainly give it a try), I've found that it typically doesn't get rid of biases present in samples with noticeably different distributions (or, it can end up exaggerating the differences when it comes to identifying genes of interest, even though the overall signal distribution may look more similar).

So, my recommendation is just to use RPKM/FPKM/TPM values. That said, it would really be best to process all samples the same way. Otherwise, I would compare each cohort separately and look at the overlapping genes, although you could also use a differential expression model where you treat the cohort like a batch.

ADD REPLY

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6