Question

Gene expression analysis on gene counts from different genome builds

0

Entering edit mode

6.3 years ago

skylinesky ▴ 10

Hello all, I will do a differential gene expression analysis using Deseq2. However in my count data, half of the samples are aligned to reference genome using mm10 genome build, and the other half is aligned using mm9(bam files). I got their gene count data using their respective genome build. I have merged all count files and will do a differential gene expression analysis. I am wondering whether using different genome builds count file can influence the result

Thanks!

RNA-Seq deseq2 gene count mm9 mm10 • 1.3k views

ADD COMMENT • link 6.3 years ago by skylinesky ▴ 10

2

Entering edit mode

Yes of course that will influence your analysis. You should realign the mm9 data to mm10, and then use the same annotation (GTF) file for producing count files.

ADD REPLY • link 6.3 years ago by Benn 8.3k

0

Entering edit mode

Thank you for your answer, they are from same experiment, they just used two different genome builds to map reads. So I have mm9 and mm10 count files and want to analyse them together.. At first I thought even the genome annotation is different in half of the samples, overall result should be same. Maybe I can create additional factor to control batch effect on my analysis..

ADD REPLY • link 6.3 years ago by skylinesky ▴ 10

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This comment belongs under @h.mon's answer.

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

You do have to align all reads to the same genome version, preferably mm10.

You don't need to correct for a batch effect, as there is none. I just asked because I considered odd to have part of the samples mapped to mm9, and part mapped to mm10, and I reasoned it could be due to the samples being sequenced at different times, due to being different experiments.

ADD REPLY • link 6.3 years ago by h.mon 35k

score 0 · Answer 1 · 2018-08-02

Before answering your question:

However in my count data, half of the samples are aligned to reference genome using mm10 genome build, and the other half is aligned using mm9(bam files).

Why such situation? Are these different experiments you want to analyse together? If this is the case, you have to take into account batch effects, and depending on the experimental design, it will be impossible to untangle batch effects from your factors of interest.

Regarding your question:

The mm10 genome sequence is better (more bases and less errors) than mm9, and one generally gets more mapped reads when using mm10 as reference genome.

In addition, and more important, the annotation have changed considerably, mostly with new genes added to mm10, but also with gene models changing between versions, pseudo-genes and incorrect annotations being dropped, and some genes / transcripts changing names.

In summary, you have to map the original reads to the same reference genome to proceed with differential expression analysis.