Hi all, I'm very new in RNA expression analysis, and I'm a bit confused in how to proceed my own RNAseq expression analysis, as I have two different sets of transcripts coming from the same species.
So my situation is that we have two very different lines from the same species, and we have both complete genome assemblies and as well their own transcripts annotations. I know commonly RNAseq from different individuals/conditions/samples were aligned to one single reference transcriptome for RNAseq expression analysis. But considering that one single reference would result in poor alignment for some transcripts or genes, due to the sequence variation or lacking annotation of a specific isoform. So I'm wondering whether I can align my RNAseq separately to their own sets of transcripts, and use kallisto or RSEM to quantify the quantity separately, and then finally use EdgeR's TMM to normalise the data for both lines? We have replicates for each individual.
I'm considering that EdgeR's TMM would probably be appropriate, because it normalises all the samples by considering transcript/gene length, RNA composition, and also sequence depth. If I only choose the homologous transcripts in both lines to analyse, then this normalisation method could even out the bias from different transcript/gene length, which is the major bias I can see so far by using two sets of different transcriptomes. Am I correct in this case? Or did I overlook something?
I also thought about analysing the RNAseq data by aligning them to the concatenated transcriptomes of both lines, as what people suggest to do when conduction RNA expression analysis for multi-species. But I think this may cause more bias, as probably 80% of the transcripts will be aligned to more than once. So I guess this should not be an option.Yes?
I'd recommend mapping to a combined reference. RSEM and kallisto will resolve the multimapping probabilistically.
The problem with mapping to a single reference is erroneous mapping -- some reads that should be mapped to line B will actually be mapped to line A because line A gives the best mapping results (since it's a line A only reference). With a combined line A+B reference, kallisto can at least figure out whether it should be line B-only or a line A+B multimapping. This is because of kallisto's k-mer approach: If it encounters k-mers that exist in line B but not line A, it'll figure out that it should map to line B-only.
Thanks, @dsull! I agree with you that a combined reference would be the best. But I have some difficulties in generating a combined reference.Shall I generated the combined reference by putting all reads together and assembled a combine Transcriptome by Trinity? How would you recommend to generate the combined reference? I was thinking of pan-transcriptome, and had googled a bit. But still have no clues far. If you could suggest me some paper to read, that would be great! Thanks a lot in advance!