Hi,
I have a question regarding the alignment of RNA-Seq data.
Let us consider the following two strategies:
(1) Genome ALigner to Transcriptome. E.g., use bowtie to align the reads to a fasta file with all transcript sequences - strategy used by several DE software, bitseq, RSem, mmseq, etc.
(2) Splice-aware aligner. E.g., tophat, STAR, etc.
Which is a better approach, (1) or (2) for aligning RNA-Seq data? Probably the (2) approach is more principle but if that's the case what are the hazards/limitations of the first approach. I can think of a trivial case for total RNA since the reads won't ONLY come from the transcriptome but I am more interested in the case of polyA+ type of experiments, where most reads should come from the transcriptome.
I am looking for practical experience with the data as well as any theoretical consideration.
Do you care about novel splice variants or new transcripts that may be represented? If the answer is yes, then you need a splice-aware aligner. If no, then it still depends on how reliable you believe your reference transcriptome is.
@Chris, I am using mouse transcriptome (ensembl, refseq, gencode, any transcriptome one can think of), with respect to its reliability I have the feeling is still a matter under debate... I do not care about novel splice variants but I am trying to asses the pitfalls or (none) pitfalls of approach (1) vs (2). One way to do it is to try both, then intersect the genomic alignment with the GTF transcriptome and see if the alignments agree... to my surprise if i do that for short single end data the agreement is quite weak, these are prelim results though... maybe someone has tried a more extended study...
I am not sure if a bowtie will be able to align a transcript that is generated from two non-adjacent exons on to the transcriptome fasta sequence where you have concatenated the adjacent exons for mapping purpose. I may be wrong on this.
@ashutosmits well, if it's not in the transcriptome GTF you are absolutely right but in principle the GTF contain all the alternative variants...
Ok got it. I didnt read that you will be using tools like Rsem. I thought you would do it yourself from scratch.
With approach 1, you are going to have multiple alignments to different transcripts for a read that aligns to a shared exon. You'll need to have a plan to deal with that situation.
@sean thanks, good point but I plan to uSe the alignments for transcript/gene expression estimation with one of the many available software such as bitseq, express, rsem, etc. My concern is forcing the alignments into "known" annotations and what is the effect of that.
That's exactly the problem as @Sean Davis mentioned. You will map one read multiple times just by using a non spliced aligner. Therefore the amount of mappable reads may drop in a certain manner (which I can't think of in times of magnitude) and therefore you change the expression of some genes artificially. Especially if all transcript variants are present.