Hi, I'm wondering what exactly is the meaning of an aligner being "splice aware". I know it has to do with the mapping of reads spanning splice junctions, but as someone pretty new to RNA-seq and molecular biology, that's not quite enough for me to grasp the concept.
The following is my best reasoning of its meaning. RNA-seq reads are derived from mature mRNA, so there's typically no introns in the sequence. But aligners use a reference genome to aid in the process, so a read spans (what in the actual transcript are) two exons, while the reference would have one exon followed by an intron. So the reference genome would find a matching sequence in only one of the exons, while the rest of the read would not match the intron in the reference, so the read can't be properly aligned. A splice-aware aligner would know not to try to align RNA-seq reads to introns, and would somehow identify possible downstream exons and try to align to those instead, ignoring introns altogether.
Is this anywhere close to the meaning of splice-aware? And if so, would a splice-unaware aligner properly align RNA-seq data, given a reference transcriptome?
To extend this...
Splice-aware aligners are not necessary when aligning to a transcriptome, only when aligning to a genome. A "splice-unaware" aligner will do a perfectly fine job of aligning to a transcriptome, with one caveat -
Transcriptomes of alternatively-spliced organisms (basically, Eukaryota) are both incomplete (since not all transcripts have been identified), and highly redundant (since transcripts have multiple isoforms). Both of these cause problems with all aligners. It's only one caveat, though, because splice-aware aligners encounter the same problems.
If you align to a genome, which I always recommend, splice-aware aligners are required. The main advantage of aligning to a transcriptome is speed; genome alignment is much more scientifically valuable, as it starts with fewer assumptions.
Note that I say this as someone who has developed a high-speed tool for quantifying transcript expression (Seal). It is probably 100x faster than BBMap (a splice-aware aligner) in most cases, and it does a very good job at quantifying expression differences. But, it presumes that your transcriptome is accurate, which it never is. Essentially, it forces your data into a mold that you know is wrong, while BBMap would actually allow you to discover new things, assuming that the genome is correct. Genomes are far more complete and accurate than transcriptomes.
If all you want to know is whether gene A or B is more upregulated in your experiment, then mapping to a transcriptome using any aligner is fine... but you could accomplish the same thing faster and probably more accurately using a kmer-matching tool like Seal. However, if you want to seriously study what is going on and care about differential splicing, you need to map to the full genome using a splice-aware aligner.