I ran HISAT2 (index built using a transcriptome multi fasta) intending that it won't perform gapped alignment. I use following script to run HISAT:
INDEX=./indices/hisat/transcriptome
FASTQ=$1
OUTPUT=./transcriptome_aligned/$2.sam
./software/hisat-0.1.6-beta/hisat \
-q \
-p 2 \
--no-spliced-alignment \
--end-to-end \
-x $INDEX \
-U $FASTQ \
-S $OUTPUT
Should I still expect gapped alignment in my SAM file? I have records like this in the SAM output.
SRR2144041.255 0 YCL025C 274 255 16M1I33M * 0 0 CAGGCTCAAGAACTAGAAAAAAAATGAAAGTTCGGACAACATAGGCGCTA CCCFFFFFHHHHHJJJJJJJJJJJJJJJIJGIIIIJJJJJIJJIIJJJHH AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:49 YT:Z:UU NH:i:1
This shows that HISAT2 is still performing gapped alignment even with --end-to-end
and --no-splice-alignment
parameters.
I'm trying to use the output SAM for rsem-calculate-expression
but it returns following error due to presence of gapped alignment:
rsem-parse-alignments ./indices/rsem/rsem ./rsem_output/sample.temp/sample ./rsem_output/sample.stat/sample ./transcriptome_aligned/sample.bam 1 -tag XM
Read SRR2144041.836747: RSEM currently does not support gapped alignments, sorry!
"rsem-parse-alignments ./indices/rsem/rsem ./rsem_output/sample.temp/sample ./rsem_output/sample.stat/sample ./transcriptome_aligned/sample.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!
How do I make sure that HISAT2 doesn't perform gapped alignment? Should I filter the output for using grep -v XO:i:0
?
EDIT:
I checked RSEM manual and found that in order to avoid gapped alignments using Bowtie2, RSEM uses following Bowtie2 parameters:
--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1
I wonder what is the equivalent of --gbar
in HISAT2
Thanks
Thanks for your reply Istvan.
But RSEM is one of the revered tools for isoform quantification. If exclusion of gapped alignment is such a serious issue then, why has it not been highlighted in any literature for gene quantification.
In fact newer tools like Kalliso used RSEM for most of benchmarking.
While I probably would not go so far is to claim that RSEM is obsolete (i.e. that it should no longer be used), I would note that quite a few tools have incorporated gapped alignments, and have benefitted as a result. Some of them even compare directly against RSEM and demonstrate marked improvement (see e.g. TIGAR — it gets some boost by considering an improved alignment model allowing gaps, and an even bigger boost by adopting the Variational Bayesian EM algorithm). The default alignment arguments for RSEM also discards orphaned alignments, though you probably wouldn't use that, alone, as an argument that it's the right thing to discard orphaned reads, right?
I've never realized that RSEM cannot handle gapped alignments and I say this with no disrespect - but a tool that cannot use make use of gapped alignment can't be all that accurate.
This has played out in the past the same way. Remember how bowtie1 was superfast and brought so much excitement into the field but would not perform gapped alignments. Try publishing anything with this tool today you shouldn't be able to, today we understand much better just how much variation is in the genome.
Well, I certainly agree that it is throwing out useful information. I was similarly surprised by the fact that it discards orphaned alignments. I think that, in part, other difficulties inherent in accurately estimating transcript-level abundance may dominate the negative performance impact RSEM incurs by ignoring gapped and orphaned alignments. However, as we continue to get a better handle on RNA-seq data, develop more accurate modes and better optimization algorithms, incorporating this information will become increasingly important (e.g. the alignment-based mode of Salmon accounts for such alignments rather than discards them, and we find that this does improve quantification accuracy, at least under certain important metrics).