Entering edit mode
7.6 years ago
Tobias
▴
150
I have been running HISAT2 for some of my FASTQ files via
hisat2 -x genome --sp 1000,1000 -U /home/a.fastq.gz -S /home/output.sam
It worked and I got out:
16419826 reads; of these:
16419826 (100.00%) were unpaired; of these:
1719313 (10.47%) aligned 0 times
9510754 (57.92%) aligned exactly 1 time
5189759 (31.61%) aligned >1 times
89.53% overall alignment rate
However, when I looked into the data I found, that it was indeed 16419826 reads and 1719313 were not aligned, but it was 12448816 reads aligning exactly 1 time and 2251697 reads aligning more than 1 time.
Hence, the number I got out while looking at the data were a bit different.
How is that explicable? Did I do something wrong?
Many thanks for your help in advance!
How did you determine "aligned only once"? My guess is that this corresponds to MAPQ > 3 or so in hisat2.
Hi Devon, many thanks for question! Well, I just counted the read IDs appearing just once!
Give the counting by MAPQ a go, though make sure to exclude secondary alignments :)
But if I would be doing that I would obtain more not aligned reads (which would not make much sense, because that number is a perfect match) and I would obtain less multiple aligned reads. But since I already have fewer multiple aligned reads in my analysis, the impairment would get stronger!
In your original post you asked why you were getting different numbers than expected given what hisat2 output. I gave a possible reason (namely, that you're using different definitions for a multimapper). Whether you prefer one definition or another or whether there are downstream consequences of that is wholly immaterial.
I agree. All I wanted to say is that if they would have been using MAPQ>3 for the statistics, they would have been observing less multiple aligned reads than I did by simply counting them, and not more.
No, it would have found more multimappers and fewer "unique" mappers, which is indeed what it found. The presumption is that multimappers aren't always discernible by having multiple entries, but rather by MAPQ (this is essentially how it works in bowtie2 and since the code bases are related...).
Okay great, do you know some code (in R or by using some bash script) with which I could test that out?
samtools view -F 256 foo.bam | cut -f 5 | sort | uniq -c
will give you the number of primary alignments per MAPQ. My guess is that some combination of the lower MAPQs will at least be closer to the numbers output by hisat2.