Entering edit mode
6.9 years ago
JJ
▴
730
Hi all,
So I am working with a public dataset and I am a bit worried about the high number of secondary alignment I get. Here is the samtools flagstat output. I usually get a much lower number. What could be the cause of such a high number of secondary alignments? What number is still acceptable? Do I need to dump this dataset? This is human mRNA. I used HISAT2 with default settings.
53638643 + 0 in total (QC-passed reads + QC-failed reads)
27516323 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
50786919 + 0 mapped (94.68% : N/A)
26122320 + 0 paired in sequencing
13061160 + 0 read1
13061160 + 0 read2
21125408 + 0 properly paired (80.87% : N/A)
22335136 + 0 with itself and mate mapped
935460 + 0 singletons (3.58% : N/A)
213596 + 0 with mate mapped to a different chr
131241 + 0 with mate mapped to a different chr (mapQ>=5)
What is the reference where you mapped these RNASeq reads on?
Current Gencode release - human primary assembly and corresponding gtf were used to build the index. I also mapped it using the standard index provided by HISAT2 - almost the same result.
Anyway, with the default
--score-min
,--mp
and gap opening / extending penalties you allow for quite some mismatches / gaps in your read mapping. This means that you likely get secondary alignments on the second copies of those genes, or on similar regions. Try to increase the penalties or to lower the accepted alignment score. If only the number of reads with secondary alignments goes down, then you're good, if also the "mapped" goes down then you're being too strict. This might be a workaround to test this.So I set the
--score-min
to--score-min L,0,-0.6
and I get now this:Better. Are there any guidelines for this parameter? Thanks a lot, Is there any documentation on this parameter apart from what is stated in the manual?
I think the documentation is made on discussions on this forum :) What I can only suggest you is to spend some time on this parameter every time you map reads. Multiply the read length by the third number (-0.6) and then subtract the second number (0): this way you get the maximum penalty. For reads of 125 nt, if I set it to
L,2,-0.5
I will get-((125*0.5)+2) = -77
. If my mismatch penalty is--mp 7,7
I will get at most 11 mismatches. If my gap penalty is--rdg 7,5
I will get a gap of most length 14 (77-7 opening, 70/5 for extension).