Hello,
I basecall the pod5 files generated by the sequencer with dorado and align with minimap2 and in the BAM file I get the exact same read with the same ID and all more than once. Is this normal? If it is, I suppose I have to remove the duplicates right?
Thank you!
If you have short(er) reads you could be seeing secondary alignments. Which would be a normal thing. Check the SAM tags
I don't think I have seen what @marco.barr is referring to below.
I had the same problem in this iusse https://github.com/nanoporetech/dorado/issues/603 and I solved it as they say here
Thanks for the link. We never use "pod5_fail" folders when re-basecalling so we did not see this issue. Current version of MinKNOW now makes a single pod5 folder.
Since original question is about alignments following
minimap2
showing duplicate ID, it may likely be due to secondary alignments.it's also true that now I only have one pod5 folder, thanks anyway for clarifying the focus of the issue. Yes, they are probably secondary alignments
I think this is the answer but I can't find in the tags info about primary and secondary alignments. I've only found in the flag that the second entry for each repeated ID corresponds to a supplementary alignment, but from what I've seen that's not the same a what you are saying.
How to remove duplicate reads after ONT basecalling from fastq.gz files you could try this
If you only have supplementary alignments then they may be caused by reasons mentioned here --> What are Supplementary Reads? I did not mention them originally since they are not main reason for dup id's.
If the alignments are valid you can't simply ignore them as being from duplicate ID's.
I've looked into it and one of the reads with a supplementary alignment has the representative alignment aligning to the + strand and being 1207 bp long and the supplementary aligning to the - strand being 130 bp, but they are overlapping so this is not the indel thing I think
Supplementary Alignments
chr4:433,255-434,368 (-) = 1,115bp @MAPQ 60 NM132 (representative)
chr4:434,228-434,367 (+) = 140bp @MAPQ 33 NM20 (supplemantary)
Here is a screenshot of both reads in IGV, the representative in red and supplemetary in blue.
I've found this thread but they are talking about bwa and I don't think they got anywhere.