Entering edit mode
7 months ago
quentinperriere
•
0
Hi,
For long-read technologies like Oxford Nanopore, do I need to remove duplicates after using minimap2 ? or i should keep them ?
(the bam files generated using minimap2 are used by freebayes in order to detect variants)
I guess the enlightening question is: duplicates of what?
pcr duplicates or/and read duplicates should I use this command to remove them ? or we don't talk about duplicates when dealing with ONT ?? Sorry but I'm lost , it's a new notion for me samtools markdup -r -@ [number_of_thread] [input_sorted_bam] [output_dedup_bam]
With ONT you will only have PCR duplicates. 'Read' (you mean optical?) duplicates are an artifact of cluster-based sequencing (ie Illumina) and you won't have them with ONT.
Whether or not to remove duplicates at the read level (fastq) or alignment level (bam) depends on what you're trying to do, and how the library was constructed.
As above, you do not need to worry about duplicates for ONT data
Also have you looked at using other tools for variant calling more suited to ONT data? such as medaka and longshot?
I could imagine an ONT scenario where you'd want to remove duplicates (amplicon-seq, etc) but likely for variant calling there is no need...especially if the library was PCR-free or low-cycle PCR as is common with ONT.
If the 'freebayes' program requires a duplicate flag be present, it might not be necessary to perform duplicate marking but just add the flag manually/synthetically ...depends on the library and what you expect.
thank you for responding. I was looking for suitable variant calling tool. I'm working on fungus and I don't have fast5 files. Could you recommand suitable tools for this situation please
I'd try those OP samuel.a.odonnell recommends
My guess is you're new to ONT, fast5 is the 'raw data' which is current over time...other than generating the fastq files these are not important for most people. You can re-generate fastq files from a bam file.
Regarding the technicalities of duplicate removal (not that it's likely needed here), depending specifically on your experiment and library you might want to remove at the level of the fastq file or the bam file. For example, if you have amplicon sequencing and want to be very strict about unique reads and you have a UMI, you would remove duplicates at the level of the fastq file. If you lack a UMI and did PCR in the creation of a library, it might make more sense to remove reads at the level of the bam/alignment. This is because two reads may align identically across a span in the genome but have slight variation due to PCR errors. Because they align to an identical spot they can be assumed to be duplicates, but if you tried to remove at the level of the fastq they would be unique reads because of a difference of a SNP/indel.