Hi,
I have a BAM file containing paired-end reads. This BAM file is just for storing the results of the sequencing, i.e. it does not contain mapping information.
When I convert this BAM file to a FASTQ file I can see that in some cases there are reads with duplicated names (for about 1% of the reads). For example:
@HS34_15849:1:1101:1065:15188#26/1
@HS34_15849:1:1101:1065:15188#26/2
@HS34_15849:1:1101:1065:15188#26/2
If I use bamToFastq (from bedtools) to generate two FASTQ files (one with each mate from a pair), all these "duplications" are removed. Apparently, bamToFastq retains one of each duplicated read name in a random fashion. Then, it raises a warning about next read having no mate.
Is it normal to have this kind of read name duplications? What is be the best way to handle these duplications?
Many thanks,
Federico
How did you make the BAM file? I suspect that the origin of this issue is there.
Did you check to see if the reads with duplicated names are identical?
H.mon: No, they are different. In fact, in these cases one of the redundant reads is usually much shorter than the typical read length.
Devon: I have asked how these BAM files were generated from the sequencing. Waiting for an answer.
Could this perhaps be a case where a read was mapped twice (i.e. BWAmem), I'm not sure if converting a bam with multi-mapped reads to FASTQ format would cause this but it might.