Hi everyone,
I have mapped some PacBio long reads onto an assembled genome and I'm interested in finding out reads that span (encompass) some regions of interest.
To do that, I'm using the following command, which should return reads that cover my ROI coordinates in a .bed
file.
bedtools intersect -abam mappedReads.bam -b ROI.bed -F 1.0 -c -bed > ROIreads.bed
What I noticed, however, is that the output always double counts what appears to be a single read; something like this:
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
What might be the possible reasons for this output?
One potential source of this problem that I can think of is that I have converted my PacBio subreads data (which comes in .bam
and .xml
files using samtools
instead of the BAM2fastx
, which means that it does not reference the native .xml
file from the sequencing run. I did not use the recommended BAM2fastx
as it was reporting that I have a missing file; but that is a separate issue. However, if this is indeed the problem, please let me know.
Thank you all!
Update: The double counting phenomenon is indeed due to an error with the conversion using
samtools
. Counting withBAM2fastx
resolves this issue completely. The.xml
file is also unnecessary for the conversion.