Double counts with bedtools intersect?
1
0
Entering edit mode
10 months ago
Zeng Hao ▴ 40

Hi everyone,

I have mapped some PacBio long reads onto an assembled genome and I'm interested in finding out reads that span (encompass) some regions of interest.

To do that, I'm using the following command, which should return reads that cover my ROI coordinates in a .bed file.

bedtools intersect -abam mappedReads.bam -b ROI.bed -F 1.0 -c -bed > ROIreads.bed

What I noticed, however, is that the output always double counts what appears to be a single read; something like this:

contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,

What might be the possible reasons for this output?

One potential source of this problem that I can think of is that I have converted my PacBio subreads data (which comes in .bam and .xml files using samtools instead of the BAM2fastx, which means that it does not reference the native .xml file from the sequencing run. I did not use the recommended BAM2fastx as it was reporting that I have a missing file; but that is a separate issue. However, if this is indeed the problem, please let me know.

Thank you all!

bedtools mapping alignment • 729 views
ADD COMMENT
1
Entering edit mode

Update: The double counting phenomenon is indeed due to an error with the conversion using samtools. Counting with BAM2fastx resolves this issue completely. The .xml file is also unnecessary for the conversion.

ADD REPLY
2
Entering edit mode
10 months ago

Not sure, but hopefully this bedmap/bam2bed combination will help give the answer you're looking for:

bedmap --echo --skip-unmapped --fraction-ref 1 <(bam2bed -R < reads.bam) roi.bed > answer.bed

The bam2bed tool uses samtools to turn the BAM into BED. If you get double-counts via this method, also, then there may be an issue with conversion via samtools, so at the very least it may be a useful diagnostic.

You can also pass a region and a BAM file into samtools to extract reads which overlap the singular region by one or more bases. This is not the 100% threshold you are after, but you might investigate the output to see if you get double reads, so this may be yet another approach to test the fidelity of read conversion via samtools.

ADD COMMENT
1
Entering edit mode

Hi Alex, many thanks for the help. I've ran the command you suggested and the output is still producing double-counts. It does seems likely that something indeed went wrong with the samtools conversion and I should look into that instead.

ADD REPLY

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6