I have a set of BAM files from a bwa alignment, as well as a BED file of "target regions" I am interested in.
What I want to know is the count of the reads in the BAM file which overlap with any of the region in the BED file. Essentially, each read in the BAM will have an associated TRUE or FALSE denoting if it overlaps with any of the BED regions.
I can't seem to find the exact tool to use within bedtools, since bedtools multicov
returns the total reads which overlap within each BED region, which ends up overcounting since an aligned BAM read can overlap with multiple BED regions.
bedtools intersect -a BAM -b BED -u
technically gets me there, but in this case it generates a new bam file with only the reads in the BED regions, which I then have to count with samtools view -c
. Since the BAM files are large, this is time-consuming and I was wondering if there is a faster way to do this.
See the answers in this thread: Extract Reads From A Bam File That Fall Within A Given Region (
samtools view -L BED_file
accepts a BED file of regions also).Not sure if that would need a custom solution.
hum... what are you trying to do at the end ?
I probably should have clarified the purpose of this, which might make it easier to understand.
We have taken genomic DNA and isolated certain regions out (essentially exons). We then took this DNA enriched for exons and sequenced it.
My question is: how many of our sequencing reads fall within these exon regions? In other words, how efficient is the enrichment for these exon regions?
So what I want is a single number for each BAM file which states how many of the mapped reads overlapped with an exon (which I have as a BED file).
Currently the way I am doing this is:
bedtools intersect -a BAM -b BED -u | samtools view -c
Which works but is very slow.
For reference:
From bedtools intersect documentation, the
-u
flag will:"Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B."