Question

Find count of reads in BAM file that fall within BED region

3

Entering edit mode

3.6 years ago

KH ▴ 90

I have a set of BAM files from a bwa alignment, as well as a BED file of "target regions" I am interested in.

What I want to know is the count of the reads in the BAM file which overlap with any of the region in the BED file. Essentially, each read in the BAM will have an associated TRUE or FALSE denoting if it overlaps with any of the BED regions.

I can't seem to find the exact tool to use within bedtools, since bedtools multicov returns the total reads which overlap within each BED region, which ends up overcounting since an aligned BAM read can overlap with multiple BED regions.

bedtools intersect -a BAM -b BED -u technically gets me there, but in this case it generates a new bam file with only the reads in the BED regions, which I then have to count with samtools view -c. Since the BAM files are large, this is time-consuming and I was wondering if there is a faster way to do this.

bedtools BAM overlap BED • 6.4k views

ADD COMMENT • link updated 3.6 years ago by Pierre Lindenbaum 164k • written 3.6 years ago by KH ▴ 90

0

Entering edit mode

count of the reads in the BAM file which overlap with any of the region in the BED file

See the answers in this thread: Extract Reads From A Bam File That Fall Within A Given Region (samtools view -L BED_file accepts a BED file of regions also).

each read in the BAM will have an associated TRUE or FALSE denoting if it overlaps with any of the BED regions.

Not sure if that would need a custom solution.

ADD REPLY • link 3.6 years ago by GenoMax 147k

0

Entering edit mode

Essentially, each read in the BAM will have an associated TRUE or FALSE denoting if it overlaps with any of the BED regions.

hum... what are you trying to do at the end ?

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I probably should have clarified the purpose of this, which might make it easier to understand.

We have taken genomic DNA and isolated certain regions out (essentially exons). We then took this DNA enriched for exons and sequenced it.

My question is: how many of our sequencing reads fall within these exon regions? In other words, how efficient is the enrichment for these exon regions?

So what I want is a single number for each BAM file which states how many of the mapped reads overlapped with an exon (which I have as a BED file).

Currently the way I am doing this is:
bedtools intersect -a BAM -b BED -u | samtools view -c
Which works but is very slow.

For reference:
From bedtools intersect documentation, the -u flag will:
"Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B."

ADD REPLY • link 3.6 years ago by KH ▴ 90

score 1 · Answer 1 · 2021-05-07

1

Entering edit mode

3.6 years ago

Pierre Lindenbaum 164k

how many of our sequencing reads fall within these exon regions? In other words, how efficient is the enrichment for these exon regions?

samtools view -c -q 30 -F 3844 in.bam > total.txt
samtools view -c -q 30 -F 3844 -L capture.bed in.bam > in_target.txt

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 164k