What options are there to produce a bed file with every entry being a start-end segment of the reference with a gap in coverage from a bam file?
What options are there to produce a bed file with every entry being a start-end segment of the reference with a gap in coverage from a bam file?
If you create a BEDGRAPH of coverage from your BAM file using the bedtools genomecov tool, you can simply filter for regions with 0 (or whatever minimum value) as follows:
bedtools genomecov -ibam my.sorted.bam -bg > my.sorted.bedgraph
awk '$4 < 1' my.sorted.bedgraph > my.uncovered.intevals.bedgraph
You could try bedtools
, though you'd need to do a little intermediate processing. Use bedtools bamtobed
to convert the bam file into bed format, then use bedtools complement
to identify the intervals that weren't covered in the original bam file.
First, to demonstrate the idea, make a sorted reference BED file. Let's say you're working with human data (hg19
build):
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" \
| awk '{print $1"\t0\t"($2+1)}' \
| tail -n +2 \
| sort-bed - \
> hg19.ref.bed
$ head hg19.ref.bed
chr1 0 249250622
chr10 0 135534748
chr11 0 135006517
...
(You can use any sorted reference coordinates you want — I am using sorted coordinates for the hg19
reference genome for demonstration purposes.)
Next, convert your BAM file to BED:
$ bam2bed < myReads.bam > myReads.bed
Finally, run the reads and reference files through BEDOPS bedops --partition and bedops --not-element-of to extract regions-of-interest:
$ bedops --partition hg19.ref.bed myReads.bed \
| bedops --not-element-of -1 - myReads.bed \
> answer.bed
The file answer.bed
will be a sorted BED file that contains regions from the reference file hg19.ref.bed
, with gaps where the BAM-sourced reads are located.
(Briefly, the way this works is that the --partition
operation splits the reference and read regions into disjoint subsets. The --not-element-of
operation filters out disjoint subsets that are specifically from the reads set.)
this would work if "gap" means no coverage at all. if "gap" is understood as "target regions without enough coverage", which would include those without coverage, then you would have to extract the coverage first in bedgraph format for instance, and then parse that bedgraph file looking for positions not reaching the threshold of interest.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
that's exactly what we do to detect poorly covered regions (and it works quite fast I must add). you may want to intersect that final bedgraph with your design bed file to get only the designed regions that were not covered enough.