Question

Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

1

Entering edit mode

12.2 years ago

user ▴ 950

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

bedtools bed gff bam rna-seq • 5.4k views

ADD COMMENT • link updated 12.2 years ago by Alex Reynolds 36k • written 12.2 years ago by user ▴ 950

Ram · Accepted Answer · 2013-01-16

One way to do this is to turn things around a bit, by making your BAM file be the "B" input. This way, the "driving" "A" file will be your regions.bed. This example below requires that your BAM file is sorted by position. Now, intersect won't allow BAM as the B file directly (it will soon), but you can use bedtobam to convert it to BED and pipe this to intersect. Using the -sorted option to limit memory consumption (and requiring that both the BED and the BAM are position sorted; sort -k1,1 -k2,2n for your BED file), you can pipe the output to the groupby option to get the result you'd like. The collapse option in groupby takes all values from a specific column (in this case the read ID is the 8th column) and creates a comma separated list from them. The list is returned based on changes in the columns upon which you are grouping --- in this case, columns 1-4 (-g 1-4), which represent the columns in my example regions file. Below is an example with some test files I have. If you are unfamiliar with groupby, there are more details here.

$  bedtools bamtobed -i testingData/NA18152.bam | head -3
chr1    554304    554637    NA18152-SRR007381.35051    16    -
chr1    554304    554618    NA18152-SRR007381.637219    16    -
chr1    554304    554656    NA18152-SRR007381.730912    16    -
...

Intersect the regions with the BAM file (NOTE: I have annotated the output to show where each output group starts). In this example, the BAM read name is the 8th column and the overlap (-wo) is the 11th column.

$ bedtools bamtobed -i NA18152.sorted.bam | \
   bedtools intersect -a regions.bed -b -  -sorted -wo
# group 1
chr1    713984    714547    CpG:_60    chr1    714220    714373    NA18152-SRR007381.251923    31    +    153
chr1    713984    714547    CpG:_60    chr1    714220    714368    NA18152-SRR007381.831825    37    +    148
# group 2
chr1    858970    861632    CpG:_257    chr1    860064    860310    NA18152-SRR007381.329161    10    -    246
# group 3
chr1    875730    878363    CpG:_246    chr1    875876    876203    NA18152-SRR007381.732122    9    +    327
# group 4
chr1    933387    937410    CpG:_413    chr1    936966    937135    NA18152-SRR007381.925947    12    -    169
# group 5
chr1    1109314    1110145    CpG:_59    chr1    1108986    1109385    NA18152-SRR007381.659411    23    +    71
chr1    1109314    1110145    CpG:_59    chr1    1108992    1109433    NA18152-SRR007381.615088    38    +    119
chr1    1109314    1110145    CpG:_59    chr1    1108992    1109347    NA18152-SRR007381.677905    43    +    33
chr1    1109314    1110145    CpG:_59    chr1    1109029    1109424    NA18152-SRR007381.217465    44    -    110
chr1    1109314    1110145    CpG:_59    chr1    1109038    1109346    NA18152-SRR007381.1495841    46    +    32
...

Now use group by to condense the read names and overlap columns

$ bedtools bamtobed -i NA18152.sorted.bam | \
   bedtools intersect -a regions.bed -b -  -sorted | \
      bedtools groupby -g 1-4 -c 8,11 -o collapse,collapse \
          head -5
chr1    713984    714547    CpG:_60    NA18152-SRR007381.251923,NA18152-SRR007381.831825    153,148
chr1    858970    861632    CpG:_257    NA18152-SRR007381.329161    246
chr1    875730    878363    CpG:_246    NA18152-SRR007381.732122    327
chr1    933387    937410    CpG:_413    NA18152-SRR007381.925947    169
chr1    1109314    1110145    CpG:_59    NA18152-SRR007381.659411,NA18152-SRR007381.615088,NA18152-SRR007381.677905,NA18152-SRR007381.217465,NA18152-SRR007381.1495841,NA18152-SRR007381.1061710,NA18152-SRR007381.209479,NA18152-SRR007381.293009,NA18152-SRR007381.350684,NA18152-SRR007381.1386200,NA18152-SRR007381.176689,NA18152-SRR007381.369558,NA18152-SRR007381.744765,NA18152-SRR007381.1321023,NA18152-SRR007381.838421,NA18152-SRR007381.991283,NA18152-SRR007381.1385310,NA18152-SRR007381.1039387,NA18152-SRR007381.1380869,NA18152-SRR007381.551813,NA18152-SRR007381.1295807,NA18152-SRR007381.1443162,NA18152-SRR007381.403094,NA18152-SRR007381.130789,NA18152-SRR007381.448068,NA18152-SRR007381.678372,NA18152-SRR007381.1300780,NA18152-SRR007381.160158,NA18152-SRR007381.1454803,NA18152-SRR007381.467939,NA18152-SRR007381.1405856,NA18152-SRR007381.1057252,NA18152-SRR007381.1062561,NA18152-SRR007381.85329,NA18152-SRR007381.722618,NA18152-SRR007381.878135    71,119,33,110,32,131,107,142,374,374,454,314,355,281,331,268,232,226,118,272,280,272,274,157,160,137,202,62,62,54,54,42,25,23,10,10

Ram · Accepted Answer · 2013-01-16

The BEDOPS bedmap tool with the --echo, --echo-map-id, --delim and --multidelim options can also answer this very quickly. You can pipe in reads into bedmap by using the bam2bed script:

$ bam2bed.csh < reads.bam \
    | bedmap --echo --echo-map-id --delim '\t' --multidelim ',' sorted-regions.bed - \
    > answer.bed

The --echo option returns the region element, while --echo-map-id returns a list of read elements which overlap the region by one or more bases.

The answer.bed file will look like:

[ region-A in BED format ] \t [ comma-delimited list of read IDs overlapping region-A ]
[ region-B in BED format ] \t [ comma-delimited list of read IDs overlapping region-B ] 
...

As the answer is BED-formatted output, you can very easily pipe this to downstream processes that consume BED data.

Just be sure to sort regions, if not sorted, e.g.:

$ sort-bed regions.bed > sorted-regions.bed

And, likewise, for reads:

$ bam2bed.csh < reads.bam | sort-bed - > sorted-reads.bed

Or to put it together with the overall analysis:

$ bam2bed.csh < reads.bam \
    | sort-bed - \
    | bedmap --echo --echo-map-id --delim '\t' --multidelim ',' sorted-regions.bed - \
    > answer.bed