One way to do this is to turn things around a bit, by making your BAM file be the "B" input. This way, the "driving" "A" file will be your regions.bed. This example below requires that your BAM file is sorted by position. Now, intersect won't allow BAM as the B file directly (it will soon), but you can use bedtobam
to convert it to BED and pipe this to intersect
. Using the -sorted option to limit memory consumption (and requiring that both the BED and the BAM are position sorted; sort -k1,1 -k2,2n
for your BED file), you can pipe the output to the groupby
option to get the result you'd like. The collapse
option in groupby takes all values from a specific column (in this case the read ID is the 8th column) and creates a comma separated list from them. The list is returned based on changes in the columns upon which you are grouping --- in this case, columns 1-4 (-g 1-4
), which represent the columns in my example regions file. Below is an example with some test files I have. If you are unfamiliar with groupby
, there are more details here.
$ bedtools bamtobed -i testingData/NA18152.bam | head -3
chr1 554304 554637 NA18152-SRR007381.35051 16 -
chr1 554304 554618 NA18152-SRR007381.637219 16 -
chr1 554304 554656 NA18152-SRR007381.730912 16 -
...
Intersect the regions with the BAM file (NOTE: I have annotated the output to show where each output group starts). In this example, the BAM read name is the 8th column and the overlap (-wo
) is the 11th column.
$ bedtools bamtobed -i NA18152.sorted.bam | \
bedtools intersect -a regions.bed -b - -sorted -wo
# group 1
chr1 713984 714547 CpG:_60 chr1 714220 714373 NA18152-SRR007381.251923 31 + 153
chr1 713984 714547 CpG:_60 chr1 714220 714368 NA18152-SRR007381.831825 37 + 148
# group 2
chr1 858970 861632 CpG:_257 chr1 860064 860310 NA18152-SRR007381.329161 10 - 246
# group 3
chr1 875730 878363 CpG:_246 chr1 875876 876203 NA18152-SRR007381.732122 9 + 327
# group 4
chr1 933387 937410 CpG:_413 chr1 936966 937135 NA18152-SRR007381.925947 12 - 169
# group 5
chr1 1109314 1110145 CpG:_59 chr1 1108986 1109385 NA18152-SRR007381.659411 23 + 71
chr1 1109314 1110145 CpG:_59 chr1 1108992 1109433 NA18152-SRR007381.615088 38 + 119
chr1 1109314 1110145 CpG:_59 chr1 1108992 1109347 NA18152-SRR007381.677905 43 + 33
chr1 1109314 1110145 CpG:_59 chr1 1109029 1109424 NA18152-SRR007381.217465 44 - 110
chr1 1109314 1110145 CpG:_59 chr1 1109038 1109346 NA18152-SRR007381.1495841 46 + 32
...
Now use group by to condense the read names and overlap columns
$ bedtools bamtobed -i NA18152.sorted.bam | \
bedtools intersect -a regions.bed -b - -sorted | \
bedtools groupby -g 1-4 -c 8,11 -o collapse,collapse \
head -5
chr1 713984 714547 CpG:_60 NA18152-SRR007381.251923,NA18152-SRR007381.831825 153,148
chr1 858970 861632 CpG:_257 NA18152-SRR007381.329161 246
chr1 875730 878363 CpG:_246 NA18152-SRR007381.732122 327
chr1 933387 937410 CpG:_413 NA18152-SRR007381.925947 169
chr1 1109314 1110145 CpG:_59 NA18152-SRR007381.659411,NA18152-SRR007381.615088,NA18152-SRR007381.677905,NA18152-SRR007381.217465,NA18152-SRR007381.1495841,NA18152-SRR007381.1061710,NA18152-SRR007381.209479,NA18152-SRR007381.293009,NA18152-SRR007381.350684,NA18152-SRR007381.1386200,NA18152-SRR007381.176689,NA18152-SRR007381.369558,NA18152-SRR007381.744765,NA18152-SRR007381.1321023,NA18152-SRR007381.838421,NA18152-SRR007381.991283,NA18152-SRR007381.1385310,NA18152-SRR007381.1039387,NA18152-SRR007381.1380869,NA18152-SRR007381.551813,NA18152-SRR007381.1295807,NA18152-SRR007381.1443162,NA18152-SRR007381.403094,NA18152-SRR007381.130789,NA18152-SRR007381.448068,NA18152-SRR007381.678372,NA18152-SRR007381.1300780,NA18152-SRR007381.160158,NA18152-SRR007381.1454803,NA18152-SRR007381.467939,NA18152-SRR007381.1405856,NA18152-SRR007381.1057252,NA18152-SRR007381.1062561,NA18152-SRR007381.85329,NA18152-SRR007381.722618,NA18152-SRR007381.878135 71,119,33,110,32,131,107,142,374,374,454,314,355,281,331,268,232,226,118,272,280,272,274,157,160,137,202,62,62,54,54,42,25,23,10,10
@Aaronquinlan: I believe the above example won't work if you don't pass
intersectBed
the-wao
option - the command as you wrote it will output a BED that has only 4 columns and doesn't output the read id information whichgroupBy
relies on (i.e. the-c 8
parameter). Also on my end, if I pass-wao
, the read column id is actually the 9th column (1-based) so it needs to be-c 9
(c-8
will give the read start position). Perhaps I misunderstood your answer?I've updated the example to hopefully make it more clear to you what is happening. The example also now uses -wo.
Thanks! Final issue: if I pass intersectBed
-split
, it will split junction reads into distinct intervals. My goal is to look for reads that are in total containment within the region (which is not the same as splitting them). Is there a way to achieve this for junction reads as well? So if you have(a, d)
in the intervals fileregions.bed
and you have a junction read spanning the two exons(a, b), (c, d)
(wherea < b < c < d
), then I would like to count that junction read as strictly contained within(a, d)
but not count it towards(a, b)
or(c, d)
if those intervals are inregions.bed
. Edit: actually, I think-split
takes care of this, perhaps I'm wrong.