Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools
3
2
Entering edit mode
13.0 years ago
User 9996 ▴ 840

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

bedtools bam next-gen sequencing bed • 8.0k views
ADD COMMENT
3
Entering edit mode
13.0 years ago

A memory-efficient way to do this is to use the new -sorted option in intersectBed. This requires your GFF and BAM files to be sorted by chrom then start. For the BAM it would be:

samtools sort mybam.bam mybam.sort

For the GFF it would be:

sort -k1,1 -k4,4n intervals.gff > intervals.sort.gff

The -sorted option in bedtools does not yet support BAM (work in progress), but you could just use bamToBed in a FIFO to get around this:

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb

This will report, for each interval in the GFF, a new line for each overlapping read. If you wanted to denormalize (collapse) this output to one line per interval with a list of read Ids, you could use the groupBy command as Istvan mentions. Something like below will generate a comma separated list of read ids (column 13 of the output) for each interval.

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb | \
      groupBy -g 1,2,3,4,5,6,7,8,9 -c 13 -o collapse
ADD COMMENT
0
Entering edit mode

Wow, groupBy really needs short-cuts like the -g 1-9.

ADD REPLY
0
Entering edit mode

Hi Aaron, I noticed that you have answered quite a few questions in this focus area. So, I thought may be I can post my question here. I am interested in extracting the number of paired-end reads aligned to each member of a large set of regions. I have used BEDTools/IntersectBED tool so far; it works well and returns reads overlapping my regions. But I am only interested in their count; and given the fact that I have quite a large number of regions, the output of intersectBED will be pretty large. Is there a way to prompt intersectBED to only return counts?

ADD REPLY
0
Entering edit mode

Thank you so much for the very quick reply! I browsed biostars and SeqAnswers since posting the initial question, and came across coverageBED. Do you know the pros/cons of using -c option vs coverageBED? will they return identical results? and is there a way to make them return the number of paired-end reads as opposed to counting each member of pair once? [ presumably, these counts should be ~2x what I am looking for; but just to be safe!]

ADD REPLY
1
Entering edit mode
13.0 years ago

You might look into the bedops and bedmap tools in the open-source BEDOPS suite.

The bedops tool performs many kinds of set operations (including intersection and element-of operations) on UCSC BED-formatted data.

The bedmap tool performs calculations on the signal/score information that are in four- or five-column BED files, which are map-ped to a reference BED file.

You may need to convert intermediate formats to BED to use these tools, but they are designed to be fast, accurate and low-profile.

(Disclaimer: I am one of the authors of this suite, but not of bedops or bedmap.)

ADD COMMENT
0
Entering edit mode
13.0 years ago

I believe that the groupBy tool might give you what you need. I would transform to BED and stick the CIGAR attribute into a column, for example as the name.

Then once the intersect is done and the reads are listed next to the exons sort the file by the columns that you want to group on (exon coordinates I guess) and run the groupBy tool with the collapse operator.

ADD COMMENT

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6