What are the alternative tools available which can do what bedtools intersect
does?
What are the alternative tools available which can do what bedtools intersect
does?
You would need sorted inputs (sorted per sort-bed
, not sure what sortBed
does), but for faster options, bedops --intersect
and bedops --element-of
do different kinds of intersections.
If you're counting overlaps of elements by class: bedmap --count
and bedmap --faster --count
can be useful.
You can also use the --chrom
operator with BEDOPS tools to trivially parallelize work by chromosome via GNU Parallel or HPC job schedulers.
I don't know what overlap criteria that bedtools uses as a default, but --element-of 1
is one or more bases of overlap. More stringent overlap can be specified with more bases or by using percentage, i.e. --element-of 100%
for full enclosure. Also check that inputs are sorted, and that inputs are provided in the correct order, i.e. bedops -e 1 A B
will give a different answer from bedops -e 1 B A
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
and why wouldn't you use "bedtools intersect" ?
There are 10+ methylome bedfiles which have the positions of methylated bases. They are obtained through bwa-meth followed by methyldackel followed by conversion to bed. Roughly each file has 80,000,000+ entries. I was trying to intersect these with gene features like exons. The count came as expected. The problem is that I have around 50,000 genes. Bedtools intersect is taking approximately 5 minutes say in total for say exon, intron, promoter of one gene. If I extrapolate for 50,000 genes at that scale it would require weeks to get the intersection completed. (Tried bedops also but counts were different than with bedtools. )
I have been an avid user of bedtools ever since. But in this case even with sorted beds, I could not achieve the necessary speed.
That is why I asked for other alternatives.
That's a better description about the problem you are trying to solve.
Some ideas:
tabix
to index yourbed
file. Doing this you can have random access to given regions.gnu parallel
fin swimmer
I see . How about parallelizing things per exon ?
When I tried to parallelize bedtools, they are in fact individually slowing down effectively nullifying the expected advantage. I checked in different servers[256 GB], but this behaviour is recurring, may be something to do with RAM.
Please confirm whether you are using
bedtools intersect -sorted
.John,
I remember so. Let me cross check again. Shall confirm on this at the earliest.
John,
Confirmed. With -sorted itself it is slow.
Jeffin
It depens on what you want to "intersect".
But what is the purpose? For some of the task "grep -f" or "join" command can also be used.
A good read
What Is The Proper Way To Think About Reinventing The Wheel As A Bioinformatician?
If this is whole-genome sequencing, I recommend first running a basic intersect (either with bedtools or bedops) with just the gene regions, keeping only those bases that overlap with genes for the more detailed annotation tasks looking at exons, introns, etc. Odds are you want to separate the based into gene-overlapping and non-gene-overlapping anyway. You may also want to consider splitting the intersect for either type of annotation, i.e., run a separate process with a bed file only containing exons or introns, respectively.