This question is related to this one, but I would like to know if anyone knows of any methods of quickly extracting reads from a BAM file that overlap with a list of many regions (e.g. a BED file), but I would like the reads separately for each region.
My eventual requirements are that for each region individually, I would like to know:
- The number of overlapping reads
- The location of the highest pileup
- The height of the highest pileup
However I am happy to calculate these, I just need a tool to return the reads for each region in a way that I can parse.
I am currently using Rsamtools, which does everything I require, as it returns a list, with each element in the list containing the reads that overlap each region. However, is really slow for a large number of regions. It's run time is dependent on the number of regions and largely independent of the number of reads in the BAM. I am using Rsamtools like this currently:
which <- RangedData(space=bed[,1],IRanges(bed[,2],bed[,3]))
param <- ScanBamParam(which=which,what=c("pos"))
bam <- scanBam(file=bamFile,param=param)
I can split the list of regions and run it as many parallel jobs and merge at the end, however first I wanted to check if there was another tool which was quicker for this task.
I have looked at intersectBed, but I can not get this to return reads in a way I can easily parse to give information for each region. The run time for this method is dependent on the number of reads in the BAM and largely independent of the number of regions, which would be good for my needs.
I coded this in python using pysam. For 166k regions and a bamfile with 100 million alignments, runtime was 412 seconds to count the number of reads in each region.
Hi, it's slow because scanBam with param is not meant to be used as repeatedly as you do it. for many regions it's advisable to use countOverlaps.
The java code by Pierre is incomplete. Hi Pierre, is there any chance that you could post the entire java code? Thx!