Is there a tool out there that will filter read-pairs if any pair maps to a location with coverage < N.
I can in theory run the samtools pileup find the positions at which coverage is < N and read the bam file again and exclude the reads that are mapped to positions in my low coverage locations per chromosome.
This can be memory intensive in some cases where coverage is low and I am not sure if I find a read in low coverage region how do I make sure to exclude its mate.
just want to get rid of any mapping artifact and noise. We expect regions of interest to have high coverage. This is actually the same transcript start,end data points and I would like to remove reads in a region with < N coverage. May be I should also think about physical coverage.
first, you have to detect those low coverage regions. then, you have to filter your bam file with those regions. I would suggest you open a new question rather than commenting a previous related one.
I think that the idea behind this question is to generate a reduced BAM file based on coverage, leaving only the reads and regions which would actually be useful for any kind of downstream analysis Abhi may want to perform. I can only foresee 2 major set of applications for this idea, which in my opinion shouldn't be addressed this way: a) you want simplify a later coverage calculation, or b) you want to work only on regions with a coverage above certain threshold. in any of these cases, the usual proceeding is to deal directly with the entire BAM, setting filters/thresholds for the analysis to be performed on them. the tools that were designed to deal with BAM files are indeed optimized to perform these filters/thresholds when needed.
I could only understand performing such extra work if your intention is to perform a later intensive work on that BAM file, which just being significantly reduced on size would represent an interesting save of disk usage, hence you'll get reduced timings. if you still want to go for such filtering process, the easiest thing I can think of would be a first pass trying to generate a bed file with the regions of coverage above the desired threshold (bedtools' coverageBed should do the work, and it'll also be very fast), and then a second pass filtering the BAM file with those regions (samtools should do the work, and again that should be very fast indeed).
you could use samtools depth tool and then create a bed from the output. This bed could be used to filter your bam. There must be an easier way?
What is the workflow that requires this manipulation--just curious?
just want to get rid of any mapping artifact and noise. We expect regions of interest to have high coverage. This is actually the same transcript start,end data points and I would like to remove reads in a region with < N coverage. May be I should also think about physical coverage.
Abhi, How do yo get read a bam file and exclude the reads that are mapped to positions with low coverage?
first, you have to detect those low coverage regions. then, you have to filter your bam file with those regions. I would suggest you open a new question rather than commenting a previous related one.