I would like to remove x% of reads from a BAM file at a specific region . For instance : remove 50% of reads in chr4:1-1000000
What's the most efficient approach ?
I would like to remove x% of reads from a BAM file at a specific region . For instance : remove 50% of reads in chr4:1-1000000
What's the most efficient approach ?
This should work:
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R reference.fasta \
-I input.bam \
-o output.bam \
-dfrac 0.5 \
-L chr4:1-1000000
Similarly, you can do this with Picard Tools
java -jar ~/tools/picard.jar DownsampleSam I=Input.bam o=Output.bam PROBABILITY=0.5
Or use use samtools view
samtools view -s 0.5 Input.bam > Output.bam
Picardtools has the advantage that it keeps or discards read pairs as a whole. E.g. if one mate is removed/kept, the other is as well.
To remove reads from a bam in a particular region, I only found the python way using pysam. Here is the pseudo-code
for read in bamfile:
if read.pos > region_start and (read.pos + read.length) < region_end:
do_nothing()
else:
output.write(read)
My version of pysam doesn't have read.length - what is it? If you meant len(read.seq), strictly speaking you should parse the CIGAR string as that takes into consideration insertions/deletions/skips/softclipped bases/etc. Also, obviously, read pairs will be all screwed up as stam says, but perhaps thats OK for your application.
Yeah, I mean, you can do that by all means, but it won't be as accurate as parsing the CIGAR strings. To be honest, for your tests it'll probably be fine. But this is what i'm talking about:
>>> b
<pysam.calignedsegment.AlignedSegment object at 0x109a18f58>
>>> print b
SOLEXA1_0001:4:87:13822:5416#0 16 0 3299480 0 8M16308N28M -1 -1 36 GAAACTAACACTCTCTCTCTCTCTCTCTCTCTCTCT array('B', [34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34]) [('NM', 2), ('XS', '-'), ('NH', 5), ('CC', 'chr10'), ('CP', 85523284)]
>>> print b.query_length
36
So this is a read where, according to the CIGAR, the first 8bp map, then theres a 16,308bp gap, then 28bp map. To get the end position you want the POS + 8 + 16308 + 28, and not POS + 36. However, since reads like this can be rare, you might want to just add code that checks if the read has a weird CIGAR and ask you manually to accept/reject the read. Might be faster in the long run.
I also don't know how pysam handles soft-clipped reads (and for the record, this is exactly why i dislike pysam - it causes more ambiguity than it gives back in abstraction). Perhaps query_length, whatever that is, takes into account soft-clipped reads. Perhaps not. Who knows. But if it does not and you have soft-clipped adapters, you'll have to parse CIGAR strings for sure.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I prefer something using samtools / bedtools ! But thanks, I will try it
Maybe something like this:
Those commands keeps reads . I want to exclude them. I though to remove all reads in chr4:1-1000000, and then merge with your output. Unfortunally, I didn't find how to remove easily reads in the region chr4:1-1000000
This removes all the reads in the specified region (-XL option) to include use L, don't know a samtools equivalent:
Thanks ! it seems working. But I wonder why the reference is required.