Entering edit mode
9.0 years ago
schelarina
▴
50
Hello
Is there a way to output the single counts for every region with samtools? For example I use samtools -c file.bam chrom1:10-1000 chrom1:50-800
and it gives the sum of the counts of the two regions. How to get the counts for the two regions separately?
Not by running samtools two separate times :-)
This would work :) But if OP wants to expand to solve the general problem of counting N regions of a BAM file, it becomes messy quickly. Better to use a different approach (like using pysam/hts-python).
Also, um, I remember trying to do this myself via samtools and having a bad time. I think, either it doesn't count reads which end in the window but start outside - or it does - i can't remember. Regardless, it's ambiguous and the OP should look into it (if they haven't already). If my memory serves me correctly and samtools doesn't count reads which start and end in the region, and the OP wants that, they'll have to use something like python.
If they are doing Paired End sequencing, he/she also needs to think about the DNA between reads (not counting read/mate twice - counting stuff in between even if both read and mate aren't within the region, etc) then they'll have to pile that stuff up and count it. Good times.
BTW, samtools view -c will include alignments starting/ending within a region.
It will include alignments that span over, but don't directly align in, a bin as well. For example:
This is handled in things like mpileup (and depth), but not the default -c option code. Of course this only matters with spliced data.
Ahh, ok got yah :) Thanks for clearing that up for me.
I think I was doing it for P.E. ChIP data, which I guess can be thought of as a single spliced fragment ;)
Hm. Come to think of it, it would be kind of neat if there was a way to turn P.E. into S.E. (modifying the CIGAR to treat it as a splice), because then tools which read the data don't have to match reads up in-memory. A lot of my run-through-a-bam-and-calculate-statistics tools in pysam have to store the read in memory and wait until it comes across the mate before it can calculate the stat. About 80-90% of my memory usage is spent keeping track of reads without mates (because if the mate doesn't map, it's at the end of the BAM when sorted on POS). Single-Ending the reads would be one nice way to tidy that problem up... kind of off-topic though.
You have the TLEN field, just ignore the downstream mate and you needn't blow up the memory. Note that this won't work if you're fetching reads near a region first, since reads that would be extended into a region wouldn't get fetched (there's no good way around that, this is due to how the BAM format works).
Yeah, that's true - you'd have to pile up everything in the chromosome I guess. But sometimes there are issues where from the read you cant infer everything about the mate, which was why I was thinking about PE -> SE. I made this little diagram to wrap my head around what values you can/cant get from the mate:
The result is that some statistics like mate-length, or mate 3' end, depending on the read type, can't be inferred without knowing the CIGAR/SEQ of the mate (I call PSEQ, but that doesnt exist). Now that I think about it, you could add PSEQ in as a tag and fix this gap. Meh.