Hi,
I know this post is old but I was looking for a solution to the same problem (preferably as a one-line bash code) and after a lot of head-scratching and googling, I came up with the following. I'm leaving it here for whomever wants it (or for my future self when I inevitably forget what I did):
samtools mpileup bamfile.bam -f reference.fa -r regions -q 20 -Q 20 | awk 'BEGIN{RS="\n";OFS="\t";ORS="\t"}{split("", count); a=split($5,chars,""); print $1,$2,$3,$4; for (u=1; u<=a; u++) count[chars[u]]++i; for (u in count) {printf "%s ",u; printf "%s ", count[u]}; print "\n"}'
The -q
filters for base phred-scores and -Q
filters for mapping quality if needed. The output table is not very elegant, but it will print the composition of each base pair position and their number according to the mpileup formatting. An example:
9 133730382 T 3365 ] 2 ^ 2 , 1815 G 2 . 1548 $ 1
9 133730383 T 3363 ] 3 ^ 3 , 1813 . 1550 $ 2
Two positions from an example BAM file (chromosome 9, positions 133730382 and 133730383) have a T with a depth of 3365 and 3363 respectively and have 1815 reads with the T in reverse and 1548 in forward for the first for the first position. Hope someone finds this helpful.
A all-in-one package would have been better but yours did a perfectly good job. It'll just require some digging in the 16,443,247 lines that it produced^^
Thanks!
If someone is interested by the next step:
GOAL: For each position of a chromosome, find the ATCGN content while filtering for the low phred score: