I have whole genome sequences of several clinical isolates and I want to get the nucleotide frequencies (% A,C,T,G) at a given position. Among novel SNPs we could discover, there are some very specific positions that I want to look at. There are enough clinical isolates that I don't wish to this manually using IGV.
I can do this, but it's extremely inelegant and I think that there is some more elegant way that I am missing. I have summarize that there are probably two approaches.
Approach 1
1 . Run the basic samtools SNP pipeline, outputting all positions without a VCF ( remove -V) and outputting all possible reference alleles.
- Write a script that parses the info column in the VCF file, get the depth info for each nucleotide, calculate the frequencies (for my variants of interest only); luckily I already have this script.
Approach 2
- Use bedtools and convert that bam to fastq
- Use seqtk and convert fastq to fasta
- Use bedtools nuc command, specifying position of interest
I feel like this is too common a task to require the use of so many tools and file format conversions and even custom scripts, so I ask : what's a better way to do this?
I think using mpileup with a given genomic coordinate is the simplest solution. It supports filters by MAPQ and you can specify genomic coordinates (just be careful about the off-by-one error if youre zero-counting vs one-counting bases).
This was the fastest way to do it.