I am trying to characterize the depth of coverage for SNPs/INDELs/other variants in my BAM/SAM file. Using samtools mpileup generates a nice VCF that gives me information on this, but it's not quite complete. Entering the "-t AD" flag gives me the depth of the alternate alleles compared to the reference. That is what I want, though the actual depth is found in the column after the flag is indicated and it is colon delimited. It is readable, but not terribly concise.
However, I am trying to check specific positions along the genome for specific alternate alleles. Is there a convenient way to parse the VCF for the AD flag values by position and allele without complicated and format-specific if/else statements?
I can use bedtools intersect or a dozen other methods to reduce the file to specific positions, already, but I'm hoping there's a method that doesn't require piping a handful of tools together.
EDIT:
It would be straightforward to write an if statement of the kind:
if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"
However, over entire genomes and many different variant calls, this won't be so trivial. I have a file that contains a list of positions along the genome and alleles that I'm specifically looking for. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.
BBMap's variant-caller callvariants.sh) will produce a vcf file with the allele depth always in the exact same position, which is relatively easy to parse with no if statements... a typical line is:
The total coverage here (DP) is 33 for the first sample, and allele depth (AD) is also 33 for the first sample. So, allele depth is the the 3rd colon-delimited field of the 9th tab-delimited field.
That doesn't look too much different from the pileup output. The greater difficulty I'm having isn't parsing out the AD feature, but doing so when comparing the alternate allele value at that point.
It would be straightforward to write an if statement of the kind:
However, over entire genomes and many different variant calls, this won't be so trivial. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.