I have a multisample vcffile and would like to edit heterozygous individuals with a low or high allelic balance (AB). AB is defined as the ratio of reads showing the reference allele to all reads.
If the AB is below 0.25, I would like to change the genotype to 1/1 (homozygous for alternateallele ). If the AB is above 0.75, I would like to change the genotype to 0/0 (homozygous for reference allele).
Unfortunately, I dont have the AB information for each individual in my vcf. Therefore, I was planning to use RO and DP instead. RO is the count of full observations of the reference haplotype and DP is the total number of reads. These tags are specified in the genotype column, meaning that they are specific for every individual.
The way to proceed in these cases apparently is as described here: https://software.broadinstitute.org/gatk/documentation/article?id=12350
I have been trying to use something similar to this to tag the variants that I want to convert:
gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf --genotype-filter-expression 'isHet == 1 && RO / DP < 0.25 ' --genotype-filter-name 'lowAB'
But somehow this just adds an anotation to all heterozygous positions...
As far as I know, GATK only allows you to convert the calls with the tag to no call. Therefore I think I need a script to manually change the genotype fields if the tags are present.
Are you sure that's not a XY Problem?
I guess so. The original aim is to filter a multisample VCF by AB. The problem is, that if I do
on a multisample VCF file, it uses the AB over ALL heterozygous genotypes in that site. However, what I want is to filter for the AB on a genotype level instead. So that, if an individual has 2 reads supporting allele A and 8 reads supporting allele B, that site is filtered since the 2 reads are then assumed to be sequencing errors.
However, I think that automatically removing the genotypes with low or high AB results in a great loss of information. There could be cases where the AB is low or high but when one discards the reads supporting A (because they are assumed to be sequencing errors), there are still sufficient reads showing that that site is homozygous for B.
Therefore, it would be good, if the program could discard the reads supporting A and then recall the genotype instead of converting the genotype under consideration to ./. automatically...
What would you advise?
It probably depends for what you need allelic imbalance and how sensitive your detection of allelic imbalance needs to be. Is this RNA or DNA-seq? If the latter, what's the coverage?
DNA, average coverage is about 10 reads.
I am working with very closely related species, thats why I wanted to keep the hets. My plan was to infer the phylogenetic tree using several methods (SNAPP, phylonet, Treemix and a concatenated tree with raxml) and then to compare the results.
Others are way more qualified to answer here since I'm not a germline person, but with average 10X, you have to accept that you don't have the power to detect heterozygosity for all sites. So you have to find a way to deal with the uncertainty. I'm sure there are standard ways of doing that for your goal. Allelic imbalance seems like the wrong term for a literature research here.