I am trying to set my reference allele as the ancestral allele in 1000genomes vcf files. I can do this using the --derived
option in vcftools. However most of the ancestral alleles are in lowercase so vcftools is not able to correct for this.
I am currently looking at a method of extracting the ancestral alleles and converting them to upper case as such:
bcftools view -G -H file.vcf.gz | awk -F'[;=|]' '{for(i=1;i<=NF;i++)if($i=="AA"){print toupper($(i+1));next}}'
And then reinserting them.
This is quite a convoluted way of doing things and I wonder if anyone has a tidier method for doing this?
EDIT:
Here is a single entry from the vcf file (with genotype info hidden):
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP
So here the ancestral allele is g (AA=g
) and I need it to be in uppercase so that vcftools recognises it when running the --derived
option.
I don't get what is this "AA". Show us one line of this vcf please.
I have edited my question. Thanks.