Extracting Ancestral allele from INFO column of vcf file
2
0
Entering edit mode
5.6 years ago
NS ▴ 10

Hi From my vcf file, I need to generate a text file with the following headers (shown just an example):
CHROM POS ID REF ALT AA
1 886817 rs11174805 C G C
1 886817 rs111111111 C T .
1 886817 rs11144444 A T A

Here AA column is the Ancestral allele column that can be obtained from the INFO column of the vcf file which has "AA=allele_name". In second row, "." implies missing "AA=." in INFO column and thus the ancestral allele is unknown. i have extracted the columns from my vcf file using: awk 'BEGIN {OFS ="\t" ; FS = "\t"};{print $1, $2, $3, $4, $5, $8}' chr22.vcf

This gives me columns - CHROM, POS, ID, REF, ALT, INFO.
INFO column looks like this: GAC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=.|||;VT=SNP

Now, using shell script, how can I extract AA from INFO and create a new column with AA alleles in it ?

Thanks

vcf SNP ancestral allele shell linux • 2.9k views
ADD COMMENT
1
Entering edit mode
5.6 years ago

Assuming that AA is defined in your header, like this:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and IndelType are only defined for indels)">

Then, you can just do this:

bcftools query --format '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%AA\n' 1000Genomes.Norm.bcf | head -20
1   10177   rs367896724 A   AC  |||unknown(NO_COVERAGE)
1   10235   rs540431307 T   TA  |||unknown(NO_COVERAGE)
1   10352   rs555500075 T   TA  |||unknown(NO_COVERAGE)
1   10505   rs548419688 A   T   .|||
1   10506   rs568405545 C   G   .|||
1   10511   rs534229142 G   A   .|||
1   10539   rs537182016 C   A   .|||
1   10542   rs572818783 C   T   .|||
1   10579   rs538322974 C   A   .|||
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   .
1   10642   rs558604819 G   A   .|||
1   11008   rs575272151 C   G   .|||
1   11012   rs544419019 C   G   .|||
1   11063   rs561109771 T   G   .|||
1   13011   rs574746232 T   G   t|||
1   13110   rs540538026 G   A   g|||
1   13116   rs62635286  T   G   t|||
1   13118   rs200579949 A   G   a|||
1   13156   rs552314247 G   C   g|||
1   13259   rs562993331 G   A   g|||
ADD COMMENT
0
Entering edit mode

Thanks for your help ! Is there also a way to keep only bi-allelic SNPs in this output using bcf tools ?

ADD REPLY
0
Entering edit mode
5.6 years ago

In addition to Kevin's answer, GATK has a very nice VariantsToTable tool for extracting data like this that is sometimes easier and quicker when your queries are more complicated.

ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6