Split multiple annotations from SnpEff
1
0
Entering edit mode
2.2 years ago
biomonte ▴ 220

The INFO field (8th column) of SnpEff can have multiple annotations for a particular SNP. For example:

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||    GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0

I would like to put each annotation into a different row but keeping the rest of the information untouched, just like so:

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||    GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0

I wrote the following script that can successfully achieve this task using bash and awk, but it is crazy inefficient on large files with millions of SNPs:

while read SNP ; \
do \
awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' snpeff_annotation_results.vcf | \
awk -F"ANN=" '{print $2}' | \
sed 's/,/\n/g' | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' \
snpeff_annotation_results.vcf | \
awk -F"|" '{print $1}' | \
awk '{print substr($0, 1, length($0)-1)}')/g" | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $1,$2,$3,$4,$5,$6,$7}' \
snpeff_annotation_results.vcf) /g" | \
sed "s%\$% $(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $9,$10}' \
snpeff_annotation_results.vcf)%g" \
; done < snps.list >> snpeff_annotation_results.vcf.splitted

Does anyone know an alternative way to do this task faster?

Additional comment 1: the files 'snpeff_annotation_results.vcf' and 'snps.list' (not shown) contain annotation results from SnpEff and the position of millions of snps (one per row), respectively. Here I am just posting one SNP to make my question more straightforward.

Additional comment 2: besides bash/awk, I have experience in Python, Perl and R. So, I would be grateful if the solution could be with any of these languages.

SnpEff AWK Bash VCF Annotation • 1.0k views
ADD COMMENT
3
Entering edit mode
2.2 years ago
iraun 6.2k

Hi! Would something like this work for you?

Let me know if you need any explanation of the code :)

awk -F'\t' '{split($8,info,"ANN="); split(info[2],ann,",");  for (i in ann) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"info[1]","ann[i]"\t"$9"\t"$10}}' test.vcf
ADD COMMENT
1
Entering edit mode

Hi iraun ! Your code works great and it seems to run super fast compared to mine, so thank you very much for your help :) there is only one minor issue which is the "ANN=" is not being printed, but I think I can easily fix this with 'sed' after running the script, just like so: sed 's/;,/;ANN=/g'

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
ADD REPLY
1
Entering edit mode

Glad it worked! And you are totally right, ANN= is missing. You can probably use sed as you mentioned, or slightly modify the awk command as follows:

awk -F'\t' '{split($8,info,"ANN=");split(info[2],ann,","); for (i in ann) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"info[1]"ANN="ann[i]"\t"$9"\t"$10}}' 
ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6