The INFO field (8th column) of SnpEff can have multiple annotations for a particular SNP. For example:
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
I would like to put each annotation into a different row but keeping the rest of the information untouched, just like so:
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1 52100121 . T A 2605.89 . AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
I wrote the following script that can successfully achieve this task using bash and awk, but it is crazy inefficient on large files with millions of SNPs:
while read SNP ; \
do \
awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' snpeff_annotation_results.vcf | \
awk -F"ANN=" '{print $2}' | \
sed 's/,/\n/g' | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' \
snpeff_annotation_results.vcf | \
awk -F"|" '{print $1}' | \
awk '{print substr($0, 1, length($0)-1)}')/g" | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $1,$2,$3,$4,$5,$6,$7}' \
snpeff_annotation_results.vcf) /g" | \
sed "s%\$% $(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $9,$10}' \
snpeff_annotation_results.vcf)%g" \
; done < snps.list >> snpeff_annotation_results.vcf.splitted
Does anyone know an alternative way to do this task faster?
Additional comment 1: the files 'snpeff_annotation_results.vcf' and 'snps.list' (not shown) contain annotation results from SnpEff and the position of millions of snps (one per row), respectively. Here I am just posting one SNP to make my question more straightforward.
Additional comment 2: besides bash/awk, I have experience in Python, Perl and R. So, I would be grateful if the solution could be with any of these languages.
Hi iraun ! Your code works great and it seems to run super fast compared to mine, so thank you very much for your help :) there is only one minor issue which is the "ANN=" is not being printed, but I think I can easily fix this with 'sed' after running the script, just like so:
sed 's/;,/;ANN=/g'
Glad it worked! And you are totally right,
ANN=
is missing. You can probably usesed
as you mentioned, or slightly modify theawk
command as follows: