How to extract only transition mutations/ SNPs from VCF file
1
4
Entering edit mode
7.0 years ago
sadsam ▴ 50

I have generated vcf files containing SNPs from mutant genome sequences. Since the mutagen only caused transitions to these genomes, I want to make vcf files that have only transition mutations/ SNPs. I know awk could easily do it. I have started with the following command:

 cat file.vcf | awk '$4=="G" && $5=="C"' > transitions.vcf

it generates a file with G>C transitions only. However, I need to include 3 other transitions (A>G, C>T and T>C) from the same file to the transitions.vcf file. I am not sure how to combine all awks together to get one single vcf file. Any help is highly appreciated.

SNP next-gen • 4.9k views
ADD COMMENT
0
Entering edit mode

All this would do is extract positions where the REF allele is G and ALT is C irrespective of your subject ‘s genotype.

It will NOT give you positions where your subject has G > C mutations

ADD REPLY
3
Entering edit mode
7.0 years ago

using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html and the function isTransition of VariantContextUtils

 java -jar dist/vcffilterjdk.jar \
  -e 'return variant.isSNP() && variant.isBiallelic() && VariantContextUtils.isTransition(variant);' \
  input.vcf

and if you really want awk

awk -F '\t' '(($4 == "A" && $5 == "G") || ($4 == "G" && $5 == "A") || ($4 == "C" && $5 == "T") || ($4 == "T" && $5 == "C") || $0 ~ /^#/ )'   input.vcf
ADD COMMENT
0
Entering edit mode

Thanks a lot Pierre. With my current knowledge, I would prefer awk as there is no need to download/ install anything

ADD REPLY

Login before adding your answer.

Traffic: 1609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6