Question

Need suggestions in subsetting the Annovar annotated VCF file

0

Entering edit mode

5.7 years ago

Riri • 0

Hello Everyone,

I am fairly novice Bioinformatician. I need some help and suggestions on tools that I can use to subset my annotated vcf file using specific criteria. The criteria are: (i) Coding and Splice site variants (ii) CADD > 10 if nonsynonymous SNPs (iii) AA change: Nonsense (iv) Absent in Exac database (v) Frequency is KAVIAR: 6.4E -06. I am working on the python code because I couldn't find any tool that serves my need. So far I have tried GATK's varianttotable, variantfiltration, bcftool, vcftool. I would like to know if there are any tools or tool out there which can parse the INFO column of vcf file and help to filter/subset the file based on selected criteria. Thank you in advance for your help!

next-gen • 1.8k views

ADD COMMENT • link updated 5.7 years ago by mbelmadani ★ 1.4k • written 5.7 years ago by Riri • 0

score 0 · Answer 1 · 2019-08-01

0

Entering edit mode

5.7 years ago

mbelmadani ★ 1.4k

Using table_annovar.pl, you should get outputs for a VCF and a tabular .txt version of the results, so while the .vcf one has an INFO field that requires parsing, the tabular .txt file should already have the information you want in columns. It should be easy to filter by column after using python, any other standard programming language or shell tools like awk.

ADD COMMENT • link 5.7 years ago by mbelmadani ★ 1.4k

0

Entering edit mode

Hi Manuel, Thank you for your response. I tried using tabular.txt to filter, but it is missing my Sample IDs that are present in the corresponding VCF file, so it is not very helpful. The VCF file I have is around 95 GB and it has 1048 samples. Is it normal for tabular.txt to not have Sample IDs?

ADD REPLY • link 5.7 years ago by Riri • 0

1

Entering edit mode

I've typically only used annovar with single sample VCFs, but it looks possible if your VCF file is version 4.0, using -format vcf4 and -allsample: http://annovar.openbioinformatics.org/en/latest/misc/faq/

By default "vcf4" will only process the first sample, and will only print out mutations that exist in the first sample. So if you have a multi-sample VCF file, then usually only a subset of lines will exist in the output file. The -format vcf4 can be combined with -allsample argument, which will print out a separate output file for each sample in the VCF4 file (again by default, only the first sample in the VCF4 file will be processed). More importantly, if you use -format vcf4 -allsample -withfreq, then all input lines from VCF will be kept in output lines, yet an allele frequency measure is included in each line calculating the frequency of each variant among all the samples in the VCF file.

ADD REPLY • link 5.7 years ago by mbelmadani ★ 1.4k