VCF-simplify.v2
A python parser to simplify the vcf file into table like format. https://github.com/everestial/VCF-simplify
There are several tools available to mainpulate and alter VCF file. But, a simple and comprehensive tool that can produce a most simple output required by emperical biologist is still amiss.
This tool takes in sorted vcf file and reports a simplified table output for INFO
and FORMAT
field for each SAMPLE
of interest. With default state (minimal code) all the INFO
, FORMAT
for all the SAMPLE
are simplified. Fields can be further narrowed down using very convenient and comprehensive scripts. See the examples given below.
The output table can be created in both "long" and "wide" format, which makes it suitable for mining data by samples vs position quite simple. The output can be further filtered downstream with awk and can be loaded onto R and used with tidyr, dplyr where different columns can be accessed by matching names
or pre,suf - fixes
.
Prerequisites:
Python packages and modules:
- argparse (https://docs.python.org/3/library/argparse.html)
- cyvcf2 (https://github.com/brentp/cyvcf2/)
- Python3 (https://www.python.org/)
Usage (using the given input test data):
Call for available options
python3 vcf_simplify-v2.py --help
If no options are provided then all the INFO, FORMAT fields are reported from all the SAMPLE
python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt
Report wide output and "GT" as nucleotide bases
python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode wide --samples MA605,ms01e --gtbase yes
Expected output
CHROM POS REF ALT FILTER AF AN BaseQRankSum ClippingRankSum MA605_PI MA605_GT MA605_PG ms01e_PI ms01e_GT ms01e_PG
2 15881018 G A,C PASS 1.0 8 -0.771 0.0 . G/G 0/0 . ./. ./.
2 15881080 A G PASS 0.458 6 -0.732 0.0 . A/A 0/0 . ./. .
2 15881106 C CA PASS 0.042 6 0.253 0.0 . C/C 0/0 . ./. .
2 15881156 A G PASS 0.5 6 None None . A/A 0/0 . ./. .
2 15881224 T G PASS 0.036 12 1.75 0.0 . T/T 0/0 . ./. ./.
2 15881229 C G PASS 0.308 10 None None . C/C 0/0 . ./. ./.
Report simiplified output (all available fields) for sample MA605,ms01e
python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --samples MA605,ms01e
Expected output
CHROM POS ID REF ALT QUAL FILTER AF BaseQRankSum ClippingRankSum DP DS END ExcessHet FS HaplotypeScore InbreedingCoeff MLEAC MLEAF MQ MQRankSum QD RAW_MQ ReadPosRankSum SOR set SF AC AN MA605_AD MA605_DP MA605_GQ MA605_GT MA605_MIN_DP MA605_PGT MA605_PID MA605_PL MA605_RGQ MA605_SB MA605_PG MA605_PB MA605_PI MA605_PM MA605_PW MA605_PC ms01e_AD ms01e_DP ms01e_GQ ms01e_GT ms01e_MIN_DP ms01e_PGT ms01e_PID ms01e_PL ms01e_RGQ ms01e_SB ms01e_PG ms01e_PB ms01e_PI ms01e_PM ms01e_PW ms01e_PC
2 15881018 . G A,C 5082.45 PASS 1.0 -0.771 0.0 902 None None 0.005 0.0 None 0.8 12,1 0.462,0.038 60.29 0.0 33.99 None 0.26 0.657 HignConfSNPs 0,1,2,3,4,5,6 2,0 8 3,0,0 3 9 0/0 None None None 0,9,112,9,112,112 None None 0/0 . . . 0/0 . 0,0 0 . ./. None None None 0,0,0,.,.,. None None ./. . . . ./. .
2 15881080 . A G 4336.44 PASS 0.458 -0.732 0.0 729 None None 0.01 0.0 None 0.826 11 0.458 60.0 0.0 34.24 None -0.414 0.496 HignConfSNPs 4,5,6 0 6 5,0 5 15 0/0 None None None 0,15,181 None None 0/0 . . . 0/0 . . . . . None None None . None None . . . . . .
2 15881106 . C CA 33.32 PASS 0.042 0.253 0.0 654 None None 3.01 0.0 None -0.047 1 0.042 60.0 0.0 6.66 None 0.253 0.223 HignConfSNPs 4,5,6 0 6 6,0 6 18 0/0 None None None 0,18,206 None None 0/0 . . . 0/0 . . . . . None None None . None None . . . . . .
Report simplified output in "long" format
python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode long --samples MA605,ms01e
Expected output
CHROM POS REF ALT FILTER AF AN BaseQRankSum ClippingRankSum SAMPLE PI GT PG
2 15881018 G A,C PASS 1.0 8 -0.771 0.0 MA605 . 0/0 0/0
2 15881018 G A,C PASS 1.0 8 -0.771 0.0 ms01e . ./. ./.
2 15881080 A G PASS 0.458 6 -0.732 0.0 MA605 . 0/0 0/0
2 15881080 A G PASS 0.458 6 -0.732 0.0 ms01e . . .
2 15881106 C CA PASS 0.042 6 0.253 0.0 MA605 . 0/0 0/0
2 15881106 C CA PASS 0.042 6 0.253 0.0 ms01e . . .
2 15881156 A G PASS 0.5 6 None None MA605 . 0/0 0/0
2 15881156 A G PASS 0.5 6 None None ms01e . . .
2 15881224 T G PASS 0.036 12 1.75 0.0 MA605 . 0/0 0/0
2 15881224 T G PASS 0.036 12 1.75 0.0 ms01e . ./. ./.
2 15881229 C G PASS 0.308 10 None None MA605 . 0/0 0/0
2 15881229 C G PASS 0.308 10 None None ms01e . ./. ./.
Upcoming features:
- Ability to add
genotype bases
for fields other than "GT". - Write the table back to a VCF file.
Citation: Giri, B.K, (2018). VCF-simplify: A vcf simpification tool.
How does the tool handle SVs? In particular, what happens with variants reported as a) symbolic alleles or b) in breakend notation?
I haven't dealt with that directly. My assumption is it should report as it is.
This tool is meant to simplify the VCF output in most possible simple way. So, the simplification is only limited to the structure of the output. It doesn't interpret field,tags.
Specific Interpreted extraction of tags/fields should be dealt by using
pyvcf
,cyvcf
by writing custom methods. The only field that is converted into interpretable value is only for "GT" field.See the provided examples.
Hope it helps !
Looks nice for users with minimal programming experience. Have you tested it on a wide range of VCFs?
In addition, for large multi-sample VCFs, you may consider adding some functions that can do what I have done here:
@Kevin : Looks like a good add on methods to the tool. The implementation shouldn't take long.
I will have to think if there is already a GATK method to do so for "A" and add it to the
INFO
field, so I am not reinventing the wheel. "B" looks like an extensive version of "B".Thanks,
I had sometime to think over your question. Symbolic variants can only be mined if
cyvcf2
has a method built into it - which I think there is none. Because, symbolic allele would need a method to interpret that symbol (be it a deletion overlapping variant, inversion etc.). Symbolic variants only make sense when related to alignment data (SAM, BAM) and cannot be interpreted solely based onREF and ALT
alleles; therefore cannot be extracted bycyvcf2
directly. Hence, VCF simplify is limited on being able to interpret symbolic variants.Hope this type of issues will change in the future.