Hello biostars community,
I have recently annotated a multi-sample VCF using VEP standalone with plugins and custom annotations. For custom annotations, I used kaviar, goNl and gnomAD.
After annotation, I wanted to filter rare damaging missense variants. At this moment I have the following filter:
filter_vep \
--force_overwrite \
--input_file $1 \
--output_file $2 \
--only_matched \
--filter "gnomADG_AF_NFE < 0.01 and kaviar_AF < 0.01 and goNl_AF < 0.01" \
--filter "Polyphen2_HDIV_pred match D and SIFT4G_pred match D" \
--filter "CADD_phred > 20" \
--filter "phastCons100way_vertebrate >= 0.3 and phyloP100way_vertebrate >= 0"
For example I have a variant with the following annotation:
chr1 1500144 rs186758134 C T 2077.27 . AC=10;AF=0.179;AN=56;BaseQRankSum=3.76;ClippingRankSum=0;DB;DP=446;ExcessHet=2.2123;FS=4.535;InbreedingCoeff=0.0252;MLEAC=10;MLEAF=0.179;MQ=58.52;MQRankSum=1.14;QD=13.15;ReadPosRankSum=1.58;SOR=0.185;VariantCaller=Intersection;CSQ=T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000308647|protein_coding|||||||||||3942|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02,T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000472194|retained_intron|||||||||||2296|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02,T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000474481|retained_intron|||||||||||3943|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02,T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000485748|retained_intron|||||||||||3942|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02,T|upstream_gene_variant|MODIFIER|AL645728.2|ENSG00000284740|Transcript|ENST00000641679|transcribed_unprocessed_pseudogene|||||||||||3438|1||Clone_based_ensembl_gene||||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02,T|upstream_gene_variant|MODIFIER|AL645728.2|ENSG00000284740|Transcript|ENST00000641974|processed_transcript|||||||||||3106|1||Clone_based_ensembl_gene||||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
For clarity, there are multiple CSQ fields:
CSQ=T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000308647|protein_coding|||||||||||3942|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000472194|retained_intron|||||||||||2296|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000474481|retained_intron|||||||||||3943|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
T|downstream_gene_variant|MODIFIER|ATAD3B|ENSG00000160072|Transcript|ENST00000485748|retained_intron|||||||||||3942|1||HGNC|HGNC:24007|||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
T|upstream_gene_variant|MODIFIER|AL645728.2|ENSG00000284740|Transcript|ENST00000641679|transcribed_unprocessed_pseudogene|||||||||||3438|1||Clone_based_ensembl_gene||||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
T|upstream_gene_variant|MODIFIER|AL645728.2|ENSG00000284740|Transcript|ENST00000641974|processed_transcript|||||||||||3106|1||Clone_based_ensembl_gene||||||||||||||rs186758134|0.016|1:1500143-1500145&rs186758134|0.0010615&0.0162256|rs186758134|7.74330e-02
Take kaviar AF values for example: 0.0010615&0.0162256
I tried running filter_vep however I get this kind messages in the stdout:
Argument "0.0001896&0.8633710&0.0302146" isn't numeric in numeric lt (<) at cedrick/conda/envs/ensembl-vep/share/ensembl-vep-98.0-0/modules/Bio/EnsEMBL/VEP/FilterSet.pm line 679, <GEN0> line 2305.
How does filter_vep go about processing this? Does it skip it?
Hello Emily,
Thank you for the quick response. I used the --custom as follow: --custom ~{kaViarPath},kaviar,vcf,exact,0,AF wouldn't it have matched only the exact coordinates?
The custom VCFs match by rsID, not by coordinates, but these should be distinct.
You should check what's in the kaviar VCF. If you're getting these fields with "&" in them, is that what's in kaviar?
Hello, actually kaviar had variants with multiple allele