Hi,
I'm trying to filter my VCFs using filter_vep (https://asia.ensembl.org/info/docs/tools/vep/script/vep_filter.html) following certain criteria. Variants in my output need to pass all filters.
filter_vep \
--input_file input.vcf.gz \
--output_file out.vcf \
--format vcf \
--force_overwrite \
--only_matched \
--filter "CANONICAL is YES" \
--filter "BIOTYPE is protein_coding"\
--filter "gnomAD_AF < 0.01 or not gnomAD_AF" \
--filter "(IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant)) or (REVEL > 0.5) or (VEST4_rankscore > 0.5) or (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) or (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8))"
However, I keep getting non-canonical transcripts and biotypes other than protein_coding, such as lncRNA in my outputs. From what I understood, multiple --filter flags may be used, and are treated as logical ANDs, i.e. all filters must pass for a line to be printed. Not sure what am I doing wrong here. Could anyone help to point any errors/issues in my script?
Here's an example of a variant in the output file following filter_vep:
chr1 2556714 . A G 672.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.284;DP=41;ExcessHet=3.0103;FS=6.967;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=16.41;ReadPosRankSum=1.19;SOR=0.454;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|TNFRSF14-AS1|ENSG00000238164|Transcript|ENST00000416860|lncRNA||1/5|ENST00000416860.2:n.36-18T>C|||||||rs4870||-1||SNV|HGNC|HGNC:26966|||2|||||||||||||0.6148|0.7837|0.5303|0.5397|0.4682|0.6748|0.7263|0.472|0.5136|0.7267|0.5108|0.4422|0.4915|0.4894|0.4669|0.4949|0.6332|0.7837|AFR|not_provided||1|24728327&19825846|ClinVar::VCV000135349&RCV000122164--Uniprot::VAR_013007||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||2||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3.056|0.318|3.375|||,G|intron_variant&non_coding_transcript_variant|MODIFIER|TNFRSF14-AS1|ENSG00000238164|Transcript|ENST00000452793|lncRNA||1/3|ENST00000452793.1:n.56-18T>C|||||||rs4870||-1||SNV|HGNC|HGNC:26966|||3|||||||||||||0.6148|0.7837|0.5303|0.5397|0.4682|0.6748|0.7263|0.472|0.5136|0.7267|0.5108|0.4422|0.4915|0.4894|0.4669|0.4949|0.6332|0.7837|AFR|not_provided||1|24728327&19825846|ClinVar::VCV000135349&RCV000122164--Uniprot::VAR_013007||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3.056|0.318|3.375||| GT:AD:DP:GQ:PL 0/1:17,24:41:99:701,0,458
Here's the CSQ field:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|VAR_SYNONYMS|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|REVEL|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AltaiNeandertal|Ancestral_allele|CADD_phred|CADD_raw|CADD_raw_rankscore|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-pred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V7_gene|GTEx_V7_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_controls_AC|gnomAD_exomes_controls_AF|gnomAD_exomes_controls_AFR_AC|gnomAD_exomes_controls_AFR_AF|gnomAD_exomes_controls_AFR_AN|gnomAD_exomes_controls_AFR_nhomalt|gnomAD_exomes_controls_AMR_AC|gnomAD_exomes_controls_AMR_AF|gnomAD_exomes_controls_AMR_AN|gnomAD_exomes_controls_AMR_nhomalt|gnomAD_exomes_controls_AN|gnomAD_exomes_controls_ASJ_AC|gnomAD_exomes_controls_ASJ_AF|gnomAD_exomes_controls_ASJ_AN|gnomAD_exomes_controls_ASJ_nhomalt|gnomAD_exomes_controls_EAS_AC|gnomAD_exomes_controls_EAS_AF|gnomAD_exomes_controls_EAS_AN|gnomAD_exomes_controls_EAS_nhomalt|gnomAD_exomes_controls_FIN_AC|gnomAD_exomes_controls_FIN_AF|gnomAD_exomes_controls_FIN_AN|gnomAD_exomes_controls_FIN_nhomalt|gnomAD_exomes_controls_NFE_AC|gnomAD_exomes_controls_NFE_AF|gnomAD_exomes_controls_NFE_AN|gnomAD_exomes_controls_NFE_nhomalt|gnomAD_exomes_controls_POPMAX_AC|gnomAD_exomes_controls_POPMAX_AF|gnomAD_exomes_controls_POPMAX_AN|gnomAD_exomes_controls_POPMAX_nhomalt|gnomAD_exomes_controls_SAS_AC|gnomAD_exomes_controls_SAS_AF|gnomAD_exomes_controls_SAS_AN|gnomAD_exomes_controls_SAS_nhomalt|gnomAD_exomes_controls_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_controls_AC|gnomAD_genomes_controls_AF|gnomAD_genomes_controls_AFR_AC|gnomAD_genomes_controls_AFR_AF|gnomAD_genomes_controls_AFR_AN|gnomAD_genomes_controls_AFR_nhomalt|gnomAD_genomes_controls_AMR_AC|gnomAD_genomes_controls_AMR_AF|gnomAD_genomes_controls_AMR_AN|gnomAD_genomes_controls_AMR_nhomalt|gnomAD_genomes_controls_AN|gnomAD_genomes_controls_ASJ_AC|gnomAD_genomes_controls_ASJ_AF|gnomAD_genomes_controls_ASJ_AN|gnomAD_genomes_controls_ASJ_nhomalt|gnomAD_genomes_controls_EAS_AC|gnomAD_genomes_controls_EAS_AF|gnomAD_genomes_controls_EAS_AN|gnomAD_genomes_controls_EAS_nhomalt|gnomAD_genomes_controls_FIN_AC|gnomAD_genomes_controls_FIN_AF|gnomAD_genomes_controls_FIN_AN|gnomAD_genomes_controls_FIN_nhomalt|gnomAD_genomes_controls_NFE_AC|gnomAD_genomes_controls_NFE_AF|gnomAD_genomes_controls_NFE_AN|gnomAD_genomes_controls_NFE_nhomalt|gnomAD_genomes_controls_POPMAX_AC|gnomAD_genomes_controls_POPMAX_AF|gnomAD_genomes_controls_POPMAX_AN|gnomAD_genomes_controls_POPMAX_nhomalt|gnomAD_genomes_controls_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons30way_mammalian|phastCons30way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP30way_mammalian|phyloP30way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP151|TSSDistance|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|GO|miRNA|FunMotifs">
Thanks Ben. I dont think --gz flag was the issued. I still can get the ouput without the --gz flag. But variants in my outputs dont passed all of the filters
When I combined all my filters in one line such as below without the --gz flag, i get my desired output. ie variants passing all my filters. But when I use multiple --filters flag, it seemed to treat each filters separately and behaving more line an "OR" operator.
Hi Jan,
Yes, when multiple filters are used they behave like 'AND' operators. Looking in more detail, it seems that the problem is the last filter is missing some parentheses:
Without the parentheses when the filters are merged, the final filter is:
Instead of being:
Thanks for the correction Ben. It is now working as intended.