I've been analysing some DNAnextgen datasets over the past months, and performing thourough analysis on the datasets itself and on the tools used to analyse it.
In one of the last steps, I've come across a vcf file (simplified to generate a reproducable situation to the txt file below):
[cedric@laptop]:/data_error$ tail insertions.vcf |cut -f 1,2,4,5,6
#CHROM POS REF ALT QUAL
chrX 48991024 T TG .
chrX 48996531 T C .
chrX 49068386 C CT .
chrX 49068452 C T .
chrX 49068845 CTG C .
chrX 123697721 C CT .
chrX 135052062 A AG .
Running vep on this vcf file:
../../vep/vep -i insertions.vcf -o outputvep_inserts --cache --force_overwrite --symbol
Generates the following vep output file :
cat outputvep_inserts|cut -f 2,3,11,12|uniq
Location Allele Amino_acids Codons
chrX:48991024-48991025 G -/X -/C
chrX:48991024-48991025 G - -
chrX:48991024-48991025 G -/X -/C
chrX:48991024-48991025 G - -
chrX:48991024-48991025 G P/PX cca/ccCa
chrX:48991024-48991025 G -/X -/C
chrX:48996531 C - -
chrX:49068386-49068387 T - -
chrX:49068452 T - -
chrX:49068846-49068847 - - -
chrX:123697721-123697722 T S/KX agc/aAgc
chrX:123697721-123697722 T - -
chrX:123697721-123697722 T S/KX agc/aAgc
chrX:123697721-123697722 T - -
chrX:135052062-135052063 G L/PX ctc/cCtc
chrX:135052062-135052063 G - -
Could anybody verify that the Codon column is in fact the results we want to achieve? Cause to me it seems that, although the "Allele" field mentions the correct nucleotide most of the time, the codon field itself seems to mention completely different insertions/deletions than the vcf file.
Please note that the vcf file is human, and generated from Mutect2 according to GATK best practices.This vcf file contain the erroneous entries of the 20000 entries contained in the full vcf file, on which it does behave as expected. So I don't suspect any installation error to be the cause of this behaviour.
Thanks in advance, Cedric
full vep output:
[cedric@laptop]:/Mupexi/Mupexi/data_error$ cat outputvep_inserts
> ## ENSEMBL VARIANT EFFECT PREDICTOR v90.9
> ## Output produced at 2017-12-05 15:40:58
> ## Using cache in /media/cedric/Extra_space_linu/.vep/homo_sapiens/90_GRCh38
> ## Using API version 90, DB version ?
> ## ensembl-io version 90.9a148ea
> ## ensembl-variation version 90.00c29b7
> ## ensembl-funcgen version 90.743f32b
> ## ensembl version 90.4a44397
> ## dbSNP version 150
> ## ESP version V2-SSA137
> ## gencode version GENCODE 27
> ## 1000genomes version phase3
> ## ClinVar version 201706
> ## sift version sift5.2.2
> ## regbuild version 16
> ## genebuild version 2014-07
> ## assembly version GRCh38.p10
> ## COSMIC version 81
> ## gnomAD version 170228
> ## polyphen version 2.2.2
> ## HGMD-PUBLIC version 20164
> ## Column descriptions:
> ## Uploaded_variation : Identifier of uploaded variant
> ## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)
> ## Allele : The variant allele used to calculate the consequence
> ## Gene : Stable ID of affected gene
> ## Feature : Stable ID of feature
> ## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature
> ## Consequence : Consequence type
> ## cDNA_position : Relative position of base pair in cDNA sequence
> ## CDS_position : Relative position of base pair in coding sequence
> ## Protein_position : Relative position of amino acid in protein
> ## Amino_acids : Reference and variant amino acids
> ## Codons : Reference and variant codon sequence
> ## Existing_variation : Identifier(s) of co-located known variants
> ## Extra column keys:
> ## IMPACT : Subjective impact classification of consequence type
> ## DISTANCE : Shortest distance from variant to transcript
> ## STRAND : Strand of the feature (1/-1)
> ## FLAGS : Transcript quality flags
> ## SYMBOL : Gene symbol (e.g. HGNC)
> ## SYMBOL_SOURCE : Source of gene symbol
> ## HGNC_ID : Stable identifer of HGNC gene symbol
> #Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra . chrX:48991024-48991025 G ENSG00000068400 ENST00000376423 Transcript frameshift_variant 578-579 543-544 181-182 -/X -/C - IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000473581 Transcript non_coding_transcript_exon_variant 362-363 - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000474512 Transcript upstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=2373;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000593475 Transcript frameshift_variant 548-549 543-544 181-182 -/X -/C - IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000611757 Transcript non_coding_transcript_exon_variant 417-418 - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000617369 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=2423;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000619149 Transcript upstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=3207;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000621664 Transcript frameshift_variant,NMD_transcript_variant 144-145 146-147 49 P/PX cca/ccCa - IMPACT=HIGH;STRAND=-1;FLAGS=cds_start_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000622231 Transcript frameshift_variant 352-353 348-349 116-117 -/X -/C - IMPACT=HIGH;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025 G ENSG00000068400 ENST00000622599 Transcript frameshift_variant 428-429 408-409 136-137 -/X -/C - IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000376423 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000480041 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=1417;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000495258 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=2443;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000593475 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000611705 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=728;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000611757 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000617369 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000621664 Transcript intron_variant,NMD_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_start_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000622231 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531 C ENSG00000068400 ENST00000622599 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
..
Tagging: Emily_Ensembl
I am not close to the computer but I bet those genes are on the minus strand.