Hey everyone,
I have a question about the VEP results. Why are there for some variants multiple features like consequence, gene symbol, ensemble id...? And why does it get more, if I have more samples?
Shouldn't the gene be specified by the position on the genome?
An example is this (around 20 samples, after bcftools +vep-split
):
CHROM POS REF ALT ID Consequence SYMBOL Existing_variation VARIANT_CLASS Gene
17 81645307 G A . intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,. rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.
The same variant with around 500 samples (including the above 17):
CHROM POS REF ALT ID Consequence SYMBOL Existing_variation VARIANT_CLASS Gene
17 81645307 G A . intron_variant&non_coding_transcript_variant,intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant,regulatory_region_variant NPLOC4,NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.,. rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.,.
One explanation for multiple entries that I could think of could be that a variant can sit in a coding region for one gene and in a regulatory region for another. However, this does not explain, why there's a different amount at different n. Can someone explain this to me?
My VCF are called with Haplotypecaller
per sample and then merged and the merged VCF is then annotated.
Could you elaborate this in a little more detail. I get such an output e.g. (already after
bcftools +split-vep
to make it more readable and I left out some columns):How do I know which of these annotations is for what? I haven't found this comma form in the manpages for vep.
If you have the latest bcftools you can use
--select
argument. If you're going to pick more than one transcript you might want to add-d
parameter.More about it can be found in the how to guide. https://samtools.github.io/bcftools/howtos/plugin.split-vep.html
Ok, I types a really long answer and then I hit the edit button on another message and it was gone. Maybe it would be nice to have drafts saved here, but that's another topic :). So, let's try again (in a shorter way):
My question is not really about filtering, but more about understanding the output. One of the first answers was: "VEP annotates for every allele, gene and transcript". So, isn't there a way to check, which annotation is for wich? So which is for the gene, which for the alleles and which for the different transcripts? If I understand it correctly, using
-d -s PICK=1
simply takes the first entry, but that's not really, what I want to do.E.g. I also annotated with
AlphaMissense
and here I have the problem that some entries, even though they show an aa change, don't show anAlphaMissense
annotation. So simply picking the first would definitely not work here:So, all are missense, but only 2 and 3 have
AM
annotations. That's why I would rather get all outputs and then check, which one to use, rather then simply picking one.That problem is not exclusive to biostars but to any form you're filling on a web browser. I suggest you do all your editing in a reliable text editor and just copy and paste to form after you finish editing, or you can use something like ghosttext.
You can see that from the annotation itself. You will see multiple genes, transcripts lines for the given input line.
Let's inspect a locus that overlaps two genes:
Let's create a multi-allelic example input for VEP:
And let's inspect the results.
| index | pos | allele | consequence | symbol | transcript |
| 1 | 1:45340254-45340254 | C | start_lost | TOE1 | ENST00000372090.6 |
| 2 | 1:45340254-45340254 | A | start_lost | TOE1 | ENST00000372090.6 |
| 3 | 1:45340254-45340254 | G | start_lost | TOE1 | ENST00000372090.6 |
| 4 | 1:45340254-45340254 | C | start_lost | MUTYH | ENST00000372098.7 |
| 5 | 1:45340254-45340254 | A | start_lost | MUTYH | ENST00000372098.7 |
| 6 | 1:45340254-45340254 | G | start_lost | MUTYH | ENST00000372098.7 |
| 7 | 1:45340254-45340254 | C | upstream_gene_variant | MUTYH | ENST00000372104.5 |
| 8 | 1:45340254-45340254 | A | upstream_gene_variant | MUTYH | ENST00000372104.5 |
| 9 | 1:45340254-45340254 | G | upstream_gene_variant | MUTYH | ENST00000372104.5 |
The rows 1,2,3 are output for C, A, G alleles for TOE1 gene, annotations for different input alleles.
The rows 4,5,6 are again annotations for different alleles but also for different gene MUTYH.
And the rows 7,8,9 are annotation for different alleles for the gene MUTYH but this time for a different transcript of the MUTYH.
AlphaMissense might only have annotations for some transcripts. That might be why you're not seeing the results in all of them. I suggest you check the AlphaMissense tsv file you downloaded and see for yourself.