Question

VEP outputs more rs ids than input, why?

0

Entering edit mode

6.2 years ago

bioinfo456 ▴ 150

I've downloaded the standalone version of VEP.

My input to the tool is 109,153 rs ids, but I seems to be getting an output of 142,325 rs ids. Why is this so?

Which are the additional rs ids in the output file?

Thanks in advance.

SNP ensembl vep • 4.6k views

ADD COMMENT • link 6.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Add exact commands, please

ADD REPLY • link 6.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

./vep --cache --force_overwrite -i p0_01.csv --vcf --fork 4  -o case_vep_output.vcf

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Any thoughts anybody?

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

You are not getting responses because your question is unspecific. How do you define "more rs ids"? More entries as in wc -l or more total occurrences like in grep -c 'rs'. Please elaborate.

ADD REPLY • link 6.2 years ago by ATpoint 86k

0

Entering edit mode

Output file contains more entries than the input file.

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Can you show some of the new entries?

ADD REPLY • link 6.2 years ago by ATpoint 86k

0

Entering edit mode

uday@uday-desktop:~/ensembl-vep$ wc -l temp1.csv
109153 temp1.csv
uday@uday-desktop:~/ensembl-vep$ wc -l temp2.csv
142325 temp2.csv
uday@uday-desktop:~/ensembl-vep$ grep -c 'rs' temp1.csv
108774
uday@uday-desktop:~/ensembl-vep$ grep -c 'rs' temp2.csv
142325

temp1.csv is the list of input rs ids and temp2.csv is the list of resulting VEP's rs ids.

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Are you sure that every single line in your files have unique rsIDs? That the same rsID isn't being listed multiple times for having different possible effects in different transcripts?

ADD REPLY • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for your response. I don't see a column with that name in the output file. However, I've given rs ids as input to the VEP and obtained a result of more rs ids than expected. 1. My question is, is the tool designed to do so as you've mentioned? 2. To give out the information of all merged SNPs at a particular loci? 3. And also, how do I deal with SNPs in the form of chr:location? Since the output say : No variant found for such formats?

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

2

Entering edit mode

Where are you seeing the rsIDs? In what column of the output? Can you show us a few lines of your output please? Could you tell us some rsIDs that appear in your output that are not in your input

The aim of the VEP is to tell you the effects of variants on genes. You can input data in a variety of formats including VCF, lists of variant IDs and HGVS. You do not need to know the rsID of the variants you input and the variants can be novel. For every variant, it will tell you which genes it hits and the effects on those genes. If the variant is already known in the database, it will also tell you the identifier (including rsID, COSMIC ID and many more) and give you relevant information about that variant, such as frequency and clinical significance.

You can input variants without an rsID using only the location, if you use one of the accepted formats. You cannot use a mixed format input file. If you have some variants with just an rsID and others with just a location, you will need to do two queries. If most of your data is a list of rsIDs, the VEP is looking for all the inputs to be variant identifiers and will give a "no variant found" message for anything that is not.

ADD REPLY • link 6.2 years ago by Emily 24k

0

Entering edit mode

##fileformat=VCFv4.1
##VEP="v94" time="2018-11-19 03:50:42" cache="/home/uday/.vep/homo_sapiens/94_GRCh38" db="homo_sapiens_core_94_38@ensembldb.ensembl.org" ensembl=94.5c08d90 ensembl-io=94.8d53275 ensembl-variation=94.066b102 ensembl-funcgen=94.08b0c13 1000genomes="phase3" COSMIC="86" ClinVar="201807" ESP="V2-SSA137" HGMD-PUBLIC="20174" assembly="GRCh38.p12" dbSNP="151" gencode="GENCODE 29" genebuild="2014-07" gnomAD="170228" polyphen="2.2.2" regbuild="16" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1068801 rs55746161  C   A,G .   .   CSQ=A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000394517|processed_transcript|||||||||||2796|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000394517|processed_transcript|||||||||||2796|1||Clone_based_ensembl_gene|,A|intron_variant&non_coding_transcript_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000412397|transcribed_unprocessed_pseudogene||9/9||||||||||1||Clone_based_ensembl_gene|,G|intron_variant&non_coding_transcript_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000412397|transcribed_unprocessed_pseudogene||9/9||||||||||1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000427998|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000427998|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000433695|processed_transcript|||||||||||2527|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000433695|processed_transcript|||||||||||2527|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000451054|processed_transcript|||||||||||2360|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000451054|processed_transcript|||||||||||2360|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|RNF223|ENSG00000237330|Transcript|ENST00000453464|protein_coding|||||||||||2165|-1||HGNC|HGNC:40020,G|downstream_gene_variant|MODIFIER|RNF223|ENSG00000237330|Transcript|ENST00000453464|protein_coding|||||||||||2165|-1||HGNC|HGNC:40020,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000456409|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000456409|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|
1   1130420 rs11580120  C   T   .   .   CSQ=T|intergenic_variant|MODIFIER||||||||||||||||||||
1   1130717 rs61766345  G   A   .   .   CSQ=A|intergenic_variant|MODIFIER||||||||||||||||||||
1   1132196 rs11589263  G   A   .   .   CSQ=A|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4821|1||HGNC|HGNC:50551
1   1132482 rs9442374   T   C,G .   .   CSQ=C|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4535|1||HGNC|HGNC:50551,G|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4535|1||HGNC|HGNC:50551
1   1133503 rs61766346  G   A   .   .   CSQ=A|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||3514|1||HGNC|HGNC:50551

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Regarding rs ids that appear in output but not in input, that is not at all the case, I was wrong. There were multiple occurrences of a few rs ids which is why the output was larger than the input. Like you said, it must have given me the output of multiple gene hits as well. Regardless of the gene hits, the position of the rs ids having multiple gene hits is going to be the same, isn't it? Since I'm only interested in obtaining the position of variants in GRCh38 assembly. Do you concur?

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Yes, the position is the position.

ADD REPLY • link 6.2 years ago by Emily 24k

0

Entering edit mode

Is this a comment on my answer? If it is a comment, please add it using "Add comment", do not create a new answer.

ADD REPLY • link 6.2 years ago by Emily 24k

0

Entering edit mode

Ya sorry about that. I didn't notice it, was kinda in a hurry.

ADD REPLY • link 6.2 years ago by bioinfo456 ▴ 150

score 3 · Accepted Answer · 2018-11-27

3

Entering edit mode

6.2 years ago

Emily 24k

Is this is in the colocated variants column? Some loci have more than one rsID assigned to them, usually when multiple rsIDs have been merged. The colocated variants column will show you every variant known at that locus, not just the one you used as input.

ADD COMMENT • link 6.2 years ago by Emily 24k