Entering edit mode
6.1 years ago
bioinfo456
▴
150
I've downloaded the standalone version of VEP.
My input to the tool is 109,153 rs ids, but I seems to be getting an output of 142,325 rs ids. Why is this so?
Which are the additional rs ids in the output file?
Thanks in advance.
Add exact commands, please
Any thoughts anybody?
You are not getting responses because your question is unspecific. How do you define "more rs ids"? More entries as in
wc -l
or more total occurrences like ingrep -c 'rs'
. Please elaborate.Output file contains more entries than the input file.
Can you show some of the new entries?
temp1.csv is the list of input rs ids and temp2.csv is the list of resulting VEP's rs ids.
Are you sure that every single line in your files have unique rsIDs? That the same rsID isn't being listed multiple times for having different possible effects in different transcripts?
Thanks for your response. I don't see a column with that name in the output file. However, I've given rs ids as input to the VEP and obtained a result of more rs ids than expected. 1. My question is, is the tool designed to do so as you've mentioned? 2. To give out the information of all merged SNPs at a particular loci? 3. And also, how do I deal with SNPs in the form of chr:location? Since the output say : No variant found for such formats?
Where are you seeing the rsIDs? In what column of the output? Can you show us a few lines of your output please? Could you tell us some rsIDs that appear in your output that are not in your input
The aim of the VEP is to tell you the effects of variants on genes. You can input data in a variety of formats including VCF, lists of variant IDs and HGVS. You do not need to know the rsID of the variants you input and the variants can be novel. For every variant, it will tell you which genes it hits and the effects on those genes. If the variant is already known in the database, it will also tell you the identifier (including rsID, COSMIC ID and many more) and give you relevant information about that variant, such as frequency and clinical significance.
You can input variants without an rsID using only the location, if you use one of the accepted formats. You cannot use a mixed format input file. If you have some variants with just an rsID and others with just a location, you will need to do two queries. If most of your data is a list of rsIDs, the VEP is looking for all the inputs to be variant identifiers and will give a "no variant found" message for anything that is not.
Regarding rs ids that appear in output but not in input, that is not at all the case, I was wrong. There were multiple occurrences of a few rs ids which is why the output was larger than the input. Like you said, it must have given me the output of multiple gene hits as well. Regardless of the gene hits, the position of the rs ids having multiple gene hits is going to be the same, isn't it? Since I'm only interested in obtaining the position of variants in GRCh38 assembly. Do you concur?
Yes, the position is the position.
Is this a comment on my answer? If it is a comment, please add it using "Add comment", do not create a new answer.
Ya sorry about that. I didn't notice it, was kinda in a hurry.