How to download VCF file of a specific human gene?
0
0
Entering edit mode
7.8 years ago

Hello everyone, I am a total beginner in biomedical. The first step I need to do in my program in to compare some algorithms based on VCF format. But I am confused that I cant access to the gene I want (there's always not VCF only FASTA or GFF of them).

Though it looks like a simple question.. I wonder are there any tool that can convert other formats into VCF or can query the VCF of a single gene efficiently? Or are there any tools that can cut the specific gene from the whole database?

Thank you in advance.

vcf sequence gene • 5.8k views
ADD COMMENT
1
Entering edit mode

The Variant Caller Format (VCF) is used to store genetic variants that have been called/detected with respect to some reference. Here's a guide that I found useful http://www.internationalgenome.org/wiki/Analysis/vcf4.0.

As for your question, they are two separate things. It is not difficult to convert a VCF file into another genome coordinate file, such as GFF or BED; just be sure that you know the difference between 0-based and 1-based file formats (https://www.biostars.org/p/84686/); though I'm not aware of any tools that perform this conversion, since this can be achieved with some Unix tools. To "query" a VCF file for a single gene you can either find the genome coordinates of your gene of interest, store it as a BED file, and use BEDTools (http://bedtools.readthedocs.io/en/latest/) to overlap/intersect the two files. If your VCF file has gene annotations, you can simply search through the file with your gene of interest.

ADD REPLY
0
Entering edit mode

Thank you for replying. As far as I understand, VCF is more related to variations and locus. It's easy to convert VCF into other file but I don't know how to change other files to VCF coz it contains reference and alternative.. My thesis needs me to compare the performance of VCF-based algorithms so I plan to download some benchmark gene sequences, but so far I can only access to the whole release from 1000 Genome. My "query" means to search the gene (e.g.DMD) in the release.

ADD REPLY
0
Entering edit mode

I'm not quite sure what you have to do but I would imagine you don't need to download gene sequences. Perhaps this paper may be relevant https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5 for your thesis.

ADD REPLY
0
Entering edit mode

I am not 100% sure I know the exact answer you need. It is worth of searching your gene on COSMIC website and you will find all the mutations that gene has. In addition, you can download the VCF file including mutations of all genes, and then code to extract the info you need. Another workaround, once you have the mutation information, you can generate VCF file yourself, according to the VCF format.

ADD REPLY
0
Entering edit mode

Thank you so much for the response. Do u mean that I can generate a VCF file of one gene with all of its variations? But how?

ADD REPLY
0
Entering edit mode

Of course you can get all mutations of that gene, e.g. DMD. 1) Go to Ensembl Biomart; 2) Choose database Ensembl variation 87; 3) Choose dataset Human somatic short variant; 4) Click Filters at the left panel of the webpage; 5) Expand "Gene associated variant filters"; 6) Tick Ensembl gene ID(s) and input ENSG00000198947, which is DMD; 7) Click count at the top left, and you can see this gene has 4303 somatic variants; 8) Export the result, using TSV, or CSV or XLS format, and then you will have all essential info for VCF file; 9) Read VCF specifications; 10) Code script to generate VCF files yourself or Google for existing tools (I think there should be).

ADD REPLY
0
Entering edit mode

I don't think a vcf is what you think it is. A gene doesn't have a vcf. You can have a vcf in which variants are described which are found in that gene in a certain study. But a vcf is not a characteristic of a gene. What is the scope of your analysis?

ADD REPLY
0
Entering edit mode

Thank you for asking. I agree with you in the definition of VCF and that's what I am confused with. My thesis needs me to compare efficiency of some compression algorithms(support VCF) so refer to former researches (many of them used FASTA, probably) I think I should use some benchmark human gene sequences to analyze it. But seemingly it's not used like that. Therefore I want to find a proper way to get suitable VCF files for comparison.

ADD REPLY
0
Entering edit mode

So essentially you are just looking for a vcf file to do some benchmarking on for testing multiple compression algorithms?

ADD REPLY
0
Entering edit mode

Yes this is what i need to do first now..

ADD REPLY
0
Entering edit mode

You can download vcf files from the 1000 genomes project here, and that's a commonly used dataset which should be okay for testing. If you want a large file, take a large chromosome. If it's small scale testing, go with chr21 or chr22.

ADD REPLY

Login before adding your answer.

Traffic: 1643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6