Entering edit mode
7.8 years ago
insanenostalgic
•
0
Hello everyone, I am a total beginner in biomedical. The first step I need to do in my program in to compare some algorithms based on VCF format. But I am confused that I cant access to the gene I want (there's always not VCF only FASTA or GFF of them).
Though it looks like a simple question.. I wonder are there any tool that can convert other formats into VCF or can query the VCF of a single gene efficiently? Or are there any tools that can cut the specific gene from the whole database?
Thank you in advance.
The Variant Caller Format (VCF) is used to store genetic variants that have been called/detected with respect to some reference. Here's a guide that I found useful http://www.internationalgenome.org/wiki/Analysis/vcf4.0.
As for your question, they are two separate things. It is not difficult to convert a VCF file into another genome coordinate file, such as GFF or BED; just be sure that you know the difference between 0-based and 1-based file formats (https://www.biostars.org/p/84686/); though I'm not aware of any tools that perform this conversion, since this can be achieved with some Unix tools. To "query" a VCF file for a single gene you can either find the genome coordinates of your gene of interest, store it as a BED file, and use BEDTools (http://bedtools.readthedocs.io/en/latest/) to overlap/intersect the two files. If your VCF file has gene annotations, you can simply search through the file with your gene of interest.
Thank you for replying. As far as I understand, VCF is more related to variations and locus. It's easy to convert VCF into other file but I don't know how to change other files to VCF coz it contains reference and alternative.. My thesis needs me to compare the performance of VCF-based algorithms so I plan to download some benchmark gene sequences, but so far I can only access to the whole release from 1000 Genome. My "query" means to search the gene (e.g.DMD) in the release.
I'm not quite sure what you have to do but I would imagine you don't need to download gene sequences. Perhaps this paper may be relevant https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5 for your thesis.
I am not 100% sure I know the exact answer you need. It is worth of searching your gene on COSMIC website and you will find all the mutations that gene has. In addition, you can download the VCF file including mutations of all genes, and then code to extract the info you need. Another workaround, once you have the mutation information, you can generate VCF file yourself, according to the VCF format.
Thank you so much for the response. Do u mean that I can generate a VCF file of one gene with all of its variations? But how?
Of course you can get all mutations of that gene, e.g. DMD. 1) Go to Ensembl Biomart; 2) Choose database Ensembl variation 87; 3) Choose dataset Human somatic short variant; 4) Click Filters at the left panel of the webpage; 5) Expand "Gene associated variant filters"; 6) Tick Ensembl gene ID(s) and input ENSG00000198947, which is DMD; 7) Click count at the top left, and you can see this gene has 4303 somatic variants; 8) Export the result, using TSV, or CSV or XLS format, and then you will have all essential info for VCF file; 9) Read VCF specifications; 10) Code script to generate VCF files yourself or Google for existing tools (I think there should be).
I don't think a vcf is what you think it is. A gene doesn't have a vcf. You can have a vcf in which variants are described which are found in that gene in a certain study. But a vcf is not a characteristic of a gene. What is the scope of your analysis?
Thank you for asking. I agree with you in the definition of VCF and that's what I am confused with. My thesis needs me to compare efficiency of some compression algorithms(support VCF) so refer to former researches (many of them used FASTA, probably) I think I should use some benchmark human gene sequences to analyze it. But seemingly it's not used like that. Therefore I want to find a proper way to get suitable VCF files for comparison.
So essentially you are just looking for a vcf file to do some benchmarking on for testing multiple compression algorithms?
Yes this is what i need to do first now..
You can download vcf files from the 1000 genomes project here, and that's a commonly used dataset which should be okay for testing. If you want a large file, take a large chromosome. If it's small scale testing, go with chr21 or chr22.