Entering edit mode
6.5 years ago
marion.ryan
▴
50
Hello
Any advice on the best (easiest) way to handle VCF files (to find SNPs) i.e. using R
or VCFtools
within Linux.
Sorry if this is basic but just need to get started.
Regards Marion
define:
way to handle
To clarify (in case OP was not aware), VCF files already contain the SNPs.
Just to chime in. Please read the structure of VCF and how VCFTools work and what they can do. Then read about SNP and what format of file represent them. Once you read them you can answer your own question. Man page of VCFTool is pretty descriptive. Read and then formulate your query where you get stuck we will be happy to help and yes VCF already contain SNPs (check for the column with #rsID's they are SNPs) . Well also understand difference between SNPs and SNVs. ;)
Good luck!
Thanks for the quick answers, I am looking for the best tool with which to navigate and explore the VCF files derived from an RNAseq experiment in order to obtain specific SNPs relating to particular genes and also compare the samples in relation to specific SNPs, so any tips in relation to the best tools would be great, but I will read up myself also. sorry I should have been a bit clearer. Regards Marion
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.You can start with this guide from GATK people:
https://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it
so if you are trying to find mutations or variants from RNASeq then samtools workflow or STAR/GATK workflow should be fine. I personally like STAR/GATK owing to the statistical model and robustness that you can add to it. Having said that, once you have VCF you can always plot stats to see how many of your variants have PASS flag and what are the DP,AF scores of them. Then again are you looking at somatic or germline? Once you have done following with the GATK workflows you should have significant calls whereby you will have variants , some of which will be SNPs meaning they have been identified as SNPs with #rsID , rest should be novel.
If you have the id of the SNPs with you probably a list of vcf with the rsID then you can always overlap them with your VCF file and pull out the scores to make some summarisation calls. Whatever you do, you will need to annotate the variants and associate the consequences in order to find some biological relevance.