Entering edit mode
3.8 years ago
iibrams07
▴
10
Given whole genome sequencing or whole-exome sequencing datasets of the human genome, I need to generate a list of all the SNPs present in each of the datasets separately. In the best scenario the SNPs should be well annotated, as for their genes they belong to, their coordinates, and eventually additional attributes. The file should have a .csv format or some other similar format, since I need to upload it and manipulate some data from it.
How should I proceed?
Many thanks.
@Dave Carlson. Many thanks. Once at step 6., what should I do to collect only the SNPs and discard other variants? I need to pipe this collection to maybe a text file. In the best practice pipeline, you are referring to, there is no SNP-based pipeline regarding the somatic case, it rather refers to germline SNPs. Do you know of a resource that provides the command line code to proceed step by step in achieving this goal?
To exclude all but SNPs from your VCF file you can use GATK's SelecVariants tool with
--select-type-to-include SNP
or bcftools with--skip-variants indels
.From the rest of your comment, it sounds like you're doing somatic variant calling. Is that right? If so, GATK has a separate best practices workflow for this, though I don't have any personal experience with it.
The samples the DNA was extracted from were somatic cells. It is surprising that GATK provides more pipelines for germ cells than somatic ones. The last best practice workflow you are referring to is not clearly telling if it is about SNPs. SNV is not the same as SNPs. It is totally different. Thus I am confused. Can you comment on this? I got another question. When I open the Github extracted code file, I find a .jason and .wdl file. What should I do with these files? What I need is the code. Is the code inside one of these files? How can I read them? Thanks.
SNP = single nucleotide polymorphism
SNV = single nucleotide variant
These two things are subtlety different, but they are not totally different.
I'm not sure which precise github page you're looking at, but if you're referring to the code for GATK, it's written in JAVA, and the easiest way to run it would be to download the latest release and call the wrapper script
gatk
.By code I mean the command line code needed to run the pipeline and not to download GATK. Clicking at best practices workflow, one is led to a page of pipelines which are themselves linked to github pages where a code file is situated.