With full respect, GATK is a good tool for SNP calling. But the tutorial on GATK website is too complex, I get lost in the details.
Is there an easy to use a list of GATK commands for SNP calling? That I can copy and paste, with changing of just input file names, and maybe few parameters?
Hello,
what's the problem with the Tool Documentation?
I guess you tried to go through the best practice guide and get lost somewhere there? For the beginning it's ok to start with just the command for a VariantCall using HaplotypeCaller. But I would recommend to read more about the the whole "pipeline thing" (Not only the best practice guide, but that's a good starting pointing). Depending on what you try to analyse, there is much more to do than just hack in the command for a VariantCall.
Please feel free to ask a specific question if you don't understand a certain point.
fin swimmer
Hi swimmer, thank you very much for the suggestions. Do you think https://gencore.bio.nyu.edu/variant-calling-pipeline/ is a good command pipeline that I can follow? This is the kind of pipeline I am looking for, but I am not sure if they miss something important.
That pipeline might be a bit old. I don't think you need to do the realignment target creater/realign for indel anymore as haplotypeCaller will do that now.
The general steps for me are:
And yes, their tutorials are a bit of a mess. Their best practice guide is organized badly. You have to dig around alot.
Hi, Damian! I like the steps you mentioned a lot - I've just looked for something like this. Under "mark duplicates" did you mean to mark the duplicated reads using MarkDuplicates (Picard)? And should be duplicated/recombinant regions be removed from the reference as well, or it happens naturally when MarkDuplicates work?
Yes, I usually just use picardtools' MarkDuplicates. Duplicate/recombinant regions are tricky to deal with. It might be better to do some kind of de novo assembly of those regions specifically if that's what you want to study.
Dear Damian~ I'm trying to understand best practices for variant calling. Following alignment and marking duplicates, does each individual need to have variants called before calling variants across multiple samples (step 6)?
GATK best practices suggest creating a genome VCF (g.vcf) for each individual, combining the g.vcfs and then doing a joint-calling. This is step 4,5,6 in my comment.
A genome VCF is different from a normal VCF in that it will also output information on positions that are not different from the reference. You want this information when you eventually do a joint-calling among all samples so you can make the comparison with other samples where there is a difference to reference at that position. I would read up on g.vcfs if you want more info.
Which command did you use for GATK to call variants: I tried a lot but not able to generate vcf file:
Here is my commands:
./gatk --java-options "-Xmx4g" HaplotypeCaller -R ../sequence.fasta -I ../trt.bam -O ../output.vcf.gz
I also tried command:
./gatk HaplotypeCaller -R ../sequence.fasta -I ../sorted-trt.bam -O ../variants-trt.vcf
Hi Chen, the pipeline that you mentioned by NYU seems fine. It appears to be mostly for internal use, though. You don't appear to be based at NYU...?
An updated version of this pipeline using GATK4 is now available here: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/
It's available as a Nextflow script on github and fully dockerized so anyone outside of NYU can now use this same pipeline
I agree - It would be awesome if GATK could be used through a front-end application or were more user-friendly!