I have created the vcf file from fastq file using recent GATK pipeline (https://www.broadinstitute.org/gatk/guide/presentations?id=4765)
After I finished the varaint discovery procedure(inclduing thevariant recalibration), I can get the vcf file which are ready to annotate using other tools such as snpEff.. etc..
==================================================
but the question is this.
Our miSeq machine provided by Illumina provided built-in program to make vcf file from fastq file automatically.
(In this case, I don't need to run GATK by myself. the machine build-in program will do everything.. I checked that they also use GATK pipeline.)
However, my vcf file ( I created by myself with GATK pipeline) and the automatically generated vcf file by illumina machine is very different at the perspective of number of variants.
I know that the different program report different variant calls. However, the automatically generated vcf file generated by illumina machine has about 9300 variants called. However, my vcf file (I generated using GATK) has 55000 variants, which are huge.
I know I need to filter out some variants based on several criteria such as read depth, quality score etc. But, I think at the very beginning, the number of callled variants should be comparable.. Do I miss something?
Could you please someone help me with this?
Thanks
Going from fastq to vcf is a long way. At first you have to align reads against the reference (and aligners can already introduce differences). Then SNP calling can be done using different parameters and this might also affect results.
I suggest you look for some tutorial on SNP calling using GATK and some using the miseq builtin tools.