Question

From fastq file to vcf file (inconsistent vcf file from two methods)

0

Entering edit mode

10.3 years ago

illinois.ks ▴ 210

I have created the vcf file from fastq file using recent GATK pipeline (https://www.broadinstitute.org/gatk/guide/presentations?id=4765)

After I finished the varaint discovery procedure(inclduing thevariant recalibration), I can get the vcf file which are ready to annotate using other tools such as snpEff.. etc..

==================================================

but the question is this.

Our miSeq machine provided by Illumina provided built-in program to make vcf file from fastq file automatically.

(In this case, I don't need to run GATK by myself. the machine build-in program will do everything.. I checked that they also use GATK pipeline.)

However, my vcf file ( I created by myself with GATK pipeline) and the automatically generated vcf file by illumina machine is very different at the perspective of number of variants.

I know that the different program report different variant calls. However, the automatically generated vcf file generated by illumina machine has about 9300 variants called. However, my vcf file (I generated using GATK) has 55000 variants, which are huge.

I know I need to filter out some variants based on several criteria such as read depth, quality score etc. But, I think at the very beginning, the number of callled variants should be comparable.. Do I miss something?

Could you please someone help me with this?

Thanks

vcf next-gen illumina gatk • 5.0k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.3 years ago by illinois.ks ▴ 210

1

Entering edit mode

Going from fastq to vcf is a long way. At first you have to align reads against the reference (and aligners can already introduce differences). Then SNP calling can be done using different parameters and this might also affect results.

I suggest you look for some tutorial on SNP calling using GATK and some using the miseq builtin tools.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.3 years ago by Fabio Marroni ★ 3.0k

Ram · Answer 1 · 2015-05-20

The issue of concordance between genotypers has been discussed before: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706896/

If you're using a black box pipeline, versus your hand-rolled GATK then there's all kinds of variables that may or may not be in play. As someone has already pointed out unless you may already be looking at data that has come from two separate aligners. You may be calling SNP's across a whole genome with GATK, whereas maybe the Illumina calls are restricted to e.g. regions of enrichment from an amplicon assay or exome capture. There's too many variables to diagnose.