Question

GATK: HaplotypeCaller gVCF and multisample

1

Entering edit mode

9.9 years ago

iraun 6.2k

Hi all,

Anyone can explain me what is the main difference between using GATK HC in gVCF mode instead of in multi-sample mode? I know that HC in GVCF mode is used to do variant discovery analysis on cohorts of samples, but what is the meaning of "cohorts of samples"? If I have 2 groups of samples, one WT and the other mutant, should I use GVCF mode? I've read almost all the tutorials and Howto's of GATK and I can not understand at all.

Also, how can I give more than one bam to HC? Is this the correct way?:

java \
  -Xmx8g \
  -jar \
  -XX:ParallelGCThreads=4 \
  -jar GenomeAnalysisTK.jar \
  -T HaplotypeCaller \
  -R reference.fasta \
  -I sample1.bam -I sample2.bam -I ... \ #like this
  --genotyping_mode DISCOVERY \
  -stand_emit_conf 10 \
  -stand_call_conf 30 \
  --dbsnp dbsnp_138.b37.vcf \
  -o raw_variants.vcf

Thank you

GATK • 13k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by iraun 6.2k

Ram · Answer 1 · 2015-05-19

between using GATK HC in gVCF mode instead of in multi-sample mode?

I'm not sure what you mean, but with the possible scenarios:

HaplotypleCaller in gVCF mode vs Just variant calling
HaplotypeCaller (without regard to gVCF or just variant calling) vs UnifiedGenotyper per sample

I think you meant the second one. HaplotypeCaller also performs de-novo assembly of regions containing variants for more confident variant calls. Also, more info is described here: Variant Caller Of Choice?

"Cohort" is usually subjective.

Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

From: http://gatkforums.broadinstitute.org/discussion/3059/lane-library-sample-and-cohort-what-do-they-mean-and-why-are-they-important

When we are doing our GATK-based pipeline, by cohorts of samples, we mean, all of the "pools". For example, we have four pools. We have induced mutation on a plant, and then fifteen plants still exhibit phenotype that as if it did not undergo mutation. We call that Pool1. The rest, Pools 2-4 with around 15 physical plants per pool, exhibit mutation at varying degrees. With the sequencing data, Pool1, 2, 3 and 4 are different samples. The cohort is all of them together.

I do think you can or not use GVCF in your analysis (WT vs mutant) - that depends on what you have further in your downstream processing. With all I have seen so far, they do use HaplotypeCaller in GVCF mode, then GenotypeGVCFs, then Variant Quality Score Recalibration which in actuality uses VariantRecalibrator and ApplyRecalibration walkers of GATK. From there you select variants with acceptable VQSLOD usually >= 4.0 . Further filtration might be needed after that.

And yes, you are correct in giving two or more BAMs to GATK.