Hello,
I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.
This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:
./gatk HaplotypeCaller \
-I /home/novaseq/bams/4_12.bam \
-R /home/novaseq/gatk/genomic_refseq.fna \
-O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ERC GVCF \
-ploidy 60
Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:
./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/novaseq/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \
The resulting database was used to produce a vcf file via GenotypeGVCFs:
./gatk GenotypeGVCFs \
-R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/novaseq/gatk/pooled_colony.vcf.gz
I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:
Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Now I know there is a limit to how many genotypes can be in the vcf.
But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?
Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?
many thanks.
microsattelites, many indels in the context, many different clipped sequences, etc...
Hello! I am having the same issue with merged samples of tetraploids... Any update for this? Would you share the link of your question in GATK?
Thanks!
Hi, I am having the same issue in the WGS data for joint genotyping from the GenomicsDBImport. Any one has a thought on this and why it is happenig? Does it make any critical issue in the output?