Question

GenotypeGVCF too many genotypes from pooled samples

0

Entering edit mode

3.5 years ago

Vic ▴ 100

Hello,

I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.

This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:

./gatk HaplotypeCaller \    
-I /home/novaseq/bams/4_12.bam \
-R /home/novaseq/gatk/genomic_refseq.fna \
-O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ERC GVCF \
-ploidy 60

Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:

./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/novaseq/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \

The resulting database was used to produce a vcf file via GenotypeGVCFs:

./gatk GenotypeGVCFs \
-R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/novaseq/gatk/pooled_colony.vcf.gz

I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:

Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.

Now I know there is a limit to how many genotypes can be in the vcf.

But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?

Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?

many thanks.

GenotypeGVCFs GATK VCF • 1.8k views

ADD COMMENT • link updated 5 months ago by Sd • 0 • written 3.5 years ago by Vic ▴ 100

score 0 · Answer 1 · 2022-06-29

0

Entering edit mode

2.4 years ago

Begonia_pavonina ▴ 200

I have the same issue with GenotypeGVCFs, with two different warning messages.

An error related to the number of genotypes:

> Sample/Callset 95( TileDB row idx 76) at Chromosome Chr2 position
> 2405066 (TileDB column 168320521) has too many genotypes in the
> combined VCF record : 1081 : current limit :  1024 (num_alleles,
> ploidy) = (46, 2). Fields, such as  PL, with length equal to the
> number of genotypes will NOT be added

An error related to the number of alleles:

> Chromosome Chr2 position 5276810 (TileDB column 171192265) has too
> many alleles in the combined VCF record : 58 : current limit : 50.
> Fields, such as  PL, with length equal to the number of genotypes will
> NOT be added for this location.

I don't really understand either how can we get that many different alleles and genotypes at a single position, taking in account I have 80 samples which are diploid. I will ask the question directly on the tool webpage.

ADD COMMENT • link 2.4 years ago by Begonia_pavonina ▴ 200

1

Entering edit mode

I don't really understand either how can we get that many different alleles and genotypes at a single position,

microsattelites, many indels in the context, many different clipped sequences, etc...

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hello! I am having the same issue with merged samples of tetraploids... Any update for this? Would you share the link of your question in GATK?

Thanks!

ADD REPLY • link 13 months ago by Paula Andrea • 0

0

Entering edit mode

Hi, I am having the same issue in the WGS data for joint genotyping from the GenomicsDBImport. Any one has a thought on this and why it is happenig? Does it make any critical issue in the output?

ADD REPLY • link 5 months ago by Sd • 0