Question

Haploid Genotypes in discoSNP++ VCFs

1

Entering edit mode

6.2 years ago

dustin.r.long ▴ 10

I am trialing discoSNP++ as part of a bacterial GWAS pipeline and am seeking some clarification on the genotypes in the multisample VCFs. Similar to the Pseudomonas example provided in the VCF_creator user guide (pages 4-5), we see a large number heterozygous genotypes despite all the samples being from haploid organisms (e.g. 0/0, 0/1, 1/1). How should we interpret these heterozygous reads from our bacterial sequence data (paired-end data provided in the fof_reads1.txt and fof_reads2.txt structure as described in Case 4 of the discoSNP user guide). Any guidance appreciated!

discosnp • 1.3k views

ADD COMMENT • link 6.2 years ago by dustin.r.long ▴ 10

score 1 · Accepted Answer · 2018-09-18

1

Entering edit mode

6.2 years ago

pierre.peterlongo ▴ 900

Hi

The genotyping is an option that can be switched off (-n option) when working on non diploid species. Computing the genotype on haploid species is meaningless.

However, 0/1 results may warn you (depending on the effective coverage of each allele) as this may reflect the existence of approximate repeats in the genome -----A------//------T------- that may be seen as SNP variants while they are not.

1/1 results are expected as, with no reference genome, the "reference" allele is randomly chosen and an homozygous variant may fall in the other allele.

Hope this helps; Pierre

ADD COMMENT • link 6.2 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Thanks for the great (and quick!) explanation.

ADD REPLY • link 6.2 years ago by dustin.r.long ▴ 10