Question

DiscoSNP++: reducing missing data for heterozygous species

0

Entering edit mode

6.9 years ago

dussert.yann • 0

Hi,

I am trying to use DiscoSNP++ v2.3 on a data set of 30 individuals (Illumina resequencing, 100/150 bp reads and 25 to 50x coverage depending on the individual).

I get a lot of missing data (./.), probably because for a lot of detected SNP, a high proportion of individuals do not have enough reads mapped by the kissreads module. I suspect part of the problem is I am working with a very heterozygous species, with a high SNP density in some genomic regions.

I tried different values for the max number of polymorphism (-P) and the number of authorized substitutions while mapping reads (-d), with no real improvement. Does someone have a suggestion to try to fine tune parameters for my case?

Thank you in advance for your answer. Yann

discosnp variant calling • 1.5k views

ADD COMMENT • link updated 6.9 years ago by pierre.peterlongo ▴ 900 • written 6.9 years ago by dussert.yann • 0

score 0 · Answer 1 · 2018-02-05

0

Entering edit mode

6.9 years ago

pierre.peterlongo ▴ 900

Hi Yann,

I do not understand why this is an issue. Are you expecting those variants to be covered?

The advice I can give you is to have a look to coverages themselves instead of genotype only. The threshold used to determine whether a variant is missing ./. or not depends on the solidity threshold used by discoSnp (-c parameter, by default automatically determined depending on the sequencing coverage). Thus looking to read coverage may provide more information.

Hoping this helps.

ADD COMMENT • link 6.9 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

Thank for your answer, and also for developing DiscoSNP.

I am expecting variants to be covered, but I was thinking that regions with too much SNPs (ie very heterozygous regions) would not be properly mapped by DiscoSNP, leading to low/no coverage and missing data. We had some issues mapping reads on our reference genome because of that.

For the large majority of SNPs with lots of missing data, there is one or two individuals genotyped and the others have a read coverage of 0, most of the time (so maybe not real polymorphisms?). Concerning the solidity thresholds determined by DiscoSNP, they seem reasonable compared to the genome coverage of individuals.

ADD REPLY • link 6.9 years ago by dussert.yann • 0

0

Entering edit mode

Hi, sorry for the delay,

On regions with high variant density, discoSnp results depend on the 'b' option. With b 0 no variant is detected, with b 1 some of them are detected, and with b 2 almost all of them are detected (but with a high false positive rate on complex data).

Once detected, we can talk about the idea of mapping. Reads are mapped back on variants sequences (the sequences of the bubbles) for counting (and latter genotyping) purposes. During this mapping process, we forbid any mismatch at variant positions and we authorize by default up to one mismatch elsewhere. You can tune this parameter with the -d option.

ADD REPLY • link 6.9 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi,

I think I understand, thanks!

I am still wondering how DiscoSNP is behaving when you have a mixture of shared polymorphisms and low frequency SNP. For example:

CACCCTGCAATTGAACGTGCACGCGTTTTTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCTGCAATTGAACGTGGACGCGTTTTTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCAGCAATTGAACGTGGACGCGTTATTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCAGCAATTGAACGTGGACGCGTTTTTCAAGCATTGCGTCAACAGTAGAGCATTCCAC

Does DiscoSNP detect only the shared polymorphism, all of them, none?

Thanks in advance, Yann

ADD REPLY • link 6.9 years ago by dussert.yann • 0