Hi,
I am trying to use DiscoSNP++ v2.3 on a data set of 30 individuals (Illumina resequencing, 100/150 bp reads and 25 to 50x coverage depending on the individual).
I get a lot of missing data (./.), probably because for a lot of detected SNP, a high proportion of individuals do not have enough reads mapped by the kissreads module. I suspect part of the problem is I am working with a very heterozygous species, with a high SNP density in some genomic regions.
I tried different values for the max number of polymorphism (-P) and the number of authorized substitutions while mapping reads (-d), with no real improvement. Does someone have a suggestion to try to fine tune parameters for my case?
Thank you in advance for your answer. Yann
Hi Pierre,
Thank for your answer, and also for developing DiscoSNP.
I am expecting variants to be covered, but I was thinking that regions with too much SNPs (ie very heterozygous regions) would not be properly mapped by DiscoSNP, leading to low/no coverage and missing data. We had some issues mapping reads on our reference genome because of that.
For the large majority of SNPs with lots of missing data, there is one or two individuals genotyped and the others have a read coverage of 0, most of the time (so maybe not real polymorphisms?). Concerning the solidity thresholds determined by DiscoSNP, they seem reasonable compared to the genome coverage of individuals.
Hi, sorry for the delay,
On regions with high variant density, discoSnp results depend on the 'b' option. With b 0 no variant is detected, with b 1 some of them are detected, and with b 2 almost all of them are detected (but with a high false positive rate on complex data).
Once detected, we can talk about the idea of mapping. Reads are mapped back on variants sequences (the sequences of the bubbles) for counting (and latter genotyping) purposes. During this mapping process, we forbid any mismatch at variant positions and we authorize by default up to one mismatch elsewhere. You can tune this parameter with the -d option.
Hi,
I think I understand, thanks!
I am still wondering how DiscoSNP is behaving when you have a mixture of shared polymorphisms and low frequency SNP. For example:
CACCCTGCAATTGAACGTGCACGCGTTTTTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCTGCAATTGAACGTGGACGCGTTTTTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCAGCAATTGAACGTGGACGCGTTATTCAAGCATTGCGTCATCAGTAGAGCATTCCAC CACCCAGCAATTGAACGTGGACGCGTTTTTCAAGCATTGCGTCAACAGTAGAGCATTCCAC
Does DiscoSNP detect only the shared polymorphism, all of them, none?
Thanks in advance, Yann