Hello
I am trying to find SNP using DiscoSNP without a reference genome in wheat which highly complex genome with many repeats. The reads were obtained from reduced representation library where DNA was cut with restriction enzyme and the adaptors only catch the sticky ends of the restriction site. Therefore, no overlap is expected between reads. The DNA was obtained from plants that were selfed several times so low heterozygousity is expected. In a run with b=1 or b= 0 and all the rest of the parameters are default, I see that about 35% of the genotype by SNP data points are heterozygous . It seems to me that these heterozygous data points are caused by paralogs and they are not true SNP. The same results are obtained with TASSEL pipeline. Please recommend on parameters that can reduce the number of false heterozygous and work well with no overlapping reads. Thank you
Hanan
Hello Pierre
Thank you for the response
Sorry I was not clear about no overlap. The situation is like this:
and not partly overlapping like this
So for every cut site there are many reads that start at the same position but the sequence length is only one read length.
Hanan
Hi,
Sorry for the misunderstanding.
My answer remains the same about the differentiation between heterozygous from homozygous. (should be done only a posteriori by using the read coverage (or the VCF genotype information)).
If the dataset is small and you're interested by obtaining not some (very confident b 0 or b 1) but all (not so confident) variants, you may use b 2. In this case most of the predictions are false positives, but, with downstream filters on the coverage, this method may be interesting. We are currently testing this on small amplicon datasets.
However, remember that in complex genomes, b 2 may take a lot of time due to large number of possible path to traverse in the graph.
Best regards,
Pierre
Hi
I am currently trying b 2 and it takes forever, more than 4 days, I have the time to wait. What filters do you suggest?
Hi,
If I were you, I'd make two tries, one with b 1 and if you'e expecting more variants, retry with b 2. In both cases, in your context, you may apply post treatment filters. I'd let the other default parameters.
Pierre