Hi folks,
I am testing discoSNP++ on my data and I have some questions.
My data is composed of NGS Illumina paired-reads for three species of the legume genus Stylosanthes. Two species are supposed to be diploid parents of the third allotetraploid species. I was able to run discoSNP++ for all three samples, but I am not sure how can I compare the results between them. Is there possible to use vcf-compare to see shared and unique SNPs among all samples? Or can I just compare overall amount of SNPs found. If I use a set of contigs obtained from a de novo assembly of the allotetraploid and use this as reference for the mapping all samples, can I then compare unique and shared SNPs? For me it would be ideal to find shared and unique SNPs to support the allopolyploid origin of this species and further supports the two diploid as its parents.
Thanks in advance, André
Hi Pierre,
I did run option 1 and it seems to be working. Now I would like to test the second option and compare results.
I wrote the shell script below to automate the analysis:
However, when I compare the vcf files with vcf-compare I got very different results from 1st way. In the second way I got most SNPs not been shared with other samples, so I think I am doing something wrong...
Can you help me, please?!
Best, André
Thanks for your message André,
Your commands look perfect. For kissreads steps, you may remove the
-coverage_file 270X-discoSNP-5th-RUN_cov.h5
option. This one provides a solidity threshold per input read set (for coherent vs uncoherent variants distinction), but, as we hack here kissreads using it with more read sets than what is contained in those .h5 files, they should be removed.Btw, I think that your problem come from the fact that VCF are not comparables with this second solution. We have no reference, thus ''chromosome'' and positions in the VCFs are only related to sequences constructed by discoSnp itself. Those sequences (and sequences ids) are not comparable from one run to another.
The idea of this second approach is to detect the variants per read set (explaining why you obtain three .fasta files), and, for each of those fasta files, you compute the coverage from all read sets, providing genotyping for all your three read sets.
The drawback indeed, stands in the fact that numerous variants are predicted two or three times. Maybe simply checking coverage in the final three VCF files could fix this issue:
- for VCF file
2701-discoSNP-5th-RUN.vcf
consider all variants- for
2702-discoSNP-5th-RUN.vcf
: consider only variants whose coverage is nearly null in read set corresponding to 2701- for
2703-discoSNP-5th-RUN.vcf
: consider only variants whose coverage is nearly null in read set corresponding to 2702 and 2703Best, Pierre
I would also add the
-genotype
flag to get the actual genotypes.Hi Pierre,
Many thanks for explaining how to do that. I will try it out and will report back here if it worked.
Best, André
Hi Pierre,
What are the thresholds used by discoSnp to consider a variation found in the reads as a SNP and indel? How can I adjust that?
Best, André
Hi André,
Your question is not clear to me, but I'm trying to answer it there: There is a threshold applied on kmers abundance (kmer seen less than T times are not considered in the graph) and there is a threshold for read coherency (once reads are mapped back on alleles, we verify that each allele has at least one read set from which the number of mapped reads is >= T).
The threshold is by automatically detected per read set by analysing kmer spectrum or it can be set using option -c (see doc).
Hope this helps, Pierre
Sorry for not being clear...
I mean that normally to consider a variation found as a SNP, the nucleotide polymorphism needs to be found in at least X reads, considering a minimum coverage that can be specified. So this information is not clear for me in discoSNP.