discoSNP++ comparison of shared and unique variants between different samples without reference sequence
2
2
Entering edit mode
7.0 years ago

Hi folks,

I am testing discoSNP++ on my data and I have some questions.

My data is composed of NGS Illumina paired-reads for three species of the legume genus Stylosanthes. Two species are supposed to be diploid parents of the third allotetraploid species. I was able to run discoSNP++ for all three samples, but I am not sure how can I compare the results between them. Is there possible to use vcf-compare to see shared and unique SNPs among all samples? Or can I just  compare overall amount of SNPs found. If I use a set of contigs obtained from a de novo assembly of the allotetraploid and use this as reference for the mapping all samples, can I  then compare unique and shared SNPs? For me it would be ideal to find shared and unique SNPs to support the allopolyploid origin of this species and further supports the two diploid as its parents.

Thanks in advance, André

discoSNP discoSNP++ • 2.1k views
ADD COMMENT
2
Entering edit mode
7.0 years ago

Hi all,

Thanks André for asking your question here. This is a generic need, that should be more explicit in the documentation. There are two main ways to answer your question (in absence of a reference genome, else, simply use the -G discoSnp++ option to obtain variant locus).

  • Run discoSnp on all read sets together. In practice you'll have to generate four "file of file" files (fof). For each of your three samples, a fof (say fof1.txt, fof2.txt, fof3.txt) describes the two read files (one pair of read files per species) and one fof (says root_fof.txt) that describes fof1.txt, fof2.txt and fof3.txt. Run discoSnp++.sh on the root_fof.txt. All variants will be detected at once and genotyped automatically.
  • It may appear that a user prefers to detect independently variants from several species, and then genotype variants in all species taken together. Here is a way to do it using discoSnp: Perform discoSnp++.sh with option -X(added today in commit 22d80). Option -X will stop discoSnp++ after the variant detection, not mapping reads. Thus no read coverage information per allele and per read set is provided. Then use kissreads module and vcf creator module to genotype variants with all read sets.
    • In practice, 1/ run three times discoSnp++.sh with fof1.txt, with fof2.txt, with fof3.txt, each time using -X; 2/ run three times kissreads with each of the three generated .fa files, using the root_fof.txt so reads from all species will be mapped. This will generate three fasta files containing read coverage information per allele and per read set; 3/ Finally, those three .fasta files may be formatted as vcf files using vcf_creator.

Hope this helps, Pierre

ADD COMMENT
1
Entering edit mode

Hi Pierre,

I did run option 1 and it seems to be working. Now I would like to test the second option and compare results.

I wrote the shell script below to automate the analysis:

#! /bin/bash

# 1st step
/home/marques/Documents/DiscoSnp/run_discoSnp++.sh -r fof_reads_2701.txt -p 2701-discoSNP-5th-RUN -t -T -D 1 -X

/home/marques/Documents/DiscoSnp/run_discoSnp++.sh -r fof_reads_2702.txt -p 2702-discoSNP-5th-RUN -t -T -D 1 -X

/home/marques/Documents/DiscoSnp/run_discoSnp++.sh -r fof_reads_2703.txt -p 2703-discoSNP-5th-RUN -t -T -D 1 -X

# 2nd step Kissreads2

/home/marques/Documents/DiscoSnp/build/bin/kissreads2 -predictions 2701-discoSNP-5th-RUN*.fa -reads root_fof.txt -co 2701-discoSNP-5th-RUN_coherent -unco 2701-discoSNP-5th-RUN_uncoherent -k 31 -coverage_file 2701-discoSNP-5th-RUN_cov.h5

/home/marques/Documents/DiscoSnp/build/bin/kissreads2 -predictions 2702-discoSNP-5th-RUN*.fa -reads root_fof.txt -co 2702-discoSNP-5th-RUN_coherent -unco 2702-discoSNP-5th-RUN_uncoherent -k 31 -coverage_file 2702-discoSNP-5th-RUN_cov.h5

/home/marques/Documents/DiscoSnp/build/bin/kissreads2 -predictions 2703-discoSNP-5th-RUN*.fa -reads root_fof.txt -co 2703-discoSNP-5th-RUN_coherent -unco 2703-discoSNP-5th-RUN_uncoherent -k 31 -coverage_file 2703-discoSNP-5th-RUN_cov.h5

# 3rd step format files

sort -rg 2701-discoSNP-5th-RUN_coherent | cut -d " " -f 2 | tr ';' '\n' > 2701-discoSNP-5th-RUN_coherent.fa

sort -rg 2702-discoSNP-5th-RUN_coherent | cut -d " " -f 2 | tr ';' '\n' > 2702-discoSNP-5th-RUN_coherent.fa

sort -rg 2703-discoSNP-5th-RUN_coherent | cut -d " " -f 2 | tr ';' '\n' > 2703-discoSNP-5th-RUN_coherent.fa

# 4th step VCF_creator
/home/marques/Documents/DiscoSnp/scripts/run_VCF_creator.sh -p 2701-discoSNP-5th-RUN_coherent.fa -o 2701-discoSNP-5th-RUN.vcf

/home/marques/Documents/DiscoSnp/scripts/run_VCF_creator.sh -p 2702-discoSNP-5th-RUN_coherent.fa -o 2702-discoSNP-5th-RUN.vcf

/home/marques/Documents/DiscoSnp/scripts/run_VCF_creator.sh -p 2703-discoSNP-5th-RUN_coherent.fa -o 2703-discoSNP-5th-RUN.vcf

However, when I compare the vcf files with vcf-compare I got very different results from 1st way. In the second way I got most SNPs not been shared with other samples, so I think I am doing something wrong...

Can you help me, please?!

Best, André

ADD REPLY
0
Entering edit mode

Thanks for your message André,

Your commands look perfect. For kissreads steps, you may remove the -coverage_file 270X-discoSNP-5th-RUN_cov.h5 option. This one provides a solidity threshold per input read set (for coherent vs uncoherent variants distinction), but, as we hack here kissreads using it with more read sets than what is contained in those .h5 files, they should be removed.

Btw, I think that your problem come from the fact that VCF are not comparables with this second solution. We have no reference, thus ''chromosome'' and positions in the VCFs are only related to sequences constructed by discoSnp itself. Those sequences (and sequences ids) are not comparable from one run to another.

The idea of this second approach is to detect the variants per read set (explaining why you obtain three .fasta files), and, for each of those fasta files, you compute the coverage from all read sets, providing genotyping for all your three read sets.

The drawback indeed, stands in the fact that numerous variants are predicted two or three times. Maybe simply checking coverage in the final three VCF files could fix this issue:
- for VCF file 2701-discoSNP-5th-RUN.vcfconsider all variants
- for 2702-discoSNP-5th-RUN.vcf: consider only variants whose coverage is nearly null in read set corresponding to 2701
- for 2703-discoSNP-5th-RUN.vcf: consider only variants whose coverage is nearly null in read set corresponding to 2702 and 2703

Best, Pierre

ADD REPLY
0
Entering edit mode

I would also add the -genotype flag to get the actual genotypes.

ADD REPLY
0
Entering edit mode

Hi Pierre,

Many thanks for explaining how to do that. I will try it out and will report back here if it worked.

Best, André

ADD REPLY
0
Entering edit mode

Hi Pierre,

What are the thresholds used by discoSnp to consider a variation found in the reads as a SNP and indel? How can I adjust that?

Best, André

ADD REPLY
0
Entering edit mode

Hi André,

Your question is not clear to me, but I'm trying to answer it there: There is a threshold applied on kmers abundance (kmer seen less than T times are not considered in the graph) and there is a threshold for read coherency (once reads are mapped back on alleles, we verify that each allele has at least one read set from which the number of mapped reads is >= T).

The threshold is by automatically detected per read set by analysing kmer spectrum or it can be set using option -c (see doc).

Hope this helps, Pierre

ADD REPLY
0
Entering edit mode

Sorry for not being clear...

I mean that normally to consider a variation found as a SNP, the nucleotide polymorphism needs to be found in at least X reads, considering a minimum coverage that can be specified. So this information is not clear for me in discoSNP.

ADD REPLY

Login before adding your answer.

Traffic: 1939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6