Entering edit mode
9.3 years ago
tkitapci
▴
60
Hi,
How can I use discoSNP++ to call SNPs from pooled samples and/or for polyploid samples. I have 6 pooled population samples and I want to compare allele frequencies between them (it is from an organism with no reference genome).
Thanks
Best Regards
Hamdi
Hi Hamdi
This question means that you finally succeeded in compiling discoSnp++? (see thread DiscoSNP++ compilation problem)
You may use disco as usual using as input the 6 read sets, each corresponding to a pooled population. In this case, the genotyping could be meaningless and misleading. You may disable the genotyping by using the
-n
option.Pierre
Hi Pierre,
Thanks a lot for the answer. I could not compile/run the newest version DiscoSNP++-2.2.0, but I tried the older version DiscoSNP++-2.1.7 and it worked. (How can I delete my post about comilation problem?)
I am trying to get the allele frequencies in each pool I think the number of samples in each pool need to be an input for DiscoSNP++ right ? If I don't input the number of samples how can I interpret the SNP calls?
Thanks!
Hi,
Sorry for the compilation trouble.
You cannot inform DiscoSnp++ about pooled datasets.
However, the output coverage is given for each read set. This mean that you have to normalize this coeverage in case the number of sample on each pool is not uniform.
I hope I understood correctly your question, and that this answer helps.
Pierre
Hi,
Thanks for your reply. I have the same number of samples in each pool (30 in this case). Do I need to make any normalization in this case ? In the case of pooled samples should I interpret the ratio of the output coverages of REF/ALT alleles as the allele frequency at that position ?
Thanks
Best Regards
T. Hamdi Kitapci
Hi Hamdi,
I'd say that with same number of pool in each sample, the normalization isn't mandatory.
Indeed, in your case, the ratio REF/ALT corresponds to allele frequency.
Best, Pierre
Hi Pierre,
I got a .VCF file that looks like this
how can I extract the ratio of REF/ALT from the .VCF or .FA file?
Thanks
Best Regards
Hamdi
Hi
Strange VCF. Could you please show a few lines of the ..._coherent.fa file ?
Pierre
Hi Pierre,
When I run it with
-n
option this is few lines from_coherent.fa
when I do the same run without
-n
(calling the genotypes) this is what I get:there seems to be something wrong when genotypes are not called. In the second version I can see the count of reads mapped to REF vs ALT for each sample. Given that those are not "genotypes" but "pooled samples" (similarly can be thought as a polyploid organism with ploidy =30) is it correct to use the ratio from this file as the allele frequency of the pool?
Thanks
Best Regards
Hamdi
Hi
Indeed, there was a bug in the vcf creation when no genotype was previously computed. Thanks to your message this is now fixed. We'll release this fix quickly.
Indeed, the ratio can be used as allele frequency (eg. allele A of your first SNP has a frequency of 100% in all pooled samples except the last one which has an allele frequency 100% for G)
Hi Pierre,
How did you get the info that for the first SNP allele A has 100% for all samples except the last one?
In the fasta file it says
G1_0/0:5,28,164
I am trying to understand what each number here means, I checked thediscoSNP_user_guide.pdf
but I can't understand.0/0
is the SNP call. What are the numbers5
,28
and164
means? I assumed those number are related to the number of reads mapping to each allele so a ratio of those should give me the allele frequency. I think I am wrong here because you said the allele frequency for this allele is 100%. I will appreciate if you can clarify this.Thanks a lot
Hamdi
Here is an explanation:
G1_0/0:5,28,164
means that (in case of diploid individual) the most probable genotype is0/0
.The three following values (5, 28, and 164) provide the phred-scaled likelihood of each of the three genotypes (0/0 0/1 and 1/1). We used the same math formula as GATK for computing these values (see https://www.broadinstitute.org/gatk/guide/tagged?tag=genotype)
In your case, all these values are useless. They are well suited when you want to evaluate a diploid allele frequency. What may interest you is the coverage of each allele. In the first set the allele is covered by 8 reads (
C1_8
) while the other is covered by zero read (C1_0
). This is why I said that allele A has a 100% frequency is this dataset.I hope this helps.
Pierre
Hi Pierre,
I just want to re-iterate to make sure I am getting this right (also I am sorry if this info is already in the documentation, I tried to find it but I could not find I will appreciate if you can point out where I can find these details)
for this one in the .fa file
are
C1_8|C2_11|C3_11
.. etc showing the number of reads that cover the REF allele in sample 1,2,3 respectively ?similarly are
Q1_71|Q2_70|Q3_70
showing the number of reads that cover the ALT allele in sample 1,2,3 ... respectively?so I can use the ration
C1_X/Q1_Y
as the allele frequency for sample 1 at this locus?Thanks a lot
Best Regards
T. Hamdi Kitapci
C1_8|C2_11|C3_11
are coverages of the REF allele in samples 1,2,3Q1_71|Q2_70|Q3_70
are average phred quality of the REF allele in samples 1,2,3For finding the same information for the ALT allele, see
C_1, C_2, C3
from the "lower_path" comment:"C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_3" are coverages of the ALT allele in all samples. The ALT allele is present only in the last sample (with quality Q8_71)
Pierre