Question

Pooled / polyploid samples

1

Entering edit mode

9.6 years ago

tkitapci ▴ 60

Hi,

How can I use discoSNP++ to call SNPs from pooled samples and/or for polyploid samples. I have 6 pooled population samples and I want to compare allele frequencies between them (it is from an organism with no reference genome).

Thanks

Best Regards
Hamdi

discosnp • 3.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.6 years ago by tkitapci ▴ 60

0

Entering edit mode

Hi Hamdi

This question means that you finally succeeded in compiling discoSnp++? (see thread DiscoSNP++ compilation problem)

You may use disco as usual using as input the 6 read sets, each corresponding to a pooled population. In this case, the genotyping could be meaningless and misleading. You may disable the genotyping by using the -n option.

Pierre

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.6 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

Thanks a lot for the answer. I could not compile/run the newest version DiscoSNP++-2.2.0, but I tried the older version DiscoSNP++-2.1.7 and it worked. (How can I delete my post about comilation problem?)

I am trying to get the allele frequencies in each pool I think the number of samples in each pool need to be an input for DiscoSNP++ right ? If I don't input the number of samples how can I interpret the SNP calls?

Thanks!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.6 years ago by tkitapci ▴ 60

0

Entering edit mode

Hi,

Sorry for the compilation trouble.

You cannot inform DiscoSnp++ about pooled datasets.

However, the output coverage is given for each read set. This mean that you have to normalize this coeverage in case the number of sample on each pool is not uniform.

I hope I understood correctly your question, and that this answer helps.

Pierre

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.6 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi,

Thanks for your reply. I have the same number of samples in each pool (30 in this case). Do I need to make any normalization in this case ? In the case of pooled samples should I interpret the ratio of the output coverages of REF/ALT alleles as the allele frequency at that position ?

Thanks

Best Regards

T. Hamdi Kitapci

ADD REPLY • link 9.6 years ago by tkitapci ▴ 60

0

Entering edit mode

Hi Hamdi,

I'd say that with same number of pool in each sample, the normalization isn't mandatory.

Indeed, in your case, the ratio REF/ALT corresponds to allele frequency.

Best, Pierre

ADD REPLY • link 9.5 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

I got a .VCF file that looks like this

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
SNP_higher_path_94640   30      94640   A       G       .       .       Ty=SNP;Rk=1;UL=.;UR=.;CL=.;CR=. 0       0
SNP_higher_path_92763   30      92763   A       C       .       .       Ty=SNP;Rk=1;UL=.;UR=.;CL=.;CR=. 0       0
SNP_higher_path_8939    30      8939    A       G       .       .       Ty=SNP;Rk=1;UL=.;UR=.;CL=.;CR=. 0       0
SNP_higher_path_8409    30      8409    A       C       .       .       Ty=SNP;Rk=1;UL=.;UR=.;CL=.;CR=. 0       0

how can I extract the ratio of REF/ALT from the .VCF or .FA file?

Thanks

Best Regards
Hamdi

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by tkitapci ▴ 60

0

Entering edit mode

Hi

Strange VCF. Could you please show a few lines of the ..._coherent.fa file ?

Pierre

ADD REPLY • link 9.5 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

When I run it with -n option this is few lines from _coherent.fa

>SNP_higher_path_94640|P_1:30_A/G|high|nb_pol_1|C1_8|C2_11|C3_11|C4_22|C5_14|C6_10|C7_14|C8_0|Q1_71|Q2_70|Q3_70|Q4_68|Q5_68|Q6_70|Q7_70|Q8_0|rank_1
ATAAGTGAGAGACACCCACACGGAGTACTTATTTCGGGGAGCGACAACCTGAACCAAACAG
>SNP_lower_path_94640|P_1:30_A/G|high|nb_pol_1|C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_3|Q1_0|Q2_0|Q3_0|Q4_0|Q5_0|Q6_0|Q7_0|Q8_71|rank_1
ATAAGTGAGAGACACCCACACGGAGTACTTGTTTCGGGGAGCGACAACCTGAACCAAACAG
>SNP_higher_path_92763|P_1:30_A/C|high|nb_pol_1|C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_5|Q1_0|Q2_0|Q3_0|Q4_0|Q5_0|Q6_0|Q7_0|Q8_71|rank_1
ATAAAGCCCAATGTATAATTTCCCGCGTAAATGACATGACACAACTAAACATTTCTTGGTG

when I do the same run without -n (calling the genotypes) this is what I get:

SNP_higher_path_94598|P_1:30_A/G|high|nb_pol_1|C1_8|C2_11|C3_11|C4_22|C5_14|C6_10|C7_14|C8_0|Q1_71|Q2_70|Q3_70|Q4_68|Q5_68|Q6_70|Q7_70|Q8_0|G1_0/0:5,28,164|G2_0/0:5,37,224|G3_0/0:5,37,224|G4_0/0:5,70,444|G5_0/0:5,46,284|G6_0/0:5,34,204|G7_0/0:5,46,284|G8_1/1:64,13,4|rank_1 ATAAGTGAGAGACACCCACACGGAGTACTTATTTCGGGGAGCGACAACCTGAACCAAACAG SNP_lower_path_94598|P_1:30_A/G|high|nb_pol_1|C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_3|Q1_0|Q2_0|Q3_0|Q4_0|Q5_0|Q6_0|Q7_0|Q8_71|G1_0/0:5,28,164|G2_0/0:5,37,224|G3_0/0:5,37,224|G4_0/0:5,70,444|G5_0/0:5,46,284|G6_0/0:5,34,204|G7_0/0:5,46,284|G8_1/1:64,13,4|rank_1 ATAAGTGAGAGACACCCACACGGAGTACTTGTTTCGGGGAGCGACAACCTGAACCAAACAG SNP_higher_path_92747|P_1:30_A/C|high|nb_pol_1|C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_5|Q1_0|Q2_0|Q3_0|Q4_0|Q5_0|Q6_0|Q7_0|Q8_71|G1_1/1:164,28,5|G2_0/1:4,4,4|G3_0/1:4,4,4|G4_1/1:44,10,4|G5_0/1:4,4,4|G6_0/1:4,4,4|G7_0/1:4,4,4|G8_0/0:4,19,104|rank_1 ATAAAGCCCAATGTATAATTTCCCGCGTAAATGACATGACACAACTAAACATTTCTTGGTG ```

there seems to be something wrong when genotypes are not called. In the second version I can see the count of reads mapped to REF vs ALT for each sample. Given that those are not "genotypes" but "pooled samples" (similarly can be thought as a polyploid organism with ploidy =30) is it correct to use the ratio from this file as the allele frequency of the pool?

Thanks

Best Regards
Hamdi

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by tkitapci ▴ 60

0

Entering edit mode

Hi

Indeed, there was a bug in the vcf creation when no genotype was previously computed. Thanks to your message this is now fixed. We'll release this fix quickly.

Indeed, the ratio can be used as allele frequency (eg. allele A of your first SNP has a frequency of 100% in all pooled samples except the last one which has an allele frequency 100% for G)

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

How did you get the info that for the first SNP allele A has 100% for all samples except the last one?

In the fasta file it says G1_0/0:5,28,164 I am trying to understand what each number here means, I checked the discoSNP_user_guide.pdf but I can't understand. 0/0 is the SNP call. What are the numbers 5,28 and 164 means? I assumed those number are related to the number of reads mapping to each allele so a ratio of those should give me the allele frequency. I think I am wrong here because you said the allele frequency for this allele is 100%. I will appreciate if you can clarify this.

Thanks a lot
Hamdi

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by tkitapci ▴ 60

0

Entering edit mode

Here is an explanation:

G1_0/0:5,28,164 means that (in case of diploid individual) the most probable genotype is 0/0.

The three following values (5, 28, and 164) provide the phred-scaled likelihood of each of the three genotypes (0/0 0/1 and 1/1). We used the same math formula as GATK for computing these values (see https://www.broadinstitute.org/gatk/guide/tagged?tag=genotype)

In your case, all these values are useless. They are well suited when you want to evaluate a diploid allele frequency. What may interest you is the coverage of each allele. In the first set the allele is covered by 8 reads (C1_8) while the other is covered by zero read (C1_0). This is why I said that allele A has a 100% frequency is this dataset.

I hope this helps.

Pierre

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Pierre,

I just want to re-iterate to make sure I am getting this right (also I am sorry if this info is already in the documentation, I tried to find it but I could not find I will appreciate if you can point out where I can find these details)

for this one in the .fa file

>SNP_higher_path_94598|P_1:30_A/G|high|nb_pol_1|C1_8|C2_11|C3_11|C4_22|C5_14|C6_10|C7_14|C8_0|Q1_71|Q2_70|Q3_70|Q4_68|Q5_68|Q6_70|Q7_70|Q8_0|G1_0/0:5,28,164|G2_0/0:5,37,224|G3_0/0:5,37,224|G4_0/0:5,70,444|G5_0/0:5,46,284|G6_0/0:5,34,204|G7_0/0:5,46,284|G8_1/1:64,13,4|rank_1

are C1_8|C2_11|C3_11.. etc showing the number of reads that cover the REF allele in sample 1,2,3 respectively ?

similarly are Q1_71|Q2_70|Q3_70 showing the number of reads that cover the ALT allele in sample 1,2,3 ... respectively?

so I can use the ration C1_X/Q1_Y as the allele frequency for sample 1 at this locus?

Thanks a lot

Best Regards
T. Hamdi Kitapci

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by tkitapci ▴ 60

0

Entering edit mode

C1_8|C2_11|C3_11 are coverages of the REF allele in samples 1,2,3

Q1_71|Q2_70|Q3_70 are average phred quality of the REF allele in samples 1,2,3

For finding the same information for the ALT allele, see C_1, C_2, C3 from the "lower_path" comment:

"C1_0|C2_0|C3_0|C4_0|C5_0|C6_0|C7_0|C8_3" are coverages of the ALT allele in all samples. The ALT allele is present only in the last sample (with quality Q8_71)

Pierre

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by pierre.peterlongo ▴ 900