Hello all,
I struggled the whole day trying to figure out what is really the difference between SNPs present only in one individual (singletons) vs SNPs having a minor allele frequency of ≤ 1%
I have one data set of SNPs having 2 plant species (14 wild individuals and 157 cultivated plants individuals) with total number of SNPs: 11,046,501
I used --singletons
option from the vcftools to get a file to have the list of SNPs occur in one individual.
This gave 3,671,719 singletons SNPs. Calculating the MAF ≤ 1% using vcftools option: --max-maf 0.01
to keep sites with MAF less than or equal to 0.01 and calculating the number of SNPs in this vcf file gave 4,901,160 SNPs (44% of the total SNPs) - That's a lot :(
The big issue is when I get two different subsets from the VCF file: one having wild individuals only (14 samples with 9,839,152 SNPs ) and I did calculate the same things --singletons
and --max-maf 0.01
and what surprised me is the big number of singletons (3,513,130) and the small number of SNPs having MAF ≤ 1% (160,140 SNPs only)
The other VCF subset was the cultivated samples only (157 individuals with 2,617,322 SNPs) and this has 836,609 singletons SNPs and 989,630 SNPs with MAF ≤ 1% (both numbers are more or less similar, not like the wild) - I am so confused !!!!!
I tried outputting SNPs with MAF ≤ 1% with plink as well, and it gave exactly the same results.
How to interpret these results? I was expecting to get similar numbers between singletons and SNPs with MAF ≤ 1%. But it seems it doesn't work that way. So back to the main question:
What is the difference between singletons SNPs vs SNPs have MAF ≤ 0.01
.
P.S. Just to clarify: getting rid of wild individuals (n=14) with the non-polymorphic SNPs results in a huge drop of the number of SNPs, and this is because the wild samples are highly diverse compare to cultivated (there is a lof of SNPs and differences between wild genome sequence with the reference genome)
. . Thanks for your help!