Question

The difference between singletons SNPs vs SNPs have MAF less than or equal 0.01

1

Entering edit mode

4.9 years ago

Hann ▴ 110

Hello all,

I struggled the whole day trying to figure out what is really the difference between SNPs present only in one individual (singletons) vs SNPs having a minor allele frequency of ≤ 1%

I have one data set of SNPs having 2 plant species (14 wild individuals and 157 cultivated plants individuals) with total number of SNPs: 11,046,501

I used --singletons option from the vcftools to get a file to have the list of SNPs occur in one individual.

This gave 3,671,719 singletons SNPs. Calculating the MAF ≤ 1% using vcftools option: --max-maf 0.01 to keep sites with MAF less than or equal to 0.01 and calculating the number of SNPs in this vcf file gave 4,901,160 SNPs (44% of the total SNPs) - That's a lot :(

The big issue is when I get two different subsets from the VCF file: one having wild individuals only (14 samples with 9,839,152 SNPs ) and I did calculate the same things --singletons and --max-maf 0.01 and what surprised me is the big number of singletons (3,513,130) and the small number of SNPs having MAF ≤ 1% (160,140 SNPs only)

The other VCF subset was the cultivated samples only (157 individuals with 2,617,322 SNPs) and this has 836,609 singletons SNPs and 989,630 SNPs with MAF ≤ 1% (both numbers are more or less similar, not like the wild) - I am so confused !!!!!

I tried outputting SNPs with MAF ≤ 1% with plink as well, and it gave exactly the same results.

How to interpret these results? I was expecting to get similar numbers between singletons and SNPs with MAF ≤ 1%. But it seems it doesn't work that way. So back to the main question:

What is the difference between singletons SNPs vs SNPs have MAF ≤ 0.01

.

P.S. Just to clarify: getting rid of wild individuals (n=14) with the non-polymorphic SNPs results in a huge drop of the number of SNPs, and this is because the wild samples are highly diverse compare to cultivated (there is a lof of SNPs and differences between wild genome sequence with the reference genome)

. . Thanks for your help!

SNP sequencing population genetics • 2.9k views

ADD COMMENT • link 4.9 years ago by Hann ▴ 110

score 1 · Answer 1 · 2020-05-21

It was easy at the end to explain this observation.

Comparing allele frequency can happen in 14 individuals at max is not comparable with the allele frequency of a bigger population. The same MAF cut off ( 0.01) in the wild population (n=14) will be very small, because we need to have the same SNP occurring in all individuals to get allele frequency < 0.01, which will be actually 0. If only one individual has a different allele (that is a singleton), the allele frequency of one allele different allele in 14 individuals is 0.07... which is higher than 0.01