Entering edit mode
14 months ago
abedkurdi10
▴
190
Hello everyone,
I have four PLINK samples. I harmonized the samples using Genotype Harmonizer
in presence of a reference panel. The genotyping rate, for each PLINK sample is around 0.98-0.99. When I merge the four PLINK sets, the genotyping rate drops to 0.76 in average.
Does anyone knows what could affect the genotyping rate?
Thank you!
Can you please explain your approaches before merging the datasets? Are your datasets in the same genomic build? Were there any allele flips? How do you deal with A/T and G/C SNPs? Did you merge common SNPs between the datasets? Genotyping rate should not drop that low after merging the data.
Yes, my datasets are in the same genomic build. Of course, There were some allele flips.
I am using
Genotype Harmonizer
:https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer
, it takes care of everything, A/T and G/C SNPs, corrects the flips. Also, this tool seems that it is not flipping all the SNPs based on the reference panel I provided. When merging withbcftools
, I found some variants that were not flipped, while the variants were flipped in other samples.It seems I am facing issues with the merging process. If I merge the common SNPs, I would lose a lot of SNPs, am I right? Unless I am missing something.
I have not tried the Genotype Harmonizer yet. If the datasets were genotyped in the same array then you will not loose much SNPs while merging. But, if they are in different array, you will loose some. However, you can impute back the lost SNPs while imputing your data. I would first check how many common SNPs are in your data as follows-
The datasets were not genotyped in the same array. In common, I got around ~207000 variants, while for each dataset I have:
Dataset 1 492592
Dataset 2 324282
Dataset 3 291611
Dataset 4 398343
Dataset 5 518396
Dataset 6 387083
Dataset 7 551603
Dataset 8 532975
Now you are saying 8 datasets (earlier in your post were 4). The lowest number of SNPs (n=291611) is in your dataset 3. Seeing this the number of common SNPs (~207000) you have among these datasets is not bad. I would suggest you to go with 3 different approaches for your data-
You can check each of these approaches and the choose the best one that works for your data or goal.
Thank you very much for your suggestions! Yeah, it was my mistake to say four! Thanks again!!