Entering edit mode
6.7 years ago
miaowzai
▴
390
I have a plink format data (bim/bed/fam) from data provider. As far as I know, the sequencing was done by Illumina exome SNP beadchip. I didn't call the genotypes.
There are some variant IDs startwith "exm". They do have rs numbers when I look'em up. What are these SNPs? Why use "exm"?
Also, I found some duplicated monomorphic positions, like:
1 exm1771279 2435759 0 G
1 exm2277033 2435759 0 C
I checked the genotypes for both calls and the file says all people have GG for exm1771279 and all people have CC for exm2277033. However, they are at the same site. How is this possible?
Thanks!
It is caused by the allele is from different strand. GG actually == CC they are from Watson and Crick strand, respectively.
Now you know the above situation, the follow situation should also be paid attention:
When merging two datasets, it is clearly very important that the two sets of SNPs are concordant in terms of positive or negative strand. Whereas some mismatches will be easy to spot as more than two alleles will be observed in the merged dataset, other instances will not be so easy to spot, i.e. for A/T and C/G SNPs.
Be sure to Flip DNA strand for SNPs when it is need in your analysis.
Plink only use ID to be the key. Here exm1771279 and exm2277033 are allowed in the analysis. but usually, for me, I will transfer exm id to rs id, and therefore, in my anlaysis, both of these two allele will be mapped to one rs SNP and therefore one of them will be removed in my further analysis. It is very important to do that.
This makes a lot of sense! Thank you so much.