I got a PLINK file from a publication. I want to have the major allele correspond to the forward strand of hg19 because I want to analyze it with another file that is organized in this manner. I am curious if this is possible and if my thinking below is the right way.
Here is the head of the bim file, it supposed to be hg18. Using the UCSC browser suggests that these are not all reported on the same strand:
6 rs6938233 0 26022056 A G
6 rs2071299 0 26022780 T C
6 rs2071298 0 26022880 A G
6 rs7450342 0 26027459 C T
From looking up these few rsids manually on dbsnp, I think my anticipated result should be something like this when all is said and done: Major allele on forward strand of hg19:
6 25914077 rs6938233 0 T C
6 25914801 rs2071299 0 A G
6 25914901 rs2071298 0 A G
6 25919480 rs7450342 0 C T
I started by using https://github.com/sritchie73/liftOverPlink to lift over to change coordinates to hg19.
6 rs6938233 0 25914077 A G
6 rs2071299 0 25914801 T C
6 rs2071298 0 25914901 A G
6 rs7450342 0 25919480 T C
Ok, still need everything with the major allele on the forward strand, so I was hoping that I could use https://github.com/biocore-ntnu/snpflip, which gives me this:
chromosome 0_idx_position snp_name genetic_distance allele_1 allele_2 reference reference_rev strand
6 25914076 rs6938233 0 A G C G reverse
6 25914800 rs2071299 0 T C G C reverse
6 25914900 rs2071298 0 A G G C forward
6 25919479 rs7450342 0 T C T A forward
This seems to detect the strandedness well, but not sure how to proceed other than updating the bim file by taking the compliment of everything that is detected as reverse here with plink --flip
and then using plink --ref-from-fa
, which would hopefully give me the desired output above. I am a bit worried about ambiguous SNPs though (G/C A/T). This is getting pretty messy, is my approach even close or is there a better way to achieve this. Any help is greatly appreciated.