Assume you have genotype data in Plink format. (G.bed, G.bim, G.fam)
Download some reference panel data in Plink format (R.bed, R.bim, R.fam) in a specific build (hg19 or hg38).
Generate one more copy of the reference panel with a build different from the one in which it was initially available. (PyLiftOver)
from pyliftover import LiftOver
import pandas as pd
df = pd.read_csv("ref38.bim",names=['CHR','SNP','Distance','BP','A1','A2'],sep="\s+")
print(df.head())
#df =df.head()
df['chromosome'] = "chr"+df["CHR"].astype(str)
df['position_GRCh38'] = df["BP"].values
def convert_coordinates_batch(rows):
converted_coordinates = []
lo = LiftOver('hg38','hg19')
for index, row in rows.iterrows():
converted_coordinate = lo.convert_coordinate(row['chromosome'], row['position_GRCh38'])
converted_position = converted_coordinate[0][1] if converted_coordinate else None
converted_coordinates.append(converted_position)
return converted_coordinates
batch_size = 50
for i in range(0, len(df), batch_size):
df.loc[i:i+batch_size-1, 'position_GRCh37'] = convert_coordinates_batch(df.iloc[i:i+batch_size])
#df['position_GRCh38'] = df.apply(lambda row: convert_coordinate_GRCh37_to_GRCh38(row['chromosome'], row['position_GRCh37']), axis=1)
df["BP"] = df['position_GRCh37'].values
print(df.head())
#df["BP"] = df["BP"].astype(int)
del df['position_GRCh38']
del df['position_GRCh37']
del df['chromosome']
print(df.head())
df.to_csv("ref19.bim",sep="\t",index=False,header=False)
Now you have
R.hg19.bim (Reference panel in build 19)
R.hg38.bim (Reference panel in build 38)
Count the number of common variants (C19) between G.bim and R.hg19.bim.
Count the number of common variants (C38) between G.bim and R.hg38.bim.
bimfile["match"] = bimfile[0].astype(str)+"_"+bimfile[3].astype(str)+"_"+bimfile[4].astype(str)+"_"+bimfile[5].astype(str)
df["match"] = df["CHR"].astype(str)+"_"+df["BP"].astype(str)+"_"+df["A1"].astype(str)+"_"+df["A2"].astype(str)
df.drop_duplicates(subset='match', inplace=True)
bimfile.drop_duplicates(subset='match', inplace=True)
hg19variants = len(df[df['match'].isin(bimfile['match'].values)])
print(hg19variants)
df = pd.read_csv(GWAS, compression="gzip",sep="\s+",on_bad_lines='skip')
bimfile = pd.read_csv("genotypes.bim38",sep="\s+",header=None)
bimfile["match"] = bimfile[0].astype(str)+"_"+bimfile[3].astype(str)+"_"+bimfile[4].astype(str)+"_"+bimfile[5].astype(str)
df["match"] = df["CHR"].astype(str)+"_"+df["BP"].astype(str)+"_"+df["A1"].astype(str)+"_"+df["A2"].astype(str)
df.drop_duplicates(subset='match', inplace=True)
bimfile.drop_duplicates(subset='match', inplace=True)
hg38variants = len(df[df['match'].isin(bimfile['match'].values)])
print(hg38variants)
if hg19variants > hg38variants:
return "19"
if hg38variants > hg19variants:
return "38"
Match:
CHR:BP/Position/:A1:A2
If C19>C38:
G is hg19
else:
G is gh28