I'm working with a genetic dataset (roughly 23,000 samples and 300,000 SNPs as features). I got my files in PLINK binary format files (.bed, .bim, .fam). Listed below are their sizes:
.bed file : 1.6G .bim file = 9.3M .fam file = 737K My aim is to convert them into (pandas) dataframes and then start my predictive analysis in Python (it's a machine learning project).
I was adviced to combine all 3 binary files into one vcf (variant call format) file. The result (vcf file) is a 26G file using PLINK software. There are python packages and codes for converting vcf files into pandas dataframes, but my remote system memory is limited (15 Gi). Due to the nature of the dataset, I can only work with university computers.
My question is, considering all my limitations, how do I convert my dataset into a dataframe that can be used in machine learning? Let me know if you need more details.
use the
chunksize
option when you load your data to Pandas.