Question

Converting PLINK binary files into python dataframe

0

Entering edit mode

2.2 years ago

bbehrooz • 0

I'm working with a genetic dataset (roughly 23,000 samples and 300,000 SNPs as features). I got my files in PLINK binary format files (.bed, .bim, .fam). Listed below are their sizes:

.bed file : 1.6G .bim file = 9.3M .fam file = 737K My aim is to convert them into (pandas) dataframes and then start my predictive analysis in Python (it's a machine learning project).

I was adviced to combine all 3 binary files into one vcf (variant call format) file. The result (vcf file) is a 26G file using PLINK software. There are python packages and codes for converting vcf files into pandas dataframes, but my remote system memory is limited (15 Gi). Due to the nature of the dataset, I can only work with university computers.

My question is, considering all my limitations, how do I convert my dataset into a dataframe that can be used in machine learning? Let me know if you need more details.

python frame VCF data PLINK genetic • 1.5k views

ADD COMMENT • link 2.1 years ago by bbehrooz • 0

0

Entering edit mode

use the chunksize option when you load your data to Pandas.

ADD REPLY • link 2.2 years ago by zorbax ▴ 650

score 0 · Answer 1 · 2022-09-28

0

Entering edit mode

2.2 years ago

Istvan Albert 102k

Pandas is notoriously slow at reading in large datasets

I would recommend an alternative like polars or datatable:

find more information here:

https://towardsdatascience.com/getting-started-with-the-polars-dataframe-library-6f9e1c014c5c

or something else from this list:

http://theautomatic.net/2021/10/09/faster-alternatives-to-pandas/

ADD COMMENT • link 2.2 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you so much for your input.

ADD REPLY • link 2.1 years ago by bbehrooz • 0