Question

Random Subset of Individuals from .BED File (.ped not available)

0

Entering edit mode

6.6 years ago

angus.gane • 0

I am trying to split a a GWAS cohort into two random samples. I have the .bed, .fam, and .bim files. I know plink has commands for filtering out subsets of individuals (--filter) but this seems to require the .map file. It is possible to filter binary files on plink but it doesn't seem to allow this for the first two 'columns' - which contain the individual data I need to filter using.

My very computationally intensive solution has been to recode the .bed file and .ped and .map files for each chromosome (800GB+), randomly select a cohort of individuals with shuf and then grep these out of the .ped file before recoding as .bed files.

I was wondering if anyone had a better way of doing this?

Thanks, Angus

plink GWAS • 3.5k views

ADD COMMENT • link 2.9 years ago by angus.gane • 0

1

Entering edit mode

Are you doing this for some 'machine learning' or bootstrapping method?, i.e., breaking the dataset up into training and testing?

Just do the following:

obtain a sample ID listitng
'randomly' select sample IDs from the listing (using any programming language)
use --keep or --remove on your BED files to keep or remove samples accordingly

ADD REPLY • link 6.6 years ago by Kevin Blighe 88k

1

Entering edit mode

2.9 years ago

Alex Reynolds 36k

800GB of data per chromosome is a lot. If shuf does not scale to the size of data you are working with and you get out-of-memory errors, then the sample application might be of use. It samples like shuf, but uses a simple trick to reduce memory usage to 8 bytes per line.

ADD COMMENT • link 2.9 years ago by Alex Reynolds 36k

score 1 · Accepted Answer · 2022-01-03

Thank you all. Looking back on this a few years later there are a few possible approaches.

In the end I used sort -R on the fam file, extracted a testing and a training set with head and tail and then used:

plink1.9.exe --bfile file --keep set1.fam --make-bed --out subset1
plink1.9.exe --bfile file --keep set2.fam --make-bed --out subset2

In addition of course a few checks to ensure everything went ok!