I have BED file which is very large. I need to extract the geneotype of only 10 subjects. I do not interested with all subjects, I am only interested in 10 subjects, so how can I extract those subjects from BED file?
I have BED file which is very large. I need to extract the geneotype of only 10 subjects. I do not interested with all subjects, I am only interested in 10 subjects, so how can I extract those subjects from BED file?
When you say a BED file, are you talking about the binary genotype .bed file defined by plink?
If so, you can use plink to include or exclude certain samples easily. See https://www.cog-genomics.org/plink2/filter if you're using plink v2...
If you're using plink v1, and you have files called hapmap1.ped
and hapmap1.map
, and say you want to create output called mysubset.ped
and mysubset.map
, then the command would be:
plink --file hapmap1 --keep mylist.txt --recode --out mysubset
There is another type of "BED file" (UCSC Genome Browser's BED format) which has nothing to do with plink. It doesn't sound like you're talking about this kind of file... but if you are, then I imagine you just want to extract a certain subset of columns (e.g. using awk
), but you'd have to give a little more detail about the structure of the file to get a full answer.
I am not familiar (have never used it) with the plink bed format, but reading the documentation for 1.9 maybe:
awk 'NR==FNR{A[$1];next}$1 in A' mylist.txt hapmap1.ped > result.txt
column 1 of hapmap1 is being for a match to each line in mylist.txt, and only the 10 matches are in the result. If column 1 is not the correct one to search change the $1 after next to whatever column.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you I use plink --hapmap1.ped --keep mylist.txt But it does not work.
error problem parsing the command line arguments I need only to keep the data of the below subjects and remove other subjects which are not included in my list. Below mylist.txt 136_S_4269 130_S_4352 129_S_4371 129_S_4369 031_S_4496 031_S_4474 031_S_4218 031_S_4032 031_S_4024 031_S_4021 019_S_4477 019_S_4367 019_S_4252 018_S_4400 018_S_4399 018_S_4349 018_S_4313 018_S_4257 012_S_4026 006_S_4449 006_S_4357 006_S_4192 006_S_4153 006_S_4150 002_S_4270 002_S_4225 002_S_4213
@fadlwork: You are not providing the right command line arguments to plink. I have edited my answer below to include the command to subset by individuals for plink v1. Are you using plink v1 or v2?