I tried the following command to extract a bfile using a given snp list and a given subject ID list:
plink2 \
--bgen xxx.bgen \
--sample xxx.sample \
--extract input.snp_list \
--make-bed \
--out output
The input.snp_list
is the second column from some other input.bim
file. I hope to have alleles in the output.bim
identical to the input.bim
, but it's not case in the output.bim
right now, because there are duplicate snps in the output.bim
(like below),
19 rs75617501 0 44544721 T C
19 rs75617501 0 44544721 T G
and
19 rs573790568 0 44678292 TTG T
19 rs573790568 0 44678292 T TTGTG
19 rs573790568 0 44678292 T TTGTGTG
and rs75617501 and rs573790568 in the input.bim
had no duplicates, and their corresponding alleles are
19 rs75617501 0 44544721 C T
and
19 rs573790568 0 44678292 TTG T
So I wonder if there is a way to remove the duplicated snps when extracting bfile so that only snps with alleles matching input.bim
are kept. For example, after removing the duplicate snps I would only have the following in my output.bim
:
19 rs75617501 0 44544721 T C
19 rs573790568 0 44678292 TTG T
Thank you!
Since you are using plink, you can update SNP (rsid) with POS:A1:A2 in the plink files and later extract the SNP with the alleles you want. I hope this makes sense.
Thank you! Could you give an example of the 'update' command? I think it should be one of the commands listed in https://www.cog-genomics.org/plink/1.9/data but unsure about which command to use.
You should use the following command in this page.
Thank you! I still have a question: rsID.lst seems to be a two-column file consisting of rsID and POS, so I'm not sure how the duplicated snps will removed according to the alleles in
input.bim
. I wanted to generate a bfile so that among all duplicated snps, only the snp with alleles exactly matching with input.bim is kept (for example, in theinput.bim
we have '19 rs573790568 0 44678292 TTG T', so in the final bfile, only '19 rs573790568 0 44678292 TTG T' is kept, and '19 rs573790568 0 44678292 T TTGTG' and '19 rs573790568 0 44678292 T TTGTGTG' is discarded).This is usually done with plink 2.0's --set-all-var-ids flag (https://www.cog-genomics.org/plink/2.0/data#set_all_var_ids ).