Question

Extract bfile from .bgen using a given .bim to handle duplicated snps

0

Entering edit mode

14 months ago

yang1641 • 0

I tried the following command to extract a bfile using a given snp list and a given subject ID list:

plink2 \
    --bgen xxx.bgen \
    --sample xxx.sample \
    --extract  input.snp_list \
    --make-bed \
    --out output

The input.snp_list is the second column from some other input.bim file. I hope to have alleles in the output.bim identical to the input.bim, but it's not case in the output.bim right now, because there are duplicate snps in the output.bim (like below),

19  rs75617501  0 44544721           T       C
19  rs75617501  0 44544721           T       G

and

19 rs573790568  0 44678292         TTG       T
19 rs573790568  0 44678292           T   TTGTG
19 rs573790568  0 44678292           T TTGTGTG

and rs75617501 and rs573790568 in the input.bim had no duplicates, and their corresponding alleles are

19 rs75617501  0 44544721  C  T

and

19 rs573790568  0 44678292 TTG  T

So I wonder if there is a way to remove the duplicated snps when extracting bfile so that only snps with alleles matching input.bim are kept. For example, after removing the duplicate snps I would only have the following in my output.bim:

19  rs75617501  0 44544721           T       C
19 rs573790568  0 44678292         TTG       T

Thank you!

bim plink bgen • 1.1k views

ADD COMMENT • link updated 14 months ago by chrchang523 11k • written 14 months ago by yang1641 • 0

0

Entering edit mode

Since you are using plink, you can update SNP (rsid) with POS:A1:A2 in the plink files and later extract the SNP with the alleles you want. I hope this makes sense.

ADD REPLY • link 14 months ago by bk11 ★ 3.1k

0

Entering edit mode

Thank you! Could you give an example of the 'update' command? I think it should be one of the commands listed in https://www.cog-genomics.org/plink/1.9/data but unsure about which command to use.

ADD REPLY • link 14 months ago by yang1641 • 0

0

Entering edit mode

You should use the following command in this page.

plink --bfile mydata --update-map rsID.lst --update-name --make-bed --out mydata2

ADD REPLY • link 14 months ago by bk11 ★ 3.1k

0

Entering edit mode

Thank you! I still have a question: rsID.lst seems to be a two-column file consisting of rsID and POS, so I'm not sure how the duplicated snps will removed according to the alleles in input.bim. I wanted to generate a bfile so that among all duplicated snps, only the snp with alleles exactly matching with input.bim is kept (for example, in the input.bim we have '19 rs573790568 0 44678292 TTG T', so in the final bfile, only '19 rs573790568 0 44678292 TTG T' is kept, and '19 rs573790568 0 44678292 T TTGTG' and '19 rs573790568 0 44678292 T TTGTGTG' is discarded).

ADD REPLY • link 14 months ago by yang1641 • 0

0

Entering edit mode

This is usually done with plink 2.0's --set-all-var-ids flag (https://www.cog-genomics.org/plink/2.0/data#set_all_var_ids ).

ADD REPLY • link 14 months ago by chrchang523 11k