Question

Subsetting .vcf.gz based on .txt file

0

Entering edit mode

5.2 years ago

chvbs2000 • 0

I am working on preprocessing data from a list of .vcf.gz to subset all these .gz files according to a list of SNPs. I stored SNP IDs of interest into a text file. And I want to all rows from these .vcf.gz files that have the same SNP IDs from the SNP_ID file:

SNP_ID file:

In python I would imagine to process each line on conditional statement or inner join, yet python may not be an optimal choice since the size all these .vcf.gz files are huge. Is there any way I can subsetting vcf.gz based on a text file with bash command such as awk, sed, or cat? Thanks!

SNP gene vcf genome sequencing • 1.4k views

ADD COMMENT • link updated 5.2 years ago by Yean ▴ 150 • written 5.2 years ago by chvbs2000 • 0

0

Entering edit mode

duplicate of : Soft filtering of SNPs in a list ; How to get 1000 Genomes data in bulk? ; etc...

ADD REPLY • link 5.2 years ago by Pierre Lindenbaum 166k

score 0 · Answer 1 · 2020-05-25

0

Entering edit mode

5.2 years ago

Yean ▴ 150

What's about plink ?

   plink1.9 --vcf input.vcf.gz --extract snp.snplist --make-bed --out extract_snp

ADD COMMENT • link 5.2 years ago by Yean ▴ 150