Entering edit mode
4.5 years ago
chvbs2000
•
0
I am working on preprocessing data from a list of .vcf.gz to subset all these .gz files according to a list of SNPs. I stored SNP IDs of interest into a text file. And I want to all rows from these .vcf.gz files that have the same SNP IDs from the SNP_ID file:
SNP_ID file:
rs61733845
rs1320571
rs9729550
rs1815606
rs7515488
rs11260562
rs6697886
rs6603785
rs11804831
In python I would imagine to process each line on conditional statement or inner join, yet python may not be an optimal choice since the size all these .vcf.gz files are huge. Is there any way I can subsetting vcf.gz based on a text file with bash command such as awk, sed, or cat? Thanks!
duplicate of : Soft filtering of SNPs in a list ; How to get 1000 Genomes data in bulk? ; etc...