Hello, I'm trying to extract a subset of SNPs using vcftools. I have a list of 2474008 SNPs and a 90 GB vcf file. I used this command:
vcftools --vcf GCF_000001405.25.vcf --snps rsLeptin_adj.txt --recode --recode-INFO-all --out match_rsLeptin_adj.txtBlockquote
But my output file has 2562258 lines (88250 more SNPs, apparently) , so I'm not sure if the command is not specific or if there is some error while processing that gives more lines. I have also tried with awk
, using an array:
awk '{array[$1]}' rsLeptin_ad.txt
matching with the 3rd column of the vcf file, wich contains the SNPs
awk 'FNR==NR {array[$1]; next}; $3 in array' rsLeptin_adj.txt GCF_000001405.25.vcf
Has anyone experienced the same issue? Any comment will help. Thanks in advance
If I understand correct, you have a file
rsLeptin_adj.txt
containing IDs that may be in theID
column of your vcf file and you like to filter out those variants.For this use
bcftools
instead ofvcftools
.vcftools
is deprecated.fin swimmer