If not using vcftools, the simplest and reasonably fast way is to use awk:
awk 'BEGIN{while((getline<"list.txt")>0)l[$1]=1}/^#/||l[$3]' 1000g.vcf > output.vcf
If you really care about speed, you may use Perl/Python to split the first few fields only instead of wastefully splitting every genotype field. If you do not want to create a temporary file, you can (not recommeded, though):
tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz .|awk 'BEGIN{while((getline<"tmp.list")>0)l[$1]=1}/^#/||l[$3]'
You still download the whole file, but does not create a temporary one.
EDIT:
The right way to use grep is:
grep -wFf list.txt output.vcf
we should NOT use just grep -wf
. With -F
, the patterns are interpreted as fixed strings instead of regular expressions. This will trigger the Aho–Corasick algorithm which is much faster than multi-regex matching. Here is an experiment. tmp.vcf
is 1000g site-only VCF on chr20 and tmp.list
contains 8582 rs#.
$ time awk 'BEGIN{while((getline<"tmp.list")>0)l[$1]=1}l[$3]' tmp.vcf > /dev/null
real 0m5.318s
user 0m4.365s
sys 0m0.937s
$ time grep -wFf tmp.list tmp.vcf > /dev/null
real 0m2.740s
user 0m1.879s
sys 0m0.768s
$ time grep -wf tmp.list tmp.vcf > /dev/null
(Unfinished. Already 6m50s and counting...)
EDIT 2:
Although grep -wFf
is faster, it may give you false matches. For example, if in the INFO field we have FOO=rs1234
, a list containing rs1234
will match this line. This scenario rarely happens of course, but you should beware of that.
I will give a try to both approaches; the VCFtools and this
awk
. Wish I could choose more than one answer! Thanks for this amazing explanation!!