Hi all,
As one of the last steps of a large variant analysis, I would like to identify nocalls and mask them/pass them using --snpmask
to GATK to replace sites that cannot be confidently called. Not terribly tricky.
I decided the best way to go about this was to generate calls at all sites using EMIT_ALL_SITES
in GATK. I then use a short Perl one-liner to identify sites with nocalls (./.
), subset the VCF so that it includes only such sites, and pass the result back to GATK one last time with --snpmask
. The whole approach from raw data to sequences, including a bunch more that I won't go into here, is currently implemented in Python.
I was filtering the VCF using VCFtools by passing it the positions file generated above. However, the call to VCFtools takes over 100 GB of RAM, and I would like to do multiple samples simultaneously. The CPU time, too, is pretty large; this is one of the most resource-intensive steps, even outpacing indel realignment and variant calling.
My question: is there an alternative to subset the VCF taking my situation into account? Should I try creating a BED file from my list of positions (it would be something like chr1 0 1; chr1 1 2 for many of the sites) and letting tabix take a crack at it? Are there other options I ought to consider? It's even okay if it takes a bit longer but has a smaller memory footprint.
I recently ran a similar task to the OPs with the mask saved as BED files. First time I ran it'd forgotten to tabix the files: ~7hrs to run. Once I'd tabixed them ~ 1 minute :)
Exactly. bgzip, then tbi and just like that, boom!
Thanks! tabix was one of the first things I thought of, just wasn't sure how it scaled up to whole-genome calls.