Question

Quickly subsetting VCF while being memory efficient

0

Entering edit mode

10.1 years ago

Brice Sarver ★ 3.8k

Hi all,

As one of the last steps of a large variant analysis, I would like to identify nocalls and mask them/pass them using --snpmask to GATK to replace sites that cannot be confidently called. Not terribly tricky.

I decided the best way to go about this was to generate calls at all sites using EMIT_ALL_SITES in GATK. I then use a short Perl one-liner to identify sites with nocalls (./.), subset the VCF so that it includes only such sites, and pass the result back to GATK one last time with --snpmask. The whole approach from raw data to sequences, including a bunch more that I won't go into here, is currently implemented in Python.

I was filtering the VCF using VCFtools by passing it the positions file generated above. However, the call to VCFtools takes over 100 GB of RAM, and I would like to do multiple samples simultaneously. The CPU time, too, is pretty large; this is one of the most resource-intensive steps, even outpacing indel realignment and variant calling.

My question: is there an alternative to subset the VCF taking my situation into account? Should I try creating a BED file from my list of positions (it would be something like chr1 0 1; chr1 1 2 for many of the sites) and letting tabix take a crack at it? Are there other options I ought to consider? It's even okay if it takes a bit longer but has a smaller memory footprint.

BED R Python VCF • 4.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

10.1 years ago

David W 4.9k

I don't know if it can be used in the particular case you are taking about, but genome query tools (https://github.com/ryanlayer/gqt) has been written specifically to handle very large genotype files and with low memory footprint and speedy queries.

EDIT

Some of gqt's tricks are specifically for multi-sample genotype files though, so you might not get as much out of it for this application?

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by David W 4.9k

0

Entering edit mode

10.1 years ago

Alex Reynolds 36k

The following BEDOPS call is very memory efficient and fast:

$ vcf2bed < variants.vcf | bedops -e 1 - regions_of_interest.bed > answer.bed

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Alex Reynolds 36k

score 4 · Accepted Answer · 2015-06-07

4

Entering edit mode

10.1 years ago

Ram 45k

If you can create that BED file, tabix has a BED filter option that is hands down the fastest I have seen - it takes maybe seconds to get stuff done where vcftools runs out of memory.

There's a catch though - tabix does not extract the VCF header.

Check out this post: Extract Sub-Set Of Regions From Vcf File

ADD COMMENT • link 2.5 years ago by Ram 45k

1

Entering edit mode

I recently ran a similar task to the OPs with the mask saved as BED files. First time I ran it'd forgotten to tabix the files: ~7hrs to run. Once I'd tabixed them ~ 1 minute :)

ADD REPLY • link 10.1 years ago by David W 4.9k

0

Entering edit mode

Exactly. bgzip, then tbi and just like that, boom!

ADD REPLY • link 2.5 years ago by Ram 45k

0

Entering edit mode

Thanks! tabix was one of the first things I thought of, just wasn't sure how it scaled up to whole-genome calls.

ADD REPLY • link 10.1 years ago by Brice Sarver ★ 3.8k