Quickly subsetting VCF while being memory efficient
3
0
Entering edit mode
9.5 years ago
Brice Sarver ★ 3.8k

Hi all,

As one of the last steps of a large variant analysis, I would like to identify nocalls and mask them/pass them using --snpmask to GATK to replace sites that cannot be confidently called. Not terribly tricky.

I decided the best way to go about this was to generate calls at all sites using EMIT_ALL_SITES in GATK. I then use a short Perl one-liner to identify sites with nocalls (./.), subset the VCF so that it includes only such sites, and pass the result back to GATK one last time with --snpmask. The whole approach from raw data to sequences, including a bunch more that I won't go into here, is currently implemented in Python.

I was filtering the VCF using VCFtools by passing it the positions file generated above. However, the call to VCFtools takes over 100 GB of RAM, and I would like to do multiple samples simultaneously. The CPU time, too, is pretty large; this is one of the most resource-intensive steps, even outpacing indel realignment and variant calling.

My question: is there an alternative to subset the VCF taking my situation into account? Should I try creating a BED file from my list of positions (it would be something like chr1 0 1; chr1 1 2 for many of the sites) and letting tabix take a crack at it? Are there other options I ought to consider? It's even okay if it takes a bit longer but has a smaller memory footprint.

BED R Python VCF • 4.3k views
ADD COMMENT
4
Entering edit mode
9.5 years ago
Ram 44k

If you can create that BED file, tabix has a BED filter option that is hands down the fastest I have seen - it takes maybe seconds to get stuff done where vcftools runs out of memory.

There's a catch though - tabix does not extract the VCF header.

Check out this post: Extract Sub-Set Of Regions From Vcf File

ADD COMMENT
1
Entering edit mode

I recently ran a similar task to the OPs with the mask saved as BED files. First time I ran it'd forgotten to tabix the files: ~7hrs to run. Once I'd tabixed them ~ 1 minute :)

ADD REPLY
0
Entering edit mode

Exactly. bgzip, then tbi and just like that, boom!

ADD REPLY
0
Entering edit mode

Thanks! tabix was one of the first things I thought of, just wasn't sure how it scaled up to whole-genome calls.

ADD REPLY
0
Entering edit mode
9.5 years ago
David W 4.9k

I don't know if it can be used in the particular case you are taking about, but genome query tools (https://github.com/ryanlayer/gqt) has been written specifically to handle very large genotype files and with low memory footprint and speedy queries.

EDIT

Some of gqt's tricks are specifically for multi-sample genotype files though, so you might not get as much out of it for this application?

ADD COMMENT
0
Entering edit mode
9.5 years ago

The following BEDOPS call is very memory efficient and fast:

$ vcf2bed < variants.vcf | bedops -e 1 - regions_of_interest.bed > answer.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6