Question

most efficient way to read from large vcf files

3

Entering edit mode

6.9 years ago

lait ▴ 180

Hi,

if I have a large VCF file (for example, from the 1000 genomes project), what could be the most 'efficient' way (R libraries?) to extract variants from this file that lie in certain genomic regions ?

The way I used to do it with small vcf files is to load them in memory and start digging, but with a 700MB vcf file, what could be a better way?

vcf • 5.5k views

ADD COMMENT • link 6.9 years ago by lait ▴ 180

1

Entering edit mode

As you ask for R I added my answer just as a comment. The most common way is to use tabix.

ADD REPLY • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

See if this is helpful: https://samtools.github.io/bcftools/howtos/query.html You will need to convert VCF to BCF.

ADD REPLY • link 6.9 years ago by GenoMax 152k

0

Entering edit mode

There is kind of a sister project to this that removes the need for that conversion: https://vcftools.github.io/index.html

Tabix is also a good option. Actually, bedtools can do this as well.

ADD REPLY • link 6.9 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Tabix is also a good option. Actually, bedtools can do this as well.

I read this from time to time, but never managed to it with bedtools that fast how tabix can do it. What's the correct way?

fin swimmer

ADD REPLY • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

The easiest way is to provide the region(s) you want to extract variants from in a BED file and then use bedtools intersect:

bedtools intersect -a myvariants.vcf -b myregions.bed -header -wa > output.vcf should do the trick. I have no idea if it's as fast as tabix, but it should be pretty quick.

ADD REPLY • link 6.9 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Hello jared.andrews07,

yes I know I can subselect variants with bedtools intersect. But on large files this is slow. tabix build a position based index of the bgzipped vcf and is able to have random access to the positions I need. AFAIK bedtools cannot do this.

A little comparison on the 1000Genomes vcf file:

$ cat test.bed
2   1000000 1000050

$ time (tabix ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -R test.bed)
0,03s user 0,00s system 99% cpu 0,030 total

$ time (bedtools intersect -a ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -b test.bed -wa)
91,69s user 0,57s system 99% cpu 1:33,14 total

fin swimmer

ADD REPLY • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

$ vcf2bed < foo.vcf | bedops -e 1 - regions.bed > answer.bed

ADD REPLY • link 6.9 years ago by Alex Reynolds 36k