Entering edit mode
6.3 years ago
lait
▴
180
Hi,
if I have a large VCF file (for example, from the 1000 genomes project), what could be the most 'efficient' way (R libraries?) to extract variants from this file that lie in certain genomic regions ?
The way I used to do it with small vcf files is to load them in memory and start digging, but with a 700MB vcf file, what could be a better way?
As you ask for R I added my answer just as a comment. The most common way is to use tabix.
See if this is helpful: https://samtools.github.io/bcftools/howtos/query.html You will need to convert VCF to BCF.
There is kind of a sister project to this that removes the need for that conversion: https://vcftools.github.io/index.html
Tabix is also a good option. Actually, bedtools can do this as well.
I read this from time to time, but never managed to it with bedtools that fast how tabix can do it. What's the correct way?
fin swimmer
The easiest way is to provide the region(s) you want to extract variants from in a BED file and then use
bedtools intersect
:bedtools intersect -a myvariants.vcf -b myregions.bed -header -wa > output.vcf
should do the trick. I have no idea if it's as fast as tabix, but it should be pretty quick.Hello jared.andrews07,
yes I know I can subselect variants with
bedtools intersect
. But on large files this is slow. tabix build a position based index of the bgzipped vcf and is able to have random access to the positions I need. AFAIK bedtools cannot do this.A little comparison on the 1000Genomes vcf file:
fin swimmer