Question

Access data from very big vcf files in R

1

Entering edit mode

7.4 years ago

bisansamara ▴ 20

Hi, I have a very big vcf file (11.8 GB), the header and first row look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       13372   .       G       C       608.91  PASS    "AC=3;AC_AFR=0;AC_AMR=0

How can I need access the #CHROM and POS columns?

Note that I cannot view it in excel because it's too big. I have also tries the following, but none worked:

#1
> library(VariantAnnotation)
> vcfFile = system.file(package="VariantAnnotation", "extdata", "ExAC.r1.sites.vep.vcf.gz")
> scanVcfHeader(vcfFile)
Error in .io_check_exists(path(con)) : file(s) do not exist:
  ''

#2
> vcf<-readVcf("ExAC.r1.sites.vep.vcf.gz","hg19")
Error: cannot allocate vector of size 54 Kb

Any help is highly appreciated

R gene vcf chromosome ranges • 5.5k views

ADD COMMENT • link updated 7.4 years ago by d-cameron ★ 2.9k • written 7.4 years ago by bisansamara ▴ 20

1

Entering edit mode

I would do such task using Linux command line as discussed below, but If you really need to read it in R you can use fread from library(data.table)

awk 'BEGIN{OFS="\t"}{if(!"^#"){print $1,$2}}' <(gzip -dc yourfile.gz) | gzip > output.txt.gz

ADD REPLY • link 7.4 years ago by Medhat 9.8k

score 2 · Answer 1 · 2017-08-20

You can extract your target information through following linux shell command: zcat ExAC.r1.sites.vep.vcf.gz | head -n x+ | awk '{print $1 $2}' > target.bed

x means the number of the first information line; target.bed is your result file.

This is a simple operation, you can contact me (cginsea@gmail.com) if you need any help about this question.

score 0 · Answer 2 · 2017-08-22

You have insufficient memory to load the entire VCF in memory at once. The readVcf() has the optional argument param which allows you to specify not only a region of the genome that you wish to load, but also which VCF fields you want to load. By specifying the minimum number of regions, and the minimum number of fields to load, you can reduce the memory footprint of the loaded VCF.

If it's still too big to load, you could shrink your problem by only considering a subset of the data at any point in time (e.g. performing your analysis per chromosome).

Alternatively, you can use a computer with more memory.