parse large vcf file?
3
1
Entering edit mode
10.6 years ago
Luyi Tian ▴ 120

I have several datasets in vcf format(about 1GB per file) and I want to extract snp info in each samples. I usually use python so I tried pyvcf, but it works slow in such large datasets. Are there any better ways to do that, especially in python? Thanks in advance

python vcf • 11k views
ADD COMMENT
8
Entering edit mode
10.6 years ago

A trick is to compress the VCF file with bgzip, and index it with tabix. Both these tools are downloadable from the tabix home page.

See also this page for more documentation. Follow the instructions there:

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

I am not sure if pyvcf supports compressed and indexed files. According to the source code, it seems so.

EDIT: yes, it seems that pyvcf supports compressed and indexed VCF files. See the example in pyvcf documentation:

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237):  
...     print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
ADD COMMENT
1
Entering edit mode

thank you for your answer. BTW: I like your blog(http://bioinfoblog.it/)

ADD REPLY
4
Entering edit mode
10.6 years ago
lh3 33k

I don't know if the following is applicable to pyvcf - generally writing your task-specific parser is much faster than using a library for a complex format such as VCF. The reason is that libraries tend to parse everything but frequently you don't need everything. Libraries are wasting processing time. For collecting basic statistics in a VCF, I can write a script faster than the htslib parser in C (I wrote the initial version). Another example is samblaster is several times faster than the samtools/htslib parser because it only parse a few fields in SAM. It is possible to write libraries by using lazy evaluation, but it is non-trivial.

A second approach as others said is to index with tabix and process each chromosome separately. You don't need to pyvcf or to split by chromosome manually. Just make your script read data from stdin:

tabix -h your.vcf.gz chr2 | your-script.py > chr2.out
ADD COMMENT
1
Entering edit mode
10.6 years ago

Perhaps split your VCF file by chromosome, running extraction tasks in parallel by chromosome.

ADD COMMENT

Login before adding your answer.

Traffic: 2234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6