Doing an operation on one huge VCF, use Python or would C be faster (somewhat niave bcftools/computer science question)
0
0
Entering edit mode
4.5 years ago
curious ▴ 810

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

python bcftools c vcf • 1.8k views
ADD COMMENT
0
Entering edit mode

Can you give a representative example what you actually aim to do?

ADD REPLY
0
Entering edit mode
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
chr5    11893   chr5:11893:T:C  T       C       .       PASS       GT:DS:HDS:GP       0|0:0.021:0.021,0:0.965,0.031,0 0|0:0.080:0.080,0.006:0.910,0.080,0.001       0|0:0.030:0.008,0.021:0.965,0.031,0

I am trying to apply for each row:

Var(HDS)/(p(1-p)) where p=mean(HDS)

So:

Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)

ADD REPLY
0
Entering edit mode

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

ADD REPLY
1
Entering edit mode
from pysam import VariantFile
from itertools import chain
from statistics import variance, mean

def calc_rsq(row):
    row_variance = variance(row)
    row_mean = mean(row)
    rsq = row_variance / (row_mean * (1-row_mean))
    return rsq

vcf_in = VariantFile("my_fav_vcf.vcf.gz")  # auto-detect input format
vcf_out = VariantFile('test.vcf', 'w', header=vcf_in.header)


for rec in vcf_in.fetch('chr1', 100000, 101000):
    haploid_dosages = [s['HDS'] for s in rec.samples.values()]
    haploid_dosages_list = list(chain(*haploid_dosages))
    rsq = calc_rsq(haploid_dosages_list)
    rec.info["R2"] = rsq
    vcf_out.write(rec)

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before

ADD REPLY

Login before adding your answer.

Traffic: 1591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6