Entering edit mode
4.5 years ago
curious
▴
810
I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.
One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.
Can you give a representative example what you actually aim to do?
I am trying to apply for each row:
Var(HDS)/(p(1-p)) where p=mean(HDS)
So:
Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)
If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.
Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before