In my previous question, I asked: I would like to take a vcf file and a reference genome from the 1000Genomes project, and obtain a fasta file that lists the genomes for each individual in the vcf, according to the SNPs each individual has in the vcf file. Answers showed that GATK (FastaAlternativeReferenceMarker) and vcftools (vcf-consensus, http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/) were able to do something similar to this. However, I'd like to skip indels when creating these sequences. Do existing tools have options to do this?
I've tried parsing through the vcf files using python to remove indels but I've run it for days and it hasn't finished, and I'm not sure if it's a bug in my script or because there's 84.4 million SNPs to go through. I've looked at FastaVariant; does it create an alternate sequence like FastaAlternateReferenceMaker in GATK?
Yes, it creates a variant-substituted FASTA sequence that you can work with using the
Fasta
interface. It's only designed to handle SNPs, so in fact it does ignore INDELs (I thought this was a design flaw and was thinking about correcting it but maybe not...).I ran the script and created a FastaVariant object. It works well but it outputs 1 alternative genome. I was wondering if I can use it to get an alternative sequence for each individual (so if I want the genomes of 100 individuals, including HG00097, from the vcf file, can pyfaidx do this?)
I think this script would do the trick:
I just realized I hadn't documented the use of the
sample
argument, so thanks for pointing this out!Thank you! I will try it
I got this error:
What does it mean?
Oops. I didn't test before submitting. I just edited the Gist, so please try again.
What's the estimated run time for one individual? I've ran it for around 30 min and it hasn't given output yet for the first individual, HG00096. It's created the file but it's empty so far. I can continue running it for several more hours to see what happens, though I was wondering if I might be doing something if it's supposed to be faster.
Also, just to understand how the code works better, does this give the same DNA sequence for each individual as what you wrote above? It hasn't finished running yet so I don't know:
That should give you chromosome 20 for each of the individuals, yes.
Thanks! It took 33 seconds for str(consensus['20'][0:100000]) to finish running, so I estimate it'll take 6 hours for str(consensus['20']) to finish running.
LOL. I'm glad the code actually works, and granted the module wasn't purpose built for this, but it's still a bit embarrassing...
I hadn't ever tested fetching entire (large) chromosomes using
FastaVariant
, and it turns out that I really designed the class to support efficient random access of smaller subsequences. I'm testing your use case right now and you're correct - it's excruciatingly slow. The reason for this is thatpysam
is used to query every variant position on every chromosome for every individual. It will be much better to use a tool that is designed for this purpose. You might take a look at the recommendation from the 1000 Genomes team: http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/I've tried the option for 1000genomes, along with many other options (at least 6, spending hours on each option) but the vcf-subset doesn't work for me as it gives me 'Broken VCF header- no columns names?', and Data Slicer crashes (it says page taking too long to respond) when I try to use it to get an entire chromosome for 1 individual. So I think pyfaidx is the best option i have right now, until I figure out why vcf-subset doesn't work.
Actually 100000 takes 30 seconds, while 500000 takes 7 minutes, so 6 hrs is far off. I've tracked how much the time increases for each increment i of 10000 for consensus[0:i]. It might take a day to run, I will see.
It's going to depend on SNP density.
Maybe it would be faster if I was to take chunks of it, says str(consensus[0:10000]) + str(consensus[10000:20000]), and add them together. That way it sort of 'parallelizes' it.
EDIT: len(str(consensus['20'][100000:110000])) takes around 12 seconds len(str(consensus['20'][200000:210000])) takes around 12 seconds.
So maybe it could work? I'll test it out.
Awesome!! I just ran it in parallel on the sun grid engine from 0 to a million and it all finished in less than 2 minutes!!
I ran the whole thing for 1 individual... finished in around 3-5 minutes. There were around 600 jobs. Next time I can use only 60 jobs and it'll probably finish in 30 min, and 6 jobs (10 mil each) for a couple of hours.
Yes, the access times will be linear with the size of the query, independent of the position of the query in the genome. Glad you figured out a way to parallelize your job!
In the phased vcf file it gives the information for each diploid chromosome for each individual. For example, for column HG00096, it has 0|0, or 0|1, etc., where 1 indicates the chromosome has the alternative, while 0 means it has the reference SNP. Does the pyfaidx FastaVariant object only create it for 1 of the 2 chromosomes in a set? Is it possible to use it to get both chromosomes in a set (say, both chromosome 1s?)
I guess it may have something to do with line 790 in the __init__.py script?
gt_type depends on 'het' and 'hom', which seem to have something to do with heterozygous/homozygous? I'm not sure how to interpret them.
I think what you're asking for is full respect of phasing information, and currently pyfaidx ignores phasing completely. You only get one sequence out, and it's SNP-substituted using either hetero/homozygous combination of SNPs. Really you seem to want haplotypes output separately, but unless your entire call set is completely phased you won't get "just" 2 versions of every chromosome. In fact, for unphased calls you'll get an exponential growth in the number of possible chromosomes. That's why I ignore phrasing currently...
In this thread: VCF to FASTA
You linked to a program called vcf2fasta.cpp. Would that do what I need?
EDIT: I wrote a simple python script that gets phased fasta from vcf for 1 individual, and it takes 5 min to run, so if I run it in parallel for all individuals I should get fastas