I'm using some tools such as Pindel to call structural variants from exome data. Since exome is sparse region with limited information, I'm just looking for those large indels (say 200bp), which is small enough to may have breakpoints within exomes, and big enough to be missed by SNP-centric algorithms like GATK. Due to BWA's inability to well handle multiple-alignment in low-complexity region, I'm trying to do de novo assembly around all called breakpoints using Abyss, in order to exlude possible false positives.
This propels me to think: why not simply assemble the whole exome?
My question:
- I know some available assemblers like Abyss, Velvet. But any algorithm specifically calling variants based on de novo assembly?
- How much RAM do I need to assemble the whole exome?
- Any tricks for assembling exome sequences which makes it different for whole genome? I mean would it be desirable to put exome sequences into those algorithms designed for whole-genome assembly?
many thanks
many thanks, this is Cortex! I've noticed this paper some while ago, just don't have time reading
Hi there. Gerry has also just posted this on the Cortex google group (search for cortex_var and google group), where I've posted an extensive reply. Essentially the answer is yes, sure. You should only need around 3Mb of RAM to do an exome, and Cortex has tools to build an assembly graph, call variants and dump a VCF. Cortex has mostly been used on whole genome sequencing data, and so the new automated pipeline (which allows you to call at many kmers and take the union of calls made at different k - note this is unlike when people try to do whole genome consensus assembly and try to choose the "optimal" kmer - for variant calling there fundamentally is no optimal k, so the best you can do is either vary kmer size (Cortex doesnt do this) or run at many kmers and take the union of all your calls) is not really tailored for exome data. Specifically the automatic error cleaning option won't work well on exome data and you will need to do that step yourself. I give some details of how to do this at the Cortex google group, and I'll probably post about it in more detail in the future,
cheers
Zam
Hi - one more comment as a result of further questions from Gerry. Cortex uses a couple of statistical models, one for genotyping and one for classification of putative sites as either polymorphic, repeat or error. The former uses a Poisson model for read distribution, and was designed for Whole genome data. So, I would not expect it to be well suited to exome data (though I haven't measured its genotyping accuracy of exome data). However, that doesn't stop you discovering variants, it's just the genotyping step that uses that model. If you just do discovery, you get a VCF where the sample columns/fields just have coverage on each allele; you could try to do your own genotyping on the basis of this, but you do need to be a little careful - since Cortex can call very long alleles, they can sometimes share homology, and so you can get coverage on both alleles that is due to the shared sequence. ie things might look heterozygous because both alleles have coverage, but actually you need to look at the points where the alleles differ. Cortex does this in its genotyping step. If you want to do this with exome data and Cortex, I can help. Further details on the Cortex google group
Zam