I am trying to assemble haplotypes for a peculiar region of the human genome that (1) has high heterozygosity, (2) has variation in presence or absence of entire genes, and (3) encompasses a gene cluster of highly similar paralogues. This is obviously making assembly difficult, since a paralogue on the same chromosome may have only 10% divergence from its duplicate, while the homologue on the other chromosome has 5% differences due to segregating polymorphism at that locus. Currently, I have nanopore and short reads from this region, both at approximately 30X coverage. I would like to use canu to assemble the nanopore reads, then short reads to polish, but I am getting nowhere near the full assembly. My command is
canu -p prefix -d canu_run genomeSize=250k correctedErrorRate=0.144 minOverlapLength=500 -nanopore-raw sample.fastq
Here sample.fastq are reads filtered for my region of interest, so its a fairly small total assembly. So far, I have tried varying the corrected error rate between 0.1 and 0.2, and the minOverlapLength between 500 and 1000 with no luck. Using BLAST, I can see large chunks of my genes of interest in the prefix.unassembled.fasta file. It seems varying error rates should help find a sweet spot of expected divergence between reads from the same allele at a locus, reads from different alleles at a locus, and reads from different paralogous loci - I'm wondering, is there any other parameters I can vary to try and get a more complete assembly? Is there any preprocessing I can do with the more accurate short reads to lead to a more complete assembly? Ideally, I eventually want phased haplotype information.
I don't have any good suggestions for additional Canu parameters to tweak, but have you tried the latest version of the Platanus assembler? It's designed for heterozgous genome assembly, and uses both long and short-read data. See more here:
https://www.nature.com/articles/s41467-019-09575-2
I did try platanus-allee too, neither work well, and actually, both seem to introduce errors, perhaps in the read correction stages (?). At least, if I use a set of gene-specific probe sequences to find long reads with plausible haplotypes (based on known gene order), and I do the same thing after canu or platanus-allee assembly, I find totally different haplotypes. Also, while the (error-ridden) haplotypes found in raw reads at least present the correct order of genes, the ones following platanus-allee and canu sometimes predict mapping of gene-specific probes that don't fit with known gene order, and sometimes one gene's probe will end up in the middle of a bunch of probes for another gene.
Going to write my own script to assemble based on these gene-specific probes.