I have paired end reads and two closely related references, one for mitochondria and another for chloroplast. Can I simply map the reads to the mitochondria reference (with bwa) to assemble the mitochondrial genome, then map them to the chloroplast reference to assemble the chloroplast genome? Is there any problem with this approach? I'd appreciate some ideas/suggestions from someone who'd done something similar before.
Thank you Brice. ARC sounds like the tool I need. Do you know if it would deal with multiple copies of the genome within the reads? I expect there to be several copies of mitochondrial/chloroplast genomes.
Yes, that should work fine. That basically means that the mitochondrion/chloroplast will have higher coverage relative to nuclear markers. ARC, in particular, will find reads that are similar to either, split them into pools, and attempt to assemble those pools de novo. If you expect heteroplasmy, this is a somewhat more complicated question informatically; the short answer is you will be able to recover distinct haplotypes given enough variation and coverage.
Hi, Brice, I want to know whether ARC work well on the big data set about 80G, which has mixed mitochondrial and chloroplast genome reads?
80G of raw data? I see no reason why it wouldn't perform well. Arguably the most time-consuming step is the read splitting, so this might take a bit of time, but it was designed to handle datasets like this. I'd give it a try. At the very least, it will do better than a complete de novo since you know what you're looking for.
What should I consider if my reference sequence has repetitive elements? The chloroplast reference genome has two repetitive elements, IRa and IRb.
(I'm trying to assemble the chloroplast genome of the purple maize)
Thanks for the recommendation! I'll try ARC too :D