Dear all, I am trying to assemble a phage (virus) genome. I've checked the quality of the reads and mapped them to the host genome to remove the host genome reads. Now I am trying to assemble the unmapped reads but the number of contigs is more than 1000. I have tried different kmer sizes and different assembler but the number of contigs is always more than 1000. I've found that there are chimeric reads due to which the number of contigs is very large. My question is how to find out these chimeric reads and remove them?
It is possible that you have way more coverage (the genome must be pretty small) than necessary. You could look into normalizing the data to a lower coverage and/or use
tadpole.sh
from BBMap suite as a k-mer based assembler instead.Tadpole seems to do a better job of assembling viruses than Spades. I won't guarantee that, but it seems to be generally true.
Viruses and hosts can share sequence. If you remove all sequences shared between the virus and host, it's likely that you will incur holes in your assembly, if you are trying to assemble the host.
In this case, I'd suggest partitioning reads by depth, and assembling the high-depth reads, which will be viral. You can do that with Tadpole by using the mindepth=X flag.
I've already assembled the reads with tadpole, the number of contigs decreased but they are very high. mindepth=X What is X here?
Brian Bushnell : Is
mindepth=
flag new since I don't see it in the in-line help fortadpole.sh
.I could not find any major difference by applying
mindepth
What value did you use? 100 or more?
What about
V-GAP
assembly pipeline? https://www.sciencedirect.com/science/article/pii/S0378111915012378I could not find the pipeline that proposes the author in the article. If someone has this then please share it.