Does anybody know who SPAdes (or even any other genome assembly tools) deal with circular genomes. They return linear form of circular genomes, but how? how in the De Bruin Graph, they decide which parts are the beginning and the end since every reads are supposed to overlap (when you have the entire genomes covered)
The spades developers just released a plasmidSPADES version, unsurprisingly for assembling plasmids which are circular. It will apparently take in to consideration circularisation/overlapping reads in the assembly but I've only ever tried it for a plasmid that assembled as one complete contig. Perhaps it can be used for genomes too since its just as quick as SPAdes seemingly. I would guess it won't work as well if you are getting multiple contigs instead of one like I had, which you almost certainly will for a genome. Could just try it and see though!
Thank you but I already read this. It didn't help much because it's about a lot of assumptions. i actually have cases where the whole circular genome is covered by reads, and at some point the algorithm choose a "breaking point". At first I thought (like others in the post you've add as a link) the "cut spot" was decided regarding low coverage regions, but staring at the reads mapping the circular genome, I found that the positions of low coverage regions didn't match the "cut spots", leaving me with the question how does SPAdes choose the "breaking point"????
Spades does not know anything about circular replicons, it just assembles linear contigs. Contigs which represent the full length of a circular replication unit will show the same sequence at both of their ends. The ends are forming a kind of direct repeat, which I like to call "circular overlap". You can derive the true sequence of the replicon by removing one of the repeats.
Because bacterial chromosomes usually comprise some repeats which cannot be resolved by spades, it is unlikely, that in praxis you will ever encounter a single contig which shows overlapping ends and represents a whole chromosome.
But it may happen with plasmids. Contigs representing plasmidic DNA do often show a significant higher coverage because bacterial cell often comprise several copies of the same plasmid. Some plasmids (called cryptic plasmids) are only about 1000 nt in size and encode only a single gene. If you have contigs which are suspicious to arise from a plasmid, then check their ends for overlaps. This can be accomplished by a dot plot where you plot the sequence of the contig against itself.
ADD COMMENT
• link
updated 5.1 years ago by
Ram
44k
•
written 9.2 years ago by
piet
★
1.9k
0
Entering edit mode
First, thank you for your answer. Second of all, what you say about the "circular overlap" is true, I observed those in my results (you're the first one mentioning them, that's why I have previously been thinking something was wrong with my data), and it makes totally sense to me. But still there are a few thing that bother me:
This "circular overlap" (let's call it CO) is not always present on every contigs yielded. You might argue that for those not showing the CO, the reads were not covering the entire circular genome (contigs don't represent the full length of a circular replication unit). But for those latter ones I did a read mapping with artificial circular reference genome (entire genome concatenate at the end with its 300 first bases to mimic the circularity), and there were reads mapped on the entire reference sequence, meaning that the original sequence was fully covered by reads. So why the CO is not there?
On a lot of contigs (some having the CO in them), when I blast them I have "inverted" HSPs (High Scoring Pairs) in such a way that the end of my contigs match the beginning of the reference sequences and vice-versa. In my opinion it may result from a assembling error, but what causes this? my first guess was the CO, but it was not always present on my wrong contigs. NB: I don't know if it has any importance but I'd like to mention "inverted" HSPs overlaps when CO are present (which is normal since the same part is repeated)
My main concern remains : why don't I have contigs "in the right order" (the order matching the one from reference)? assembling issues or other issues I have not thought about?
In my experience, all contigs assembled for bacterial genomes end in some kinds of repeats. Bacterial genomes comprise several different types of repeats. If you see inverted repeats they may be related to transposons and insertion elements (IS). It may help to annotate the ends of your contigs. Then you will better see which types of repeats are there.
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.2 years ago by
piet
★
1.9k
0
Entering edit mode
i'm working on viral species (Human Papilloma Viruses). And you may seem to have misunderstood my second point (in my previous comment), i was talking about "inverted HSPs" (BLAST small sequences that are aligned) and not "inverted repeats". About annotating my contigs, since the "portions" (HSPs) forming my contigs was not in the right order (comparing to the order of GenBank sequences) i've looked for ORFs to check the order of the coding regions. but still the ORFs were in the right order (maybe because of the circularity of the genome). But i don't understand why on one hand the order of the ORF's is right but on the other hand the order (in sequence) of HPSs is not! Any clue?
HSP is the Blast term for (short) local alignments. Blast usually produces hundreds of them, most being rather short. In my applications most HSP are just artefacts. I do not really understand, what you mean with inverted HSP, but coordinates in Blast output are often hard to read for humans. I am not familiar with Papilloma viruses, are there any known repeats in their genomes?
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.2 years ago by
piet
★
1.9k
0
Entering edit mode
There are a few small repeats, but there are small enough (the biggest ones are 5 or 7 bp) to be solved by the assembler. I finally understood why i had "inverted HPS" (i.e the end of my contigs matching the beginning of the reference sequence (RF) and the beginning of the contigs matching the end of the RF). This question of order is about point of view (the start and the end depends on where the assembler choose the cut spot), whatever the cut spot if it doesn't correspond to the one of RF, it will seem like the contig is not in the right order.
What is left is the question about not finding the circular overlap on every contigs made of reads that cover the entire genome.
Another way to detect circular overlaps is visualizing the fastg file created by SPAdes with Bandage. You will see that the contigs with the same repeat at the beginning/end have a self-loop.
The spades developers just released a plasmidSPADES version, unsurprisingly for assembling plasmids which are circular. It will apparently take in to consideration circularisation/overlapping reads in the assembly but I've only ever tried it for a plasmid that assembled as one complete contig. Perhaps it can be used for genomes too since its just as quick as SPAdes seemingly. I would guess it won't work as well if you are getting multiple contigs instead of one like I had, which you almost certainly will for a genome. Could just try it and see though!