hi,
we have sequenced a viral genome, and assembled it with 454 newbler. How can I know whether the genome is circular or linear? Should it be part of the assembly software features ( but there is no such feature) or should I use an external software?
Thanks alot!
Almost added this as a comment to Pierre's answer...
Newbler is not reporting circularity, but it looks like you can find out about whether a contig is circular from its output:
We assembled a bacterial genome using newbler (shotgun and paired end reads), and it showed a small plasmid. I checked the 454ReadStatus.txt file, and it showed a number of shotgun (!) reads that were aligned with the start in the first few hundred bases, and the end in the last few (orientation '-' and '+', respectively). We also found the reverse.
I guess you can take this as an indication for circularity.
A second option would be 'make the contig circular', and cut the contig sequence in two at another position. Then, map the reads back to your contig and looks for reads mapping perfectly over the original 'breakpoint' (hope this makes sense).
If you like my answer, vote it up :-)
+ orientation means the read is located on the positive (forward) strand, - on the negative (reverse) strand. Looks like your two different assemblies indeed happen to 'break' the circle at a different position.
Hi Lex,
I tried it and indeed there are reads aligned with the start and also the end of the contig!!! moreover, in the new version of the software I received a different contig(only the order of the 2 parts differs) so does it mean that the newbler just breaks the circular contig somewhere in the middle? and also, what does it mean the orientation is - and + respectively how did you find the reverse? THANKS!!!
I have 20 such "circular reads" and all of them are 5' - and 3'+. Does it makw sense? Why do they begin on the positive strand and end on the negative one? And I voted your answer up, it was very helpful!!
If your assembler is not aware of circularity, it will probably split the genome arbitrarily at some point, in order to present a report as if it were linear. (I haven't had the opportunity to use Newbler, so I don't know how it treats its results.)
If the genome is circular, you should see some apparently badly oriented, but otherwise well mapped, read-pairs, pointing "away" from each other at the ends of the linear assembly. For each pair, the sum of their distances to the nearest contig end should each be similar to the expected insert size.
You may also find the "joining part" of the same circular sequence represented in a separate contig, so it's worth checking specifically for that.
As far as I am aware, no assembly programs produce explicitly labeled circular contigs on output, even though this would be useful for some virus, many bacteria, plasmids, mitochondria, chloroplasts etc. In practice this is not usually an issue - you are unlikely to get enough nice data for a whole circular bacterial genome to come out as one contig. For those of us interested in viral genomes or mitochondria etc it is annoying, but the sequences are small enough to manually finish.
The other posters have suggested several ideas to help you manually stitch the ends together. I would add that with one I have had an apparently circular 40kb viral genome assembly of 454 reads of come out as a linear contig of about 50kb - it had actually started repeating! Something else to check.
Finally, and probably most crucially - talk to some virologists! Some virus can form a circular form for replication, but a linear form for bundling up into viral particles. In this situation you may have to do some lab work to work out where the ends really are.
It is a novel viral genome, so the virologists don't know (but want to know) whether it is circular.We do have one contig only. I don't understand how the assembler works if it is circular: does it cut the contig in the middle?
Different assemblers will do it differently, and it will depend a bit on your data and how variable it is. I think the best you can hope for is a linear contig which is the full length of the circle, perhaps with some overlap as I described (e.g. 50kb contig for a 40kb circle). The circle break point will probably be random (assuming you have good coverage - otherwise I would expect it to break at a region of low coverage). You will need to do some manual finishing, and should try Sanger/capillary sequencing over the end gap to confirm the ends do meet.
The repeat of some 10 kbp is certainly a sign that something is up, perhaps a circular genome. So, I'd run the assembly through Miropeats to identify those repeats, duplications or inversions. See http://genome.wustl.edu/software/miropeats
I your program produced a full linear sequence, I would simply look (a simple grep ?) for some reads starting with the end of the assembly and ending with the beginning of the assembly.
Correct me if I'm wrong, but in order for this to work you need to be sure to use paired end reads only. In the case of 454 the 2 ends would be saved in one read, but with e.g. Illumina you would have to use /1 /2 sequences explicitly.
As you mentioned, you used shotgun-only 454 sequencing. Assuming that your viral genome is (almost) entirely sequenced:
There are no sequencing gaps, i.e. your genome did not contain segments where 454 sequencing failed and thus there are reads covering the entire genome. In this case, it would be best if you had only one contig the ends of which you could try to join by finding a read that spans the start and the end (taking orientations and complementarity into account). The more contigs you have the more impractical this approach gets and the less likely the assumption (no gaps) is.
There are 1 or 2 sequencing gaps. Find primers to try and close the gap with another sequencing method (Sanger comes to mind, as long reads are an advantage here). Again, take orientations into account to join contigs to a circular genome (or not). Again, with more contigs/gaps this gets impractical. You might want to tweak your assembler's options here.
The important thing to note is that it is not possible to distinguish between "no gaps-linear genome" and "one gap-circular genome". In order to be sure, I would try joining the ends together by either sequencing or simple PCR products.
Maybe a way too limple tought on this, but it could work:
As mentioned earlier by some answers the assembly will probably break at some point of low coverage. However, you could slightly change your input sequence dataset (for instance delete the reads within a segment of the genome assembly). A reassembly will in this case be pushed towards a defined break at the point where you deleted the sequences. However, you are now able to check whether the first assembly ends are joined or still present as ends in the new assembly.
Slightly fiddling around with which reads to delete might give you a good answer on circularity (in addition to for instance the functional annotation which could also give a clue of arbitrary breaking of the assembly or being real (ragged) ends).
If you like my answer, vote it up :-) + orientation means the read is located on the positive (forward) strand, - on the negative (reverse) strand. Looks like your two different assemblies indeed happen to 'break' the circle at a different position.
Hi Lex, I tried it and indeed there are reads aligned with the start and also the end of the contig!!! moreover, in the new version of the software I received a different contig(only the order of the 2 parts differs) so does it mean that the newbler just breaks the circular contig somewhere in the middle? and also, what does it mean the orientation is - and + respectively how did you find the reverse? THANKS!!!
I have 20 such "circular reads" and all of them are 5' - and 3'+. Does it makw sense? Why do they begin on the positive strand and end on the negative one? And I voted your answer up, it was very helpful!!