Hi all,
This is my second post, again about sequencing reads and genome assembly. I have assembled reads from a protist genome sequenced with PacBio. As proposed by PacBio, I used the package canu to do it, and I end up with a list of files but I am not sure about wich correspond to what. I have read the canu information, but as a very beginner in the field, I have a lot of doubts!
Particularly Can gave me 4 differents fasta outfiles:
x.bubbles.fasta (0 sequences - size: 0B)
x.contigs.fasta (2559 sequences - size: 66M)
x.unassembled.fasta(111507 sequences - size: 521M)
x.unitigs.fasta (6927 sequences - size: 90M)
If I understood well what I read, the contigs are containing all the read that could have been assembled, and the unassembled contains the remaining reads, that could have been integrated to the assembly? In this case it means that the totality of my "genome" would be contained in these two files? But I am concerned about the unitigs files, for which I can't find a proper description. Based on other posts I have read here, I underestood that it is all the singles read who have been integreated to the contigs, but in a unique version (if a sequence is present twice in the contigs, it will be present only once in the unitig). But if it is the case, I don't understand why I end up with a unitig file 1.5x bigger than the contig one...
Also for information, the genome size have been estimate around 176.5 Mb.
If this is redundant with another post, seems naive or if I am not using the right vocabulary, I apologize in advance, and will be grateful to be corrected! Thank you in advance!