Hi there,
I'm planning a comparative genomics analysis between bacterial genomes. I've downloaded them from NCBI and found out that many of them are scaffolds with hundreds of contigs, some other are completely assembled in one single chromosome. Today I've been asked really simple questions that made me really confused... 1) Do you think that I can use the complete genome as Reference and try to assembly the other fragmented genomes again? Thus, giving the fasta genome as input to some assembler tools and using the fasta Reference as Ref? (The ideal would be getting the fastq files from NCBI as well but not all of them are available).
2) A reorder with Mauve would improve somehow the genome? Of course it won't reduce the number of contigs, but would instead increase the annotation?
3) Furthermore, I noticed that some genomes are on NCBI Assembly, some other just on NCBI Genomes.. Why is that?
Any suggestion/explanation please? Thanks, Silvia
Hi Silvia,
Can you give me an example of a genome that is in Genome but not Assembly?
Hi there,
If you look for Pseudomonas avellanae here you can find 16 entries, but if you look here you get only 14. R2sc214 and R25260 are missing.
Because R2sc214 and R25260 are marked as
Anomalous assembly
and therefore excluded form RefSeq and GenBankYes, exactly what's written in the FAQ ;)
If I understand correctly you are wondering if you can use the contiguous reference-like assemblies to improve the less contiguous assemblies?
So there are tools for this, you want to look at reference alignment/based/guided/etc assembly. However it may depend on what you want to look at in your comparisons...
For example, the unassembled regions are most likely complex regions which will be probably be highly variable, so I am guessing that by using the reference guided assembly, for regions you 'solve' you are potentially just introducing reference-bias. Just a guess, I have not checked this or read about it.
Your alternative, which I think you touch on and perhaps is better, is just using your reference genome to scaffold the less contiguous assemblies. This once again will create reference-biased structured scaffolds but at least the spaces with 'N's will allow you to keep track of this missing info.